Resistive Bloom Filters: From Approximate Membership to Approximate Computing with Bounded Errors

Vahideh Akhlaghi*, Abbas Rahimi†, Rajesh K. Gupta*

*Department of Computer Science and Engineering, UC San Diego, La Jolla, CA, USA
†Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA, USA
{vakhlagh, gupta}@cs.ucsd.edu, abbas@eecs.berkeley.edu

Abstract—Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation.

I. INTRODUCTION

The scaling of physical dimensions in semiconductor circuits opens the way to an astonishing over 7 billion transistors on massively parallel integrated architectures enforcing energy efficiency as a primary concern [1]. Emerging applications including graphics, multimedia, recognition, and data mining offer significant degrees of tolerance to the precision in computing results, thus enabling the emerging vision of “approximate computing”. Approximate computing provides higher energy efficiency by accepting errors for a specific application. As an example, approximate adders are introduced by reducing circuit complexity [2], [3]. These techniques typically consider a specific set of input patterns to design approximate hardware/circuit, hence they cannot guarantee error bounds for all data set.

Another emerging, and unrelated, development is growing use and capability of non-volatile memory (NVM) in systems. NVM today offers high density and near zero leakage power that enables energy-efficient computation on memory-centric architectures. Resistive RAM (ReRAM), in particular, offers low-write energy, fast read access time and low read voltage [4]. A 1-Mb ternary content addressable memory (TCAM) test chip is proposed as an energy-efficient memory-centric architecture providing \( 10 \times \) smaller cell size compared to SRAM-based TCAMs [5]. Such properties determine memristors are a promising replacement for traditional SRAM-based memories in low-power and high capacity designs.

Value locality and similarity inside various applications [6] especially in data-level parallel applications provide opportunity to reuse frequent computations [7], [8], thereby, improving energy-efficiency by avoiding redundant processing. In computational reuse, memories play an important role in maintaining pre-calculated computations. In this context, a Bloom Filter (BF) provides a data structure with the ability to provide probabilistic response to membership queries. A BF stores a set of elements (i.e., patterns) using multiple hash functions performed on them. The size of the memory required for a BF scales linearly with the number of stored elements in the filter [9]. BF data structure is extensively used to keep track of incoming data and states of flows in network applications [10], [11].

Thus, our working thesis is that a combination of approximation in computing with enhanced reuse methods enabled by NVM can yield high energy efficiency. In this paper, we explore this possibility for programmable parallel architectures. In particular, this work aims to increase computational reuse for approximate computing with bounded errors by designing BFs using non-volatile memory elements. We make the following contributions: I) We design BFs to generate an approximate function with a guaranteed error bound. Hence, a set of BFs is tightly-coupled to individual FU to approximately represent highly frequent computations of the associated FU. Each BF reports no false negatives (i.e., recall rate of 100%), and has a tunable parameter to control false positives. II) To further lower energy consumption and enable scalability, we utilize low-power memristor array in designing resistive Bloom filter (ReBF). III) We integrate ReBF into the Southern Islands GPU that yields on average 24.1% energy saving for five image processing kernels. Caltech 101 computer vision data set [12] is used for profiling and finding the error bounds.

The rest of the paper is organized as follows. Section II explains our proposed methodology to use BF in approximating a function with the bounded error rate. The proposed scalable BF to demonstrate additional functionality is presented in Section III. The experimental results are discussed in Section IV. Finally, Section V concludes the paper.

II. APPROXIMATE COMPUTING WITH BLOOM FILTERS

A. Bloom Filters (BFs)

BFs provide a compact representation of a set of elements. A BF consists of an \( m \)-bit vector which is programmed using \( k \) random hash functions. For a given element, \( k \) bits of the Bloom vector (BV), selected by the hash functions, are set in the programming stage. For a look up process, the same hash functions are computed on an input pattern. If the \( k \) bits in the
vector determined by $k$ hash values are all set, the input pattern is said to be present in the BF. Since those $k$ values may also be obtained by any of actual members, the presence of an input in the BF may be confirmed erroneously. On the other hand, if at least one of the $k$ bits is not set, the input is certainly not the member of the BF. This explains that a BF has a recall rate of $100\%$ and there is no false negative; however, it allows tunable false positives (FP), the rate of which is formulated as:

$$fp = (1 - e^{-\frac{n}{m}})^k$$

(1)

where, $n$ is the number of patterns saved in the BF, $m$ is the length of BV, and $k$ is the number of hash functions.

The hardware implementation of the BF is represented in Figure 1. The BV is shown as an $m$-bit array. For a look up operation, BF, at first, computes $k$ hash functions (HF) concurrently. The input data is also split into the number of bytes to parallelize the computations of each hash function. Each HFB module inside the HF performs computation on one byte of the data. Each HFB is made of XOR gates, and a number of coefficients, each has $\log_2(m)$ bits. Coefficients are XORed if the corresponding bit of the input is one. The last XOR gate in each HF produces the final address. If all of the $k$ bits of the BV are one, hit occurs and the EnL signal becomes one.

![Fig. 1: The hardware architecture of Bloom filter.](image)

**B. Utilizing Bloom Filters for Function Approximation**

In this section, we describe our proposed method that exploits BF to expand the probabilistic membership query to function approximation.

To avoid redundant computation overhead due to re-execution of an operation for the same inputs, we identified highly frequent output values in the function co-domain and stored their corresponding input patterns in a set of BFs. To perform a function for a given input, at first, the membership of the input is queried in the BFs associated to the function. If a hit event occurs, the corresponding output value is returned as the final output of the function, and all pipeline stages in the implementation of the functional unit (FU) are clock-gated to save energy. As described in Section II-A, positive responses of the BF to the membership queries are not always correct due to the FP. The source of error in our computation, is a FP event where the BF wrongly reports the output value as the function output. Here, the rate of the error corresponds to the FP rate, and can be controlled by tuning the parameters affecting the FP (i.e., $n$, $m$, $k$). In case of a miss, the FU continues the exact computation for the mismatched input patterns. Based on the aforementioned details, a set of BFs can be exploited to approximate partial outputs of a function with the bounded errors that provide guaranteed quality. The output quality, here, is maintained by adjusting the FP rate of BFs.

**C. Bounding Bloom Filter Errors at Design Time**

In our method, we are able to limit the error rate (i.e., the degree of approximation) to meet the acceptable output quality by limiting the FP rate of associated BFs at the design time.

<table>
<thead>
<tr>
<th>App</th>
<th>ADD</th>
<th>MUL</th>
<th>MAC</th>
<th>SQRT</th>
<th>PSNR (min, max, avg) (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sobel Filter</td>
<td>0.001</td>
<td>0.001</td>
<td>0.001</td>
<td>0.001</td>
<td>(26.4, 32.9, 28.0)</td>
</tr>
<tr>
<td>Sharpen</td>
<td>–</td>
<td>0.01</td>
<td>0.01</td>
<td>0.01</td>
<td>(24.7, 36.8, 28.12)</td>
</tr>
<tr>
<td>Roberts</td>
<td>–</td>
<td>0.001</td>
<td>0.001</td>
<td>0.001</td>
<td>(25.4, 32.5, 27.9)</td>
</tr>
<tr>
<td>Prewitt</td>
<td>–</td>
<td>0.01</td>
<td>0.001</td>
<td>0.01</td>
<td>(25.2, 33.3, 27.1)</td>
</tr>
<tr>
<td>Scharr</td>
<td>–</td>
<td>0.001</td>
<td>0.0001</td>
<td>0.001</td>
<td>(26.1, 31.1, 27.2)</td>
</tr>
</tbody>
</table>

We focus on multimedia applications that are amenable to approximation. The multimedia applications offer the well-known notion of output quality with peak signal-to-noise ratio (PSNR). With approximate computing, an application with PSNR of equal or greater than 25 dB can still appear to execute correctly from the user’s perspective [13]. FUs in the GPU architecture are targeted for integrating BFs. To ensure that the output quality will never go below the desired threshold, we need to configure BFs. We require, at first, to identify the maximum tolerable error rate for each FU in a given kernel. We set the error magnitude conservatively to its maximum value for each FP event during the design time analysis. Then, the parameters of BF should be set to yield the FP rate less than or equal to the error rate. To do that, we need to specify the number of inputs to store in the BF through profiling. For the sake of clarity, we describe the process of configuring BFs in more detail, given the fact that four types of operations are identified in GPU architecture: adder (ADD), multiplier (MUL), multiply-accumulator (MAC) and SQRT.

![Fig. 2: Energy consumption comparison of the proposed architecture using CMOS BFs and conventional FU.](image)

**Maximum Tolerable Error Rates.** To find maximum tolerable error rates of the FUs used in a kernel, we use the following algorithm, and simulate the kernel for 30 different images using Multi2Sim [14]. The desired output quality, here, is assumed to be equal or greater than 25 dB. However, the algorithm has the ability to find the maximum error rate for any given PSNR threshold.

The first step is to find the maximum tolerable error rate for each individual operation. To achieve this, each time, one operation is selected and random error with different rates is injected into it. We start from error rate of 0.1 and decrease it until the average PSNR of final output images becomes acceptable. The next step aims to inject errors to all FUs...
to see their combined effect on the output quality. Here, we inject errors simultaneously into all operations with the corresponding error rate obtained in the previous step. In case of unacceptable output quality, in the third step, we identify the top frequent operation using the assembly code of the kernel generated by Multi2Sim, and decrease its error rate by ten times. The process is repeated for the next frequent FU until the acceptable PSNR is achieved. Table I summarizes the maximum tolerable error rate obtained for each FU of five image processing applications, and the corresponding PSNR.

Table I: Maximum Tolerable Error Rate

<table>
<thead>
<tr>
<th>Application</th>
<th>Error Rate (dB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sobel</td>
<td>30.2</td>
</tr>
<tr>
<td>Robert</td>
<td>28.5</td>
</tr>
<tr>
<td>Prewitt</td>
<td>27.8</td>
</tr>
<tr>
<td>Scharr</td>
<td>27.2</td>
</tr>
<tr>
<td>Sharpen</td>
<td>26.9</td>
</tr>
</tbody>
</table>

For the look up/search process, the ML is, at first, pre-charged to Vdd via the pre-charge circuit. Then, the same $k$ hash functions generate $k$ addresses which will be decoded in parallel through $k$ CMOS decoders. The select lines of the corresponding $k$ cells turn on the NMOS transistor in the cell. If the stored value in any of the $k$ cells is zero (i.e., the resistor value is low), a path between Vdd and ground will be established and the ML will be discharged. The voltage drop on the ML will be amplified by the sense circuit; therefore, a full swing will be observed on the EnL signal. This means that the incoming input is not saved in the ReBF. In this case, EnL prevents the output register to be read. The delay of the circuit is determined by the time it takes to pre-charge ML to Vdd, and the time it takes to discharge the line and observe the full swing on EnL. The worst-case delay happens when only one of the cells is not matched, and tries to discharge the ML. If more than one cell are not matched, the ML will be discharged rapidly, thereby, the delay is lower than the previous case. However, if all $k$ cells are set to one, the resistor values are all high, and prevent the ML from discharging. This means that hit occurs and EnL signal becomes high, which leads to returning pre-calculated value stored in a CMOS register as the final output of the FU that uses the ReBF.

### III. ReBF: Resistive Bloom Filter

Before introducing architecture of a ReBF, let us examine energy efficacy of CMOS implementation of BF. We implement BFs configured for five image processing applications using Verilog, and synthesized them with 45 nm standard CMOS library. Each individual output value is stored in a 32-bit register, connected to the corresponding BF. Figure 2 illustrates the energy of the proposed architecture using CMOS BFs compared to the conventional one. As shown, computational reuse with the aid of CMOS BFs offers negligible or no benefit (i.e., on average the overhead is 4.4%). This energy overhead is mainly due to the implementations of m-bit BVs (i.e., about 70% of the power). To improve the energy efficiency of our proposed architecture, we use 1-transistor/1-resistive (1T-1R) cells, to implement the m-bit BV.

**A. ReBF Architecture**

The architecture of ReBF is shown in Figure 3. Each bit of the resistive Bloom vector is implemented by a 1T-1R cell. The cells are connected to each other through a match line (ML). The stored value in each cell is represented by the value of the resistor element. High value of the resistor is considered as digital one, and low value as digital zero. The transistor in the cell is controlled by a select line (SL). During the programming stage, $k$ hash values for each member are obtained using the combinational CMOS circuit, and, then, the resistor of the corresponding cell in ReBF is set to the high value.

**TABLE II: Energy Consumption of Bloom vector**

<table>
<thead>
<tr>
<th>Size (bits)</th>
<th>Vdd (V)</th>
<th>RBV (fJ)</th>
<th>CMOS BV (fJ)</th>
</tr>
</thead>
<tbody>
<tr>
<td>2048</td>
<td>0.71</td>
<td>19.3</td>
<td>12188.4</td>
</tr>
<tr>
<td>1024</td>
<td>0.6</td>
<td>8.04</td>
<td>6662.4</td>
</tr>
<tr>
<td>512</td>
<td>0.54</td>
<td>5.10</td>
<td>3992.4</td>
</tr>
<tr>
<td>256</td>
<td>0.51</td>
<td>3.77</td>
<td>2603.1</td>
</tr>
<tr>
<td>128</td>
<td>0.47</td>
<td>1.0</td>
<td>1837.3</td>
</tr>
<tr>
<td>64</td>
<td>0.47</td>
<td>1.62</td>
<td>1649.55</td>
</tr>
</tbody>
</table>

For the look up/search process, the ML is, at first, pre-charged to Vdd via the pre-charge circuit. Then, the same $k$ hash functions generate $k$ addresses which will be decoded in parallel through $k$ CMOS decoders. The select lines of the corresponding $k$ cells turn on the NMOS transistor in the cell. If the stored value in any of the $k$ cells is zero (i.e., the resistor value is low), a path between Vdd and ground will be established and the ML will be discharged. The voltage drop on the ML will be amplified by the sense circuit; therefore, a full swing will be observed on the EnL signal. This means that the incoming input is not saved in the ReBF. In this case, EnL prevents the output register to be read. The delay of the circuit is determined by the time it takes to pre-charge ML to Vdd, and the time it takes to discharge the line and observe the full swing on EnL. The worst-case delay happens when only one of the cells is not matched, and tries to discharge the ML. If more than one cell are not matched, the ML will be discharged rapidly, thereby, the delay is lower than the previous case. However, if all $k$ cells are set to one, the resistor values are all high, and prevent the ML from discharging. This means that hit occurs and EnL signal becomes high, which leads to returning pre-calculated value stored in a CMOS register as the final output of the FU that uses the ReBF.

### IV. Experimental Results

**A. Experimental Setup**

To assess the efficiency of the function approximation using ReBF, we choose five image processing applications adopted from AMD APP SDK v2.5 [15]: Sobel, Robert, Prewitt, Schar and Sharpen. The openCL code of these applications are simulated by Multi2Sim [14] to perform profiling and finding the error rate bound on four FUs. We generates the VHDL code of these FUs as the six-stage pipeline unit, commensurate with the AMD Southern Islands GPUs [14], using FloPoCo [16]. The hardware realization of different size of BFs with different functions generate $k$ addresses which will be decoded in parallel through $k$ CMOS decoders. The select lines of the corresponding $k$ cells turn on the NMOS transistor in the cell. If the stored value in any of the $k$ cells is zero (i.e., the resistor value is low), a path between Vdd and ground will be established and the ML will be discharged. The voltage drop on the ML will be amplified by the sense circuit; therefore, a full swing will be observed on the EnL signal. This means that the incoming input is not saved in the ReBF. In this case, EnL prevents the output register to be read. The delay of the circuit is determined by the time it takes to pre-charge ML to Vdd, and the time it takes to discharge the line and observe the full swing on EnL. The worst-case delay happens when only one of the cells is not matched, and tries to discharge the ML. If more than one cell are not matched, the ML will be discharged rapidly, thereby, the delay is lower than the previous case. However, if all $k$ cells are set to one, the resistor values are all high, and prevent the ML from discharging. This means that hit occurs and EnL signal becomes high, which leads to returning pre-calculated value stored in a CMOS register as the final output of the FU that uses the ReBF.

**B. Energy Saving**

Energy consumption of an individual resistive Bloom vector (RBV) and CMOS Bloom vector of different length of
TABLE III: Optimum ReBF configuration for different applications

<table>
<thead>
<tr>
<th>FU</th>
<th>Sobel</th>
<th>Roberts</th>
<th>Sharpen</th>
<th>Prewitt</th>
<th>Scharr</th>
</tr>
</thead>
<tbody>
<tr>
<td>#Out</td>
<td>#In</td>
<td>HR%</td>
<td>BV</td>
<td>#Fn</td>
<td>#Out</td>
</tr>
<tr>
<td>ADD</td>
<td>2</td>
<td>12</td>
<td>29.46</td>
<td>256</td>
<td>4</td>
</tr>
<tr>
<td>MUL</td>
<td>2</td>
<td>16</td>
<td>29.61</td>
<td>1024</td>
<td>3</td>
</tr>
<tr>
<td>MAC</td>
<td>4</td>
<td>12</td>
<td>25.44</td>
<td>256</td>
<td>4</td>
</tr>
<tr>
<td>SQRT</td>
<td>16</td>
<td>16</td>
<td>20.8</td>
<td>512</td>
<td>3</td>
</tr>
</tbody>
</table>

2048, 1024, 512, 256, 128, and 64 bits are summarized in Table II. The delay of the RBV of different size is fixed at 1.4 ns by adaptively adjusting the operating voltage. As we can see, implementing Bloom vector using memristor cells presents extremely low energy consumption even if the larger vector is used. This presents the scalability of RBV and its benefit in representing more functionality. To assess the efficiency of our approach, we profile five image processing kernels using 10% of Caltech 101 computer vision dataset [12] and find the maximum tolerable error rate for each FU in the kernels as described in Section II-C using 30% of the dataset. For each kernels, we select various hit rates obtained from profiling stage. Then, we configure BFs based on the number of elements that depends on the hit rate (HR), and the maximum error rate obtained for each FU. We evaluate energy consumption of our approach for each configuration to find the maximum energy improvement compared to the conventional FU without ReBF. For each application, we summarize the optimum configuration of the Bloom filters (e.g., the length of Bloom vector (BV) , and the number of hash functions (Fn)) integrated to each FU in Table III.

Fig. 4: Energy comparison of the proposed architecture using ReBFs and conventional FUs.

Figure 4 illustrates the total energy of using ReBF (including resistive and CMOS parts) along with FUs and energy of conventional FUs without ReBF. ReBF demonstrates 21.7%, 25.2%, 25.3%, 31.4%, and 16.8% energy improvement for Sobel filter, Sharpen, Roberts, Prewitt, and Scharr, respectively. As an example, for Sharpen, maximum improvement is achieved by using ReBFs for MUL and MAC, to store 15 patterns for each module. To meet the maximum error rates, we employ Bloom vector of size 256 bits, and two hash functions for each operation. The provided hit rates are 42.18% and 30.2% for MUL and MAC, respectively. Here, high energy consumption of CMOS modules inhibits us from further computational reuse and energy saving.

V. CONCLUSION

Modern multimedia applications offer massive parallelism and significant degrees of tolerance to approximate computing. This paper aims to address the following challenge: how to increase approximate computational reuse through non-volatile resistive storages in GPUs with bounded errors? Combining computational reuse and approximate computing, we propose a resistive Bloom filter (ReBF) that provides an approximate representation of a function by integrating Bloom filters to the hardware functional units. ReBFs are used to store highly frequent patterns to avoid re-executions. This methodology has the ability to control the degree of function approximation by adjusting the false positive rate of the ReBFs. To reduce energy consumption of the proposed architecture, low-power memristive arrays are exploited to perform search operations at extremely low energy. Experimental results show that our approach represents on average 38.42% of the functionality of FUs in five different kernels running on GPUs, while guaranteeing the acceptable outputs with PSNR of greater than 27 dB. This leads to on average 24.1% energy reduction compared to conventional architectures without ReBF.

VI. ACKNOWLEDGEMENTS

This work was supported by NSF’s Variability Expedition (1029783).

REFERENCES