public class BloomFilter2 extends Object implements Externalizable
it.unimi.dsi.mg4j.util.BloomFilter
in the mg4j project. The
primary changes are:
fastutil
project.
The code uses Arrays.fill(long[], long)
instead to clear the state on
reset.Externalizable
implementation and added access methods for their
data.Instances of this class represent a set of character sequences or primitive-type arrays (with false positives) using a Bloom filter. Because of the way Bloom filters work, you cannot remove elements.
The intended usage is that character sequences and arrays should not be mixed (albeit in principle it could work). A Bloom filter is rather agnostic with respect to the data it contains—all it needs is a sequence of hash functions that can be applied to the data.
Bloom filters have an expected error rate, depending on the number of hash
functions used, on the filter size and on the number of elements in the
filter. This implementation uses a variable optimal number of hash functions,
depending on the expected number of elements. More precisely, a Bloom filter
for n character sequences with d hash functions will
use ln 2 dn ≈ 1.44 dn
bits; false positives will happen with probability 2-d.
The maximum number of bits supported is MAX_BITS
.
Hash functions are generated at creation time using a mix of universal hashing and shift-add-xor hashing (for the latter, see Performance in practice of string hashing functions, by M.V. Ramakrishna and Justin Zobel, Proc. of the Fifth International Conference on Database Systems for Advanced Applications, 1997, pages 215−223).
Each hash function uses NUMBER_OF_WEIGHTS
random integers, which are
cyclically multiplied by the character codes in a character sequence. The
resulting integers are then summed with a shifted hash and XOR-ed together.
This class exports access methods that are similar to those of
Set
, but it does not implement that interface, as too many
non-optional methods would be unimplementable (e.g., iterators).
A main method makes it easy to create serialised Bloom filters starting from a list of terms.
Modifier and Type | Field and Description |
---|---|
static long |
MAX_BITS
The maximum number of bits in a filter (limited by array size and bits in a long).
|
static int |
NUMBER_OF_WEIGHTS
The number of weights used to create hash functions.
|
Constructor and Description |
---|
BloomFilter2()
De-serialization ctor.
|
BloomFilter2(int n)
Creates a new high-precision Bloom filter a given expected number of elements.
|
BloomFilter2(int n,
int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.
|
Modifier and Type | Method and Description |
---|---|
boolean |
add(byte[] a)
Adds a byte array to the filter.
|
boolean |
add(char[] a)
Adds a character array to the filter.
|
boolean |
add(CharSequence s)
Adds a character sequence to the filter.
|
boolean |
add(double[] a)
Adds a double array to the filter.
|
boolean |
add(float[] a)
Adds a float array to the filter.
|
boolean |
add(int[] a)
Adds an int array to the filter.
|
boolean |
add(long[] a)
Adds a long array to the filter.
|
boolean |
add(short[] a)
Adds a short array to the filter.
|
void |
clear()
Clears this filter.
|
boolean |
contains(byte[] a)
Checks whether the given byte array is in this filter.
|
boolean |
contains(char[] a)
Checks whether the given character array is in this filter.
|
boolean |
contains(CharSequence s)
Checks whether the given character sequence is in this filter.
|
boolean |
contains(double[] a)
Checks whether the given double array is in this filter.
|
boolean |
contains(float[] a)
Checks whether the given float array is in this filter.
|
boolean |
contains(int[] a)
Checks whether the given int array is in this filter.
|
boolean |
contains(long[] a)
Checks whether the given long array is in this filter.
|
boolean |
contains(short[] a)
Checks whether the given short array is in this filter.
|
int |
d()
The number of hash functions used by this filter.
|
long |
m()
The number of bits in this filter.
|
void |
readExternal(ObjectInput in) |
long |
size()
Returns the size of this filter.
|
void |
writeExternal(ObjectOutput out) |
public static final transient long MAX_BITS
public static final transient int NUMBER_OF_WEIGHTS
public BloomFilter2()
public BloomFilter2(int n)
This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.
n
- the expected number of elements.public BloomFilter2(int n, int d)
n
- the expected number of elements.d
- the number of hash functions; under obvious uniformity and indipendence assumptions,
if the filter has not more than n
elements,
false positives will happen with probability 2-d.public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
readExternal
in interface Externalizable
IOException
ClassNotFoundException
public void writeExternal(ObjectOutput out) throws IOException
writeExternal
in interface Externalizable
IOException
public final long m()
public final int d()
public boolean contains(CharSequence s)
Note that this method may return true on a character sequence that has not been added to the filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in the filter is less than n, the number of expected elements specified at creation time.
s
- a character sequence.s
(or some element
with the same hash sequence as s
) is in the filter.public boolean contains(byte[] a)
a
- a byte array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(short[] a)
a
- a short array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(char[] a)
a
- a character array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(int[] a)
a
- an int array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(long[] a)
a
- a long array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(float[] a)
a
- a float array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean contains(double[] a)
a
- a double array.a
(or some element
with the same hash sequence as a
) is in the filter.contains(CharSequence)
public boolean add(CharSequence s)
s
- a character sequence.s
nor any
other element with the same hash sequence as s
was already in this filter).public boolean add(byte[] a)
a
- a byte array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(short[] a)
a
- a short array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(char[] a)
a
- a character array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(int[] a)
a
- an int array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(long[] a)
a
- a long array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(float[] a)
a
- a float array.a
nor any
other element with the same hash sequence as a
was already in this filter).public boolean add(double[] a)
a
- a double array.a
nor any
other element with the same hash sequence as a
was already in this filter).public void clear()
public long size()
Note that the size of a Bloom filter is only a lower bound for the number of distinct elements that have been added to the filter. False positives might make the number returned by this method smaller than it should be.
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.