public class BloomFilterFactory extends Object implements Serializable
AbstractBTree
and which allows the caller to specify the expected
number of index entries, the desired error rate for the filter at that #of
index entries, and the maximum error rate before the bloom filter will be
disabled.
While the bloom filter is always a "perfect" fit for an IndexSegment
since the #of index entries is known before we build the IndexSegment
,
that is not true of a BTree
. For a BTree
, the performance
of the bloom filter will degrade as you exceed the #of index entries for
which it was provisioned (at a desired error rate). Therefore the factory
also specifies the error rate above which the bloom filter will be disabled
for a BTree
. The factory works backwards from the actual bit length
of the bloom filter and the maximum allowed error rate to derive the #of
index entries which would achieve that error rate. The bloom filter is
disabled when inserts into the BTree
reach that threshold #of index
entries.
Once a bloom filter has been disabled for a BTree
it will not be
re-enabled for that BTree
. However, in the scale-out architecture
the a new mutable BTree
is created each time the journal overflows
and the data from the historical view is migrated into IndexSegment
s.
Both the new mutable BTree
and IndexSegment
s will have bloom
filters configured according to this factory. The bloom filter on the new
BTree
will operate until the #of index entries in that BTree
(not in the index or the index partition) exceeds the threashold at which the
bloom filter would be expected to operate with the specified maximum error
rate, at which point it will be disabled.
Modifier and Type | Field and Description |
---|---|
static BloomFilterFactory |
DEFAULT
The recommenced default factory configuration.
|
static double |
DEFAULT_ERROR_RATE
The default target error rate 0.02.
|
static double |
DEFAULT_MAX_ERROR_RATE
The default maximum error rate 0.15.
|
static int |
DEFAULT_N
The default expected #of index entries 1000000.
|
int |
maxN
The maximum #of index entries before the expected performance will be
worse than the specified maximum error rate.
|
double |
maxP
The maximum error rate for the bloom filter (it will be disabled for a
BTree once the bloom filter can be expected to realize this error
rate). |
int |
n
The expected #of index entries.
|
double |
p
The desired error rate for the bloom filter at that #of index entries.
|
Constructor and Description |
---|
BloomFilterFactory(int n)
|
BloomFilterFactory(int n,
double p,
double maxP)
Core impl.
|
Modifier and Type | Method and Description |
---|---|
BloomFilter |
newBloomFilter()
Create and return a new (empty) bloom filter for a
BTree or
IndexSegment . |
String |
toString() |
public final int n
public final double p
public final double maxP
BTree
once the bloom filter can be expected to realize this error
rate).public final int maxN
BTree
will
automatically disable its bloom filter once this many elements have been
inserted. In practice, this is only a constraint on scale-up indices. For
scale-out indices the maximum applies only the mutable BTree
absorbing writes destined for a specific index partition and the
IndexSegment
s will have "perfect fit" bloom filters.public static final transient int DEFAULT_N
public static final transient double DEFAULT_ERROR_RATE
public static final transient double DEFAULT_MAX_ERROR_RATE
public static final transient BloomFilterFactory DEFAULT
The space requirements of a BloomFilter
are approximately one
byte per index entry for a reasonable false positive rate (aka error
rate). Use DEFAULT
for reasonable defaults.
With these values, the bloom filter will be limited to an index with no
more than ~2M entries. After than the bloom filter begins to have
diminishing results due to an increasing false positive rate and will be
automatically disabled. If you increase the expected #of index entries,
then you can handle larger indices with the same error rate, but this
swiftly gets into diminishing returns again because the latency required
to materialize and persist the bloom filter becomes noticeable. Therefore
the default configuration for the bloom filter is highly accurate up to
1M index entries (0.02 error rate) and then begins to loose precision. If
the index grows large enough then the expected error rate of the bloom
filter will degrade to the point where it will be automatically disabled.
While this places an effective upper bound on the size of a
scale-up index that can make effective use of a bloom filter,
scale-out indices DO NOT share the same limitation. Each time a scale-out
index is partitioned, it is broken into a mutable BTree
for
absorbing writes for an index partition and zero or more
IndexSegment
s. Each time an overflow occurs, index entries are
migrated to the IndexSegment
s, so the #of index entries in the
BTree
is never that large. Finally, #of index entries in an
IndexSegment
is always known when the IndexSegment
is
built, so the BloomFilter
for an IndexSegment
is always a
perfect fit.
public BloomFilterFactory(int n)
n
- The expected #of index entries for a BTree
(this
value is ignored for IndexSegment
s).IllegalArgumentException
- if n is non-positive.public BloomFilterFactory(int n, double p, double maxP)
n
- The expected #of index entries (this value is ignored for
IndexSegment
s).p
- The desired error rate for the bloom filter at that #of index
entries (or at the actual #of index entries for an
IndexSegment
).maxP
- The maximum error rate for the bloom filter for a
BTree
(it will be disabled for a BTree
once
the bloom filter can be expected to realize this error rate).IllegalArgumentException
- if n is non-positive.IllegalArgumentException
- unless p lies in (0:1].IllegalArgumentException
IllegalArgumentException
- unless maxP lies in (p:1].public BloomFilter newBloomFilter()
BTree
or
IndexSegment
.
The bloom filter can be provisioned with reference to /architecture/bloomfilter.xls
. Let p
be the probability of
a false positive (aka the error rate) and n
be the #of index
entries. The values p=.02 and n=1M result in a space requirement of
8656171 bits or approximately 1mb and uses ~ 8.6 bits per element. In
order to achieve the same error rate with n=10M, the size requirements of
the bloom filter will be approximately 10mb since the filter will still
use ~ 8.6 bits per element for that error rate, or roughly one byte per
index entry.
The maximum record length for the backing store can easily be exceeded by a large bloom filter, large bloom filters will require significant time to read/write during checkpoints, and large bloom filters will be written each time a dirty index is involved in a commit.
While the scale-out architecture uses group commits and hence can be
expected to perform more commits during a bulk data load, it also uses
one bloom filter per AbstractBTree
so the #of index entries is
bounded by the configured ISimpleSplitHandler
in an application
dependent manner. On the other hand, the bloom filter performance will
degrade as a scale-up index grows in size since the bloom filter can not
be made very large for a scale-up store (the maximum record size is
reduced in order to permit more records) and large indices will therefore
experience increasing false positive rates as they grow.
Whether or not a bloom filter is useful depends on the application. The
bloom filter will ONLY be used for point tests such as contains(),
lookup(), or an IAccessPath
that is fully bound and therefore can
benefit from testing a bloom filter.
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.