public abstract class AbstractBTree extends Object implements IIndex, IAutoboxBTree, ILinearList, IBTreeStatistics, ILocalBTreeView, ISimpleTreeIndexAccess, ICheckpointProtocol
Base class for mutable and immutable B+-Tree implementations.
The B+-Tree implementation supports variable length unsigned byte[] keys and
provides a IKeyBuilder
utilities designed to make it possible to
generate keys from any combination of primitive data types and Unicode
strings. The total ordering imposed by the index is that of a bit-wise
comparison of the variable length unsigned byte[] keys. Encoding Unicode keys
is support by an integration with ICU4J and applications may choose the
locale, strength, and other properties that govern the sort order of sort
keys generated from Unicode strings. Sort keys produces by different collator
objects are NOT compatible and applications that use Unicode data in their keys
MUST make sure that they use a collator that imposes the same sort order each
time they provision a KeyBuilder
.
The use of variable length unsigned byte[] keys makes it possible for the B+-Tree to perform very fast comparison of a search key with keys in the nodes and leaves of the tree. To support fast search, the leading prefix is factored out each time a node or leaf is made immutable, e.g., directly proceeding serialization. Further, the separator keys are chosen to be the shortest separator key in order to further shorten the keys in the nodes of the tree.
The B+Tree can optionally maintain version metadata (a version timestamp and
deletion marker). Version metadata MUST be enabled (a) if the index will be
used with transactional isolation; or (b) if the index is part of a scale-out
index. In both cases the version timestamps and deletion markers play a
critical role when reading from a fused view of an ordered set of indices
describing an index or an index partition. Note that the existence of
deletion markers means that rangeCount(byte[], byte[])
and
getEntryCount()
are upper bounds as deleted entries will be reported
until they are purged from the index by a compacting merge of the view.
KeyBuilder
Modifier and Type | Class and Description |
---|---|
static interface |
AbstractBTree.IBTreeCounters
Interface declaring namespaces for performance counters collected for a
B+Tree.
|
Modifier and Type | Field and Description |
---|---|
protected int |
branchingFactor
The branching factor for the btree.
|
protected boolean |
debug
Flag turns on the use of
AbstractNode.assertInvariants() and is
automatically enabled when the logger is set to
Level.DEBUG . |
protected static boolean |
DEBUG
True iff the
log level is DEBUG or less. |
static org.apache.log4j.Logger |
dumpLog
Log for
dump(PrintStream) and friends. |
protected Throwable |
error
This field is set if an error is encountered that renders an unisolated
index object unusable.
|
protected static String |
ERROR_CLOSED
The index is already closed.
|
protected static String |
ERROR_ERROR_STATE
An unisolated index view is in an error state.
|
protected static String |
ERROR_LESS_THAN_ZERO
A parameter was less than zero.
|
protected static String |
ERROR_READ_ONLY
The index is read-only but a mutation operation was requested.
|
protected static String |
ERROR_TOO_LARGE
A parameter was too large.
|
protected static String |
ERROR_TRANSIENT
The index is transient (not backed by persistent storage) but an
operation that requires persistence was requested.
|
protected static boolean |
INFO
True iff the
log level is INFO or less. |
protected static org.apache.log4j.Logger |
log
Log for btree opeations.
|
protected IndexMetadata |
metadata
The metadata record for the index.
|
protected int |
ndistinctOnWriteRetentionQueue
The #of distinct nodes and leaves on the
writeRetentionQueue . |
protected NodeSerializer |
nodeSer
Used to serialize and de-serialize the nodes and leaves of the tree.
|
protected boolean |
readOnly
When
true the AbstractBTree does not permit
mutation. |
protected AbstractNode<?> |
root
The root of the btree.
|
protected IRawStore |
store
The persistence store -or-
null iff the B+Tree is transient. |
protected ConcurrentMap<Long,Object> |
storeCache
Deprecated.
|
protected IHardReferenceQueue<PO> |
writeRetentionQueue
Nodes (that is nodes or leaves) are added to a hard reference queue when
they are created or read from the store.
|
Modifier | Constructor and Description |
---|---|
protected |
AbstractBTree(IRawStore store,
INodeFactory nodeFactory,
boolean readOnly,
IndexMetadata metadata,
IRecordCompressorFactory<?> recordCompressorFactory) |
Modifier and Type | Method and Description |
---|---|
protected abstract void |
_reopen()
This method is responsible for setting up the root leaf (either new or
read from the store), the bloom filter, etc.
|
protected void |
assertNotReadOnly() |
protected void |
assertNotTransient() |
void |
close()
The contract for
ICheckpointProtocol.close() is to reduce the resource burden of the
index while not rendering the index inoperative. |
boolean |
contains(byte[] key)
Core method to decide whether the index has a (non-deleted) entry under a
key.
|
boolean |
contains(Object key)
Return true iff there is an entry for the key.
|
static long |
decodeRecordAddr(byte[] buf)
Decodes a signed long value as encoded by
#appendSigned(long) . |
boolean |
dump(org.apache.log4j.Level level,
PrintStream out) |
boolean |
dump(PrintStream out)
Recursive dump of the tree.
|
BaseIndexStats |
dumpPages(boolean recursive,
boolean visitLeaves)
Reports statistics for the index.
|
static byte[] |
encodeRecordAddr(ByteArrayBuffer recordAddrBuf,
long addr)
Encode a raw record address into a byte[] suitable for storing in the
value associated with a tuple and decoding using
decodeRecordAddr(byte[]) |
abstract BloomFilter |
getBloomFilter()
Return the optional
IBloomFilter , transparently
reopen() ing the index if necessary. |
int |
getBranchingFactor()
The branching factor for the btree.
|
BTreeCounters |
getBtreeCounters()
Counters tracking various aspects of the btree.
|
Tuple |
getContainsTuple()
Return a
Tuple that may be used to test for the existence of
tuples having a given key within the AbstractBTree . |
CounterSet |
getCounters()
Return performance counters.
|
abstract long |
getEntryCount()
The #of entries (aka tuples) in the
AbstractBTree . |
IndexMetadata |
getIndexMetadata()
The metadata for the index.
|
abstract long |
getLastCommitTime()
The timestamp associated with the last
IAtomicStore.commit() in
which writes buffered by this index were made restart-safe on the backing
store. |
int |
getLevel(AbstractNode t)
Return the level of t below the root node or leaf.
|
int |
getLevel(AbstractNode t,
AbstractNode node)
Return the level of t below the given node or leaf.
|
Tuple |
getLookupTuple()
Return a
Tuple that may be used to copy the value associated with
a key out of the AbstractBTree . |
NodeSerializer |
getNodeSerializer()
The object responsible for (de-)serializing the nodes and leaves of the
IIndex . |
int |
getReadLockCount()
Return the #of read-locks held by the current thread for a mutable index
view.
|
IResourceMetadata[] |
getResourceMetadata()
The description of the resources comprising the index view.
|
abstract long |
getRevisionTimestamp()
The timestamp associated with unisolated writes on this index.
|
AbstractNode |
getRightMostNode(boolean nodesOnly)
Utility method returns the right-most node in the B+Tree.
|
protected AbstractNode<?> |
getRoot()
The root of the btree.
|
protected AbstractNode<?> |
getRootOrFinger(byte[] key)
Returns the node or leaf to be used for search.
|
IBTreeStatistics |
getStatistics()
Return a statistics snapshot of the B+Tree.
|
IBTreeUtilizationReport |
getUtilization()
Computes and returns the utilization of the tree.
|
Tuple |
getWriteTuple()
Private instance used for mutation operations (insert, remove) which are
single threaded.
|
long |
indexOf(byte[] key)
Lookup the index position of the key.
|
byte[] |
insert(byte[] key,
byte[] value)
Insert or update a value under the key.
|
Tuple |
insert(byte[] key,
byte[] value,
boolean delete,
boolean putIfAbsent,
long timestamp,
Tuple tuple)
Core method for inserting or updating a value under a key.
|
Object |
insert(Object key,
Object value)
Insert with auto-magic handling of keys and value objects.
|
boolean |
isBalanced()
Return
true iff the tree is balanced. |
boolean |
isOpen()
An "open" index has may have some buffered data.
|
boolean |
isReadOnly()
Return
true iff this B+Tree is read-only. |
boolean |
isTransient()
Return
true iff this is a transient data structure (no
backing store). |
byte[] |
keyAt(long index)
Return the key for the identified entry.
|
byte[] |
lookup(byte[] key)
Lookup a value for a key.
|
Tuple |
lookup(byte[] key,
Tuple tuple)
Core method for retrieving a value under a key.
|
Object |
lookup(Object key)
Lookup a value for a key.
|
abstract ILeafCursor |
newLeafCursor(byte[] key)
Return a cursor that may be used to efficiently locate and scan the
leaves in the B+Tree.
|
abstract ILeafCursor |
newLeafCursor(SeekEnum where)
Return a cursor that may be used to efficiently locate and scan the
leaves in the B+Tree.
|
byte[] |
putIfAbsent(byte[] key,
byte[] value)
Insert or update a value under the key iff there is no entry for that key
in the index.
|
protected boolean |
rangeCheck(byte[] key,
boolean allowUpperBound)
Deprecated.
This method has been disabled. It always returns true.
The method is disabled because it forces the caller to ensure
that the query lies within the specified range. While that is
all well and good, this places an undue burden on
|
long |
rangeCopy(IIndex src,
byte[] fromKey,
byte[] toKey,
boolean overflow)
Copy all data, including deleted index entry markers and timestamps iff
supported by the source and target.
|
long |
rangeCount()
Return the #of tuples in the index.
|
long |
rangeCount(byte[] fromKey,
byte[] toKey)
Return the #of tuples in a half-open key range.
|
long |
rangeCount(Object fromKey,
Object toKey)
Variant implicitly converts the optional application keys into unsigned
byte[]s.
|
long |
rangeCountExact(byte[] fromKey,
byte[] toKey)
Return the exact #of tuples in a half-open key range.
|
long |
rangeCountExactWithDeleted(byte[] fromKey,
byte[] toKey)
Note:
rangeCount(byte[], byte[]) already reports deleted tuples
for an AbstractBTree so this method is just delegated to that
one. |
ITupleIterator |
rangeIterator()
Visits all tuples in key order.
|
ITupleIterator |
rangeIterator(byte[] fromKey,
byte[] toKey)
Return an iterator that visits the entries in a half-open key range.
|
ITupleIterator |
rangeIterator(byte[] fromKey,
byte[] toKey,
int capacityIsIgnored,
int flags,
IFilter filter)
Designated variant (the one that gets overridden) for an iterator that
visits the entries in a half-open key range.
|
ITupleIterator |
rangeIterator(Object fromKey,
Object toKey)
Variant implicitly converts the optional application keys into unsigned
byte[]s.
|
ITupleIterator |
rangeIterator(Object fromKey,
Object toKey,
int capacity,
int flags,
IFilter filter)
Variant implicitly converts the optional application keys into unsigned
byte[]s.
|
Lock |
readLock()
Return a
Lock that may be used to obtain a shared read lock which
is used (in the absence of other concurrency control mechanisms) to
permit concurrent readers on an unisolated index while serializing access
to that index when a writer must run. |
protected AbstractNode<?> |
readNodeOrLeaf(long addr)
Read a node or leaf from the store.
|
protected int |
recycle(long addr)
Recycle (aka delete) the allocation.
|
byte[] |
remove(byte[] key)
Remove the tuple under that key (will write a delete marker if delete
markers are enabled).
|
Tuple |
remove(byte[] key,
Tuple tuple)
Core method for deleting a value under a key.
|
Object |
remove(Object key)
Remove the key and its associated value.
|
abstract void |
removeAll()
Remove all entries in the B+Tree.
|
void |
reopen()
(Re-) open the index.
|
ICloseableIterator<?> |
scan()
Visit all entries in the index in the natural order of the index
(dereferencing visited tuples to the application objects stored within
those tuples).
|
void |
setBTreeCounters(BTreeCounters btreeCounters)
Replace the
BTreeCounters . |
void |
submit(byte[] fromKey,
byte[] toKey,
IKeyRangeIndexProcedure proc,
IResultHandler handler)
The procedure will be transparently applied against each index partition
spanned by the given key range.
|
<T> T |
submit(byte[] key,
ISimpleIndexProcedure<T> proc)
Submits an index procedure that operations on a single key to the
appropriate index partition returning the result of that procedure.
|
void |
submit(int fromIndex,
int toIndex,
byte[][] keys,
byte[][] vals,
AbstractKeyArrayIndexProcedureConstructor ctor,
IResultHandler aggregator)
Runs a procedure against an index.
|
String |
toString()
Fast summary information about the B+Tree.
|
protected void |
touch(AbstractNode<?> node)
This method is responsible for putting the node or leaf onto the ring
buffer which controls (a) how long we retain a hard reference to the node
or leaf; and (b) for writes, when the node or leaf is evicted with a zero
reference count and made persistent (along with all dirty children).
|
byte[] |
valueAt(long index)
Return the value for the identified entry.
|
Tuple |
valueAt(long index,
Tuple tuple) |
Lock |
writeLock()
Return a
Lock that may be used to obtain an exclusive write lock
which is used (in the absence of other concurrency control mechanisms) to
serialize all processes accessing an unisolated index when a writer must
run. |
protected long |
writeNodeOrLeaf(AbstractNode<?> node)
Codes the node and writes the coded record on the store (non-recursive).
|
protected void |
writeNodeRecursive(AbstractNode<?> node)
Write a dirty node and its children using a post-order traversal that
first writes any dirty leaves and then (recursively) their parent nodes.
|
protected void |
writeNodeRecursiveCallersThread(AbstractNode<?> node)
This is the historical implementation and runs entirely in the caller's
thread.
|
protected void |
writeNodeRecursiveConcurrent(AbstractNode<?> node)
Writes the dirty nodes and leaves in level sets (one level at a time)
with up to one thread per dirty node/leave in a given level.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
getHeight, getLeafCount, getNodeCount
getMutableBTree, getSourceCount, getSources
getCounter
getHeight, getLeafCount, getNodeCount
getCheckpoint, getDirtyListener, getMetadataAddr, getRecordVersion, getRootAddr, setDirtyListener, setLastCommitTime, writeCheckpoint, writeCheckpoint2
handleCommit, invalidate
getStore
protected static final String ERROR_CLOSED
protected static final String ERROR_LESS_THAN_ZERO
protected static final String ERROR_TOO_LARGE
protected static final String ERROR_READ_ONLY
protected static final String ERROR_TRANSIENT
protected static final String ERROR_ERROR_STATE
protected static final org.apache.log4j.Logger log
protected static final boolean INFO
log
level is INFO or less.protected static final boolean DEBUG
log
level is DEBUG or less.public static final org.apache.log4j.Logger dumpLog
dump(PrintStream)
and friends.protected final boolean debug
AbstractNode.assertInvariants()
and is
automatically enabled when the logger
is set to
Level.DEBUG
.protected final IRawStore store
null
iff the B+Tree is transient.protected final boolean readOnly
true
the AbstractBTree
does not permit
mutation.@Deprecated protected final ConcurrentMap<Long,Object> storeCache
protected final int branchingFactor
protected volatile AbstractNode<?> root
This hard reference is cleared to null
if an index is
closed
. getRoot()
automatically uses
reopen()
to reload the root so that closed indices may be
transparently made ready for further use (indices are closed to reduce
their resource burden, not to make their references invalid). The
AbstractNode
and derived classes assume that the root
is non-null. This assumption is valid if close()
is invoked by
the application in a manner consistent with the single-threaded contract
for the AbstractBTree
.
Note: This field MUST be marked as [volatile] in order to guarantee
correct semantics for double-checked locking in reopen()
and reopen()
.
http://en.wikipedia.org/wiki/Double-checked_locking
protected volatile Throwable error
null
the index MUST be reloaded from the most recent
checkpoint before it can be used (that is, you need to obtain a new view
of the unisolated index since this field is sticky once set).protected final NodeSerializer nodeSer
protected final IHardReferenceQueue<PO> writeRetentionQueue
IRawStore
. The
nodes and leaves refer to their parent with a WeakReference
s.
Likewise, nodes refer to their children with a WeakReference
. The
hard reference queue in combination with touch(AbstractNode)
and
with hard references held on the stack ensures that the parent and/or
children remain reachable during operations. Once the node is no longer
strongly reachable weak references to that node may be cleared by the VM
- in this manner the node will become unreachable by navigation from its
ancestors in the btree. The special role of the hard reference queue is
to further ensure that dirty nodes remain dirty by defering persistence
until the reference count for the node is zero during an eviction from
the queue.
Note that nodes are evicted as new nodes are added to the hard reference queue. This occurs in two situations: (1) when a new node is created during a split of an existing node; and (2) when a node is read in from the store. Inserts on this hard reference queue always drive evictions. Incremental writes basically make it impossible for the commit set to get "too large" - the maximum #of nodes to be written is bounded by the size of the hard reference queue. This helps to ensure fast commit operations on the store.
The minimum capacity for the hard reference queue is two (2) so that a split may occur without forcing eviction of either node participating in the split.
Note: The code in Node.postOrderNodeIterator(boolean, boolean)
and DirtyChildIterator
MUST NOT touch the hard reference queue
since those iterators are used when persisting a node using a post-order
traversal. If a hard reference queue eviction drives the serialization of
a node and we touch the hard reference queue during the post-order
traversal then we break down the semantics of
HardReferenceQueue.add(Object)
as the eviction does not
necessarily cause the queue to reduce in length. Another way to handle
this is to have HardReferenceQueue.add(Object)
begin to evict
objects before is is actually at capacity, but that is also a bit
fragile.
Note: The writeRetentionQueue
uses a HardReferenceQueue
.
This is based on a RingBuffer
and is very fast. It does not use a
HashMap
because we can resolve the WeakReference
to the
child Node
or Leaf
using top-down navigation as long as
the Node
or Leaf
remains strongly reachable (that is, as
long as it is on the writeRetentionQueue
or otherwise strongly
held). This means that lookup in a map is not required for top-down
navigation.
protected int ndistinctOnWriteRetentionQueue
writeRetentionQueue
.protected IndexMetadata metadata
BTree
object, but it CAN be changed.protected AbstractBTree(IRawStore store, INodeFactory nodeFactory, boolean readOnly, IndexMetadata metadata, IRecordCompressorFactory<?> recordCompressorFactory)
store
- The persistence store.nodeFactory
- Object that provides a factory for node and leaf objects.readOnly
- true
IFF it is known that the
AbstractBTree
is read-only.metadata
- The IndexMetadata
object for this B+Tree.recordCompressorFactory
- Object that knows how to (de-)compress the serialized data
records.public final BTreeCounters getBtreeCounters()
public final void setBTreeCounters(BTreeCounters btreeCounters)
BTreeCounters
.
Note: This is used by the IndexManager
to ensure that an index
loaded from its backing store uses the BTreeCounters
associated
with that index since the DataService
was last (re-)started.
btreeCounters
- The counters to be used.IllegalArgumentException
- if the argument is null
.public abstract BloomFilter getBloomFilter()
IBloomFilter
, transparently
reopen()
ing the index if necessary.getBloomFilter
in interface ILocalBTreeView
null
if there is no bloom
filter (including the case where there is a bloom filter but it
has been disabled since the BTree
has grown too large and
the expected error rate of the bloom filter would be too high).public final CounterSet getCounters()
Interesting performance counters and other statistics about the index.
Return some "statistics" about the btree including both the static
CounterSet
and the BTreeCounters
s.
Note: counters reporting directly on the AbstractBTree
use a
snapshot mechanism which prevents a hard reference to the
AbstractBTree
from being attached to the return
CounterSet
object. One consequence is that these counters will
not update until the next time you invoke getCounters()
.
Note: In order to snapshot the counters use OneShotInstrument
to
prevent the inclusion of an inner class with a reference to the outer
AbstractBTree
instance.
Note: This method reports information that is NOT available from
BTreeCounters
s. Some of this information is either static or
relatively static (e.g., the UUID
for this index or its
implementation class). Other information is dynamic (such as the #of
nodes and leaves or the current size of the writeRetentionQueue). The
dynamic information is valid for the specific index view as of the moment
that it is sampled.
getCounters
in interface IIndex
getCounters
in interface ICounterSetAccess
BTreeCounters.getCounters()
,
Expose performance counters for read-only indices public void close()
ICheckpointProtocol.close()
is to reduce the resource burden of the
index while not rendering the index inoperative. An index that has been
closed
MAY be reopened
at any time
(conditional on the continued availability of the backing store). Such an
index reference remains valid after a ICheckpointProtocol.close()
. A closed index is
transparently reopened
by any access to the index data
(scanning the index, probing the index, etc).
Note: A ICheckpointProtocol.close()
on a dirty index MUST discard writes rather than
flushing them to the store and MUST NOT update its Checkpoint
record - (ICheckpointProtocol.close()
is used to discard indices with partial writes
when an AbstractTask
fails). If you are seeking to
ICheckpointProtocol.close()
a mutable index view that state can be recovered by
ICheckpointProtocol.reopen()
then you MUST write a new Checkpoint
record
before closing the index.
Note: CLOSING A TRANSIENT INDEX WILL DISCARD ALL DATA!
This implementation clears the hard reference queue (releasing all node
references), releases the hard reference to the root node, and releases
the buffers on the NodeSerializer
(they will be naturally
reallocated on reuse).
Note: AbstractBTree
is NOT thread-safe and close()
MUST
be invoked in a context in which there will not be concurrent threads --
the natural choice being the single-threaded commit service on the
journal.
close
in interface ICheckpointProtocol
IllegalStateException
- if the root is null
, indicating that the
index is already closed.IllegalStateException
- if the root is dirty (this implies that this is a mutable
btree and there are mutations that have not been written
through to the store)public final void reopen()
ICheckpointProtocol.close()
/
ICheckpointProtocol.reopen()
protocol. That protocol may be used to reduce the
resource burden of an index. This method is automatically invoked by a
variety of methods that need to ensure that the index is available for
use..
This method delegates to _reopen()
if double-checked locking
demonstrates that the root
is null
(indicating that
the index has been closed). This method is automatically invoked by a
variety of methods that need to ensure that the index is available for
use.
reopen
in interface ICheckpointProtocol
ICheckpointProtocol.close()
,
ICheckpointProtocol.isOpen()
,
#getRoot()
protected abstract void _reopen()
reopen()
once root
has been show to be
null
with double-checked locking. When invoked in this
context, the caller is guaranteed to hold a lock on this. This is
done to ensure that at most one thread gets to re-open the index from the
backing store.public final boolean isOpen()
ICheckpointProtocol
isOpen
in interface ICheckpointProtocol
ICheckpointProtocol.close()
,
ICheckpointProtocol.reopen()
public final boolean isTransient()
true
iff this is a transient data structure (no
backing store).protected final void assertNotTransient()
public final boolean isReadOnly()
true
iff this B+Tree is read-only.isReadOnly
in interface IReadWriteLockManager
protected final void assertNotReadOnly()
UnsupportedOperationException
- if the B+Tree is read-only.isReadOnly()
public abstract long getLastCommitTime()
IAtomicStore.commit()
in
which writes buffered by this index were made restart-safe on the backing
store. The lastCommitTime is set when the index is loaded from the
backing store and updated after each commit. It is ZERO (0L) when
BTree
is first created and will remain ZERO (0L) until the
BTree
is committed. If the backing store does not support atomic
commits, then this value will always be ZERO (0L).
Note: This is fixed for an IndexSegment
.
getLastCommitTime
in interface ICheckpointProtocol
ICheckpointProtocol.getLastCommitTime()
public abstract long getRevisionTimestamp()
The revision timestamp assigned by this method is
lastCommitTime+1
. The reasoning is as follows. Revision
timestamps are assigned by the transaction manager when the transaction
is validated as part of its commit protocol. Therefore, revision
timestamps are assigned after the transaction write set is complete.
Further, the assigned revisionTimestamp will be strictly LT the
commitTime for that transaction. By using lastCommitTime+1
we are guaranteed that the revisionTimestamp for new writes (which will
be part of some future commit point) will always be strictly GT the
revisionTimestamp of historical writes (which were part of some prior
commit point).
Note: Unisolated operations using this timestamp ARE NOT validated. The timestamp is simply applied to the tuple when it is inserted or updated and will become part of the restart safe state of the B+Tree once the unisolated operation participates in a commit.
Note: If an unisolated operation were to execute concurrent with a
transaction commit for the same index then that could produce
inconsistent results in the index and could trigger concurrent
modification errors. In order to avoid such concurrent modification
errors, unisolated operations which are to be mixed with full
transactions MUST ensure that they have exclusive access to the
unisolated index before proceeding. There are two ways to do this: (1)
take the application off line for transactions; (2) submit your unisolated
operations to the IConcurrencyManager
which will automatically
impose the necessary constraints on concurrent access to the unisolated
indices.
UnsupportedOperationException
- if the index is read-only.public final IResourceMetadata[] getResourceMetadata()
IIndex
getResourceMetadata
in interface IIndex
public IndexMetadata getIndexMetadata()
Note: The same method is exposed by ICheckpointProtocol
. It is
also exposed here in order to provide access to the IndexMetadata
to remote clients in the scale-out architecture.
Note: If the B+Tree is read-only then the metadata object will be cloned to avoid potential modification. However, only a single cloned copy of the metadata record will be shared between all callers for a given instance of this class.
getIndexMetadata
in interface ICheckpointProtocol
getIndexMetadata
in interface IIndex
null
.ICheckpointProtocol.getIndexMetadata()
public String toString()
public BaseIndexStats dumpPages(boolean recursive, boolean visitLeaves)
ICheckpointProtocol
dumpPages
in interface ICheckpointProtocol
recursive
- When true
, also collects statistics on the pages
(nodes and leaves) using a low-level approach.visitLeaves
- When true
and recursive:=true
then the
leaves of the index are also visited.protected final boolean rangeCheck(byte[] key, boolean allowUpperBound)
The method is disabled because it forces the caller to ensure
that the query lies within the specified range. While that is
all well and good, this places an undue burden on
FusedView
which must substitute the actual key range
of a shard view for the caller's key range. This makes the
test effectively into a self-check of whether
FusedView
has correctly done this substitution rather
than a check of whether a request was mapped onto the correct
shard, which was the original purpose of this test. The logic
for performing such range checks was moved into
RangeCheckUtil
.
Note: An index partition is identified by
IndexMetadata.getPartitionMetadata()
returning non-
null
.
Note: This method is used liberally in assert
s to detect
errors that can arise is client code when an index partition is split,
joined, or moved.
key
- The key.allowUpperBound
- true
iff the key represents an inclusive
upper bound and thus must be allowed to be LTE to the right
separator key for the index partition. For example, this would
be true
for the toKey parameter on
rangeCount or rangeIterator methods.true
always.IllegalArgumentException
- if the key is null
KeyOutOfRangeException
- if the key does not lie within the index partition.public final int getBranchingFactor()
IBTreeStatistics
getBranchingFactor
in interface IBTreeStatistics
public final boolean isBalanced()
true
iff the tree is balanced.
Note: Not all tree-structured indices are balanced. For example, the
BTree
is balanced, but the HTree
is not balanced. The
height of an unbalanced tree must be discovered through a traversal of
the tree.
isBalanced
in interface ISimpleTreeIndexAccess
true
since a B+Tree is a balanced tree.public abstract long getEntryCount()
IBTreeStatistics
AbstractBTree
. This is zero
(0) for a new B+Tree. When the B+Tree supports delete markers, this value
also includes tuples which have been marked as deleted.getEntryCount
in interface IBTreeStatistics
getEntryCount
in interface ISimpleTreeIndexAccess
public IBTreeStatistics getStatistics()
public IBTreeUtilizationReport getUtilization()
IBTreeStatistics
getUtilization
in interface IBTreeStatistics
public final NodeSerializer getNodeSerializer()
IIndex
.protected final AbstractNode<?> getRoot()
The hard reference to the root node is cleared if the index is
closed
. This method automatically reopen()
s
the index if it is closed, making it available for use.
protected AbstractNode<?> getRootOrFinger(byte[] key)
#finger
and will return it if the key lies within
the finger.key
- The key.public final Tuple getWriteTuple()
IRangeQuery.KEYS
and
IRangeQuery.VALS
are requested so that indices which encode part
of the application object within the key can recover the application
object in insert(Object, Object)
and remove(Object)
.
While the Tuple
is not safe for use by concurrent threads, the
mutation API is not safe for concurrent threads either.
public final Tuple getLookupTuple()
Tuple
that may be used to copy the value associated with
a key out of the AbstractBTree
.
While the returned Tuple
is not safe for use by concurrent
threads, each instance returned by this method is safe for use by the
thread in which it was obtained.
lookup(byte[], Tuple)
public final Tuple getContainsTuple()
Tuple
that may be used to test for the existence of
tuples having a given key within the AbstractBTree
.
The tuple does not copy either the keys or the values. Contains is
implemented as a lookup operation that either return this tuple or
null
. When isolation is supported, the version metadata is
examined to determine if the matching entry is flagged as deleted in
which case contains() will report "false".
While the returned Tuple
is not safe for use by concurrent
threads, each instance returned by this method is safe for use by the
thread in which it was obtained.
contains(byte[])
public final Object insert(Object key, Object value)
IAutoboxBTree
insert
in interface IAutoboxBTree
key
- The key is implicitly converted to an unsigned
byte[]
.value
- The value is implicitly converted to a byte[]
.null
if there was
no value stored under that key.public final byte[] insert(byte[] key, byte[] value)
ISimpleBTree
insert
in interface ISimpleBTree
key
- The key.value
- The value (may be null).null
if the
key was not found or if the previous entry for that key was
marked as deleted.public final byte[] putIfAbsent(byte[] key, byte[] value)
ISimpleBTree
if (!contains(key)) insert(key, value);However, if the index allows
null
values to be stored under
a key and the application in fact stores null
values for
some tuples, then caller is not able to decide using this method whether
or not the mutation was applied based on the return value. For these
cases if the caller needs to know whether or not the conditional mutation
actually took place, the caller CAN use the pattern
if(!contains()) insert(key,value);
to obtain that
information.putIfAbsent
in interface ISimpleBTree
key
- The key.value
- The value (may be null).null
if the key
was not found or if the previous entry for that key was marked as
deleted. Note that the return value MAY be null
even
if there was an entry under the key. This is because the index is
capable of storing a null
value. In such cases the
conditional mutation WAS NOT applied.(putIfAbsent)
public final Tuple insert(byte[] key, byte[] value, boolean delete, boolean putIfAbsent, long timestamp, Tuple tuple)
key
- The variable length unsigned byte[] key.value
- The variable length byte[] value (MAY be null
).delete
- true
iff the index entry should be marked as
deleted (this behavior is supported iff the btree supports
delete markers).putIfAbsent
- When true
, a pre-existing entry for the key will
NOT be replaced (unless it is a deleted tuple, which is the
same as if there was no entry under the key). This should ONLY
be true when the top-level method is putIfAbsent
.
Historical code paths should specify false for an unconditional
mutation. See BLZG-1539.timestamp
- The timestamp to be associated with the new or updated index
entry (required iff the btree supports transactional isolation
and otherwise 0L).tuple
- Data and metadata for the old value under the key will be
copied onto this object (optional).null
if there was no entry under
that key. See ITuple.isDeletedVersion()
to determine
whether or not the entry is marked as deleted.UnsupportedOperationException
- if the index is read-only.public final Object remove(Object key)
IAutoboxBTree
remove
in interface IAutoboxBTree
key
- The key is implicitly converted to an unsigned
byte[]
.null
if the key was not found.public final byte[] remove(byte[] key)
remove
in interface ISimpleBTree
key
- The key.null
if the key
was not found or if the previous entry under that key was marked
as deleted.public final Tuple remove(byte[] key, Tuple tuple)
remove(byte[])
uses #insert(byte[], byte[], boolean, long, Tuple)
instead of
this method to mark the index entry as deleted when delete markers are
being maintained.
Note: removing a key has no effect on the optional bloom filter. If a key
is removed from the index by this method then the bloom filter will
report a false positive for that key, which will be detected
when we test the index itself. This works out fine in the scale-out
design since the bloom filter is per AbstractBTree
instance and
split/join/move operations all result in new mutable BTree
s with
new bloom filters to absorb new writes. For a non-scale-out deployment,
this can cause the performance of the bloom filter to degrade if you are
removing a lot of keys. However, in the special case of a BTree
that does NOT use delete markers, BTree.removeAll()
will create a
new root leaf and a new (empty) bloom filter as well.
key
- The search key.tuple
- An object that will be used to report data and metadata for
the pre-existing version (optional).null
if there was no version
under that key.UnsupportedOperationException
- if delete markers are being maintained.UnsupportedOperationException
- if the index is read-only.public abstract void removeAll()
Note: The IIndexManager
defines methods for registering (adding)
and dropping indices vs removing the entries in an individual
AbstractBTree
.
removeAll
in interface ISimpleIndexAccess
public Object lookup(Object key)
IAutoboxBTree
lookup
in interface IAutoboxBTree
key
- The key is implicitly converted to an unsigned
byte[]
.null
if there is no
entry for that key.public byte[] lookup(byte[] key)
ISimpleBTree
lookup
in interface ISimpleBTree
null
if there
is no entry for that key or if the entry under that key is marked
as deleted.public Tuple lookup(byte[] key, Tuple tuple)
null
from a
missing index entry or (when delete markers are enabled) from a deleted
index entry. Applies the optional bloom filter if it exists. If the bloom
filter exists and reports true
, then looks up the value
for the key in the index (note that the key might not exist in the index
since a bloom filter allows false positives, further the key might exist
for a deleted entry).key
- The search key.tuple
- An object that will be used to report data and metadata for
the pre-existing version (required).null
if there is no entry in the
index under the key.public boolean contains(Object key)
IAutoboxBTree
contains
in interface IAutoboxBTree
key
- The key is implicitly converted to an unsigned
byte[]
.public boolean contains(byte[] key)
true
, then verifies that the key does in fact
exist in the index.contains
in interface ISimpleBTree
key
- The key.true
iff the key does not exist. Or, if the btree
supports isolation, if the key exists but it is marked as
"deleted".public long indexOf(byte[] key)
ILinearList
Note that ILinearList.indexOf(byte[])
is the basis for implementing the
IRangeQuery
interface.
indexOf
in interface ILinearList
key
- The key.(-(insertion point) - 1)
. The insertion point is
defined as the point at which the key would be found it it were
inserted into the btree without intervening mutations. Note that
this guarantees that the return value will be >= 0 if and only if
the key is found. When found the index will be in [0:nentries).
Adding or removing entries in the tree may invalidate the index.
pos = -(pos+1)
will convert an insertion point to
the index at which the key would be found if it were
inserted - this is also the index of the predecessor of key
in the index.
ILinearList.keyAt(long)
,
ILinearList.valueAt(long)
public byte[] keyAt(long index)
ILinearList
ISimpleBTree.lookup(byte[])
.keyAt
in interface ILinearList
index
- The index position of the entry (origin zero).ILinearList.indexOf(byte[])
,
ILinearList.valueAt(long)
public byte[] valueAt(long index)
ILinearList
ISimpleBTree.lookup(byte[])
.valueAt
in interface ILinearList
index
- The index position of the entry (origin zero).null
if
there is a deleted entry at that index position thenILinearList.indexOf(byte[])
,
ILinearList.keyAt(long)
public final long rangeCountExact(byte[] fromKey, byte[] toKey)
IRangeQuery
Note: If the index supports deletion markers then this operation will require a key-range scan.
rangeCountExact
in interface IRangeQuery
fromKey
- The lowest key that will be counted (inclusive). When
null
there is no lower bound.toKey
- The first key that will not be counted (exclusive). When
null
there is no upper bound.public final long rangeCount()
IRangeQuery
Note: If the index supports deletion markers then the range count will be an upper bound and may double count tuples which have been overwritten, including the special case where the overwrite is a delete.
rangeCount
in interface IRangeQuery
rangeCount
in interface ISimpleIndexAccess
ISimpleIndexAccess.rangeCount()
public final long rangeCount(Object fromKey, Object toKey)
fromKey
- toKey
- public final long rangeCount(byte[] fromKey, byte[] toKey)
Note: If the index supports deletion markers then the range count will be an upper bound and may double count tuples which have been overwritten, including the special case where the overwrite is a delete.
This method computes the #of entries in the half-open range using
AbstractNode#indexOf(Object)
. Since it does not scan the tuples
it can not differentiate between deleted and undeleted tuples for an
index that supports delete markers, but the result will be exact if
delete markers are not being used. The cost is equal to the cost of
lookup of the both keys. If both keys are null
, then the
cost is zero (no IOs).
rangeCount
in interface IRangeQuery
fromKey
- The lowest key that will be counted (inclusive). When
null
there is no lower bound.toKey
- The first key that will not be counted (exclusive). When
null
there is no upper bound.public long rangeCountExactWithDeleted(byte[] fromKey, byte[] toKey)
rangeCount(byte[], byte[])
already reports deleted tuples
for an AbstractBTree
so this method is just delegated to that
one. However,
FusedView.rangeCountExactWithDeleted(byte[], byte[])
treats this
case differently since it must report the exact range count when
considering all sources at once. It uses a range iterator scan visiting
both deleted and undeleted tuples for that.rangeCountExactWithDeleted
in interface IRangeQuery
fromKey
- The lowest key that will be counted (inclusive). When
null
there is no lower bound.toKey
- The first key that will not be counted (exclusive). When
null
there is no upper bound.IRangeQuery.rangeCountExact(byte[], byte[])
public final ICloseableIterator<?> scan()
ISimpleIndexAccess
scan
in interface ISimpleIndexAccess
public final ITupleIterator rangeIterator()
IRangeQuery
rangeIterator(null, null)
rangeIterator
in interface IRangeQuery
public final ITupleIterator rangeIterator(Object fromKey, Object toKey)
fromKey
- toKey
- public final ITupleIterator rangeIterator(byte[] fromKey, byte[] toKey)
IRangeQuery
rangeIterator
in interface IRangeQuery
fromKey
- The first key that will be visited (inclusive lower bound).
When null
there is no lower bound.toKey
- The first key that will NOT be visited (exclusive upper
bound). When null
there is no upper bound.SuccessorUtil, which may be used to compute the successor of a value
before encoding it as a component of a key.
,
BytesUtil#successor(byte[]), which may be used to compute the
successor of an encoded key.
,
EntryFilter, which may be used to filter the entries visited by the
iterator.
public final ITupleIterator rangeIterator(Object fromKey, Object toKey, int capacity, int flags, IFilter filter)
fromKey
- toKey
- public ITupleIterator rangeIterator(byte[] fromKey, byte[] toKey, int capacityIsIgnored, int flags, IFilter filter)
Core implementation.
Note: If IRangeQuery.CURSOR
is specified the returned iterator
supports traversal with concurrent modification by a single-threaded
process (the BTree
is NOT thread-safe for writers). Write are
permitted iff AbstractBTree
allows writes.
Note: IRangeQuery.REVERSE
is handled here by wrapping the
underlying ITupleCursor
.
Note: IRangeQuery.REMOVEALL
is handled here by wrapping the
iterator.
Note: FusedView.rangeIterator(byte[], byte[], int, int, IFilter)
is also responsible for constructing an ITupleIterator
in a
manner similar to this method. If you are updating the logic here, then
check the logic in that method as well!
rangeIterator
in interface IRangeQuery
fromKey
- The first key that will be visited (inclusive lower bound).
When null
there is no lower bound.toKey
- The first key that will NOT be visited (exclusive upper
bound). When null
there is no upper bound.capacityIsIgnored
- The #of entries to buffer at a time. This is a hint and MAY be
zero (0) to use an implementation specific default
capacity. A non-zero value may be used if you know that you
want at most N results or if you want to override the default
#of results to be buffered before sending them across a
network interface. (Note that you can control the default
value using
IBigdataClient.Options#DEFAULT_CLIENT_RANGE_QUERY_CAPACITY
).flags
- A bitwise OR of IRangeQuery.KEYS
, IRangeQuery.VALS
, etc.filter
- An optional object used to construct a stacked iterator. When
IRangeQuery.CURSOR
is specified in flags, the base
iterator will implement ITupleCursor
and the first
filter in the stack can safely cast the source iterator to an
ITupleCursor
. If the outermost filter in the stack
does not implement ITupleIterator
, then it will be
wrapped an ITupleIterator
.SuccessorUtil, which may be used to compute the successor of a value
before encoding it as a component of a key.
,
BytesUtil#successor(byte[]), which may be used to compute the
successor of an encoded key.
,
IFilterConstructor, which may be used to construct an iterator stack
performing filtering or other operations.
public long rangeCopy(IIndex src, byte[] fromKey, byte[] toKey, boolean overflow)
src
- The source index.fromKey
- The first key that will be copied (inclusive). When
null
there is no lower bound.toKey
- The first key that will NOT be copied (exclusive). When
null
there is no upper bound.overflow
- When true
the IOverflowHandler
defined
for the source index (if any) will be applied to copy raw
records onto the backing store for this index. This value
SHOULD be false
if the two indices have the
same backing store and true
otherwise.UnsupportedOperationException
- if the src and this do not have the same
setting for IndexMetadata.getDeleteMarkers()
UnsupportedOperationException
- if the src and this do not have the same
setting for IndexMetadata.getVersionTimestamps()
CompactTask
,
OverflowManager
public <T> T submit(byte[] key, ISimpleIndexProcedure<T> proc)
IIndex
submit
in interface IIndex
key
- The key.proc
- The procedure.IIndexProcedure.apply(IIndex)
public void submit(byte[] fromKey, byte[] toKey, IKeyRangeIndexProcedure proc, IResultHandler handler)
IIndex
Note: Since this variant of submit() does not split keys the
fromIndex and toIndex in the Split
s reported to
the IResultHandler
will be zero (0).
submit
in interface IIndex
fromKey
- The lower bound (inclusive) -or- null
if there
is no lower bound.toKey
- The upper bound (exclusive) -or- null
if there
is no upper bound.proc
- The procedure. If the procedure implements the
IParallelizableIndexProcedure
marker interface then it
MAY be executed in parallel against the relevant index
partition(s).public void submit(int fromIndex, int toIndex, byte[][] keys, byte[][] vals, AbstractKeyArrayIndexProcedureConstructor ctor, IResultHandler aggregator)
IIndex
Note: This may be used to send custom logic together with the data to a
remote index or index partition. When the index is remote both the
procedure and the return value MUST be Serializable
.
Note: The scale-out indices add support for auto-split of the procedure such that it runs locally against each relevant index partition.
submit
in interface IIndex
fromIndex
- The index of the first key to be used (inclusive).toIndex
- The index of the last key to be used (exclusive).keys
- The keys (required).vals
- The values (optional depending on the procedure).ctor
- An object that can create instances of the procedure.aggregator
- When defined, results from each procedure application will be
reported to this object.
TODO In order to allow parallelization within a shard, we need to modify
this method signature to pass in an IResultHandler
constructor
object. That might be something which could be pushed down onto the ctor
argument. It would be used in scale-out to create a DS local result handler
so we can locally aggregate when parallelizing against each shard and then
return that aggregated result to the client which would extract the aggregate
result across the shards from the client's result handler. See BLZG-1537.(Schedule more IOs when loading data)
public AbstractNode getRightMostNode(boolean nodesOnly)
nodesOnly
- when true
the search will halt at the
right-most non-leaf. Otherwise it will return the right-most
leaf.null
if
the root of the B+Tree is a leaf and
nodesOnly == true
.public abstract ILeafCursor newLeafCursor(SeekEnum where)
public abstract ILeafCursor newLeafCursor(byte[] key)
key
- The key (required).IllegalArgumentException
- if key is null
.public boolean dump(PrintStream out)
out
- The dump is written on this stream.public boolean dump(org.apache.log4j.Level level, PrintStream out)
public int getLevel(AbstractNode t)
t
- A node or leaf that belongs to this B+Tree.IllegalArgumentException
- if t is null.IllegalArgumentException
- if t does not belong to this B+Tree.public int getLevel(AbstractNode t, AbstractNode node)
t
- A node or leaf that belongs to this B+Tree.node
- A node or leaf that is either t or (recursively) some
parent of t.IllegalArgumentException
- if t is null.IllegalArgumentException
- if t does not belong to this B+Tree.IllegalArgumentException
- if node is null.IllegalArgumentException
- if node does not belong to this B+Tree.NotChildException
- if t is neither node nor some descendant of
node.protected void touch(AbstractNode<?> node)
This method is responsible for putting the node or leaf onto the ring buffer which controls (a) how long we retain a hard reference to the node or leaf; and (b) for writes, when the node or leaf is evicted with a zero reference count and made persistent (along with all dirty children). The concurrency requirements and the implementation behavior and guarentees differ depending on whether the B+Tree is read-only or mutable for two reasons: For writers, the B+Tree is single-threaded so there is no contention. For readers, every touch on the B+Tree goes through this point, so it is vital to make this method non-blocking.
This method guarantees that the specified node will NOT be synchronously persisted as a side effect and thereby made immutable. (Of course, the node may be already immutable.)
In conjunction with DefaultEvictionListener
, this method
guarantees that the reference counter for the node will reflect the #of
times that the node is actually present on the
writeRetentionQueue
.
If the node is not found on a scan of the head of the queue, then it is
appended to the queue and its AbstractNode.referenceCount
is
incremented. If a node is being appended to the queue and the queue is at
capacity, then this will cause a reference to be evicted from the queue.
If the reference counter for the evicted node or leaf is zero and the
evicted node or leaf is dirty, then a data record will be coded for the
evicted node or leaf and written onto the backing store. A subsequent
attempt to modify the node or leaf will force copy-on-write for that node
or leaf.
For the mutable B+Tree we also track the #of references to the node/leaf on the ring buffer. When that reference count reaches zero we do an eviction and the node/leaf is written onto the backing store if it is dirty. Those reference counting games DO NOT matter for read-only views so we can take a code path which does not update the per-node/leaf reference count and we do not need to use either synchronization or atomic counters to track the reference counts.
In order to reduce contention for the lock required to update the backing
queue, the writeRetentionQueue
is configured to collect
references for touched nodes or leaves in a thread-local queue and then
batch those references through to the backing hard reference queue while
holding the lock.
node
- The node or leaf.protected final void writeNodeRecursive(AbstractNode<?> node)
Node
s.
Note: This will throw an exception if the backing store is read-only.node
- The root of the hierarchy of nodes to be written. The node
MUST be dirty. The node this does NOT have to be the root of
the tree and it does NOT have to be a Node
.(Reduce commit latency by parallel checkpoint by level of
dirty pages in an index)
protected final void writeNodeRecursiveCallersThread(AbstractNode<?> node)
node
- protected final void writeNodeRecursiveConcurrent(AbstractNode<?> node)
ICheckpointProtocol.writeCheckpoint()
or for a Node
evicted from the writeRetentionQueue
. Whether this is driven by
ICheckpointProtocol.writeCheckpoint()
or touch(AbstractNode)
, we have the
same contract with respect to eviction as the single-threaded eviction
logic - we are just doing it in parallel level sets.node
- (Reduce commit latency by parallel checkpoint by level of
dirty pages in an index)
protected long writeNodeOrLeaf(AbstractNode<?> node)
writeRetentionQueue
, the B+Tree continuously converts nodes and
leaves to their more compact coded record forms which results in a
smaller in memory footprint.
Note: For a transient B+Tree, this merely codes the node but does not write the node on the store (there is none).
UnsupportedOperationException
- if the B+Tree (or the backing store) is read-only.protected AbstractNode<?> readNodeOrLeaf(long addr)
Note: Callers SHOULD be synchronized in order to ensure that only one thread will read the desired node or leaf in from the store and attach the reference to the newly read node or leaf as appropriate into the existing data structures (e.g., as the root reference or as a child of a node or leaf already live in memory).
addr
- The address in the store.IllegalArgumentException
- if the address is IAddressManager.NULL
.public static byte[] encodeRecordAddr(ByteArrayBuffer recordAddrBuf, long addr)
decodeRecordAddr(byte[])
recordAddrBuf
- The buffer that will be used to format the record address.addr
- The raw record address.public static long decodeRecordAddr(byte[] buf)
#appendSigned(long)
.buf
- The buffer containing the encoded record address.protected int recycle(long addr)
BTreeCounters
.addr
- The address to be recycled.IAddressManager.NULL
.public final Lock readLock()
IReadWriteLockManager
Lock
that may be used to obtain a shared read lock which
is used (in the absence of other concurrency control mechanisms) to
permit concurrent readers on an unisolated index while serializing access
to that index when a writer must run. This is exposed for processes which
need to obtain the write lock to coordinate external operations.
Note: If the persistence capable data structure is read-only then the
returned Lock
is a singleton that ignores all lock requests. This
is because our read-only persistence capable data structures are already
thread-safe for concurrent readers.
readLock
in interface IReadWriteLockManager
public final Lock writeLock()
IReadWriteLockManager
Lock
that may be used to obtain an exclusive write lock
which is used (in the absence of other concurrency control mechanisms) to
serialize all processes accessing an unisolated index when a writer must
run. This is exposed for processes which need to obtain the write lock to
coordinate external operations.writeLock
in interface IReadWriteLockManager
public final int getReadLockCount()
IReadWriteLockManager
getReadLockCount
in interface IReadWriteLockManager
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.