public class BTree extends AbstractBTree implements ICheckpointProtocol
This class implements a variant of a B+Tree in which all values are stored in leaves, but the leaves are not connected with prior-next links. This constraint arises from the requirement to support a copy-on-write policy.
Note: No mechanism is exposed for fetching a node or leaf of the tree other than the root by its key. This is because the parent reference on the node (or leaf) can only be set when it is read from the store in context by its parent node.
Note: the leaves can not be stitched together with prior and next references without forming cycles that make it impossible to write out the leaves of the btree. This restriction arises because each time we write out a node or leaf it is assigned a persistent identifier as an unavoidable artifact of providing isolation for the object index.
Note: This implementation is thread-safe for concurrent readers BUT NOT for
concurrent writers. If a writer has access to a BTree
then there MUST
NOT be any other reader -or- writer operating on the BTree
at the
same time.
Modifier and Type | Class and Description |
---|---|
static class |
BTree.Counter
Mutable counter.
|
class |
BTree.LeafCursor
A cursor that may be used to traversal
Leaf s. |
protected static class |
BTree.NodeFactory
Factory for mutable nodes and leaves used by the
NodeSerializer . |
static class |
BTree.PartitionedCounter
Places the counter values into a namespace formed by the partition
identifier.
|
protected static class |
BTree.Stack
A simple stack based on an array used to maintain hard references for the
parent
Node s in the BTree.LeafCursor . |
AbstractBTree.IBTreeCounters
Modifier and Type | Field and Description |
---|---|
protected AtomicLong |
counter
The mutable counter exposed by #getCounter()}.
|
protected int |
height
The height of the btree.
|
protected long |
nentries
The #of entries in the btree.
|
protected long |
nleaves
The #of leaf nodes in the btree.
|
protected long |
nnodes
The #of non-leaf nodes in the btree.
|
protected long |
recordVersion
The value of the record version number that will be assigned to the next
node or leaf written onto the backing store.
|
branchingFactor, debug, DEBUG, dumpLog, error, ERROR_CLOSED, ERROR_ERROR_STATE, ERROR_LESS_THAN_ZERO, ERROR_READ_ONLY, ERROR_TOO_LARGE, ERROR_TRANSIENT, INFO, log, metadata, ndistinctOnWriteRetentionQueue, nodeSer, readOnly, root, store, storeCache, writeRetentionQueue
Constructor and Description |
---|
BTree(IRawStore store,
Checkpoint checkpoint,
IndexMetadata metadata,
boolean readOnly)
Required constructor form for
BTree and any derived subclasses. |
Modifier and Type | Method and Description |
---|---|
protected void |
_reopen()
This method is responsible for setting up the root leaf (either new or
read from the store), the bloom filter, etc.
|
BTree |
asReadOnly()
Returns an immutable view of this
BTree . |
static BTree |
create(IRawStore store,
IndexMetadata metadata)
Create a new
BTree or derived class. |
static BTree |
createTransient(IndexMetadata metadata)
|
long |
createViewCheckpoint()
|
protected void |
fireDirtyEvent()
Fire an event to the listener (iff set).
|
BloomFilter |
getBloomFilter()
Lazily reads the bloom filter from the backing store if it exists and is
not already in memory.
|
Checkpoint |
getCheckpoint()
Returns the most recent
ICheckpoint record. |
ICounter |
getCounter()
Returns an
ICounter . |
IDirtyListener |
getDirtyListener()
Return the
IDirtyListener . |
long |
getEntryCount()
The #of entries (aka tuples) in the
AbstractBTree . |
int |
getHeight()
The height of the btree.
|
long |
getLastCommitTime()
The timestamp associated with the last
IAtomicStore.commit() in
which writes buffered by this index were made restart-safe on the backing
store. |
long |
getLeafCount()
The #of leaf nodes in the
AbstractBTree . |
long |
getMetadataAddr()
The address at which the most recent
IndexMetadata record was
written. |
BTree |
getMutableBTree()
The
BTree that is absorbing writes for the view. |
long |
getNodeCount()
The #of non-leaf nodes in the
AbstractBTree . |
long |
getRecordVersion()
The value of the record version number that will be assigned to the next
node or leaf written onto the backing store.
|
long |
getRevisionTimestamp()
The timestamp associated with unisolated writes on this index.
|
long |
getRootAddr()
The address of the last written root of the persistent data structure
-or-
0L if there is no root. |
int |
getSourceCount()
Returns ONE (1).
|
AbstractBTree[] |
getSources()
An array containing this
BTree . |
IRawStore |
getStore()
The backing store.
|
long |
handleCommit(long commitTime)
Handle request for a commit by
writeCheckpoint() ing dirty nodes
and leaves onto the store, writing a new metadata record, and returning
the address of that metadata record. |
void |
invalidate(Throwable t)
Mark an
ICommitter as invalid. |
static BTree |
load(IRawStore store,
long addrCheckpoint,
boolean readOnly)
Load an instance of a
BTree or derived class from the store. |
boolean |
needsCheckpoint()
Return true iff changes would be lost unless the B+Tree is flushed to the
backing store using
writeCheckpoint() . |
BTree.LeafCursor |
newLeafCursor(byte[] key)
Return a cursor that may be used to efficiently locate and scan the
leaves in the B+Tree.
|
BTree.LeafCursor |
newLeafCursor(SeekEnum where)
Return a cursor that may be used to efficiently locate and scan the
leaves in the B+Tree.
|
protected BloomFilter |
readBloomFilter()
Read the bloom filter from the backing store using the address stored in
the last
checkpoint record. |
void |
removeAll()
Remove all entries in the B+Tree.
|
void |
setDirtyListener(IDirtyListener listener)
Set or clear the listener (there can be only one).
|
void |
setIndexMetadata(IndexMetadata indexMetadata)
Method updates the index metadata associated with this
BTree . |
void |
setLastCommitTime(long lastCommitTime)
Sets the lastCommitTime.
|
long |
writeCheckpoint()
Checkpoint operation must
#flush() dirty nodes, dirty persistent
data structures, etc, write a new Checkpoint record on the
backing store, save a reference to the current Checkpoint , and
return the address of that Checkpoint record. |
Checkpoint |
writeCheckpoint2()
Checkpoint operation must
#flush() dirty nodes, dirty persistent
data structures, etc, write a new Checkpoint record on the
backing store, save a reference to the current Checkpoint , and
return the address of that Checkpoint record. |
assertNotReadOnly, assertNotTransient, close, contains, contains, decodeRecordAddr, dump, dump, dumpPages, encodeRecordAddr, getBranchingFactor, getBtreeCounters, getContainsTuple, getCounters, getIndexMetadata, getLevel, getLevel, getLookupTuple, getNodeSerializer, getReadLockCount, getResourceMetadata, getRightMostNode, getRoot, getRootOrFinger, getStatistics, getUtilization, getWriteTuple, indexOf, insert, insert, insert, isBalanced, isOpen, isReadOnly, isTransient, keyAt, lookup, lookup, lookup, putIfAbsent, rangeCheck, rangeCopy, rangeCount, rangeCount, rangeCount, rangeCountExact, rangeCountExactWithDeleted, rangeIterator, rangeIterator, rangeIterator, rangeIterator, rangeIterator, readLock, readNodeOrLeaf, recycle, remove, remove, remove, reopen, scan, setBTreeCounters, submit, submit, submit, toString, touch, valueAt, valueAt, writeLock, writeNodeOrLeaf, writeNodeRecursive, writeNodeRecursiveCallersThread, writeNodeRecursiveConcurrent
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
close, dumpPages, getIndexMetadata, isOpen, reopen
getCounters
rangeCount, scan
getReadLockCount, isReadOnly, readLock, writeLock
protected int height
height := 0
. Note
that the height only changes when we split the root node.protected long nnodes
protected long nleaves
protected long nentries
protected long recordVersion
protected AtomicLong counter
Note: This is protected
so that it will be visible to
Checkpoint
which needs to write the actual value store in this
counter into its serialized record (without the partition identifier).
public BTree(IRawStore store, Checkpoint checkpoint, IndexMetadata metadata, boolean readOnly)
BTree
and any derived subclasses.
This constructor is used both to create a new BTree
, and to load
a BTree
from the store using a Checkpoint
record.store
- The store.checkpoint
- A Checkpoint
record for that BTree
.metadata
- The metadata record for that BTree
.readOnly
- When true
the BTree
will be immutable.create(IRawStore, IndexMetadata)
,
load(IRawStore, long, boolean)
public final int getHeight()
IBTreeStatistics
height := 0
. A btree with a
root node and one level of leaves under it has height := 1
.
Note that all leaves of a btree are at the same height (this is what is
means for the btree to be "balanced"). Also note that the height only
changes when we split or join the root node (a btree maintains balance by
growing and shrinking in levels from the top rather than the leaves).getHeight
in interface IBTreeStatistics
getHeight
in interface ISimpleTreeIndexAccess
ISimpleTreeIndexAccess.isBalanced()
public final long getNodeCount()
IBTreeStatistics
AbstractBTree
. This is zero (0)
for a new btree.getNodeCount
in interface IBTreeStatistics
getNodeCount
in interface ISimpleTreeIndexAccess
public final long getLeafCount()
IBTreeStatistics
AbstractBTree
. This is one (1) for a
new btree.getLeafCount
in interface IBTreeStatistics
getLeafCount
in interface ISimpleTreeIndexAccess
public final long getEntryCount()
IBTreeStatistics
AbstractBTree
. This is zero
(0) for a new B+Tree. When the B+Tree supports delete markers, this value
also includes tuples which have been marked as deleted.getEntryCount
in interface IBTreeStatistics
getEntryCount
in interface ISimpleTreeIndexAccess
getEntryCount
in class AbstractBTree
public final IRawStore getStore()
ISimpleIndexAccess
getStore
in interface ISimpleIndexAccess
public ICounter getCounter()
ICounter
. The ICounter
is mutable iff the
BTree
is mutable. All ICounter
s returned by this method
report and increment the same underlying counter.
Note: When the BTree
is part of a scale-out index then the
counter will assign values within a namespace defined by the partition
identifier.
getCounter
in interface IIndexLocalCounter
protected void _reopen()
AbstractBTree
AbstractBTree.reopen()
once AbstractBTree.root
has been show to be
null
with double-checked locking. When invoked in this
context, the caller is guaranteed to hold a lock on this. This is
done to ensure that at most one thread gets to re-open the index from the
backing store._reopen
in class AbstractBTree
public final BloomFilter getBloomFilter()
getBloomFilter
in interface ILocalBTreeView
getBloomFilter
in class AbstractBTree
null
if there is no bloom
filter (including the case where there is a bloom filter but it
has been disabled since the BTree
has grown too large and
the expected error rate of the bloom filter would be too high).protected final BloomFilter readBloomFilter()
checkpoint
record. This method will be invoked by
getBloomFilter()
when the bloom filter reference is
null
but the bloom filter is known to exist and the bloom
filter object is requested.
Note: A bloom filter can be relatively large. The bit length of a bloom filter is approximately one byte per index entry, so a filter for an index with 10M index entries will be on the order of 10mb. Therefore this method will typically have high latency.
Note: the Checkpoint
record initially stores 0L
for the bloom filter address. newRootLeaf()
is responsible for
allocating the bloom filter (if one is to be used) when the root leaf is
(re-)created. The address then gets stored in the Checkpoint
record by writeCheckpoint()
(if invoked and once the bloom
filter is dirty).
IllegalStateException
- if the bloom filter does not exist (the caller should check
this first to avoid obtaining a lock).public final long getLastCommitTime()
AbstractBTree
IAtomicStore.commit()
in
which writes buffered by this index were made restart-safe on the backing
store. The lastCommitTime is set when the index is loaded from the
backing store and updated after each commit. It is ZERO (0L) when
BTree
is first created and will remain ZERO (0L) until the
BTree
is committed. If the backing store does not support atomic
commits, then this value will always be ZERO (0L).
Note: This is fixed for an IndexSegment
.
getLastCommitTime
in interface ICheckpointProtocol
getLastCommitTime
in class AbstractBTree
ICheckpointProtocol.getLastCommitTime()
public final long getRevisionTimestamp()
AbstractBTree
The revision timestamp assigned by this method is
lastCommitTime+1
. The reasoning is as follows. Revision
timestamps are assigned by the transaction manager when the transaction
is validated as part of its commit protocol. Therefore, revision
timestamps are assigned after the transaction write set is complete.
Further, the assigned revisionTimestamp will be strictly LT the
commitTime for that transaction. By using lastCommitTime+1
we are guaranteed that the revisionTimestamp for new writes (which will
be part of some future commit point) will always be strictly GT the
revisionTimestamp of historical writes (which were part of some prior
commit point).
Note: Unisolated operations using this timestamp ARE NOT validated. The timestamp is simply applied to the tuple when it is inserted or updated and will become part of the restart safe state of the B+Tree once the unisolated operation participates in a commit.
Note: If an unisolated operation were to execute concurrent with a
transaction commit for the same index then that could produce
inconsistent results in the index and could trigger concurrent
modification errors. In order to avoid such concurrent modification
errors, unisolated operations which are to be mixed with full
transactions MUST ensure that they have exclusive access to the
unisolated index before proceeding. There are two ways to do this: (1)
take the application off line for transactions; (2) submit your unisolated
operations to the IConcurrencyManager
which will automatically
impose the necessary constraints on concurrent access to the unisolated
indices.
getRevisionTimestamp
in class AbstractBTree
public final void setLastCommitTime(long lastCommitTime)
ICheckpointProtocol
Note: The lastCommitTime is set by a combination of the
AbstractJournal
and Name2Addr
based on the actual
commitTime of the commit during which an Name2Addr.Entry
for that index was
last committed. It is set for both historical index reads and unisolated
index reads using Name2Addr.Entry.commitTime
. The lastCommitTime for an
unisolated index will advance as commits are performed with that index.
setLastCommitTime
in interface ICheckpointProtocol
lastCommitTime
- The timestamp of the last committed state of this index.public final IDirtyListener getDirtyListener()
IDirtyListener
.getDirtyListener
in interface ICheckpointProtocol
public final void setDirtyListener(IDirtyListener listener)
setDirtyListener
in interface ICheckpointProtocol
listener
- The listener.protected final void fireDirtyEvent()
public BTree asReadOnly()
BTree
. If BTree
is
already read-only, then this instance is returned. Otherwise, a
read-only BTree
is loaded from the last checkpoint and returned.IllegalStateException
- If the BTree
is dirty.public final long writeCheckpoint()
#flush()
dirty nodes, dirty persistent
data structures, etc, write a new Checkpoint
record on the
backing store, save a reference to the current Checkpoint
, and
return the address of that Checkpoint
record.
Note: A checkpoint by itself is NOT an atomic commit. The commit protocol
is at the store level and uses Checkpoint
s to ensure that the
state of the persistence capable data structure is current on the backing
store.
This checkpoint operation flush()
es dirty nodes, the optional
IBloomFilter
(if dirty), the IndexMetadata
(if dirty),
and then writes a new Checkpoint
record on the backing store,
saves a reference to the current Checkpoint
and returns the
address of that Checkpoint
record.
writeCheckpoint
in interface ICheckpointProtocol
Checkpoint
record for the
persistence capable was written onto the store. The data
structure can be reloaded from this Checkpoint
record.#writeCheckpoint2(), which returns the {@link Checkpoint} record
itself.
,
load(IRawStore, long, boolean)
public final Checkpoint writeCheckpoint2()
#flush()
dirty nodes, dirty persistent
data structures, etc, write a new Checkpoint
record on the
backing store, save a reference to the current Checkpoint
, and
return the address of that Checkpoint
record.
Note: A checkpoint by itself is NOT an atomic commit. The commit protocol
is at the store level and uses Checkpoint
s to ensure that the
state of the persistence capable data structure is current on the backing
store.
writeCheckpoint2
in interface ICheckpointProtocol
Checkpoint
record for the persistent data structure
which was written onto the store. The persistent data structure
can be reloaded from this Checkpoint
record.load(IRawStore, long, boolean)
public final Checkpoint getCheckpoint()
ICheckpointProtocol
ICheckpoint
record.getCheckpoint
in interface ICheckpointProtocol
ICheckpoint
record and never
null
.public final long getRecordVersion()
ICheckpointProtocol
getRecordVersion
in interface ICheckpointProtocol
public final long getMetadataAddr()
ICheckpointProtocol
IndexMetadata
record was
written.getMetadataAddr
in interface ICheckpointProtocol
public final long getRootAddr()
ICheckpointProtocol
0L
if there is no root. A 0L
return may be
an indication that an empty data structure will be created on demand.getRootAddr
in interface ICheckpointProtocol
public boolean needsCheckpoint()
writeCheckpoint()
.
Note: In order to avoid needless checkpoints this method will return
false
if:
null
, indicating that the index
is closed (in which case there are no buffered writes);Checkpoint.getRootAddr()
, and the
counter
value agrees Checkpoint.getCounter()
.true
true iff changes would be lost unless the
B+Tree was flushed to the backing store using
writeCheckpoint()
.public final void setIndexMetadata(IndexMetadata indexMetadata)
BTree
.
The new metadata record will be written out as part of the next index
writeCheckpoint()
.
Note: this method should be used with caution.
indexMetadata
- The new metadata description for the BTree
.IllegalArgumentException
- if the new value is null
IllegalArgumentException
- if the new IndexMetadata
record has already been
written on the store - see
IndexMetadata.getMetadataAddr()
UnsupportedOperationException
- if the index is read-only.public long handleCommit(long commitTime)
writeCheckpoint()
ing dirty nodes
and leaves onto the store, writing a new metadata record, and returning
the address of that metadata record.
Note: In order to avoid needless writes the existing metadata record is
always returned if needsCheckpoint()
is false
.
Note: The address of the existing Checkpoint
record is always
returned if autoCommit
is disabled.
handleCommit
in interface ICommitter
commitTime
- The timestamp assigned to the commit.Checkpoint
record from which the btree
may be reloaded.public void invalidate(Throwable t)
ICommitter
ICommitter
as invalid. This will prevent it from allowing
any writes through to the backing store.invalidate
in interface ICommitter
t
- A cause (required).https://jira.blazegraph.com/browse/BLZG-1953
public final void removeAll()
When delete markers are not enabled this simply replaces the root with a new root leaf and resets the counters (height, #of nodes, #of leaves, etc). to be consistent with that new root. If the btree is then made restart-safe by the commit protocol of the backing store then the effect is as if all entries had been deleted. Old nodes and leaves will be swept from the store eventually when the journal overflows.
When delete markers are enabled all un-deleted entries in the index are overwritten with a delete marker.
Note: The IIndexManager
defines methods for registering (adding)
and dropping indices vs removing the entries in an individual
BTree
.
removeAll
in interface ISimpleIndexAccess
removeAll
in class AbstractBTree
public long createViewCheckpoint()
BTree
in which the view is
redefined to include the previous view of the BTree
(the one from
which this BTree
instance was loaded) plus the current view of
the BTree
. The root of the BTree
is replaced with an
empty leaf as part of this operation. The LocalPartitionMetadata
is updated to reflect the new view.
Note: This method is used by the scale-out architecture for which it performs a very specific function. It should not be used for any other purpose and should not be invoked by user code. This encapsulates all of the trickery for creating the necessary checkpoint without exposing any methods which could be used to replace the root node with an empty root leaf.
BTree
was loaded. This is the timestamp that was
placed on the 2nd element of the view. Any read-only build or
merge task will read from this timestamp.IllegalStateException
- if the BTree
is read only.IllegalStateException
- if the BTree
is dirty.IllegalStateException
- if the BTree
has never been committed.public static BTree create(IRawStore store, IndexMetadata metadata)
BTree
or derived class. This method works by writing
the IndexMetadata
record on the store and then loading the
BTree
from the IndexMetadata
record.store
- The store.metadata
- The metadata record.BTree
.IllegalStateException
- If you attempt to create two btree objects from the same
metadata record since the metadata address will have
already been noted on the IndexMetadata
object.
You can use IndexMetadata.clone()
to obtain a new
copy of the metadata object with the metadata address set
to 0L
.IllegalStateException
- if the IndexTypeEnum
in the supplied
IndexMetadata
object is not
IndexTypeEnum.BTree
.load(IRawStore, long, boolean)
public static BTree createTransient(IndexMetadata metadata)
BTree
or derived class that is fully transient (NO
backing IRawStore
).
Fully transient BTree
s provide the functionality of a B+Tree
without a backing persistence store. Internally, reachable nodes and
leaves of the transient BTree
use hard references to ensure that
remain strongly reachable. Deleted nodes and leaves simply clear their
references and will be swept by the garbage collector shortly thereafter.
Operations which attempt to write on the backing store will fail.
While nodes and leaves are never persisted, the keys and values of the
transient BTree
are unsigned byte[]s. This means that application
keys and values are always converted into unsigned byte[]s before being
stored in the BTree
. Hence if an object that is inserted into
the BTree
and then looked up using the BTree
API, you
WILL NOT get back the same object reference.
Note: CLOSING A TRANSIENT INDEX WILL DISCARD ALL DATA!
metadata
- The metadata record.BTree
.public static BTree load(IRawStore store, long addrCheckpoint, boolean readOnly)
BTree
or derived class from the store. The
BTree
or derived class MUST declare a constructor with the
following signature:
className(IRawStore store, Checkpoint checkpoint, IndexMetadata metadata, boolean readOnly)
store
- The store.addrCheckpoint
- The address of a Checkpoint
record for the index.readOnly
- When true
the BTree
will be marked as
read-only. Marking has some advantages relating to the locking
scheme used by Node.getChild(int)
since the root node
is known to be read-only at the time that it is allocated as
per-child locking is therefore in place for all nodes in the
read-only BTree
. It also results in much higher
concurrency for AbstractBTree.touch(AbstractNode)
.BTree
or derived class loaded from that
Checkpoint
record.IllegalArgumentException
- if store is null
.public final int getSourceCount()
getSourceCount
in interface ILocalBTreeView
public final AbstractBTree[] getSources()
BTree
.getSources
in interface ILocalBTreeView
public final BTree getMutableBTree()
ILocalBTreeView
BTree
that is absorbing writes for the view.getMutableBTree
in interface ILocalBTreeView
public BTree.LeafCursor newLeafCursor(SeekEnum where)
AbstractBTree
newLeafCursor
in class AbstractBTree
public BTree.LeafCursor newLeafCursor(byte[] key)
AbstractBTree
newLeafCursor
in class AbstractBTree
key
- The key (required).Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.