public class Node extends AbstractNode<Node> implements INodeData
A non-leaf node.
Node we must also
track the min/max timestamp for each direct child of that Node. While
this inflates the size of the INodeData data record considerably, we
are required to track those per-child data in order to avoid a scan of the
children when we need to recompute the min/max timestamp for the Node
. The IO latency costs of that scan are simply not acceptable, especially for
large branching factors. The min/max timestamp on the Node is ONLY
used for filtering iterators based on a desired tuple revision range. This is
why the choice to support tuple revision filters is its own configuration
option.
FIXME An alternative to per-child min/max tuple revision timestamps would be
the concurrent materialization of the direct children. These data are only
mutable for B+Tree instances with relatively small branching factors. They
are immutable for the IndexSegment. However, the per-Node
min/max timestamp also make the tuple revision filtering more efficient since
we can prune the search before we materialize the child.btree, DEBUG, log, parent, referenceCount, selfNULL| Modifier | Constructor and Description |
|---|---|
protected |
Node(AbstractBTree btree,
long addr,
INodeData data)
De-serialization constructor.
|
protected |
Node(BTree btree)
Used to create a new node when a node is split.
|
protected |
Node(BTree btree,
AbstractNode oldRoot,
long nentries)
|
protected |
Node(Node src,
long triggeredByChildId)
Copy constructor.
|
| Modifier and Type | Method and Description |
|---|---|
Iterator<AbstractNode> |
childIterator(boolean dirtyNodesOnly)
Iterator visits the direct child nodes in the external key ordering.
|
Iterator<AbstractNode> |
childIterator(byte[] fromKey,
byte[] toKey)
Iterator visits the direct child nodes in the external key ordering.
|
AbstractFixedByteArrayBuffer |
data()
The coded (aka compressed) data.
|
void |
delete()
Deletes the persistence capable object.
|
boolean |
dump(org.apache.log4j.Level level,
PrintStream out,
int height,
boolean recursive)
Dump the data onto the
PrintStream. |
protected int |
findChild(byte[] searchKey)
Return the index of the child to be searched.
|
AbstractNode |
getChild(int index)
Return the child node or leaf at the specified index in this node.
|
long |
getChildAddr(int index)
Return the persistent addresses of the specified child node.
|
int |
getChildCount()
The #of children of this node.
|
long |
getChildEntryCount(int index)
Return the #of tuples spanned by the indicated child of this node.
|
Reference<AbstractNode<?>> |
getChildRef(int index)
Return the
Reference for the child. |
INodeData |
getDelegate()
Return the delegate
IAbstractNodeData object. |
protected int |
getIndexOf(AbstractNode child)
Return the index of the child among the direct children of this node.
|
int |
getKeyCount()
Return the #of keys in the node or leaf.
|
IRaba |
getKeys()
The object used to contain and manage the keys.
|
protected AbstractNode |
getLeftSibling(AbstractNode child,
boolean materialize)
Return the left sibling.
|
long |
getMaximumVersionTimestamp()
The most recent tuple revision timestamp associated with any tuple
spanned by this node or leaf.
|
long |
getMinimumVersionTimestamp()
The earliest tuple revision timestamp associated with any tuple spanned
by this node or leaf.
|
protected AbstractNode |
getRightMostChild(boolean nodesOnly)
Return the right-most child of this node.
|
protected AbstractNode |
getRightSibling(AbstractNode child,
boolean materialize)
Return the right sibling of the specified child of a common parent.
|
long |
getSpannedTupleCount()
The #of tuples spanned by this node.
|
boolean |
hasVersionTimestamps()
Return
true iff the leaves maintain tuple revision
timestamps. |
long |
indexOf(byte[] key)
Recursive search locates the appropriate leaf and returns the index
position of the entry.
|
Tuple |
insert(byte[] key,
byte[] value,
boolean delete,
boolean putIfAbsent,
long timestamp,
Tuple tuple)
Insert or update a value.
|
protected void |
insertChild(byte[] key,
AbstractNode child)
Invoked by
AbstractNode.split() to insert a key and reference for
a child created when another child of this node is split. |
boolean |
isCoded()
true iff this is a coded data structure. |
boolean |
isLeaf()
Always returns
false. |
boolean |
isReadOnly()
The result depends on the backing
INodeData implementation. |
byte[] |
keyAt(long entryIndex)
Recursive search for the key at the specified entry index.
|
Tuple |
lookup(byte[] key,
Tuple tuple)
Lookup a key.
|
protected int |
maxKeys()
Return
branchingFactor - 1, which is the maximum #of keys
for a Node. |
protected void |
merge(AbstractNode sibling,
boolean isRightSibling)
Merge the keys and values from the sibling into this node, delete the
sibling from the store and remove the sibling from the parent.
|
protected int |
minKeys()
Return
((branchingFactor + 1) << 1) - 1 |
Iterator<AbstractNode> |
postOrderIterator(byte[] fromKey,
byte[] toKey)
Iterator visits children in the specified half-open key range,
recursively expanding each child with a post-order traversal of its
children and finally visits this node itself.
|
Iterator<AbstractNode> |
postOrderNodeIterator(boolean dirtyNodesOnly,
boolean nodesOnly)
Iterator visits children, recursively expanding each child with a
post-order traversal of its children and finally visits this node itself.
|
protected void |
prefetchChildLeaves(Node node,
byte[] fromKey,
byte[] toKey)
When we visit a node whose children are leaves, schedule memoization of
those leaves whose separator key in the node is LT the toKey
(non-blocking).
|
protected void |
prefetchRightSibling(Node node,
byte[] toKey)
If the caller's toKey is GT the separator keys for the children of
this node then the iterator will need to visit the rightSibling of the
node and this method will schedule the memoization of the node's
rightSibling in order to reduce the IO latency when the iterator visit
that rightSibling (non-blocking).
|
protected boolean |
rangeCheckChildIndex(int index)
Range check a child index.
|
protected boolean |
rangeCheckSpannedTupleIndex(long entryIndex)
Range check an index into the keys of the node.
|
protected void |
redistributeKeys(AbstractNode sibling,
boolean isRightSibling)
Redistributes a key from the specified sibling into this node in order to
bring this node up to the minimum #of keys.
|
Tuple |
remove(byte[] key,
Tuple tuple)
Recursive search locates the appropriate leaf and removes the entry for
the key.
|
protected void |
removeChild(AbstractNode child)
Invoked when a non-root node or leaf has no more keys to detach the child
from its parent.
|
protected IAbstractNode |
split()
Split an over-capacity node (a node with
maxKeys+1 keys),
creating a new rightSibling. |
String |
toString()
Human readable representation of the
Node. |
protected void |
updateEntryCount(AbstractNode<?> child,
long delta)
Apply the delta to the per-child count for this node and then recursively
ascend up the tree applying the delta to all ancestors of this node.
|
protected void |
updateMinMaxVersionTimestamp(AbstractNode<?> child)
Update the
getMinimumVersionTimestamp() and
getMaximumVersionTimestamp(). |
void |
valueAt(long entryIndex,
Tuple tuple)
Recursive search for the value at the specified entry index.
|
assertInvariants, assertKeysMonotonic, copyKey, copyOnWrite, copyOnWrite, dump, dump, entryIterator, getBranchingFactor, getParent, isLeftMostNode, isRightMostNode, join, keyAsString, postOrderNodeIterator, postOrderNodeIterator, rangeIteratorgetIdentity, indent, isDeleted, isDirty, isPersistent, setDirty, setIdentity, toShortStringprotected Node(AbstractBTree btree, long addr, INodeData data)
Note: The de-serialization constructor (and ONLY the de-serialization
constructor) ALWAYS creates a clean node. Therefore the PO.dirty
flag passed up from this constructor has the value false.
btree - The tree to which the node belongs.addr - The persistent identity of the node.data - The data record.protected Node(BTree btree)
protected Node(BTree btree, AbstractNode oldRoot, long nentries)
Leaf or a root
Node. The resulting node has a single child reference and NO
keys. The #of entries allocated to the child is the #of remaining in that
child after the split.btree - A mutable btree.oldRoot - The node that was previously the root of the tree (either a
node or a leaf).nentries - The #of entries spanned by the oldRoot before the
split.protected Node(Node src, long triggeredByChildId)
src - The source node (must be immutable).triggeredByChildId - The persistent identity of the child that triggered the copy
constructor. This should be the immutable child NOT the one
that was already cloned. This information is used to avoid
stealing the original child since we already made a copy of
it. It is IIdentityAccess.NULL when this information is not
available, e.g., when the copyOnWrite action is triggered by a
join() and we are cloning the sibling before we redistribute a
key to the node/leaf on which the join was invoked.protected final int minKeys()
((branchingFactor + 1) << 1) - 1minKeys in class AbstractNode<Node>protected final int maxKeys()
branchingFactor - 1, which is the maximum #of keys
for a Node.maxKeys in class AbstractNode<Node>protected final boolean rangeCheckChildIndex(int index)
index - The index of a child in [0:nkeys+1].trueIndexOutOfBoundsException - if the index is not in the legal range.public final Reference<AbstractNode<?>> getChildRef(int index)
Reference for the child. This is part of the internal
API.index - The index of the child.Reference for that child. This will be
null if the child is not buffered. Note that a non-
null return MAY still return null from
Reference.get().public final INodeData getDelegate()
AbstractNodeIAbstractNodeData object.public final boolean isLeaf()
false.isLeaf in interface IAbstractNodeDataisLeaf in class AbstractNode<Node>public final boolean isReadOnly()
INodeData implementation. The
Node will be mutable when it is first created and is made
immutable when it is persisted. If there is a mutation operation, the
backing INodeData is automatically converted into a mutable
instance.isReadOnly in interface IAbstractNodeDatapublic final boolean isCoded()
IAbstractNodeDatatrue iff this is a coded data structure.isCoded in interface IAbstractNodeDatapublic final AbstractFixedByteArrayBuffer data()
IAbstractNodeDatadata in interface IAbstractNodeDatadata in interface IDataRecordAccesspublic final int getKeyCount()
IKeysDatankeys+1
children. A leaf has nkeys keys and values. The maximum #of
keys for a node is one less than the branching factor of the B+Tree. The
maximum #of keys for a leaf is the branching factor of the B+Tree. For a
hash bucket, this is the #of entries in the bucket.getKeyCount in interface IKeysDatapublic final IRaba getKeys()
IKeysDatapublic final int getChildCount()
IChildDataIAbstractNodeData#getKeyCount()+1getChildCount in interface IChildDatapublic final long getSpannedTupleCount()
ISpannedTupleCountDataIKeysData.getKeyCount() or
ILeafData.getValueCount().getSpannedTupleCount in interface ISpannedTupleCountDatapublic final long getChildAddr(int index)
IChildDatagetChildAddr in interface IChildDataindex - The index of the child in [0:nkeys].public final long getChildEntryCount(int index)
ISpannedTupleCountDataISpannedTupleCountData.getSpannedTupleCount()
. These data are used to support fast computation of the index at which a
key occurs and the #of entries in a given key range.getChildEntryCount in interface ISpannedTupleCountDataindex - The index of the child in [0:nkeys].public final long getMaximumVersionTimestamp()
IAbstractNodeDataLong.MIN_VALUE since the initial value of the
maximum version timestamp is always the smallest possible long integer.getMaximumVersionTimestamp in interface IAbstractNodeDatapublic long getMinimumVersionTimestamp()
IAbstractNodeDataLong.MAX_VALUE since the initial value of the minimum
version timestamp is always the largest possible long integer.getMinimumVersionTimestamp in interface IAbstractNodeDatapublic final boolean hasVersionTimestamps()
IAbstractNodeDatatrue iff the leaves maintain tuple revision
timestamps. When true, the minimum and maximum tuple
revision timestamp for a node or leaf are available from
IAbstractNodeData.getMinimumVersionTimestamp() and
IAbstractNodeData.getMaximumVersionTimestamp().hasVersionTimestamps in interface IAbstractNodeDataprotected final void updateEntryCount(AbstractNode<?> child, long delta)
This also updates getMinimumVersionTimestamp() and
getMaximumVersionTimestamp() iff version timestamps are enabled
and the child has a minimum (maximum) version timestamp GE the minimum
(maximum) version timestamp of this node.
child - The direct child.delta - The change in the #of spanned children.protected final void updateMinMaxVersionTimestamp(AbstractNode<?> child)
getMinimumVersionTimestamp() and
getMaximumVersionTimestamp(). This is invoked when the min/max
in the child has changed without a corresponding change to the #of
spanned tuples. E.g., when an insert() causes a tuple to be updated
rather than added.child - The direct child.public void delete()
IIdentityAccessdelete in interface IIdentityAccessdelete in class AbstractNode<Node>public Tuple insert(byte[] key, byte[] value, boolean delete, boolean putIfAbsent, long timestamp, Tuple tuple)
AbstractNodeinsert in class AbstractNode<Node>key - The key (non-null).value - The value (may be null).delete - true iff the entry is to marked as deleted
(delete markers must be supported for if this is true).putIfAbsent - When true, a pre-existing entry for the key will
NOT be replaced (unless it is a deleted tuple, which is the
same as if there was no entry under the key). This should ONLY
be true when the top-level method is putIfAbsent.
Historical code paths should specify false for an unconditional
mutation. See BLZG-1539.timestamp - The timestamp associated with the version (the value is
ignored unless version metadata is being maintained).tuple - A tuple that may be used to obtain the data and metadata for
the pre-existing index entry overwritten by the insert
operation (optional).null otherwise.public Tuple lookup(byte[] key, Tuple tuple)
AbstractNodelookup in class AbstractNode<Node>key - The search key.tuple - A tuple that may be used to obtain the data and metadata for
the pre-existing index entry (required).null otherwise.public Tuple remove(byte[] key, Tuple tuple)
AbstractNodeNote: It is an error to call this method if delete markers are in use.
remove in class AbstractNode<Node>key - The search key.tuple - A tuple that may be used to obtain the data and metadata for
the pre-existing index entry that was either removed by the
remove operation (optional).null otherwise.public long indexOf(byte[] key)
AbstractNodeindexOf in class AbstractNode<Node>key - The search key.(-(insertion point) - 1). The insertion point is
defined as the point at which the key would be found it it were
inserted into the btree without intervening mutations. Note that
this guarantees that the return value will be >= 0 if and only if
the key is found.protected boolean rangeCheckSpannedTupleIndex(long entryIndex)
entryIndex - The key index.trueIndexOutOfBoundsException - if the index is LT ZERO (0) -or- GTE the
getSpannedTupleCount()public final byte[] keyAt(long entryIndex)
keyAt in class AbstractNode<Node>entryIndex - The index of the entry (relative to the first entry spanned by
this node).public final void valueAt(long entryIndex,
Tuple tuple)
valueAt in class AbstractNode<Node>entryIndex - The index of the entry (relative to the first entry spanned by
this node).tuple - A tuple that may be used to obtain the data and metadata for
the pre-existing index entry (required).protected final int findChild(byte[] searchKey)
The interpretation of the key index for a node is as follows. When searching nodes of the tree, we search for the index in keys[] of the first key value greater than or equal (GTE) to the probe key. If the match is equal, then we choose the child at index + 1. Otherwise we choose the child having the same index as the GTE key match. For example,
keys[] : [ 5 9 12 ] child[] : [ a b c d ]A probe with keys up to
4 matches at index zero (0) and we
choose the 1st child, a, which is at index zero (0).
A probe whose key is 5 matches at index zero (0) exactly and
we choose the child at index + 1, b, which is at index one
(1).
A probe with keys in [6:8] matches at index one (1) and we choose the 2nd
child, b, which is at index one (1). A probe with 9 also
matches at index one (1), but we choose index+1 equals two
(2) since this is an exact key match.
A probe with keys in [10:11] matches at index two (2) and we choose the
3rd child, c, which is at index two (2). A probe with 12
also matches at index two (2), but we choose index+1 equals
three (3) since this is an exact key match.
A probe with keys greater than 12 exceeds all keys in the node and always matches the last child in that node. In this case, d, which is at index three (3).
Note that we never stop a search on a node, even when there is an exact match on a key. All values are stored in the leaves and we always descend until we reach the leaf in which a value for the key would be stored. A test on the keys of that leaf is then conclusive - either a value is stored in the leaf for that key or it is not stored in the tree.
searchkey - The probe key.protected IAbstractNode split()
Split an over-capacity node (a node with maxKeys+1 keys),
creating a new rightSibling. The splitIndex is (maxKeys+1)/2
. The key at the splitIndex is the separatorKey. Unlike when we split a
Leaf, the separatorKey is lifted into the parent and does not
appear in either this node or the rightSibling after the split. All keys
and child references from splitIndex+1 (inclusive) are moved
to the new rightSibling. The child reference at splitIndex
remains in this node.
If this node is the root of the tree (no parent), then a new root
Node is created without any keys and is made the parent of this
node.
In any case, we then insert( separatorKey, rightSibling ) into the parent node, which may cause the parent node itself to split.
Note: splitting a node causes entry counts for the relocated childen to be relocated to the new rightSibling but it does NOT change the #of entries on the parent.
split in class AbstractNode<Node>protected void redistributeKeys(AbstractNode sibling, boolean isRightSibling)
While redistribution changes the #of entries spanned by the node and the
sibling and therefore must update #childEntryCounts on the shared
parent, it does not change the #of entries spanned by the parent.
redistributeKeys in class AbstractNode<Node>sibling - A direct sibling of this node (either the left or right
sibling). The sibling MUST be mutable.isRightSibling - True iff the sibling is the rightSibling of this node.protected void merge(AbstractNode sibling, boolean isRightSibling)
AbstractNode.join() if the parent node is now
deficient. While this changes the #of entries spanned by the current node
it does NOT effect the #of entries spanned by the parent.merge in class AbstractNode<Node>sibling - A direct sibling of this node (does NOT need to be mutable).
The sibling MUST have exactly the minimum #of keys.isRightSibling - True iff the sibling is the rightSibling of this node.protected void insertChild(byte[] key,
AbstractNode child)
AbstractNode.split() to insert a key and reference for
a child created when another child of this node is split. This method has
no effect on the #of entries spanned by the parent. *
Note: This operation is invoked only when a node or leaf is split. As
such, it can not cause the min/max tuple revision timestamp on this
Node to change since no tuples have been added or removed.
However, this method does need to record the min/max for the new
rightSibling.
key - The key on which the old node was split.child - The new node.
FIXME set min/max for the new child.protected AbstractNode getLeftSibling(AbstractNode child, boolean materialize)
AbstractNode.join() to explore their left sibling.child - The child (must be dirty).materialize - When true, the left sibling will be materialized if it exists
but is not resident.null if it does not exist -or-
if it is not materialized and materialized == false.
If the sibling is returned, then is NOT guaranteed to be mutable
and the caller MUST invoke copy-on-write before attempting to
modify the returned sibling.protected AbstractNode getRightSibling(AbstractNode child, boolean materialize)
AbstractNode.join() to explore their right
sibling.child - The child (must be dirty).materialize - When true, the left sibling will be materialized if it exists
but is not resident.null if it does not exist -or-
if it is not materialized and materialized == false.
If the sibling is returned, then is NOT guaranteed to be mutable
and the caller MUST invoke copy-on-write before attempting to
modify the returned sibling.protected int getIndexOf(AbstractNode child)
child - The child.childRefs where that child is found. This
may also be used as an index into #childAddr and
#childEntryCounts.IllegalArgumentException - iff child is not a child of this node.protected void removeChild(AbstractNode child)
child - The child (does NOT need to be mutable).public final AbstractNode getChild(int index)
Note: This implementation DOES NOT cause concurrent threads to block
unless they are performing IO for the same child. A Memoizer
pattern is used to assign each concurrent thread a FutureTask on
which it waits for the result. Once the result is available, there is a
small synchronized block during which the concurrent
requests for a child will content to update the appropriate element in
childRefs.
I believe the contention to update childRefs is unavoidable. If
this object was made into an AtomicReferenceArray then we would
have difficulty when inserting and removing tuples since the backing
array is not visible. An array of AtomicReference objects would
not help since it would not ensure "publication" when the element was
changed from a null to an AtomicReference, only when
AtomicReference.compareAndSet(Object, Object) was used. Thus it
could only help if we pre-populated the array with
AtomicReference objects, which seems wasteful.
As always, the mutable B+Tree is single threaded so there are not added
synchronization costs. Concurrent readers can only arise for read-only
BTrees and for IndexSegments.
index - The index of the child to be read from the store (in
[0:nkeys]).IndexOutOfBoundsException - if index is negative.IndexOutOfBoundsException - if index is GT the #of keys in the node (there is one more
child than keys in a node).protected AbstractNode getRightMostChild(boolean nodesOnly)
nodesOnly - when true the search will halt at the right-most
non-leaf. Otherwise it will return the right-most leaf.public Iterator<AbstractNode> postOrderNodeIterator(boolean dirtyNodesOnly, boolean nodesOnly)
postOrderNodeIterator in class AbstractNode<Node>dirtyNodesOnly - When true, only dirty nodes and leaves will be visitednodesOnly - When true, the leaves will not be visited.AbstractNodes.public Iterator<AbstractNode> postOrderIterator(byte[] fromKey, byte[] toKey)
postOrderIterator in class AbstractNode<Node>fromKey - The first key that will be visited (inclusive). When
null there is no lower bound.toKey - The first key that will NOT be visited (exclusive). When
null there is no upper bound.AbstractNodes.protected void prefetchChildLeaves(Node node, byte[] fromKey, byte[] toKey)
node - A node whose children are leaves.toKey - The exclusive upper bound of some iterator.protected void prefetchRightSibling(Node node, byte[] toKey)
node - A node.toKey - The exclusive upper bound of some iterator.public Iterator<AbstractNode> childIterator(boolean dirtyNodesOnly)
dirtyNodesOnly - When true, only the direct dirty child nodes will be visited.public Iterator<AbstractNode> childIterator(byte[] fromKey, byte[] toKey)
public boolean dump(org.apache.log4j.Level level,
PrintStream out,
int height,
boolean recursive)
AbstractNodePrintStream.dump in class AbstractNode<Node>level - The logging level.out - Where to write the dump.height - The height of this node in the tree or -1 iff you need to
invoke this method on a node or leaf whose height in the tree
is not known.recursive - When true, the node will be dumped recursively using a
pre-order traversal.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.