public class Node extends AbstractNode<Node> implements INodeData
A non-leaf node.
Node
we must also
track the min/max timestamp for each direct child of that Node
. While
this inflates the size of the INodeData
data record considerably, we
are required to track those per-child data in order to avoid a scan of the
children when we need to recompute the min/max timestamp for the Node
. The IO latency costs of that scan are simply not acceptable, especially for
large branching factors. The min/max timestamp on the Node
is ONLY
used for filtering iterators based on a desired tuple revision range. This is
why the choice to support tuple revision filters is its own configuration
option.
FIXME An alternative to per-child min/max tuple revision timestamps would be
the concurrent materialization of the direct children. These data are only
mutable for B+Tree instances with relatively small branching factors. They
are immutable for the IndexSegment
. However, the per-Node
min/max timestamp also make the tuple revision filtering more efficient since
we can prune the search before we materialize the child.btree, DEBUG, log, parent, referenceCount, self
NULL
Modifier | Constructor and Description |
---|---|
protected |
Node(AbstractBTree btree,
long addr,
INodeData data)
De-serialization constructor.
|
protected |
Node(BTree btree)
Used to create a new node when a node is split.
|
protected |
Node(BTree btree,
AbstractNode oldRoot,
long nentries)
|
protected |
Node(Node src,
long triggeredByChildId)
Copy constructor.
|
Modifier and Type | Method and Description |
---|---|
Iterator<AbstractNode> |
childIterator(boolean dirtyNodesOnly)
Iterator visits the direct child nodes in the external key ordering.
|
Iterator<AbstractNode> |
childIterator(byte[] fromKey,
byte[] toKey)
Iterator visits the direct child nodes in the external key ordering.
|
AbstractFixedByteArrayBuffer |
data()
The coded (aka compressed) data.
|
void |
delete()
Deletes the persistence capable object.
|
boolean |
dump(org.apache.log4j.Level level,
PrintStream out,
int height,
boolean recursive)
Dump the data onto the
PrintStream . |
protected int |
findChild(byte[] searchKey)
Return the index of the child to be searched.
|
AbstractNode |
getChild(int index)
Return the child node or leaf at the specified index in this node.
|
long |
getChildAddr(int index)
Return the persistent addresses of the specified child node.
|
int |
getChildCount()
The #of children of this node.
|
long |
getChildEntryCount(int index)
Return the #of tuples spanned by the indicated child of this node.
|
Reference<AbstractNode<?>> |
getChildRef(int index)
Return the
Reference for the child. |
INodeData |
getDelegate()
Return the delegate
IAbstractNodeData object. |
protected int |
getIndexOf(AbstractNode child)
Return the index of the child among the direct children of this node.
|
int |
getKeyCount()
Return the #of keys in the node or leaf.
|
IRaba |
getKeys()
The object used to contain and manage the keys.
|
protected AbstractNode |
getLeftSibling(AbstractNode child,
boolean materialize)
Return the left sibling.
|
long |
getMaximumVersionTimestamp()
The most recent tuple revision timestamp associated with any tuple
spanned by this node or leaf.
|
long |
getMinimumVersionTimestamp()
The earliest tuple revision timestamp associated with any tuple spanned
by this node or leaf.
|
protected AbstractNode |
getRightMostChild(boolean nodesOnly)
Return the right-most child of this node.
|
protected AbstractNode |
getRightSibling(AbstractNode child,
boolean materialize)
Return the right sibling of the specified child of a common parent.
|
long |
getSpannedTupleCount()
The #of tuples spanned by this node.
|
boolean |
hasVersionTimestamps()
Return
true iff the leaves maintain tuple revision
timestamps. |
long |
indexOf(byte[] key)
Recursive search locates the appropriate leaf and returns the index
position of the entry.
|
Tuple |
insert(byte[] key,
byte[] value,
boolean delete,
boolean putIfAbsent,
long timestamp,
Tuple tuple)
Insert or update a value.
|
protected void |
insertChild(byte[] key,
AbstractNode child)
Invoked by
AbstractNode.split() to insert a key and reference for
a child created when another child of this node is split. |
boolean |
isCoded()
true iff this is a coded data structure. |
boolean |
isLeaf()
Always returns
false . |
boolean |
isReadOnly()
The result depends on the backing
INodeData implementation. |
byte[] |
keyAt(long entryIndex)
Recursive search for the key at the specified entry index.
|
Tuple |
lookup(byte[] key,
Tuple tuple)
Lookup a key.
|
protected int |
maxKeys()
Return
branchingFactor - 1 , which is the maximum #of keys
for a Node . |
protected void |
merge(AbstractNode sibling,
boolean isRightSibling)
Merge the keys and values from the sibling into this node, delete the
sibling from the store and remove the sibling from the parent.
|
protected int |
minKeys()
Return
((branchingFactor + 1) << 1) - 1 |
Iterator<AbstractNode> |
postOrderIterator(byte[] fromKey,
byte[] toKey)
Iterator visits children in the specified half-open key range,
recursively expanding each child with a post-order traversal of its
children and finally visits this node itself.
|
Iterator<AbstractNode> |
postOrderNodeIterator(boolean dirtyNodesOnly,
boolean nodesOnly)
Iterator visits children, recursively expanding each child with a
post-order traversal of its children and finally visits this node itself.
|
protected void |
prefetchChildLeaves(Node node,
byte[] fromKey,
byte[] toKey)
When we visit a node whose children are leaves, schedule memoization of
those leaves whose separator key in the node is LT the toKey
(non-blocking).
|
protected void |
prefetchRightSibling(Node node,
byte[] toKey)
If the caller's toKey is GT the separator keys for the children of
this node then the iterator will need to visit the rightSibling of the
node and this method will schedule the memoization of the node's
rightSibling in order to reduce the IO latency when the iterator visit
that rightSibling (non-blocking).
|
protected boolean |
rangeCheckChildIndex(int index)
Range check a child index.
|
protected boolean |
rangeCheckSpannedTupleIndex(long entryIndex)
Range check an index into the keys of the node.
|
protected void |
redistributeKeys(AbstractNode sibling,
boolean isRightSibling)
Redistributes a key from the specified sibling into this node in order to
bring this node up to the minimum #of keys.
|
Tuple |
remove(byte[] key,
Tuple tuple)
Recursive search locates the appropriate leaf and removes the entry for
the key.
|
protected void |
removeChild(AbstractNode child)
Invoked when a non-root node or leaf has no more keys to detach the child
from its parent.
|
protected IAbstractNode |
split()
Split an over-capacity node (a node with
maxKeys+1 keys),
creating a new rightSibling. |
String |
toString()
Human readable representation of the
Node . |
protected void |
updateEntryCount(AbstractNode<?> child,
long delta)
Apply the delta to the per-child count for this node and then recursively
ascend up the tree applying the delta to all ancestors of this node.
|
protected void |
updateMinMaxVersionTimestamp(AbstractNode<?> child)
Update the
getMinimumVersionTimestamp() and
getMaximumVersionTimestamp() . |
void |
valueAt(long entryIndex,
Tuple tuple)
Recursive search for the value at the specified entry index.
|
assertInvariants, assertKeysMonotonic, copyKey, copyOnWrite, copyOnWrite, dump, dump, entryIterator, getBranchingFactor, getParent, isLeftMostNode, isRightMostNode, join, keyAsString, postOrderNodeIterator, postOrderNodeIterator, rangeIterator
getIdentity, indent, isDeleted, isDirty, isPersistent, setDirty, setIdentity, toShortString
protected Node(AbstractBTree btree, long addr, INodeData data)
Note: The de-serialization constructor (and ONLY the de-serialization
constructor) ALWAYS creates a clean node. Therefore the PO.dirty
flag passed up from this constructor has the value false
.
btree
- The tree to which the node belongs.addr
- The persistent identity of the node.data
- The data record.protected Node(BTree btree)
protected Node(BTree btree, AbstractNode oldRoot, long nentries)
Leaf
or a root
Node
. The resulting node has a single child reference and NO
keys. The #of entries allocated to the child is the #of remaining in that
child after the split.btree
- A mutable btree.oldRoot
- The node that was previously the root of the tree (either a
node or a leaf).nentries
- The #of entries spanned by the oldRoot before the
split.protected Node(Node src, long triggeredByChildId)
src
- The source node (must be immutable).triggeredByChildId
- The persistent identity of the child that triggered the copy
constructor. This should be the immutable child NOT the one
that was already cloned. This information is used to avoid
stealing the original child since we already made a copy of
it. It is IIdentityAccess.NULL
when this information is not
available, e.g., when the copyOnWrite action is triggered by a
join() and we are cloning the sibling before we redistribute a
key to the node/leaf on which the join was invoked.protected final int minKeys()
((branchingFactor + 1) << 1) - 1
minKeys
in class AbstractNode<Node>
protected final int maxKeys()
branchingFactor - 1
, which is the maximum #of keys
for a Node
.maxKeys
in class AbstractNode<Node>
protected final boolean rangeCheckChildIndex(int index)
index
- The index of a child in [0:nkeys+1].true
IndexOutOfBoundsException
- if the index is not in the legal range.public final Reference<AbstractNode<?>> getChildRef(int index)
Reference
for the child. This is part of the internal
API.index
- The index of the child.Reference
for that child. This will be
null
if the child is not buffered. Note that a non-
null
return MAY still return null
from
Reference.get()
.public final INodeData getDelegate()
AbstractNode
IAbstractNodeData
object.public final boolean isLeaf()
false
.isLeaf
in interface IAbstractNodeData
isLeaf
in class AbstractNode<Node>
public final boolean isReadOnly()
INodeData
implementation. The
Node
will be mutable when it is first created and is made
immutable when it is persisted. If there is a mutation operation, the
backing INodeData
is automatically converted into a mutable
instance.isReadOnly
in interface IAbstractNodeData
public final boolean isCoded()
IAbstractNodeData
true
iff this is a coded data structure.isCoded
in interface IAbstractNodeData
public final AbstractFixedByteArrayBuffer data()
IAbstractNodeData
data
in interface IAbstractNodeData
data
in interface IDataRecordAccess
public final int getKeyCount()
IKeysData
nkeys+1
children. A leaf has nkeys
keys and values. The maximum #of
keys for a node is one less than the branching factor of the B+Tree. The
maximum #of keys for a leaf is the branching factor of the B+Tree. For a
hash bucket, this is the #of entries in the bucket.getKeyCount
in interface IKeysData
public final IRaba getKeys()
IKeysData
public final int getChildCount()
IChildData
IAbstractNodeData#getKeyCount()
+1
getChildCount
in interface IChildData
public final long getSpannedTupleCount()
ISpannedTupleCountData
IKeysData.getKeyCount()
or
ILeafData.getValueCount()
.getSpannedTupleCount
in interface ISpannedTupleCountData
public final long getChildAddr(int index)
IChildData
getChildAddr
in interface IChildData
index
- The index of the child in [0:nkeys].public final long getChildEntryCount(int index)
ISpannedTupleCountData
ISpannedTupleCountData.getSpannedTupleCount()
. These data are used to support fast computation of the index at which a
key occurs and the #of entries in a given key range.getChildEntryCount
in interface ISpannedTupleCountData
index
- The index of the child in [0:nkeys].public final long getMaximumVersionTimestamp()
IAbstractNodeData
Long.MIN_VALUE
since the initial value of the
maximum version timestamp is always the smallest possible long integer.getMaximumVersionTimestamp
in interface IAbstractNodeData
public long getMinimumVersionTimestamp()
IAbstractNodeData
Long.MAX_VALUE
since the initial value of the minimum
version timestamp is always the largest possible long integer.getMinimumVersionTimestamp
in interface IAbstractNodeData
public final boolean hasVersionTimestamps()
IAbstractNodeData
true
iff the leaves maintain tuple revision
timestamps. When true
, the minimum and maximum tuple
revision timestamp for a node or leaf are available from
IAbstractNodeData.getMinimumVersionTimestamp()
and
IAbstractNodeData.getMaximumVersionTimestamp()
.hasVersionTimestamps
in interface IAbstractNodeData
protected final void updateEntryCount(AbstractNode<?> child, long delta)
This also updates getMinimumVersionTimestamp()
and
getMaximumVersionTimestamp()
iff version timestamps are enabled
and the child has a minimum (maximum) version timestamp GE the minimum
(maximum) version timestamp of this node.
child
- The direct child.delta
- The change in the #of spanned children.protected final void updateMinMaxVersionTimestamp(AbstractNode<?> child)
getMinimumVersionTimestamp()
and
getMaximumVersionTimestamp()
. This is invoked when the min/max
in the child has changed without a corresponding change to the #of
spanned tuples. E.g., when an insert() causes a tuple to be updated
rather than added.child
- The direct child.public void delete()
IIdentityAccess
delete
in interface IIdentityAccess
delete
in class AbstractNode<Node>
public Tuple insert(byte[] key, byte[] value, boolean delete, boolean putIfAbsent, long timestamp, Tuple tuple)
AbstractNode
insert
in class AbstractNode<Node>
key
- The key (non-null).value
- The value (may be null).delete
- true
iff the entry is to marked as deleted
(delete markers must be supported for if this is true).putIfAbsent
- When true
, a pre-existing entry for the key will
NOT be replaced (unless it is a deleted tuple, which is the
same as if there was no entry under the key). This should ONLY
be true when the top-level method is putIfAbsent
.
Historical code paths should specify false for an unconditional
mutation. See BLZG-1539.timestamp
- The timestamp associated with the version (the value is
ignored unless version metadata is being maintained).tuple
- A tuple that may be used to obtain the data and metadata for
the pre-existing index entry overwritten by the insert
operation (optional).null
otherwise.public Tuple lookup(byte[] key, Tuple tuple)
AbstractNode
lookup
in class AbstractNode<Node>
key
- The search key.tuple
- A tuple that may be used to obtain the data and metadata for
the pre-existing index entry (required).null
otherwise.public Tuple remove(byte[] key, Tuple tuple)
AbstractNode
Note: It is an error to call this method if delete markers are in use.
remove
in class AbstractNode<Node>
key
- The search key.tuple
- A tuple that may be used to obtain the data and metadata for
the pre-existing index entry that was either removed by the
remove operation (optional).null
otherwise.public long indexOf(byte[] key)
AbstractNode
indexOf
in class AbstractNode<Node>
key
- The search key.(-(insertion point) - 1)
. The insertion point is
defined as the point at which the key would be found it it were
inserted into the btree without intervening mutations. Note that
this guarantees that the return value will be >= 0 if and only if
the key is found.protected boolean rangeCheckSpannedTupleIndex(long entryIndex)
entryIndex
- The key index.true
IndexOutOfBoundsException
- if the index is LT ZERO (0) -or- GTE the
getSpannedTupleCount()
public final byte[] keyAt(long entryIndex)
keyAt
in class AbstractNode<Node>
entryIndex
- The index of the entry (relative to the first entry spanned by
this node).public final void valueAt(long entryIndex, Tuple tuple)
valueAt
in class AbstractNode<Node>
entryIndex
- The index of the entry (relative to the first entry spanned by
this node).tuple
- A tuple that may be used to obtain the data and metadata for
the pre-existing index entry (required).protected final int findChild(byte[] searchKey)
The interpretation of the key index for a node is as follows. When searching nodes of the tree, we search for the index in keys[] of the first key value greater than or equal (GTE) to the probe key. If the match is equal, then we choose the child at index + 1. Otherwise we choose the child having the same index as the GTE key match. For example,
keys[] : [ 5 9 12 ] child[] : [ a b c d ]A probe with keys up to
4
matches at index zero (0) and we
choose the 1st child, a, which is at index zero (0).
A probe whose key is 5
matches at index zero (0) exactly and
we choose the child at index + 1
, b, which is at index one
(1).
A probe with keys in [6:8] matches at index one (1) and we choose the 2nd
child, b, which is at index one (1). A probe with 9
also
matches at index one (1), but we choose index+1
equals two
(2) since this is an exact key match.
A probe with keys in [10:11] matches at index two (2) and we choose the
3rd child, c, which is at index two (2). A probe with 12
also matches at index two (2), but we choose index+1
equals
three (3) since this is an exact key match.
A probe with keys greater than 12 exceeds all keys in the node and always matches the last child in that node. In this case, d, which is at index three (3).
Note that we never stop a search on a node, even when there is an exact match on a key. All values are stored in the leaves and we always descend until we reach the leaf in which a value for the key would be stored. A test on the keys of that leaf is then conclusive - either a value is stored in the leaf for that key or it is not stored in the tree.
searchkey
- The probe key.protected IAbstractNode split()
Split an over-capacity node (a node with maxKeys+1
keys),
creating a new rightSibling. The splitIndex is (maxKeys+1)/2
. The key at the splitIndex is the separatorKey. Unlike when we split a
Leaf
, the separatorKey is lifted into the parent and does not
appear in either this node or the rightSibling after the split. All keys
and child references from splitIndex+1
(inclusive) are moved
to the new rightSibling. The child reference at splitIndex
remains in this node.
If this node is the root of the tree (no parent), then a new root
Node
is created without any keys and is made the parent of this
node.
In any case, we then insert( separatorKey, rightSibling ) into the parent node, which may cause the parent node itself to split.
Note: splitting a node causes entry counts for the relocated childen to be relocated to the new rightSibling but it does NOT change the #of entries on the parent.
split
in class AbstractNode<Node>
protected void redistributeKeys(AbstractNode sibling, boolean isRightSibling)
While redistribution changes the #of entries spanned by the node and the
sibling and therefore must update #childEntryCounts
on the shared
parent, it does not change the #of entries spanned by the parent.
redistributeKeys
in class AbstractNode<Node>
sibling
- A direct sibling of this node (either the left or right
sibling). The sibling MUST be mutable.isRightSibling
- True iff the sibling is the rightSibling of this node.protected void merge(AbstractNode sibling, boolean isRightSibling)
AbstractNode.join()
if the parent node is now
deficient. While this changes the #of entries spanned by the current node
it does NOT effect the #of entries spanned by the parent.merge
in class AbstractNode<Node>
sibling
- A direct sibling of this node (does NOT need to be mutable).
The sibling MUST have exactly the minimum #of keys.isRightSibling
- True iff the sibling is the rightSibling of this node.protected void insertChild(byte[] key, AbstractNode child)
AbstractNode.split()
to insert a key and reference for
a child created when another child of this node is split. This method has
no effect on the #of entries spanned by the parent. *
Note: This operation is invoked only when a node or leaf is split. As
such, it can not cause the min/max tuple revision timestamp on this
Node
to change since no tuples have been added or removed.
However, this method does need to record the min/max for the new
rightSibling.
key
- The key on which the old node was split.child
- The new node.
FIXME set min/max for the new child.protected AbstractNode getLeftSibling(AbstractNode child, boolean materialize)
AbstractNode.join()
to explore their left sibling.child
- The child (must be dirty).materialize
- When true, the left sibling will be materialized if it exists
but is not resident.null
if it does not exist -or-
if it is not materialized and materialized == false
.
If the sibling is returned, then is NOT guaranteed to be mutable
and the caller MUST invoke copy-on-write before attempting to
modify the returned sibling.protected AbstractNode getRightSibling(AbstractNode child, boolean materialize)
AbstractNode.join()
to explore their right
sibling.child
- The child (must be dirty).materialize
- When true, the left sibling will be materialized if it exists
but is not resident.null
if it does not exist -or-
if it is not materialized and materialized == false
.
If the sibling is returned, then is NOT guaranteed to be mutable
and the caller MUST invoke copy-on-write before attempting to
modify the returned sibling.protected int getIndexOf(AbstractNode child)
child
- The child.childRefs
where that child is found. This
may also be used as an index into #childAddr
and
#childEntryCounts
.IllegalArgumentException
- iff child is not a child of this node.protected void removeChild(AbstractNode child)
child
- The child (does NOT need to be mutable).public final AbstractNode getChild(int index)
Note: This implementation DOES NOT cause concurrent threads to block
unless they are performing IO for the same child. A Memoizer
pattern is used to assign each concurrent thread a FutureTask
on
which it waits for the result. Once the result is available, there is a
small synchronized
block during which the concurrent
requests for a child will content to update the appropriate element in
childRefs
.
I believe the contention to update childRefs
is unavoidable. If
this object was made into an AtomicReferenceArray
then we would
have difficulty when inserting and removing tuples since the backing
array is not visible. An array of AtomicReference
objects would
not help since it would not ensure "publication" when the element was
changed from a null
to an AtomicReference
, only when
AtomicReference.compareAndSet(Object, Object)
was used. Thus it
could only help if we pre-populated the array with
AtomicReference
objects, which seems wasteful.
As always, the mutable B+Tree is single threaded so there are not added
synchronization costs. Concurrent readers can only arise for read-only
BTree
s and for IndexSegment
s.
index
- The index of the child to be read from the store (in
[0:nkeys]).IndexOutOfBoundsException
- if index is negative.IndexOutOfBoundsException
- if index is GT the #of keys in the node (there is one more
child than keys in a node).protected AbstractNode getRightMostChild(boolean nodesOnly)
nodesOnly
- when true
the search will halt at the right-most
non-leaf. Otherwise it will return the right-most leaf.public Iterator<AbstractNode> postOrderNodeIterator(boolean dirtyNodesOnly, boolean nodesOnly)
postOrderNodeIterator
in class AbstractNode<Node>
dirtyNodesOnly
- When true, only dirty nodes and leaves will be visitednodesOnly
- When true
, the leaves will not be visited.AbstractNode
s.public Iterator<AbstractNode> postOrderIterator(byte[] fromKey, byte[] toKey)
postOrderIterator
in class AbstractNode<Node>
fromKey
- The first key that will be visited (inclusive). When
null
there is no lower bound.toKey
- The first key that will NOT be visited (exclusive). When
null
there is no upper bound.AbstractNode
s.protected void prefetchChildLeaves(Node node, byte[] fromKey, byte[] toKey)
node
- A node whose children are leaves.toKey
- The exclusive upper bound of some iterator.protected void prefetchRightSibling(Node node, byte[] toKey)
node
- A node.toKey
- The exclusive upper bound of some iterator.public Iterator<AbstractNode> childIterator(boolean dirtyNodesOnly)
dirtyNodesOnly
- When true, only the direct dirty child nodes will be visited.public Iterator<AbstractNode> childIterator(byte[] fromKey, byte[] toKey)
public boolean dump(org.apache.log4j.Level level, PrintStream out, int height, boolean recursive)
AbstractNode
PrintStream
.dump
in class AbstractNode<Node>
level
- The logging level.out
- Where to write the dump.height
- The height of this node in the tree or -1 iff you need to
invoke this method on a node or leaf whose height in the tree
is not known.recursive
- When true, the node will be dumped recursively using a
pre-order traversal.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.