public abstract class AbstractHTree extends Object implements ICounterSetAccess, ICheckpointProtocol, ISimpleTreeIndexAccess
Modifier and Type | Class and Description |
---|---|
static class |
AbstractHTree.HTreePageStateException
Exception that can be thrown for asserts and testing, retaining the
source page of the error
|
Modifier and Type | Field and Description |
---|---|
protected int |
addressBits
The #of bits in the address space for a directory page (from the
constructor).
|
protected int |
bucketSlots
The #of bucket slots in a bucket page.
|
protected static boolean |
DEBUG
True iff the
log level is DEBUG or less. |
static org.apache.log4j.Logger |
dumpLog
Log for
dump(PrintStream) and friends. |
protected Throwable |
error
This field is set if an error is encountered that renders an unisolated
index object unusable.
|
protected static String |
ERROR_CLOSED
The index is already closed.
|
protected static String |
ERROR_ERROR_STATE
An unisolated index view is in an error state.
|
protected static String |
ERROR_LESS_THAN_ZERO
A parameter was less than zero.
|
protected static String |
ERROR_READ_ONLY
The index is read-only but a mutation operation was requested.
|
protected static String |
ERROR_TOO_LARGE
A parameter was too large.
|
protected static String |
ERROR_TOO_SMALL
A parameter was too small
|
protected static String |
ERROR_TRANSIENT
The index is transient (not backed by persistent storage) but an
operation that requires persistence was requested.
|
protected static boolean |
INFO
True iff the
log level is INFO or less. |
protected static org.apache.log4j.Logger |
log |
int |
MAX_ADDRESS_BITS
The maximum value for the
addressBits parameter. |
protected HTreeIndexMetadata |
metadata
The metadata record for the index.
|
int |
MIN_ADDRESS_BITS |
protected int |
ndistinctOnWriteRetentionQueue
The #of distinct nodes and leaves on the
writeRetentionQueue . |
protected NodeSerializer |
nodeSer
Used to serialize and de-serialize the nodes and leaves of the tree.
|
protected boolean |
readOnly
When
true the AbstractHTree does not permit
mutation. |
protected com.bigdata.htree.DirectoryPage |
root
The root directory.
|
protected IRawStore |
store
The backing store.
|
protected IHardReferenceQueue<PO> |
writeRetentionQueue
Nodes (that is nodes or leaves) are added to a hard reference queue when
they are created or read from the store.
|
Modifier | Constructor and Description |
---|---|
protected |
AbstractHTree(IRawStore store,
INodeFactory nodeFactory,
boolean readOnly,
HTreeIndexMetadata metadata,
IRecordCompressorFactory<?> recordCompressorFactory) |
Modifier and Type | Method and Description |
---|---|
protected abstract void |
_reopen()
This method is responsible for setting up the root leaf (either new or
read from the store), the bloom filter, etc.
|
protected void |
assertNotReadOnly() |
protected void |
assertNotTransient() |
void |
checkConsistency(com.bigdata.htree.AbstractPage pge,
boolean full) |
void |
checkConsistency(boolean full)
Checks globalDepth of each node and also whether any BucketPages
are empty.
|
void |
close()
The contract for
ICheckpointProtocol.close() is to reduce the resource burden of the
index while not rendering the index inoperative. |
static long |
decodeRecordAddr(byte[] buf)
Decodes a signed long value as encoded by
#appendSigned(long) . |
boolean |
dump(org.apache.log4j.Level level,
PrintStream out,
boolean materialize) |
boolean |
dump(PrintStream out)
Recursive dump of the tree.
|
static byte[] |
encodeRecordAddr(ByteArrayBuffer recordAddrBuf,
long addr)
Encode a raw record address into a byte[] suitable for storing in the
value associated with a tuple and decoding using
decodeRecordAddr(byte[]) |
int |
getAddressBits()
The #of bits in the address space for a hash directory page.
|
BTreeCounters |
getBtreeCounters()
Counters tracking various aspects of the btree.
|
CounterSet |
getCounters()
Return performance counters.
|
abstract long |
getEntryCount()
The #of tuples in the
HTree . |
int |
getHeight()
Throws an exception since the
HTree is not a balanced tree. |
HTreeIndexMetadata |
getIndexMetadata()
Returns the metadata record for this index.
|
abstract long |
getLastCommitTime()
The timestamp associated with the last
IAtomicStore.commit() in
which writes buffered by this index were made restart-safe on the backing
store. |
abstract long |
getLeafCount()
The #of
BucketPage s in the HTree (not buddy hash buckets,
but the pages on which they appear). |
abstract long |
getNodeCount()
The #of
DirectoryPage s in the HTree (not buddy hash
tables, but the pages on which they appear). |
NodeSerializer |
getNodeSerializer()
The object responsible for (de-)serializing the nodes and leaves of the
IIndex . |
int |
getReadLockCount()
Return the #of read-locks held by the current thread for a mutable index
view.
|
protected com.bigdata.htree.DirectoryPage |
getRoot()
The root of the
HTree . |
IRawStore |
getStore()
The backing store.
|
boolean |
isBalanced()
Return
true iff the tree is balanced. |
boolean |
isOpen()
An "open" index has may have some buffered data.
|
boolean |
isReadOnly()
Return
true iff this B+Tree is read-only. |
boolean |
isTransient()
Return
true iff this is a transient data structure (no
backing store). |
abstract long |
rangeCount()
The #of index entries.
|
ITupleIterator |
rangeIterator()
Simple iterator visits all tuples in the
HTree in order by the
effective prefix of their keys. |
Lock |
readLock()
Return a
Lock that may be used to obtain a shared read lock which
is used (in the absence of other concurrency control mechanisms) to
permit concurrent readers on an unisolated index while serializing access
to that index when a writer must run. |
protected com.bigdata.htree.AbstractPage |
readNodeOrLeaf(long addr)
Read an
AbstractPage from the store. |
void |
reopen()
(Re-) open the index.
|
ICloseableIterator<?> |
scan()
Visit all entries in the index in the natural order of the index
(dereferencing visited tuples to the application objects stored within
those tuples).
|
void |
setBTreeCounters(BTreeCounters btreeCounters)
Replace the
BTreeCounters . |
String |
toString()
Fast summary information about the B+Tree.
|
protected void |
touch(com.bigdata.htree.AbstractPage node)
This method is responsible for putting the node or leaf onto the ring
buffer which controls (a) how long we retain a hard reference to the node
or leaf; and (b) for writes, when the node or leaf is evicted with a zero
reference count and made persistent (along with all dirty children).
|
Iterator<byte[]> |
values()
Visits the values stored in the
HTree in order by the effective
prefix of their keys. |
Lock |
writeLock()
Return a
Lock that may be used to obtain an exclusive write lock
which is used (in the absence of other concurrency control mechanisms) to
serialize all processes accessing an unisolated index when a writer must
run. |
protected long |
writeNodeOrLeaf(com.bigdata.htree.AbstractPage node)
Codes the node and writes the coded record on the store (non-recursive).
|
protected void |
writeNodeRecursive(com.bigdata.htree.AbstractPage node)
Write a dirty node and its children using a post-order traversal that
first writes any dirty leaves and then (recursively) their parent nodes.
|
protected void |
writeNodeRecursiveCallersThread(com.bigdata.htree.AbstractPage node) |
protected void |
writeNodeRecursiveConcurrent(com.bigdata.htree.AbstractPage node)
Writes the dirty nodes and leaves in level sets (one level at a time)
with up to one thread per dirty node/leave in a given level.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
dumpPages, getCheckpoint, getDirtyListener, getMetadataAddr, getRecordVersion, getRootAddr, setDirtyListener, setLastCommitTime, writeCheckpoint, writeCheckpoint2
handleCommit, invalidate
removeAll
protected static final String ERROR_CLOSED
protected static final String ERROR_LESS_THAN_ZERO
protected static final String ERROR_TOO_SMALL
protected static final String ERROR_TOO_LARGE
protected static final String ERROR_READ_ONLY
protected static final String ERROR_TRANSIENT
protected static final String ERROR_ERROR_STATE
public final int MIN_ADDRESS_BITS
public final int MAX_ADDRESS_BITS
addressBits
parameter.
Note: 1^32
overflows an int32. However, 2^16
is
65536 which is a pretty large page. 2^20
is 1,048,576. It is
pretty difficult to imagine use cases where the fan out for the
HTree
should be that large.
protected static final transient org.apache.log4j.Logger log
protected static final boolean INFO
log
level is INFO or less.protected static final boolean DEBUG
log
level is DEBUG or less.public static final org.apache.log4j.Logger dumpLog
dump(PrintStream)
and friends.protected final IRawStore store
protected final boolean readOnly
true
the AbstractHTree
does not permit
mutation.protected final int addressBits
2^addressBits
entries. Those entries are
divided up among one or more buddy hash tables on that page.protected final int bucketSlots
protected volatile com.bigdata.htree.DirectoryPage root
protected volatile Throwable error
null
the index MUST be reloaded from the most recent
checkpoint before it can be used (that is, you need to obtain a new view
of the unisolated index since this field is sticky once set).protected final IHardReferenceQueue<PO> writeRetentionQueue
IRawStore
. The
nodes and leaves refer to their parent with a WeakReference
s.
Likewise, nodes refer to their children with a WeakReference
. The
hard reference queue in combination with touch(AbstractPage)
and
with hard references held on the stack ensures that the parent and/or
children remain reachable during operations. Once the node is no longer
strongly reachable weak references to that node may be cleared by the VM
- in this manner the node will become unreachable by navigation from its
ancestors in the btree. The special role of the hard reference queue is
to further ensure that dirty nodes remain dirty by defering persistence
until the reference count for the node is zero during an eviction from
the queue.
Note that nodes are evicted as new nodes are added to the hard reference queue. This occurs in two situations: (1) when a new node is created during a split of an existing node; and (2) when a node is read in from the store. Inserts on this hard reference queue always drive evictions. Incremental writes basically make it impossible for the commit set to get "too large" - the maximum #of nodes to be written is bounded by the size of the hard reference queue. This helps to ensure fast commit operations on the store.
The minimum capacity for the hard reference queue is two (2) so that a split may occur without forcing eviction of either node participating in the split.
Note: The code in
AbstractPage.postOrderNodeIterator(boolean, boolean)
and
DirtyChildIterator
MUST NOT touch the hard reference queue since
those iterators are used when persisting a node using a post-order
traversal. If a hard reference queue eviction drives the serialization of
a node and we touch the hard reference queue during the post-order
traversal then we break down the semantics of
HardReferenceQueue.add(Object)
as the eviction does not
necessarily cause the queue to reduce in length. Another way to handle
this is to have HardReferenceQueue.add(Object)
begin to evict
objects before is is actually at capacity, but that is also a bit
fragile.
Note: The writeRetentionQueue
uses a HardReferenceQueue
.
This is based on a RingBuffer
and is very fast. It does not use a
HashMap
because we can resolve the WeakReference
to the
child DirectoryPage
or BucketPage
using top-down
navigation as long as the DirectoryPage
or BucketPage
remains strongly reachable (that is, as long as it is on the
writeRetentionQueue
or otherwise strongly held). This means that
lookup in a map is not required for top-down navigation.
protected int ndistinctOnWriteRetentionQueue
writeRetentionQueue
.protected HTreeIndexMetadata metadata
HTree
object, but it CAN be changed.protected final NodeSerializer nodeSer
protected AbstractHTree(IRawStore store, INodeFactory nodeFactory, boolean readOnly, HTreeIndexMetadata metadata, IRecordCompressorFactory<?> recordCompressorFactory)
store
- The persistence store.nodeFactory
- Object that provides a factory for node and leaf objects.readOnly
- true
IFF it is known that the
AbstractHTree
is read-only.metadata
- The IndexMetadata
object for this
AbstractHTree
.recordCompressorFactory
- Object that knows how to (de-)compress the serialized data
records.IllegalArgumentException
- if addressBits is LT ONE (1).IllegalArgumentException
- if addressBits is GT (16).public final BTreeCounters getBtreeCounters()
public final void setBTreeCounters(BTreeCounters btreeCounters)
BTreeCounters
.
Note: This is used by the IndexManager
to ensure that an index
loaded from its backing store uses the BTreeCounters
associated
with that index since the DataService
was last (re-)started.
btreeCounters
- The counters to be used.IllegalArgumentException
- if the argument is null
.public CounterSet getCounters()
ICounterSetAccess
getCounters
in interface ICounterSetAccess
public HTreeIndexMetadata getIndexMetadata()
Note: If the index is read-only then the metadata object will be cloned to avoid potential modification. However, only a single cloned copy of the metadata record will be shared between all callers for a given instance of this class.
getIndexMetadata
in interface ICheckpointProtocol
null
.IIndex.getIndexMetadata()
public void close()
ICheckpointProtocol.close()
is to reduce the resource burden of the
index while not rendering the index inoperative. An index that has been
closed
MAY be reopened
at any time
(conditional on the continued availability of the backing store). Such an
index reference remains valid after a ICheckpointProtocol.close()
. A closed index is
transparently reopened
by any access to the index data
(scanning the index, probing the index, etc).
Note: A ICheckpointProtocol.close()
on a dirty index MUST discard writes rather than
flushing them to the store and MUST NOT update its Checkpoint
record - (ICheckpointProtocol.close()
is used to discard indices with partial writes
when an AbstractTask
fails). If you are seeking to
ICheckpointProtocol.close()
a mutable index view that state can be recovered by
ICheckpointProtocol.reopen()
then you MUST write a new Checkpoint
record
before closing the index.
Note: CLOSING A TRANSIENT INDEX WILL DISCARD ALL DATA!
This implementation clears the hard reference queue (releasing all node
references), releases the hard reference to the root node, and releases
the buffers on the NodeSerializer
(they will be naturally
reallocated on reuse).
Note: AbstractHTree
is NOT thread-safe and close()
MUST
be invoked in a context in which there will not be concurrent threads --
the natural choice being the single-threaded commit service on the
journal.
close
in interface ICheckpointProtocol
IllegalStateException
- if the root is null
, indicating that the
index is already closed.IllegalStateException
- if the root is dirty (this implies that this is a mutable
HTree and there are mutations that have not been written
through to the store)public final void reopen()
ICheckpointProtocol.close()
/
ICheckpointProtocol.reopen()
protocol. That protocol may be used to reduce the
resource burden of an index. This method is automatically invoked by a
variety of methods that need to ensure that the index is available for
use..
This method delegates to _reopen()
if double-checked locking
demonstrates that the root
is null
(indicating that
the index has been closed). This method is automatically invoked by a
variety of methods that need to ensure that the index is available for
use.
reopen
in interface ICheckpointProtocol
ICheckpointProtocol.close()
,
ICheckpointProtocol.isOpen()
,
#getRoot()
protected abstract void _reopen()
reopen()
once root
has been show to be
null
with double-checked locking. When invoked in this
context, the caller is guaranteed to hold a lock on this. This is
done to ensure that at most one thread gets to re-open the index from the
backing store.public final boolean isOpen()
ICheckpointProtocol
isOpen
in interface ICheckpointProtocol
ICheckpointProtocol.close()
,
ICheckpointProtocol.reopen()
public final boolean isTransient()
true
iff this is a transient data structure (no
backing store).protected final void assertNotTransient()
public final boolean isReadOnly()
true
iff this B+Tree is read-only.isReadOnly
in interface IReadWriteLockManager
protected final void assertNotReadOnly()
UnsupportedOperationException
- if the B+Tree is read-only.isReadOnly()
public abstract long getLastCommitTime()
IAtomicStore.commit()
in
which writes buffered by this index were made restart-safe on the backing
store. The lastCommitTime is set when the index is loaded from the
backing store and updated after each commit. It is ZERO (0L) when
HTree
is first created and will remain ZERO (0L) until the
HTree
is committed. If the backing store does not support atomic
commits, then this value will always be ZERO (0L).getLastCommitTime
in interface ICheckpointProtocol
public IRawStore getStore()
ISimpleIndexAccess
getStore
in interface ISimpleIndexAccess
public final int getAddressBits()
2^addressBits
. When addressBits is 10
we
have 2^10 := 1024
. If the size of a child address is 4
bytes, then a 10 bit address space implies a 4k page size.public abstract long getNodeCount()
DirectoryPage
s in the HTree
(not buddy hash
tables, but the pages on which they appear).getNodeCount
in interface ISimpleTreeIndexAccess
public abstract long getLeafCount()
BucketPage
s in the HTree
(not buddy hash buckets,
but the pages on which they appear).getLeafCount
in interface ISimpleTreeIndexAccess
public abstract long getEntryCount()
HTree
.getEntryCount
in interface ISimpleTreeIndexAccess
public final boolean isBalanced()
true
iff the tree is balanced.
Note: Not all tree-structured indices are balanced. For example, the
BTree
is balanced, but the HTree
is not balanced. The
height of an unbalanced tree must be discovered through a traversal of
the tree.
isBalanced
in interface ISimpleTreeIndexAccess
false
since an HTree
is NOT a balanced tree.public int getHeight()
HTree
is not a balanced tree.getHeight
in interface ISimpleTreeIndexAccess
ISimpleTreeIndexAccess.isBalanced()
public final NodeSerializer getNodeSerializer()
IIndex
.protected final com.bigdata.htree.DirectoryPage getRoot()
public String toString()
public boolean dump(PrintStream out)
out
- The dump is written on this stream.public boolean dump(org.apache.log4j.Level level, PrintStream out, boolean materialize)
protected void touch(com.bigdata.htree.AbstractPage node)
This method is responsible for putting the node or leaf onto the ring buffer which controls (a) how long we retain a hard reference to the node or leaf; and (b) for writes, when the node or leaf is evicted with a zero reference count and made persistent (along with all dirty children). The concurrency requirements and the implementation behavior and guarentees differ depending on whether the B+Tree is read-only or mutable for two reasons: For writers, the B+Tree is single-threaded so there is no contention. For readers, every touch on the B+Tree goes through this point, so it is vital to make this method non-blocking.
This method guarantees that the specified node will NOT be synchronously persisted as a side effect and thereby made immutable. (Of course, the node may be already immutable.)
In conjunction with DefaultEvictionListener
, this method
guarantees that the reference counter for the node will reflect the #of
times that the node is actually present on the
writeRetentionQueue
.
If the node is not found on a scan of the head of the queue, then it is
appended to the queue and its AbstractPage.referenceCount
is
incremented. If a node is being appended to the queue and the queue is at
capacity, then this will cause a reference to be evicted from the queue.
If the reference counter for the evicted node or leaf is zero and the
evicted node or leaf is dirty, then a data record will be coded for the
evicted node or leaf and written onto the backing store. A subsequent
attempt to modify the node or leaf will force copy-on-write for that node
or leaf.
For the mutable B+Tree we also track the #of references to the node/leaf on the ring buffer. When that reference count reaches zero we do an eviction and the node/leaf is written onto the backing store if it is dirty. Those reference counting games DO NOT matter for read-only views so we can take a code path which does not update the per-node/leaf reference count and we do not need to use either synchronization or atomic counters to track the reference counts.
In order to reduce contention for the lock required to update the backing
queue, the writeRetentionQueue
is configured to collect
references for touched nodes or leaves in a thread-local queue and then
batch those references through to the backing hard reference queue while
holding the lock.
node
- The node or leaf.protected final void writeNodeRecursive(com.bigdata.htree.AbstractPage node)
Node
s.
Note: This will throw an exception if the backing store is read-only.node
- The root of the hierarchy of nodes to be written. The node
MUST be dirty. The node this does NOT have to be the root of
the tree and it does NOT have to be a Node
.(Reduce commit latency by parallel checkpoint by level of
dirty pages in an index)
protected final void writeNodeRecursiveCallersThread(com.bigdata.htree.AbstractPage node)
protected final void writeNodeRecursiveConcurrent(com.bigdata.htree.AbstractPage node)
ICheckpointProtocol.writeCheckpoint()
or for a Node
evicted from the writeRetentionQueue
. Whether this is driven by
ICheckpointProtocol.writeCheckpoint()
or #touch(AbstractNode)
, we have the
same contract with respect to eviction as the single-threaded eviction
logic - we are just doing it in parallel level sets.node
- (Reduce commit latency by parallel checkpoint by level of
dirty pages in an index)
protected long writeNodeOrLeaf(com.bigdata.htree.AbstractPage node)
writeRetentionQueue
, the B+Tree continuously converts nodes and
leaves to their more compact coded record forms which results in a
smaller in memory footprint.
Note: For a transient B+Tree, this merely codes the node but does not write the node on the store (there is none).
UnsupportedOperationException
- if the B+Tree (or the backing store) is read-only.protected com.bigdata.htree.AbstractPage readNodeOrLeaf(long addr)
AbstractPage
from the store.
Note: Callers SHOULD be synchronized in order to ensure that only one thread will read the desired node or leaf in from the store and attach the reference to the newly read node or leaf as appropriate into the existing data structures (e.g., as the root reference or as a child of a node or leaf already live in memory).
Note: The caller MUST set the AbstractPage.globalDepth
on the
returned value.
addr
- The address in the store.AbstractPage
.IllegalArgumentException
- if the address is IAddressManager.NULL
.public static byte[] encodeRecordAddr(ByteArrayBuffer recordAddrBuf, long addr)
decodeRecordAddr(byte[])
recordAddrBuf
- The buffer that will be used to format the record address.addr
- The raw record address.public static long decodeRecordAddr(byte[] buf)
#appendSigned(long)
.buf
- The buffer containing the encoded record address.public abstract long rangeCount()
rangeCount
in interface ISimpleIndexAccess
IRangeQuery.rangeCount()
public final ICloseableIterator<?> scan()
ISimpleIndexAccess
scan
in interface ISimpleIndexAccess
public ITupleIterator rangeIterator()
HTree
in order by the
effective prefix of their keys. Since the key is typically a hash of some
fields in the associated application data record, this will normally
visit application data records in what appears to be an arbitrary order.
Tuples within a buddy hash bucket will be delivered in a random order.
Note: The HTree
does not currently maintain metadata about the
#of spanned tuples in a DirectoryPage
. Without that we can not
provide fast range counts, linear list indexing, etc.
public Iterator<byte[]> values()
HTree
in order by the effective
prefix of their keys. Since the key is typically a hash of some fields in
the associated application data record, this will normally visit
application data records in what appears to be an arbitrary order. Tuples
within a buddy hash bucket will be delivered in a random order.
Note: This allocates a new byte[] for each visited value. It is more
efficient to reuse a buffer for each visited Tuple
. This can be
done using rangeIterator()
.
public void checkConsistency(boolean full)
full
- indicates whether to check consistency on diskpublic void checkConsistency(com.bigdata.htree.AbstractPage pge, boolean full)
public final Lock readLock()
IReadWriteLockManager
Lock
that may be used to obtain a shared read lock which
is used (in the absence of other concurrency control mechanisms) to
permit concurrent readers on an unisolated index while serializing access
to that index when a writer must run. This is exposed for processes which
need to obtain the write lock to coordinate external operations.
Note: If the persistence capable data structure is read-only then the
returned Lock
is a singleton that ignores all lock requests. This
is because our read-only persistence capable data structures are already
thread-safe for concurrent readers.
readLock
in interface IReadWriteLockManager
public final Lock writeLock()
IReadWriteLockManager
Lock
that may be used to obtain an exclusive write lock
which is used (in the absence of other concurrency control mechanisms) to
serialize all processes accessing an unisolated index when a writer must
run. This is exposed for processes which need to obtain the write lock to
coordinate external operations.writeLock
in interface IReadWriteLockManager
public final int getReadLockCount()
IReadWriteLockManager
getReadLockCount
in interface IReadWriteLockManager
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.