public class IndexSegmentBuilder extends Object implements Callable<IndexSegmentCheckpoint>
IndexSegment
given a source btree and a target branching
factor. There are two main use cases:
BTree
that is ideally backed by a fully buffered
IRawStore
so that no random reads are required.The disadvantage of (A) is that it requires two passes over the source view, which substantially increases the run time of the algorithm. In addition, the passes can drive evictions in the global LRU and could defeat caching for a view approaching the nominal size for a split. However, with (A) we can do builds for very large source B+Trees. Therefore, (A) is implemented for such use cases.
The disadvantage of (B) is that it requires more memory. However, it is much faster than (A). To compensate for the increased memory demand, we can single thread builds, merges, and splits and fall back to (A) if memory is very tight or the source view is very large.
The disadvantage of (C) is that the "hacks" break encapsulation and leak into
the API where operations such as retrieving the right sibling of a node could
return an empty leaf (since we ran out of tuples for the plan). Since these
"hacks" would break encapsulation, it would be difficult to have confidence
that the B+Tree API was fully insulated against the effects of ill-formed
IndexSegment
s. Therefore, I have discarded this approach and backed
out changes designed to support it from the code base.
For the two pass design described above as option (A), the code buffers the
nodes and leaves onto TemporaryRawStore
s. This approach is scalable,
which is the concern of (A), but requires at least twice the IO when compared
to directly writing the nodes and leaves onto the output file.
When sufficient memory is available, as cases where (B) would apply, we can write the leaves directly on the backing file (using double-buffering to update the prior/next addrs). Since there are far fewer nodes than leaves, we can buffer the nodes in memory, writing them once the leaves are finished.
IndexSegment
,
IndexSegmentFile
,
IndexSegmentCheckpoint
Modifier and Type | Class and Description |
---|---|
protected static class |
IndexSegmentBuilder.AbstractSimpleNodeData
Abstract base class for classes used to construct and serialize nodes and
leaves written onto the index segment.
|
protected static class |
IndexSegmentBuilder.NOPNodeFactory
Factory does not support node or leaf creation.
|
protected static class |
IndexSegmentBuilder.SimpleLeafData
A class that can be used to (de-)serialize the data for a leaf without
any of the logic for operations on the leaf.
|
protected static class |
IndexSegmentBuilder.SimpleNodeData
A class that can be used to (de-)serialize the data for a node without
any of the logic for operations on the node.
|
Modifier and Type | Field and Description |
---|---|
protected boolean |
bufferNodes
When
true the generated nodes will be fully buffered in RAM. |
long |
commitTime
The commit time associated with the view from which the
IndexSegment is being generated (from the ctor). |
boolean |
compactingMerge
true iff the generated IndexSegment will
incorporate all state for the source index (partition) as of the
specified commitTime. |
long |
elapsed
The process runtime in milliseconds.
|
long |
elapsed_build
The time to write the nodes and leaves into their respective buffers, not
including the time to transfer those buffered onto the output file.
|
long |
elapsed_setup
The time to setup the index build, including the generation of the index
plan and the initialization of some helper objects.
|
long |
elapsed_write
The time to write the nodes and leaves from their respective buffers
onto the output file and synch and close that output file.
|
long |
entryCount
The value specified to the ctor.
|
protected static String |
ERR_NO_TUPLES
Message when the index segment will be empty.
|
protected static String |
ERR_TOO_MANY_TUPLES
Error message when the #of tuples in the
IndexSegment would
exceed Integer.MAX_VALUE . |
float |
mbPerSec
The data throughput rate in megabytes per second.
|
IndexMetadata |
metadata
A copy of the metadata object provided to the ctor.
|
protected RandomAccessFile |
out
The file on which the
IndexSegment is written. |
File |
outFile
The file specified by the caller on which the
IndexSegment is
written. |
IndexSegmentPlan |
plan
The plan for building the B+-Tree.
|
UUID |
segmentUUID
The unique identifier for the generated
IndexSegment resource. |
Modifier | Constructor and Description |
---|---|
protected |
IndexSegmentBuilder(File outFile,
File tmpDir,
long entryCount,
ITupleIterator<?> entryIterator,
int m,
IndexMetadata metadata,
long commitTime,
boolean compactingMerge,
boolean bufferNodes)
Designated constructor sets up a build of an
IndexSegment for
some caller defined read-only view. |
Modifier and Type | Method and Description |
---|---|
protected void |
addChild(IndexSegmentBuilder.SimpleNodeData parent,
long childAddr,
IndexSegmentBuilder.AbstractSimpleNodeData child)
Record the persistent address of a child on its parent and the #of
entries spanned by that child.
|
protected void |
addSeparatorKey(IndexSegmentBuilder.SimpleLeafData leaf)
Copies the first key of a new leaf as a separatorKey for the appropriate
parent (if any) of that leaf.
|
protected void |
buildBTree()
Scan the source tuple iterator in key order writing output leaves onto
the index segment file with the new branching factor.
|
IndexSegmentCheckpoint |
call()
Build the
IndexSegment given the parameters specified to the
constructor. |
protected void |
flushNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
Flush a node or leaf that has been closed (no more data will be added).
|
IndexSegmentCheckpoint |
getCheckpoint()
The
IndexSegmentCheckpoint record written on the
IndexSegmentStore . |
protected IndexSegmentBuilder.SimpleNodeData |
getParent(IndexSegmentBuilder.AbstractSimpleNodeData node)
Return the parent of a node or leaf in the
stack . |
IResourceMetadata |
getSegmentMetadata()
The description of the constructed
IndexSegment resource. |
long |
getStartTime()
|
static void |
main(String[] args)
Driver for index segment build against a named index on a local journal.
|
static IndexSegmentBuilder |
newInstance(File outFile,
File tmpDir,
long entryCount,
ITupleIterator<?> entryIterator,
int m,
IndexMetadata metadata,
long commitTime,
boolean compactingMerge,
boolean bufferNodes)
A more flexible factory for an
IndexSegment build which permits
override of the index segment branching factor, replacement of the
IndexMetadata , and the use of the caller's iterator. |
static IndexSegmentBuilder |
newInstance(ILocalBTreeView src,
File outFile,
File tmpDir,
boolean compactingMerge,
long createTime,
byte[] fromKey,
byte[] toKey)
Builder factory will build an
IndexSegment from an index
(partition). |
static IndexSegmentBuilder |
newInstance(Object[] a,
int alen,
IndexMetadata indexMetadata,
File outFile,
File tmpDir,
int m,
boolean compactingMerge,
long createTime,
boolean bufferNodes)
Variant using an array of objects in the desired order.
|
protected static IndexSegmentBuilder |
newInstanceFullyBuffered(ILocalBTreeView src,
File outFile,
File tmpDir,
int m,
boolean compactingMerge,
long createTime,
byte[] fromKey,
byte[] toKey,
boolean bufferNodes)
A one pass algorithm which materializes the tuples in RAM, computing the
exact tuple count as it goes.
|
protected static IndexSegmentBuilder |
newInstanceTwoPass(ILocalBTreeView src,
File outFile,
File tmpDir,
int m,
boolean compactingMerge,
long createTime,
byte[] fromKey,
byte[] toKey,
boolean bufferNodes)
A two pass build algorithm.
|
protected void |
resetNode(IndexSegmentBuilder.SimpleNodeData parent)
The
stack contains nodes which are reused for each node or leaf
at a given level in the generated B+Tree. |
protected static void |
usage(String[] args,
String msg,
int exitCode)
Prints the usage and then exits.
|
protected IndexSegmentCheckpoint |
writeIndexSegment(FileChannel outChannel,
long commitTime)
Writes the complete file format for the index segment.
|
protected long |
writeLeaf(IndexSegmentBuilder.SimpleLeafData leaf)
Code the leaf, obtaining its address, update the prior/next addr of the
previous leaf, and write that previous leaf onto the output file.
|
protected long |
writeNode(IndexSegmentBuilder.SimpleNodeData node)
Code and write the node onto the
nodeBuffer . |
protected long |
writeNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
Write the node or leaf onto the appropriate output channel.
|
protected static final String ERR_TOO_MANY_TUPLES
IndexSegment
would
exceed Integer.MAX_VALUE
.
Note: This is not an inherent limit in the IndexSegment
but
rather a limit in the IndexSegmentPlan
(and perhaps the
IndexSegmentBuilder
) which presumes that the entry count is an
int
rather than a long
.
protected static final String ERR_NO_TUPLES
public final File outFile
IndexSegment
is
written.public final long entryCount
public final long commitTime
IndexSegment
is being generated (from the ctor). This value is
written into IndexSegmentCheckpoint.commitTime
.public final boolean compactingMerge
true
iff the generated IndexSegment
will
incorporate all state for the source index (partition) as of the
specified commitTime.
Note: This flag is written into the IndexSegmentCheckpoint
but it
has no other effect on the build process.
public final IndexMetadata metadata
IndexSegmentStore
.public final UUID segmentUUID
IndexSegment
resource.protected RandomAccessFile out
IndexSegment
is written. The file is closed
regardless of the outcome of the operation.protected final boolean bufferNodes
true
the generated nodes will be fully buffered in RAM.
Otherwise they will be buffered on the nodeBuffer
and then
transferred to the output file en mass.public final IndexSegmentPlan plan
public final long elapsed_setup
public long elapsed_build
public long elapsed_write
public long elapsed
public float mbPerSec
protected IndexSegmentBuilder(File outFile, File tmpDir, long entryCount, ITupleIterator<?> entryIterator, int m, IndexMetadata metadata, long commitTime, boolean compactingMerge, boolean bufferNodes) throws IOException
Designated constructor sets up a build of an IndexSegment
for
some caller defined read-only view.
Note: The caller must determine whether or not deleted index entries are present in the view. The entryCount MUST be the exact #of index entries that are visited by the given iterator. In general, this is not difficult. However, if a compacting merge is desired (that is, if you are trying to generate a view containing only the non-deleted entries) then you MUST explicitly count the #of entries that will be visited by the iterator, e.g., it will require two passes over the iterator to setup the index build operation.
Note: With a branching factor of 4096 a tree of height 2 (three levels)
could address 68,719,476,736 entries - well beyond what we want in a
given index segment! Well before that the index segment should be split
into multiple files. The split point should be determined by the size of
the serialized leaves and nodes, e.g., the amount of data on disk
required by the index segment and the amount of memory required to fully
buffer the index nodes. While the size of a serialized node can be
estimated easily, the size of a serialized leaf depends on the kinds of
values stored in that index. The actual sizes are recorded in the
IndexSegmentCheckpoint
record in the header of the
IndexSegment
.
outFile
- The file on which the index segment is written. The file MAY
exist but MUST have zero length if it does exist (this permits
you to use the temporary file facility to create the output
file).tmpDir
- The temporary directory in data are buffered during the build
(optional - the default temporary directory is used if this is
null
).entryCount
- The #of entries that will be visited by the iterator. This
MUST be an exact range count.entryIterator
- Visits the index entries in key order that will be written
onto the IndexSegment
.m
- The branching factor for the generated tree. This can be
chosen with an eye to minimizing the height of the generated
tree. (Small branching factors are permitted for testing, but
generally you want something relatively large.)metadata
- The metadata record for the source index. A copy will be made
of this object. The branching factor in the generated tree
will be overridden to m.commitTime
- The commit time associated with the view from which the
IndexSegment
is being generated. This value is written
into IndexSegmentCheckpoint.commitTime
.compactingMerge
- true
iff the generated IndexSegment
will
incorporate all state for the source index (partition) as of
the specified commitTime. This flag is written into the
IndexSegmentCheckpoint
but does not otherwise effect
the build process.bufferNodes
- When true
the generated nodes will be fully
buffered in RAM (faster, but imposes a memory constraint).
Otherwise they will be written onto a temporary file and then
transferred to the output file en mass.IOException
public IndexSegmentCheckpoint getCheckpoint()
IndexSegmentCheckpoint
record written on the
IndexSegmentStore
.public long getStartTime()
public static IndexSegmentBuilder newInstance(ILocalBTreeView src, File outFile, File tmpDir, boolean compactingMerge, long createTime, byte[] fromKey, byte[] toKey) throws IOException
IndexSegment
from an index
(partition). Delete markers are propagated to the IndexSegment
unless compactingMerge is true
.src
- A view of the index partition as of the createTime.
When compactingMerge is false
then this
MUST be a single BTree
since incremental builds are
only support for a BTree
source while compacting
merges are defined for any IIndex
.outFile
- The file on which the IndexSegment
will be written.
The file MAY exist, but if it exists then it MUST be empty.compactingMerge
- When true
the caller asserts that src is a
FusedView
and deleted index entries WILL NOT be
included in the generated IndexSegment
. Otherwise, it
is assumed that the only select component(s) of the index
partition view are being exported onto an IndexSegment
and deleted index entries will therefore be propagated to the
new IndexSegment
(aka an incremental build).createTime
- The commit time associated with the view from which the
IndexSegment
is being generated. This value is written
into IndexSegmentCheckpoint.commitTime
.fromKey
- The lowest key that will be included (inclusive). When
null
there is no lower bound.toKey
- The first key that will be included (exclusive). When
null
there is no upper bound.IndexSegment
.IOException
protected static IndexSegmentBuilder newInstanceTwoPass(ILocalBTreeView src, File outFile, File tmpDir, int m, boolean compactingMerge, long createTime, byte[] fromKey, byte[] toKey, boolean bufferNodes) throws IOException
IOException
protected static IndexSegmentBuilder newInstanceFullyBuffered(ILocalBTreeView src, File outFile, File tmpDir, int m, boolean compactingMerge, long createTime, byte[] fromKey, byte[] toKey, boolean bufferNodes) throws IOException
TestIndexSegmentBuilderWithLargeTrees
but not yet for the other
test suite variants.IOException
public static IndexSegmentBuilder newInstance(Object[] a, int alen, IndexMetadata indexMetadata, File outFile, File tmpDir, int m, boolean compactingMerge, long createTime, boolean bufferNodes) throws IOException
IndexSegment
.a
- The array of objects to be written onto the index. The index
must know how to generate tuples from these objects. The
objects must already be in the natural order of the keys that
will be generated for those tuples.alen
- The #of elements in that array.indexMetadata
- The IndexMetadata
that will serve as the template for
the generated IndexSegment
.outFile
- The file on which the IndexSegment
will be written.
The file MAY exist, but if it exists then it MUST be empty.tmpDir
- The temporary directory in data are buffered during the build
(optional - the default temporary directory is used if this is
null
).m
- The branching factor for the generated IndexSegment
.compactingMerge
- When true
the caller asserts that src is a
FusedView
and deleted index entries WILL NOT be
included in the generated IndexSegment
. Otherwise, it
is assumed that the only select component(s) of the index
partition view are being exported onto an IndexSegment
and deleted index entries will therefore be propagated to the
new IndexSegment
(aka an incremental build).createTime
- The commit time associated with the view from which the
IndexSegment
is being generated. This value is written
into IndexSegmentCheckpoint.commitTime
.bufferNodes
- When true
the generated nodes will be fully
buffered in RAM (faster, but imposes a memory constraint).
Otherwise they will be written onto a temporary file and then
transferred to the output file en mass.IOException
- TODO We could pass a flag indicating whether the leaf needs
to be sorted after it is generated, but the caller would
still be responsible for ensuring that there are no
duplicates in the array.public static IndexSegmentBuilder newInstance(File outFile, File tmpDir, long entryCount, ITupleIterator<?> entryIterator, int m, IndexMetadata metadata, long commitTime, boolean compactingMerge, boolean bufferNodes) throws IOException
A more flexible factory for an IndexSegment
build which permits
override of the index segment branching factor, replacement of the
IndexMetadata
, and the use of the caller's iterator.
Note: The caller must determine whether or not deleted index entries are present in the view. The entryCount MUST be the exact #of index entries that are visited by the given iterator. In general, this is not difficult. However, if a compacting merge is desired (that is, if you are trying to generate a view containing only the non-deleted entries) then you MUST explicitly count the #of entries that will be visited by the iterator, e.g., it will require two passes over the iterator to setup the index build operation.
Note: With a branching factor of 4096 a tree of height 2 (three levels)
could address 68,719,476,736 entries - well beyond what we want in a
given index segment! Well before that the index segment should be split
into multiple files. The split point should be determined by the size of
the serialized leaves and nodes, e.g., the amount of data on disk
required by the index segment and the amount of memory required to fully
buffer the index nodes. While the size of a serialized node can be
estimated easily, the size of a serialized leaf depends on the kinds of
values stored in that index. The actual sizes are recorded in the
IndexSegmentCheckpoint
record in the header of the
IndexSegment
.
outFile
- The file on which the index segment is written. The file MAY
exist but MUST have zero length if it does exist (this permits
you to use the temporary file facility to create the output
file).tmpDir
- The temporary directory in data are buffered during the build
(optional - the default temporary directory is used if this is
null
).entryCount
- The #of entries that will be visited by the iterator. This
MUST be an exact range count.entryIterator
- Visits the index entries in key order that will be written
onto the IndexSegment
.m
- The branching factor for the generated tree. This can be
chosen with an eye to minimizing the height of the generated
tree. (Small branching factors are permitted for testing, but
generally you want something relatively large.)metadata
- The metadata record for the source index. A copy will be made
of this object. The branching factor in the generated tree
will be overridden to m.commitTime
- The commit time associated with the view from which the
IndexSegment
is being generated. This value is written
into IndexSegmentCheckpoint.commitTime
.compactingMerge
- true
iff the generated IndexSegment
will
incorporate all state for the source index (partition) as of
the specified commitTime. This flag is written into the
IndexSegmentCheckpoint
but does not otherwise effect
the build process.bufferNodes
- When true
the generated nodes will be fully
buffered in RAM (faster, but imposes a memory constraint).
Otherwise they will be written onto a temporary file and then
transferred to the output file en mass.IOException
public IndexSegmentCheckpoint call() throws Exception
IndexSegment
given the parameters specified to the
constructor.call
in interface Callable<IndexSegmentCheckpoint>
Exception
protected void buildBTree()
The plan tells us the #of values to insert into each leaf and the #of children to insert into each node. Each time a leaf becomes full (according to the plan), we "close" the leaf, writing it out onto the store and obtaining its "address". The "close" logic also takes care of setting the address on the leaf's parent node (if any). If the parent node becomes filled (according to the plan) then it is also "closed".
Each time (except the first) that we start a new leaf we record its first key as a separatorKey in the appropriate parent node.
Note: The root may be a leaf as a degenerate case.
protected void flushNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
Flush a node or leaf that has been closed (no more data will be added).
Note: When a node or leaf is flushed we write it out to obtain its
address and set that address on its direct parent using
#addChild(SimpleNodeData, long, AbstractSimpleNodeData, boolean)
.
This also updates the per-child counters of the #of entries spanned by a
node.
node
- The node to be flushed.protected void addChild(IndexSegmentBuilder.SimpleNodeData parent, long childAddr, IndexSegmentBuilder.AbstractSimpleNodeData child)
parent
- The parent.childAddr
- The address of the child (node or leaf).child
- The child reference.protected void resetNode(IndexSegmentBuilder.SimpleNodeData parent)
stack
contains nodes which are reused for each node or leaf
at a given level in the generated B+Tree. This method prepares a node in
the stack for reuse.protected void addSeparatorKey(IndexSegmentBuilder.SimpleLeafData leaf)
leaf
- The current leaf. The first key on that leaf must be defined.protected IndexSegmentBuilder.SimpleNodeData getParent(IndexSegmentBuilder.AbstractSimpleNodeData node)
stack
.node
- The node or leaf.null
iff node is the root node
or leaf.protected long writeNodeOrLeaf(IndexSegmentBuilder.AbstractSimpleNodeData node)
protected long writeLeaf(IndexSegmentBuilder.SimpleLeafData leaf)
Note: For leaf addresses we know the absolute offset into the
IndexSegmentStore
where the leaf will wind up so we encode the
address of the leaf using the IndexSegmentRegion.BASE
region.
Note: In order to write out the leaves using a double-linked list with
prior-/next-leaf addresses we have to use a "write behind" strategy.
Instead of writing out the leaf as soon as it is serialized, we save the
uncoded address and a copy of the coded data record on private member
fields. When we code the next leaf (or if we learn that we have no more
leaves to code because IndexSegmentPlan.nleaves
EQ
nleavesWritten
) then we patch the coded representation of the
prior leaf and write it on the store at the previously obtained address,
thereby linking the leaves together in both directions. It is definitely
confusing.
IndexSegmentStore
.protected long writeNode(IndexSegmentBuilder.SimpleNodeData node)
nodeBuffer
.IndexSegmentBuilder.SimpleNodeData.addr
.IndexSegmentBuilder.SimpleNodeData
,
IndexSegmentRegion
,
IndexSegmentAddressManager
protected IndexSegmentCheckpoint writeIndexSegment(FileChannel outChannel, long commitTime) throws IOException, InterruptedException
Writes the complete file format for the index segment. The file is divided up as follows:
IndexSegmentCheckpoint
record (required)IndexMetadata
record (required, but extensible)
The index segment metadata is divided into a base
IndexSegmentCheckpoint
record with a fixed format containing only
essential data and additional metadata records written at the end of the
file including the optional bloom filter and the required
IndexMetadata
record. The latter is where we write variable
length metadata including the _name_ of the index, or additional metadata
defined by a specific class of index.
Once all nodes and leaves have been buffered we are ready to start writing the data. We skip over a fixed size metadata record since otherwise we are unable to pre-compute the offset to the leaves and hence the addresses of the leaves. The node addresses are written in an encoding that requires active translation by the receiver who must be aware of the offset to the start of the node region. We can not write the metadata record until we know the size and length of each of these regions (leaves, nodes, and the bloom filter, or other metadata records) since that information is required in order to be able to form their addresses for insertion in the metadata record.
outChannel
- commitTime
- IOException
InterruptedException
- FIXME There is no sense of an atomic commit when building a
new index segment. We should write ZEROs into the checkpoint
record initially and then seek back to the head of the file
once we are done and write out the correct checkpoint record.
Note: There are similar issues involved when we replicate index segment or journal files to verify that they are good.
public IResourceMetadata getSegmentMetadata()
IndexSegment
resource.IllegalStateException
- if requested before the build operation is complete.protected static void usage(String[] args, String msg, int exitCode)
args
- The command line args.public static void main(String[] args) throws Exception
args
- [opts] journal [name]*
, where journal is
the name of the journal file, where name is the name of
a B+Tree registered on that journal, and where opts are
any of:
BuildEnum
..seg
extension). .Exception
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.