public class HTreeHashJoinUtility extends Object implements IHashJoinUtility
HTree
, this class is written directly to the RDF data model. Rather
than POJO serialization, solutions are encoded as logical IV
[]s in a
manner very similar to how we represent the keys of the statement indices.
Since this encoding does not persist the cache
, a
separate mapping must be maintained from IV
to BigdataValue
for those IV
s which have a materialized BigdataValue
.
TODO Do a 64-bit hash version which could be used for hash indices having
more than 500M distinct join variable combinations. Note that at 500M
distinct join variable combinations we have a 1 in 4 chance of a hash
collision. Whether or not that turns into a cost really depends on the
cardinality of the solutions per distinct combination of the join variables.
If there is only one solution per join variable combination, then those
collisions will cause basically no increase in the work to be done. However,
if there are 50,000 solutions per distinct combination of the join variables
then we would be better off using a 64-bit hash code.
Modifier and Type | Class and Description |
---|---|
static class |
HTreeHashJoinUtility.BS
Glue class for hash code and binding set used when the hash code is for
just the join variables rather than the entire binding set.
|
Modifier and Type | Field and Description |
---|---|
static IHashJoinUtilityFactory |
factory
Singleton
IHashJoinUtilityFactory that can be used to create a
new HTreeHashJoinUtility . |
protected CAT |
nJoinsConsidered
The #of solution pairs considered for a join.
|
protected CAT |
nleftConsidered
The #of left solutions considered for a join.
|
protected CAT |
nrightConsidered
The #of right solutions considered for a join.
|
Constructor and Description |
---|
HTreeHashJoinUtility(IMemoryManager mmgr,
PipelineOp op,
JoinTypeEnum joinType) |
Modifier and Type | Method and Description |
---|---|
long |
acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats)
Buffer solutions on a hash index.
|
protected IBindingSet |
decodeSolution(ITuple<?> t)
Decode a solution from an encoded
IV []. |
long |
filterSolutions(ICloseableIterator<IBindingSet[]> itr,
BOpStats stats,
IBuffer<IBindingSet> sink)
Filter solutions, writing only the DISTINCT solutions onto the sink.
|
IVariable<?> |
getAskVar()
The variable bound based on whether or not a solution survives an
"EXISTS" graph pattern (optional).
|
IConstraint[] |
getConstraints()
The join constraints (optional).
|
protected IVBindingSetEncoder |
getEncoder() |
protected static HTreeIndexMetadata |
getIndexMetadata(PipelineOp op)
|
protected HTree |
getJoinSet()
The set of distinct source solutions which joined.
|
protected long |
getJoinSetSize() |
JoinTypeEnum |
getJoinType()
Return the type safe enumeration indicating what kind of operation is to
be performed.
|
IVariable<?>[] |
getJoinVars()
The join variables.
|
protected long |
getNoJoinVarsLimit() |
protected AtomicBoolean |
getOpen() |
protected boolean |
getOutputDistintcJVs() |
long |
getRightSolutionCount()
Return the #of solutions in the hash index.
|
protected HTree |
getRightSolutions()
The hash index.
|
IVariable<?>[] |
getSelectVars()
The variables to be retained (optional, all variables are retained if
not specified).
|
IRawStore |
getStore()
The backing
IRawStore . |
void |
hashJoin(ICloseableIterator<IBindingSet[]> leftItr,
BOpStats stats,
IBuffer<IBindingSet> outputBuffer)
Do a hash join between a stream of source solutions (left) and a hash
index (right).
|
void |
hashJoin2(ICloseableIterator<IBindingSet[]> leftItr,
BOpStats stats,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints)
Variant hash join method allows the caller to impose different
constraints or additional constraints.
|
ICloseableIterator<IBindingSet> |
indexScan()
Return an
BytesTrie.Iterator that visits all solutions in the index (index
scan). |
boolean |
isEmpty()
Return
true iff there are no solutions in the hash index. |
boolean |
isOutputDistinctJoinVars()
Returns true if the projection outputs the distinct join vars (in
that case, the variables delivered by {
IHashJoinUtility.getSelectVars() will
be ignored, might even be uninitialized). |
void |
mergeJoin(IHashJoinUtility[] others,
IBuffer<IBindingSet> outputBuffer,
IConstraint[] constraints,
boolean optional)
Perform an N-way merge join.
|
void |
outputJoinSet(IBuffer<IBindingSet> out)
Output the solutions which joined.
|
void |
outputOptionals(IBuffer<IBindingSet> outputBuffer)
Identify and output the optional solutions.
|
void |
outputSolutions(IBuffer<IBindingSet> out)
Output the solutions buffered in the hash index.
|
void |
release()
Discard the hash index.
|
protected void |
saveInJoinSet(int joinSetHashCode,
byte[] val)
Add to 2nd hash tree of all solutions which join.
|
void |
saveSolutionSet()
Checkpoint the generated hash index such that it becomes safe for
concurrent readers.
|
String |
toString()
Human readable representation of the
IHashJoinUtility metadata
(but not the solutions themselves). |
protected HTreeHashJoinUtility.BS[] |
vector(IBindingSet[] leftSolutions,
IVariable<?>[] joinVars,
IVariable<?>[] selectVars,
boolean ignoreUnboundVariables,
AtomicInteger vectorSize)
Vector a chunk of solutions.
|
public static final IHashJoinUtilityFactory factory
IHashJoinUtilityFactory
that can be used to create a
new HTreeHashJoinUtility
.protected final CAT nleftConsidered
protected final CAT nrightConsidered
protected final CAT nJoinsConsidered
public HTreeHashJoinUtility(IMemoryManager mmgr, PipelineOp op, JoinTypeEnum joinType)
mmgr
- The IMemoryManager which will back the named solution set.op
- The operator whose annotation will inform construction the
hash index. The HTreeAnnotations
may be specified for
this operator and will control the initialization of the
various HTree
instances.joinType
- The type of join to be performed.HTreeHashJoinAnnotations
protected AtomicBoolean getOpen()
protected IVBindingSetEncoder getEncoder()
protected long getNoJoinVarsLimit()
protected boolean getOutputDistintcJVs()
protected HTree getRightSolutions()
protected HTree getJoinSet()
null
otherwise.public String toString()
IHashJoinUtility
metadata
(but not the solutions themselves).public boolean isEmpty()
IHashJoinUtility
true
iff there are no solutions in the hash index.isEmpty
in interface IHashJoinUtility
public long getRightSolutionCount()
IHashJoinUtility
getRightSolutionCount
in interface IHashJoinUtility
protected long getJoinSetSize()
public JoinTypeEnum getJoinType()
IHashJoinUtility
getJoinType
in interface IHashJoinUtility
public IVariable<?> getAskVar()
IHashJoinUtility
getAskVar
in interface IHashJoinUtility
HashJoinAnnotations.ASK_VAR
public IVariable<?>[] getJoinVars()
IHashJoinUtility
getJoinVars
in interface IHashJoinUtility
HashJoinAnnotations.JOIN_VARS
public IVariable<?>[] getSelectVars()
IHashJoinUtility
getSelectVars
in interface IHashJoinUtility
JoinAnnotations.SELECT
public boolean isOutputDistinctJoinVars()
IHashJoinUtility
IHashJoinUtility.getSelectVars()
will
be ignored, might even be uninitialized). See
HashJoinAnnotations.OUTPUT_DISTINCT_JVs
.isOutputDistinctJoinVars
in interface IHashJoinUtility
public IConstraint[] getConstraints()
IHashJoinUtility
getConstraints
in interface IHashJoinUtility
JoinAnnotations.CONSTRAINTS
protected static HTreeIndexMetadata getIndexMetadata(PipelineOp op)
public void saveSolutionSet()
This implementation checkpoints the HTree
instance(s) used to
buffer the source solutions (rightSolutions
and #ivCache
) and then re-load the them in a read-only mode from their checkpoint(s).
This exposes a view of the HTree
which is safe for concurrent
readers.
saveSolutionSet
in interface IHashJoinUtility
public void release()
IHashJoinUtility
release
in interface IHashJoinUtility
public long acceptSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats)
IHashJoinUtility
When optional:=true
, solutions which do not have a binding
for one or more of the join variables will be inserted into the hash
index anyway using hashCode:=1
. This allows the solutions to
be discovered when we scan the hash index and the set of solutions which
did join to identify the optional solutions.
acceptSolutions
in interface IHashJoinUtility
itr
- The source from which the solutions will be drained.stats
- The statistics to be updated as the solutions are buffered on
the hash index.public long filterSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats, IBuffer<IBindingSet> sink)
IHashJoinUtility
filterSolutions
in interface IHashJoinUtility
itr
- The source solutions.stats
- The stats to be updated.sink
- The sink.protected IBindingSet decodeSolution(ITuple<?> t)
IV
[].
Note: The IVCache
associated are NOT resolved by this method. The
resolution step is relatively expensive since it must do lookups in
persistence capable data structures. The caller MUST use
IBindingSetDecoder.resolveCachedValues(IBindingSet)
to resolve
the IVCache
associations once they decide that the decoded
solution can join.
Note: This instance method is required by the MERGE JOIN logic which
associates the schema with the first IHashJoinUtility
instance.
t
- A tuple whose value is an encoded IV
[].IBindingSet
.public void hashJoin(ICloseableIterator<IBindingSet[]> leftItr, BOpStats stats, IBuffer<IBindingSet> outputBuffer)
IHashJoinUtility
Note: Some JoinTypeEnum
s have side-effects on the join state. For
this joins, once method has been invoked for the final time, you must
then invoke either IHashJoinUtility.outputOptionals(IBuffer)
(Optional or
NotExists) or IHashJoinUtility.outputJoinSet(IBuffer)
(Exists).
hashJoin
in interface IHashJoinUtility
leftItr
- A stream of chunks of solutions to be joined against the hash
index (left).stats
- The statistics to be updated as solutions are drained from the
leftItr (optional). When left
is the
pipeline, BOpStats.chunksIn
and
BOpStats.unitsIn
should be updated by passing in the
BOpStats
object. When left
is a hash
index (i.e., for a hash join against an access path), you
should pass null
since the chunksIn and unitsIn
are updated as the HashIndexOp
builds the hash index
rather than when it executes the join against the access
path).outputBuffer
- Where to write the solutions which join.public void hashJoin2(ICloseableIterator<IBindingSet[]> leftItr, BOpStats stats, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints)
IHashJoinUtility
Note: Some JoinTypeEnum
s have side-effects on the join state. For
this joins, once method has been invoked for the final time, you must
then invoke either IHashJoinUtility.outputOptionals(IBuffer)
(Optional or
NotExists) or IHashJoinUtility.outputJoinSet(IBuffer)
(Exists).
hashJoin2
in interface IHashJoinUtility
leftItr
- A stream of chunks of solutions to be joined against the hash
index (left).stats
- The statistics to be updated as solutions are drained from the
leftItr.outputBuffer
- Where to write the solutions which join.constraints
- Constraints attached to this join (optional). Any constraints
specified here are combined with those specified in the
constructor.protected HTreeHashJoinUtility.BS[] vector(IBindingSet[] leftSolutions, IVariable<?>[] joinVars, IVariable<?>[] selectVars, boolean ignoreUnboundVariables, AtomicInteger vectorSize)
leftSolutions
- The solutions.joinVars
- The variables on which the hash code will be computed.selectVars
- When non-null
, all other variables are dropped.
(This is used when we are modeling a DISTINCT solutions filter
since we need to drop anything which is not part of the
DISTINCT variables list.)ignoreUnboundVariables
- When true
, an unbound variable will not cause a
JoinVariableNotBoundException
to be thrown.vectorSize
- The vector size (set by side-effect). This will be LTE the
number of solutions in leftSolutions
. (If some
solutions are eliminated because they lack a binding for a
required join variable, then vectorSize is LT the number of
leftSolutions
).protected void saveInJoinSet(int joinSetHashCode, byte[] val)
Note: the hash key is based on the entire solution (not just the join
variables). The values are the full encoded IBindingSet
.
public void outputOptionals(IBuffer<IBindingSet> outputBuffer)
IHashJoinUtility
Optionals are identified using a joinSet containing each right solution which joined with at least one left solution. The total set of right solutions is then scanned once. For each right solution, we probe the joinSet. If the right solution did not join, then it is output now as an optional join.
outputOptionals
in interface IHashJoinUtility
outputBuffer
- Where to write the optional solutions.public ICloseableIterator<IBindingSet> indexScan()
IHashJoinUtility
BytesTrie.Iterator
that visits all solutions in the index (index
scan). The visited solutions MAY contain variables that would not be
projected out of the hash join.
Note: This is very nearly the same as IHashJoinUtility.outputSolutions(IBuffer)
except that the latter only outputs the projected variables and it writes
onto an IBuffer
rather than returning an
ICloseableIterator
.
indexScan
in interface IHashJoinUtility
BytesTrie.Iterator
.public void outputSolutions(IBuffer<IBindingSet> out)
IHashJoinUtility
outputSolutions
in interface IHashJoinUtility
out
- Where to write the solutions.public void outputJoinSet(IBuffer<IBindingSet> out)
IHashJoinUtility
outputJoinSet
in interface IHashJoinUtility
out
- Where to write the solutions.public void mergeJoin(IHashJoinUtility[] others, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints, boolean optional)
The merge join takes a set of solution sets in the some order and having the same join variables. It examines the next solution in order for each solution set and compares them. For each solution set which reported a solution having the same join variables as that earliest solution, it outputs the cross product and advances the iterator on that solution set.
The iterators draining the source solution sets need to be synchronized such that we consider only solutions having the same hash code in each cycle of the MERGE JOIN. The synchronization step is different depending on whether or not the MERGE JOIN is OPTIONAL.
If the MERGE JOIN is REQUIRED, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which they all have in common.
If the MERGE JOIN is OPTIONAL, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which appears for any source iterator. Solutions will not be drawn from iterators not having that key in that pass.
Note that each hash code may be an alias for solutions having different values for their join variables. Such solutions will not join. However, only solutions having the same values for the hash code can join. Thus, by proceeding with synchronized iterators and operating only on solutions having the same hash code in each round, we will consider all solutions which COULD join with one another in each round.
Note: If the solutions are not in a stable and mutually consistent order
by hash code in the hash indices then the solutions in each hash index
MUST be SORTED before proceeding. (The HTree
maintains solutions
in such an order but the JVM collections do not.)
Note: For the HTree
, the entries are in key order. Those keys are
hash codes computed from the solutions using the join variables. Since
the keys are hash codes and not the join variable bindings, each hash
code identifies a collision bucket from the perspective of the merge join
algorithm. Of course, from the perspective of the HTree
those
solutions are just consequective tuples readily identified using
HTree.lookupAll(int)
.
FIXME Either always project everything or raise [select] into a parameter
for this method. We DO NOT want to only project whatever was projected by
the first source.
mergeJoin
in interface IHashJoinUtility
others
- The other solution sets to be joined. All instances must be of
the same concrete type as this.outputBuffer
- Where to write the solutions.constraints
- The join constraints.optional
- true
iff the join is optional.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.