HTreeHashJoinUtility (Blazegraph Database Platform 2.1.5 API)

java.lang.Object
- com.bigdata.bop.join.HTreeHashJoinUtility

All Implemented Interfaces:

IHashJoinUtility

Direct Known Subclasses:

HTreePipelinedHashJoinUtility
```
public class HTreeHashJoinUtility
extends Object
implements IHashJoinUtility
```
Utility methods to support hash index builds and hash index joins using a scalable native memory data structures.
Vectoring and IV encoding
In order to provide efficient encoding and persistence of solutions on the HTree, this class is written directly to the RDF data model. Rather than POJO serialization, solutions are encoded as logical IV[]s in a manner very similar to how we represent the keys of the statement indices.
Since this encoding does not persist the cache, a separate mapping must be maintained from IV to BigdataValue for those IVs which have a materialized BigdataValue. TODO Do a 64-bit hash version which could be used for hash indices having more than 500M distinct join variable combinations. Note that at 500M distinct join variable combinations we have a 1 in 4 chance of a hash collision. Whether or not that turns into a cost really depends on the cardinality of the solutions per distinct combination of the join variables. If there is only one solution per join variable combination, then those collisions will cause basically no increase in the work to be done. However, if there are 50,000 solutions per distinct combination of the join variables then we would be better off using a 64-bit hash code.

Version:

$Id: HTreeHashJoinUtility.java 5568 2011-11-07 19:39:12Z thompsonbry

Author:

Bryan Thompson

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`HTreeHashJoinUtility.BS` Glue class for hash code and binding set used when the hash code is for just the join variables rather than the entire binding set.

Field Summary

Fields
Modifier and Type	Field and Description
`static IHashJoinUtilityFactory`	`factory` Singleton `IHashJoinUtilityFactory` that can be used to create a new `HTreeHashJoinUtility`.
`protected CAT`	`nJoinsConsidered` The #of solution pairs considered for a join.
`protected CAT`	`nleftConsidered` The #of left solutions considered for a join.
`protected CAT`	`nrightConsidered` The #of right solutions considered for a join.

Constructor Summary

Constructors
Constructor and Description

HTreeHashJoinUtility(IMemoryManager mmgr, PipelineOp op, JoinTypeEnum joinType)

Constructors
Constructor and Description
`HTreeHashJoinUtility(IMemoryManager mmgr, PipelineOp op, JoinTypeEnum joinType)`

Method Summary

Methods
Modifier and Type	Method and Description
`long`	`acceptSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats)` Buffer solutions on a hash index.
`protected IBindingSet`	`decodeSolution(ITuple<?> t)` Decode a solution from an encoded `IV`[].
`long`	`filterSolutions(ICloseableIterator<IBindingSet[]> itr, BOpStats stats, IBuffer<IBindingSet> sink)` Filter solutions, writing only the DISTINCT solutions onto the sink.
`IVariable<?>`	`getAskVar()` The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).
`IConstraint[]`	`getConstraints()` The join constraints (optional).
`protected IVBindingSetEncoder`	`getEncoder()`
`protected static HTreeIndexMetadata`	`getIndexMetadata(PipelineOp op)` Setup the `IndexMetadata` for `rightSolutions` or `joinSet`.
`protected HTree`	`getJoinSet()` The set of distinct source solutions which joined.
`protected long`	`getJoinSetSize()`
`JoinTypeEnum`	`getJoinType()` Return the type safe enumeration indicating what kind of operation is to be performed.
`IVariable<?>[]`	`getJoinVars()` The join variables.
`protected long`	`getNoJoinVarsLimit()`
`protected AtomicBoolean`	`getOpen()`
`protected boolean`	`getOutputDistintcJVs()`
`long`	`getRightSolutionCount()` Return the #of solutions in the hash index.
`protected HTree`	`getRightSolutions()` The hash index.
`IVariable<?>[]`	`getSelectVars()` The variables to be retained (optional, all variables are retained if not specified).
`IRawStore`	`getStore()` The backing `IRawStore`.
`void`	`hashJoin(ICloseableIterator<IBindingSet[]> leftItr, BOpStats stats, IBuffer<IBindingSet> outputBuffer)` Do a hash join between a stream of source solutions (left) and a hash index (right).
`void`	`hashJoin2(ICloseableIterator<IBindingSet[]> leftItr, BOpStats stats, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints)` Variant hash join method allows the caller to impose different constraints or additional constraints.
`ICloseableIterator<IBindingSet>`	`indexScan()` Return an `BytesTrie.Iterator` that visits all solutions in the index (index scan).
`boolean`	`isEmpty()` Return `true` iff there are no solutions in the hash index.
`boolean`	`isOutputDistinctJoinVars()` Returns true if the projection outputs the distinct join vars (in that case, the variables delivered by {`IHashJoinUtility.getSelectVars()` will be ignored, might even be uninitialized).
`void`	`mergeJoin(IHashJoinUtility[] others, IBuffer<IBindingSet> outputBuffer, IConstraint[] constraints, boolean optional)` Perform an N-way merge join.
`void`	`outputJoinSet(IBuffer<IBindingSet> out)` Output the solutions which joined.
`void`	`outputOptionals(IBuffer<IBindingSet> outputBuffer)` Identify and output the optional solutions.
`void`	`outputSolutions(IBuffer<IBindingSet> out)` Output the solutions buffered in the hash index.
`void`	`release()` Discard the hash index.
`protected void`	`saveInJoinSet(int joinSetHashCode, byte[] val)` Add to 2nd hash tree of all solutions which join.
`void`	`saveSolutionSet()` Checkpoint the generated hash index such that it becomes safe for concurrent readers.
`String`	`toString()` Human readable representation of the `IHashJoinUtility` metadata (but not the solutions themselves).
`protected HTreeHashJoinUtility.BS[]`	`vector(IBindingSet[] leftSolutions, IVariable<?>[] joinVars, IVariable<?>[] selectVars, boolean ignoreUnboundVariables, AtomicInteger vectorSize)` Vector a chunk of solutions.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - factory
```
public static final IHashJoinUtilityFactory factory
```
    Singleton IHashJoinUtilityFactory that can be used to create a new HTreeHashJoinUtility.
  - nleftConsidered
```
protected final CAT nleftConsidered
```
    The #of left solutions considered for a join.
  - nrightConsidered
```
protected final CAT nrightConsidered
```
    The #of right solutions considered for a join.
  - nJoinsConsidered
```
protected final CAT nJoinsConsidered
```
    The #of solution pairs considered for a join.
- Constructor Detail
  - HTreeHashJoinUtility
```
public HTreeHashJoinUtility(IMemoryManager mmgr,
                    PipelineOp op,
                    JoinTypeEnum joinType)
```
    Parameters:
    mmgr - The IMemoryManager which will back the named solution set.
    op - The operator whose annotation will inform construction the hash index. The HTreeAnnotations may be specified for this operator and will control the initialization of the various HTree instances.
    joinType - The type of join to be performed.
    See Also:
    HTreeHashJoinAnnotations
- Method Detail
  - getOpen
```
protected AtomicBoolean getOpen()
```
  - getEncoder
```
protected IVBindingSetEncoder getEncoder()
```
  - getNoJoinVarsLimit
```
protected long getNoJoinVarsLimit()
```
  - getOutputDistintcJVs
```
protected boolean getOutputDistintcJVs()
```
  - getRightSolutions
```
protected HTree getRightSolutions()
```
    The hash index.
  - getJoinSet
```
protected HTree getJoinSet()
```
    The set of distinct source solutions which joined. This set is maintained iff the join is optional and is null otherwise.
  - toString
```
public String toString()
```
    Human readable representation of the IHashJoinUtility metadata (but not the solutions themselves).
    
    Overrides:
    
    toString in class Object
  - isEmpty
```
public boolean isEmpty()
```
    Description copied from interface: IHashJoinUtility
    
    Return true iff there are no solutions in the hash index.
    
    Specified by:
    
    isEmpty in interface IHashJoinUtility
  - getRightSolutionCount
```
public long getRightSolutionCount()
```
    Description copied from interface: IHashJoinUtility
    
    Return the #of solutions in the hash index.
    
    Specified by:
    
    getRightSolutionCount in interface IHashJoinUtility
  - getJoinSetSize
```
protected long getJoinSetSize()
```
  - getJoinType
```
public JoinTypeEnum getJoinType()
```
    Description copied from interface: IHashJoinUtility
    
    Return the type safe enumeration indicating what kind of operation is to be performed.
    
    Specified by:
    
    getJoinType in interface IHashJoinUtility
  - getAskVar
```
public IVariable<?> getAskVar()
```
    Description copied from interface: IHashJoinUtility
    
    The variable bound based on whether or not a solution survives an "EXISTS" graph pattern (optional).
    
    Specified by:
    
    getAskVar in interface IHashJoinUtility
    
    See Also:
    HashJoinAnnotations.ASK_VAR
  - getJoinVars
```
public IVariable<?>[] getJoinVars()
```
    Description copied from interface: IHashJoinUtility
    
    The join variables.
    
    Specified by:
    
    getJoinVars in interface IHashJoinUtility
    
    See Also:
    HashJoinAnnotations.JOIN_VARS
  - getSelectVars
```
public IVariable<?>[] getSelectVars()
```
    Description copied from interface: IHashJoinUtility
    
    The variables to be retained (optional, all variables are retained if not specified).
    
    Specified by:
    
    getSelectVars in interface IHashJoinUtility
    
    See Also:
    JoinAnnotations.SELECT
  - isOutputDistinctJoinVars
```
public boolean isOutputDistinctJoinVars()
```
    Description copied from interface: IHashJoinUtility
    
    Returns true if the projection outputs the distinct join vars (in that case, the variables delivered by {IHashJoinUtility.getSelectVars() will be ignored, might even be uninitialized). See HashJoinAnnotations.OUTPUT_DISTINCT_JVs.
    
    Specified by:
    
    isOutputDistinctJoinVars in interface IHashJoinUtility
  - getConstraints
```
public IConstraint[] getConstraints()
```
    Description copied from interface: IHashJoinUtility
    
    The join constraints (optional).
    
    Specified by:
    
    getConstraints in interface IHashJoinUtility
    
    See Also:
    JoinAnnotations.CONSTRAINTS
  - getIndexMetadata
```
protected static HTreeIndexMetadata getIndexMetadata(PipelineOp op)
```
    Setup the IndexMetadata for rightSolutions or joinSet.
  - getStore
```
public IRawStore getStore()
```
    The backing IRawStore.
  - saveSolutionSet
```
public void saveSolutionSet()
```
    Checkpoint the generated hash index such that it becomes safe for concurrent readers.
    This implementation checkpoints the HTree instance(s) used to buffer the source solutions (rightSolutions and #ivCache ) and then re-load the them in a read-only mode from their checkpoint(s). This exposes a view of the HTree which is safe for concurrent readers.
    
    Specified by:
    
    saveSolutionSet in interface IHashJoinUtility
  - release
```
public void release()
```
    Description copied from interface: IHashJoinUtility
    
    Discard the hash index.
    
    Specified by:
    
    release in interface IHashJoinUtility
  - acceptSolutions
```
public long acceptSolutions(ICloseableIterator<IBindingSet[]> itr,
                   BOpStats stats)
```
    Description copied from interface: IHashJoinUtility
    
    Buffer solutions on a hash index.
    When optional:=true, solutions which do not have a binding for one or more of the join variables will be inserted into the hash index anyway using hashCode:=1. This allows the solutions to be discovered when we scan the hash index and the set of solutions which did join to identify the optional solutions.
    
    Specified by:
    
    acceptSolutions in interface IHashJoinUtility
    
    Parameters:
    itr - The source from which the solutions will be drained.
    stats - The statistics to be updated as the solutions are buffered on the hash index.
    
    Returns:
    The #of solutions that were buffered.
  - filterSolutions
```
public long filterSolutions(ICloseableIterator<IBindingSet[]> itr,
                   BOpStats stats,
                   IBuffer<IBindingSet> sink)
```
    Description copied from interface: IHashJoinUtility
    
    Filter solutions, writing only the DISTINCT solutions onto the sink.
    
    Specified by:
    
    filterSolutions in interface IHashJoinUtility
    
    Parameters:
    itr - The source solutions.
    stats - The stats to be updated.
    sink - The sink.
    
    Returns:
    The #of source solutions which pass the filter.
  - decodeSolution
```
protected IBindingSet decodeSolution(ITuple<?> t)
```
    Decode a solution from an encoded IV[].
    Note: The IVCache associated are NOT resolved by this method. The resolution step is relatively expensive since it must do lookups in persistence capable data structures. The caller MUST use IBindingSetDecoder.resolveCachedValues(IBindingSet) to resolve the IVCache associations once they decide that the decoded solution can join.
    Note: This instance method is required by the MERGE JOIN logic which associates the schema with the first IHashJoinUtility instance.
    
    Parameters:
    t - A tuple whose value is an encoded IV[].
    
    Returns:
    The decoded IBindingSet.
  - hashJoin
```
public void hashJoin(ICloseableIterator<IBindingSet[]> leftItr,
            BOpStats stats,
            IBuffer<IBindingSet> outputBuffer)
```
    Description copied from interface: IHashJoinUtility
    
    Do a hash join between a stream of source solutions (left) and a hash index (right). For each left solution, the hash index (right) is probed for possible matches (solutions whose as-bound values for the join variables produce the same hash code). Possible matches are tested for consistency and the constraints (if any) are applied. Solutions which join are written on the caller's buffer.
    Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).
    
    Specified by:
    
    hashJoin in interface IHashJoinUtility
    
    Parameters:
    leftItr - A stream of chunks of solutions to be joined against the hash index (left).
    stats - The statistics to be updated as solutions are drained from the leftItr (optional). When left is the pipeline, BOpStats.chunksIn and BOpStats.unitsIn should be updated by passing in the BOpStats object. When left is a hash index (i.e., for a hash join against an access path), you should pass null since the chunksIn and unitsIn are updated as the HashIndexOp builds the hash index rather than when it executes the join against the access path).
    outputBuffer - Where to write the solutions which join.
  - hashJoin2
```
public void hashJoin2(ICloseableIterator<IBindingSet[]> leftItr,
             BOpStats stats,
             IBuffer<IBindingSet> outputBuffer,
             IConstraint[] constraints)
```
    Description copied from interface: IHashJoinUtility
    
    Variant hash join method allows the caller to impose different constraints or additional constraints. This is used to impose join constraints when a solution set is joined back into a query based on the join filters in the join group in which the solution set is included.
    Note: Some JoinTypeEnums have side-effects on the join state. For this joins, once method has been invoked for the final time, you must then invoke either IHashJoinUtility.outputOptionals(IBuffer) (Optional or NotExists) or IHashJoinUtility.outputJoinSet(IBuffer) (Exists).
    
    Specified by:
    
    hashJoin2 in interface IHashJoinUtility
    
    Parameters:
    leftItr - A stream of chunks of solutions to be joined against the hash index (left).
    stats - The statistics to be updated as solutions are drained from the leftItr.
    outputBuffer - Where to write the solutions which join.
    constraints - Constraints attached to this join (optional). Any constraints specified here are combined with those specified in the constructor.
  - vector
```
protected HTreeHashJoinUtility.BS[] vector(IBindingSet[] leftSolutions,
                               IVariable<?>[] joinVars,
                               IVariable<?>[] selectVars,
                               boolean ignoreUnboundVariables,
                               AtomicInteger vectorSize)
```
    Vector a chunk of solutions.
    
    Parameters:
    leftSolutions - The solutions.
    joinVars - The variables on which the hash code will be computed.
    selectVars - When non-null, all other variables are dropped. (This is used when we are modeling a DISTINCT solutions filter since we need to drop anything which is not part of the DISTINCT variables list.)
    ignoreUnboundVariables - When true, an unbound variable will not cause a JoinVariableNotBoundException to be thrown.
    vectorSize - The vector size (set by side-effect). This will be LTE the number of solutions in leftSolutions. (If some solutions are eliminated because they lack a binding for a required join variable, then vectorSize is LT the number of leftSolutions).
    
    Returns:
    The vectored chunk of solutions ordered by hash code.
  - saveInJoinSet
```
protected void saveInJoinSet(int joinSetHashCode,
                 byte[] val)
```
    Add to 2nd hash tree of all solutions which join.
    Note: the hash key is based on the entire solution (not just the join variables). The values are the full encoded IBindingSet.
  - outputOptionals
```
public void outputOptionals(IBuffer<IBindingSet> outputBuffer)
```
    Description copied from interface: IHashJoinUtility
    
    Identify and output the optional solutions. This is used with OPTIONAL and NOT EXISTS.
    Optionals are identified using a joinSet containing each right solution which joined with at least one left solution. The total set of right solutions is then scanned once. For each right solution, we probe the joinSet. If the right solution did not join, then it is output now as an optional join.
    
    Specified by:
    
    outputOptionals in interface IHashJoinUtility
    
    Parameters:
    outputBuffer - Where to write the optional solutions.
  - indexScan
```
public ICloseableIterator<IBindingSet> indexScan()
```
    Description copied from interface: IHashJoinUtility
    
    Return an BytesTrie.Iterator that visits all solutions in the index (index scan). The visited solutions MAY contain variables that would not be projected out of the hash join.
    Note: This is very nearly the same as IHashJoinUtility.outputSolutions(IBuffer) except that the latter only outputs the projected variables and it writes onto an IBuffer rather than returning an ICloseableIterator.
    
    Specified by:
    
    indexScan in interface IHashJoinUtility
    
    Returns:
    The BytesTrie.Iterator.
  - outputSolutions
```
public void outputSolutions(IBuffer<IBindingSet> out)
```
    Description copied from interface: IHashJoinUtility
    
    Output the solutions buffered in the hash index. This is used when an operator is building a hash index for use by a downstream operator.
    
    Specified by:
    
    outputSolutions in interface IHashJoinUtility
    
    Parameters:
    out - Where to write the solutions.
  - outputJoinSet
```
public void outputJoinSet(IBuffer<IBindingSet> out)
```
    Description copied from interface: IHashJoinUtility
    
    Output the solutions which joined. This is used with EXISTS.
    
    Specified by:
    
    outputJoinSet in interface IHashJoinUtility
    
    Parameters:
    out - Where to write the solutions.
  - mergeJoin
```
public void mergeJoin(IHashJoinUtility[] others,
             IBuffer<IBindingSet> outputBuffer,
             IConstraint[] constraints,
             boolean optional)
```
    Perform an N-way merge join. For an OPTIONAL join, this instance is understood to be the index having the "required" solutions.
    The merge join takes a set of solution sets in the some order and having the same join variables. It examines the next solution in order for each solution set and compares them. For each solution set which reported a solution having the same join variables as that earliest solution, it outputs the cross product and advances the iterator on that solution set.
    The iterators draining the source solution sets need to be synchronized such that we consider only solutions having the same hash code in each cycle of the MERGE JOIN. The synchronization step is different depending on whether or not the MERGE JOIN is OPTIONAL.
    If the MERGE JOIN is REQUIRED, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which they all have in common.
    If the MERGE JOIN is OPTIONAL, then we want to synchronize the source solution iterators on the next lowest key (aka hash code) which appears for any source iterator. Solutions will not be drawn from iterators not having that key in that pass.
    Note that each hash code may be an alias for solutions having different values for their join variables. Such solutions will not join. However, only solutions having the same values for the hash code can join. Thus, by proceeding with synchronized iterators and operating only on solutions having the same hash code in each round, we will consider all solutions which COULD join with one another in each round.
    Note: If the solutions are not in a stable and mutually consistent order by hash code in the hash indices then the solutions in each hash index MUST be SORTED before proceeding. (The HTree maintains solutions in such an order but the JVM collections do not.)
    Note: For the HTree, the entries are in key order. Those keys are hash codes computed from the solutions using the join variables. Since the keys are hash codes and not the join variable bindings, each hash code identifies a collision bucket from the perspective of the merge join algorithm. Of course, from the perspective of the HTree those solutions are just consequective tuples readily identified using HTree.lookupAll(int). FIXME Either always project everything or raise [select] into a parameter for this method. We DO NOT want to only project whatever was projected by the first source.
    
    Specified by:
    
    mergeJoin in interface IHashJoinUtility
    
    Parameters:
    others - The other solution sets to be joined. All instances must be of the same concrete type as this.
    outputBuffer - Where to write the solutions.
    constraints - The join constraints.
    optional - true iff the join is optional.

Class HTreeHashJoinUtility

Vectoring and IV encoding

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

factory

nleftConsidered

nrightConsidered

nJoinsConsidered

Constructor Detail

HTreeHashJoinUtility

Method Detail

getOpen

getEncoder

getNoJoinVarsLimit

getOutputDistintcJVs

getRightSolutions

getJoinSet

toString

isEmpty

getRightSolutionCount

getJoinSetSize

getJoinType

getAskVar

getJoinVars

getSelectVars

isOutputDistinctJoinVars

getConstraints

getIndexMetadata

getStore

saveSolutionSet

release

acceptSolutions

filterSolutions

decodeSolution

hashJoin

hashJoin2

vector

saveInJoinSet

outputOptionals

indexScan

outputSolutions

outputJoinSet

mergeJoin