HashCollisionUtility (Blazegraph Database Platform 2.1.5 API)

java.lang.Object
- com.bigdata.rdf.internal.HashCollisionUtility

```
public class HashCollisionUtility
extends Object
```
Utility class to parse some RDF resource(s) and count hash collisions using a variety of hash codes. TODO Various data sets:
```
 /nas/data/lubm/U1/data/University0/
 /nas/data/bsbm/bsbm_2785/dataset.nt.gz
 /nas/data/bsbm/bsbm_566496/dataset.nt.gz
 /data/bsbm3_200m_1MSplits
 
 8B triple bioinformatics data set.
 
 BTC data (some very large literals)
 
 
```
TODO order preserving hash codes could be interesting here. Look at 32 and 64 bit variants of the math and at generalized order preserving hash codes. With order preserving hash codes, it makes sense to insert all Unicode terms into TERM2ID such that we have a total order there. TODO benchmark the load time with different hash codes. the cost of the hash computation and the randomness of the distribution will both play a role. The B+Tree will need to be setup with a sufficient [writeRetentionQueue] and we will need to specify [-server -Xmx1G]. SHA-256 - no collisions on BSBM 200M. 30G file. time? 32-bit hash codes. #collisions=1544132 Elapsed: 16656445ms Journal size: 23841341440 bytes (23G) Now limiting the size of values in a leaf and also increasing the branching factor to 512 (was 32). [The current run is scanning after the initial insert, which involves a little wasted effort. It was also without the -server -Xmx2g, and write retention queue parameters. Finally, it was serializing BigdataValue objects, including their IV, rather than RDF Value objects. The code has since been modified to serialize just the BigdataValue Also, I've since raised the initial extent from 10M to 200M]. maxCollisions=3, Elapsed: 22579073ms Journal size: 35270950912 bytes Now buffering 100k values at a time: 2x faster.
```
 U1:
 Elapsed: 23379ms
 NumStatements: 1000313
 NumDistinctVals: 291259
 TotalKeyBytes: 1747554
 TotalValBytes: 60824514
 MaxCollisions: 1
 TotalCollisions: 6
 Journal size: 209715200 bytes
 name   m       height  nnodes  nleaves nodeBytes       leafBytes       totalBytes      avgNodeBytes    avgLeafBytes    minNodeBytes    maxNodeBytes    minLeafBytes    maxLeafBytes
 lex    1024    1       1       474     7913    3662623 3670536 7913    7727    7913    7913    5786    13784
 
```
With only a byte (versus short) counter in the key. Oddly, this has no impact on the average leaf size. That suggests that the keys in the leaves are very sparse in terms of the hash code space such that prefix compression is not really doing that much for us.
```
 Elapsed: 23235ms
 NumStatements: 1000313
 NumDistinctVals: 291259
 TotalKeyBytes: 1456295
 TotalValBytes: 60824514
 MaxCollisions: 1
 TotalCollisions: 6
 Journal size: 209715200 bytes
 name   m       height  nnodes  nleaves nodeBytes       leafBytes       totalBytes      avgNodeBytes    avgLeafBytes    minNodeBytes    maxNodeBytes    minLeafBytes    maxLeafBytes
 lex    1024    1       1       474     7913    3371370 3379283 7913    7112    7913    7913    5274    12774
 
```
BSBM 200M: This is the best time and space so far. using a byte counter rather than a short.
```
 Elapsed: 16338357ms
 NumStatements: 198808848
 NumDistinctVals: 45647082
 TotalKeyBytes: 228235410
 TotalValBytes: 11292849582
 MaxCollisions: 3
 TotalCollisions: 244042
 Journal size: 16591683584 bytes
 
```
BSBM 200M: Note: I restarted this run after terminating yourkit so the results should be valid (right?). The main changes are to use stringValue() to test for dateTime, to use the canonical huffman coder for the leaf keys.
```
 Elapsed: 20148506ms
 NumStatements: 198808848
 NumDistinctVals: 45647082
 TotalKeyBytes: 228235410
 TotalValBytes: 11292849582
 MaxCollisions: 3
 TotalCollisions: 244042
 Journal size: 16591683584 bytes
 
```
BSBM 200M: raw records are compress if they are over 64 bytes long.
```
 Elapsed: 18757003ms
 NumStatements: 198808848
 NumDistinctVals: 45647082
 TotalKeyBytes: 228235410
 TotalValBytes: 7910596818
 MaxCollisions: 3
 TotalCollisions: 244042
 Journal size: 12270108672 bytes
 
```
BSBM 200M: literals LT 64 byte labels are assumed inlined into statement indices (except datatype URIs).
```
 Elapsed: 16193915ms
 NumStatements: 198808848
 NumDistinctVals: 43273381
 NumShortLiterals: 2723662
 TotalKeyBytes: 216366905
 TotalValBytes: 7807037644
 MaxCollisions: 3
 TotalCollisions: 219542
 Journal size: 11083186176 bytes
 
```
BSBM 200M: uris LT 64 byte localNames are assumed inlined into statement indices (plus datatype literals LT 64 bytes).
```
 Elapsed: 5699248ms
 NumStatements: 198808848
 NumDistinctVals: 12198222
 NumShortLiterals: 32779032
 NumShortURIs: 493520581
 TotalKeyBytes: 60991110
 TotalValBytes: 4944223808
 MaxCollisions: 2
 TotalCollisions: 17264
 Journal size: 7320764416 bytes
 
```
BSBM 200M: one parser thread and one indexer thread.
```
 Elapsed: 3724415ms
 NumStatements: 198808848
 NumDistinctVals: 12198222
 NumShortLiterals: 32779032
 NumShortBNodes: 0
 NumShortURIs: 493520581
 TotalKeyBytes: 60991110
 TotalValBytes: 4944223808
 MaxCollisions: 2
 TotalCollisions: 17264
 Journal size: 7320764416 bytes
 
```
GC OH problem trying to run multiple parsers against BSBM 200M when split into 200 files.
```
 valBufSize := 10000
     valQueueCapacity = 100
     maxDrain := 50
     nparserThreads := 2
     parserWorkQueue := 1000
 
```
BSBM 200M - this is 3x longer. This run did not have the GC OH problem, but GC had frequent 10% spikes, which is a lot in comparison to our best run.
```
     valBufSize := 1000
     valQueueCapacity = 10
     maxDrain := 5
     nparserThreads := 4
     parserWorkQueue := 100
 
 Elapsed: 9589520ms
 NumStatements: 198805837
 NumDistinctVals: 12202052
 NumShortLiterals: 32776100
 NumShortBNodes: 0
 NumShortURIs: 493514954
 TotalKeyBytes: 61010260
 TotalValBytes: 4945278396
 MaxCollisions: 2
 TotalCollisions: 17260
 Journal size: 7320764416 bytes
 
```
BSBM 200M: split in 200 files. 69m versus best time so far of 62m. There is only one thread in the pool, but the caller runs policy means that we are actually running two parsers. So, this is not really the same as the best run, which was one parser running in the main thread with the indexer running in another thread.
```
           valBufSize := 10000
     valQueueCapacity = 10
     maxDrain := 5
     nparserThreads := 1
     parserWorkQueue := 100
 
 Elapsed: 4119775ms
 NumStatements: 198805837
 NumDistinctVals: 12202052
 NumShortLiterals: 32776100
 NumShortBNodes: 0
 NumShortURIs: 493514954
 TotalKeyBytes: 61010260
 TotalValBytes: 4945278396
 MaxCollisions: 2
 TotalCollisions: 17260
 Journal size: 7320764416 bytes
 
```
BSBM 200M with 1M statement splits using the memory manager to buffer the data on the native heap. This is the best score so far (compare with 3724415ms with one parser and one indexer thread). For some reason, the #of distinct values and literals is slightly different for these two runs. One other change in this run is that we always gzip the record since we can not deserialize the record unless we know in advance whether or not it is compressed. Previous runs had conditionally compressed based on the original byte[] value length and stored the compressed record iff it was shorter. However, we can only conditionally compress if we use a header or bit flag to indicate that the record is compressed. Peak memory manager use was 262M.
```
 Elapsed: 2863898ms
 NumStatements: 198805837
 NumDistinctVals: 12,202,052
 NumShortLiterals: 61,100,900
 NumShortBNodes: 0
 NumShortURIs: 493514954
 TotalKeyBytes: 61010260
 TotalValBytes: 4945376779
 MaxCollisions: 2
 TotalCollisions: 17260
 Journal size: 7320764416 bytes
 
```
BSBM 200M using memory manager (high tide of 351M) and 5 parser threads (plus the main thread). Heap usage is pretty controlled.
```
 Elapsed: 2803451ms
 NumStatements: 198805837
 NumDistinctVals: 12202052
 NumShortLiterals: 61100900
 NumShortBNodes: 0
 NumShortURIs: 493514954
 TotalKeyBytes: 61010260
 TotalValBytes: 4945376779
 MaxCollisions: 2
 TotalCollisions: 17260
 Journal size: 7320764416 bytes
 
```
BSBM 200M. Using memory manager and only one parser thread. This does run significantly slower (55m versus 47m with two parser threads). It might not be slower if we also ran against the single source file (this ran against the split files) since each chunk placed onto the queue would then be full, but I doubt that this will make that much difference.
```
 Elapsed: 3300871ms
 NumStatements: 198805837
 NumDistinctVals: 12049125
 NumShortLiterals: 61100900
 NumShortBNodes: 0
 NumShortURIs: 493514954
 TotalKeyBytes: 60245625
 TotalValBytes: 4877760110
 MaxCollisions: 2
 TotalCollisions: 16840
 Journal size: 7320764416 bytes
 
```
BSBM 200M. Using memory manager, one parser thread (the caller), and a single source file. The question is whether we do better with a statement handler that is only flushed incrementally (when full) compared to using 2 parsers and flushing each time we reach the end of a 1M statement source file. Nope. This was 77 minutes. (This was a fair comparison since the source files for the split sources are compressed. So we really do better with two parsers and split files)
```
 /allocationCount=0
 /bufferCapacity=1000
 /bufferCount=232
 /extent=243269632
 /slotBytes=0
 /userBytes=0
 Elapsed: 4605950ms
 NumStatements: 198808848
 NumDistinctVals: 12198222
 NumShortLiterals: 61103832
 NumShortBNodes: 0
 NumShortURIs: 493520581
 TotalKeyBytes: 60991110
 TotalValBytes: 4944322031
 MaxCollisions: 2
 TotalCollisions: 17264
 Journal size: 7320764416 bytes
 
```
TODO Try with only N bytes worth of the SHA hash code, leaving some bits left over for partitioning URIs, Literals, and BNodes (for told bnode mode) and for a counter to break ties when there is a hash collision. We should wind up with an 8-12 byte termId which is collision proof and very well distributed. TODO Add bit flags at the front for {BLOB, URI, Literal, BNode} (BLOB being the odd one out). If we move BLOBs out of the key range of other plain literals, or literals of a given language code or datatype, then we can not do an ordered scan of the literals anymore which is inclusive of the blobs. There is a similar consequence of moving small literals into the statement index.
If we inline small unicode values (<32 bytes) and reserve the TERM2ID index for large(r) values then we can approach a situation in which it serves solely for blobs but with a tradeoff in size (of the statement indices) versus indirection.
Large value promotion does not really let us handle large blobs (multi-megabytes) in s/o as a 50 4M blobs would fill up a shard. There, I think that we need to give the control over to the application and require it to write on a shared resource (shared file system, S3, etc). The value inserted into the index would then be just the pathname in the shared file system or the URL of the S3 resource. This breaks the ACID decision boundary though as the application has no means available to atomically decide that the resource does not exist and hence create it. Even using a conditional E-Tag on S3 would not work since it would have to have an index over the S3 entities to detect a write-write conflict for the same data under different URLs.
Author:

thompsonbry

Method Summary

Methods
Modifier and Type	Method and Description
`static void`	`main(String[] args)` Parse files, inserting `Value`s into indices and counting hash collisions.
`void`	`shutdown()` Normal shutdown.
`void`	`shutdownNow()` Immediate shutdown.
`void`	`start()` Start the task which will index data as it is parsed.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - start
```
public void start()
```
    Start the task which will index data as it is parsed.
  - shutdown
```
public void shutdown()
              throws Exception
```
    Normal shutdown. Running parsers will complete and their data will be indexed, but new parsers will not start. This method will block until all data has been indexed.
    
    Throws:
    
    Exception
  - shutdownNow
```
public void shutdownNow()
                 throws Exception
```
    Immediate shutdown. Running tasks will be canceled.
    
    Throws:
    
    Exception
  - main
```
public static void main(String[] args)
                 throws Exception
```
    Parse files, inserting Values into indices and counting hash collisions.
    
    Parameters:
    args - filename(s)
    
    Throws:
    
    IOException
    
    org.openrdf.rio.RDFHandlerException
    
    org.openrdf.rio.RDFParseException
    
    NoSuchAlgorithmException
    
    Exception

Class HashCollisionUtility

Method Summary

Methods inherited from class java.lang.Object

Method Detail

start

shutdown

shutdownNow

main