public class HashCollisionUtility extends Object
/nas/data/lubm/U1/data/University0/ /nas/data/bsbm/bsbm_2785/dataset.nt.gz /nas/data/bsbm/bsbm_566496/dataset.nt.gz /data/bsbm3_200m_1MSplits 8B triple bioinformatics data set. BTC data (some very large literals)TODO order preserving hash codes could be interesting here. Look at 32 and 64 bit variants of the math and at generalized order preserving hash codes. With order preserving hash codes, it makes sense to insert all Unicode terms into TERM2ID such that we have a total order there. TODO benchmark the load time with different hash codes. the cost of the hash computation and the randomness of the distribution will both play a role. The B+Tree will need to be setup with a sufficient [writeRetentionQueue] and we will need to specify [-server -Xmx1G]. SHA-256 - no collisions on BSBM 200M. 30G file. time? 32-bit hash codes. #collisions=1544132 Elapsed: 16656445ms Journal size: 23841341440 bytes (23G) Now limiting the size of values in a leaf and also increasing the branching factor to 512 (was 32). [The current run is scanning after the initial insert, which involves a little wasted effort. It was also without the -server -Xmx2g, and write retention queue parameters. Finally, it was serializing BigdataValue objects, including their IV, rather than RDF Value objects. The code has since been modified to serialize just the BigdataValue Also, I've since raised the initial extent from 10M to 200M]. maxCollisions=3, Elapsed: 22579073ms Journal size: 35270950912 bytes Now buffering 100k values at a time: 2x faster.
U1: Elapsed: 23379ms NumStatements: 1000313 NumDistinctVals: 291259 TotalKeyBytes: 1747554 TotalValBytes: 60824514 MaxCollisions: 1 TotalCollisions: 6 Journal size: 209715200 bytes name m height nnodes nleaves nodeBytes leafBytes totalBytes avgNodeBytes avgLeafBytes minNodeBytes maxNodeBytes minLeafBytes maxLeafBytes lex 1024 1 1 474 7913 3662623 3670536 7913 7727 7913 7913 5786 13784With only a byte (versus short) counter in the key. Oddly, this has no impact on the average leaf size. That suggests that the keys in the leaves are very sparse in terms of the hash code space such that prefix compression is not really doing that much for us.
Elapsed: 23235ms NumStatements: 1000313 NumDistinctVals: 291259 TotalKeyBytes: 1456295 TotalValBytes: 60824514 MaxCollisions: 1 TotalCollisions: 6 Journal size: 209715200 bytes name m height nnodes nleaves nodeBytes leafBytes totalBytes avgNodeBytes avgLeafBytes minNodeBytes maxNodeBytes minLeafBytes maxLeafBytes lex 1024 1 1 474 7913 3371370 3379283 7913 7112 7913 7913 5274 12774BSBM 200M: This is the best time and space so far. using a byte counter rather than a short.
Elapsed: 16338357ms NumStatements: 198808848 NumDistinctVals: 45647082 TotalKeyBytes: 228235410 TotalValBytes: 11292849582 MaxCollisions: 3 TotalCollisions: 244042 Journal size: 16591683584 bytesBSBM 200M: Note: I restarted this run after terminating yourkit so the results should be valid (right?). The main changes are to use stringValue() to test for dateTime, to use the canonical huffman coder for the leaf keys.
Elapsed: 20148506ms NumStatements: 198808848 NumDistinctVals: 45647082 TotalKeyBytes: 228235410 TotalValBytes: 11292849582 MaxCollisions: 3 TotalCollisions: 244042 Journal size: 16591683584 bytesBSBM 200M: raw records are compress if they are over 64 bytes long.
Elapsed: 18757003ms NumStatements: 198808848 NumDistinctVals: 45647082 TotalKeyBytes: 228235410 TotalValBytes: 7910596818 MaxCollisions: 3 TotalCollisions: 244042 Journal size: 12270108672 bytesBSBM 200M: literals LT 64 byte labels are assumed inlined into statement indices (except datatype URIs).
Elapsed: 16193915ms NumStatements: 198808848 NumDistinctVals: 43273381 NumShortLiterals: 2723662 TotalKeyBytes: 216366905 TotalValBytes: 7807037644 MaxCollisions: 3 TotalCollisions: 219542 Journal size: 11083186176 bytesBSBM 200M: uris LT 64 byte localNames are assumed inlined into statement indices (plus datatype literals LT 64 bytes).
Elapsed: 5699248ms NumStatements: 198808848 NumDistinctVals: 12198222 NumShortLiterals: 32779032 NumShortURIs: 493520581 TotalKeyBytes: 60991110 TotalValBytes: 4944223808 MaxCollisions: 2 TotalCollisions: 17264 Journal size: 7320764416 bytesBSBM 200M: one parser thread and one indexer thread.
Elapsed: 3724415ms NumStatements: 198808848 NumDistinctVals: 12198222 NumShortLiterals: 32779032 NumShortBNodes: 0 NumShortURIs: 493520581 TotalKeyBytes: 60991110 TotalValBytes: 4944223808 MaxCollisions: 2 TotalCollisions: 17264 Journal size: 7320764416 bytesGC OH problem trying to run multiple parsers against BSBM 200M when split into 200 files.
valBufSize := 10000 valQueueCapacity = 100 maxDrain := 50 nparserThreads := 2 parserWorkQueue := 1000BSBM 200M - this is 3x longer. This run did not have the GC OH problem, but GC had frequent 10% spikes, which is a lot in comparison to our best run.
valBufSize := 1000 valQueueCapacity = 10 maxDrain := 5 nparserThreads := 4 parserWorkQueue := 100 Elapsed: 9589520ms NumStatements: 198805837 NumDistinctVals: 12202052 NumShortLiterals: 32776100 NumShortBNodes: 0 NumShortURIs: 493514954 TotalKeyBytes: 61010260 TotalValBytes: 4945278396 MaxCollisions: 2 TotalCollisions: 17260 Journal size: 7320764416 bytesBSBM 200M: split in 200 files. 69m versus best time so far of 62m. There is only one thread in the pool, but the caller runs policy means that we are actually running two parsers. So, this is not really the same as the best run, which was one parser running in the main thread with the indexer running in another thread.
valBufSize := 10000 valQueueCapacity = 10 maxDrain := 5 nparserThreads := 1 parserWorkQueue := 100 Elapsed: 4119775ms NumStatements: 198805837 NumDistinctVals: 12202052 NumShortLiterals: 32776100 NumShortBNodes: 0 NumShortURIs: 493514954 TotalKeyBytes: 61010260 TotalValBytes: 4945278396 MaxCollisions: 2 TotalCollisions: 17260 Journal size: 7320764416 bytesBSBM 200M with 1M statement splits using the memory manager to buffer the data on the native heap. This is the best score so far (compare with 3724415ms with one parser and one indexer thread). For some reason, the #of distinct values and literals is slightly different for these two runs. One other change in this run is that we always gzip the record since we can not deserialize the record unless we know in advance whether or not it is compressed. Previous runs had conditionally compressed based on the original byte[] value length and stored the compressed record iff it was shorter. However, we can only conditionally compress if we use a header or bit flag to indicate that the record is compressed. Peak memory manager use was 262M.
Elapsed: 2863898ms NumStatements: 198805837 NumDistinctVals: 12,202,052 NumShortLiterals: 61,100,900 NumShortBNodes: 0 NumShortURIs: 493514954 TotalKeyBytes: 61010260 TotalValBytes: 4945376779 MaxCollisions: 2 TotalCollisions: 17260 Journal size: 7320764416 bytesBSBM 200M using memory manager (high tide of 351M) and 5 parser threads (plus the main thread). Heap usage is pretty controlled.
Elapsed: 2803451ms NumStatements: 198805837 NumDistinctVals: 12202052 NumShortLiterals: 61100900 NumShortBNodes: 0 NumShortURIs: 493514954 TotalKeyBytes: 61010260 TotalValBytes: 4945376779 MaxCollisions: 2 TotalCollisions: 17260 Journal size: 7320764416 bytesBSBM 200M. Using memory manager and only one parser thread. This does run significantly slower (55m versus 47m with two parser threads). It might not be slower if we also ran against the single source file (this ran against the split files) since each chunk placed onto the queue would then be full, but I doubt that this will make that much difference.
Elapsed: 3300871ms NumStatements: 198805837 NumDistinctVals: 12049125 NumShortLiterals: 61100900 NumShortBNodes: 0 NumShortURIs: 493514954 TotalKeyBytes: 60245625 TotalValBytes: 4877760110 MaxCollisions: 2 TotalCollisions: 16840 Journal size: 7320764416 bytesBSBM 200M. Using memory manager, one parser thread (the caller), and a single source file. The question is whether we do better with a statement handler that is only flushed incrementally (when full) compared to using 2 parsers and flushing each time we reach the end of a 1M statement source file. Nope. This was 77 minutes. (This was a fair comparison since the source files for the split sources are compressed. So we really do better with two parsers and split files)
/allocationCount=0 /bufferCapacity=1000 /bufferCount=232 /extent=243269632 /slotBytes=0 /userBytes=0 Elapsed: 4605950ms NumStatements: 198808848 NumDistinctVals: 12198222 NumShortLiterals: 61103832 NumShortBNodes: 0 NumShortURIs: 493520581 TotalKeyBytes: 60991110 TotalValBytes: 4944322031 MaxCollisions: 2 TotalCollisions: 17264 Journal size: 7320764416 bytesTODO Try with only N bytes worth of the SHA hash code, leaving some bits left over for partitioning URIs, Literals, and BNodes (for told bnode mode) and for a counter to break ties when there is a hash collision. We should wind up with an 8-12 byte termId which is collision proof and very well distributed. TODO Add bit flags at the front for {BLOB, URI, Literal, BNode} (BLOB being the odd one out). If we move BLOBs out of the key range of other plain literals, or literals of a given language code or datatype, then we can not do an ordered scan of the literals anymore which is inclusive of the blobs. There is a similar consequence of moving small literals into the statement index.
If we inline small unicode values (<32 bytes) and reserve the TERM2ID index for large(r) values then we can approach a situation in which it serves solely for blobs but with a tradeoff in size (of the statement indices) versus indirection.
Large value promotion does not really let us handle large blobs (multi-megabytes) in s/o as a 50 4M blobs would fill up a shard. There, I think that we need to give the control over to the application and require it to write on a shared resource (shared file system, S3, etc). The value inserted into the index would then be just the pathname in the shared file system or the URL of the S3 resource. This breaks the ACID decision boundary though as the application has no means available to atomically decide that the resource does not exist and hence create it. Even using a conditional E-Tag on S3 would not work since it would have to have an index over the S3 entities to detect a write-write conflict for the same data under different URLs.
Modifier and Type | Method and Description |
---|---|
static void |
main(String[] args)
Parse files, inserting
Value s into indices and counting hash
collisions. |
void |
shutdown()
Normal shutdown.
|
void |
shutdownNow()
Immediate shutdown.
|
void |
start()
Start the task which will index data as it is parsed.
|
public void start()
public void shutdown() throws Exception
Exception
public void shutdownNow() throws Exception
Exception
public static void main(String[] args) throws Exception
Value
s into indices and counting hash
collisions.args
- filename(s)IOException
org.openrdf.rio.RDFHandlerException
org.openrdf.rio.RDFParseException
NoSuchAlgorithmException
Exception
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.