public class CanonicalHuffmanRabaCoder extends Object implements IRabaCoder, Externalizable
CanonicalHuffmanRabaCoder can be used for keys or
values and supports efficient search of the coded keys.
version:byte The version identifier for this record format (8 bits,
which allows for 256 format revisions).
isKeys:1 Bit flag indicates whether the record is coding B+Tree keys
or B+Tree values. When coding values only the non-zero
byte frequency counts are used to compute the dictionary
and nulls are permitted. When coding keys, the dictionary
includes codes for all 256 possible byte values so we may
code search keys and nulls are not permitted.
isSymbolTable:1
A bit flag whose value is 1 iff the symbols are given as
a packed symbol[] and 0 if all possible byte values were
coded. Note that the implementation MAY choose to code
byte values with a zero frequency if MOST byte values are
used since that does not degrade the code by much and it
allows us to eliminate a ˜256 bytes from the data record.
isOffsetArray:1 A bit flag whose value is 1 iff the codedValueOffset[] is
stored. At this time, this array is always stored. Either
the offset[] or the #of symbols in each coded value MUST be
stored in order for us to compute the end of each sequence
of codeWord[]s for a given coded value.
isByteAlignedOffset:1
A bit flag whose value is 1 iff the codedValueOffset[] is
byte aligned. When true, the start of the array will fall
on a byte boundary and each array element will be a multiple
of 8 bits.
unused:4 unused bit flags.
size:int31 The #of elements in the logical byte[][].
nsymbols:uint9 There are at most 256 distinct symbols, which are the
distinct possible byte values (9 bits, which allows
for an empty leaf or node with no byte values used as
well as a leaf or node with all 256 byte values used).
-- note: at this point the record is still byte aligned --
symbol2byte:byte[]
The symbol2byte array. The index into the array is the
symbol, which is correlated with the frequency[] used to
build the code. The value is the byte corresponding to that
symbol. There are nsymbols entries in the array. The array
is present IFF isSymbolTable is true.
O_symbols := 48 bits.
-- note: at this point the record is still byte aligned --
codedValueOffsetBits:uint8
The width in bits of the integers used to code the
codedValueOffset[].
O_codedValueOffsetBits := [ibs.readBits];
sumCodedValueBitLengths:uint32
The sum of the bit lengths of the coded values. This is
a 32bit unsigned integer, which is sufficient to code up
bit lengths of up to 512MB. This field IS NOT present
if the codeOffsetBits is ZERO (0) since the field is only
used to compute the bit offset of the codeOffset[].
O_sumCodedValueBitLengths := O_codedValueOffsetBits + 8;
nulls[] A vector of [nvalues] bit flags IFF isKeys==false. A flag
is a ONE (1) iff the corresponding byte[] in the logical
byte[][] was a null. A null is coded as a sequence of ZERO
(0) code words. However, empty byte[]s are also permitted
for B+Tree values (but not for B+Tree keys). Therefore you
MUST test the nulls[] to distinguish between a null byte[]
and an empty byte[]. This field IS NOT present when keys
are coded.
O_nulls := O_codedValueOffsetBits + 8 + (codedValueOffsetBits==0?0:32);
-- note: if nsymbols==0 then this is the end of the record --
decoderInputs:=
length[], The bit length of each code word in the assigned order.
symbol[], The correlated symbol indices.
codeWord[0]. The shortest code word.
These length[] and the symbol[] data are written out together
using a compact representation. The length of the shortest
code word length is given by length[0]. Since we are using
a canonical huffman code, all we need is the shortest code
and the length[] to regenerate the entire code.
O_decoderInputs := O_symbols + (isSymbolTable?0:nsymbols*8)
codedValue:bit[] The coded values given as a sequence of code words. The bit
length of this field is given by sumCodedValueBitLengths.
O_codedValue[] := O_nulls + size IFF values are coded
-or-
O_codedValue[] := O_nulls IFF keys are coded.
codeValueOffset[]
The bit offset to the start of each coded value in the
codedValue[]. The offsets are relative to the start of
the first coded value.
While the delta in the offsets could be represented more
efficiently, the offsets are represented directly so that
we may avoid reading the entire codeOffset[] into memory.
This array is present iff isOffsetArray is true.
O_codedValueOffset[] := O_codedValue[] + sumCodedValueBitLengths
HuffmanCodec,
http://en.wikipedia.org/wiki/Huffman_coding,
http://en.wikipedia.org/wiki/Canonical_Huffman_code,
Serialized Form| Modifier and Type | Class and Description |
|---|---|
protected static class |
CanonicalHuffmanRabaCoder.AbstractCodingSetup
Abstract base class for preparing a logical byte[][] for coding.
|
static class |
CanonicalHuffmanRabaCoder.CodedRabaImpl
Decoder.
|
protected static class |
CanonicalHuffmanRabaCoder.RabaCodingSetup
Sets up for coding an
IRaba representing B+Tree values. |
| Modifier and Type | Field and Description |
|---|---|
static CanonicalHuffmanRabaCoder |
INSTANCE |
protected static org.apache.log4j.Logger |
log |
protected static byte |
VERSION0
The original serialization version for the coded data record.
|
| Constructor and Description |
|---|
CanonicalHuffmanRabaCoder() |
| Modifier and Type | Method and Description |
|---|---|
ICodedRaba |
decode(AbstractFixedByteArrayBuffer data)
Return an
IRaba which can access the coded data. |
AbstractFixedByteArrayBuffer |
encode(IRaba raba,
DataOutputBuffer buf)
Encode the data.
|
ICodedRaba |
encodeLive(IRaba raba,
DataOutputBuffer buf)
Encode the data, returning an
ICodedRaba. |
protected long |
getSumCodedValueBitLengths(BitVector[] codeWords,
IRaba raba,
com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
Deprecated.
Leave this field and the #of bits per codedValueOffset[]
element blank until we have written out the coded values and
then rewind the OBS and fill in those fields. Otherwise we
are encoding the same byte[][] data twice, which is wasted
effort.
|
boolean |
isDuplicateKeys()
Return true iff this
IRabaCoder supports duplicate keys. |
boolean |
isKeyCoder()
Return
true if this implementation can code B+Tree keys
(supports search on the coded representation). |
boolean |
isValueCoder()
Return
true if this implementation can code B+Tree values
(allows nulls). |
protected static HuffmanCodec.DecoderInputs |
readDecoderInputs(int nsymbols,
InputBitStream ibs,
StringBuilder sb)
Reconstruct the
HuffmanCodec.DecoderInputs from the data written by
#writeDecoderInputs(BitVector[], OutputBitStream). |
void |
readExternal(ObjectInput in) |
protected long |
writeCodedValues(PrefixCoder coder,
IRaba raba,
com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol,
long[] codedValueOffset,
OutputBitStream obs)
Write out the coded values.
|
protected static void |
writeDecoderInputs(HuffmanCodec.DecoderInputs decoderInputs,
OutputBitStream obs,
StringBuilder sb)
Write a compact minimum representation of the data required to
reconstruct the decoder (bit lengths and correlated symbols).
|
void |
writeExternal(ObjectOutput out) |
protected void |
writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte,
OutputBitStream obs)
Write out the optional packed symbol table (symbol2byte).
|
protected static final org.apache.log4j.Logger log
protected static final transient byte VERSION0
public static final transient CanonicalHuffmanRabaCoder INSTANCE
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
readExternal in interface ExternalizableIOExceptionClassNotFoundExceptionpublic void writeExternal(ObjectOutput out) throws IOException
writeExternal in interface ExternalizableIOExceptionpublic final boolean isKeyCoder()
IRabaCodertrue if this implementation can code B+Tree keys
(supports search on the coded representation). Note that some
implementations can code either keys or values.isKeyCoder in interface IRabaCoderpublic final boolean isValueCoder()
IRabaCodertrue if this implementation can code B+Tree values
(allows nulls). Note that some implementations can code
either keys or values.isValueCoder in interface IRabaCoderprotected void writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte,
OutputBitStream obs)
throws IOException
symbol2byte - The symbol table.obs - The symbol table is written on this bit stream.IOExceptionprotected static void writeDecoderInputs(HuffmanCodec.DecoderInputs decoderInputs, OutputBitStream obs, StringBuilder sb) throws IOException
min max count, symbol(s)where
min is the bit length of the shortest code word;
max is the bit length of the longest code word;
count is the #of code words of a given length; and
symbol(s) are the symbols associated with each code word.
All values are nibble coded and will generally fit in 1-2 bytes each.decoderInputs - This contains both the bit lengths of the canonical huffman
code and the symbols assigned to each code word.obs - The output bit stream.sb - Debugging information is added to this buffer (optional).IOExceptionDecoderInputs}protected static HuffmanCodec.DecoderInputs readDecoderInputs(int nsymbols, InputBitStream ibs, StringBuilder sb) throws IOException
HuffmanCodec.DecoderInputs from the data written by
#writeDecoderInputs(BitVector[], OutputBitStream).nsymbols - The #of symbols.ibs - The input bit stream.sb - Debugging information is added to this buffer (optional).IOExceptionprotected long getSumCodedValueBitLengths(BitVector[] codeWords, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
coder - The coder.raba - The logical byte[][].byte2symbol - The mapping from byte values to symbol indices.protected long writeCodedValues(PrefixCoder coder, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol, long[] codedValueOffset, OutputBitStream obs) throws IOException
coder - The coder.raba - The logical byte[][].byte2symbol - The mapping from byte values to symbol indices.codedValueOffset - An optional array dimensioned to nvalues+1, where
nvalues is #of values in the logical byte[][]. When
given, the array will be populated with the relative bit
offset of the start of each coded value. The offsets are
relative to the start of the first coded value.IOExceptionpublic AbstractFixedByteArrayBuffer encode(IRaba raba, DataOutputBuffer buf)
IRabaCoder
Note: Implementations of this method are typically heavy. While it is
always valid to IRabaCoder.encode(IRaba, DataOutputBuffer) an IRaba
, DO NOT invoke this arbitrarily on data which may already be
coded. The ICodedRaba interface will always be implemented for
coded data.
encode in interface IRabaCoderraba - The data.buf - A buffer on which the coded data will be written.public ICodedRaba encodeLive(IRaba raba, DataOutputBuffer buf)
IRabaCoderICodedRaba. Implementations of this
method should be optimized for the very common use case where the caller
requires immediate access to the coded data record. In that case, many of
the IRabaCoder implementations can be optimized by passing the
underlying decoding object directly into an alternative constructor for
the ICodedRaba. The byte[] slice for the coded data record is
available from ICodedRaba.data().
This method covers the vast major of the use cases for coding data, which
is to code B+Tree keys or values for a node or leaf that has been evicted
from the AbstractBTree's write retention queue. The common use
case is to wrap a coded record that was read from an IRawStore.
The IndexSegmentBuilder is a special case, since the coded record
will not be used other than to write it on the disk.
encodeLive in interface IRabaCoderpublic ICodedRaba decode(AbstractFixedByteArrayBuffer data)
IRabaCoderIRaba which can access the coded data. In general,
implementations SHOULD NOT materialize a backing byte[][]. Instead, the
implementation should access the data in place within the caller's
buffer. Frequently used fields MAY be cached, but the whole point of the
IRabaCoder is to minimize the in-memory footprint for the B+Tree
by using a coded (aka compressed) representation of the keys and values
whenever possible.decode in interface IRabaCoderdata - The record containing the coded data.public boolean isDuplicateKeys()
IRabaCoderIRabaCoder supports duplicate keys.isDuplicateKeys in interface IRabaCoderCopyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.