public class CanonicalHuffmanRabaCoder extends Object implements IRabaCoder, Externalizable
CanonicalHuffmanRabaCoder
can be used for keys or
values and supports efficient search of the coded keys.
version:byte The version identifier for this record format (8 bits, which allows for 256 format revisions). isKeys:1 Bit flag indicates whether the record is coding B+Tree keys or B+Tree values. When coding values only the non-zero byte frequency counts are used to compute the dictionary and nulls are permitted. When coding keys, the dictionary includes codes for all 256 possible byte values so we may code search keys and nulls are not permitted. isSymbolTable:1 A bit flag whose value is 1 iff the symbols are given as a packed symbol[] and 0 if all possible byte values were coded. Note that the implementation MAY choose to code byte values with a zero frequency if MOST byte values are used since that does not degrade the code by much and it allows us to eliminate a ˜256 bytes from the data record. isOffsetArray:1 A bit flag whose value is 1 iff the codedValueOffset[] is stored. At this time, this array is always stored. Either the offset[] or the #of symbols in each coded value MUST be stored in order for us to compute the end of each sequence of codeWord[]s for a given coded value. isByteAlignedOffset:1 A bit flag whose value is 1 iff the codedValueOffset[] is byte aligned. When true, the start of the array will fall on a byte boundary and each array element will be a multiple of 8 bits. unused:4 unused bit flags. size:int31 The #of elements in the logical byte[][]. nsymbols:uint9 There are at most 256 distinct symbols, which are the distinct possible byte values (9 bits, which allows for an empty leaf or node with no byte values used as well as a leaf or node with all 256 byte values used). -- note: at this point the record is still byte aligned -- symbol2byte:byte[] The symbol2byte array. The index into the array is the symbol, which is correlated with the frequency[] used to build the code. The value is the byte corresponding to that symbol. There are nsymbols entries in the array. The array is present IFF isSymbolTable is true. O_symbols := 48 bits. -- note: at this point the record is still byte aligned -- codedValueOffsetBits:uint8 The width in bits of the integers used to code the codedValueOffset[]. O_codedValueOffsetBits := [ibs.readBits]; sumCodedValueBitLengths:uint32 The sum of the bit lengths of the coded values. This is a 32bit unsigned integer, which is sufficient to code up bit lengths of up to 512MB. This field IS NOT present if the codeOffsetBits is ZERO (0) since the field is only used to compute the bit offset of the codeOffset[]. O_sumCodedValueBitLengths := O_codedValueOffsetBits + 8; nulls[] A vector of [nvalues] bit flags IFF isKeys==false. A flag is a ONE (1) iff the corresponding byte[] in the logical byte[][] was a null. A null is coded as a sequence of ZERO (0) code words. However, empty byte[]s are also permitted for B+Tree values (but not for B+Tree keys). Therefore you MUST test the nulls[] to distinguish between a null byte[] and an empty byte[]. This field IS NOT present when keys are coded. O_nulls := O_codedValueOffsetBits + 8 + (codedValueOffsetBits==0?0:32); -- note: if nsymbols==0 then this is the end of the record -- decoderInputs:= length[], The bit length of each code word in the assigned order. symbol[], The correlated symbol indices. codeWord[0]. The shortest code word. These length[] and the symbol[] data are written out together using a compact representation. The length of the shortest code word length is given by length[0]. Since we are using a canonical huffman code, all we need is the shortest code and the length[] to regenerate the entire code. O_decoderInputs := O_symbols + (isSymbolTable?0:nsymbols*8) codedValue:bit[] The coded values given as a sequence of code words. The bit length of this field is given by sumCodedValueBitLengths. O_codedValue[] := O_nulls + size IFF values are coded -or- O_codedValue[] := O_nulls IFF keys are coded. codeValueOffset[] The bit offset to the start of each coded value in the codedValue[]. The offsets are relative to the start of the first coded value. While the delta in the offsets could be represented more efficiently, the offsets are represented directly so that we may avoid reading the entire codeOffset[] into memory. This array is present iff isOffsetArray is true. O_codedValueOffset[] := O_codedValue[] + sumCodedValueBitLengths
HuffmanCodec
,
http://en.wikipedia.org/wiki/Huffman_coding
,
http://en.wikipedia.org/wiki/Canonical_Huffman_code
,
Serialized FormModifier and Type | Class and Description |
---|---|
protected static class |
CanonicalHuffmanRabaCoder.AbstractCodingSetup
Abstract base class for preparing a logical byte[][] for coding.
|
static class |
CanonicalHuffmanRabaCoder.CodedRabaImpl
Decoder.
|
protected static class |
CanonicalHuffmanRabaCoder.RabaCodingSetup
Sets up for coding an
IRaba representing B+Tree values. |
Modifier and Type | Field and Description |
---|---|
static CanonicalHuffmanRabaCoder |
INSTANCE |
protected static org.apache.log4j.Logger |
log |
protected static byte |
VERSION0
The original serialization version for the coded data record.
|
Constructor and Description |
---|
CanonicalHuffmanRabaCoder() |
Modifier and Type | Method and Description |
---|---|
ICodedRaba |
decode(AbstractFixedByteArrayBuffer data)
Return an
IRaba which can access the coded data. |
AbstractFixedByteArrayBuffer |
encode(IRaba raba,
DataOutputBuffer buf)
Encode the data.
|
ICodedRaba |
encodeLive(IRaba raba,
DataOutputBuffer buf)
Encode the data, returning an
ICodedRaba . |
protected long |
getSumCodedValueBitLengths(BitVector[] codeWords,
IRaba raba,
com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
Deprecated.
Leave this field and the #of bits per codedValueOffset[]
element blank until we have written out the coded values and
then rewind the OBS and fill in those fields. Otherwise we
are encoding the same byte[][] data twice, which is wasted
effort.
|
boolean |
isDuplicateKeys()
Return true iff this
IRabaCoder supports duplicate keys. |
boolean |
isKeyCoder()
Return
true if this implementation can code B+Tree keys
(supports search on the coded representation). |
boolean |
isValueCoder()
Return
true if this implementation can code B+Tree values
(allows null s). |
protected static HuffmanCodec.DecoderInputs |
readDecoderInputs(int nsymbols,
InputBitStream ibs,
StringBuilder sb)
Reconstruct the
HuffmanCodec.DecoderInputs from the data written by
#writeDecoderInputs(BitVector[], OutputBitStream) . |
void |
readExternal(ObjectInput in) |
protected long |
writeCodedValues(PrefixCoder coder,
IRaba raba,
com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol,
long[] codedValueOffset,
OutputBitStream obs)
Write out the coded values.
|
protected static void |
writeDecoderInputs(HuffmanCodec.DecoderInputs decoderInputs,
OutputBitStream obs,
StringBuilder sb)
Write a compact minimum representation of the data required to
reconstruct the decoder (bit lengths and correlated symbols).
|
void |
writeExternal(ObjectOutput out) |
protected void |
writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte,
OutputBitStream obs)
Write out the optional packed symbol table (symbol2byte).
|
protected static final org.apache.log4j.Logger log
protected static final transient byte VERSION0
public static final transient CanonicalHuffmanRabaCoder INSTANCE
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
readExternal
in interface Externalizable
IOException
ClassNotFoundException
public void writeExternal(ObjectOutput out) throws IOException
writeExternal
in interface Externalizable
IOException
public final boolean isKeyCoder()
IRabaCoder
true
if this implementation can code B+Tree keys
(supports search on the coded representation). Note that some
implementations can code either keys or values.isKeyCoder
in interface IRabaCoder
public final boolean isValueCoder()
IRabaCoder
true
if this implementation can code B+Tree values
(allows null
s). Note that some implementations can code
either keys or values.isValueCoder
in interface IRabaCoder
protected void writeSymbolTable(com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Symbol2Byte symbol2byte, OutputBitStream obs) throws IOException
symbol2byte
- The symbol table.obs
- The symbol table is written on this bit stream.IOException
protected static void writeDecoderInputs(HuffmanCodec.DecoderInputs decoderInputs, OutputBitStream obs, StringBuilder sb) throws IOException
min max count, symbol(s)where
min
is the bit length of the shortest code word;
max
is the bit length of the longest code word;
count
is the #of code words of a given length; and
symbol(s)
are the symbols associated with each code word.
All values are nibble coded and will generally fit in 1-2 bytes each.decoderInputs
- This contains both the bit lengths of the canonical huffman
code and the symbols assigned to each code word.obs
- The output bit stream.sb
- Debugging information is added to this buffer (optional).IOException
DecoderInputs}
protected static HuffmanCodec.DecoderInputs readDecoderInputs(int nsymbols, InputBitStream ibs, StringBuilder sb) throws IOException
HuffmanCodec.DecoderInputs
from the data written by
#writeDecoderInputs(BitVector[], OutputBitStream)
.nsymbols
- The #of symbols.ibs
- The input bit stream.sb
- Debugging information is added to this buffer (optional).IOException
protected long getSumCodedValueBitLengths(BitVector[] codeWords, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol)
coder
- The coder.raba
- The logical byte[][].byte2symbol
- The mapping from byte values to symbol indices.protected long writeCodedValues(PrefixCoder coder, IRaba raba, com.bigdata.btree.raba.codec.CanonicalHuffmanRabaCoder.Byte2Symbol byte2symbol, long[] codedValueOffset, OutputBitStream obs) throws IOException
coder
- The coder.raba
- The logical byte[][].byte2symbol
- The mapping from byte values to symbol indices.codedValueOffset
- An optional array dimensioned to nvalues+1
, where
nvalues is #of values in the logical byte[][]. When
given, the array will be populated with the relative bit
offset of the start of each coded value. The offsets are
relative to the start of the first coded value.IOException
public AbstractFixedByteArrayBuffer encode(IRaba raba, DataOutputBuffer buf)
IRabaCoder
Note: Implementations of this method are typically heavy. While it is
always valid to IRabaCoder.encode(IRaba, DataOutputBuffer)
an IRaba
, DO NOT invoke this arbitrarily on data which may already be
coded. The ICodedRaba
interface will always be implemented for
coded data.
encode
in interface IRabaCoder
raba
- The data.buf
- A buffer on which the coded data will be written.public ICodedRaba encodeLive(IRaba raba, DataOutputBuffer buf)
IRabaCoder
ICodedRaba
. Implementations of this
method should be optimized for the very common use case where the caller
requires immediate access to the coded data record. In that case, many of
the IRabaCoder
implementations can be optimized by passing the
underlying decoding object directly into an alternative constructor for
the ICodedRaba
. The byte[] slice for the coded data record is
available from ICodedRaba.data()
.
This method covers the vast major of the use cases for coding data, which
is to code B+Tree keys or values for a node or leaf that has been evicted
from the AbstractBTree
's write retention queue. The common use
case is to wrap a coded record that was read from an IRawStore
.
The IndexSegmentBuilder
is a special case, since the coded record
will not be used other than to write it on the disk.
encodeLive
in interface IRabaCoder
public ICodedRaba decode(AbstractFixedByteArrayBuffer data)
IRabaCoder
IRaba
which can access the coded data. In general,
implementations SHOULD NOT materialize a backing byte[][]. Instead, the
implementation should access the data in place within the caller's
buffer. Frequently used fields MAY be cached, but the whole point of the
IRabaCoder
is to minimize the in-memory footprint for the B+Tree
by using a coded (aka compressed) representation of the keys and values
whenever possible.decode
in interface IRabaCoder
data
- The record containing the coded data.public boolean isDuplicateKeys()
IRabaCoder
IRabaCoder
supports duplicate keys.isDuplicateKeys
in interface IRabaCoder
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.