public class CustomByteArrayFrontCodedList extends it.unimi.dsi.fastutil.objects.AbstractObjectList<byte[]> implements Serializable, Cloneable
This class stores immutably a list of arrays in a single large array using
front coding (of course, the compression will be reasonable only if the list
is sorted lexicographically—see below). It implements an immutable
type-specific list that returns the i-th array when calling
get(i)
. The returned array may be freely
modified.
Front coding is based on the idea that if the i-th and the (i+1)-th array have a common prefix, we might store the length of the common prefix, and then the rest of the second array.
This approach, of course, requires that once in a while an array is stored
entirely. The ratio()
arrays). A higher ratio means more
compression, but means also a longer access time, as more arrays have to be
probed to build the result. Note that we must build an array every time
get(int)
is called, but this class provides also methods that
extract one of the stored arrays in a given array, reducing garbage
collection. See the documentation of the family of get()
methods.
By setting the ratio to 1 we actually disable front coding: however, we still have a data structure storing large list of arrays with a reduced overhead (just one integer per array, plus the space required for lengths).
Note that the typical usage of front-coded lists is under the form of serialized objects; usually, the data that has to be compacted is processed offline, and the resulting structure is stored permanently. Since the pointer array is not stored, the serialized format is very small.
All arrays are stored in a large array. A separate array of pointers indexes arrays whose position is a multiple of the ratio: thus, a higher ratio means also less pointers.
More in detail, an array whose position is a multiple of the ratio is stored
as the array length, followed by the elements of the array. The array length
is coded by a simple variable-length list of k-1 bit blocks, where
k is the number of bits of the underlying primitive type. All
other arrays are stored as follows: let common
the length of the
maximum common prefix between the array and its predecessor. Then we store
the array length decremented by common
, followed by
common
, followed by the array elements whose index is greater
than or equal to common
. For instance, if we store
foo, foobar, football and
fool in a front-coded character-array list with ratio 3, the
character array will contain
<b>3</b> f o o <b>3</b> <b>3</b> b a r <b>5</b> <b>3</b> t b a l l <b>4</b> f o o l
All arrays are stored in a large array: thus, the compressed list must not
exceed Integer.MAX_VALUE
elements. Moreover, iterators are
less efficient when they move back, as
previous()
cannot build
incrementally the previous array (whereas (
next()
can).
it.unimi.dsi.fastutil.bytes.ArrayFrontCodedList
, which is part
of fastutils. The folowing changes were made:
serialVersionUID
and the serialization logic
has been modified to allow serialization against DataOutput
by
defining getBackingBuffer()
and a new constructors that operate on a
byte[] slice.byte[] array
has been replaced by an interface
suitable for wrapping either a byte[]
or a ByteBuffer
.
This was done in order to permit access to the front-coded representation
without "de-serializing" the data and a suitable constructor was added for
the ByteBuffer
case.Collection
and Iterator
ctors strongly typed.Modifier and Type | Class and Description |
---|---|
static interface |
CustomByteArrayFrontCodedList.BackingBuffer
Abstraction allowing different implementations of the backing buffer.
|
Modifier and Type | Field and Description |
---|---|
protected int |
n
The number of arrays in the list.
|
protected int[] |
p
The pointers to entire arrays in the list.
|
protected int |
ratio
The ratio of this front-coded list.
|
Constructor and Description |
---|
CustomByteArrayFrontCodedList(Collection<byte[]> c,
int ratio)
Creates a new front-coded list containing the arrays in the given
collection.
|
CustomByteArrayFrontCodedList(Collection<byte[]> c,
int ratio,
boolean hasDups)
Creates a new front-coded list containing the arrays in the given
collection.
|
CustomByteArrayFrontCodedList(int n,
int ratio,
byte[] array)
Reconsitute an instance from just the coded byte[], the #of elements in
the array, and the ratio.
|
CustomByteArrayFrontCodedList(int n,
int ratio,
byte[] array,
int off,
int len,
boolean hasDups)
Reconsitute an instance from a slice byte[] containing the coded data,
the #of elements in the array, and the ratio.
|
CustomByteArrayFrontCodedList(int n,
int ratio,
ByteBuffer b)
Reconsitute an instance from just a
ByteBuffer view onto the
coded byte[], the #of elements in the array, and the ratio. |
CustomByteArrayFrontCodedList(Iterator<byte[]> arrays,
int ratio)
Creates a new front-coded list containing the arrays returned by the
given iterator.
|
CustomByteArrayFrontCodedList(Iterator<byte[]> arrays,
int ratio,
boolean hasDups) |
Modifier and Type | Method and Description |
---|---|
int |
arrayLength(int index)
Computes the length of the array at the given index.
|
Object |
clone()
Returns a copy of this list.
|
byte[] |
get(int index) |
int |
get(int index,
byte[] a)
Stores in the given array an array stored in this front-coded list.
|
int |
get(int index,
byte[] a,
int offset,
int length)
Stores in the given array elements from an array stored in this
front-coded list.
|
byte[] |
getArray(int index) |
CustomByteArrayFrontCodedList.BackingBuffer |
getBackingBuffer()
Return the backing buffer.
|
it.unimi.dsi.fastutil.objects.ObjectListIterator<byte[]> |
listIterator(int start) |
int |
ratio()
Returns the ratio of this list.
|
int |
search(byte[] a)
Search for the index of the value having the same data.
|
int |
size() |
String |
toString()
Modified to dump internal record metadata and to show the byte[]s as
unsigned values.
|
static String |
toString(byte[] key)
Formats a key as a series of comma delimited unsigned bytes.
|
static String |
toString(byte[] key,
int off,
int len)
Formats a key as a series of comma delimited unsigned bytes.
|
int |
writeOn(OutputStream os,
int index)
Write the specified byte[] onto a stream.
|
add, add, addAll, addAll, addElements, addElements, compareTo, contains, ensureIndex, ensureRestrictedIndex, equals, getElements, hashCode, indexOf, iterator, lastIndexOf, listIterator, objectListIterator, objectListIterator, objectSubList, peek, pop, push, remove, removeElements, set, size, subList, top
containsAll, isEmpty, objectIterator, removeAll, retainAll, toArray, toArray
clear, remove
finalize, getClass, notify, notifyAll, wait, wait, wait
clear, containsAll, isEmpty, remove, removeAll, retainAll, toArray, toArray
protected int n
protected int ratio
protected transient int[] p
public CustomByteArrayFrontCodedList(Iterator<byte[]> arrays, int ratio)
arrays
- an iterator returning arrays.ratio
- the desired ratio.public CustomByteArrayFrontCodedList(Iterator<byte[]> arrays, int ratio, boolean hasDups)
public CustomByteArrayFrontCodedList(Collection<byte[]> c, int ratio)
c
- a collection containing arrays.ratio
- the desired ratio.public CustomByteArrayFrontCodedList(Collection<byte[]> c, int ratio, boolean hasDups)
c
- a collection containing arrays.ratio
- the desired ratio.hasDups
- true
iff the list allows duplicate keys.public CustomByteArrayFrontCodedList(int n, int ratio, byte[] array)
n
- The #of elements in the array.ratio
- The ratio of this front-coded list.array
- The array containing the compressed arrays.public CustomByteArrayFrontCodedList(int n, int ratio, byte[] array, int off, int len, boolean hasDups)
n
- The #of elements in the array.ratio
- The ratio of this front-coded list.array
- The array containing the compressed arrays.hasDups
- true
iff the list allows duplicate keys.public CustomByteArrayFrontCodedList(int n, int ratio, ByteBuffer b)
ByteBuffer
view onto the
coded byte[], the #of elements in the array, and the ratio.n
- The #of elements in the array.ratio
- The ratio of this front-coded list.b
- The view onto the compressed arrays.public int ratio()
public int arrayLength(int index)
index
- an index.index
-th array.public byte[] getArray(int index)
get(int)
public int writeOn(OutputStream os, int index) throws IOException
os
- The stream.index
- The index of the byte[].IOException
public int get(int index, byte[] a, int offset, int length)
index
- an index.a
- the array that will store the result.offset
- an offset into a
where elements will be store.length
- a maximum number of elements to store in a
.a
can hold the extracted elements, the number of
extracted elements; otherwise, the number of remaining elements
with the sign changed.public int get(int index, byte[] a)
index
- an index.a
- the array that will store the content of the result (we assume
that it can hold the result).a
can hold the extracted elements, the number of
extracted elements; otherwise, the number of remaining elements
with the sign changed.public int size()
size
in interface Collection<byte[]>
size
in interface List<byte[]>
size
in class AbstractCollection<byte[]>
public it.unimi.dsi.fastutil.objects.ObjectListIterator<byte[]> listIterator(int start)
listIterator
in interface it.unimi.dsi.fastutil.objects.ObjectList<byte[]>
listIterator
in interface List<byte[]>
listIterator
in class it.unimi.dsi.fastutil.objects.AbstractObjectList<byte[]>
public Object clone()
public String toString()
toString
in class it.unimi.dsi.fastutil.objects.AbstractObjectList<byte[]>
public CustomByteArrayFrontCodedList.BackingBuffer getBackingBuffer()
public int search(byte[] a)
Note: The full length entry is coded every [i/ratio] entries. However, the subsequent entries code their common length with respect to the previous front-coded entry NOT to the full length entry. This means that the length of the common prefix can increase or decrease as we scan a bucket and the #of already matched bytes can increase or decrease as well. Consider the following example, when coded with a ratio of 8.
[121, 59, 18, 79, 99, 112, 24, 116], // #0 rlen=8, clen=0 (new bucket) [121, 59, 18, 79, 99, 112, 43, 68], // #1 rlen=2, clen=6 [121, 59, 18, 79, 99, 112, 46, 78], // #2 rlen=2, clen=6 [121, 59, 18, 79, 99, 112, 54, 48], // #3 rlen=2, clen=6 [121, 59, 18, 79, 99, 112, 54, 108], // #4 rlen=1, clen=7 (***) [121, 59, 18, 79, 99, 112, 55, 81], // #5 rlen=2, clen=6 [121, 59, 18, 79, 99, 112, 62, 85], // #6 rlen=2, clen=6 [121, 59, 18, 79, 99, 112, 63, 110], // #7 rlen=8, clen=0 (new bucket) [121, 59, 18, 79, 99, 112, 71, 124], // #8 ... [121, 59, 18, 79, 99, 112, 73, 49] // #9 ...The common length grows for entry #4 because the [54] in the next to last byte in the array already appears in the same position in the previous front-coded entry. However, the common length decreases again for the next entry (#5).
The following rules guide the linear search of the bucket identified by the binary search.
clen GT mlen
, then skip to the next entry in the
bucket as no match is possible (the common prefix was demonstrated to be
longer than the matched prefix on a previous entry).mlen += prefixLength
, where prefixLength is
the length of the matched prefix from step 2.
a
- The search probe, which is interpreted as an
unsigned byte[](-(insertion point) - 1)
. The insertion point is
defined as the point at which the key would be inserted. Note
that this guarantees that the return value will be >= 0 if and
only if the key is found.public static final String toString(byte[] key)
key
- The key.public static final String toString(byte[] key, int off, int len)
key
- The key.off
- The index of the first byte that will be visited.len
- The #of bytes to visit.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.