FullTextIndex (Blazegraph Database Platform 2.1.5 API)

java.lang.Object
- com.bigdata.relation.AbstractResource<IRelation<E>>
- - com.bigdata.relation.AbstractRelation
  - - com.bigdata.search.FullTextIndex<V>

Type Parameters:
V - The generic type of the document identifier.

All Implemented Interfaces:

IMutableRelation, IMutableResource, IRelation, ILocatableResource

Direct Known Subclasses:

BigdataSubjectCentricFullTextIndex, BigdataValueCentricFullTextIndex
```
public class FullTextIndex<V extends Comparable<V>>
extends AbstractRelation
```
Full text indexing and search support.
The basic data model consists of documents, fields in documents, and tokens extracted by an analyzer from those fields.
The frequency distributions may be normalized to account for a variety of effects producing "term weights". For example, normalizing for document length or relative frequency of a term in the overall collection. Therefore the logical model is:
```
 
             token : {docId, freq?, weight?}+
 
 
```
(For RDF, docId is the term identifier as assigned by the term:id index.)
The freq and weight are optional values that are representative of the kinds of statistical data that are kept on a per-token-document basis. The freq is the token frequency (the frequency of occurrence of the token in the document). The weight is generally a normalized token frequency weight for the token in that document in the context of the overall collection.
In fact, we actually represent the data as follows:
```
 
             {sortKey(token), weight, docId, fldId} : {freq?, sorted(pos)+}
 
 
```
That is, there is a distinct entry in the full text B+Tree for each field in each document in which a given token was recognized. The text of the token is not stored in the key, just the Unicode sort key generated from the token text. The value associated with the B+Tree entry is optional - it is simply not used unless we are storing statistics for the token-document pair. The advantages of this approach are: (a) it reuses the existing B+Tree data structures efficiently; (b) we are never faced with the possibility overflow when a token is used in a large number of documents. The entries for the token will simply be spread across several leaves in the B+Tree; (c) leading key compression makes the resulting B+Tree very efficient; and (d) in a scale-out range partitioned index we can load balance the resulting index partitions by choosing the partition based on an even token boundary.
A field is any pre-identified text container within a document. Field identifiers are integers, so there are 32^2 distinct possible field identifiers. It is possible to manage the field identifiers through a secondary index, but that has no direct bearing on the structure of the full text index itself. Field identifies appear after the token in the key so that queries may be expressed that will be matched against any field in the document. Likewise, field identifiers occur before the document identifier in the key since we always search across documents (in a search key, the document identifier is always Long.MIN_VALUE and the field identifier is always Integer.MIN_VALUE). There are many applications for fields: for example, distinct fields may be used for the title, abstract, and full text of a document or for the CDATA section of each distinct element in documents corresponding to some DTD. The application is responsible for recognizing the fields in the document and producing the appropriate token stream, each of which must be tagged by the field.
A query is tokenized, producing a (possibly normalized) token-frequency vector. The relevance of documents to the query is generally taken as the cosine between the query's and each document's (possibly normalized) token-frequency vectors. The main effort of search is assembling a token frequency vector for just those documents with which there is an overlap with the query. This is done using a key range scan for each token in the query against the full text index.
```
             fromKey := token, Long.MIN_VALUE
             toKey   := successor(token), Long.MIN_VALUE
 
```
and extracting the appropriate token frequency, normalized token weight, or other statistic. When no value is associated with the entry we follow the convention of assuming a token frequency of ONE (1) for each document in which the token appears.
Tokenization is informed by the language code (when declared) and by the configured Locale for the database otherwise. An appropriate Analyzer is chosen based on the language code or Locale and the "document" is broken into a token-frequency distribution (alternatively a set of tokens). The same process is used to tokenize queries, and the API allows the caller to specify the language code used to select the Analyzer to tokenize the query.
Once the tokens are formed the language code / Locale used to produce the token is discarded (it is not represented in the index). The reason for this is that we never utilize the total ordering of the full text index, merely the manner in which it groups tokens that map onto the same Unicode sort key together. Further, we use only a single Unicode collator configuration regardless of the language family in which the token was originally expressed. Unlike the collator used by the terms index (which often is set at IDENTICAL strength), the collector used by the full text index should be chosen such that it makes relatively few distinctions in order to increase recall (e.g., set at PRIMARY strength). Since a total order over the full text index is not critical from the perspective of its IR application, the Locale for the collator is likewise not critical and PRIMARY strength will produce significantly shorter Unicode sort keys.
The term frequency within that literal is an optional property associated with each term identifier, as is the computed weight for the token in the term.
Note: Documents should be tokenized using an Analyzer appropriate for their declared language code (if any). However, once tokenized, the language code is discarded and we perform search purely on the Unicode sort keys resulting from the extracted tokens.
Scale-out

Because the first component in the key is the token, both updates (when indexing document) and queries (reading against different tokens) will be scattered across shards. Therefore it is not necessary to register a split handler for the full text index.
Version:

$Id$

Author:

Bryan Thompson

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static interface FullTextIndex.Options
Options understood by the FullTextIndex.

Nested Classes
Modifier and Type	Class and Description
`static interface`	`FullTextIndex.Options` Options understood by the `FullTextIndex`.

Field Summary

Fields
Modifier and Type Field and Description

static String NAME_SEARCH
The basename of the search index.
- Fields inherited from class com.bigdata.relation.AbstractResource
  indexManager

Fields
Modifier and Type	Field and Description
`static String`	`NAME_SEARCH` The basename of the search index.

Constructor Summary

Constructors
Constructor and Description
`FullTextIndex(IIndexManager indexManager, String namespace, Long timestamp, Properties properties)` Ctor specified by `DefaultResourceLocator`.

Method Summary

Methods
Modifier and Type	Method and Description
`Hit<V>[]`	`_search(ITextIndexer.FullTextQuery query)`
`protected Hit<V>[]`	`applyRegex(Hit<V>[] hits, Pattern regex)` Subclasses can override this method to do regex post-processing.
`int`	`count(ITextIndexer.FullTextQuery query)` Perform a range count on a full text query.
`void`	`create()` Conditionally registers the necessary index(s).
`long`	`delete(IChunkedOrderedIterator itr)` Remove elements from the relation.
`void`	`destroy()` Destroy any logically contained resources (relations, indices).
`protected Hit<V>[]`	`executeQuery(TermFrequencyData<V> qdata, boolean prefixMatch, long timeout, TimeUnit unit)`
`protected org.apache.lucene.analysis.Analyzer`	`getAnalyzer(String languageCode, boolean filterStopwords)` Return the token analyzer to be used for the given language code.
`Class<?>`	`getElementClass()` Return the class for the generic type of this relation.
`IIndex`	`getIndex()` The index used to associate term identifiers with tokens parsed from documents.
`Set<String>`	`getIndexNames()` Return the fully qualified name of each index maintained by this relation.
`protected IKeyBuilder`	`getKeyBuilder()` Return a `ThreadLocal` `IKeyBuilder` instance configured to support full text indexing and search.
`IKeyOrder`	`getKeyOrder(IPredicate p)` Return the `IKeyOrder` for the predicate corresponding to the perfect access path.
`Iterator`	`getKeyOrders()` Return the `IKeyOrder`s corresponding to the registered indices for this relation.
`IKeyOrder`	`getPrimaryKeyOrder()` Return the `IKeyOrder` for the primary index for the relation.
`protected org.apache.lucene.analysis.TokenStream`	`getTokenStream(String languageCode, Reader r, boolean filterStopwords)` Tokenize text using an `Analyzer` that is appropriate to the specified language family.
`void`	`index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r)` See `#index(TokenBuffer, long, int, String, Reader, boolean)`.
`void`	`index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r, boolean filterStopwords)` Index a field in a document.
`long`	`insert(IChunkedOrderedIterator itr)` Write elements on the relation.
`boolean`	`isOverwrite()` Return the value configured by the `FullTextIndex.Options.OVERWRITE` property.
`boolean`	`isReadOnly()` `true` unless {`AbstractResource.getTimestamp()` is `ITx.UNISOLATED`.
`protected Hit<V>[]`	`matchExact(Hit<V>[] hits, String query)` Subclasses can override this method to do exact match processing.
`Object`	`newElement(List a, IBindingSet bindingSet)` Create and return a new element.
`Hiterator<Hit<V>>`	`search(ITextIndexer.FullTextQuery query)` Performs a full text search against indexed documents returning a hit list.
`protected Hit<V>[]`	`slice(ITextIndexer.FullTextQuery query, Hit<V>[] a)`
`protected TermFrequencyData<V>`	`tokenize(ITextIndexer.FullTextQuery query)`

Methods inherited from class com.bigdata.relation.AbstractRelation
getAccessPath, getAccessPath, getAccessPath, getFQN, getFQN, getFQN, getIndex, getIndex, getIndex, newAccessPath, newIndexMetadata

Methods inherited from class com.bigdata.relation.AbstractResource
acquireExclusiveLock, assertWritable, getBareProperties, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getCommitTime, getContainer, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, init, isForceSerialExecution, toString, unlock

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.bigdata.relation.IRelation
getExecutorService, getIndexManager

Methods inherited from interface com.bigdata.relation.locator.ILocatableResource
getContainerNamespace, getNamespace, getTimestamp, init

- Field Detail
  - NAME_SEARCH
```
public static final transient String NAME_SEARCH
```
    The basename of the search index.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - FullTextIndex
```
public FullTextIndex(IIndexManager indexManager,
             String namespace,
             Long timestamp,
             Properties properties)
```
    Ctor specified by DefaultResourceLocator.
    
    Parameters:
    client - The client. Configuration information is obtained from the client. See FullTextIndex.Options.
    See Also:
    FullTextIndex.Options
- Method Detail
  - getIndex
```
public IIndex getIndex()
```
    The index used to associate term identifiers with tokens parsed from documents.
  - isOverwrite
```
public boolean isOverwrite()
```
    Return the value configured by the FullTextIndex.Options.OVERWRITE property.
  - isReadOnly
```
public final boolean isReadOnly()
```
    true unless {AbstractResource.getTimestamp() is ITx.UNISOLATED.
    
    Overrides:
    
    isReadOnly in class AbstractResource
  - create
```
public void create()
```
    Conditionally registers the necessary index(s).
    
    Specified by:
    
    create in interface IMutableResource
    
    Overrides:
    
    create in class AbstractResource
    
    Throws:
    
    IllegalStateException - if the client does not have write access.
  - destroy
```
public void destroy()
```
    Description copied from interface: IMutableResource
    
    Destroy any logically contained resources (relations, indices).
    
    Specified by:
    
    destroy in interface IMutableResource
    
    Overrides:
    
    destroy in class AbstractResource
  - getAnalyzer
```
protected org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode,
                                              boolean filterStopwords)
```
    Return the token analyzer to be used for the given language code.
    
    Parameters:
    languageCode - The language code or null to use the default Locale.
    
    Returns:
    The token analyzer best suited to the indicated language family.
  - getKeyBuilder
```
protected final IKeyBuilder getKeyBuilder()
```
    Return a ThreadLocal IKeyBuilder instance configured to support full text indexing and search.
    
    See Also:
    FullTextIndex.Options.INDEXER_COLLATOR_STRENGTH
  - index
```
public void index(TokenBuffer<V> buffer,
         V docId,
         int fieldId,
         String languageCode,
         Reader r)
```
    See #index(TokenBuffer, long, int, String, Reader, boolean).
    Uses a default filterStopwords value of true.
  - index
```
public void index(TokenBuffer<V> buffer,
         V docId,
         int fieldId,
         String languageCode,
         Reader r,
         boolean filterStopwords)
```
    Index a field in a document.
    Note: This method does NOT force a write on the indices. If the buffer overflows, then there will be an index write. Once the caller is done indexing, they MUST invoke TokenBuffer.flush() to force any data remaining in their buffer to the indices.
    Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).
    
    Parameters:
    buffer - Used to buffer writes onto the text index.
    docId - The document identifier.
    fieldId - The field identifier.
    languageCode - The language code -or- null to use the default Locale.
    r - A reader on the text to be indexed.
    filterStopwords - if true, filter stopwords from the token stream
    See Also:
    TokenBuffer.flush()
  - getTokenStream
```
protected org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode,
                                                    Reader r,
                                                    boolean filterStopwords)
```
    Tokenize text using an Analyzer that is appropriate to the specified language family.
    
    Parameters:
    languageCode - The language code -or- null to use the default Locale).
    r - A reader on the text to be indexed.
    filterStopwords - if true, filter stopwords from the token stream
    
    Returns:
    The extracted token stream.
  - search
```
public Hiterator<Hit<V>> search(ITextIndexer.FullTextQuery query)
```
    Performs a full text search against indexed documents returning a hit list.
    The basic algorithm computes cosine between the term-frequency vector of the query and the indexed "documents". The cosine may be directly interpreted as the "relevance" of a "document" to the query. The query and document term-frequency vectors are normalized, so the cosine values are bounded in [0.0:1.0]. The higher the cosine the more relevant the document is to the query. A cosine of less than .4 is rarely of any interest.
    The implementation creates and runs a set parallel tasks, one for each distinct token found in the query, and waits for those tasks to complete or for a timeout to occur. Each task uses a key-range scan on the terms index, collecting metadata for the matching "documents" and aggregating it on a "hit" for that document. Since the tasks run concurrently, there are concurrent writers on the "hits". On a timeout, the remaining tasks are interrupted.
    The collection of hits is scored and hits that fail a threshold are discarded. The remaining hits are placed into a total order and the caller is returned an iterator which can read from that order. If the operation is interrupted, then only those IHits that have already been computed will be returned.
    
    Parameters:
    query - The query (it will be parsed into tokens).
    languageCode - The language code that should be used when tokenizing the query -or- null to use the default Locale ).
    minCosine - The minimum cosine that will be returned.
    maxCosine - The maximum cosine that will be returned.
    minRank - The min rank of the search result.
    maxRank - The max rank of the search result.
    prefixMatch - When true, the matches will be on tokens which include the query tokens as a prefix. This includes exact matches as a special case when the prefix is the entire token, but it also allows longer matches. For example, free will be an exact match on free but a partial match on freedom. When false, only exact matches will be made.
    matchAllTerms - if true, return only hits that match all search terms
    timeout - The timeout -or- ZERO (0) for NO timeout (this is equivalent to using Long.MAX_VALUE).
    unit - The unit in which the timeout is expressed.
    
    Returns:
    The hit list.
  - count
```
public int count(ITextIndexer.FullTextQuery query)
```
    Perform a range count on a full text query.
  - tokenize
```
protected TermFrequencyData<V> tokenize(ITextIndexer.FullTextQuery query)
```
  - _search
```
public Hit<V>[] _search(ITextIndexer.FullTextQuery query)
```
  - slice
```
protected Hit<V>[] slice(ITextIndexer.FullTextQuery query,
             Hit<V>[] a)
```
  - executeQuery
```
protected Hit<V>[] executeQuery(TermFrequencyData<V> qdata,
                    boolean prefixMatch,
                    long timeout,
                    TimeUnit unit)
```
  - matchExact
```
protected Hit<V>[] matchExact(Hit<V>[] hits,
                  String query)
```
    Subclasses can override this method to do exact match processing. This involves materializing the hits into their original text values and checking for the query string in the materialized value. Not possible from the base class. The value-centric RDF version can use the lexicon to materialize the hits and check them for exact match.
  - applyRegex
```
protected Hit<V>[] applyRegex(Hit<V>[] hits,
                  Pattern regex)
```
    Subclasses can override this method to do regex post-processing. This involves materializing the hits into their original text values and checking against the regex string in the materialized value. Not possible from the base class. The value-centric RDF version can use the lexicon to materialize the hits and check them against the regex.
  - delete
```
public long delete(IChunkedOrderedIterator itr)
```
    Description copied from interface: IMutableRelation
    
    Remove elements from the relation.
    
    Parameters:
    itr - An iterator visiting the elements to be removed. Existing elements in the relation having a key equal to the key formed from the visited elements will be removed from the relation.
    
    Returns:
    The #of elements that were actually removed from the relation.
  - insert
```
public long insert(IChunkedOrderedIterator itr)
```
    Description copied from interface: IMutableRelation
    
    Write elements on the relation.
    
    Parameters:
    itr - An iterator visiting the elements to be written.
    
    Returns:
    The #of elements that were actually written on the relation.
  - getIndexNames
```
public Set<String> getIndexNames()
```
    Description copied from interface: IRelation
    
    Return the fully qualified name of each index maintained by this relation.
    
    Returns:
    An immutable set of the index names for the relation.
  - getKeyOrders
```
public Iterator getKeyOrders()
```
    Description copied from interface: IRelation
    
    Return the IKeyOrders corresponding to the registered indices for this relation. [rather than getIndexNames?]
  - newElement
```
public Object newElement(List a,
                IBindingSet bindingSet)
```
    Description copied from interface: IRelation
    
    Create and return a new element. The element is constructed from the ordered list of variables and constants. Variables are replaced using the given the bindings. The element is NOT inserted into the relation.
    
    Parameters:
    a - An ordered list of variables and/or constants.
    bindingSet - A set of bindings.
    
    Returns:
    The new element.
  - getElementClass
```
public Class<?> getElementClass()
```
    Description copied from interface: IRelation
    
    Return the class for the generic type of this relation. This information is used to dynamically create arrays of that generic type.
  - getPrimaryKeyOrder
```
public IKeyOrder getPrimaryKeyOrder()
```
    Description copied from interface: IRelation
    
    Return the IKeyOrder for the primary index for the relation.
  - getKeyOrder
```
public IKeyOrder getKeyOrder(IPredicate p)
```
    Description copied from interface: IRelation
    
    Return the IKeyOrder for the predicate corresponding to the perfect access path. A perfect access path is one where the bound values in the predicate form a prefix in the key space of the corresponding index.
    
    Returns:
    The IKeyOrder for the perfect access path -or- null if there is no index which provides a perfect access path for that predicate.

Class FullTextIndex<V extends Comparable<V>>

Scale-out

Nested Class Summary

Field Summary

Fields inherited from class com.bigdata.relation.AbstractResource

Constructor Summary

Method Summary

Methods inherited from class com.bigdata.relation.AbstractRelation

Methods inherited from class com.bigdata.relation.AbstractResource

Methods inherited from class java.lang.Object

Methods inherited from interface com.bigdata.relation.IRelation

Methods inherited from interface com.bigdata.relation.locator.ILocatableResource

Field Detail

NAME_SEARCH

Constructor Detail

FullTextIndex

Method Detail

getIndex

isOverwrite

isReadOnly

create

destroy

getAnalyzer

getKeyBuilder

index

index

getTokenStream

search

count

tokenize

_search

slice

executeQuery

matchExact

applyRegex

delete

insert

getIndexNames

getKeyOrders

newElement

getElementClass

getPrimaryKeyOrder

getKeyOrder