V - The generic type of the document identifier.public class FullTextIndex<V extends Comparable<V>> extends AbstractRelation
The basic data model consists of documents, fields in documents, and tokens extracted by an analyzer from those fields.
The frequency distributions may be normalized to account for a variety of effects producing "term weights". For example, normalizing for document length or relative frequency of a term in the overall collection. Therefore the logical model is:
 
             token : {docId, freq?, weight?}+
 
 
 
 (For RDF, docId is the term identifier as assigned by the term:id index.)
 The freq and weight are optional values that are representative of the kinds of statistical data that are kept on a per-token-document basis. The freq is the token frequency (the frequency of occurrence of the token in the document). The weight is generally a normalized token frequency weight for the token in that document in the context of the overall collection.
In fact, we actually represent the data as follows:
 
             {sortKey(token), weight, docId, fldId} : {freq?, sorted(pos)+}
 
 
 
 That is, there is a distinct entry in the full text B+Tree for each field in
 each document in which a given token was recognized. The text of the token is
 not stored in the key, just the Unicode sort key generated from the token
 text. The value associated with the B+Tree entry is optional - it is simply
 not used unless we are storing statistics for the token-document pair. The
 advantages of this approach are: (a) it reuses the existing B+Tree data
 structures efficiently; (b) we are never faced with the possibility overflow
 when a token is used in a large number of documents. The entries for the
 token will simply be spread across several leaves in the B+Tree; (c) leading
 key compression makes the resulting B+Tree very efficient; and (d) in a
 scale-out range partitioned index we can load balance the resulting index
 partitions by choosing the partition based on an even token boundary.
 
 A field is any pre-identified text container within a document. Field
 identifiers are integers, so there are 32^2 distinct possible
 field identifiers. It is possible to manage the field identifiers through a
 secondary index, but that has no direct bearing on the structure of the full
 text index itself. Field identifies appear after the token in the key so that
 queries may be expressed that will be matched against any field in the
 document. Likewise, field identifiers occur before the document identifier in
 the key since we always search across documents (in a search key, the
 document identifier is always Long.MIN_VALUE and the field identifier
 is always Integer.MIN_VALUE). There are many applications for fields:
 for example, distinct fields may be used for the title, abstract, and full
 text of a document or for the CDATA section of each distinct element in
 documents corresponding to some DTD. The application is responsible for
 recognizing the fields in the document and producing the appropriate token
 stream, each of which must be tagged by the field.
 
A query is tokenized, producing a (possibly normalized) token-frequency vector. The relevance of documents to the query is generally taken as the cosine between the query's and each document's (possibly normalized) token-frequency vectors. The main effort of search is assembling a token frequency vector for just those documents with which there is an overlap with the query. This is done using a key range scan for each token in the query against the full text index.
             fromKey := token, Long.MIN_VALUE
             toKey   := successor(token), Long.MIN_VALUE
 
 
 and extracting the appropriate token frequency, normalized token weight, or
 other statistic. When no value is associated with the entry we follow the
 convention of assuming a token frequency of ONE (1) for each document in
 which the token appears.
 
 Tokenization is informed by the language code (when declared) and by the
 configured Locale for the database otherwise. An appropriate
 Analyzer is chosen based on the language code or Locale and
 the "document" is broken into a token-frequency distribution (alternatively a
 set of tokens). The same process is used to tokenize queries, and the API
 allows the caller to specify the language code used to select the
 Analyzer to tokenize the query.
 
 Once the tokens are formed the language code / Locale used to produce
 the token is discarded (it is not represented in the index). The reason for
 this is that we never utilize the total ordering of the full text index,
 merely the manner in which it groups tokens that map onto the same Unicode
 sort key together. Further, we use only a single Unicode collator
 configuration regardless of the language family in which the token was
 originally expressed. Unlike the collator used by the terms index (which
 often is set at IDENTICAL strength), the collector used by the full text
 index should be chosen such that it makes relatively few distinctions in
 order to increase recall (e.g., set at PRIMARY strength). Since a total order
 over the full text index is not critical from the perspective of its IR
 application, the Locale for the collator is likewise not critical and
 PRIMARY strength will produce significantly shorter Unicode sort keys.
 
The term frequency within that literal is an optional property associated with each term identifier, as is the computed weight for the token in the term.
 Note: Documents should be tokenized using an Analyzer appropriate for
 their declared language code (if any). However, once tokenized, the language
 code is discarded and we perform search purely on the Unicode sort keys
 resulting from the extracted tokens.
 
Because the first component in the key is the token, both updates (when indexing document) and queries (reading against different tokens) will be scattered across shards. Therefore it is not necessary to register a split handler for the full text index.
| Modifier and Type | Class and Description | 
|---|---|
| static interface  | FullTextIndex.OptionsOptions understood by the  FullTextIndex. | 
| Modifier and Type | Field and Description | 
|---|---|
| static String | NAME_SEARCHThe basename of the search index. | 
indexManager| Constructor and Description | 
|---|
| FullTextIndex(IIndexManager indexManager,
             String namespace,
             Long timestamp,
             Properties properties)Ctor specified by  DefaultResourceLocator. | 
| Modifier and Type | Method and Description | 
|---|---|
| Hit<V>[] | _search(ITextIndexer.FullTextQuery query) | 
| protected Hit<V>[] | applyRegex(Hit<V>[] hits,
          Pattern regex)Subclasses can override this method to do regex post-processing. | 
| int | count(ITextIndexer.FullTextQuery query)Perform a range count on a full text query. | 
| void | create()Conditionally registers the necessary index(s). | 
| long | delete(IChunkedOrderedIterator itr)Remove elements from the relation. | 
| void | destroy()Destroy any logically contained resources (relations, indices). | 
| protected Hit<V>[] | executeQuery(TermFrequencyData<V> qdata,
            boolean prefixMatch,
            long timeout,
            TimeUnit unit) | 
| protected org.apache.lucene.analysis.Analyzer | getAnalyzer(String languageCode,
           boolean filterStopwords)Return the token analyzer to be used for the given language code. | 
| Class<?> | getElementClass()Return the class for the generic type of this relation. | 
| IIndex | getIndex()The index used to associate term identifiers with tokens parsed from
 documents. | 
| Set<String> | getIndexNames()Return the fully qualified name of each index maintained by this
 relation. | 
| protected IKeyBuilder | getKeyBuilder()Return a  ThreadLocalIKeyBuilderinstance configured to
 support full text indexing and search. | 
| IKeyOrder | getKeyOrder(IPredicate p)Return the  IKeyOrderfor the predicate corresponding to the
 perfect access path. | 
| Iterator | getKeyOrders()Return the  IKeyOrders corresponding to the registered indices for
 this relation. | 
| IKeyOrder | getPrimaryKeyOrder()Return the  IKeyOrderfor the primary index for the relation. | 
| protected org.apache.lucene.analysis.TokenStream | getTokenStream(String languageCode,
              Reader r,
              boolean filterStopwords)Tokenize text using an  Analyzerthat is appropriate to the
 specified language family. | 
| void | index(TokenBuffer<V> buffer,
     V docId,
     int fieldId,
     String languageCode,
     Reader r)See  #index(TokenBuffer, long, int, String, Reader, boolean). | 
| void | index(TokenBuffer<V> buffer,
     V docId,
     int fieldId,
     String languageCode,
     Reader r,
     boolean filterStopwords)Index a field in a document. | 
| long | insert(IChunkedOrderedIterator itr)Write elements on the relation. | 
| boolean | isOverwrite()Return the value configured by the  FullTextIndex.Options.OVERWRITEproperty. | 
| boolean | isReadOnly() | 
| protected Hit<V>[] | matchExact(Hit<V>[] hits,
          String query)Subclasses can override this method to do exact match processing. | 
| Object | newElement(List a,
          IBindingSet bindingSet)Create and return a new element. | 
| Hiterator<Hit<V>> | search(ITextIndexer.FullTextQuery query)Performs a full text search against indexed documents returning a hit
 list. | 
| protected Hit<V>[] | slice(ITextIndexer.FullTextQuery query,
     Hit<V>[] a) | 
| protected TermFrequencyData<V> | tokenize(ITextIndexer.FullTextQuery query) | 
getAccessPath, getAccessPath, getAccessPath, getFQN, getFQN, getFQN, getIndex, getIndex, getIndex, newAccessPath, newIndexMetadataacquireExclusiveLock, assertWritable, getBareProperties, getChunkCapacity, getChunkOfChunksCapacity, getChunkTimeout, getCommitTime, getContainer, getContainerNamespace, getExecutorService, getFullyBufferedReadThreshold, getIndexManager, getMaxParallelSubqueries, getNamespace, getProperties, getProperty, getProperty, getTimestamp, init, isForceSerialExecution, toString, unlockclone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitgetExecutorService, getIndexManagergetContainerNamespace, getNamespace, getTimestamp, initpublic static final transient String NAME_SEARCH
public FullTextIndex(IIndexManager indexManager, String namespace, Long timestamp, Properties properties)
DefaultResourceLocator.client - The client. Configuration information is obtained from the
            client. See FullTextIndex.Options.FullTextIndex.Optionspublic IIndex getIndex()
public boolean isOverwrite()
FullTextIndex.Options.OVERWRITE property.public final boolean isReadOnly()
isReadOnly in class AbstractResourcepublic void create()
create in interface IMutableResourcecreate in class AbstractResourceIllegalStateException - if the client does not have write access.public void destroy()
IMutableResourcedestroy in interface IMutableResourcedestroy in class AbstractResourceprotected org.apache.lucene.analysis.Analyzer getAnalyzer(String languageCode, boolean filterStopwords)
languageCode - The language code or null to use the default
            Locale.protected final IKeyBuilder getKeyBuilder()
ThreadLocal IKeyBuilder instance configured to
 support full text indexing and search.public void index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r)
#index(TokenBuffer, long, int, String, Reader, boolean).
 
 Uses a default filterStopwords value of true.
public void index(TokenBuffer<V> buffer, V docId, int fieldId, String languageCode, Reader r, boolean filterStopwords)
 Note: This method does NOT force a write on the indices. If the buffer
 overflows, then there will be an index write. Once the caller is done
 indexing, they MUST invoke TokenBuffer.flush() to force any data
 remaining in their buffer to the indices.
 
Note: If a document is pre-existing, then the existing data for that document MUST be removed unless you know that the fields to be found in the will not have changed (they may have different contents, but the same fields exist in the old and new versions of the document).
buffer - Used to buffer writes onto the text index.docId - The document identifier.fieldId - The field identifier.languageCode - The language code -or- null to use the default
            Locale.r - A reader on the text to be indexed.filterStopwords - if true, filter stopwords from the token streamTokenBuffer.flush()protected org.apache.lucene.analysis.TokenStream getTokenStream(String languageCode, Reader r, boolean filterStopwords)
Analyzer that is appropriate to the
 specified language family.languageCode - The language code -or- null to use the default
            Locale).r - A reader on the text to be indexed.filterStopwords - if true, filter stopwords from the token streampublic Hiterator<Hit<V>> search(ITextIndexer.FullTextQuery query)
The basic algorithm computes cosine between the term-frequency vector of the query and the indexed "documents". The cosine may be directly interpreted as the "relevance" of a "document" to the query. The query and document term-frequency vectors are normalized, so the cosine values are bounded in [0.0:1.0]. The higher the cosine the more relevant the document is to the query. A cosine of less than .4 is rarely of any interest.
The implementation creates and runs a set parallel tasks, one for each distinct token found in the query, and waits for those tasks to complete or for a timeout to occur. Each task uses a key-range scan on the terms index, collecting metadata for the matching "documents" and aggregating it on a "hit" for that document. Since the tasks run concurrently, there are concurrent writers on the "hits". On a timeout, the remaining tasks are interrupted.
 The collection of hits is scored and hits that fail a threshold are
 discarded. The remaining hits are placed into a total order and the
 caller is returned an iterator which can read from that order. If the
 operation is interrupted, then only those IHits that have already
 been computed will be returned.
query - The query (it will be parsed into tokens).languageCode - The language code that should be used when tokenizing the
            query -or- null to use the default Locale
            ).minCosine - The minimum cosine that will be returned.maxCosine - The maximum cosine that will be returned.minRank - The min rank of the search result.maxRank - The max rank of the search result.prefixMatch - When true, the matches will be on tokens which
            include the query tokens as a prefix. This includes exact
            matches as a special case when the prefix is the entire token,
            but it also allows longer matches. For example,
            free will be an exact match on free
            but a partial match on freedom. When
            false, only exact matches will be made.matchAllTerms - if true, return only hits that match all search termstimeout - The timeout -or- ZERO (0) for NO timeout (this is equivalent
            to using Long.MAX_VALUE).unit - The unit in which the timeout is expressed.public int count(ITextIndexer.FullTextQuery query)
protected TermFrequencyData<V> tokenize(ITextIndexer.FullTextQuery query)
public Hit<V>[] _search(ITextIndexer.FullTextQuery query)
protected Hit<V>[] slice(ITextIndexer.FullTextQuery query, Hit<V>[] a)
protected Hit<V>[] executeQuery(TermFrequencyData<V> qdata, boolean prefixMatch, long timeout, TimeUnit unit)
protected Hit<V>[] matchExact(Hit<V>[] hits, String query)
protected Hit<V>[] applyRegex(Hit<V>[] hits, Pattern regex)
public long delete(IChunkedOrderedIterator itr)
IMutableRelationitr - An iterator visiting the elements to be removed. Existing
            elements in the relation having a key equal to the key formed
            from the visited elements will be removed from the relation.public long insert(IChunkedOrderedIterator itr)
IMutableRelationitr - An iterator visiting the elements to be written.public Set<String> getIndexNames()
IRelationpublic Iterator getKeyOrders()
IRelationIKeyOrders corresponding to the registered indices for
 this relation. [rather than getIndexNames?]public Object newElement(List a, IBindingSet bindingSet)
IRelationa - An ordered list of variables and/or constants.bindingSet - A set of bindings.public Class<?> getElementClass()
IRelationpublic IKeyOrder getPrimaryKeyOrder()
IRelationIKeyOrder for the primary index for the relation.public IKeyOrder getKeyOrder(IPredicate p)
IRelationIKeyOrder for the predicate corresponding to the
 perfect access path. A perfect access path is one where the bound values
 in the predicate form a prefix in the key space of the corresponding
 index.IKeyOrder for the perfect access path -or-
         null if there is no index which provides a perfect
         access path for that predicate.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.