public class TextPattern extends Object implements Serializable, CharSequence
The regular expression facilities of the Java API are a powerful tool; however, when searching for a constant pattern many algorithms can increase of orders magnitude the speed of a search.
This class provides constant-pattern text search facilities by implementing Sunday's QuickSearch (a simplified but very effective variant of the Boyer—Moore search algorithm) using compact approximators, a randomised data structure that can accomodate in a small space (but in an approximated way) the bad-character shift table of a large alphabet such as Unicode.
Since a large subset of US-ASCII is used in all languages (e.g., whitespace, punctuation, etc.), this class caches separately the shifts for the first 128 Unicode characters, resulting in very good performance even on text in pure US-ASCII.
Note that the indexOf
methods of MutableString
use a even more simplified variant of
QuickSearch which is less efficient, but has a smaller setup time and does
not generate any object. The search facilities provided by this class are
targeted at searches on very large texts, repeated searches with the same
pattern, and case-insensitive searches.
Instances of this class are immutable and thread-safe.
This class is experimental: APIs could change with the next release.
Modifier and Type | Field and Description |
---|---|
static int |
CASE_INSENSITIVE
Enables case-insensitive matching.
|
protected char[] |
pattern
The pattern backing array.
|
static int |
UNICODE_CASE
Enables Unicode-aware case folding.
|
Constructor and Description |
---|
TextPattern(CharSequence pattern)
Creates a new case-sensitive
TextPattern object that can be used to search for the given pattern. |
TextPattern(CharSequence pattern,
int flags)
Creates a new
TextPattern object that can be used to search for the given pattern. |
Modifier and Type | Method and Description |
---|---|
boolean |
caseInsensitive()
Returns whether this pattern is case insensitive.
|
char |
charAt(int i) |
boolean |
equals(Object o)
Compares this text pattern to another object.
|
int |
hashCode()
Returns a hash code for this text pattern.
|
int |
length() |
int |
search(byte[] a)
Returns the index of the first occurrence of this pattern in the given byte array.
|
int |
search(byte[] a,
int from)
Returns the index of the first occurrence of this pattern in the given byte array starting from a given index.
|
int |
search(byte[] a,
int from,
int to)
Returns the index of the first occurrence of this pattern in the given byte array between given indices.
|
int |
search(char[] array)
Returns the index of the first occurrence of this pattern in the given character array.
|
int |
search(char[] array,
int from)
Returns the index of the first occurrence of this pattern in the given character array starting from a given index.
|
int |
search(char[] a,
int from,
int to)
Returns the index of the first occurrence of this pattern in the given character array between given indices.
|
int |
search(it.unimi.dsi.fastutil.chars.CharList list)
Returns the index of the first occurrence of this pattern in the given character list.
|
int |
search(it.unimi.dsi.fastutil.chars.CharList list,
int from)
Returns the index of the first occurrence of this pattern in the given character list starting from a given index.
|
int |
search(it.unimi.dsi.fastutil.chars.CharList list,
int from,
int to)
Returns the index of the first occurrence of this pattern in the given character list between given indices.
|
int |
search(CharSequence s)
Returns the index of the first occurrence of this pattern in the given character sequence.
|
int |
search(CharSequence s,
int from)
Returns the index of the first occurrence of this pattern in the given character sequence starting from a given index.
|
int |
search(CharSequence s,
int from,
int to)
Returns the index of the first occurrence of this pattern in the given character sequence between given indices.
|
CharSequence |
subSequence(int from,
int to) |
String |
toString() |
boolean |
unicodeCase()
Returns whether this pattern uses Unicode case folding.
|
public static final int CASE_INSENSITIVE
By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.
Case-insensitivity involves a performance drop.
public static final int UNICODE_CASE
When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched.
Unicode-aware case folding is very expensive (two method calls per examined non-ASCII character).
protected char[] pattern
public TextPattern(CharSequence pattern)
TextPattern
object that can be used to search for the given pattern.pattern
- the constant pattern to search for.public TextPattern(CharSequence pattern, int flags)
TextPattern
object that can be used to search for the given pattern.pattern
- the constant pattern to search for.flags
- a bit mask that may include CASE_INSENSITIVE
and UNICODE_CASE
.public boolean caseInsensitive()
public boolean unicodeCase()
public int length()
length
in interface CharSequence
public char charAt(int i)
charAt
in interface CharSequence
public CharSequence subSequence(int from, int to)
subSequence
in interface CharSequence
public int search(char[] array)
array
- the character array to look in.-1
, if the pattern cannot be found.public int search(char[] array, int from)
array
- the character array to look in.from
- the index from which the search must start.from
(inclusive), or
-1
, if the pattern cannot be found.public int search(char[] a, int from, int to)
a
- the character array to look in.from
- the index from which the search must start.to
- the index at which the search must end.from
(inclusive) up to to
(exclusive) characters, or -1
, if the pattern cannot be found.public int search(CharSequence s)
s
- the character sequence to look in.-1
, if the pattern cannot be found.public int search(CharSequence s, int from)
s
- the character array to look in.from
- the index from which the search must start.from
(inclusive), or
-1
, if the pattern cannot be found.public int search(CharSequence s, int from, int to)
s
- the character array to look in.from
- the index from which the search must start.to
- the index at which the search must end.from
(inclusive) up to to
(exclusive) characters, or -1
, if the pattern cannot be found.public int search(byte[] a)
a
- the byte array to look in.-1
, if the pattern cannot be found.public int search(byte[] a, int from)
a
- the byte array to look in.from
- the index from which the search must start.from
(inclusive), or
-1
, if the pattern cannot be found.public int search(byte[] a, int from, int to)
a
- the byte array to look in.from
- the index from which the search must start.to
- the index at which the search must end.from
(inclusive) up to to
(exclusive) characters, or -1
, if the pattern cannot be found.
TODO: this must be testedpublic int search(it.unimi.dsi.fastutil.chars.CharList list)
list
- the character list to look in.-1
, if the pattern cannot be found.public int search(it.unimi.dsi.fastutil.chars.CharList list, int from)
list
- the character list to look in.from
- the index from which the search must start.from
(inclusive), or
-1
, if the pattern cannot be found.public int search(it.unimi.dsi.fastutil.chars.CharList list, int from, int to)
list
- the character list to look in.from
- the index from which the search must start.to
- the index at which the search must end.from
(inclusive) up to to
(exclusive) characters, or -1
, if the pattern cannot be found.public final boolean equals(Object o)
This method will return true
iff its argument
is a TextPattern
containing the same constant pattern with the same flags set.
public final int hashCode()
The hash code of a text pattern is the same as that of a
String
with the same content (suitably lower cased, if the pattern is case insensitive).
hashCode
in class Object
String.hashCode()
public final String toString()
toString
in interface CharSequence
toString
in class Object
Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.