public class BulletParser extends Object
The bullet parser has been written with two specific goals in mind: web crawling and targeted data extraction from massive web data sets. To be usable in such environments, a parser must obey a number of restrictions:
Thus, in fact the bullet parser is not a parser. It is a bunch of spaghetti code that analyses a stream of characters pretending that it is an (X)HTML document. It has a very defensive attitude against the stream character it is parsing, but at the same time it is forgiving with all typical (X)HTML mistakes.
The bullet parser is officially StringFree™.
MutableString
s
are used for internal processing, and Java strings are used only to return attribute
values. All internal maps are reference-based maps
from fastutil, which
helps to accelerate further the parsing process.
The bullet parser uses attributes and methods of HTMLFactory
,
Element
, Attribute
and Entity
.
Thus, for instance, whenever an element is to be passed around it is one
of the shared objects contained in Element
(e.g., Element.BODY
).
The result of the parsing process is the invocation of a callback.
The callback interface
of the bullet parser remembers closely SAX2, but it has some additional
methods targeted at (X)HTML, such as Callback.cdata(it.unimi.dsi.parser.Element,char[],int,int)
,
which returns characters found in a CDATA section (e.g., a stylesheet).
Each callback must configure the parser, by requesting to perform
the analysis and the callbacks it requires. A callback that wants to
extract and tokenise text, for instance, will certainly require
parseText(true)
, but not parseTags(true)
.
On the other hand, a callback wishing to extract links will require
to parse selectively certain attribute types.
A more precise description follows.
The first important issue is what has to be required to the parser. A newly created parser does not invoke any callback. It is up to every callback to add features so that it can do its job. Remember that since many callbacks can be composed, you must always add features, never remove them, and moreover your callbacks must be ready to be invoked with features they did not request (e.g., attribute types added by another callback).
The following parse features may be configured; most of them are just boolean features, a.k.a. flags: unless otherwise specified, by default all flags are set to false (e.g., by the default the parser will not parse tags):
parseTags(boolean)
method): whether tags
should be parsed;
parseAttributes(boolean)
and
methods)
:
whether attributes should be parsed (of course, setting this flag is useless
if you are not parsing tags); note that setting this flag will just
activate the attribute parsing feature, but you must also
register every attribute
whose value you want to obtain.
parseText(boolean)
method): whether text
should be parsed; if this flag is set, the parser will call the
Callback.characters(char[], int, int, boolean)
method for every text chunk found.
parseCDATA(boolean)
method): whether CDATA
sections (stylesheets & scripts)
should be parsed; if this flag is set, the parser will call the
Callback.cdata(Element,char[],int,int)
method for every CDATA section found.
After setting the parser callback,
you just call parse(char[], int, int)
.
Modifier and Type | Field and Description |
---|---|
protected it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Attribute,MutableString> |
attrMap
A map from attributes to attribute values.
|
protected Callback |
callback
The callback of this parser.
|
protected static TextPattern |
CLOSED_CDATA
Closed section (conditional, CDATA, etc.).
|
protected static TextPattern |
CLOSED_COMMENT
Closed comment.
|
protected static TextPattern |
CLOSED_PERCENT
Closed ASP or similar tag.
|
protected static TextPattern |
CLOSED_PIC
Closed processing instruction.
|
protected static TextPattern |
CLOSED_SECTION
Closed section (conditional, etc.).
|
ParsingFactory |
factory
The parsing factory used by this parser.
|
protected static int |
HEXADECIMAL
The base for non-decimal entity.
|
protected char |
lastEntity
The character represented by the last scanned entity.
|
protected static int |
MAX_DEC_ENTITY_LENGTH
The maximum number of digits of a decimal numeric entity.
|
protected static int |
MAX_ENTITY_VALUE
The maximum Unicode value accepted for a numeric entity.
|
protected static int |
MAX_HEX_ENTITY_LENGTH
The maximum number of digits of a hexadecimal numeric entity.
|
protected static char[] |
NONSPACE_WHITESPACE
An array containing the non-space whitespace.
|
protected boolean |
parseAttributes
Whether we should parse attributes.
|
protected boolean |
parseCDATA
Whether we should invoke the CDATA section handler.
|
it.unimi.dsi.fastutil.objects.ReferenceSet<Attribute> |
parsedAttributes
An externally visible, immutable subset of attributes whose values will
be actually parsed.
|
protected it.unimi.dsi.fastutil.objects.ReferenceArraySet<Attribute> |
parsedAttrs
The subset of attributes whose values will be actually parsed (if, of
course,
parseAttributes is true). |
protected boolean |
parseTags
Whether we should parse tags.
|
protected boolean |
parseText
Whether we should invoke the text handler.
|
protected static TextPattern |
SCRIPT_CLOSE_TAG_PATTERN
Closing tag for a script element.
|
protected static char[] |
SPACE
An array, parallel to
NONSPACE_WHITESPACE , containing spaces. |
protected static int |
STATE_BEFORE_END_TAG_NAME
Scanning a closing tag.
|
protected static int |
STATE_BEFORE_START_TAG_NAME
Scanning attribute name/value pairs.
|
protected static int |
STATE_IN_END_TAG
Scanning a closing tag.
|
protected static int |
STATE_IN_START_TAG
Scanning attribute name/value pairs.
|
protected static int |
STATE_TEXT
Scanning text..
|
protected static TextPattern |
STYLE_CLOSE_TAG_PATTERN
Closing tag for a style element.
|
Constructor and Description |
---|
BulletParser()
Creates a new bullet parser using the default factory
HTMLFactory.INSTANCE . |
BulletParser(ParsingFactory factory)
Creates a new bullet parser.
|
Modifier and Type | Method and Description |
---|---|
protected char |
entity2Char(MutableString name)
Returns the character corresponding to a given entity name.
|
protected int |
handleMarkup(char[] text,
int pos,
int end)
Handles markup.
|
protected int |
handleProcessingInstruction(char[] text,
int pos,
int end)
Handles processing instruction, ASP tags etc.
|
void |
parse(char[] text)
Analyze the text document to extract information.
|
void |
parse(char[] text,
int offset,
int length)
Analyze the text document to extract information.
|
BulletParser |
parseAttribute(Attribute attribute)
Adds the given attribute to the set of attributes to be parsed.
|
boolean |
parseAttributes()
Returns whether this parser will parse attributes.
|
BulletParser |
parseAttributes(boolean parseAttributes)
Sets the attribute parsing flag.
|
boolean |
parseCDATA()
Returns whether this parser will invoke the CDATA-section handler.
|
BulletParser |
parseCDATA(boolean parseCDATA)
Sets the CDATA-section handler flag.
|
boolean |
parseTags()
Returns whether this parser will parse tags and invoke element handlers.
|
BulletParser |
parseTags(boolean parseTags)
Sets whether this parser will parse tags and invoke element handlers.
|
boolean |
parseText()
Returns whether this parser will invoke the text handler.
|
BulletParser |
parseText(boolean parseText)
Sets the text handler flag.
|
protected void |
replaceEntities(MutableString s,
MutableString entity,
boolean loose)
Replaces entities with the corresponding characters.
|
protected int |
scanEntity(char[] a,
int offset,
int length,
boolean loose,
MutableString entity)
Searches for the end of an entity.
|
BulletParser |
setCallback(Callback callback)
Sets the callback for this parser, resetting at the same time all parsing flags.
|
protected static final int STATE_TEXT
protected static final int STATE_BEFORE_START_TAG_NAME
protected static final int STATE_BEFORE_END_TAG_NAME
protected static final int STATE_IN_START_TAG
protected static final int STATE_IN_END_TAG
protected static final int MAX_ENTITY_VALUE
protected static final int HEXADECIMAL
protected static final int MAX_HEX_ENTITY_LENGTH
protected static final int MAX_DEC_ENTITY_LENGTH
protected static final TextPattern SCRIPT_CLOSE_TAG_PATTERN
protected static final TextPattern STYLE_CLOSE_TAG_PATTERN
protected static final char[] NONSPACE_WHITESPACE
protected static final char[] SPACE
NONSPACE_WHITESPACE
, containing spaces.protected static final TextPattern CLOSED_COMMENT
protected static final TextPattern CLOSED_PERCENT
protected static final TextPattern CLOSED_PIC
protected static final TextPattern CLOSED_SECTION
protected static final TextPattern CLOSED_CDATA
public final ParsingFactory factory
protected Callback callback
protected it.unimi.dsi.fastutil.objects.Reference2ObjectMap<Attribute,MutableString> attrMap
protected boolean parseText
protected boolean parseCDATA
protected boolean parseTags
protected boolean parseAttributes
protected it.unimi.dsi.fastutil.objects.ReferenceArraySet<Attribute> parsedAttrs
parseAttributes
is true).public it.unimi.dsi.fastutil.objects.ReferenceSet<Attribute> parsedAttributes
protected char lastEntity
public BulletParser(ParsingFactory factory)
public BulletParser()
HTMLFactory.INSTANCE
.public boolean parseText()
parseText(boolean)
public BulletParser parseText(boolean parseText)
parseText
- the new value.public boolean parseCDATA()
parseCDATA(boolean)
public BulletParser parseCDATA(boolean parseCDATA)
parseCDATA
- the new value.public boolean parseTags()
parseTags(boolean)
public BulletParser parseTags(boolean parseTags)
parseTags
- the new value.public boolean parseAttributes()
parseAttributes(boolean)
public BulletParser parseAttributes(boolean parseAttributes)
parseAttributes
- the new value for the flag.public BulletParser parseAttribute(Attribute attribute)
attribute
- an attribute that should be parsed.IllegalStateException
- if parseAttributes(true)
has not been invoked on this parser.public BulletParser setCallback(Callback callback)
callback
- the new callback.protected char entity2Char(MutableString name)
name
- the name of an entity.protected int scanEntity(char[] a, int offset, int length, boolean loose, MutableString entity)
This method will search for the end of an entity starting at the given offset (the offset must correspond to the ampersand).
Real-world HTML pages often contain hundreds of misplaced ampersands, due to the
unfortunate idea of using the ampersand as query separator (please use the comma
in new code!). All such ampersand should be specified as &.
If named entities are delimited using a transition
from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter
loose
is false, named entities can be delimited only by whitespace or by a comma.
a
- a character array containing the entity.offset
- the offset at which the entity starts (the offset must point at the ampersand).length
- an upper bound to the maximum returned position.loose
- if true, named entities can be terminated by any non-alphabetical character
(instead of whitespace or comma).entity
- a support mutable string used to query ParsingFactory.getEntity(MutableString)
.protected void replaceEntities(MutableString s, MutableString entity, boolean loose)
This method will modify the mutable string s
so that all legal occurrences
of entities are replaced by the corresponding character.
s
- a mutable string whose entities will be replaced by the corresponding characters.entity
- a support mutable string used by scanEntity(char[], int, int, boolean, MutableString)
.loose
- a parameter that will be passed to scanEntity(char[], int, int, boolean, MutableString)
.protected int handleMarkup(char[] text, int pos, int end)
text
- the text.pos
- the first character in the markup after <!.end
- the end of text
.protected int handleProcessingInstruction(char[] text, int pos, int end)
text
- the text.pos
- the first character in the markup after <%.end
- the end of text
.public void parse(char[] text)
text
- a char
array of text to be parsed.public void parse(char[] text, int offset, int length)
text
- a char
array of text to be parsed.offset
- the offset in the array from which the parsing will begin.length
- the number of characters to be parsed.Copyright © 2006–2019 SYSTAP, LLC DBA Blazegraph. All rights reserved.