public final class HtmlUtils extends Object
The HtmlParser will be open-sourced hence we took the
decision to keep these utilities in this package as well as not to
leverage others that may exist in the google3 code base.
The functionality exposed is designed to be 100% compatible with the corresponding logic in the C-version of the HtmlParser as such we are particularly concerned with cross-language compatibility.
Note: The words Javascript and ECMAScript are used
interchangeably unless otherwise noted.
| Modifier and Type | Class and Description |
|---|---|
static class |
HtmlUtils.META_REDIRECT_TYPE
Indicates the type of content contained in the
content HTML
attribute of the meta HTML tag. |
| Modifier and Type | Method and Description |
|---|---|
static String |
encodeCharForAscii(char chr)
Encodes the specified character using Ascii for convenient insertion into
a single-quote enclosed
String. |
static boolean |
isAttributeJavascript(String attribute)
Determines if the HTML attribute specified expects javascript
for its value.
|
static boolean |
isAttributeStyle(String attribute)
Determines if the HTML attribute specified expects a
style
for its value. |
static boolean |
isAttributeUri(String attribute)
Determines if the HTML attribute specified expects a
URI
for its value. |
static boolean |
isHtmlSpace(char chr)
Determines if the specified character is an HTML whitespace character.
|
static boolean |
isJavascriptIdentifier(char chr)
Determines if the specified character is a valid character in an
ECMAScript identifier.
|
static boolean |
isJavascriptRegexpPrefix(String input)
Determines if the input token provided is a valid token prefix to a
javascript regular expression.
|
static boolean |
isJavascriptWhitespace(char chr)
Determines if the specified character is an ECMAScript whitespace or line
terminator character.
|
static HtmlUtils.META_REDIRECT_TYPE |
parseContentAttributeForUrl(String value)
Parses the given
String to determine if it contains a URL in the
format followed by the content attribute of the meta
HTML tag. |
public static boolean isAttributeJavascript(String attribute)
onclick
attribute.
Currently returns true for any attribute name that starts
with "on" which is not exactly correct but we trust a developer to
not use non-spec compliant attribute names (e.g. onbogus).
attribute - the name of an HTML attributefalse if the input is null or is not an attribute
that expects javascript code; truepublic static boolean isAttributeStyle(String attribute)
style
for its value. Currently this is only true for the style
HTML attribute.attribute - the name of an HTML attributetrue iff the attribute name is one that expects a
style for a value; otherwise falsepublic static boolean isAttributeUri(String attribute)
URI
for its value. For example, both href and src
expect a URI but style does not. Returns
false if the attribute given was null.attribute - the name of an HTML attributetrue if the attribute name is one that expects
a URI for a value; otherwise nullATTRIBUTE_EXPECTS_URIpublic static boolean isHtmlSpace(char chr)
Space character
Tab character
Line feed character
Carriage Return character
Zero-Width Space character
​)
which is not included in the C version.chr - the char to checktrue if the character is an HTML whitespace character
White spacepublic static boolean isJavascriptWhitespace(char chr)
Tab, Vertical Tab,
Form Feed, Space,
No-break space)
Line Feed,
Carriage Return, Line separator,
Paragraph Separator).
Encompasses the characters in sections 7.2 and 7.3 of ECMAScript 3, in
particular, this list is quite different from that in
Character.isWhitespace.
ECMAScript Language Specification
chr - the char to checktrue or falsepublic static boolean isJavascriptIdentifier(char chr)
Character.isJavaIdentifierStart
and Character.isJavaIdentifierPart given that Java
and Javascript follow similar identifier naming rules but we lose
compatibility with the C-version.chr - char to checktrue if the chr is a Javascript whitespace
character; otherwise falsepublic static boolean isJavascriptRegexpPrefix(String input)
Set of identifiers that can precede a regular expression in the
javascript grammar, and returns true if the provided
String is in that Set.input - the String token to checktrue iff the token is a valid prefix of a regexppublic static String encodeCharForAscii(char chr)
String. Printable characters
are returned as-is. Carriage Return, Line Feed, Horizontal Tab,
back-slash and single quote are all backslash-escaped. All other characters
are returned hex-encoded.chr - char to encodecharpublic static HtmlUtils.META_REDIRECT_TYPE parseContentAttributeForUrl(String value)
String to determine if it contains a URL in the
format followed by the content attribute of the meta
HTML tag.
This function expects to receive the value of the content HTML
attribute. This attribute takes on different meanings depending on the
value of the http-equiv HTML attribute of the same meta
tag. Since we may not have access to the http-equiv attribute,
we instead rely on parsing the given value to determine if it contains
a URL.
The specification of the meta HTML tag can be found in:
http://dev.w3.org/html5/spec/Overview.html#attr-meta-http-equiv-refresh
We return HtmlUtils.META_REDIRECT_TYPE indicating whether the
value contains a URL and whether we are at the start of the URL or past
the start. We are at the start of the URL if and only if one of the two
conditions below is true:
Examples:
meta tag where the content
attribute contains a URL [we are not at the start of the URL]:
<meta http-equiv="refresh" content="5; URL=http://www.google.com">
meta tag where the content
attribute contains a URL [we are at the start of the URL]:
<meta http-equiv="refresh" content="5; URL=">
meta tag where the content
attribute does not contain a URL:
<meta http-equiv="content-type" content="text/html">
value - String to parseHtmlUtils.META_REDIRECT_TYPE indicating the presence
of a URL in the given valueCopyright © 2010–2016 Google. All rights reserved.