public class HtmlDocument extends Object
HtmlDocument class creates a Lucene Document from an HTML document.
It does this by using JTidy package. It can take input input
from File or InputStream.
| Constructor and Description |
|---|
HtmlDocument(File file)
Constructs an
HtmlDocument from a File. |
HtmlDocument(File file,
String tidyConfigFile)
Constructs an
HtmlDocument from a
File. |
HtmlDocument(InputStream is)
Constructs an
HtmlDocument from an InputStream. |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.lucene.document.Document |
Document(File file)
Creates a Lucene
Document from a File. |
static org.apache.lucene.document.Document |
Document(File file,
String tidyConfigFile)
Creates a Lucene
Document from a
File. |
String |
getBody()
Gets the bodyText attribute of the
HtmlDocument object. |
static org.apache.lucene.document.Document |
getDocument(InputStream is)
Creates a Lucene
Document from an InputStream. |
String |
getTitle()
Gets the title attribute of the
HtmlDocument
object. |
static void |
main(String[] args)
Runs
HtmlDocument on the files specified on
the command line. |
public HtmlDocument(File file) throws IOException
HtmlDocument from a File.file - the File containing the
HTML to parseIOException - if an I/O exception occurspublic HtmlDocument(InputStream is)
HtmlDocument from an InputStream.is - the InputStream
containing the HTMLpublic HtmlDocument(File file, String tidyConfigFile) throws IOException
HtmlDocument from a
File.file - the File containing the
HTML to parsetidyConfigFile - the String
containing the full path to the Tidy config fileIOException - if an I/O exception occurspublic static org.apache.lucene.document.Document Document(File file, String tidyConfigFile) throws IOException
Document from a
File.file - tidyConfigFile - the full path to the Tidy
config fileIOExceptionpublic static org.apache.lucene.document.Document getDocument(InputStream is)
Document from an InputStream.is - public static org.apache.lucene.document.Document Document(File file) throws IOException
Document from a File.file - IOExceptionpublic static void main(String[] args) throws Exception
HtmlDocument on the files specified on
the command line.args - Command line argumentsException - Description of Exceptionpublic String getTitle()
HtmlDocument
object.public String getBody()
HtmlDocument object.Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.