public class RussianLetterTokenizer extends CharTokenizer
Tokenizer that extends LetterTokenizer
by additionally looking up letters in a given "russian charset".
The problem with
LetterTokenizer is that it uses Character.isLetter(char) method,
which doesn't know how to detect letters in encodings like CP1252 and KOI8
(well-known problems with 0xD7 and 0xF7 chars)
AttributeSource.AttributeFactory, AttributeSource.State| Constructor and Description |
|---|
RussianLetterTokenizer(AttributeSource.AttributeFactory factory,
Reader in) |
RussianLetterTokenizer(AttributeSource source,
Reader in) |
RussianLetterTokenizer(Reader in) |
RussianLetterTokenizer(Reader in,
char[] charset)
Deprecated.
Use
RussianLetterTokenizer(Reader) instead. |
| Modifier and Type | Method and Description |
|---|---|
protected boolean |
isTokenChar(char c)
Collects only characters which satisfy
Character.isLetter(char). |
end, incrementToken, next, next, normalize, resetclose, correctOffsetgetOnlyUseNewAPI, reset, setOnlyUseNewAPIaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toStringpublic RussianLetterTokenizer(Reader in, char[] charset)
RussianLetterTokenizer(Reader) instead.public RussianLetterTokenizer(Reader in)
public RussianLetterTokenizer(AttributeSource source, Reader in)
public RussianLetterTokenizer(AttributeSource.AttributeFactory factory, Reader in)
protected boolean isTokenChar(char c)
Character.isLetter(char).isTokenChar in class CharTokenizerCopyright © 2000-2012 Apache Software Foundation. All Rights Reserved.