API reference
- class vaporetto.Vaporetto(model, /, predict_tags=False, wsconst='', norm=True)
Python binding of Vaporetto tokenizer.
- Examples:
>>> import vaporetto >>> with open('path/to/vaporetto.model', 'rb') as fp: ... tokenizer = vaporetto.Vaporetto(fp.read(), predict_tags = True) >>> tokenizer.tokenize_to_string('まぁ社長は火星猫だ') 'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ' >>> tokens = tokenizer.tokenize('まぁ社長は火星猫だ') >>> len(tokens) 6 >>> tokens[0].surface() 'まぁ' >>> tokens[0].tag(0) '名詞' >>> tokens[0].tag(1) 'マー' >>> [token.surface() for token in tokens] ['まぁ', '社長', 'は', '火星', '猫', 'だ']
- Parameters:
model (bytes) – A byte sequence of the model.
predict_tags (bool) – If True, the tokenizer predicts tags.
wsconst (str) – Does not split the specified character types.
D: Digit,R: Roman,H: Hiragana,T: Katakana,K: Kanji,O: Other,G: Grapheme cluster. You can specify multiple types such asDGR.norm (bool) – If True, input texts will be normalized beforehand.
- Return type:
- Raises:
ValueError – if the model is invalid.
ValueError – if the wsconst value is invalid.
- static create_from_kytea_model(model, /, wsconst='', norm=True)
Create a new Vaporetto instance from a KyTea’s model.
Vaporetto does not support tag prediction with KyTea’s model.
- Parameters:
model (bytes) – A byte sequence of the model.
wsconst (str) – Does not split the specified character types.
norm (bool) – If True, input texts will be normalized beforehand.
- Return type:
- Raises:
ValueError – if the model is invalid.
ValueError – if the wsconst value is invalid.
- tokenize(text, /)
Tokenize a given text and return as a list of tokens.
- Parameters:
text (str) – A text to tokenize.
- Return type:
- tokenize_to_string(text, /)
Tokenize a given text and return as a string.
- Parameters:
text (str) – A text to tokenize.
- Return type:
str
- class vaporetto.TokenList
List of
Tokenreturned by the tokenizer.- __getitem__(key, /)
Return self[key].
- __iter__()
Implement iter(self).
- __len__()
Return len(self).
- class vaporetto.Token
Representation of a token.
- end()
Return the end position (exclusive) in characters.
- Return type:
int
- n_tags()
Return the number of tags assigned to this token.
- Return type:
int
- start()
Return the start position (inclusive) in characters.
- Return type:
int
- surface()
Return the surface of this token.
- Return type:
str
- tag(index, /)
Return the tag assigned to a given index.
- Parameters:
index (int) – An index of the set of tags
- Return type:
Optional[str]
- Raises:
ValueError – if the index is out of range.
- VAPORETTO_VERSION: str
Indicates the version number of vaporetto used by this wrapper. It can be used to check the compatibility of the model file.