API reference

class vaporetto.Vaporetto(model, /, predict_tags=False, wsconst='', norm=True)

Python binding of Vaporetto tokenizer.

Examples:
>>> import vaporetto
>>> with open('path/to/vaporetto.model', 'rb') as fp:
...     tokenizer = vaporetto.Vaporetto(fp.read(), predict_tags = True)
>>> tokenizer.tokenize_to_string('まぁ社長は火星猫だ')
'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'
>>> tokens = tokenizer.tokenize('まぁ社長は火星猫だ')
>>> len(tokens)
6
>>> tokens[0].surface()
'まぁ'
>>> tokens[0].tag(0)
'名詞'
>>> tokens[0].tag(1)
'マー'
>>> [token.surface() for token in tokens]
['まぁ', '社長', 'は', '火星', '猫', 'だ']
Parameters:
  • model (bytes) – A byte sequence of the model.

  • predict_tags (bool) – If True, the tokenizer predicts tags.

  • wsconst (str) – Does not split the specified character types. D: Digit, R: Roman, H: Hiragana, T: Katakana, K: Kanji, O: Other, G: Grapheme cluster. You can specify multiple types such as DGR.

  • norm (bool) – If True, input texts will be normalized beforehand.

Return type:

vaporetto.Vaporetto

Raises:
  • ValueError – if the model is invalid.

  • ValueError – if the wsconst value is invalid.

static create_from_kytea_model(model, /, wsconst='', norm=True)

Create a new Vaporetto instance from a KyTea’s model.

Vaporetto does not support tag prediction with KyTea’s model.

Parameters:
  • model (bytes) – A byte sequence of the model.

  • wsconst (str) – Does not split the specified character types.

  • norm (bool) – If True, input texts will be normalized beforehand.

Return type:

vaporetto.Vaporetto

Raises:
  • ValueError – if the model is invalid.

  • ValueError – if the wsconst value is invalid.

tokenize(text, /)

Tokenize a given text and return as a list of tokens.

Parameters:

text (str) – A text to tokenize.

Return type:

vaporetto.TokenList

tokenize_to_string(text, /)

Tokenize a given text and return as a string.

Parameters:

text (str) – A text to tokenize.

Return type:

str

class vaporetto.TokenList

List of Token returned by the tokenizer.

__getitem__(key, /)

Return self[key].

__iter__()

Implement iter(self).

__len__()

Return len(self).

class vaporetto.TokenIterator

Iterator that returns Token.

__next__()

Implement next(self).

class vaporetto.Token

Representation of a token.

end()

Return the end position (exclusive) in characters.

Return type:

int

n_tags()

Return the number of tags assigned to this token.

Return type:

int

start()

Return the start position (inclusive) in characters.

Return type:

int

surface()

Return the surface of this token.

Return type:

str

tag(index, /)

Return the tag assigned to a given index.

Parameters:

index (int) – An index of the set of tags

Return type:

Optional[str]

Raises:

ValueError – if the index is out of range.

VAPORETTO_VERSION: str

Indicates the version number of vaporetto used by this wrapper. It can be used to check the compatibility of the model file.