Example usage

python-vaporetto does not contain model files. To perform tokenization, follow the document of Vaporetto to download distribution models or train your own models beforehand.

You can check the version number as shown below to use compatible models:

>>> import vaporetto
>>> vaporetto.VAPORETTO_VERSION
'0.6.5'

Tokenize with Vaporetto model

The following example tokenizes a string using a Vaporetto model.

>>> import vaporetto
>>> with open('tests/data/vaporetto.model', 'rb') as fp:
...     model = fp.read()

>>> tokenizer = vaporetto.Vaporetto(model, predict_tags = True)

>>> tokenizer.tokenize_to_string('まぁ社長は火星猫だ')
'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'

>>> tokens = tokenizer.tokenize('まぁ社長は火星猫だ')
>>> len(tokens)
6
>>> tokens[0].surface()
'まぁ'
>>> tokens[0].tag(0)
'名詞'
>>> tokens[0].tag(1)
'マー'
>>> [token.surface() for token in tokens]
['まぁ', '社長', 'は', '火星', '猫', 'だ']

The distributed models are compressed in zstd format. If you want to load these compressed models, you must decompress them outside the API:

>>> import vaporetto
>>> import zstandard  # zstandard package in PyPI

>>> dctx = zstandard.ZstdDecompressor()
>>> with open('tests/data/vaporetto.model.zst', 'rb') as fp:
...     with dctx.stream_reader(fp) as dict_reader:
...         tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True)

Tokenize with KyTea model

If you want to use a KyTea model, use create_from_kytea_model() instead.

>>> import vaporetto
>>> with open('path/to/jp-0.4.7-5.mod', 'rb') as fp:
...     tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())