Example usage ============= python-vaporetto does not contain model files. To perform tokenization, follow `the document of Vaporetto `_ to download distribution models or train your own models beforehand. You can check the version number as shown below to use compatible models: .. code-block:: python >>> import vaporetto >>> vaporetto.VAPORETTO_VERSION '0.6.5' Tokenize with Vaporetto model ----------------------------- The following example tokenizes a string using a Vaporetto model. .. code-block:: python >>> import vaporetto >>> with open('tests/data/vaporetto.model', 'rb') as fp: ... model = fp.read() >>> tokenizer = vaporetto.Vaporetto(model, predict_tags = True) >>> tokenizer.tokenize_to_string('まぁ社長は火星猫だ') 'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ' >>> tokens = tokenizer.tokenize('まぁ社長は火星猫だ') >>> len(tokens) 6 >>> tokens[0].surface() 'まぁ' >>> tokens[0].tag(0) '名詞' >>> tokens[0].tag(1) 'マー' >>> [token.surface() for token in tokens] ['まぁ', '社長', 'は', '火星', '猫', 'だ'] The distributed models are compressed in zstd format. If you want to load these compressed models, you must decompress them outside the API: .. code-block:: python >>> import vaporetto >>> import zstandard # zstandard package in PyPI >>> dctx = zstandard.ZstdDecompressor() >>> with open('tests/data/vaporetto.model.zst', 'rb') as fp: ... with dctx.stream_reader(fp) as dict_reader: ... tokenizer = vaporetto.Vaporetto(dict_reader.read(), predict_tags = True) Tokenize with KyTea model ------------------------- If you want to use a KyTea model, use ``create_from_kytea_model()`` instead. .. code-block:: python >>> import vaporetto >>> with open('path/to/jp-0.4.7-5.mod', 'rb') as fp: # doctest: +SKIP ... tokenizer = vaporetto.Vaporetto.create_from_kytea_model(fp.read())