natto package

Submodules

natto.api module

API module for general classes used throughout the natto-py package.

exception natto.api.MeCabError

Bases: exceptions.Exception

MeCabError is a general error class for the natto-py package.

natto.binding module

Binding via CFFI to the MeCab library.

natto.dictionary module

Wrapper for MeCab dictionary information.

class natto.dictionary.DictionaryInfo(dptr, filepath, charset)

Bases: object

Representation of a MeCab DictionaryInfo struct.

A list of dictionaries used by MeCab is returned by the dicts attribute of MeCab. Each dictionary information includes the attributes listed below.

Variables:
  • ptr – FFI pointer to the mecab_dictionary_info_t.
  • filepath – Full path to the dictionary file.
  • charset – Dictionary character set, e.g., SHIFT-JIS, UTF-8.
  • size – Number of words registered in this dictionary.
  • type – Dictionary type; 0 (SYS_DIC), 1 (USR_DIC), 2 (UNK_DIC)
  • lsize – Left attributes size.
  • rsize – Right attributes size.
  • version – Dictionary version.
  • next – Pointer to next dictionary information struct.

Example usage:

from natto import MeCab

with MeCab() as nm:

    # first dictionary info is MeCab's system dictionary
    sysdic = nm.dicts[0]

    # print absolute path to system dictionary
    print(sysdic.filepath)
    ...
    /usr/local/lib/mecab/dic/ipadic/sys.dic

    # print system dictionary character encoding
    print(sysdic.charset)
    ...
    utf8

    # is this really the system dictionary?
    print(sysdic.is_sysdic())
    ...
    True
SYS_DIC = 0
UNK_DIC = 2
USR_DIC = 1
is_sysdic()

Is this a system dictionary?

Returns:True if system dictionary, False otherwise.
is_unkdic()

Is this an unknown dictionary?

Returns:True if unknown dictionary, False otherwise.
is_usrdic()

Is this a user-defined dictionary?

Returns:True if user-defined dictionary, False otherwise.

natto.environment module

Convenience API to obtain information on MeCab environment.

class natto.environment.MeCabEnv(**kwargs)

Bases: object

Convenience class of object to obtain information on MeCab environment.

This will attempt to obtain the character encoding (charset) of MeCab’s system dictionary, which will determine the encoding used when passing strings in and obtaining string results from MeCab.

Also attempts to locate and obtain the absolute path to the MeCab library.

This makes invocations to the mecab and mecab-config (not available on Windows) executables.

Will defer to the user-provided values in environment variables MECAB_PATH and MECAB_CHARSET.

MECAB_CHARSET = 'MECAB_CHARSET'
MECAB_PATH = 'MECAB_PATH'

natto.mecab module

The main interface to MeCab via natto-py.

class natto.mecab.MeCab(options=None, **kwargs)

Bases: object

The main interface to the MeCab library, wrapping the MeCab Tagger.

Instantiate this once, per any MeCab options you wish to use. This interface allows for parsing Japanese into simple strings of morpheme surface and related features, or for iterating over MeCabNode instances which contain detailed information about the morphemes encompassed.

Configure logging before instantiating MeCab to see debug messages:

import logging

fmt='%(asctime)s : %(levelname)s : %(message)s'

logging.basicConfig(format=fmt, level=logging.DEBUG)

Example usage:

from natto import MeCab

# Use a Python with-statement to ensure mecab_destroy is invoked
#
with MeCab() as nm:

    # print MeCab version
    print(nm.version)
    ...
    0.996

    # print absolute path to MeCab library
    print(nm.libpath)
    ...
    /usr/local/lib/libmecab.so

    # parse text and print result
    print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!'))
    ...
    この    連体詞,*,*,*,*,*,この,コノ,コノ
    星      名詞,一般,*,*,*,*,星,ホシ,ホシ
    の      助詞,連体化,*,*,*,*,の,ノ,ノ
    一等    名詞,一般,*,*,*,*,一等,イットウ,イットー
    賞      名詞,接尾,一般,*,*,*,賞,ショウ,ショー
    に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
    なり    動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
    たい    助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
    の      助詞,連体化,*,*,*,*,の,ノ,ノ
    卓球    名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー
    で      助詞,格助詞,一般,*,*,*,で,デ,デ
    俺      名詞,代名詞,一般,*,*,*,俺,オレ,オレ
    は      助詞,係助詞,*,*,*,*,は,ハ,ワ
    、      記号,読点,*,*,*,*,、,、,、
    そん    名詞,一般,*,*,*,*,そん,ソン,ソン
    だけ    助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ
    !      記号,一般,*,*,*,*,!,!,!
    EOS

    # parse text into Python Generator yielding MeCabNode instances,
    # and display much more detailed information about each morpheme
    for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True):
        if n.is_nor():
    ...     # morpheme surface
    ...     # part-of-speech ID (IPADIC)
    ...     # word cost
    ...     print("{}\t{}\t{}".format(n.surface, n.posid, n.wcost))
    ...
    飛べ    31      7175
    ねえ    25      6661
    鳥      38      4905
    も      16      4669
    いる    31      9109
    って    15      6984
    こっ    31      9587
    た      25      5500
    。      7       215
MECAB_ANY_BOUNDARY = 0
MECAB_CHARSET = 'MECAB_CHARSET'
MECAB_INSIDE_TOKEN = 2
MECAB_LATTICE_ALLOCATE_SENTENCE = 64
MECAB_LATTICE_ALL_MORPHS = 32
MECAB_LATTICE_ALTERNATIVE = 16
MECAB_LATTICE_MARGINAL_PROB = 8
MECAB_LATTICE_NBEST = 2
MECAB_LATTICE_ONE_BEST = 1
MECAB_LATTICE_PARTIAL = 4
MECAB_PATH = 'MECAB_PATH'
MECAB_TOKEN_BOUNDARY = 1
parse(text, **kwargs)

Parse the given text and return result from MeCab.

Parameters:
  • text (str) – the text to parse.
  • as_nodes (bool, defaults to False) – return generator of MeCabNodes if True; or string if False.
  • boundary_constraints (str or re) – regular expression for morpheme boundary splitting; if non-None and feature_constraints is None, then boundary constraint parsing will be used.
  • feature_constraints (tuple) – tuple containing tuple instances of target morpheme and corresponding feature string in order of precedence; if non-None and boundary_constraints is None, then feature constraint parsing will be used.
Returns:

A single string containing the entire MeCab output; or a Generator yielding the MeCabNode instances.

Raises:

MeCabError

natto.node module

Wrapper for MeCab node.

class natto.node.MeCabNode(nptr, surface, feature)

Bases: object

Representation of a MeCab Node struct.

A list of MeCab nodes is returned when parsing a string of Japanese with as_nodes=True. Each node will contain detailed information about the morpheme encompassed.

Variables:
  • ptr – This node’s pointer.
  • prev – Pointer to previous node.
  • next – Pointer to next node.
  • enext – Pointer to the node which ends at the same position.
  • bnext – Pointer to the node which starts at the same position.
  • rpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
  • lpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
  • surface – Surface string, Unicode.
  • feature – Feature string, Unicode.
  • nodeid – Unique node id.
  • length – Length of surface form.
  • rlength – Length of the surface form including leading white space.
  • rcattr – Right attribute id.
  • lcattr – Left attribute id.
  • posid – Part-of-speech id.
  • char_type – Character type.
  • stat – Node status; 0 (NOR), 1 (UNK), 2 (BOS), 3 (EOS), 4 (EON).
  • isbest – 1 if this node is best node.
  • alpha – Forward accumulative log summation (with marginal probability).
  • beta – Backward accumulative log summation (with marginal probability).
  • prob – Marginal probability, only with marginal probability flag.
  • wcost – Word cost.
  • cost – Best accumulative cost from bos node to this node.

Example usage:

from natto import MeCab

text = '卓球なんて死ぬまでの暇つぶしだよ。'

# Ex. basic node parsing
#
with MeCab() as nm:
    for n in nm.parse(text, as_nodes=True):
...     # ignore the end-of-sentence nodes
...     if not n.is_eos():
...         # output the morpheme surface and default ChaSen feature
...         print('{}\t{}'.format(n.surface, n.feature))
...
卓球    名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー
なんて  助詞,副助詞,*,*,*,*,なんて,ナンテ,ナンテ
死ぬ    動詞,自立,*,*,五段・ナ行,基本形,死ぬ,シヌ,シヌ
まで    助詞,副助詞,*,*,*,*,まで,マデ,マデ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
暇つぶし        名詞,一般,*,*,*,*,暇つぶし,ヒマツブシ,ヒマツブシ
だ      助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
よ      助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
。      記号,句点,*,*,*,*,。,。,。

# Ex. custom node format
#
# -F         ... short-form of --node-format
# %F,[6,7,0] ... extract these elements from ChaSen feature as CSV
#                - morpheme root-form (6th index)
#                - reading (7th index)
#                - part-of-speech (0th index)
# %h         ... part-of-speech ID (IPADIC)
#
# -U         ... short-form of --unk-format
#                specify empty CSV ,,, when morpheme cannot be found
#                in dictionary
#
with MeCab(r'-F%F,[6,8,0],%h\n -U,,,\n') as nm:
    for n in nm.parse(text, as_nodes=True):
...     # ignore the end-of-sentence nodes
...     if not n.is_eos():
...         # output custom-formatted node feature
...         print(n.feature)
...
卓球,タッキュウ,名詞,36
なんて,ナンテ,助詞,21
死ぬ,シヌ,動詞,31
まで,マデ,助詞,21
の,ノ,助詞,24
暇つぶし,ヒマツブシ,名詞,38
だ,ダ,助動詞,25
よ,ヨ,助詞,17
。,。,記号,7
BOS_NODE = 2
EON_NODE = 4
EOS_NODE = 3
NOR_NODE = 0
UNK_NODE = 1
is_bos()

Is this a beginning-of-sentence node?

Returns:True if beginning-of-sentence node, False otherwise.
is_eon()

Is this an end of an N-best node list?

Returns:True if end of an N-best node list, False otherwise.
is_eos()

Is this an end-of-sentence node?

Returns:True if end-of-sentence node, False otherwise.
is_nor()

Is this a normal node, defined in a dictionary?

Returns:True if normal node, False otherwise.
is_unk()

Is this an unknown node, not defined in any dictionary?

Returns:True if unknown node, False otherwise.

natto.option_parse module

Helper class for parsing MeCab options.

class natto.option_parse.OptionParse(envch)

Bases: object

Helper class for transforming arguments into input for mecab_new2.

build_options_str(options)

Returns a string concatenation of the MeCab options.

Args:
options: dictionary of options to use when instantiating the MeCab
instance.
Returns:
A string concatenation of the options used when instantiating the MeCab instance, in long-form.
parse_mecab_options(options)

Parses the MeCab options, returning them in a dictionary.

Lattice-level option has been deprecated; please use marginal or nbest instead.

:options string or dictionary of options to use when instantiating
the MeCab instance. May be in short- or long-form, or in a Python dictionary.
Returns:
A dictionary of the specified MeCab options, where the keys are snake-cased names of the long-form of the option names.
Raises:
MeCabError: An invalid value for N-best was passed in.

natto.support module

Internal-use functions for string- and byte-conversion for supporting Python 2 and 3.

natto.support.splitter_support(py2enc)

Create tokenizer for use in boundary constraint parsing.

Parameters:py2enc (str) – Encoding used by Python 2 environment.
natto.support.string_support(py3enc)

Create byte-to-string and string-to-byte conversion functions for internal use.

Parameters:py3enc (str) – Encoding used by Python 3 environment.

natto.version module

Version and version info for the natto-py package.

Module contents

class natto.MeCab(options=None, **kwargs)

Bases: object

The main interface to the MeCab library, wrapping the MeCab Tagger.

Instantiate this once, per any MeCab options you wish to use. This interface allows for parsing Japanese into simple strings of morpheme surface and related features, or for iterating over MeCabNode instances which contain detailed information about the morphemes encompassed.

Configure logging before instantiating MeCab to see debug messages:

import logging

fmt='%(asctime)s : %(levelname)s : %(message)s'

logging.basicConfig(format=fmt, level=logging.DEBUG)

Example usage:

from natto import MeCab

# Use a Python with-statement to ensure mecab_destroy is invoked
#
with MeCab() as nm:

    # print MeCab version
    print(nm.version)
    ...
    0.996

    # print absolute path to MeCab library
    print(nm.libpath)
    ...
    /usr/local/lib/libmecab.so

    # parse text and print result
    print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!'))
    ...
    この    連体詞,*,*,*,*,*,この,コノ,コノ
    星      名詞,一般,*,*,*,*,星,ホシ,ホシ
    の      助詞,連体化,*,*,*,*,の,ノ,ノ
    一等    名詞,一般,*,*,*,*,一等,イットウ,イットー
    賞      名詞,接尾,一般,*,*,*,賞,ショウ,ショー
    に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
    なり    動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ
    たい    助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
    の      助詞,連体化,*,*,*,*,の,ノ,ノ
    卓球    名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー
    で      助詞,格助詞,一般,*,*,*,で,デ,デ
    俺      名詞,代名詞,一般,*,*,*,俺,オレ,オレ
    は      助詞,係助詞,*,*,*,*,は,ハ,ワ
    、      記号,読点,*,*,*,*,、,、,、
    そん    名詞,一般,*,*,*,*,そん,ソン,ソン
    だけ    助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ
    !      記号,一般,*,*,*,*,!,!,!
    EOS

    # parse text into Python Generator yielding MeCabNode instances,
    # and display much more detailed information about each morpheme
    for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True):
        if n.is_nor():
    ...     # morpheme surface
    ...     # part-of-speech ID (IPADIC)
    ...     # word cost
    ...     print("{}\t{}\t{}".format(n.surface, n.posid, n.wcost))
    ...
    飛べ    31      7175
    ねえ    25      6661
    鳥      38      4905
    も      16      4669
    いる    31      9109
    って    15      6984
    こっ    31      9587
    た      25      5500
    。      7       215
MECAB_ANY_BOUNDARY = 0
MECAB_CHARSET = 'MECAB_CHARSET'
MECAB_INSIDE_TOKEN = 2
MECAB_LATTICE_ALLOCATE_SENTENCE = 64
MECAB_LATTICE_ALL_MORPHS = 32
MECAB_LATTICE_ALTERNATIVE = 16
MECAB_LATTICE_MARGINAL_PROB = 8
MECAB_LATTICE_NBEST = 2
MECAB_LATTICE_ONE_BEST = 1
MECAB_LATTICE_PARTIAL = 4
MECAB_PATH = 'MECAB_PATH'
MECAB_TOKEN_BOUNDARY = 1
parse(text, **kwargs)

Parse the given text and return result from MeCab.

Parameters:
  • text (str) – the text to parse.
  • as_nodes (bool, defaults to False) – return generator of MeCabNodes if True; or string if False.
  • boundary_constraints (str or re) – regular expression for morpheme boundary splitting; if non-None and feature_constraints is None, then boundary constraint parsing will be used.
  • feature_constraints (tuple) – tuple containing tuple instances of target morpheme and corresponding feature string in order of precedence; if non-None and boundary_constraints is None, then feature constraint parsing will be used.
Returns:

A single string containing the entire MeCab output; or a Generator yielding the MeCabNode instances.

Raises:

MeCabError

class natto.DictionaryInfo(dptr, filepath, charset)

Bases: object

Representation of a MeCab DictionaryInfo struct.

A list of dictionaries used by MeCab is returned by the dicts attribute of MeCab. Each dictionary information includes the attributes listed below.

Variables:
  • ptr – FFI pointer to the mecab_dictionary_info_t.
  • filepath – Full path to the dictionary file.
  • charset – Dictionary character set, e.g., SHIFT-JIS, UTF-8.
  • size – Number of words registered in this dictionary.
  • type – Dictionary type; 0 (SYS_DIC), 1 (USR_DIC), 2 (UNK_DIC)
  • lsize – Left attributes size.
  • rsize – Right attributes size.
  • version – Dictionary version.
  • next – Pointer to next dictionary information struct.

Example usage:

from natto import MeCab

with MeCab() as nm:

    # first dictionary info is MeCab's system dictionary
    sysdic = nm.dicts[0]

    # print absolute path to system dictionary
    print(sysdic.filepath)
    ...
    /usr/local/lib/mecab/dic/ipadic/sys.dic

    # print system dictionary character encoding
    print(sysdic.charset)
    ...
    utf8

    # is this really the system dictionary?
    print(sysdic.is_sysdic())
    ...
    True
SYS_DIC = 0
UNK_DIC = 2
USR_DIC = 1
is_sysdic()

Is this a system dictionary?

Returns:True if system dictionary, False otherwise.
is_unkdic()

Is this an unknown dictionary?

Returns:True if unknown dictionary, False otherwise.
is_usrdic()

Is this a user-defined dictionary?

Returns:True if user-defined dictionary, False otherwise.
class natto.MeCabNode(nptr, surface, feature)

Bases: object

Representation of a MeCab Node struct.

A list of MeCab nodes is returned when parsing a string of Japanese with as_nodes=True. Each node will contain detailed information about the morpheme encompassed.

Variables:
  • ptr – This node’s pointer.
  • prev – Pointer to previous node.
  • next – Pointer to next node.
  • enext – Pointer to the node which ends at the same position.
  • bnext – Pointer to the node which starts at the same position.
  • rpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
  • lpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
  • surface – Surface string, Unicode.
  • feature – Feature string, Unicode.
  • nodeid – Unique node id.
  • length – Length of surface form.
  • rlength – Length of the surface form including leading white space.
  • rcattr – Right attribute id.
  • lcattr – Left attribute id.
  • posid – Part-of-speech id.
  • char_type – Character type.
  • stat – Node status; 0 (NOR), 1 (UNK), 2 (BOS), 3 (EOS), 4 (EON).
  • isbest – 1 if this node is best node.
  • alpha – Forward accumulative log summation (with marginal probability).
  • beta – Backward accumulative log summation (with marginal probability).
  • prob – Marginal probability, only with marginal probability flag.
  • wcost – Word cost.
  • cost – Best accumulative cost from bos node to this node.

Example usage:

from natto import MeCab

text = '卓球なんて死ぬまでの暇つぶしだよ。'

# Ex. basic node parsing
#
with MeCab() as nm:
    for n in nm.parse(text, as_nodes=True):
...     # ignore the end-of-sentence nodes
...     if not n.is_eos():
...         # output the morpheme surface and default ChaSen feature
...         print('{}\t{}'.format(n.surface, n.feature))
...
卓球    名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー
なんて  助詞,副助詞,*,*,*,*,なんて,ナンテ,ナンテ
死ぬ    動詞,自立,*,*,五段・ナ行,基本形,死ぬ,シヌ,シヌ
まで    助詞,副助詞,*,*,*,*,まで,マデ,マデ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
暇つぶし        名詞,一般,*,*,*,*,暇つぶし,ヒマツブシ,ヒマツブシ
だ      助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
よ      助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
。      記号,句点,*,*,*,*,。,。,。

# Ex. custom node format
#
# -F         ... short-form of --node-format
# %F,[6,7,0] ... extract these elements from ChaSen feature as CSV
#                - morpheme root-form (6th index)
#                - reading (7th index)
#                - part-of-speech (0th index)
# %h         ... part-of-speech ID (IPADIC)
#
# -U         ... short-form of --unk-format
#                specify empty CSV ,,, when morpheme cannot be found
#                in dictionary
#
with MeCab(r'-F%F,[6,8,0],%h\n -U,,,\n') as nm:
    for n in nm.parse(text, as_nodes=True):
...     # ignore the end-of-sentence nodes
...     if not n.is_eos():
...         # output custom-formatted node feature
...         print(n.feature)
...
卓球,タッキュウ,名詞,36
なんて,ナンテ,助詞,21
死ぬ,シヌ,動詞,31
まで,マデ,助詞,21
の,ノ,助詞,24
暇つぶし,ヒマツブシ,名詞,38
だ,ダ,助動詞,25
よ,ヨ,助詞,17
。,。,記号,7
BOS_NODE = 2
EON_NODE = 4
EOS_NODE = 3
NOR_NODE = 0
UNK_NODE = 1
is_bos()

Is this a beginning-of-sentence node?

Returns:True if beginning-of-sentence node, False otherwise.
is_eon()

Is this an end of an N-best node list?

Returns:True if end of an N-best node list, False otherwise.
is_eos()

Is this an end-of-sentence node?

Returns:True if end-of-sentence node, False otherwise.
is_nor()

Is this a normal node, defined in a dictionary?

Returns:True if normal node, False otherwise.
is_unk()

Is this an unknown node, not defined in any dictionary?

Returns:True if unknown node, False otherwise.
exception natto.MeCabError

Bases: exceptions.Exception

MeCabError is a general error class for the natto-py package.

class natto.OptionParse(envch)

Bases: object

Helper class for transforming arguments into input for mecab_new2.

build_options_str(options)

Returns a string concatenation of the MeCab options.

Args:
options: dictionary of options to use when instantiating the MeCab
instance.
Returns:
A string concatenation of the options used when instantiating the MeCab instance, in long-form.
parse_mecab_options(options)

Parses the MeCab options, returning them in a dictionary.

Lattice-level option has been deprecated; please use marginal or nbest instead.

:options string or dictionary of options to use when instantiating
the MeCab instance. May be in short- or long-form, or in a Python dictionary.
Returns:
A dictionary of the specified MeCab options, where the keys are snake-cased names of the long-form of the option names.
Raises:
MeCabError: An invalid value for N-best was passed in.
natto.string_support(py3enc)

Create byte-to-string and string-to-byte conversion functions for internal use.

Parameters:py3enc (str) – Encoding used by Python 3 environment.
natto.splitter_support(py2enc)

Create tokenizer for use in boundary constraint parsing.

Parameters:py2enc (str) – Encoding used by Python 2 environment.