natto package¶
Submodules¶
natto.api module¶
API module for general classes used throughout the natto-py package.
-
exception
natto.api.
MeCabError
¶ Bases:
Exception
MeCabError is a general error class for the natto-py package.
natto.binding module¶
Binding via CFFI to the MeCab library.
natto.dictionary module¶
Wrapper for MeCab dictionary information.
-
class
natto.dictionary.
DictionaryInfo
(dptr, filepath, charset)¶ Bases:
object
Representation of a MeCab DictionaryInfo struct.
A list of dictionaries used by MeCab is returned by the dicts attribute of MeCab. Each dictionary information includes the attributes listed below.
Variables: - ptr – FFI pointer to the mecab_dictionary_info_t.
- filepath – Full path to the dictionary file.
- charset – Dictionary character set, e.g., SHIFT-JIS, UTF-8.
- size – Number of words registered in this dictionary.
- type – Dictionary type; 0 (SYS_DIC), 1 (USR_DIC), 2 (UNK_DIC)
- lsize – Left attributes size.
- rsize – Right attributes size.
- version – Dictionary version.
- next – Pointer to next dictionary information struct.
Example usage:
from natto import MeCab with MeCab() as nm: # first dictionary info is MeCab's system dictionary sysdic = nm.dicts[0] # print absolute path to system dictionary print(sysdic.filepath) ... /usr/local/lib/mecab/dic/ipadic/sys.dic # print system dictionary character encoding print(sysdic.charset) ... utf8 # is this really the system dictionary? print(sysdic.is_sysdic()) ... True
-
SYS_DIC
= 0¶
-
UNK_DIC
= 2¶
-
USR_DIC
= 1¶
-
is_sysdic
()¶ Is this a system dictionary?
Returns: True if system dictionary, False otherwise.
-
is_unkdic
()¶ Is this an unknown dictionary?
Returns: True if unknown dictionary, False otherwise.
-
is_usrdic
()¶ Is this a user-defined dictionary?
Returns: True if user-defined dictionary, False otherwise.
natto.environment module¶
Convenience API to obtain information on MeCab environment.
-
class
natto.environment.
MeCabEnv
(**kwargs)¶ Bases:
object
Convenience class of object to obtain information on MeCab environment.
This will attempt to obtain the character encoding (charset) of MeCab’s system dictionary, which will determine the encoding used when passing strings in and obtaining string results from MeCab.
Also attempts to locate and obtain the absolute path to the MeCab library.
This makes invocations to the mecab and mecab-config (not available on Windows) executables.
Will defer to the user-provided values in environment variables MECAB_PATH and MECAB_CHARSET.
-
MECAB_CHARSET
= 'MECAB_CHARSET'¶
-
MECAB_PATH
= 'MECAB_PATH'¶
-
natto.mecab module¶
The main interface to MeCab via natto-py.
-
class
natto.mecab.
MeCab
(options=None, **kwargs)¶ Bases:
object
The main interface to the MeCab library, wrapping the MeCab Tagger.
Instantiate this once, per any MeCab options you wish to use. This interface allows for parsing Japanese into simple strings of morpheme surface and related features, or for iterating over MeCabNode instances which contain detailed information about the morphemes encompassed.
Configure logging before instantiating MeCab to see debug messages:
import logging fmt='%(asctime)s : %(levelname)s : %(message)s' logging.basicConfig(format=fmt, level=logging.DEBUG)
Example usage:
from natto import MeCab # Use a Python with-statement to ensure mecab_destroy is invoked # with MeCab() as nm: # print MeCab version print(nm.version) ... 0.996 # print absolute path to MeCab library print(nm.libpath) ... /usr/local/lib/libmecab.so # parse text and print result print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')) ... この 連体詞,*,*,*,*,*,この,コノ,コノ 星 名詞,一般,*,*,*,*,星,ホシ,ホシ の 助詞,連体化,*,*,*,*,の,ノ,ノ 一等 名詞,一般,*,*,*,*,一等,イットウ,イットー 賞 名詞,接尾,一般,*,*,*,賞,ショウ,ショー に 助詞,格助詞,一般,*,*,*,に,ニ,ニ なり 動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ の 助詞,連体化,*,*,*,*,の,ノ,ノ 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー で 助詞,格助詞,一般,*,*,*,で,デ,デ 俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 そん 名詞,一般,*,*,*,*,そん,ソン,ソン だけ 助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ ! 記号,一般,*,*,*,*,!,!,! EOS # parse text into Python Generator yielding MeCabNode instances, # and display much more detailed information about each morpheme for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True): if n.is_nor(): ... # morpheme surface ... # part-of-speech ID (IPADIC) ... # word cost ... print("{}\t{}\t{}".format(n.surface, n.posid, n.wcost)) ... 飛べ 31 7175 ねえ 25 6661 鳥 38 4905 も 16 4669 いる 31 9109 って 15 6984 こっ 31 9587 た 25 5500 。 7 215
-
MECAB_ANY_BOUNDARY
= 0¶
-
MECAB_CHARSET
= 'MECAB_CHARSET'¶
-
MECAB_INSIDE_TOKEN
= 2¶
-
MECAB_LATTICE_ALLOCATE_SENTENCE
= 64¶
-
MECAB_LATTICE_ALL_MORPHS
= 32¶
-
MECAB_LATTICE_ALTERNATIVE
= 16¶
-
MECAB_LATTICE_MARGINAL_PROB
= 8¶
-
MECAB_LATTICE_NBEST
= 2¶
-
MECAB_LATTICE_ONE_BEST
= 1¶
-
MECAB_LATTICE_PARTIAL
= 4¶
-
MECAB_PATH
= 'MECAB_PATH'¶
-
MECAB_TOKEN_BOUNDARY
= 1¶
-
parse
(text, **kwargs)¶ Parse the given text and return result from MeCab.
Parameters: - text (str) – the text to parse.
- as_nodes (bool, defaults to False) – return generator of MeCabNodes if True; or string if False.
- boundary_constraints (str or re) – regular expression for morpheme boundary splitting; if non-None and feature_constraints is None, then boundary constraint parsing will be used.
- feature_constraints (tuple) – tuple containing tuple instances of target morpheme and corresponding feature string in order of precedence; if non-None and boundary_constraints is None, then feature constraint parsing will be used.
Returns: A single string containing the entire MeCab output; or a Generator yielding the MeCabNode instances.
Raises: MeCabError
-
natto.node module¶
Wrapper for MeCab node.
-
class
natto.node.
MeCabNode
(nptr, surface, feature)¶ Bases:
object
Representation of a MeCab Node struct.
A list of MeCab nodes is returned when parsing a string of Japanese with as_nodes=True. Each node will contain detailed information about the morpheme encompassed.
Variables: - ptr – This node’s pointer.
- prev – Pointer to previous node.
- next – Pointer to next node.
- enext – Pointer to the node which ends at the same position.
- bnext – Pointer to the node which starts at the same position.
- rpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
- lpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
- surface – Surface string, Unicode.
- feature – Feature string, Unicode.
- nodeid – Unique node id.
- length – Length of surface form.
- rlength – Length of the surface form including leading white space.
- rcattr – Right attribute id.
- lcattr – Left attribute id.
- posid – Part-of-speech id.
- char_type – Character type.
- stat – Node status; 0 (NOR), 1 (UNK), 2 (BOS), 3 (EOS), 4 (EON).
- isbest – 1 if this node is best node.
- alpha – Forward accumulative log summation (with marginal probability).
- beta – Backward accumulative log summation (with marginal probability).
- prob – Marginal probability, only with marginal probability flag.
- wcost – Word cost.
- cost – Best accumulative cost from bos node to this node.
Example usage:
from natto import MeCab text = '卓球なんて死ぬまでの暇つぶしだよ。' # Ex. basic node parsing # with MeCab() as nm: for n in nm.parse(text, as_nodes=True): ... # ignore the end-of-sentence nodes ... if not n.is_eos(): ... # output the morpheme surface and default ChaSen feature ... print('{}\t{}'.format(n.surface, n.feature)) ... 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー なんて 助詞,副助詞,*,*,*,*,なんて,ナンテ,ナンテ 死ぬ 動詞,自立,*,*,五段・ナ行,基本形,死ぬ,シヌ,シヌ まで 助詞,副助詞,*,*,*,*,まで,マデ,マデ の 助詞,連体化,*,*,*,*,の,ノ,ノ 暇つぶし 名詞,一般,*,*,*,*,暇つぶし,ヒマツブシ,ヒマツブシ だ 助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ よ 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ 。 記号,句点,*,*,*,*,。,。,。 # Ex. custom node format # # -F ... short-form of --node-format # %F,[6,7,0] ... extract these elements from ChaSen feature as CSV # - morpheme root-form (6th index) # - reading (7th index) # - part-of-speech (0th index) # %h ... part-of-speech ID (IPADIC) # # -U ... short-form of --unk-format # specify empty CSV ,,, when morpheme cannot be found # in dictionary # with MeCab(r'-F%F,[6,8,0],%h\n -U,,,\n') as nm: for n in nm.parse(text, as_nodes=True): ... # ignore the end-of-sentence nodes ... if not n.is_eos(): ... # output custom-formatted node feature ... print(n.feature) ... 卓球,タッキュウ,名詞,36 なんて,ナンテ,助詞,21 死ぬ,シヌ,動詞,31 まで,マデ,助詞,21 の,ノ,助詞,24 暇つぶし,ヒマツブシ,名詞,38 だ,ダ,助動詞,25 よ,ヨ,助詞,17 。,。,記号,7
-
BOS_NODE
= 2¶
-
EON_NODE
= 4¶
-
EOS_NODE
= 3¶
-
NOR_NODE
= 0¶
-
UNK_NODE
= 1¶
-
is_bos
()¶ Is this a beginning-of-sentence node?
Returns: True if beginning-of-sentence node, False otherwise.
-
is_eon
()¶ Is this an end of an N-best node list?
Returns: True if end of an N-best node list, False otherwise.
-
is_eos
()¶ Is this an end-of-sentence node?
Returns: True if end-of-sentence node, False otherwise.
-
is_nor
()¶ Is this a normal node, defined in a dictionary?
Returns: True if normal node, False otherwise.
-
is_unk
()¶ Is this an unknown node, not defined in any dictionary?
Returns: True if unknown node, False otherwise.
natto.option_parse module¶
Helper class for parsing MeCab options.
-
class
natto.option_parse.
OptionParse
(envch)¶ Bases:
object
Helper class for transforming arguments into input for mecab_new2.
-
build_options_str
(options)¶ Returns a string concatenation of the MeCab options.
- Args:
- options: dictionary of options to use when instantiating the MeCab
- instance.
- Returns:
- A string concatenation of the options used when instantiating the MeCab instance, in long-form.
-
parse_mecab_options
(options)¶ Parses the MeCab options, returning them in a dictionary.
Lattice-level option has been deprecated; please use marginal or nbest instead.
- :options string or dictionary of options to use when instantiating
- the MeCab instance. May be in short- or long-form, or in a Python dictionary.
- Returns:
- A dictionary of the specified MeCab options, where the keys are snake-cased names of the long-form of the option names.
- Raises:
- MeCabError: An invalid value for N-best was passed in.
-
natto.support module¶
Internal-use functions for Mecab-Python string- and byte-conversion.
-
natto.support.
splitter_support
()¶ Create tokenizer for use in boundary constraint parsing.
-
natto.support.
string_support
(enc)¶ Create byte-to-string and string-to-byte conversion functions for internal use.
Parameters: enc (str) – Character encoding
natto.version module¶
Version and version info for the natto-py package.
Module contents¶
-
class
natto.
MeCab
(options=None, **kwargs)¶ Bases:
object
The main interface to the MeCab library, wrapping the MeCab Tagger.
Instantiate this once, per any MeCab options you wish to use. This interface allows for parsing Japanese into simple strings of morpheme surface and related features, or for iterating over MeCabNode instances which contain detailed information about the morphemes encompassed.
Configure logging before instantiating MeCab to see debug messages:
import logging fmt='%(asctime)s : %(levelname)s : %(message)s' logging.basicConfig(format=fmt, level=logging.DEBUG)
Example usage:
from natto import MeCab # Use a Python with-statement to ensure mecab_destroy is invoked # with MeCab() as nm: # print MeCab version print(nm.version) ... 0.996 # print absolute path to MeCab library print(nm.libpath) ... /usr/local/lib/libmecab.so # parse text and print result print(nm.parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')) ... この 連体詞,*,*,*,*,*,この,コノ,コノ 星 名詞,一般,*,*,*,*,星,ホシ,ホシ の 助詞,連体化,*,*,*,*,の,ノ,ノ 一等 名詞,一般,*,*,*,*,一等,イットウ,イットー 賞 名詞,接尾,一般,*,*,*,賞,ショウ,ショー に 助詞,格助詞,一般,*,*,*,に,ニ,ニ なり 動詞,自立,*,*,五段・ラ行,連用形,なる,ナリ,ナリ たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ の 助詞,連体化,*,*,*,*,の,ノ,ノ 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー で 助詞,格助詞,一般,*,*,*,で,デ,デ 俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 、 記号,読点,*,*,*,*,、,、,、 そん 名詞,一般,*,*,*,*,そん,ソン,ソン だけ 助詞,副助詞,*,*,*,*,だけ,ダケ,ダケ ! 記号,一般,*,*,*,*,!,!,! EOS # parse text into Python Generator yielding MeCabNode instances, # and display much more detailed information about each morpheme for n in nm.parse('飛べねえ鳥もいるってこった。', as_nodes=True): if n.is_nor(): ... # morpheme surface ... # part-of-speech ID (IPADIC) ... # word cost ... print("{}\t{}\t{}".format(n.surface, n.posid, n.wcost)) ... 飛べ 31 7175 ねえ 25 6661 鳥 38 4905 も 16 4669 いる 31 9109 って 15 6984 こっ 31 9587 た 25 5500 。 7 215
-
MECAB_ANY_BOUNDARY
= 0¶
-
MECAB_CHARSET
= 'MECAB_CHARSET'¶
-
MECAB_INSIDE_TOKEN
= 2¶
-
MECAB_LATTICE_ALLOCATE_SENTENCE
= 64¶
-
MECAB_LATTICE_ALL_MORPHS
= 32¶
-
MECAB_LATTICE_ALTERNATIVE
= 16¶
-
MECAB_LATTICE_MARGINAL_PROB
= 8¶
-
MECAB_LATTICE_NBEST
= 2¶
-
MECAB_LATTICE_ONE_BEST
= 1¶
-
MECAB_LATTICE_PARTIAL
= 4¶
-
MECAB_PATH
= 'MECAB_PATH'¶
-
MECAB_TOKEN_BOUNDARY
= 1¶
-
parse
(text, **kwargs)¶ Parse the given text and return result from MeCab.
Parameters: - text (str) – the text to parse.
- as_nodes (bool, defaults to False) – return generator of MeCabNodes if True; or string if False.
- boundary_constraints (str or re) – regular expression for morpheme boundary splitting; if non-None and feature_constraints is None, then boundary constraint parsing will be used.
- feature_constraints (tuple) – tuple containing tuple instances of target morpheme and corresponding feature string in order of precedence; if non-None and boundary_constraints is None, then feature constraint parsing will be used.
Returns: A single string containing the entire MeCab output; or a Generator yielding the MeCabNode instances.
Raises: MeCabError
-
-
class
natto.
DictionaryInfo
(dptr, filepath, charset)¶ Bases:
object
Representation of a MeCab DictionaryInfo struct.
A list of dictionaries used by MeCab is returned by the dicts attribute of MeCab. Each dictionary information includes the attributes listed below.
Variables: - ptr – FFI pointer to the mecab_dictionary_info_t.
- filepath – Full path to the dictionary file.
- charset – Dictionary character set, e.g., SHIFT-JIS, UTF-8.
- size – Number of words registered in this dictionary.
- type – Dictionary type; 0 (SYS_DIC), 1 (USR_DIC), 2 (UNK_DIC)
- lsize – Left attributes size.
- rsize – Right attributes size.
- version – Dictionary version.
- next – Pointer to next dictionary information struct.
Example usage:
from natto import MeCab with MeCab() as nm: # first dictionary info is MeCab's system dictionary sysdic = nm.dicts[0] # print absolute path to system dictionary print(sysdic.filepath) ... /usr/local/lib/mecab/dic/ipadic/sys.dic # print system dictionary character encoding print(sysdic.charset) ... utf8 # is this really the system dictionary? print(sysdic.is_sysdic()) ... True
-
SYS_DIC
= 0¶
-
UNK_DIC
= 2¶
-
USR_DIC
= 1¶
-
is_sysdic
()¶ Is this a system dictionary?
Returns: True if system dictionary, False otherwise.
-
is_unkdic
()¶ Is this an unknown dictionary?
Returns: True if unknown dictionary, False otherwise.
-
is_usrdic
()¶ Is this a user-defined dictionary?
Returns: True if user-defined dictionary, False otherwise.
-
class
natto.
MeCabNode
(nptr, surface, feature)¶ Bases:
object
Representation of a MeCab Node struct.
A list of MeCab nodes is returned when parsing a string of Japanese with as_nodes=True. Each node will contain detailed information about the morpheme encompassed.
Variables: - ptr – This node’s pointer.
- prev – Pointer to previous node.
- next – Pointer to next node.
- enext – Pointer to the node which ends at the same position.
- bnext – Pointer to the node which starts at the same position.
- rpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
- lpath – Pointer to the right path; None if MECAB_ONE_BEST mode.
- surface – Surface string, Unicode.
- feature – Feature string, Unicode.
- nodeid – Unique node id.
- length – Length of surface form.
- rlength – Length of the surface form including leading white space.
- rcattr – Right attribute id.
- lcattr – Left attribute id.
- posid – Part-of-speech id.
- char_type – Character type.
- stat – Node status; 0 (NOR), 1 (UNK), 2 (BOS), 3 (EOS), 4 (EON).
- isbest – 1 if this node is best node.
- alpha – Forward accumulative log summation (with marginal probability).
- beta – Backward accumulative log summation (with marginal probability).
- prob – Marginal probability, only with marginal probability flag.
- wcost – Word cost.
- cost – Best accumulative cost from bos node to this node.
Example usage:
from natto import MeCab text = '卓球なんて死ぬまでの暇つぶしだよ。' # Ex. basic node parsing # with MeCab() as nm: for n in nm.parse(text, as_nodes=True): ... # ignore the end-of-sentence nodes ... if not n.is_eos(): ... # output the morpheme surface and default ChaSen feature ... print('{}\t{}'.format(n.surface, n.feature)) ... 卓球 名詞,サ変接続,*,*,*,*,卓球,タッキュウ,タッキュー なんて 助詞,副助詞,*,*,*,*,なんて,ナンテ,ナンテ 死ぬ 動詞,自立,*,*,五段・ナ行,基本形,死ぬ,シヌ,シヌ まで 助詞,副助詞,*,*,*,*,まで,マデ,マデ の 助詞,連体化,*,*,*,*,の,ノ,ノ 暇つぶし 名詞,一般,*,*,*,*,暇つぶし,ヒマツブシ,ヒマツブシ だ 助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ よ 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ 。 記号,句点,*,*,*,*,。,。,。 # Ex. custom node format # # -F ... short-form of --node-format # %F,[6,7,0] ... extract these elements from ChaSen feature as CSV # - morpheme root-form (6th index) # - reading (7th index) # - part-of-speech (0th index) # %h ... part-of-speech ID (IPADIC) # # -U ... short-form of --unk-format # specify empty CSV ,,, when morpheme cannot be found # in dictionary # with MeCab(r'-F%F,[6,8,0],%h\n -U,,,\n') as nm: for n in nm.parse(text, as_nodes=True): ... # ignore the end-of-sentence nodes ... if not n.is_eos(): ... # output custom-formatted node feature ... print(n.feature) ... 卓球,タッキュウ,名詞,36 なんて,ナンテ,助詞,21 死ぬ,シヌ,動詞,31 まで,マデ,助詞,21 の,ノ,助詞,24 暇つぶし,ヒマツブシ,名詞,38 だ,ダ,助動詞,25 よ,ヨ,助詞,17 。,。,記号,7
-
BOS_NODE
= 2¶
-
EON_NODE
= 4¶
-
EOS_NODE
= 3¶
-
NOR_NODE
= 0¶
-
UNK_NODE
= 1¶
-
is_bos
()¶ Is this a beginning-of-sentence node?
Returns: True if beginning-of-sentence node, False otherwise.
-
is_eon
()¶ Is this an end of an N-best node list?
Returns: True if end of an N-best node list, False otherwise.
-
is_eos
()¶ Is this an end-of-sentence node?
Returns: True if end-of-sentence node, False otherwise.
-
is_nor
()¶ Is this a normal node, defined in a dictionary?
Returns: True if normal node, False otherwise.
-
is_unk
()¶ Is this an unknown node, not defined in any dictionary?
Returns: True if unknown node, False otherwise.
-
exception
natto.
MeCabError
¶ Bases:
Exception
MeCabError is a general error class for the natto-py package.
-
class
natto.
OptionParse
(envch)¶ Bases:
object
Helper class for transforming arguments into input for mecab_new2.
-
build_options_str
(options)¶ Returns a string concatenation of the MeCab options.
- Args:
- options: dictionary of options to use when instantiating the MeCab
- instance.
- Returns:
- A string concatenation of the options used when instantiating the MeCab instance, in long-form.
-
parse_mecab_options
(options)¶ Parses the MeCab options, returning them in a dictionary.
Lattice-level option has been deprecated; please use marginal or nbest instead.
- :options string or dictionary of options to use when instantiating
- the MeCab instance. May be in short- or long-form, or in a Python dictionary.
- Returns:
- A dictionary of the specified MeCab options, where the keys are snake-cased names of the long-form of the option names.
- Raises:
- MeCabError: An invalid value for N-best was passed in.
-
-
natto.
string_support
(enc)¶ Create byte-to-string and string-to-byte conversion functions for internal use.
Parameters: enc (str) – Character encoding
-
natto.
splitter_support
()¶ Create tokenizer for use in boundary constraint parsing.