data.aff
: contents of .aff file
The module represents data from Hunspell’s *.aff
file.
This text file has the following format:
# pseudo-comment
DIRECTIVE_NAME value1 value2 value 3
# directives with large array of values
DIRECTIVE_NAME <num_of_values>
DIRECTIVE_NAME value1_1 value1_2 value1_3
DIRECTIVE_NAME value2_1 value2_2 value2_3
# ...
How many values should be after DIRECTIVE_NAME
, is defined by directive itself. Values are separated
by any number of spaces (so, if some values should include literal ” “, they encode it as “_”).
Note: We are saying “pseudo-comment” above, because it is just a convention. In fact, Hunspell has no code explicitly interpreting anything starting with
#
as a comment – it is rather ignores everything that is not known directive name, and everything after expected number of directive values. But it is important NOT to drop#
and content after it before interpreting, as it might be meaningful! Some dictionaries define#
to be a flag, or aBREAK
character. For exampleen_GB
in LibreOffice does this:# in .aff file: COMPOUNDRULE #*0{ # reads: rule of producing compound words: # any words with flag "#", 0 or more times (*), # then any word with flag "0", # then any word with flag "{" # in .dic file: 1/#0 # reads: "1" is a word, having flags "#" and "0"
The Aff
class stores all data from the the file — read class docs to better understand the
conventions and usage of directives.
Aff
-
class
Aff
(SET='Windows-1252', FLAG='short', LANG=None, WORDCHARS=None, IGNORE=None, CHECKSHARPS=False, FORBIDDENWORD=None, KEEPCASE=None, NOSUGGEST=None, KEY='qwertyuiop|asdfghjkl|zxcvbnm', TRY='', REP=<factory>, MAP=<factory>, NOSPLITSUGS=False, PHONE=None, MAXCPDSUGS=3, MAXNGRAMSUGS=4, MAXDIFF=-1, ONLYMAXDIFF=False, PFX=<factory>, SFX=<factory>, NEEDAFFIX=None, CIRCUMFIX=None, COMPLEXPREFIXES=False, FULLSTRIP=False, BREAK=<factory>, COMPOUNDRULE=<factory>, COMPOUNDMIN=3, COMPOUNDWORDMAX=None, COMPOUNDFLAG=None, COMPOUNDBEGIN=None, COMPOUNDMIDDLE=None, COMPOUNDEND=None, ONLYINCOMPOUND=None, COMPOUNDPERMITFLAG=None, COMPOUNDFORBIDFLAG=None, FORCEUCASE=None, CHECKCOMPOUNDCASE=False, CHECKCOMPOUNDDUP=False, CHECKCOMPOUNDREP=False, CHECKCOMPOUNDTRIPLE=False, CHECKCOMPOUNDPATTERN=<factory>, SIMPLIFIEDTRIPLE=False, COMPOUNDSYLLABLE=None, COMPOUNDMORESUFFIXES=False, SYLLABLENUM=None, COMPOUNDROOT=None, ICONV=None, OCONV=None, AF=<factory>, AM=<factory>, WARN=None, FORBIDWARN=False, SUBSTANDARD=None)[source] The class contains all directives from .aff file in its attributes.
Attribute names are exactly the same as directives they’ve read from (they are upper-case, which is un-Pythonic, but allows to unambiguously relate directives to attrs and grep them in code).
Attribute values are either appropriate primitive data types (strings, numbers, arrays etc), or simple objects wrapping this data to make it easily usable in algorithms (mostly it is some pattern-alike objects, like the result of Python’s standard
re.compile
, but specific for Hunspell domain).Attribute docs include explanations derived from Hunspell’s man page (sometimes rephrased/abbreviated), plus links to relevant chunks of
spylls
code which uses the directive.Note that all directives are optional, empty .aff file is a valid one.
General
-
SET
: str = 'Windows-1252' .aff and .dic encoding.
Usage: Stored in
readers.aff.Context
and used for reopening .aff file (after the directive was read) inreader_aff
, and for opening .dic file inreader_dic
-
FLAG
: str = 'short' .aff file declares one of the possible flag formats:
short
(default) – each flag is one ASCII characterlong
– each flag is two ASCII charactersnumeric
– each flag is number, set of flags separates them with,
UTF-8
– each flag is one UTF-8 character
Flag format defines how flag sets attached to stems and affixes are parsed. For example, .dic file entry
cat/ABCD
can be considered having flags{"A", "B", "C", "D"}
(default flag format, “short”), or{"AB", "CD"}
(flag format “long”)Usage: Stored in
readers.aff.Context
and used inreader_aff
, and inreader_dic
-
LANG
: Optional[str] = None ISO language code. The only codes that change behavior is codes of Turkic languages, which have different I/i capitalization logic.
Usage: Abstracted into
casing
which is used in both lookup and suggest.
-
WORDCHARS
: Optional[str] = None Extends tokenizer of Hunspell command line interface with additional word characters, for example, dot, dash, n-dash, numbers.
Usage: Not used in Spylls at all, as it doesn’t do tokenization.
-
IGNORE
: Optional[spylls.hunspell.data.aff.Ignore] = None Sets characters to ignore dictionary words, affixes and input words. Useful for optional characters, as Arabic (harakat) or Hebrew (niqqud) diacritical marks.
Usage: in
Lookup.__call__
for preparing input word, and inreader_aff
, and inreader_dic
.
-
CHECKSHARPS
: bool = False Specify this language has German “sharp S” (ß), so this language is probably German :)
This declaration effect is that uppercase word with “ß” letter is considered correct (uppercase form of “ß” is “SS”, but it is allowed to leave downcased “ß”). The effect can be prohibited for some words by applying to word
KEEPCASE
flag (which for other situations has different meaning).Usage: To define whether to use
GermanCasing
incasing
(which changes word lower/upper-casing slightly), and inLookup.good_forms
to drop forms where lowercase “ß” is prohibited.
-
FORBIDDENWORD
: Optional[str] = None Flag that marks word as forbidden. The main usage of this flag is to specify that some form that is logically possible (by affixing/suffixing or compounding) is in fact non-existent.
Imaginary example (not from actual English dictionary!): let’s say word “create” can have suffixes “-d”, “-s”, “-ion”, and prefixes: “un-”, “re-”, “de-”, but of all possible forms (created, creates, creation, uncreates, uncreation, ….) we decide “decreated” is not an existing word. Then we mark (in .dic file) word “create” with flag for all those suffixes and prefixes, but also add separate word “decreated” to dictionary, marked with flag that specified in .aff’s FORBIDDENWORD directive. Now, this word wouldn’t be considered correct, but all other combinations would.
Suggestions
-
KEY
: str = 'qwertyuiop|asdfghjkl|zxcvbnm' String that specifies sets of adjacent characters on keyboard (so suggest could understand that “kitteb” is most probable misspelling of “kitten”). Format is “abc|def|xyz”. For QWERTY English keyboard might be
qwertyuiop|asdfghjkl|zxcvbnm
Usage:
Suggest.edits
to pass topermutations.badcharkey
.
-
TRY
: str = '' List of all characters that can be used in words, in order of probability (most probable first), used on edits for suggestions (trying to add missing, or replace erroneous character).
Usage:
Suggest.edits
to pass topermutations.badchar
andpermutations.forgotchar
. Note that, obscurely enough, Suggest checks this option to decide whether dash should be used when suggesting two words (e.g. for misspelled “foobar”, when it is decided that it is two words erroneously joined, suggest either returns only “foo bar”, or also “foo-bar”). Whether dash is suggested, decided by presence of"-"
inTRY
, or by presence of Latin"a"
(= “the language use Latin script, all of them allow dashes between words”)… That’s how it is in Hunspell!
-
NOSUGGEST
: Optional[str] = None Flag to mark word/affix as “shouldn’t be suggested” (but considered correct on lookup), like obscenities.
Usage: on
Suggest
creation (to make list of dictionary words for ngram-check), and inLookup.is_good_form
(if the lookup is called from suggest, withallow_nosuggest=False
)
-
KEEPCASE
: Optional[str] = None Flag to mark words which shouldn’t be considered correct unless their casing is exactly like in the dictionary.
Note: With
CHECKSHARPS
declaration, words with sharp s (ß) andKEEPCASE
flag may be capitalized and uppercased, but uppercased forms of these words may not contain “ß”, only “SS”.Usage:
Suggest.suggestions
to produce suggestions in proper case,Lookup.is_good_form
.
-
REP
: List[RepPattern] Table of replacements for typical typos (like “shun”->”tion”) to try on suggest. See
RepPattern
for details of format.Usage:
Suggest.edits
to pass topermutations.replchars
. Note that the table populated from aff’sREP
directive, and from dic’s fileph:
tags (seeWord
andread_dic
for detailed explanations).
-
MAP
: List[Set[str]] Sets of “similar” chars to try in suggestion (like
aáã
– if they all exist in the language, replacing one in another would be a frequent typo). Several chars as a single entry should be grouped by parentheses:MAP ß(ss)
(German “sharp s” and “ss” sequence are more or less the same).Usage:
Suggest.edits
to pass topermutations.mapchars
.
-
NOSPLITSUGS
: bool = False Never try to suggest “this word should be split in two”. LibreOffice sv_SE dictionary says “it is a must for Swedish”. (Interestingly enough, Hunspell’s tests doesn’t check this flag at all).
Usage:
Suggest.edits
-
PHONE
: Optional[spylls.hunspell.data.aff.PhonetTable] = None Table for metaphone transformations. Format is borrowed from aspell and described in its docs.
Note that dictionaries with
PHONE
table are extremely rare: of all LibreOffice/Firefox dictionaries on en_ZA (South Africa) contains it – though it is a generic English metaphone rules an it is quite weird they are not used more frequently.Showcase (with LibreOffice dictionaries):
>>> misspelled = 'excersized' >>> nometaphone = Dictionary.from_files('en/en_US') >>> [*nometaphone.suggest(misspelled)]) ['supersized'] >>> withmetaphone = Dictionary.from_files('en/en_ZA') >>> [*withmetaphone.suggest(misspelled)] ['excerpted', 'exercised', 'excessive']
Usage:
phonet_suggest
-
MAXCPDSUGS
: int = 3 Limits number of compound suggestions. Usage:
Suggest.suggestions
to limit number of edit-based suggestions which are compound words.
N-gram suggestions
-
MAXNGRAMSUGS
: int = 4 Set max. number of n-gram suggestions. Value 0 switches off the n-gram suggestions (see also
MAXDIFF
).Usage:
Suggest.ngram_suggestions
(to decide whetherngram_suggest
should be called at all) andSuggest.suggestions
(to limit amount of ngram-based suggestions).
-
MAXDIFF
: int = -1 Set the similarity factor for the n-gram based suggestions:
5 = default value
0 = fewer n-gram suggestions, but at least one;
10 (max) =
MAXNGRAMSUGS
n-gram suggestions.
Usage:
Suggest.ngram_suggestions
where it is passed tongram_suggest
module, and used indetailed_affix_score
.
-
ONLYMAXDIFF
: bool = False Remove all bad n-gram suggestions (default mode keeps one, see
MAXDIFF
).Usage:
Suggest.ngram_suggestions
where it is passed tongram_suggest
module, and used infilter_guesses
.
Stemming
-
PFX
: Dict[str, List[Prefix]] Dictionary of
flag => prefixes with this flag
. SeeAffix
for detailed format and meaning description.Usage:
in
Suggest.ngram_suggestions
to pass tongram_suggest
(and there to construct all possible forms).also parsed into
prefixes_index
Trie, which then used inLookup.deprefix
-
SFX
: Dict[str, List[Suffix]] Dictionary of
flag => suffixes with this flag
. SeeAffix
for detailed format and meaning description.Usage:
in
Suggest.ngram_suggestions
to pass tongram_suggest
(and there to construct all possible forms).also parsed into
suffixes_index
Trie, which then used inLookup.desuffix
-
NEEDAFFIX
: Optional[str] = None Flag saying “this stem can’t be used without affixes”. Can be also assigned to suffix/prefix, meaning “there should be other affixes besides this one”.
Usage:
Lookup.is_good_form
-
CIRCUMFIX
: Optional[str] = None Suffixes signed with this flag may be on a word when this word also has a prefix with this flag, and vice versa.
Usage:
Lookup.is_good_form
-
COMPLEXPREFIXES
: bool = False If two prefixes stripping is allowed (only one prefix by default). Random fun fact: of all currently available LibreOffice and Firefox dictionaries, only Firefox’s Zulu has this flag.
Usage:
Lookup.deprefix
-
FULLSTRIP
: bool = False If affixes are allowed to remove entire stem.
Not used in Spylls (e.g. spylls doesn’t fails when this option is False and entire word is removed, so hunspell’s tests
fullstrip.*
are passing).
Compounding
-
BREAK
: List[BreakPattern] Defines break points for breaking words and checking word parts separately. See
BreakPattern
for format definition.Usage:
Lookup.break_word
-
COMPOUNDRULE
: List[CompoundRule] Rule of producing compound words, with regexp-like syntax. See
CompoundRule
for format definition.Usage:
Lookup.compounds_by_rules
-
COMPOUNDMIN
: int = 3 Minimum length of words used for compounding.
Usage:
Lookup.compounds_by_rules
&Lookup.compounds_by_flags
-
COMPOUNDWORDMAX
: Optional[int] = None Set maximum word count in a compound word.
Usage:
Lookup.compounds_by_rules
&Lookup.compounds_by_flags
-
COMPOUNDFLAG
: Optional[str] = None Forms with this flag (marking either stem, or one of affixes) can be part of the compound. Note that triple of flags
COMPOUNDBEGIN
,COMPOUNDMIDDLE
,COMPOUNDEND
is more precise way of marking (“this word can be at the beginning of compound”).Usage:
Lookup.is_good_form
to compare form’s compound position (or lack thereof) with presence of teh flag.
-
COMPOUNDBEGIN
: Optional[str] = None Forms with this flag (marking either stem, or one of affixes) can be at the beginning of the compound. Part of the triple of flags
COMPOUNDBEGIN
,COMPOUNDMIDDLE
,COMPOUNDEND
; alternative to the triple is justCOMPOUNDFLAG
(“this form can be at any place in compound”).Usage:
Lookup.is_good_form
to compare form’s compound position (or lack thereof) with the presence of the flag.
-
COMPOUNDMIDDLE
: Optional[str] = None Forms with this flag (marking either stem, or one of affixes) can be in the middle of the compound (not the last part, and not the first). Part of the triple of flags
COMPOUNDBEGIN
,COMPOUNDMIDDLE
,COMPOUNDEND
; alternative to the triple is justCOMPOUNDFLAG
(“this form can be at any place in compound”).Usage:
Lookup.is_good_form
to compare form’s compound position (or lack thereof) with the presence of the flag.
-
COMPOUNDEND
: Optional[str] = None Forms with this flag (marking either stem, or one of affixes) can be at the end of the compound. Part of the triple of flags
COMPOUNDBEGIN
,COMPOUNDMIDDLE
,COMPOUNDEND
; alternative to the triple is justCOMPOUNDFLAG
(“this form can be at any place in compound”).Usage:
Lookup.is_good_form
to compare form’s compound position (or lack thereof) with the presence of the flag.
-
ONLYINCOMPOUND
: Optional[str] = None Forms with this flag (marking either stem, or one of affixes) can only be part of the compound word, and never standalone.
Usage:
Lookup.is_good_form
to compare form’s compound position (or lack thereof) with the presence of the flag. Also inSuggest
to produce list of the words suitable for ngram search.
-
COMPOUNDPERMITFLAG
: Optional[str] = None Prefixes are allowed at the beginning of compounds, suffixes are allowed at the end of compounds by default. Affixes with
COMPOUNDPERMITFLAG
may be inside of compounds.Usage:
Lookup.compounds_by_flags
to make list of flags passed toLookup.produce_affix_forms
(for this part of the compound, try find affixed spellings, you can use affixes with this flag).
-
COMPOUNDFORBIDFLAG
: Optional[str] = None Prefixes are allowed at the beginning of compounds, suffixes are allowed at the end of compounds by default. Suffixes with
COMPOUNDFORBIDFLAG
may not be even at the end, and prefixes with this flag may not be even at the beginning.Usage:
Lookup.compounds_by_flags
to make list of flags passed toLookup.produce_affix_forms
(for this part of the compound, try find affixed spellings, you can use affixes with this flag).
-
FORCEUCASE
: Optional[str] = None Last word part of a compound with flag FORCEUCASE forces capitalization of the whole compound word. Eg. Dutch word “straat” (street) with FORCEUCASE flags will allowed only in capitalized compound forms, according to the Dutch spelling rules for proper names.
Usage:
Lookup.is_bad_compound
andSuggest.suggestions
(if this flag is present in the .aff file, we check that maybe just capitalization of misspelled word would make it right).
-
CHECKCOMPOUNDCASE
: bool = False Forbid upper case characters at word boundaries in compounds.
Usage:
Lookup.is_bad_compound
-
CHECKCOMPOUNDDUP
: bool = False Forbid word duplication in compounds (e.g. “foofoo”).
Usage:
Lookup.is_bad_compound
-
CHECKCOMPOUNDREP
: bool = False Forbid compounding, if the (usually bad) compound word may be a non-compound word if some replacement by
REP
table (frequent misspellings) is made. Useful for languages with “compound friendly” orthography.Usage:
Lookup.is_bad_compound
-
CHECKCOMPOUNDTRIPLE
: bool = False Forbid compounding, if compound word contains triple repeating letters (e.g. foo|ox or xo|oof).
Usage:
Lookup.is_bad_compound
-
CHECKCOMPOUNDPATTERN
: List[CompoundPattern] List of patterns which forbid compound words when pair of words in compound matches this pattern. See
CompoundPattern
for explanation about format.Usage:
Lookup.is_bad_compound
-
SIMPLIFIEDTRIPLE
: bool = False Allow simplified 2-letter forms of the compounds forbidden by
CHECKCOMPOUNDTRIPLE
. Example: “Schiff”+”fahrt” -> “Schiffahrt”Usage:
Lookup.compounds_by_flags
, after the main splitting cycle, we also try the hypothesis that if the letter on the current boundary is duplicated, we should triplicate it.
-
COMPOUNDSYLLABLE
: Optional[Tuple[int, str]] = None Need for special compounding rules in Hungarian.
Not implemented in Spylls
-
COMPOUNDMORESUFFIXES
: bool = False Allow twofold suffixes within compounds.
Not used in Spylls and doesn’t have tests in Hunspell
-
COMPOUNDROOT
: Optional[str] = None Flag that signs the compounds in the dictionary (Now it is used only in the Hungarian language specific code).
Not used in Spylls.
Pre/post-processing
-
ICONV
: Optional[spylls.hunspell.data.aff.ConvTable] = None Input conversion table (what to do with word before checking if it is valid). See
ConvTable
for format description.Usage:
Lookup.__call__
-
OCONV
: Optional[spylls.hunspell.data.aff.ConvTable] = None Output conversion table (what to do with suggestion before returning it to the user). See
ConvTable
for format description.Usage:
Suggest.suggestions
Aliasing
-
AF
: Dict[str, Set[str]] Table of flag set aliases. Defined in .aff-file this way:
AF 3 AF ABC AF BCD AF DE
This means set of flags “ABC” has an alias “1”, “BCD” alias “2”, “DE” alias “3” (aliases are just a sequental number in the table). Now, in .dic-file,
foo/1
would be equivalent offoo/ABC
, meaning stemfoo
has flagsA, B, C
.Usage: Stored in
readers.aff.Context
to decode flags on reading .aff and .dic files.
-
AM
: Dict[str, Set[str]] Table of word data aliases. Logic of aliasing is the same as for
AM
.Usage:
read_dic
Other/Ignored
-
WARN
: Optional[str] = None This flag is for rare words, which are also often spelling mistakes. With command-line flag
-r
, Hunspell will warn about words with this flag in input text.Not implemented in Spylls
-
FORBIDWARN
: bool = False Sets if words with
WARN
flag should be considered as misspellings (errors, not warnings).Not used in any known dictionary, and not implemented in Spylls (even in aff-reader).
-
SYLLABLENUM
: Optional[str] = None Need for special compounding rules in Hungarian. (The previous phrase is the only docs Hunspell provides
:)
)Not used in Spylls.
-
SUBSTANDARD
: Optional[str] = None Flag signs affix rules and dictionary words (allomorphs) not used in morphological generation and root words removed from suggestion.
Not implemented in Spylls
Some other directives that are in docs, but are deprecated/not used (and never implemented by Spylls):
LEMMA_PRESENT
Derived attributes
This attributes are calculated after Aff reading and initialization
-
casing
: spylls.hunspell.algo.capitalization.Casing “Casing” class (defining how the words in this language lowercased/uppercased). See
Casing
for details. InAff
, basically, it isGermanCasing
ifCHECKSHARPS
isTrue
,TurkicCasing
ifLANG
is one of Turkic languages (Turkish, Azerbaijani, Crimean Tatar),regular
Casing
otherwise.
-
suffixes_index
: spylls.hunspell.algo.trie.Trie Trie structure for fast selecting of all possible suffixes for some word, created from
SFX
-
prefixes_index
: spylls.hunspell.algo.trie.Trie Trie structure for fast selecting all possible prefixes for some word, created from
PFX
-
Prefix
and Suffix
-
class
Affix
(flag, crossproduct, strip, add, condition, flags=<factory>)[source] Common base for
Prefix
andSuffix
.Affixes are stored in table looking this way:
SFX X Y 1 SFX X 0 able/CD . ds:able
Meaning of the first line (table header):
Suffix (can be
PFX
for prefix)…designated by flag
X
…supports cross-product (Y or N, “cross-product” means form with this suffix also allowed to have prefixes)
…and there is 1 of them below
Meaning of the table row:
Suffix X (should be same as table header)
…when applies, doesn’t change the stem (0 = “”, but it can be “…removes some part at the end of the stem”)
…when applies, adds “able” to the stem
…and the whole form will have also flags “C”, “D”
…condition of application is “any stem” (
.
– read it as regexp’s “any char”)…and the whole form would have data tags (morphology)
ds:able
Then, if in the dictionary we have
drink/X
(can have the suffix marked byX
), the whole thing means “‘drinkable’ is a valid word form, has additional flags ‘C’, ‘D’ and some morphological info”.Another example (from
en_US.aff
):SFX N Y 3 SFX N e ion e SFX N y ication y SFX N 0 en [^ey]
This defines suffix designated by flag
N
, non-cross-productable, with 3 forms:removes “e” and adds “ion” for words ending with “e” (animate => animation)
removes “y” and adds “icaton” for words ending with “y” (amplify => amplification)
removes nothing and adds “en” for words ending with neither (befall => befallen)
(TBH, I don’t have a slightest idea why the third option is grouped with two previous… Probably because dictionary building is semi-automated process of “packing” word lists in dic+aff, and the “affixes” actually doesn’t need to bear any grammatical sense.)
-
flag
: str Flag this affix marked with. Note that several affixes can have same flag (and in this case, which of them is relevant for the word, is decided by its
condition
)
-
crossproduct
: bool Whether this affix is compatible with opposite affix (e.g. if the word has both suffix and prefix, both of them should have
crossproduct=True
)
-
strip
: str What is stripped from the stem when the affix is applied
-
add
: str What is added when the affix is applied
-
condition
: str Condition against which stem should be checked to understand whether this affix is relevant
-
flags
: Set[str] Flags this affix has
Helper pattern-alike classes
This classes are wrapping several types of somewhat pattern-alike objects that can be *.aff
-file,
“compiling” them into something applyable much like Python’s re
module compiles regexps.
-
class
BreakPattern
(pattern)[source] Contents of the
Aff.BREAK
directive, pattern for splitting the word, compiled to regexp.Directives are stored this way:
BREAK 3 BREAK - BREAK ^- BREAK -$
(That’s, by the way, the default value of
BREAK
). It means Hunspell while checking the word like “left-right”, will check “left” and “right” separately; also will ignore “-” at the beginning and end of the word (second and third lines). Note thatBREAK -
without any special chars will NOT ignore “-” at the beginning/end.
-
class
Ignore
(chars)[source] Contents of the
Aff.IGNORE
directive, chars to ignore on lookup/suggest, compiled withstr.maketrans
.
-
class
RepPattern
(pattern, replacement)[source] Contents of the
Aff.REP
directive, pair of(frequent typo, its replacement)
. Typo pattern compiled to regexp.Example from Hunspell’s docs, showing all the features:
REP 5 REP f ph REP ph f REP tion$ shun REP ^cooccurr co-occurr REP ^alot$ a_lot
This means:
table of 5 replacements (first line):
try to see if “f -> ph” produces good word,
try “ph -> f”,
at the end of the word try “tion -> shun”,
at the beginning of the word try “cooccurr -> co-occurr”,
and try to replace the whole word “alot” with “a lot” (
_
stands for space).
-
class
ConvTable
(pairs)[source] Table of conversions that should be applied on pre- or post-processing, stored in
Aff.ICONV
andAff.OCONV
. Format is as follows (as far as I can guess from code and tests, documentation is very sparse):ICONV <number of entries> ICONV <pattern> <replacement>
Typically,
pattern
andreplacement
are just simple strings, used mostly for replacing typographics (like trigraphs and “nice” apostrophes) before/after processing.But if there is a
_
inpattern
, it is treated as: regexp^
if at the beginning of the pattern, regexp$
if at the end, and just ignored otherwise. This seem to be a “hidden” feature, demonstrated bynepali.*
set of tests in Hunspell distributionConversion rules are applied as follows:
for each position in word
…find any matching rules
…chose the one with longest pattern
…apply it, and shift to position after its applied (so there can’t be recursive application of several rules on top of each other).
-
class
CompoundPattern
(left, right, replacement=None)[source] Pattern to check whether compound word is correct, stored in
Aff.CHECKCOMPOUNDPATTERN
directive. Format of the pattern:endchars[/flag] beginchars[/flag] [replacement]
The pattern matches (telling that this compound is not allowed) if some pair of the words inside compound matches conditions:
first word ends with
endchars
(and haveflags
from the first element, if they are specified)second word starts with
beginchars
(and haveflags
from the second element, if they are specified)
endchars
can be 0, specifying “word has zero affixes”.replacement
complicates things, allowing to specify “…but this string at the border of the words, should be unpacked into thisendchars
and thatbeginchars
, but make the compound allowed”… It complicates algorithm significantly, and no known dictionary uses this feature, soreplacement
is just ignored by Spylls.
-
class
CompoundRule
(text)[source] Regexp-alike rule for generating compound words, content of
Aff.COMPOUNDRULE
directive. It is a way of specifying compounding alternative (and unrelated) toAff.COMPOUNDFLAG
and similar. Rules look this way:COMPOUNDRULE A*B?CD
…reading: compound word might consist of any number of words with flag
A
, then 0 or 1 words with flagB
, then words with flagsC
andD
.en_US.aff
uses this feature to specify spelling of numerals. In .aff-file, it hasCOMPOUNDRULE 2 COMPOUNDRULE n*1t COMPOUNDRULE n*mp
And, in .dic-file:
0/nm 0th/pt 1/n1 1st/p 1th/tc 2/nm 2nd/p 2th/tc # ...and so on...
Which makes “111th” valid (one hundred eleventh): “1” with “n”, “1” with “1” and “1th” with “t” is valid by rule
n*1t
, but “121th” is not valid (should be “121st”)
-
class
PhonetTable
(table)[source] Represents table of metaphone transformations stored in
Aff.PHONE
. Format is borrowed from aspell and described in its docs.Basically, each line of the table specifies pair of “pattern”/”replacement”. Replacement is a literal string (with “_” meaning “empty string”), and pattern is … complicated. Spylls, as of now, parses rules fully (see
parse_rule
method in the source), but doesn’t implements all the algorithm’s details (like rule prioritizing, concept of “follow-up rule” etc.)It is enough to pass Hunspell’s (small) test for PHONE implementation, but definitely more naive than expected. But as it is marginal feature (and there are enough metaphone implementations in Python), we aren’t (yet?) bothered by this fact.