`data.aff`: contents of .aff file

The module represents data from Hunspell’s *.aff file.

This text file has the following format:

# pseudo-comment
DIRECTIVE_NAME value1 value2 value 3

# directives with large array of values
DIRECTIVE_NAME <num_of_values>
DIRECTIVE_NAME value1_1 value1_2 value1_3
DIRECTIVE_NAME value2_1 value2_2 value2_3
# ...

How many values should be after DIRECTIVE_NAME, is defined by directive itself. Values are separated by any number of spaces (so, if some values should include literal ” “, they encode it as “_”).

Note: We are saying “pseudo-comment” above, because it is just a convention. In fact, Hunspell has no code explicitly interpreting anything starting with # as a comment – it is rather ignores everything that is not known directive name, and everything after expected number of directive values. But it is important NOT to drop # and content after it before interpreting, as it might be meaningful! Some dictionaries define # to be a flag, or a BREAK character. For example en_GB in LibreOffice does this:
# in .aff file:
COMPOUNDRULE #*0{
# reads: rule of producing compound words:
#  any words with flag "#", 0 or more times (*),
#  then any word with flag "0",
#  then any word with flag "{"

# in .dic file:
1/#0
# reads: "1" is a word, having flags "#" and "0"

The Aff class stores all data from the the file — read class docs to better understand the conventions and usage of directives.

`Aff`

class Aff(SET='Windows-1252', FLAG='short', LANG=None, WORDCHARS=None, IGNORE=None, CHECKSHARPS=False, FORBIDDENWORD=None, KEEPCASE=None, NOSUGGEST=None, KEY='qwertyuiop|asdfghjkl|zxcvbnm', TRY='', REP=<factory>, MAP=<factory>, NOSPLITSUGS=False, PHONE=None, MAXCPDSUGS=3, MAXNGRAMSUGS=4, MAXDIFF=-1, ONLYMAXDIFF=False, PFX=<factory>, SFX=<factory>, NEEDAFFIX=None, CIRCUMFIX=None, COMPLEXPREFIXES=False, FULLSTRIP=False, BREAK=<factory>, COMPOUNDRULE=<factory>, COMPOUNDMIN=3, COMPOUNDWORDMAX=None, COMPOUNDFLAG=None, COMPOUNDBEGIN=None, COMPOUNDMIDDLE=None, COMPOUNDEND=None, ONLYINCOMPOUND=None, COMPOUNDPERMITFLAG=None, COMPOUNDFORBIDFLAG=None, FORCEUCASE=None, CHECKCOMPOUNDCASE=False, CHECKCOMPOUNDDUP=False, CHECKCOMPOUNDREP=False, CHECKCOMPOUNDTRIPLE=False, CHECKCOMPOUNDPATTERN=<factory>, SIMPLIFIEDTRIPLE=False, COMPOUNDSYLLABLE=None, COMPOUNDMORESUFFIXES=False, SYLLABLENUM=None, COMPOUNDROOT=None, ICONV=None, OCONV=None, AF=<factory>, AM=<factory>, WARN=None, FORBIDWARN=False, SUBSTANDARD=None)[source]

The class contains all directives from .aff file in its attributes.

Attribute names are exactly the same as directives they’ve read from (they are upper-case, which is un-Pythonic, but allows to unambiguously relate directives to attrs and grep them in code).

Attribute values are either appropriate primitive data types (strings, numbers, arrays etc), or simple objects wrapping this data to make it easily usable in algorithms (mostly it is some pattern-alike objects, like the result of Python’s standard re.compile, but specific for Hunspell domain).

Attribute docs include explanations derived from Hunspell’s man page (sometimes rephrased/abbreviated), plus links to relevant chunks of spylls code which uses the directive.

Note that all directives are optional, empty .aff file is a valid one.

General

SET: str = 'Windows-1252'

.aff and .dic encoding.

Usage: Stored in readers.aff.Context and used for reopening .aff file (after the directive was read) in reader_aff, and for opening .dic file in reader_dic

FLAG: str = 'short'

.aff file declares one of the possible flag formats:

short (default) – each flag is one ASCII character
long – each flag is two ASCII characters
numeric – each flag is number, set of flags separates them with ,
UTF-8 – each flag is one UTF-8 character

Flag format defines how flag sets attached to stems and affixes are parsed. For example, .dic file entry cat/ABCD can be considered having flags {"A", "B", "C", "D"} (default flag format, “short”), or {"AB", "CD"} (flag format “long”)

Usage: Stored in readers.aff.Context and used in reader_aff, and in reader_dic

LANG: Optional[str] = None

ISO language code. The only codes that change behavior is codes of Turkic languages, which have different I/i capitalization logic.

Usage: Abstracted into casing which is used in both lookup and suggest.

WORDCHARS: Optional[str] = None

Extends tokenizer of Hunspell command line interface with additional word characters, for example, dot, dash, n-dash, numbers.

Usage: Not used in Spylls at all, as it doesn’t do tokenization.

IGNORE: Optional[spylls.hunspell.data.aff.Ignore] = None

Sets characters to ignore dictionary words, affixes and input words. Useful for optional characters, as Arabic (harakat) or Hebrew (niqqud) diacritical marks.

Usage: in Lookup.__call__ for preparing input word, and in reader_aff, and in reader_dic.

CHECKSHARPS: bool = False

Specify this language has German “sharp S” (ß), so this language is probably German :)

This declaration effect is that uppercase word with “ß” letter is considered correct (uppercase form of “ß” is “SS”, but it is allowed to leave downcased “ß”). The effect can be prohibited for some words by applying to word KEEPCASE flag (which for other situations has different meaning).

Usage: To define whether to use GermanCasing in casing (which changes word lower/upper-casing slightly), and in Lookup.good_forms to drop forms where lowercase “ß” is prohibited.

FORBIDDENWORD: Optional[str] = None

Flag that marks word as forbidden. The main usage of this flag is to specify that some form that is logically possible (by affixing/suffixing or compounding) is in fact non-existent.

Imaginary example (not from actual English dictionary!): let’s say word “create” can have suffixes “-d”, “-s”, “-ion”, and prefixes: “un-”, “re-”, “de-”, but of all possible forms (created, creates, creation, uncreates, uncreation, ….) we decide “decreated” is not an existing word. Then we mark (in .dic file) word “create” with flag for all those suffixes and prefixes, but also add separate word “decreated” to dictionary, marked with flag that specified in .aff’s FORBIDDENWORD directive. Now, this word wouldn’t be considered correct, but all other combinations would.

Usage: multiple times in both Lookup and Suggest

Suggestions

KEY: str = 'qwertyuiop|asdfghjkl|zxcvbnm'

String that specifies sets of adjacent characters on keyboard (so suggest could understand that “kitteb” is most probable misspelling of “kitten”). Format is “abc|def|xyz”. For QWERTY English keyboard might be qwertyuiop|asdfghjkl|zxcvbnm

Usage: Suggest.edits to pass to permutations.badcharkey.

TRY: str = ''

List of all characters that can be used in words, in order of probability (most probable first), used on edits for suggestions (trying to add missing, or replace erroneous character).

Usage: Suggest.edits to pass to permutations.badchar and permutations.forgotchar. Note that, obscurely enough, Suggest checks this option to decide whether dash should be used when suggesting two words (e.g. for misspelled “foobar”, when it is decided that it is two words erroneously joined, suggest either returns only “foo bar”, or also “foo-bar”). Whether dash is suggested, decided by presence of "-" in TRY, or by presence of Latin "a" (= “the language use Latin script, all of them allow dashes between words”)… That’s how it is in Hunspell!

NOSUGGEST: Optional[str] = None

Flag to mark word/affix as “shouldn’t be suggested” (but considered correct on lookup), like obscenities.

Usage: on Suggest creation (to make list of dictionary words for ngram-check), and in Lookup.is_good_form (if the lookup is called from suggest, with allow_nosuggest=False)

KEEPCASE: Optional[str] = None

Flag to mark words which shouldn’t be considered correct unless their casing is exactly like in the dictionary.

Note: With CHECKSHARPS declaration, words with sharp s (ß) and KEEPCASE flag may be capitalized and uppercased, but uppercased forms of these words may not contain “ß”, only “SS”.

Usage: Suggest.suggestions to produce suggestions in proper case, Lookup.is_good_form.

REP: List[RepPattern]

Table of replacements for typical typos (like “shun”->”tion”) to try on suggest. See RepPattern for details of format.

Usage: Suggest.edits to pass to permutations.replchars. Note that the table populated from aff’s REP directive, and from dic’s file ph: tags (see Word and read_dic for detailed explanations).

MAP: List[Set[str]]

Sets of “similar” chars to try in suggestion (like aáã – if they all exist in the language, replacing one in another would be a frequent typo). Several chars as a single entry should be grouped by parentheses: MAP ß(ss) (German “sharp s” and “ss” sequence are more or less the same).

Usage: Suggest.edits to pass to permutations.mapchars.

NOSPLITSUGS: bool = False

Never try to suggest “this word should be split in two”. LibreOffice sv_SE dictionary says “it is a must for Swedish”. (Interestingly enough, Hunspell’s tests doesn’t check this flag at all).

Usage: Suggest.edits

PHONE: Optional[spylls.hunspell.data.aff.PhonetTable] = None

Table for metaphone transformations. Format is borrowed from aspell and described in its docs.

Note that dictionaries with PHONE table are extremely rare: of all LibreOffice/Firefox dictionaries on en_ZA (South Africa) contains it – though it is a generic English metaphone rules an it is quite weird they are not used more frequently.

Showcase (with LibreOffice dictionaries):

>>> misspelled = 'excersized'

>>> nometaphone = Dictionary.from_files('en/en_US')
>>> [*nometaphone.suggest(misspelled)])
['supersized']

>>> withmetaphone = Dictionary.from_files('en/en_ZA')
>>> [*withmetaphone.suggest(misspelled)]
['excerpted', 'exercised', 'excessive']

Usage: phonet_suggest

MAXCPDSUGS: int = 3: Limits number of compound suggestions. Usage: Suggest.suggestions to limit number of edit-based suggestions which are compound words.

N-gram suggestions

MAXNGRAMSUGS: int = 4

Set max. number of n-gram suggestions. Value 0 switches off the n-gram suggestions (see also MAXDIFF).

Usage: Suggest.ngram_suggestions (to decide whether ngram_suggest should be called at all) and Suggest.suggestions (to limit amount of ngram-based suggestions).

MAXDIFF: int = -1

Set the similarity factor for the n-gram based suggestions:

5 = default value
0 = fewer n-gram suggestions, but at least one;
10 (max) = MAXNGRAMSUGS n-gram suggestions.

Usage: Suggest.ngram_suggestions where it is passed to ngram_suggest module, and used in detailed_affix_score.

ONLYMAXDIFF: bool = False

Remove all bad n-gram suggestions (default mode keeps one, see MAXDIFF).

Usage: Suggest.ngram_suggestions where it is passed to ngram_suggest module, and used in filter_guesses.

Stemming

PFX: Dict[str, List[Prefix]]

Dictionary of flag => prefixes with this flag. See Affix for detailed format and meaning description.

Usage:

in Suggest.ngram_suggestions to pass to ngram_suggest (and there to construct all possible forms).
also parsed into prefixes_index Trie, which then used in Lookup.deprefix

SFX: Dict[str, List[Suffix]]

Dictionary of flag => suffixes with this flag. See Affix for detailed format and meaning description.

Usage:

in Suggest.ngram_suggestions to pass to ngram_suggest (and there to construct all possible forms).
also parsed into suffixes_index Trie, which then used in Lookup.desuffix

NEEDAFFIX: Optional[str] = None

Flag saying “this stem can’t be used without affixes”. Can be also assigned to suffix/prefix, meaning “there should be other affixes besides this one”.

Usage: Lookup.is_good_form

CIRCUMFIX: Optional[str] = None

Suffixes signed with this flag may be on a word when this word also has a prefix with this flag, and vice versa.

Usage: Lookup.is_good_form

COMPLEXPREFIXES: bool = False

If two prefixes stripping is allowed (only one prefix by default). Random fun fact: of all currently available LibreOffice and Firefox dictionaries, only Firefox’s Zulu has this flag.

Usage: Lookup.deprefix

FULLSTRIP: bool = False

If affixes are allowed to remove entire stem.

Not used in Spylls (e.g. spylls doesn’t fails when this option is False and entire word is removed, so hunspell’s tests fullstrip.* are passing).

Compounding

BREAK: List[BreakPattern]

Defines break points for breaking words and checking word parts separately. See BreakPattern for format definition.

Usage: Lookup.break_word

COMPOUNDRULE: List[CompoundRule]

Rule of producing compound words, with regexp-like syntax. See CompoundRule for format definition.

Usage: Lookup.compounds_by_rules

COMPOUNDMIN: int = 3

Minimum length of words used for compounding.

Usage: Lookup.compounds_by_rules & Lookup.compounds_by_flags

COMPOUNDWORDMAX: Optional[int] = None

Set maximum word count in a compound word.

Usage: Lookup.compounds_by_rules & Lookup.compounds_by_flags

COMPOUNDFLAG: Optional[str] = None

Forms with this flag (marking either stem, or one of affixes) can be part of the compound. Note that triple of flags COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND is more precise way of marking (“this word can be at the beginning of compound”).

Usage: Lookup.is_good_form to compare form’s compound position (or lack thereof) with presence of teh flag.

COMPOUNDBEGIN: Optional[str] = None

Forms with this flag (marking either stem, or one of affixes) can be at the beginning of the compound. Part of the triple of flags COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND; alternative to the triple is just COMPOUNDFLAG (“this form can be at any place in compound”).

Usage: Lookup.is_good_form to compare form’s compound position (or lack thereof) with the presence of the flag.

COMPOUNDMIDDLE: Optional[str] = None

Forms with this flag (marking either stem, or one of affixes) can be in the middle of the compound (not the last part, and not the first). Part of the triple of flags COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND; alternative to the triple is just COMPOUNDFLAG (“this form can be at any place in compound”).

Usage: Lookup.is_good_form to compare form’s compound position (or lack thereof) with the presence of the flag.

COMPOUNDEND: Optional[str] = None

Forms with this flag (marking either stem, or one of affixes) can be at the end of the compound. Part of the triple of flags COMPOUNDBEGIN, COMPOUNDMIDDLE, COMPOUNDEND; alternative to the triple is just COMPOUNDFLAG (“this form can be at any place in compound”).

Usage: Lookup.is_good_form to compare form’s compound position (or lack thereof) with the presence of the flag.

ONLYINCOMPOUND: Optional[str] = None

Forms with this flag (marking either stem, or one of affixes) can only be part of the compound word, and never standalone.

Usage: Lookup.is_good_form to compare form’s compound position (or lack thereof) with the presence of the flag. Also in Suggest to produce list of the words suitable for ngram search.

COMPOUNDPERMITFLAG: Optional[str] = None

Prefixes are allowed at the beginning of compounds, suffixes are allowed at the end of compounds by default. Affixes with COMPOUNDPERMITFLAG may be inside of compounds.

Usage: Lookup.compounds_by_flags to make list of flags passed to Lookup.produce_affix_forms (for this part of the compound, try find affixed spellings, you can use affixes with this flag).

COMPOUNDFORBIDFLAG: Optional[str] = None

Prefixes are allowed at the beginning of compounds, suffixes are allowed at the end of compounds by default. Suffixes with COMPOUNDFORBIDFLAG may not be even at the end, and prefixes with this flag may not be even at the beginning.

Usage: Lookup.compounds_by_flags to make list of flags passed to Lookup.produce_affix_forms (for this part of the compound, try find affixed spellings, you can use affixes with this flag).

FORCEUCASE: Optional[str] = None

Last word part of a compound with flag FORCEUCASE forces capitalization of the whole compound word. Eg. Dutch word “straat” (street) with FORCEUCASE flags will allowed only in capitalized compound forms, according to the Dutch spelling rules for proper names.

Usage: Lookup.is_bad_compound and Suggest.suggestions (if this flag is present in the .aff file, we check that maybe just capitalization of misspelled word would make it right).

CHECKCOMPOUNDCASE: bool = False

Forbid upper case characters at word boundaries in compounds.

Usage: Lookup.is_bad_compound

CHECKCOMPOUNDDUP: bool = False

Forbid word duplication in compounds (e.g. “foofoo”).

Usage: Lookup.is_bad_compound

CHECKCOMPOUNDREP: bool = False

Forbid compounding, if the (usually bad) compound word may be a non-compound word if some replacement by REP table (frequent misspellings) is made. Useful for languages with “compound friendly” orthography.

Usage: Lookup.is_bad_compound

CHECKCOMPOUNDTRIPLE: bool = False

Forbid compounding, if compound word contains triple repeating letters (e.g. foo|ox or xo|oof).

Usage: Lookup.is_bad_compound

CHECKCOMPOUNDPATTERN: List[CompoundPattern]

List of patterns which forbid compound words when pair of words in compound matches this pattern. See CompoundPattern for explanation about format.

Usage: Lookup.is_bad_compound

SIMPLIFIEDTRIPLE: bool = False

Allow simplified 2-letter forms of the compounds forbidden by CHECKCOMPOUNDTRIPLE. Example: “Schiff”+”fahrt” -> “Schiffahrt”

Usage: Lookup.compounds_by_flags, after the main splitting cycle, we also try the hypothesis that if the letter on the current boundary is duplicated, we should triplicate it.

COMPOUNDSYLLABLE: Optional[Tuple[int, str]] = None

Need for special compounding rules in Hungarian.

Not implemented in Spylls

COMPOUNDMORESUFFIXES: bool = False

Allow twofold suffixes within compounds.

Not used in Spylls and doesn’t have tests in Hunspell

COMPOUNDROOT: Optional[str] = None

Flag that signs the compounds in the dictionary (Now it is used only in the Hungarian language specific code).

Not used in Spylls.

Pre/post-processing

ICONV: Optional[spylls.hunspell.data.aff.ConvTable] = None

Input conversion table (what to do with word before checking if it is valid). See ConvTable for format description.

Usage: Lookup.__call__

OCONV: Optional[spylls.hunspell.data.aff.ConvTable] = None

Output conversion table (what to do with suggestion before returning it to the user). See ConvTable for format description.

Usage: Suggest.suggestions

Aliasing

AF: Dict[str, Set[str]]

Table of flag set aliases. Defined in .aff-file this way:

AF 3
AF ABC
AF BCD
AF DE

This means set of flags “ABC” has an alias “1”, “BCD” alias “2”, “DE” alias “3” (aliases are just a sequental number in the table). Now, in .dic-file, foo/1 would be equivalent of foo/ABC, meaning stem foo has flags A, B, C.

Usage: Stored in readers.aff.Context to decode flags on reading .aff and .dic files.

AM: Dict[str, Set[str]]

Table of word data aliases. Logic of aliasing is the same as for AM.

Usage: read_dic

Other/Ignored

WARN: Optional[str] = None

This flag is for rare words, which are also often spelling mistakes. With command-line flag -r, Hunspell will warn about words with this flag in input text.

Not implemented in Spylls

FORBIDWARN: bool = False

Sets if words with WARN flag should be considered as misspellings (errors, not warnings).

Not used in any known dictionary, and not implemented in Spylls (even in aff-reader).

SYLLABLENUM: Optional[str] = None

Need for special compounding rules in Hungarian. (The previous phrase is the only docs Hunspell provides :))

Not used in Spylls.

SUBSTANDARD: Optional[str] = None

Flag signs affix rules and dictionary words (allomorphs) not used in morphological generation and root words removed from suggestion.

Not implemented in Spylls

Some other directives that are in docs, but are deprecated/not used (and never implemented by Spylls):

LEMMA_PRESENT

Derived attributes

This attributes are calculated after Aff reading and initialization

casing: spylls.hunspell.algo.capitalization.Casing

“Casing” class (defining how the words in this language lowercased/uppercased). See Casing for details. In Aff, basically, it is

GermanCasing if CHECKSHARPS is True,
TurkicCasing if LANG is one of Turkic languages (Turkish, Azerbaijani, Crimean Tatar),
regular Casing otherwise.

suffixes_index: spylls.hunspell.algo.trie.Trie: Trie structure for fast selecting of all possible suffixes for some word, created from SFX

prefixes_index: spylls.hunspell.algo.trie.Trie: Trie structure for fast selecting all possible prefixes for some word, created from PFX

`Prefix` and `Suffix`

class Affix(flag, crossproduct, strip, add, condition, flags=<factory>)[source]

Common base for Prefix and Suffix.

Affixes are stored in table looking this way:

SFX X Y 1
SFX X   0 able/CD . ds:able

Meaning of the first line (table header):

Suffix (can be PFX for prefix)
…designated by flag X
…supports cross-product (Y or N, “cross-product” means form with this suffix also allowed to have prefixes)
…and there is 1 of them below

Meaning of the table row:

Suffix X (should be same as table header)
…when applies, doesn’t change the stem (0 = “”, but it can be “…removes some part at the end of the stem”)
…when applies, adds “able” to the stem
…and the whole form will have also flags “C”, “D”
…condition of application is “any stem” (. – read it as regexp’s “any char”)
…and the whole form would have data tags (morphology) ds:able

Then, if in the dictionary we have drink/X (can have the suffix marked by X), the whole thing means “‘drinkable’ is a valid word form, has additional flags ‘C’, ‘D’ and some morphological info”.

Another example (from en_US.aff):

SFX N Y 3
SFX N   e     ion        e
SFX N   y     ication    y
SFX N   0     en         [^ey]

This defines suffix designated by flag N, non-cross-productable, with 3 forms:

removes “e” and adds “ion” for words ending with “e” (animate => animation)
removes “y” and adds “icaton” for words ending with “y” (amplify => amplification)
removes nothing and adds “en” for words ending with neither (befall => befallen)

(TBH, I don’t have a slightest idea why the third option is grouped with two previous… Probably because dictionary building is semi-automated process of “packing” word lists in dic+aff, and the “affixes” actually doesn’t need to bear any grammatical sense.)

flag: str: Flag this affix marked with. Note that several affixes can have same flag (and in this case, which of them is relevant for the word, is decided by its condition)

crossproduct: bool: Whether this affix is compatible with opposite affix (e.g. if the word has both suffix and prefix, both of them should have crossproduct=True)

strip: str: What is stripped from the stem when the affix is applied

add: str: What is added when the affix is applied

condition: str: Condition against which stem should be checked to understand whether this affix is relevant

flags: Set[str]: Flags this affix has

class Prefix(flag, crossproduct, strip, add, condition, flags=<factory>)[source]: Affix at the beginning of the word, stored in Aff.PFX directive.

class Suffix(flag, crossproduct, strip, add, condition, flags=<factory>)[source]: Affix at the end of the word, stored in Aff.SFX directive.

Helper pattern-alike classes

This classes are wrapping several types of somewhat pattern-alike objects that can be *.aff-file, “compiling” them into something applyable much like Python’s re module compiles regexps.

class BreakPattern(pattern)[source]

Contents of the Aff.BREAK directive, pattern for splitting the word, compiled to regexp.

Directives are stored this way:

BREAK 3
BREAK -
BREAK ^-
BREAK -$

(That’s, by the way, the default value of BREAK). It means Hunspell while checking the word like “left-right”, will check “left” and “right” separately; also will ignore “-” at the beginning and end of the word (second and third lines). Note that BREAK - without any special chars will NOT ignore “-” at the beginning/end.

class Ignore(chars)[source]: Contents of the Aff.IGNORE directive, chars to ignore on lookup/suggest, compiled with str.maketrans.

class RepPattern(pattern, replacement)[source]

Contents of the Aff.REP directive, pair of (frequent typo, its replacement). Typo pattern compiled to regexp.

Example from Hunspell’s docs, showing all the features:

REP 5
REP f ph
REP ph f
REP tion$ shun
REP ^cooccurr co-occurr
REP ^alot$ a_lot

This means:

table of 5 replacements (first line):
try to see if “f -> ph” produces good word,
try “ph -> f”,
at the end of the word try “tion -> shun”,
at the beginning of the word try “cooccurr -> co-occurr”,
and try to replace the whole word “alot” with “a lot” (_ stands for space).

class ConvTable(pairs)[source]

Table of conversions that should be applied on pre- or post-processing, stored in Aff.ICONV and Aff.OCONV. Format is as follows (as far as I can guess from code and tests, documentation is very sparse):

ICONV <number of entries>
ICONV <pattern> <replacement>

Typically, pattern and replacement are just simple strings, used mostly for replacing typographics (like trigraphs and “nice” apostrophes) before/after processing.

But if there is a _ in pattern, it is treated as: regexp ^ if at the beginning of the pattern, regexp $ if at the end, and just ignored otherwise. This seem to be a “hidden” feature, demonstrated by nepali.* set of tests in Hunspell distribution

Conversion rules are applied as follows:

for each position in word
…find any matching rules
…chose the one with longest pattern
…apply it, and shift to position after its applied (so there can’t be recursive application of several rules on top of each other).

class CompoundPattern(left, right, replacement=None)[source]

Pattern to check whether compound word is correct, stored in Aff.CHECKCOMPOUNDPATTERN directive. Format of the pattern:

endchars[/flag] beginchars[/flag] [replacement]

The pattern matches (telling that this compound is not allowed) if some pair of the words inside compound matches conditions:

first word ends with endchars (and have flags from the first element, if they are specified)
second word starts with beginchars (and have flags from the second element, if they are specified)

endchars can be 0, specifying “word has zero affixes”.

replacement complicates things, allowing to specify “…but this string at the border of the words, should be unpacked into this endchars and that beginchars, but make the compound allowed”… It complicates algorithm significantly, and no known dictionary uses this feature, so replacement is just ignored by Spylls.

class CompoundRule(text)[source]

Regexp-alike rule for generating compound words, content of Aff.COMPOUNDRULE directive. It is a way of specifying compounding alternative (and unrelated) to Aff.COMPOUNDFLAG and similar. Rules look this way:

COMPOUNDRULE A*B?CD

…reading: compound word might consist of any number of words with flag A, then 0 or 1 words with flag B, then words with flags C and D.

en_US.aff uses this feature to specify spelling of numerals. In .aff-file, it has

COMPOUNDRULE 2
COMPOUNDRULE n*1t
COMPOUNDRULE n*mp

And, in .dic-file:

0/nm
0th/pt
1/n1
1st/p
1th/tc
2/nm
2nd/p
2th/tc
# ...and so on...

Which makes “111th” valid (one hundred eleventh): “1” with “n”, “1” with “1” and “1th” with “t” is valid by rule n*1t, but “121th” is not valid (should be “121st”)

class PhonetTable(table)[source]

Represents table of metaphone transformations stored in Aff.PHONE. Format is borrowed from aspell and described in its docs.

Basically, each line of the table specifies pair of “pattern”/”replacement”. Replacement is a literal string (with “_” meaning “empty string”), and pattern is … complicated. Spylls, as of now, parses rules fully (see parse_rule method in the source), but doesn’t implements all the algorithm’s details (like rule prioritizing, concept of “follow-up rule” etc.)

It is enough to pass Hunspell’s (small) test for PHONE implementation, but definitely more naive than expected. But as it is marginal feature (and there are enough metaphone implementations in Python), we aren’t (yet?) bothered by this fact.

data.aff: contents of .aff file

Aff

Prefix and Suffix

Helper pattern-alike classes

`data.aff`: contents of .aff file

`Aff`

`Prefix` and `Suffix`