data.dic
: contents of .dic file
The module represents data from Hunspell’s *.dic
file.
This text file has the following format:
124 # first line: number of entries
# pseudo-comments are marked with "#"
# Each entry has form:
cat/ABC ph:kat
See Word
for explanation about fields.
The meaning of flags, as well as file encoding and other reading settings, are defined
by Aff
.
Dic
contains all the entries (converted to Word
) from the *.dic
file in the linear
list, and also provides some indexes and utilities for convenience.
Dic
is read by read_dic
.
Dic
: list of entries
-
class
Dic
(words)[source] Represents list of words from
*.dic
file. Each word is stored as an instance ofWord
.Besides flat list of all words, on initialization also creates word indexes, see
index
andlowercase_index
.Note, that there could be (and typically are) several entries in the dictionary with same stems but different flags and/or data tags, that’s why index values are lists of words. For example, in English dictionary “spell” (verb, related to reading/writing) and “spell” (noun, magical formula) may be different entries, defining different possible sets of suffixes and morphological properties.
Typically,
spylls
user shouldn’t create the instance of this class by themselves, it is created when the whole dictionary is read:>>> dictionary = Dictionary.from_files('dictionaries/en_US') >>> dictionary.dic Dictionary(... 62119 words ...) >>> dictionary.dic.homonyms('spell') [Word(spell /G,R,S,J,Z,D)]
Data contents:
Querying (used by lookup and suggest):
-
homonyms
(stem, *, ignorecase=False)[source] Returns all
Word
instances with the same stem.- Parameters:
stem (str) – Stem to search
ignorecase (bool) – If passed, the stems are searched in the lowercased index (and the
stem
itself assumed to be lowercased). Used by lookup to find a correspondence for uppercased word, if the stem has complex capitalization (find “McDonalds” by “MCDONALDS”)
- Return type:
def homonyms(self, stem: str, *, ignorecase: bool = False) -> List[Word]: """ Returns all :class:`Word` instances with the same stem. Args: stem: Stem to search ignorecase: If passed, the stems are searched in the lowercased index (and the ``stem`` itself assumed to be lowercased). Used by lookup to find a correspondence for uppercased word, if the stem has complex capitalization (find "McDonalds" by "MCDONALDS") """ if ignorecase: return self.lowercase_index.get(stem, []) return self.index.get(stem, [])
-
has_flag
(stem, flag, *, for_all=False)[source] If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to check something like “…but if there is ANY dictionary entry with this stem and ‘forbidden’ flag…”, or “…but if ALL dictionary entries with this stem marked as ‘forbidden’…”
- Parameters:
stem (str) – Stem present in dictionary
flag (str) – Flag to test
for_all (bool) – If
True
, checks if all homonyms have this flag, ifFalse
, checks if at least one.
- Return type:
bool
def has_flag(self, stem: str, flag: str, *, for_all: bool = False) -> bool: """ If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to check something like "...but if there is ANY dictionary entry with this stem and 'forbidden' flag...", or "...but if ALL dictionary entries with this stem marked as 'forbidden'..." Args: stem: Stem present in dictionary flag: Flag to test for_all: If ``True``, checks if **all** homonyms have this flag, if ``False``, checks if at least one. """ homonyms = self.homonyms(stem) if not homonyms: return False if for_all: return all(flag in homonym.flags for homonym in homonyms) return any(flag in homonym.flags for homonym in homonyms)
Dictionary creation
-
append
(word, *, lower)[source] Used only by
read_dic
to put the word into the dictionary.- Parameters:
word (spylls.hunspell.data.dic.Word) – The word instance, already pre-populated
lower (List[str]) – List of all the lowercase forms of word stems. They are pre-calculated on dictionary reading, because proper lowercasing requires casing context; and may produce several lowercased variants (for German). See
Casing.lower
for details.
def append(self, word: Word, *, lower: List[str]): """ Used only by :meth:`read_dic <spylls.hunspell.readers.dic.read_dic>` to put the word into the dictionary. Args: word: The word instance, already pre-populated lower: List of all the lowercase forms of word stems. They are pre-calculated on dictionary reading, because proper lowercasing requires casing context; and may produce several lowercased variants (for German). See :meth:`Casing.lower <spylls.hunspell.algo.capitalization.Casing.lower>` for details. """ self.words.append(word) self.index[word.stem].append(word) for lword in lower: self.lowercase_index[lword].append(word)
-
Word
: dictionary entry
-
class
Word
(stem, flags, data, alt_spellings, captype)[source] One word (stem) of a .dic file.
Each entry in the source contains something like:
foo/ABC ph:phoo is:bar
Where
foo
is the stem itself,ABC
is word flags (flags meaning and format is defined by*.aff
file), andph:phoo is:bar
are additional data tags (ph
is the tag andfoo
is the value). Both flags and tags can be absent.Both flags and data tags can be also represented by numeric aliases defined in .aff file (see
Aff.AF
andAff.AM
), this is handled on reading stage, seeread_dic
docs for details.Meaning of data tags are discussed in hunspell docs. Spylls, for now, provides special handling only for
ph:
field. The code probably means “phonetic”, but the idea is that this field contains “alternative spellings” (or, rather, common misspellings) of the word. The simplest example iswhich ph:wich
This specifies that dictionary word
which
is frequently misspelled aswich
, and would be considered inSuggest
. More complicated forms:pretty ph:prity* happy ph:hepi->happi
The first one means “any
prit
inside word should be replaced bypret
(chomping off the last letter of both), the second: “anyhepi
should be replaced tohappi
, but we store this fact with stemhappy
” (think “hepiness -> happiness”).First (simple) form is stored in
alt_spellings
and used inngram_suggest
, more complex forms are processed at reading stage and is actually stored inAff.REP
.Attributes from source data:
-
stem
: str Word stem
-
flags
: Set[str] Flags of the word, parsed depending on aff-file settings.
ABCD
might be parsed into{"A", "B", "C", "D"}
(default flag format, “short”), or{"AB", "CD"}
(“long” flag format)
-
data
: Dict[str, List[str]] Raw values of data tags. Each tag can be repeated several times, like
witch ph:wich ph:which
, that’s why dictionary values are lists
Attributes calculated on dictionary reading:
-
alt_spellings
: List[str] List of alternative word spellings, defined with
ph:
data tag, and used byngram_suggest
. Not everything specified withph:
is stored here, see explanations in class docs.
-
captype
: CapType One of
capitalization.Type
(no capitalization, initial letter capitalized, all letters, or mixed) analyzed on dictionary reading, will be useful on lookup.
-