data.dic: contents of .dic file
The module represents data from Hunspell’s *.dic file.
This text file has the following format:
124 # first line: number of entries
# pseudo-comments are marked with "#"
# Each entry has form:
cat/ABC ph:kat
See Word for explanation about fields.
The meaning of flags, as well as file encoding and other reading settings, are defined
by Aff.
Dic contains all the entries (converted to Word) from the *.dic file in the linear
list, and also provides some indexes and utilities for convenience.
Dic is read by read_dic.
Dic: list of entries
-
class
Dic(words)[source] Represents list of words from
*.dicfile. Each word is stored as an instance ofWord.Besides flat list of all words, on initialization also creates word indexes, see
indexandlowercase_index.Note, that there could be (and typically are) several entries in the dictionary with same stems but different flags and/or data tags, that’s why index values are lists of words. For example, in English dictionary “spell” (verb, related to reading/writing) and “spell” (noun, magical formula) may be different entries, defining different possible sets of suffixes and morphological properties.
Typically,
spyllsuser shouldn’t create the instance of this class by themselves, it is created when the whole dictionary is read:>>> dictionary = Dictionary.from_files('dictionaries/en_US') >>> dictionary.dic Dictionary(... 62119 words ...) >>> dictionary.dic.homonyms('spell') [Word(spell /G,R,S,J,Z,D)]
Data contents:
Querying (used by lookup and suggest):
-
homonyms(stem, *, ignorecase=False)[source] Returns all
Wordinstances with the same stem.- Parameters:
stem (str) – Stem to search
ignorecase (bool) – If passed, the stems are searched in the lowercased index (and the
stemitself assumed to be lowercased). Used by lookup to find a correspondence for uppercased word, if the stem has complex capitalization (find “McDonalds” by “MCDONALDS”)
- Return type:
def homonyms(self, stem: str, *, ignorecase: bool = False) -> List[Word]: """ Returns all :class:`Word` instances with the same stem. Args: stem: Stem to search ignorecase: If passed, the stems are searched in the lowercased index (and the ``stem`` itself assumed to be lowercased). Used by lookup to find a correspondence for uppercased word, if the stem has complex capitalization (find "McDonalds" by "MCDONALDS") """ if ignorecase: return self.lowercase_index.get(stem, []) return self.index.get(stem, [])
-
has_flag(stem, flag, *, for_all=False)[source] If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to check something like “…but if there is ANY dictionary entry with this stem and ‘forbidden’ flag…”, or “…but if ALL dictionary entries with this stem marked as ‘forbidden’…”
- Parameters:
stem (str) – Stem present in dictionary
flag (str) – Flag to test
for_all (bool) – If
True, checks if all homonyms have this flag, ifFalse, checks if at least one.
- Return type:
bool
def has_flag(self, stem: str, flag: str, *, for_all: bool = False) -> bool: """ If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to check something like "...but if there is ANY dictionary entry with this stem and 'forbidden' flag...", or "...but if ALL dictionary entries with this stem marked as 'forbidden'..." Args: stem: Stem present in dictionary flag: Flag to test for_all: If ``True``, checks if **all** homonyms have this flag, if ``False``, checks if at least one. """ homonyms = self.homonyms(stem) if not homonyms: return False if for_all: return all(flag in homonym.flags for homonym in homonyms) return any(flag in homonym.flags for homonym in homonyms)
Dictionary creation
-
append(word, *, lower)[source] Used only by
read_dicto put the word into the dictionary.- Parameters:
word (spylls.hunspell.data.dic.Word) – The word instance, already pre-populated
lower (List[str]) – List of all the lowercase forms of word stems. They are pre-calculated on dictionary reading, because proper lowercasing requires casing context; and may produce several lowercased variants (for German). See
Casing.lowerfor details.
def append(self, word: Word, *, lower: List[str]): """ Used only by :meth:`read_dic <spylls.hunspell.readers.dic.read_dic>` to put the word into the dictionary. Args: word: The word instance, already pre-populated lower: List of all the lowercase forms of word stems. They are pre-calculated on dictionary reading, because proper lowercasing requires casing context; and may produce several lowercased variants (for German). See :meth:`Casing.lower <spylls.hunspell.algo.capitalization.Casing.lower>` for details. """ self.words.append(word) self.index[word.stem].append(word) for lword in lower: self.lowercase_index[lword].append(word)
-
Word: dictionary entry
-
class
Word(stem, flags, data, alt_spellings, captype)[source] One word (stem) of a .dic file.
Each entry in the source contains something like:
foo/ABC ph:phoo is:bar
Where
foois the stem itself,ABCis word flags (flags meaning and format is defined by*.afffile), andph:phoo is:barare additional data tags (phis the tag andfoois the value). Both flags and tags can be absent.Both flags and data tags can be also represented by numeric aliases defined in .aff file (see
Aff.AFandAff.AM), this is handled on reading stage, seeread_dicdocs for details.Meaning of data tags are discussed in hunspell docs. Spylls, for now, provides special handling only for
ph:field. The code probably means “phonetic”, but the idea is that this field contains “alternative spellings” (or, rather, common misspellings) of the word. The simplest example iswhich ph:wich
This specifies that dictionary word
whichis frequently misspelled aswich, and would be considered inSuggest. More complicated forms:pretty ph:prity* happy ph:hepi->happi
The first one means “any
pritinside word should be replaced bypret(chomping off the last letter of both), the second: “anyhepishould be replaced tohappi, but we store this fact with stemhappy” (think “hepiness -> happiness”).First (simple) form is stored in
alt_spellingsand used inngram_suggest, more complex forms are processed at reading stage and is actually stored inAff.REP.Attributes from source data:
-
stem: str Word stem
-
flags: Set[str] Flags of the word, parsed depending on aff-file settings.
ABCDmight be parsed into{"A", "B", "C", "D"}(default flag format, “short”), or{"AB", "CD"}(“long” flag format)
-
data: Dict[str, List[str]] Raw values of data tags. Each tag can be repeated several times, like
witch ph:wich ph:which, that’s why dictionary values are lists
Attributes calculated on dictionary reading:
-
alt_spellings: List[str] List of alternative word spellings, defined with
ph:data tag, and used byngram_suggest. Not everything specified withph:is stored here, see explanations in class docs.
-
captype: CapType One of
capitalization.Type(no capitalization, initial letter capitalized, all letters, or mixed) analyzed on dictionary reading, will be useful on lookup.
-