data.dic: contents of .dic file

The module represents data from Hunspell’s *.dic file.

This text file has the following format:

124 # first line: number of entries

# pseudo-comments are marked with "#"
# Each entry has form:

cat/ABC ph:kat

See Word for explanation about fields.

The meaning of flags, as well as file encoding and other reading settings, are defined by Aff.

Dic contains all the entries (converted to Word) from the *.dic file in the linear list, and also provides some indexes and utilities for convenience.

Dic is read by read_dic.

Dic: list of entries

class Dic(words)[source]

Represents list of words from *.dic file. Each word is stored as an instance of Word.

Besides flat list of all words, on initialization also creates word indexes, see index and lowercase_index.

Note, that there could be (and typically are) several entries in the dictionary with same stems but different flags and/or data tags, that’s why index values are lists of words. For example, in English dictionary “spell” (verb, related to reading/writing) and “spell” (noun, magical formula) may be different entries, defining different possible sets of suffixes and morphological properties.

Typically, spylls user shouldn’t create the instance of this class by themselves, it is created when the whole dictionary is read:

>>> dictionary = Dictionary.from_files('dictionaries/en_US')

>>> dictionary.dic
Dictionary(... 62119 words ...)

>>> dictionary.dic.homonyms('spell')
[Word(spell /G,R,S,J,Z,D)]

Data contents:

words: List[Word]

List of all words from *.dic file

index: Dict[str, List[Word]]

All .dic file entries for some stem.

lowercase_index: Dict[str, List[Word]]

All .dic file entries for lowercase version of some stem.

Querying (used by lookup and suggest):

homonyms(stem, *, ignorecase=False)[source]

Returns all Word instances with the same stem.

Parameters:
  • stem (str) – Stem to search

  • ignorecase (bool) – If passed, the stems are searched in the lowercased index (and the stem itself assumed to be lowercased). Used by lookup to find a correspondence for uppercased word, if the stem has complex capitalization (find “McDonalds” by “MCDONALDS”)

Return type:

List[spylls.hunspell.data.dic.Word]

def homonyms(self, stem: str, *, ignorecase: bool = False) -> List[Word]:
    """
    Returns all :class:`Word` instances with the same stem.

    Args:
        stem: Stem to search
        ignorecase: If passed, the stems are searched in the lowercased index (and the ``stem``
                    itself assumed to be lowercased). Used by lookup to find a correspondence
                    for uppercased word, if the stem has complex capitalization (find "McDonalds"
                    by "MCDONALDS")
    """
    if ignorecase:
        return self.lowercase_index.get(stem, [])
    return self.index.get(stem, [])
has_flag(stem, flag, *, for_all=False)[source]

If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to check something like “…but if there is ANY dictionary entry with this stem and ‘forbidden’ flag…”, or “…but if ALL dictionary entries with this stem marked as ‘forbidden’…”

Parameters:
  • stem (str) – Stem present in dictionary

  • flag (str) – Flag to test

  • for_all (bool) – If True, checks if all homonyms have this flag, if False, checks if at least one.

Return type:

bool

def has_flag(self, stem: str, flag: str, *, for_all: bool = False) -> bool:
    """
    If any/all of the homonyms have specified flag. It is frequently necessary in lookup algo to
    check something like "...but if there is ANY dictionary entry with this stem and 'forbidden'
    flag...", or "...but if ALL dictionary entries with this stem marked as 'forbidden'..."

    Args:
        stem: Stem present in dictionary
        flag: Flag to test
        for_all: If ``True``, checks if **all** homonyms have this flag, if ``False``, checks if
                 at least one.
    """
    homonyms = self.homonyms(stem)
    if not homonyms:
        return False
    if for_all:
        return all(flag in homonym.flags for homonym in homonyms)
    return any(flag in homonym.flags for homonym in homonyms)

Dictionary creation

append(word, *, lower)[source]

Used only by read_dic to put the word into the dictionary.

Parameters:
  • word (spylls.hunspell.data.dic.Word) – The word instance, already pre-populated

  • lower (List[str]) – List of all the lowercase forms of word stems. They are pre-calculated on dictionary reading, because proper lowercasing requires casing context; and may produce several lowercased variants (for German). See Casing.lower for details.

def append(self, word: Word, *, lower: List[str]):
    """
    Used only by :meth:`read_dic <spylls.hunspell.readers.dic.read_dic>` to put the word into the
    dictionary.

    Args:
        word: The word instance, already pre-populated
        lower: List of all the lowercase forms of word stems. They are pre-calculated on dictionary
               reading, because proper lowercasing requires casing context; and may produce several
               lowercased variants (for German). See
               :meth:`Casing.lower <spylls.hunspell.algo.capitalization.Casing.lower>` for details.
    """
    self.words.append(word)
    self.index[word.stem].append(word)
    for lword in lower:
        self.lowercase_index[lword].append(word)

Word: dictionary entry

class Word(stem, flags, data, alt_spellings, captype)[source]

One word (stem) of a .dic file.

Each entry in the source contains something like:

foo/ABC ph:phoo is:bar

Where foo is the stem itself, ABC is word flags (flags meaning and format is defined by *.aff file), and ph:phoo is:bar are additional data tags (ph is the tag and foo is the value). Both flags and tags can be absent.

Both flags and data tags can be also represented by numeric aliases defined in .aff file (see Aff.AF and Aff.AM), this is handled on reading stage, see read_dic docs for details.

Meaning of data tags are discussed in hunspell docs. Spylls, for now, provides special handling only for ph: field. The code probably means “phonetic”, but the idea is that this field contains “alternative spellings” (or, rather, common misspellings) of the word. The simplest example is

which ph:wich

This specifies that dictionary word which is frequently misspelled as wich, and would be considered in Suggest. More complicated forms:

pretty ph:prity*
happy ph:hepi->happi

The first one means “any prit inside word should be replaced by pret (chomping off the last letter of both), the second: “any hepi should be replaced to happi, but we store this fact with stem happy” (think “hepiness -> happiness”).

First (simple) form is stored in alt_spellings and used in ngram_suggest, more complex forms are processed at reading stage and is actually stored in Aff.REP.

Attributes from source data:

stem: str

Word stem

flags: Set[str]

Flags of the word, parsed depending on aff-file settings. ABCD might be parsed into {"A", "B", "C", "D"} (default flag format, “short”), or {"AB", "CD"} (“long” flag format)

data: Dict[str, List[str]]

Raw values of data tags. Each tag can be repeated several times, like witch ph:wich ph:which, that’s why dictionary values are lists

Attributes calculated on dictionary reading:

alt_spellings: List[str]

List of alternative word spellings, defined with ph: data tag, and used by ngram_suggest. Not everything specified with ph: is stored here, see explanations in class docs.

captype: CapType

One of capitalization.Type (no capitalization, initial letter capitalized, all letters, or mixed) analyzed on dictionary reading, will be useful on lookup.