algo.suggest: main suggestion algorithm

The main “suggest correction for this misspelling” module.

On a bird-eye view level, suggest does:

  • tries small word “edits” (remove letters, insert letters, swap letters) and checks (with the help of lookup) there are any valid ones

  • if no good suggestions found, tries “ngram-based” suggestions (calculating ngram-based distance to all dictionary words and select the closest ones), handled by ngram_suggest

  • if possible, tries metaphone-based suggestions, handled by phonet_suggest

Note that Spylls’s implementation takes one liberty comparing to Hunspell’s: In Hunspell, ngram suggestions (select all words from dictionary that ngram-similar => produce suggestions) and phonetic suggestions (select all words from dictionary that phonetically similar => produce suggestions) are done in the same cycle, because they both iterate through entire dictionary. Spylls does it in two separate cycles, again, for the sake of clarity (note that dictionaries with metaphone transformation rules defined are extremely rare).

To follow algorithm details, start reading from Suggest.__call__()

class Suggest(aff, dic, lookup)[source]

Suggest object is created on Dictionary reading. Typically, you would not use it directly, but you might want for experiments:

>>> dictionary = Dictionary.from_files('dictionaries/en_US')
>>> suggest = dictionary.suggester

>>> [*suggest('spylls')]
['spells', 'spills']

>>> for suggestion in suggest.suggestions('spylls'):
...    print(suggestion)
Suggestion[badchar](spell)
Suggestion[badchar](spill)

See __call__() as the main entry point for algorithm explanation.

Main methods

__call__(word)[source]

Outer “public” interface: returns a list of all valid suggestions, as strings.

Method returns a generator, so it is up to client code to fetch as many suggestions as it needs:

>>> suggestions = suggester('unredable')
<generator object Suggest.__call__ at 0x7f74f5056350>
>>> suggestions.__next__()
'unreadable'

Note that suggestion to split words in two also returned as a single string, with a space:

>>> [*suggester('badcat')]
['bad cat', 'bad-cat', 'baccarat']

Internally, the method just calls suggestions() (which returns instances of Suggestion) and yields suggestion texts.

Parameters:

word (str) – Word to check

Return type:

Iterator[str]

def __call__(self, word: str) -> Iterator[str]:
    """
    Outer "public" interface: returns a list of all valid suggestions, as strings.

    Method returns a generator, so it is up to client code to fetch as many suggestions as it
    needs::

        >>> suggestions = suggester('unredable')
        <generator object Suggest.__call__ at 0x7f74f5056350>
        >>> suggestions.__next__()
        'unreadable'

    Note that suggestion to split words in two also returned as a single string, with a space::

        >>> [*suggester('badcat')]
        ['bad cat', 'bad-cat', 'baccarat']

    Internally, the method just calls :meth:`suggestions` (which returns instances of :class:`Suggestion`)
    and yields suggestion texts.

    Args:
        word: Word to check
    """
    yield from (suggestion.text for suggestion in self.suggestions(word))
suggestions(word)[source]

Main suggestion search loop. What it does, in general, is:

  • generates possible misspelled word cases (for ex., “KIttens” in dictionary might’ve been ‘KIttens’, ‘kIttens’, ‘kittens’, or ‘Kittens’)

  • produces word edits with edits() (with the help of permutations module), checks them with Lookup, and decides if that’s maybe enough

  • but if it is not (and if .aff settings allow), ngram-based suggestions are produced with ngram_suggestions(), and phonetically similar suggestions with phonet_suggestions()

That’s very simplified explanation, read the code!

Parameters:

word (str) – Word to check

Return type:

Iterator[spylls.hunspell.algo.suggest.Suggestion]

def suggestions(self, word: str) -> Iterator[Suggestion]:  # pylint: disable=too-many-statements
    """
    Main suggestion search loop. What it does, in general, is:

    * generates possible misspelled word cases (for ex., "KIttens" in dictionary might've been
      'KIttens', 'kIttens', 'kittens', or 'Kittens')
    * produces word edits with :meth:`edits` (with the help of
      :mod:`permutations <spylls.hunspell.algo.permutations>` module), checks them with
      :class:`Lookup <spylls.hunspell.algo.lookup.Lookup>`, and decides if that's maybe enough
    * but if it is not (and if .aff settings allow), ngram-based suggestions are produced with
      :meth:`ngram_suggestions`, and phonetically similar suggestions with :meth:`phonet_suggestions`

    That's very simplified explanation, read the code!

    Args:
        word: Word to check
    """

    # Whether some suggestion (permutation of the word) is an existing and allowed word,
    # just delegates to Lookup
    def is_good_suggestion(word):
        # Note that instead of using Lookup's main method, we just see if there is any good forms
        # of this exact word, avoiding ICONV and trying to break word by dashes.
        return any(self.lookup.good_forms(word, capitalization=False, allow_nosuggest=False))

    # The suggestion is considered forbidden if there is ANY homonym in dictionary with flag
    # FORBIDDENWORD. Besides marking swearing words, this feature also allows to include in
    # dictionaries known "correctly-looking but actually non-existent" forms, which might important
    # with very flexive languages.
    def is_forbidden(word):
        return self.aff.FORBIDDENWORD and self.dic.has_flag(word, self.aff.FORBIDDENWORD)

    # This set will gather all good suggestions that were already returned (in order, for example,
    # to not return same suggestion twice)
    handled: Set[str] = set()

    # Suggestions that are already considered good are passed through this method, which converts
    # it to proper capitalization form, and then either yields it (if it is not forbidden,
    # hadn't already seen, etc), or just does nothing.
    # Method is quite lengthy, but is nested because updates and reuses ``handled`` local var
    def handle_found(suggestion, *, check_inclusion=False):
        text = suggestion.text

        # If any of the homonyms has KEEPCASE flag, we shouldn't coerce it from the base form.
        # But CHECKSHARPS flag presence changes the meaning of KEEPCASE...
        if (self.aff.KEEPCASE and self.dic.has_flag(text, self.aff.KEEPCASE) and not
                (self.aff.CHECKSHARPS and 'ß' in text)):
            # Don't try to change text's case
            pass
        else:
            # "Coerce" suggested text from the capitalization that it has in the dictionary, to
            # the capitalization of the misspelled word. E.g., if misspelled was "Kiten", suggestion
            # is "kitten" (how it is in the dictionary), and coercion (what we really want
            # to return to user) is "Kitten"
            text = self.aff.casing.coerce(text, captype)
            # ...but if this particular capitalized form is forbidden, return back to original text
            if text != suggestion.text and is_forbidden(text):
                text = suggestion.text

            # "aNew" will suggest "a new", here we fix it back to "a New"
            if captype in [CapType.HUH, CapType.HUHINIT] and ' ' in text:
                pos = text.find(' ')
                if text[pos + 1] != word[pos] and text[pos + 1].upper() == word[pos]:
                    text = text[:pos+1] + word[pos] + text[pos+2:]

        # If the word is forbidden, nothing more to do
        if is_forbidden(text):
            return

        # Finally, OCONV table in .aff-file might specify what chars to replace in suggestions
        # (for example, "'" to proper typographic "’", or common digraphs)
        text = self.aff.OCONV(text) if self.aff.OCONV else text

        # If we already seen this suggestion, nothing to do
        # Note that this should happen AFTER the OCONV: it sometimes changes the content significantly.
        # For example, Dutch dictionary recodes "ij" into ligature (one character representing this
        # two letters) to enforce proper capitalization (and decodes it back on OCONV).
        if text in handled:
            return

        # Sometimes we want to skip suggestions even if they are same as PARTS of already
        # seen ones: for examle, ngram-based suggestions might produce very similar forms, like
        # "impermanent" and "permanent" -- both of them are correct, but if the first is
        # closer (by length/content) to misspelling, there is no point in suggesting the second
        if check_inclusion and any(previous.lower() in text.lower() for previous in handled):
            return

        # Remember we seen it
        handled.add(text)

        # And here we are!
        yield suggestion.replace(text=text)

    # **Start of the main suggest code**

    # First, produce all possible "good capitalizations" from the provided word. For example,
    # if the word is "MsDonalds", good capitalizations (what it might've been in dictionary) are
    # "msdonalds" (full lowercase) "msDonalds" (first letter lowercased), or maybe "Msdonalds"
    # (only first letter capitalized). Note that "MSDONALDS" (it should've been all caps) is not
    # produced as a possible good form, but checked separately in ``edits``
    captype, variants = self.aff.casing.corrections(word)

    # Check a special case: if it is possible that words would be possible to be capitalized
    # on compounding, then we check capitalized form of the word. If it is correct, that's the
    # only suggestion we ever need.
    if self.aff.FORCEUCASE and captype == CapType.NO:
        for capitalized in self.aff.casing.capitalize(word):
            if is_good_suggestion(capitalized):
                yield from handle_found(Suggestion(capitalized.capitalize(), 'forceucase'))
                return  # No more need to check anything

    good_edits_found = False

    # Now, for all capitalization variant
    for idx, variant in enumerate(variants):
        # If it is different from original capitalization, and is good, we suggest it
        if idx > 0 and is_good_suggestion(variant):
            yield from handle_found(Suggestion(variant, 'case'))

        # Now we do two rounds:
        # * generate edits and check whether they are valid non-compound words,
        # * generate the same edits again and check whether they are valid compound words
        #
        # The reason of splitting the checks is: a) optimization (checking whether the word is
        # a valid compound is much more pricey) and b) suggestion quality (it is considered that
        # non-compound words are more probable).
        # There are cases when this algorithm brings problems: for example, "11thhour" will NOT
        # suggest "11th hour" -- because "11th" is compound word, and "hour" is not, but we
        # first check whether the splitting is correct assuming all words are non-compound, and
        # then check whether it is correct assuming both are compound. That's how it is in Hunspell,
        # and Spylls doesn't tries to fix that.

        nocompound = False

        # 1. Only generate suggestions that are correct NOT COMPOUND words.
        for suggestion in self.edit_suggestions(variant, handle_found, limit=MAXSUGGESTIONS, compounds=False):
            yield suggestion
            # remember if any "good" edits was found -- in this case we don't need
            # ngram suggestions
            good_edits_found = good_edits_found or (suggestion.kind in GOOD_EDITS)
            # If any of those kinds are found, we don't need to try _any_ compound suggestions
            if suggestion.kind in ['uppercase', 'replchars', 'mapchars']:
                nocompound = True
            # We might break from the method immediately if the suggestion is to split word
            # with space, and this phrase is _in the dictionary as a whole_: "alot" => "a lot".
            # Then we don't need ANY other suggestions. This is a trick to enforce suggesting
            # to break commonly misspelled phrases withot oversuggesting
            if suggestion.kind == 'spaceword':
                return

        # 2. Only generate suggestions that are correct COMPOUND words
        if not nocompound:
            for suggestion in self.edit_suggestions(variant, handle_found,
                                                    limit=self.aff.MAXCPDSUGS, compounds=True):
                yield suggestion
                good_edits_found = good_edits_found or (suggestion.kind in GOOD_EDITS)

    if good_edits_found:
        return

    # One additional pass at suggesting: if the word contains dashes, and there were no dash-containing
    # suggestions, we should try to spellcheck it separately... But only one misspelled part is allowed.
    if '-' in word and not any('-' in sug for sug in handled):
        chunks = word.split('-')
        for idx, chunk in enumerate(chunks):
            # If this chunk of the word is misspelled...
            if not is_good_suggestion(chunk):
                # ...try all suggestions for this separate chunk
                for sug in self(chunk):
                    candidate = '-'.join([*chunks[:idx], sug, *chunks[idx+1:]])
                    # And check if the whole word with this chunk replaced is a good word. Note that
                    # we use lookup's main method here, which will also break it into words by
                    # dashes. It is done (instead of just checking chunk-by-chunk), because there
                    # are ambiguities how to split: if we are fixing "quas-Afro-American"->"quasi-Afro-American",
                    # and "Afro-American" is present in the dictionary as a whole word, it is better
                    # to delegate search for the right breaking to the Lookup.
                    if self.lookup(candidate):
                        yield Suggestion(candidate, 'dashes')
                # after only one misspelled chunk, we stop the search anyways, whether we found some
                # sugestions or not
                break

    # If there was no good edits that were valid words, we might try
    # ngram-based suggestion algorithm: it is slower, but able to correct severely misspelled words

    ngrams_seen = 0
    for sug in self.ngram_suggestions(word, handled=handled):
        for res in handle_found(Suggestion(sug, 'ngram'), check_inclusion=True):
            ngrams_seen += 1
            yield res
        if ngrams_seen >= self.aff.MAXNGRAMSUGS:
            break

    # Also, if metaphone transformations (phonetic coding of words) were defined in the .aff file,
    # we might try to use them to produce suggestions

    phonet_seen = 0
    for sug in self.phonet_suggestions(word):
        for res in handle_found(Suggestion(sug, 'phonet'), check_inclusion=True):
            phonet_seen += 1
            yield res
        if phonet_seen >= MAXPHONSUGS:
            break

Suggestion types

edits(word)[source]

Produces all possible word edits in a form of Suggestion or MultiWordSuggestion. Note that:

  • order is important, that’s the order user will receive the suggestions (and the further the suggestion type in the order, the more probably it would be dropped due to suggestion count limit)

  • suggestion “source” tag is important: suggestions() uses it to distinguish between good and questionble edits (if there were any good ones, ngram suggestion wouldn’t be used)

Parameters:

word (str) – Word to mutate

Return type:

Iterator[Union[spylls.hunspell.algo.suggest.Suggestion, spylls.hunspell.algo.suggest.MultiWordSuggestion]]

def edits(self, word: str) -> Iterator[Union[Suggestion, MultiWordSuggestion]]:
    """
    Produces all possible word edits in a form of :class:`Suggestion` or :class:`MultiWordSuggestion`.
    Note that:

    * order is important, that's the order user will receive the suggestions (and the further the
      suggestion type in the order, the more probably it would be dropped due to suggestion count
      limit)
    * suggestion "source" tag is important: :meth:`suggestions` uses it to distinguish between
      good and questionble edits (if there were any good ones, ngram suggestion wouldn't
      be used)

    Args:
        word: Word to mutate
    """
    # suggestions for an uppercase word (html -> HTML)
    yield Suggestion(self.aff.casing.upper(word), 'uppercase')

    # REP table in affix file specifies "typical misspellings", and we try to replace them.
    # Note that the content of REP table taken not only from aff file, but also from "ph:" tag
    # in dictionary (lines looking like ``hello ph:helo`` put word "hello" in dictionary, and
    # "helo -> hello" in REP-table), see read_dic.
    #
    # It might return several words if REP table has "REP <something> <some>_<thing>" (_ is code
    # for space).
    #
    # ...in this case we should suggest both "<word1> <word2>" as one dictionary entry, and
    # "<word1>" "<word2>" as a sequence -- but clarifying this sequence might NOT be joined by "-"
    for suggestion in pmt.replchars(word, self.aff.REP):
        if isinstance(suggestion, list):
            yield Suggestion(' '.join(suggestion), 'replchars')
            yield MultiWordSuggestion(suggestion, 'replchars', allow_dash=False)
        else:
            yield Suggestion(suggestion, 'replchars')

    for words in pmt.twowords(word):
        yield Suggestion(' '.join(words), 'spaceword')

        if self.use_dash():
            # "alot" => "a-lot"
            yield Suggestion('-'.join(words), 'spaceword')

    # MAP in aff file specifies related chars (for example, "ïi"), and mapchars produces all
    # changes of the word with related chars replaced. For example, "naive" produces "naïve".
    for suggestion in pmt.mapchars(word, self.aff.MAP):
        yield Suggestion(suggestion, 'mapchars')

    # Try to swap adjacent characters (ktiten -> kitten), produces all possible forms with ONE
    # swap; but for 4- and 5-letter words tries also two swaps "ahev -> have"
    for suggestion in pmt.swapchar(word):
        yield Suggestion(suggestion, 'swapchar')

    # Try longer swaps (up to 4 chars distance)
    for suggestion in pmt.longswapchar(word):
        yield Suggestion(suggestion, 'longswapchar')

    # Try to replace chars by those close on keyboard ("wueue" -> "queue"), KEY in aff file specifies
    # keyboard layout.
    for suggestion in pmt.badcharkey(word, self.aff.KEY):
        yield Suggestion(suggestion, 'badcharkey')

    # Try remove character (produces all forms with one char removed: "clat" => "lat", "cat", "clt", "cla")
    for suggestion in pmt.extrachar(word):
        yield Suggestion(suggestion, 'extrachar')

    # Try insert character (from set of all possible language chars specified in aff), produces
    # all forms with any of the TRY chars inserted in all possible positions
    for suggestion in pmt.forgotchar(word, self.aff.TRY):
        yield Suggestion(suggestion, 'forgotchar')

    # Try to move a character forward and backwars:
    #  "rnai" => "nari", "nair", "rain" (forward: r moved 2 and 3 chars, n moved 2 chars)
    #         => "rina", "irna", "arni" (backward: i moved 2 and 3 chars, a moved 2 chars)
    # (no 1-char movements necessary, they are covered by swapchar above)
    for suggestion in pmt.movechar(word):
        yield Suggestion(suggestion, 'movechar')

    # Try replace each character with any of other language characters
    for suggestion in pmt.badchar(word, self.aff.TRY):
        yield Suggestion(suggestion, 'badchar')

    # Try fix two-character doubling: "chickcken" -> "chicken" (one-character doubling is
    # already covered by extrachar)
    for suggestion in pmt.doubletwochars(word):
        yield Suggestion(suggestion, 'doubletwochars')

    if not self.aff.NOSPLITSUGS:
        # Try split word by space in all possible positions
        # NOSPLITSUGS option in aff prohibits it, it is important, say, for Scandinavian languages
        for suggestion_pair in pmt.twowords(word):
            yield MultiWordSuggestion(suggestion_pair, 'twowords', allow_dash=self.use_dash())
ngram_suggestions(word, handled)[source]

Produces ngram-based suggestions, by passing to ngram_suggest.ngram_suggest current misspelling, already found suggestions and settings from .aff file.

See ngram_suggest.

Parameters:
  • word (str) – Misspelled word

  • handled (Set[str]) – List of already handled (known) suggestions; it is reused in ngram_suggest.filter_guesses to decide whether we add “not really good” ngram-based suggestions to result

Return type:

Iterator[str]

def ngram_suggestions(self, word: str, handled: Set[str]) -> Iterator[str]:
    """
    Produces ngram-based suggestions, by passing to
    :meth:`ngram_suggest.ngram_suggest <spylls.hunspell.algo.ngram_suggest.ngram_suggest>` current
    misspelling, already found suggestions and settings from .aff file.

    See :mod:`ngram_suggest <spylls.hunspell.algo.ngram_suggest>`.

    Args:
        word: Misspelled word
        handled: List of already handled (known) suggestions; it is reused in
                 :meth:`ngram_suggest.filter_guesses <spylls.hunspell.algo.ngram_suggest.filter_guesses>`
                 to decide whether we add "not really good" ngram-based suggestions to result
    """
    if self.aff.MAXNGRAMSUGS == 0:
        return

    yield from ngram_suggest.ngram_suggest(
                word.lower(),
                dictionary_words=self.words_for_ngram,
                prefixes=self.aff.PFX, suffixes=self.aff.SFX,
                known={*(word.lower() for word in handled)},
                maxdiff=self.aff.MAXDIFF,
                onlymaxdiff=self.aff.ONLYMAXDIFF,
                has_phonetic=(self.aff.PHONE is not None))
phonet_suggestions(word)[source]

Produces phonetical similarity-based suggestions, by passing to phonet_suggest.phonet_suggest current misspelling and settings from .aff file.

See phonet_suggest.

Parameters:

word (str) – Misspelled word

Return type:

Iterator[str]

def phonet_suggestions(self, word: str) -> Iterator[str]:
    """
    Produces phonetical similarity-based suggestions, by passing to
    :meth:`phonet_suggest.phonet_suggest <spylls.hunspell.algo.phonet_suggest.phonet_suggest>` current
    misspelling and settings from .aff file.

    See :mod:`phonet_suggest <spylls.hunspell.algo.phonet_suggest.phonet_suggest>`.

    Args:
        word: Misspelled word
    """
    if not self.aff.PHONE:
        return

    yield from phonet_suggest.phonet_suggest(word, dictionary_words=self.words_for_ngram, table=self.aff.PHONE)

Suggestion classes

class Suggestion(text, kind)[source]

Suggestions is what Suggest produces internally to store enough information about some suggestion to make sure it is a good one.

text: str

Actual suggestion text

kind: str

How suggestion was produced, useful for debugging, typically same as the method of the editing which led to this suggestion, like “badchar”, “twowords”, “phonet” etc.

class MultiWordSuggestion(words, source, allow_dash=True)[source]

Represents suggestion to split words into several.

words: List[str]

List of words

source: str

Same as Suggestion.source

allow_dash: bool = True

Whether those words are allowed to be joined by dash. We should disallow it if the multi-word suggestion was produced by Aff.REP table, see Suggest.edits() for details.