Build your own toy spellchecker

Spellcheckers and autocorrect, aren't they magical? They do feel like magic to me. I mean, how is it able to predict what word do you want to type?

According to Norvig, some of the most talented engineers at Google don't have a complete understanding of how a spellchecker works.

Peter Norvig's solution in Python is so elegant and simple, I don't think anyone can write better than that. However, I wanted to know how it works, so I thought of building it to understand it's functionality. Also, an excuse to exercise my C++ skills.

So, are you as curious as I am? If you are, I think you're in the right spot.

How it works?

Okay, so think about it? What does a spellchecker do? You type in a misspelled word and it returns a word with the highest probability, right?

Well, there's a little bit more to it.

Create a word dictionary

First, we must create a dictionary, in order to do that, you need to extract words from a large piece of text and store it in a Hashmap in which each word will have a word count. In this example, I've used a Sherlock Holmes novel (which is around 6MB). The words are extracted from a novel instead of an actual dictionary because it can be used to create a simple Language Model.

Source code to create a dictionary:

void SpellChecker::extractWords(string &filename)
{
    ifstream infile;
    infile.open(filename);
    string x;
    while(infile >> x)
    {
        x = filterAlpha(x);
        dictionary[x]++;
    }
}

string SpellChecker::filterAlpha(string &word)
{
    for(int i=0; i<word.size(); i++)
    {
        char ch = word[i];

        if(ch < 0)
        {
            word[i] = '-';
        }

        if(isalpha(ch))
        {
            word[i] = tolower(ch);
        }
    }
    return word;
}

Create a list of candidates

Second, we must able to predict/hypothesize the ways of editing text when the user types. It could be one of the following types of editing:

Adding a letter
Replacing a letter
Switching two adjacent letters
Removing a letter

Based on the types of edits a user could make, we can generate a list of possible candidates by creating permutations using these edit methods mentioned above.

Adding a letter

In this method, you generate a list of candidates by inserting a letter in every iteration.

void SpellChecker::inserts(string &word, Vector &result)
{
    for(int i=0; i<word.size()+1; i++)
    {
        for(int j=0; j<alphabets.size(); j++)
        {
            char ch = alphabets[j];
            result.push_back(word.substr(0,i) + ch + word.substr(i));
        }
    }
}

Replacing a letter

In this method, you generate a list of candidates by replacing each character with a letter from a list of alphabets in every iteration.

void SpellChecker::replaces(string &word, Vector &result)
{
    for(int i=0; i<word.size(); i++)
    {
        for(int j=0; j<alphabets.size(); j++)
        {
            char ch = alphabets[j];
            result.push_back(word.substr(0,i) + ch + word.substr(i+1));
        }
    }
}

Switching two adjacent letters

In this method, you generate a list of candidates by switcing two adjacent letters in every iteration. For example: the word "ornage" would look like this: "orange", when the letters "n" and "a" are swapped.

void SpellChecker::transposes(string &word, Vector &result)
{
    for(int i=0; i<word.size()-1; i++)
    {
        result.push_back(word.substr(0,i) + word[i+1] + word[i] + word.substr(i+2));
    }
}

Removing a letter

In this method, you generate a list of candidates by removing a letter in every iteration.

void SpellChecker::deletes(string &word, Vector &result)
{
    for(int i=0; i<word.size(); i++)
    {
        result.push_back(word.substr(0,i) + word.substr(i+1));
    }
}

All of these methods are called in one wrapper method:

void SpellChecker::edits(string &word, Vector &result)
{
    //Deletion
    deletes(word, result);

    //Transposition
    transposes(word, result);

    //Replacement
    replaces(word, result);

    //Insertion
    inserts(word, result);
}

Extract the known words

Third, at this stage, the above step would've generated a huge list of words but 90% of them would be gibberish, so we need to "clean" the list and extract the known words using the dictionary we've created.

void SpellChecker::known_words(Vector& results, Dictionary &candidates)
{
    Dictionary::iterator end = dictionary.end();

    for(int i=0; i<results.size(); i++)
    {
        Dictionary::iterator val = dictionary.find(results[i]);

        if(val != end)
        {
            candidates[val->first] = val->second;
        }
    }
}

The edits() method apply to words that have a edit distance of 1, what if it was 2 or more? Like if the user typed "the", it could've been "then" or "they". So, all you have to do is create a method that generates a new set of permutations based on the already generated list of edited words and extract the known words.

void SpellChecker::edits2(Vector &result, Dictionary &candidates)
{
    for(int i=0; i<result.size(); i++)
    {
        Vector edit2;

        edits(result[i], edit2);
        known_words(edit2, candidates);
    }   
}

Display the correct word

In order to determine the correct word, the following possibilities are considered:

Check if this word is in the dictionary, if it does, display it.
Generate known words that have an edit distance of 1 and check in the dictionary, if it does, display it.
Generate known words that have an edit distance of 2 and check in the dictionary, if it does, display it.
If all else fails, this word is unique or not a known word.

string SpellChecker::correct(string &word)
{
    Vector result;
    Dictionary candidates;

    string file = "big.txt";

    //1. if it's in the dictionary, display it
    if(dictionary.find(word) != dictionary.end())
    {
        return word;
    }

    extractWords(file);

    edits(word, result);
    known_words(result, candidates);

    //2. if it's a known word but one edit away
    if(candidates.size() > 0)
    {
        return max_element(candidates.begin(), candidates.end())->first;
    }

    //3. if it's a known word but two edits away
    edits2(result, candidates);

    if(candidates.size() > 0)
    {
        return max_element(candidates.begin(), candidates.end())->first;
    }

    //4. Display nothing if it doesn't exist
    return "This word doesn't exist!";
}

However, for conditions 2 and 3, the word displayed would most likely have the highest word frequency in the dictionary.

Conclusion

Phew! I hope that wasn't a long read. Although, I've written this on C++, I plan to rewrite this on Javascript for some of my future projects.

To be honest, I don't think it's completely accurate, although, I got most of the words correct when tested.

The source code can be found on my GitHub repository.

Hope you liked reading this article!

Stay tuned for more!