Building a personal predictive text engine

Published on December 15, 2023 under the Coding category.

I write a lot of technical documentation, which involves using long(er) words like “augmentation” and harder-to-type-correctly words like YOLOv5 (an acronym for a computer vision model, the “YOLO” part of which means You Only Look Once). I also happen to write blog posts where words like “documentation” are not uncommon. Herein lay an idea: what if my computer could help me type the next word?

There is prior art for this. For years, Apple has had an auto-suggest feature available in iOS. Three words appear above your keyboard when you are typing. If you make a typo, Apple chooses the middle of the three words and fixes the error. Sometimes, it can complete typing a word when you press the “space” key.

For some, the Apple auto-suggest behaviour is frustrating since the phone is making estimations about what you want to type. Sometimes the phone is wrong. But what if I could have such a system trained on my own writing: my public documentation from work, my personal blog posts published on this website. Going further: if I had such a system, how would it be designed?

This system has an important differentiation between GPT-based systems: I only want it to predict the next word or two, rather than try to suggest a sentence. Thus, such a system is more about typing faster than it is about helping me write sentences. I want a system that can help me get the words I already know that I want to type on the digital page faster than it takes for me to type the whole word.

I made such a system. Here is a quick (work-in-progress) demo video:

I called my system AutoWrite. In my mind, AutoWrite is similar to “auto tune”. With auto tune, you already know what you want to say, but you want to tune it to the right key. With AutoWrite, you know what you want to say, but the tool helps you write the words faster.

The beginning of a project: Word probabilities

One of my favourite concepts in natural language processing is “surprisal.” Surprisal measures how “surprising” a word is given a corpus of text. That corpus could be a blog post or all of the words you have written. Surprisal is calculated as such (in Python):

probabilities[word] = counts[word] / len(text)

-math.log(probabilities[word])

First, you calculate the probability of a word appearing. This is measured as the number of times it appears divided by the number of words in the text. Then, you take the logarithm of that probability and turn the number negative. You do this for every word in a corpus.

The lower the surprisal, the more likely it is that a word comes up in a text. You can thus expect words like “and” to be unsurprising and “integrated” to be less surprising. Typos will be even less surprising since they are likely to appear very few times, or only once.

Surprisals alone can’t predict words, but they can tell you how likely it is that a word comes up. Unlike modern neural networks, surprisals are calculated for a single word, rather than a word in context. You can calculate how surprising bigrams are, though, which you could use as a feature in predicting the word after the one you are writing. Yes, this will overfit with a small sample size, or a corpus that only covers one subject. But that might be what you want! If you mostly write “computer vision” together, the computer suggesting “computer vision” might not be such a bad thing.

I decided to calculate case-sensitive surprisals. This ensures that acronyms are preserved.

I want a system where if I write:

Ind

The computer would suggest:

IndieWeb

I use the word “IndieWeb” a lot in my writing so this would be a neat shortcut.

How do we build this? We need another component: the trusty trie.

Unrelated but technically curious note: Surprisals have an interesting feature: you can compare a distribution of surprisals made from all words in your writing with a distribution from another blog post using a metric like KL Divergence. I calculated the KL Divergence of every post on my blog against the mean from all words. I was able to identify the interview posts that other people had written for this blog by looking at the blog posts with the furthest KL Divergence.

The trusty trie: Predicting next words

I stored surprisals in a dictionary that maps a word to its surprisals. This data structure is not appropriate for predicting next words. Instead, we need a trie. A trie is a heavily-nested tree data structure that is commonly used in predictive text use cases. With a trie, you can search a tree by character.

Here is an example of a trie that represents the word “and”:


{"a": { "n": { "d": 1 }}}

Where 1 is how “surprising” the word is.

If I had “ad” in my list, my trie would be:


{"a": { "n": { "d": 1 }}, "d": 1}

Given a search query “an”, I would search for “a” and then the “n” key in the “a” tree. I could then traverse the sub-trees to get all results. In this case, the only option for “an” would be “and”. But there could be “ante” or “antler”, etc.

I turned my surprisals dictionary into a trie, where the keys are letters in each word in the corpus and the values are the probability.

Consider the query “Ind”. I would look up “Ind” in my trie, then get all words that start with that sequence of letters. I then order them by their probabilities, from lowest to highest. The lower the probability, the more likely the word is.

I end up with an autocomplete feature that is aware of my words. Given “Ind”, my system predicts I want to say “IndieWeb”. Given “micr”, the system predicts I want to say “microformats”.

The user experience

I decided to build a web application that uses my code. The web application has two parts:

The server, which can take a sequence of characters and calculate the most probable next word, and;
A client, in which you can write text and requests are made to the server to retrieve probable words.

When you press the tab key in the client, the word auto-completes. This is similar to the mechanism by which you complete code in Visual Studio Code. On the tab key, Visual Studio Code knows you want to follow an autocomplete suggestion (or a Copilot suggestion, depending on if you have Copilot installed).

The front-end uses a contenteditable div as the text input area. There is a span tag that appears at the end of the div and shows the next predicted letters in the word you are writing. When you press tab, the suggestion is accepted and you can start typing a new word.

When this runs locally, the autocomplete suggestions are almost instant. I put this application on a server so friends could play around with it and performance was a bit slower. Not frustratingly slow, but based on what I observed I expect some requests came in later and would override the most recent (and thus accurate) prediction.

As I was experimenting with this, I wanted to see if I could autocomplete on space. This is valuable for two reasons:

I don’t have to reach up to the tab key, which I don’t use regularly in my (non-code) writing, and;
Mobile phones don’t have a tab key and the space would be more intuitive.

Therein lay a problem. Space also has another common use: adding a space. How would I be able to differentiate between the two? I built this feature and limited it so that autocomplete would only work if the sequence of letters you had typed already were not in my dictionary of words. This helped eliminate a lot of false positives, but there were still a few. The video below shows me typing on my phone:

This video shows the Apple autocomplete feature working when blue lines appear under words. My feature is working when words autocomplete that appear in light text. Halfway through, a word was autocompleted that I didn’t want to be autocompleted. This is an active problem I am trying to solve. How do I know when I want to accept a prediction on mobile? Do I need some kind of classifier? If any readers know of any directions I should explore (or want to talk about Taylor Swift), please email me at readers [at] jamesg [dot] blog.

AutoWrite also takes into account words you have already written in a document, boosting them. In the demo above, “AutoWrite” autocompleted after its first use because I had already defined it. This means that AutoWrite can be globally aware and context specific.

Apple is also experimenting with inline word prediction in iOS, but what I want is trained on my words. On macOS desktop, I would love if there was an operating-system level API that I could program with my system. I would ideally like AutoWrite in the tools I use already, rather than in a new tool.

Source code

My source code for calculating surprisals is available on GitHub. You will need to build from source because there is a bug in the installation process for the latest version. You can download the code like this:


git clone https://github.com/capjamesg/pysurprisal
cd pysurprisal
pip install -e .

You can see my auto-write code in my AutoWrite repository. I have not yet written documentation; this is an active hobby project.