A guide to building AI apps and artefacts

Chapter 5 - Using AI with words and sentences

Ken Kahn, University of Oxford

Browser compatibility

This chapter of the guide includes many interactive elements that currently run best in the Chrome browser. See the troubleshooting guide for how to deal with problems encountered.

Introduction

AI programs can do many things with text. These include

  1. Answering questions (including more intelligent handling of web searches).
  2. Summarising text.
  3. Detecting the sentiment of the text (positive or negative? happy or sad? angry?).
  4. Authoring text (many sports and financial news articles are written by computers today).
  5. Determining the grammatical structure of a sentence.
  6. Translating between languages.

Doing arithmetic with words and sentences

While computers can deal with text as strings of characters, a technique called word embedding works by converting words into a long list of numbers. These numbers can either be created by humans where a number has a meaning such as 'minimum size', 'maximum size' or 'average life expectancy'. Most AI programs instead use numbers created by machine learning (see the previous chapter and the next chapter). The numbers are created by considering text with billions word (e.g. all Wikipedia pages in a given language). People don't understand what the numbers mean but similar words have similar numbers and unrelated words have very different numbers. Each number measures a 'feature' of the word but what feature is a mystery.

The word embeddings used in this chapter were created by Facebook. They trained their machine learning models on 157 different languages on all Wikipedia articles in each language. Even though that was about a billion words each that wasn't enough, so they also trained their models on tens of billions more words found by crawling the web. They created tables for each language of at least a million different words. The blocks described here provide the 20,000 most common words for 15 languages. (Larger tables and more languages can be added. Send requests to toontalk@gmail.com.)

Turning a word into lots of numbers

We have created Snap! blocks for exploring how word embeddings can be used to find similar words, words which are between other words, and most surprisingly solve word analogy problems. The features of block will report a list of 300 numbers. If the language field is left empty the default language will be used. You can think of the numbers as placing the word in a 300-dimensional space. The numbers were adjusted so all 20,000 words fit inside a 300-dimensional hypersphere with a radius of 1. There are databases with word embeddings for one million words but loading and searching such a large data set would be very slow. The features of block is based upon the most frequently occurring 20,000 entries that are lower case (no proper nouns) and contain only letters (no punctuation or digits).

Finding the closest word to a list of feature numbers

A program can search through all the words to find the word that is closest to a list of numbers. The closest word to reporter block does this.

Click to read an advanced topic

Different ways of measuring distances in high dimensional spaces

There are two common ways of measuring distance in a high-dimensional space. One is Euclidean distance which is a generalisation of how distance is computed in 2 and 3 dimensional space. The idea is to take the sum of the squares of the differences along each of the 300 dimensions and then report the square root of that. The other measure is called cosine similarity. Both work pretty well in the closest word to reporter which lets you choose which to use. While they usually agree sometimes small differences can be observed. For example, the third closest word to "dog" can be "canine" or "puppy" depending on the measure.

Finding the word half way between two other words

You can take two words and average their features by adding together corresponding numbers and dividing the result by 2. You can then use the closest word to reporter to find the word closest to the average.

Try averaging more than two words. And see what word is the closest to somewhere between two words other than the halfway point.

Using word embeddings to solve word analogy problems

One of the most surprising thing about word embeddings is that with the right formula one can solve word analogy problems. For example "man is to woman as king is to what?" can be expressed as "king+(woman-man)=x".

Note that "king is to man as woman is to what?" can be expressed as "woman+(king-man)=x". And "king+(woman-man)=x" amd "woman+(king-man)=x" are equivalent yet they solve different word analogy problems! This use of word embeddings works for grammatical analogies as well. Try solving "slow is to slower as fast is to what?". You might need to add 'fast' as an exception.

Finding all the 'closest' words

If you were to use the closest word to reporter to sort all the words by distance to a list of features it would take about a full day since it will have to call the reporter 20,000 times. Instead we provide the closest words to reporter that does it all at once in less than a second (except the first time it is called it may take several seconds). Optionally it can also report the distances as cosines.

While one rarely needs all 20,000 words it might be interesting to compare two words by seeing how many of the nearest 100 or 500 words of each are in common. Think up other uses for this reporter.

Drawing word embeddings

It would be nice to visualise the 300 numbers associated with a word. One way is to draw a sucession of vertical lines, one for each feature.

A full screen version of this program can be found here.

Mapping 300 dimensional points to two-dimensional points

No one can visualise 300-dimensional space. There are techniques for giving an impression of the relationships between very high-dimensional points by mapping the points to two or three dimensions. We use a technique called t-SNE. It can be understood as a physics simulation where all points in crowded areas repel each other and all points are attracted to those with a small distance away (in high-dimensional space). This data projector displays all 20,000 English words in two or three dimensions using either t-SNE or PCA (principal component analysis). You can also use the projector to see the word embeddings of these languages: German, Greek, Spanish, French, Finnish, Hindi, Indonesian, Italian, Japanese, Lithuanian, Portuguese, Sinhalese, Swedish, and Chinese.

Note that it takes several hundred iterations of t-SNE before it settles down on a good mapping from 300 dimensions. You can also search for words and their neighbours and create bookmarks. The above links launch the projector with a bookmark showing t-SNE and highlighting the hundred words closes to 'dog'.

In the lower right corner you can select projector bookmarks

Here is a program that displays 50 random words at the location generated by t-SNE.

Word embeddings can do translations

What would happen if you took, for example, the features of the English word 'dog' and asked for the closest word in say French. Try this with different source and target languages and different words. Compare it with Google Translate. Tip: it is easy to copy and paste words that your keyboard can't type from the Google Translate page. Or you can use the input method editor supported by the operating system of your device. Supported languages are Chinese, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Lithuanian, Portuguese, Sinhalese, Spanish, and Swedish.

Note that this version of closest word to has the choice of how one measures the distance between two vectors. Euclidean distance is the familiar 2D distance measure. Cosine similarity is similar but preferred by experts. See if it makes a difference in which words are closest to the untranslated word.

It is possible to add word embeddings for more languages. The process is documented here.

A 'Guess My Word' game using word embeddings

The following game picks a random word and gives the player warmer or colder feedback as the player makes guesses. This does so by comparing the distance to the secret word with the previous distance. It uses the location of ... reporter block to display your guesses. The game is very hard! There are many ways to make the game better. See if you can!

Benefits and risks using word embeddings

Word embeddings can be used as a component in AI programs that do sentiment analysis, entity detection, recommendations, summarising text, translation, and question answering. This is typically done by replacing the words in a text with their embeddings and then doing machine learning on the approximate meaning of the words. This makes the systems work better with synonyms and paraphrasings.

Word embeddings are learned by examining text with billions of words. These texts may have captured societal biases. For example, the following example seems to have captured the bias that butchers are male and bakers female. But the bias is so weak that if cosine similarity is used instead of Euclidean distance the unbiased "chef" is found. Some word embedding databases have the bias that doctors are male and nurses female. They will answer the question "man is to doctor as woman is to X" with "nurse". Is this a bias? Or might it be due to the fact that only women can nurse babies? Run some experiments below to explore these kinds of questions.

A paper called Semantics derived automatically from language corpora contain human-like biases proposed a way to measure word biases. The idea is to use the average distance two words have to two sets of "attribute" words. To explore gender bias, for example, the attribute word lists can be "male, man, boy" and "female, woman, girl". The difference of the average distances provides a score that can be used to compare words. In this implementation of the scoring reporter one can see that "mathematics" has a higher "maleness" score than "art". And "art" has a higher "pleasantness" score than "mathematics".

How does this work?

While we don't really know what the numbers mean, they must be encoding lots of things about words such as gender, grammatical category, family relationships, and hundreds more things. But the numbers aren't perfect. See if you can create some examples where the results are not good. One known problem with how the numbers are generated is that it combines features of different senses of the same word. There is only one entry, for example, for 'bank' which combines the ways that word is used in sentences about financial institutions and those about the sides of rivers. This can cause words to be closer than they should be. For example, "rat" and "screen" end up being closer together than otherwise due to "rat" being close to "mouse" and "mouse" (the computer input device) being close to (computer) "screen". This is a problem researchers are working on. Another problem is that sometimes short phrases act like words. "Ice cream", for example, has no word embedding while "sherbet" and "sorbet" do.

A sample project using word embeddings for translation

Here is a program that asks the user for two languages, obtains the feature vector of a random word from the first language, and then displays several of the words closest to that feature vector. It places the words in the t-SNE two-dimensional approximation of where the 300-dimensional words really are.

How does translation using word embeddings work?

The word embeddings for each language were generated independently based upon text from Wikipedia and the web. The location in 300-dimensional space of the features for a word like 'dog' have no relationship to the location of features of translations of the word 'dog'. Researchers noticed that in most (all?) languages some words are close together. For example 'dog', 'dogs', puppy', and 'canine' are close. Words like 'wolf', 'cow', and 'mouse' are close but not as close as those words. And all of these words are far from abstract words like 'truth' and 'logic'. So they discovered that it is possible to find a rotation that will cause many word embeddings in one language to cause translations of those words to be close. The way it was done at first and in these Snap! blocks is by giving a program a word list between English and each other language. 500 words is enough to find a good rotation that brings most of the other 19500 words close to where their translations are. While it is impressive that translation works at all given a word list that covers 2.5% of the vocabulary Word Translation Without Parallel Data describes a technique that uses no word lists or translated texts. A rotation is all that is needed because all the word embeddings are centred around zero so they don't need to be translated (in the mathematical sense, i.e. moved) as well. But note that the translation happens in 299 dimensions!

Two ways of aligning word embeddings in different languages

In the figure (A) and (B) show X being rotated to match Y to make a small number of words in X align with their translation in Y. Many other words become roughly aligned as a result. Other techniques can be applied to improve the alignment.

Image embeddings are possible as well

Using a similar technique to how vectors are generated for words we can also generate vectors for images. The get costume features of ... block will pass a vector of 1280 numbers to the blocks provided. This uses MobileNet to compute the numbers from the "top" of the neural net.

One can use image embeddings to determine which images are close to other images. Closeness takes into account many factors including texture, colour, parts, and semantics. Image embeddings can be used to work out image analogy problems similar to how word analogy problems are solved.

In the machine learning chapter there is a description of the train with image buckets ... block which is used for training. It works by collecting the feature vectors of all the training images and then finds the nearest neighbours to a test image to determine what label to give the image.

Possible project ideas using image recognition

Here are some project ideas:

  1. Try using word embeddings to explore the similarity of sentences. One idea is to average all the words in the sentence. This is called the bag of words technique since it ignores the order of the words just as if they were put in a bag.
  2. Find a chain of similar words by finding the nearest word to the starting word. Then repeatedly find the nearest word to that while making sure to never repeat the same word. Use this to repeatedly change random words one at a time in a famous poem or text (e.g. "roses are red and violets are blue").
  3. Make word games using word embeddings. For example, something like Semantris, a semantic version of Tetris. A bilingual version Semantris might be a good idea.
  4. Create a program that searches for new word analogies. Hint: If A is to B as C is to D then A-B "is close to" C-D.
  5. Explore why sometimes word analogies are right and sometimes wrong. Does the second, third, or tenth closest answer make more sense? Hint: use the closest words to reporter to explore this. Is it better at word analogies when A is close to B in "A is to B as C is to D"?
  6. Researchers have found that if you look at the average distance between pleasant words and flowers the distance is much smaller than the distance to unpleasant words. The opposite holds if you replace words about flowers with words about insects. Based upon this observation people have made other comparisons to see how, for example, words about males are closer to words about science while words about females are closer to art words. See if you can find other biases that arise from the way people write about things. If you know another language (and it is one of the 15 supported languages) see if it applies across languages.
  7. Be creative! Word embeddings are new and there is much that remains to discover.

Future directions for this chapter

New word embeddings blocks could be added. Currently the word embeddings blocks excludes all proper nouns. If we added them one could solve analogies such as Paris is to France as Berlin is to X. Exploring how words change over time can lead to great projects. Word embeddings generated from publications in different time periods could be used to see how words like "awful" and "broadcast" have changed over the last two centuries. New blocks could be added based upon research on generating "word sense" embeddings instead of word embeddings. E.g. one sense of "duck" is close to "chicken" while another sense is close to "jump".

There is plenty more that AI programs can do with language including determining the grammatical structure of sentences (this is called "parsing"), figuring out the sentiment in some text, and question answering. We plan to add more.

Additional resources

Wikipedia's word embedding article is short and written for an advanced audience. The Facebook team wrote a paper detailing how they generated the word embeddings we used here: E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages. The Understanding word vectors web page has a very good introduction to the subject and the contains examples that are helpful but require familiarity with Python. This Google blog about biases in word embeddings is very good and clear. In Google's Steering the right course for AI discusses bias along with other societal issues including interpretivity, jobs, and doing good. Fair is Better than Sensational: Man is to Doctor as Woman is to Doctor discusses biases in word embeddings in depth. How to Use t-SNE Effectively is a clear interactive description of how t-SNE works. Exploiting Similarities among Languages for Machine Translation pioneered the idea adjusting the word embeddings to support translation. Word Translation Without Parallel Data explores how word embeddings can be used for translation without using word lists or translated texts. Word Translation Without Parallel Data projector.tensorflow.org is a great website for interactively exploring different ways of visualising high-dimensional spaces. Here is a video of a nice talk by Laurens van der Maaten who invented the idea of t-SNE.

Learn about making and training neural nets

Go to the next chapter on neural nets

Return to the previous chapter on machine learning.