Adding more languages to word embeddings

How to add word embeddings to more languages

First we download a language file from Facebook's Fasttext site. After unzipping it we run write_vectors to select the N most common words that are "good" (typically lowercase and only letters). We also run write_js which selects the same words but outputs it in a file that is JavaScript syntax. write_vectors and write_js are defined in this Python Jupyter file.

Then to generate the translation matrix one needs the vectors file produced by write_vectors as well as a vocabulary list. Between 500 words seems to work fine. We used 1000mostcommonwords.com and this JavaScript script to turn a vocabulary page into the correct format and length. The Jupyter file defines translate. It seems to work fine ignoring the warnings. Finally download translate.html to the folder with the generated files and open it up. It will generate a rotated version of the JavaScript file of word embeddings that needs to be saved (and the tiny bit of HTML at the top and bottom of the file need to be removed).

Finally to generate 2D locations for the words download tsne.html to the folder with the generated files and open it up. After several minutes a text area with the locations will appear in the text area. Your computer may be sluggish while this is running. Save the generated locations in a file named word-locations.js. To view the word embeddings using Tensorflow's projector one needs to copy projector.html to the language's folder and open it. Save the results in two files and host them on the web along with the projector.json file described on the Projector page.

To use another language there is a block called "wait for word embeddings ..." that can optionally take a URL to load another language.