Representing Words
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.
![]() |
| Figure 1: an illustration of one-hot vectors of cat, cats and dog. The only non-zero entry in each vector is the index of the word in the vocabulary. |
One important characteristic of a word is the company it keeps. According to the distributional hypothesis [1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs); simply put, “tell me who your friends are and I will tell you who you are” – the words version.
This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence “The left elevator goes down to the second floor”, with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. [2] , but gained extreme popularity with word2vec [3] in 2013. There are also some other variants of word embeddings, like GloVe [4].
Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word’s vector and the context words’ vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.
What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors – this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.
Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.
- Most similar words – we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to July are June, January, October, November, and February. We can also find a word which is similar to a set of other words: for example, the most similar word to Israel + Lebanon + Syria + Jordan is Iraq, while the most similar word to France + Germany + Italy + Switzerland is Austria! Isn’t this cool? If that’s not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (t-SNE, PCA), one can visualize word embeddings, and get some really nice insights about what the vectors capture:
- Analogies – Following the above figure, the authors of this paper presented some nice results regarding the vectors’ ability to solve analogy questions: a is to b as c is to d; given a, b, and c, find the missing word d. Turns out that this can be solved pretty accurately by selecting a vector which is similar to b and c, but not to a. This is done by addition and subtraction, and the most famous example is the “king + woman – man = queen“, demonstrated in the following figure:
| Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the paper. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation. |
| Figure 5: an illustration of the analogical male-female relation between word pairs such as (king, queen), (man, woman) and (uncle, aunt), from the paper. |
Deep Learning, NLP, and Representations, from Christopher Olah’s blog (more technical and advanced).
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, “A neural probabilistic language model”, The Journal of Machine Learning Research, 2003. ↩
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient estimation of word representations in vector space.” CoRR, 2013.. ↩
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. “GloVe: Global Vectors for Word Representation.” EMNLP 2014. ↩


