Humans understand words; machines work with numbers. In the world of text analytics, we must somehow convert our human words into numbers a computer can understand.
Of course, it is not enough to assign each word some sort of key value. One number cannot encompass the meaning and application of a word any more than your social insurance number conveys information about your eye color or your favorite sport. One approach is to define a multidimensional abstract space in which each individual characteristic of a word is represented by some “distance” in that direction. Each word is then represented by a vector in this abstract space. There are many possible ways that such an abstract vector space can be defined, and just as assuredly there is no abstract space that represents a word perfectly.
In practice, we most often look for a word embedding. Unfortunately, this phrase is not always used consistently. Some authors observe that words are “defined” for the computer by describing the myriad of ways in which they are “embedded” among other words in phrases, sentences, and documents. A more mathematical definition would be that words are converted into vectors that are “embedded” in a vector space of lower dimensionality than the space of all uses of all words. These word vector embeddings fulfill our requirement of being a numerical word representation that computers can work with. Embeddings, of course, are not unique; there are many possible embeddings to choose from, and each has its own strengths and weaknesses. Embeddings can be broadly classified into two groups, frequency-based embeddings which essentially provide word counts, and prediction-based embeddings. Many classic methods of text analysis are frequency-based embeddings. Prediction embeddings are sometimes called “neural word embeddings” since they use simple neural networks to organize the text data for further analysis.
One very popular, and often quite effective, prediction-based embedding is word2vec, which is actually not so much a single method as a cluster of closely related algorithms. Word2vec uses a neural network, simple by today’s standards, to group words based on their “similarity”. One example, which has become a sort of “hello, world” for text analytics, is the pair of words “king” and “queen”. Clearly kings and queens have distinct features, but the words often appear in similar contexts within a document.
In text analytics, as in any engineering enterprise, there are always tradeoffs. We can improve the effectiveness of a model by training it with a greater amount of text, increasing the number of dimensions in our abstract vector space, and increasing the number of words we consider are a “context” within a document. All of these choices come at the cost of increased computational compexity (and therefore increased time). Let’s consider some examples.
Uses a shallow two-layer neural network to learn about the contexts of words in a document or set of documents. Neural networks are, therefore, now being used to learn how to feed information to bigger, more powerful neural networks.
CBOW (Continuous bag-of-words) and skip-grams attack word embedding from opposite directions. CBOW algorithms attempt to develop a neural network model that predicts the occurrence of words based on the surrounding context. Skip-grams, on the other hand attempt to predict a context based on individual words.
Remember, however, that this is not a parlour game in which the goal is to predict the missing word. It is an attempt to generate vectors that in some way encapsulate the use and meaning of words.
Many fields of artificial intelligence grapple with the necessity of providing a mathematical description for the somewhat vague notion of similarity. Your movie streaming service wants to quantify films that are similar to the ones you have already enjoyed. An autonomous vehicle needs to infer if objects in a video are similar to children in a school crosswalk.
In the world of text analytics a common measure of similarity is cosine similarity. Cosine similarity is closely related to the Pearson correlation coefficient of classic statistics. If two words are out there in the same direction in our abstract vector space, the cosine of the angle between those vectors is small and the cosine is close to one. If one word is straight ahead of us and the other is off our left shoulder, then the angle between the vectors is close to 90° and the cosine is close to zero.
The definition of cosine similarity provides a direct path to the prediction of word analogies. For the text analytics “hello world” example, we would expect that the cosine similarity between “male” and “king” to be close to the similarity between “female” and “queen”.
Ultimately, however, our concern with language is a concern with meaning, not words. In English, many words have multiple meanings. In Chinese, the meaningful interpretation of a word without its context is virtually impossible.
The idea of word embeddings can be extended to include sense embeddings, that is, embeddings that include the multple senses of individual words. Indeed, the popular skip-grams algorithm has been modified and extended into the multi-sense skip-gram or MSSG.
Word embeddings more sophisticated than those described here must be enlisted to meet the needs of machine translation. Examples include ELMo, XLNet, not to mention BERT and ERNIE from Google and Baidu, respectively.
In the next blog we will look at some actual examples that apply wrd2vec techniques using the popular library Gensim.