A Simple Implementation of Word2Vec
I have always been puzzled and amazed by the idea of “embedding”. A high-dimensional space, such as a corpus for a language, can be represented using, say, only 50 dimensions. How amazing! This is a huge save in covariate space dimension compared to the one-hot encoding. In a previous course at UConn called Data Science in Action, I did some text classification based on one-hot encoding and tf-idf weighting of text messages after tokenization, but that was a rather naive application - there were 9376 words in a total of 5572 messages, and I did not try to lower the dimension of covariate space but applied a bunch of classification algorithms directly. The project is on GitHub.
Word2Vec is an approach that uses a small neural network of 3 layers (input, 1 hidden, output) to produce a word’s embedding based on its context. It can be implemented using either continuous bag of words (CBOW), which predicts the probability of seeing a word based on its context, or skip-gram, which predicts the context based on the word. This post provides a brief but easy-to-understand illustration of CBOW. The corpus, though, is too small to indicate any interesting results. I intend to use a bigger corpus, and see what the algorithm will tell us.
To find a larger corpus I will use a piece of news, which is about the spotlight couple Duke and Duchess of Sussex. First we load the needed packages and import the text in this piece of news.
import tensorflow as tf
import numpy as np
import nltk
import pandas as pd
import re
f = open("./word2vectext.txt", "r")
## make a corpus of all sentences, and convert all sentences to lower case
## remove all special characters
corpus = []
for x in f:
corpus.append(re.sub(r'[^\w]', ' ', x.lower()))
## make a list of words
words = []
for i in corpus:
for word in i.split():
words.append(word)
## remove duplicates
words = set(words)
len(words) # 541 unique words
Next we create the dictionaries containing the mapping of a word to its index, and its index to the word.
word2int, int2word = {}, {}
vocab_size = len(words)
for i, word in enumerate(words):
word2int[word] = i
int2word[i] = word
And we create another list, where each element is itself a list containing words in each sentence. A window of size 3 is used to scan all sentences to record concurrence of pairs of words within distance of 3.
sentences = []
for sentence in corpus:
sentences.append(sentence.split())
data = []
window_size = 3
for sentence in sentences:
for word_index, word in enumerate(sentence):
for nb_word in sentence[max(word_index - window_size, 0) : min(word_index + window_size, len(sentence)) + 1]:
if nb_word != word:
data.append([word, nb_word])
len(data) # 6730
For these 6730 pairs, we treat the first word as x, the second word as y, and use a small neural network to predict y based on x. For this purpose, text needs to be converted to its numerical representation. We use one-hot encoding to convert them into matrices, where each row has row sum 1.
## this is a brutal force way to do one-hot-encoding
## pandas has more convenient solutions
def to_one_hot(data_point_index, vocab_size):
temp = np.zeros(vocab_size)
temp[data_point_index] = 1
return temp
x_train, y_train = [], []
for data_word in data:
x_train.append(to_one_hot(word2int[data_word[0]], vocab_size))
y_train.append(to_one_hot(word2int[data_word[1]], vocab_size))
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
print(x_train.shape, y_train.shape) # (6730, 541) (6730, 541)
Next it comes to model training using tensorflow.
x = tf.placeholder(tf.float32, shape = (None, vocab_size))
y_label = tf.placeholder(tf.float32, shape = (None, vocab_size))
### choose how many dimensions to use to represent these more than 500 words
embedding_dim = 5
## initialize model weights using random numbers
W1 = tf.Variable(tf.random.normal([vocab_size, embedding_dim]))
b1 = tf.Variable(tf.random.normal([embedding_dim]))
hidden_representation = tf.add(tf.matmul(x, W1), b1)
## output layer
W2 = tf.Variable(tf.random.normal([embedding_dim, vocab_size]))
b2 = tf.Variable(tf.random.normal([vocab_size]))
prediction = tf.nn.softmax(tf.add(tf.matmul(hidden_representation, W2), b2))
Now we are ready to train the weights:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * \
tf.log(prediction), reduction_indices = [1]))
train_step = tf.train.GradientDescentOptimizer(0.08).minimize(cross_entropy_loss)
n_iters = 10000
## one step at a time
for _ in range(n_iters):
sess.run(train_step, feed_dict = {x: x_train, y_label: y_train})
print(sess.run(W1))
print(sess.run(b1))
After the model is trained, we are able to obtain the embedded vectors:
vectors = sess.run(W1 + b1)
We use a function to find the closest word to a given word:
def euclidean_dist(vec1, vec2):
return np.sqrt(np.sum(vec1 - vec2) ** 2)
## a linear search for minimum
def find_closest(word_index, vectors):
min_dist = 10000
min_index = -1
query_vector = vectors[word_index]
for index, vector in enumerate(vectors):
if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
min_dist = euclidean_dist(vector, query_vector)
min_index = index
return min_index
With the model trained and the embedded vectors, we can play with them.
word2int["meghan"] # 337
word2int["harry"] # 363
find_closest(337, vectors) #482
int2word[482] # 'ottawa'
find_closest(363, vectors) #341
int2word[341] # ''avoided
- The code is based on tensorflow version 1.14.0, credit to Towards Data Science.