Word2vec Defined

6 min readDec 30, 2021

Word2vec

Introduction

Word2vec (word to vector), as the name suggests, is a tool that converts words into vector form.

In all technicalities, Word2vec is a shallow two-layered neural network, that is used to produce the relative model of word embedding. It works by collecting a large number of vocabulary based datasets as input and outputs a vector space, such that each word in the dictionary maps itself to a unique vector. This allows us to represent the relationship between words.

Broadly, there are two models CBOW(Continuous Bag of Words) and Skip-Gram. Both neural network architectures essentially help the network learn how to represent a word. This is unsupervised machine learning, and labels are needed to train the model. Either of these two models, can create the labels for the given input and prepare the neural network to train the model and perform the desired task.

The difference in architecture between CBOW & Skip-gram models

Skip Gram

Suppose we have 10000 unique words, and we represent an input word like “ape” as a one-hot vector(A categorical word/variable can be better understood by an ML algorithm when it is one hot encoded to 0’s and 1's). This vector will have 10000 components; each component contains one vocabulary. And we place ‘1" in one position, which represents the word “ape” and place 0 in the rest of the positions.

The output of the network is a single vector with 10000 components as well. The probability that a randomly selected nearby word is that vocabulary word. We don’t need to consider the hidden layer neurons since none of them are active. However, the output neurons use softmax.

When we evaluate the trained network on an input word, the output is a probability distribution instead of a one-hot vector.

The Hidden Layer

For example, we use a word vector with 300 features as the input. So the hidden layer should be a 300*10000 matrix, which means there are 10000 rows for words in vocabulary and 300 columns for hidden neurons.

The one-hot vector we use as input is used to pick out the corresponding row in the matrix. This means that the hidden layers are only used as a lookup table. And the output of the hidden layer is just the word vector for the input word.

The Output Layer

As we pick up the word vector for “ape” in the hidden layers, it will be sent to the output layer. The output layer is a softmax regression classifier.

So what is the softmax regression classifier? Softmax regression is a generalization of logistic regression that we can use for multi-class classification.

CBOW

CBOW(continuous bag-of-words) is a model suitable for work with smaller databases, and it does not need much RAM requirement either. In CBOW, we predict a word given its context. The entire vocabulary is used to create a “Bag of Words” model before proceeding to the next task.

The bag of words model is a simple representation of words that disregards grammar.

Here is an example for better understanding.

John likes to watch movies. Mary likes movies too.

Mary also likes to watch football games.

A list of words is created by breaking the two sentences.

We then label each word with the number of occurrences. This is called a bag of words. In computers, this is usually captured as a JSON file.

The sentences are then combined to get the overall frequency of each unique word.

Now that the bag of words model is ready, we can use CBOW to predict the probability of a word given these groups of words.

CBOW model to predict probabilities of words

Each word and its frequency are passed as a unique vector into the input layer of the neural network. Say, if we have X words, the input layer takes in X[1XV] vectors and gives out 1[1XV] in the output layer.

The input-hidden layer matrix sizes up to [VXN] and the output-hidden layer matrix sizes to [NXV]. In this case, N is the number of dimensions. The layers have no activation function between them.

To calculate the output, the hidden input layers’ weights are multiplied by the hidden output layers weights. The error between the output and targets is calculated, and the weights are constantly readjusted through backpropagation. The only non-linearity is the softmax calculations in the output layer to generate probabilities for the word vectors.

Overall, after calculations and readjustments, the weight between the hidden-output layer is taken as the word vector representation. As we can see, this architecture allows the model to predict the current word relying on influence from surrounding words.

Differences between CBOW and Skip-Gram

Although the two models show mirror symmetry, they vary in terms of architecture and performance.

The Skip-Gram model predicts words surrounding a certain word by relying on the contextual similarity of words. On the other hand, the CBOW model uses the Bag of words approach and predicts a word using the words that surround it.
The Skip-Gram model is more accurate when it comes to infrequent words. It suits larger databases and requires more RAM to function. The CBOW model is faster, does not guarantee the handling of infrequent words, requires less RAM, and suits smaller databases.

The choice of model depends on the user’s task.

Code tutorial

We are using word2vec by genism . We use Retail data set. This code use word2vec model to find out familiar word for a specific word.

import pandas as pd
import numpy as np
import random
from gensim.models import Word2Vecdf = pd.read_csv('OnlineRetail.csv')purchases_train = []for i in customers_train:
    temp = train_df[train_df["CustomerID"] == i]["StockCode"].tolist()
    purchases_train.append(temp)# train word2vec model
model = Word2Vec(window = 2, sg = 1, hs = 0,
                 negative = 10, # for negative sampling
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)model.build_vocab(purchases_train, progress_per=200)model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)def similar_products(v, n = 4):
    
    # extract most similar products for the input vector
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    
    # extract name and similarity score of the similar products
    new_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1])
        new_ms.append(pair)
        
    return new_ms
similar_products(model['82482'])   
#WOODEN PICTURE FRAME WHITE FINISH

The output is:

[('WOOD S/3 CABINET ANT WHITE FINISH', 0.9902486205101013),
 ('WOOD 2 DRAWER CABINET WHITE FINISH', 0.9844652414321899),
 ('WOODEN FRAME ANTIQUE WHITE ', 0.9791655540466309),
 ('VINTAGE BILLBOARD LOVE/HATE MUG', 0.9734472036361694)]

These sentences are similar compared to “Wooden picture frame white finish”

Recommend System

Recommend System is based on the past purchases user had made.

def recommend_system(products):
    product_vec = []
    for i in products:
        try:
            product_vec.append(model[i])
        except KeyError:
            continue
    print(np.mean(product_vec, axis=0))    
    return np.mean(product_vec, axis=0)

Output for Recommend System

[('PAPER BUNTING WHITE LACE', 0.999230682849884),
 ('CHARLIE + LOLA RED HOT WATER BOTTLE', 0.9992085695266724),
 ('CHARLIE+LOLA PINK HOT WATER BOTTLE', 0.9992051124572754),
 ('TOTE BAG I LOVE LONDON', 0.9991719722747803)]

In the above blog post, we introduced and explained commonly-used word embedding techniques Word2vec. This method includes an introduction, diagram, implementation, and elaboration with code on a real-world dataset. Word Embedding is the foundation of any major text analysis task, and we hope to have done justice to this topic by covering it in-depth.