Explain The Idea

Word vector is widely used in text-related AI processes. Its center idea is to maximize the following likelihood function.

$$\begin{equation} \label{eq:originalL}
L(\theta) = \prod_{t=1}^T \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta)

Terrifying! Right? Let me explain it.

Imagine as if you have a long text which is a sequence of words. At any time, consider a word $w_t$ as center of a window in the long word sequence. What is the probability of words surrounding $w_t$? Rephrase it, what is the probability of surrounding words appearing in a window, given $w_t$ is the center? Solution to this question can be considered in the following way.

For any one specific occurrence of $w_t$, its surrounding is $\{ w_{-m}, w_{-m+1}, …, w_{t-1}, w_{t+1}, …, w_{m-1}, w_{m} \}$. For each of the $w_{t+j}$, where $-m \le j \le m \land j \ne 0$, the probability of $w_{t+j}$ given $w_t$ is just denoted as $p(w_{t+j} | w_t ; \theta)$. The $\theta$ there is the parameter to be trained, which guides how each probability is calculated while the formulae are fixed. In fact, you can think of $\theta$ as a store. When computing $p(w_{t+j} | w_t ; \theta)$, $p$ takes the part of $w_{t+j}$ out of $\theta$, takes the part of $w_t$ out of $\theta$, then do the computation using a fixed formula. Now multiply all $p(w_{t+j} | w_t ; \theta)$ for $w_t$’s surrounding words, we get a probability estimate of one surrounding given $w_t$. This is the $$ \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta) $$ part.

Now, I have got a probability estimate for one position in the long sequence of words. We use the same formula for all positions, and multiply the probabilities to get the probability estimate of all windows appearing collectively, given the long word sequence. And the whole probability estimate is $L(\theta)$. Note: the word sequence is the universe, therefore it is not explicitly included in the formula.

$L(\theta)$ is clearly between $0$ and $1$, by definition. And the word sequence appears in reality. Therefore, ideally, $L(\theta)$ should be $1$. So, what we need to do is to maximize $L(\theta)$ to make it as close to $1$ as possible.

In real computation, all $p(w_{t+j} | w_t ; \theta)$’s are between $0$ and $1$. Their products will be extremely close to 0. This will cause underflow and loss of precision. Hence, we use $J(\theta)=-\frac{1}{T}log(L(\theta))$ instead. The minus sign changes maximization into minimization, as people are more comfortable with minimization and more software packages are tuned to minimization. The $\frac{1}{T}$ helps to constrain the sum in the magnitude of one log-probability value, namely the sum does not grow indefinitely large in magnitude as $T$ grows. $J(\theta)$ expands to $$
J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))

Before working further, we must determine the formula of $p(w_{t+j} | w_t ; \theta)$. Any formula that constitutes a probability density function will do. But softmax is a well acccepted one. So, I will use softmax. This is also the function used in the original algorithm. The original algorithm uses two vectors for each word. $u_c$ when the word considered is the center word, $u_o$ otherwise. Hence, $$ p(w_o | w_c ; \theta) = \frac{exp({u_o}^\top u_c)}{\sum_{q \in V}exp({u_q}^\top u_c)} $$, where $u_o$, $u_c$ and $u_q$ are taken out of $\theta$, and $V$ is the vocabulary, namely the set of all words. In this setting, each word will end up with two vectors. People will have to merge the two vectors for each word, because eventually we want one vector for each word. Here I think, why not just use one vector $u$ for each word? I am going to try it. The formula looks identical: $$ p(w_a | w_b ; \theta) = \frac{exp({u_a}^\top u_b)}{\sum_{d \in V}exp({u_d}^\top u_b)} $$, just it no longer distinguishes center from surrounding. A side effect of this change is that size of $\theta$ is halved.

The full $J(\theta)$ is $$\begin{align}
&=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta)) \\
&=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \label{eq:originalJ}

Derive The Formula for Minimization

To minimize $J(\theta)$ over $\theta$, I compute derivative of $J(\theta)$.

$$ \begin{align}
& \frac{\partial}{\partial \theta} log(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)}) \\
&= \frac{\partial}{\partial \theta} \left[ log(exp({u_{t+j}}^\top u_t)) – log({\sum_{d \in V}exp({u_d}^\top u_t)}) \right] \\
&= \frac{\partial}{\partial \theta} log(exp({u_{t+j}}^\top u_t)) – \frac{\partial}{\partial \theta} log({\sum_{d \in V}exp({u_d}^\top u_t)}) \\
&= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} \frac{\partial}{\partial \theta} exp({u_d}^\top u_t)} \right] \\
&= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} exp({u_d}^\top u_t) \frac{\partial}{\partial \theta} {u_d}^\top u_t} \right] \\
&=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} \sum_{d \in V} exp({u_d}^\top u_t) {u_d} \right] \\
&=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{\sum_{d \in V} exp({u_d}^\top u_t) {u_d}}{\sum_{f \in V}exp({u_f}^\top u_t)} \right] \\
&=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { \frac{exp({u_d}^\top u_t) }{\sum_{f \in V}exp({u_f}^\top u_t)} u_d } \right] \\
&=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] \\

As the computational process will be iterative, we can treat $u_{t+j}$ as constant parameters to $u_t$. So the last step becomes:

$$ u_{t+j} – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] $$

Put them together:

$$ \begin{align} \label{eq:derivativeJ}
\frac{\partial J}{\partial \theta} = -\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ u_{t+j} – \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right]
\end{align} $$

So, we can now use $\frac{\partial J}{\partial \theta}$ to update $\theta$ so that $\frac{\partial J}{\partial \theta}$ will approach $0$. But how? Let me explain intuitively.

You can think of $\frac{\partial J}{\partial \theta}$ as a slope. A slope is a vector pointing to the direction of changing function value, assuming the function is continuous and convex in a vicinity that contains both $\theta$ and the minimal point of $J(\theta)$. So, if the direction of $\frac{\partial J}{\partial \theta}$ coincides the growing direction of $J(\theta)$, a portion of $\frac{\partial J}{\partial \theta}$ need be subtracted from $\theta$. If the direction of $\frac{\partial J}{\partial \theta}$ contradicts the growing direction of $J(\theta)$, a portion of negated $\frac{\partial J}{\partial \theta}$ need be added to $\theta$, which amounts to also subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$. Therefore, subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$ is the method for updating $\theta$ in the minimization process. Repeatedly updating $\theta$ in this way until no changes in $\theta$ is perceivable. This is the Gradient Descent Algorithm. If the portion is not fixed, but changing due to some intuitive, then it is the Adaptive Gradient Descent Algorithm. In formula, it is:

$$ \theta_{k+1} \gets \theta_{k} – \alpha \frac{\partial J}{\partial \theta} $$

Computational Considerations

In the process of computing $\frac{\partial J}{\partial \theta}$, we will need all ${u_a}^\top u_b$’s for all $a,b \in V$. As $|V|$ can be very large. The memory that are required for storing all ${u_a}^\top u_b$’s will be formidably large. So, storing them is better avoided.

Look closely at structure of $$p(w_d | w_t ; \theta) = \frac{exp({u_d}^\top u_t)}{\sum_{f \in V}exp({u_f}^\top u_t)}$$. It needs dot products of $u_t$ with all other vectors $u_f$ for all $f \in V$. And $t$ is just some $f$. Observe structure of $J(\theta)$ again: $$
J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
$$. We can find that in the inner sum $$
\sum_{\substack{-m \le j \le m \\ j \ne 0}}^T log(p(w_{t+j} | w_t ; \theta))
$$, denominator of $p$ is repeatedly computed, and it does not depend on $j$. Hence denominator of $p$ can be calculated beforehand and used repeatedly.

However, you may notice that $p(w_{t+j} | w_t ; \theta)$ still asks for too much computation as it must iterate over the whole vocabulary.

Changing Probability Evaluation Method into An Intuitive One

Let’s focus on $$\begin{align}
log(p(w_{t+j} | w_t ; \theta))
&= log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \\
& \notag \\
&= \underbrace{log(exp({u_{t+j}}^\top u_t))}_\text{part1} \quad – \quad \underbrace{log(\sum_{f \in V}exp({u_f}^\top u_t))}_\text{part2}
\end{align}$$. Part1 gives similarity of $u_t$ to its surrounding words. Part2 removes some ground similarity that $u_t$ is to all words on average. The heavy computation burden comes from part2. So, we need to come up with some efficient, yet possibly approximate, alternatives.

As part2 is an estimate of $u_t$’s overall similarity, we can make the estimate from a sample of $V$. A way to do it is: $$\begin{equation}
\underbrace{log(exp({u_{t+j}}^\top u_t))}_\text{part1} \quad – \quad \underbrace{
log\left(\sum_{f \in K}exp({u_f}^\top u_t)\right)
\end{equation}$$, where $K$ is a sample of $V$. This, in turn, is approximately $$\begin{equation}
\underbrace{log(exp({u_{t+j}}^\top u_t))}_\text{part1} \quad – \quad \underbrace{
\sum_{f \in K}log(exp({u_f}^\top u_t))
\end{equation}$$. As this is no longer softmax. A way is to change the $exp$ to the sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$. Thus $log$ and $exp$ no longer cancel out. The above becomes $$
\underbrace{log(\sigma({u_{t+j}}^\top u_t))}_\text{part1} \quad + \quad \underbrace{
\sum_{f \in K}log(\sigma(-{u_f}^\top u_t))
$$. Note the plus sign before part 2. It was minus before. This change is because similarity between $w_f$ and $w_t$ are evaluated negatively as $-{u_f}^\top u_t$. Thus, if a supposedly unrelated pair $u_f$ and $u_t$ yields a large positive dot-product, the log value will be very negative, and I need this value to help decrease similarity as measured by part 1. Put them back into the formula of $J$: $$\begin{equation} \label{eq:alternativeL1}
J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ log(\sigma({u_{t+j}}^\top u_t)) \; + \;
\sum_{f \in K}log(\sigma(-{u_f}^\top u_t)) \right]
\end{equation}$$. This is called Skip-Gram with Negative Sampling.

The first summation $$
$$ can still be too big. One can update $\theta$ partially by selecting a sample from the indices $\{w_1,…,w_T\}$. This entitles the algorithm Stochastic: $$
J(\theta)=-\frac{1}{|S|} \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ log(\sigma({u_{t+j}}^\top u_t)) \; + \;
\sum_{f \in K}log(\sigma(-{u_f}^\top u_t)) \right]
$$, where $S$ is a sample of $\{w_1,…,w_T\}$. If $|S|$ is fixed, $\frac{1}{|S|}$ can be omitted. And $J$ becomes $$\begin{equation} \label{eq:alternativeL2}
J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ log(\sigma({u_{t+j}}^\top u_t)) \; + \;
\sum_{f \in K}log(\sigma(-{u_f}^\top u_t)) \right]

Now, find $\frac{\partial J}{\partial \theta}$.

\frac{\partial}{\partial \theta} log\sigma(u)
&=\frac{\partial}{\partial \theta} log\frac{1}{1+e^{-u}} \\
&=\frac{\partial}{\partial \theta} \left( log(1) – log(1+e^{-u}) \right) \\
&=-\frac{\partial}{\partial \theta} log({1+e^{-u}}) \\
&=-\frac{1}{{1+e^{-u}}}\frac{\partial}{\partial \theta}(1+e^{-u}) \\
&=-\frac{1}{{1+e^{-u}}}\frac{\partial}{\partial \theta}e^{-u} \\
&=-\frac{e^{-u}}{{1+e^{-u}}}\frac{\partial}{\partial \theta}({-u}) \\
&=\frac{e^{-u}}{{1+e^{-u}}}\frac{\partial u}{\partial \theta} \\
&=\frac{e^{-u} \div e^{-u}}{{1 \div e^{-u} + e^{-u} \div e^{-u}}}\frac{\partial u}{\partial \theta} \\
&=\frac{1}{{1 \div e^{-u} + 1}}\frac{\partial u}{\partial \theta} \\
&=\frac{1}{{e^{u} + 1}}\frac{\partial u}{\partial \theta} \\
&=\frac{1}{{1 + e^{u}}}\frac{\partial u}{\partial \theta} \\
&=\sigma(-u)\frac{\partial u}{\partial \theta} \\

Then, $$\begin{align}
\frac{\partial}{\partial \theta}J(\theta)&=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
\sigma(-{u_{t+j}}^\top u_t)\frac{\partial}{\partial \theta}{u_{t+j}}^\top u_t \; + \;
\sum_{f \in K}\sigma({u_f}^\top u_t)\frac{\partial}{\partial \theta}(-{u_f}^\top u_t) \right] \\
&=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
\sigma(-{u_{t+j}}^\top u_t)u_{t+j} \; – \;
\sum_{f \in K}\sigma({u_f}^\top u_t){u_f} \right]

There are two kinds of samples $S$ and $K$. They are sampled using different strategies. $S$ needs to cover the whole set of word positions, hence sampling of $S$ can be a permutation of $\{w_1,…,w_T\}$ cut into segments. $K$ needs to avoid words in current window, hence it is almost a tendency to cover less frequent words. Two methods can be used. One is to sample the vocabulary uniformly. Due to Zipf’s law, you will most probably always hit less frequent words. The other is to sample the unigram distribution while taking a bias towards less frequent words. Which one is better? Have a try!

A careful reader may find some inconsistency in the above equations. $J$ takes $\theta$ as its argument, but $\theta$ disappears in the expansion. This is because each $u_a$ ($a \in V$) is taken out of $\theta$. You can think of $u_a$ as a vector of the same size as $\theta$, but have zeroes everywhere except those positions of word $a$. Another way to thinking of it is to treat $\theta$ as a vector of $u_a$’s. In this way, only the $u_a$ entries that appears in the expansion affects and are affected.

Another potential problem in the expansion above is that the possibility of $w_f=w_t$ has not been fully considered. But if $|K| \ll |V|$, the probability that $t \in K$ is very small. And even if it happens, it affects only one out of $|K|$ terms in the summation. Hence the effect could be neglectable. However, if we are to make it strict, we can multiply $C(w_f,w_t)$ to the last $u_f$, where $C(w_f,w_t)$ is $$
1 & \text{if } w_f \ne w_t \\
2 & \text{if } w_f = w_t
$$. And the same problem appears when $w_{t+j}=w_t$. Hence $$\begin{align} \label{eq:derivativeJ2}
\frac{\partial}{\partial \theta}J(\theta)
&=\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
-\sigma(-{u_{t+j}}^\top u_t)C(w_{t+j},w_t)u_{t+j} \; + \;
\sum_{f \in K}C(w_f,w_t)\sigma({u_f}^\top u_t){u_f} \right]
\end{align}$$. Note that I have distributed the leftmost minus sign into the bracket, as this is the better form for program evaluation.

Consider Computation Again

All word vectors have the same size. When they are lined up, they constitute $\theta$: $$
\theta = \left[ u_{w_1} ; u_{w_2} ; … ; u_{w_{|V|}} \right]
$$, one vector for each word. Suppose each $u$ has $d$ entries, then, $\theta$ will be $d \times |V|$ dimensions. $\theta$ will be one $u$ at a time.

Now look closer to the summation. It is a bit tricky. How should I use $\frac{\partial}{\partial \theta}J(\theta)$ to update $\theta$, as the formula now only considers part of $\theta$. So, where are the parts should I update? To answer this question, it will be easier to consider the extreme case where $|S|$ is $1$. It is then clear that the second summation is updating $u_t$, which is the word vector of the center word of the window. The outermost summation just says that we need to update all the word vectors in batch.

What Could Be the Problem?

It turns out that still some problems remain challenging.


First, let’s look back at equation \eqref{eq:originalJ} or \eqref{eq:alternativeL2}. Think of each $p(\bullet)$s as $x_i$. On any 2-D circle formed by any of $x_i$ and $x_j$, $x_i x_j$ reaches its maximum when $x_i = x_j$, and minimum when $x_i=-x_j$. So do they when applied $exp(\bullet)$. So, $\vec{0}$ is indeed a saddle point. This means that I cannot set initial values of $\theta$ to $\vec{0}$. And this raises a risk in training that $\theta$ might converge to $\vec{0}$. At least, if this happens, I must retrain the model.

Linear Dependency

Look close at \eqref{eq:derivativeJ} and \eqref{eq:derivativeJ2}, you can find that the derivative can only update $\theta$ in multiple of $u_t$ and $u_f$. So, if the $u_t$ and $u_f$ are linearly-dependent to begin with, the resulting $\theta$ can only consist of linearly-dependent vectors. So, I must initialize $\theta$ so that all the $u$’s are likely to be linear-independent, or at least find a way to introduce some disturbance to break the linear-dependency.


$J(\theta)$ is affected by dot products of two word vectors. Positive dot products tend to be greater in positive direction, negative dot products tend to be greater in negative direction. After the dot products growing big, they almost do not affect value of $\sigma$, thus almost do not affect value of $J(\theta)$. This is bad. Because, if a dot product is two big, it will definitely cause a float-pointing overflow, which will make further computation meaningless. So, I will need a mechanism to restrict magnitudes of the word vectors. One way to do it is to add a squared $L_2$ norm. This turns \eqref{eq:alternativeL2} into $$\begin{equation}
J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}}
log(\sigma({u_{t+j}}^\top u_t)) \; + \;
\sum_{f \in K}log(\sigma(-{u_f}^\top u_t)) \; – \;
D {u_t}^\top u_t
\end{equation}$$. Here, $D$ is a constant for tuning the balance between model fidelity and vector magnitude. Then, \eqref{eq:derivativeJ2} becomes $$
\frac{\partial}{\partial \theta}J(\theta)
=\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
-\sigma(-{u_{t+j}}^\top u_t)C(w_{t+j},w_t)u_{t+j} \; + \;
\sum_{f \in K}C(w_f,w_t)\sigma({u_f}^\top u_t){u_f} \; + \; 2 D u_t

Step Size of Update

It is not easily known that if my computed $\frac{\partial J}{\partial \theta}$ will give $\theta$ a suitable stride so that I have to compute $J(\theta)$ to see if a step really makes a better $J(\theta)$, and update $\theta$ when it does. Otherwise, I can decrease the step size to see if a smaller step size does better.