Category: AI

  • Understand Relu as a Piecewise Function

    Relu is an activation function defined as:$$
    Relu(x)=\begin{cases}
    x, & x > 0 \\
    0, & otherwise
    \end{cases}
    $$. This function has an interesting property that it can act like an open/close switch. To see why, suppose $\vec{x}$ is a vector, its entries are $x_i$’s. When a Relu layer is applied to $\vec{x}$, it will be an entry-wise check. Only those $x_i$’s that are greater than $0$ are passed, just as if electricity passes through a closed switch. All non-positive values are zeroed out, just as if electricity is blocked by an open switch.

    When a neural network uses Relu layers inside it, it can be thought as each of the Relu layers is selecting some outputs of its preceding layer, therefore behaves as a feature selection. The training process of the neural network is then training a feature selection model. When the model has multiple layers, Relu is in fact selecting computed features.

    So, one can try to design a feature selecting process using Relu. Like $$\text{Input} \rightarrow \text{D} \rightarrow \text{Relu} \rightarrow \text{More DNN}$$. Here, the $\text{D}$ is a layer that is equivalent to multiplying a diagonal matrix. After several epochs of training, all entries in the input that correspond to non-positive entries in $\text{D}$ can be discarded.

  • Understand Why A Sole Linear Layer Does Not Classify Well

    In deep neural network, the most frequently used component may be the linear layer. However, the linear layer itself does not work well as a classifier. In this article, I intend to explain why from my own point of view.

    A linear layer is a linear function. Just as in the following form: $$F(A) = A (\vec{x})$$, where $A$ is a matrix, $\vec{x}$ is the input vector. The training process of a DNN is to find an optimal $A$ to maximize an objective function. In this process, $\frac{\partial{F}}{\partial{A}}$ is computed, and it is used to update $A$. To view $\frac{\partial{F}}{\partial{A}}$, we can rewrite $F$ as $$
    F'(A)=M vec(A)
    $$, where $vec(A)$ is the vectorized $A$, and $M$ is a corresponding matrix which contains suitably distributed entries from $\vec{x}$ so that $$
    F(A) = F'(vec(A)) = M vec(A)
    $$.
    Now, it is clear that $$
    \frac{\partial{F}}{\partial{A}} = \frac{\partial{F’}}{\partial{vec(A)}} = M
    $$. When an optimizer uses a multiple of $M$ to update $A$ as in a gradient descending algorithm, $A \leftarrow \vec{\Lambda} M$, where $\vec{\Lambda}$ is some vector that summarizes backpropagation of gradients from previous layers. Hence, it is clear that $A$ eventually only changes along some vector in the column space of $M$. When the optimal $A$ is not in the column space of $M$, no matter how $A$ is updated, the optimal point will never be reached.

    The above findings can help us improve design of even simple networks. For example, assume a layer $G(x) = A \vec{x}$, where $A$ is a $2 \times m$ matrix. When $\vec{x}$s are almost colinear, $A$ will only move along the $\vec{x}$s. Hence, it is quite difficult to achieve an effect like switching two entries of $A$.

    To solve the problem, the effective dimension of $\vec{x}$s must be increased, either directly by adding more various samples, or indirectly by expanding the model to multiple layers. For example, define $$
    F(\vec{x}) = A v( B \vec{x} )
    $$, where $v$ is some activation function. This is in fact two linear layers. $B \vec{x}$ first expands $\vec{x}$ into more various values, then $v$ introduces more linear-independency, before right-multiplying with $A$. With this result, Relu is clearly a good choice as an activation function, because it behaves far from a linear function, therefore, produces higher dimensions. Then, the high dimension makes columns of A combine with more variations.

    On the contrary, if $v$ is a sigmoid function, it changes smoothly everywhere, and behaves closely to a linear function at the origin. Hence, its ability of adding dimension is low, as compared to Relu.

    That is all! ……Wait! Not all!

    There is still a problem. The nonlinearity of Relu only occurs at the 0 point. But if the input to Relu is far from 0, then Relu will be virtually linear, hence lose the merits of Relu. This is where a good initialization needed. A good initialization will cover both + and – side of a Relu input for every input entry if the layer is large enough, hence will give enough dimensions for the column space of $B$. However, if $B$ is too large, it will be wasteful. A natural question is at least how large a layer must be. Support the input $\vec{x}$ has $p$ entries. Then to give chance to each of the entries to be passed and blocked by the Relu function, there must be at least two entries of $B$. So, there must be $2p$ entries in each row of B. Still, this does not guarantee that each $x$ is both passed and blocked in every row of $B$, because an initialization process is very likely a random process, it is only probable that the initialization goes perfectly. Hence, adding more than $2p$ entries can also be practical. Since then, $A$ can give enough combinations that can cover the whole space of $F$.

    In addition, to avoid overfitting, dropout has been the de facto method. If the dropout ratio is $d$, then $B$ must have at least $2p/d$ entries.

    Conclusion: A least practical DNN need have the form $F(\vec{x}) = A \circ B \circ \vec{x}$, $B$ should be at least double number of columns of $\vec{x}$.

  • バイブコーディングのブームがそろそろ過ぎ去ろうとしているのではないか

    動画サイトでは、まだまだバイブコーディングの話題で盛り上げています。しかし、すでにいくつかの楽観視できない結果でデータとして出ています。最近は下記の結果が目に飛び込んできました。

    The AI Productivity Boom that Wasn’t | L&DeepDive

    要約しますと、ベテランプログラマーであれば、生成AIを使って、20%早くなったという錯覚が生じて、実際は20%遅くなったという結果になった。

    これでは、バイブコーディングのきれいな泡はほぼ弾けてしまいました。

    バイブコーディングでは、もともとテクにある負債を神速に積み上げることが予見されています。今回の結果を加えて、生産性を落としていることも再度確認されました。

    もちろん、生成AIは一つの発展方向ですが、実際のビジネス環境というすごく複雑な文脈(コンテキスト)をどうやって一つの生成AIに低コストに詰め込むことは相当難しいでしょう。

    Claude Code社では、80%-90%のコードはAIに書かせていると宣言したことですが、あれば、自社製品作成の一環としてもみなすことができ、通常のプロジェクト開発と比べて、コスト上昇に対する容認度はかなり高いでしょう。

    バイブコーディングはプロダクションコードの主力になるまでまだまだ時間がかかりそうですね。

  • Compute Sigmoid Function Smart

    The sigmoid function frequently appears in machine learning contexts. It has a simple form: $$
    \sigma(x)=\frac{1}{1+e^{-x}}
    $$. A naive implementation in C would be:

    double sigma(double x)
    {
        return 1.0/(1.0+exp(-x));
    }

    It looks good, right? No, it is not. When $x << 0$, say $x=1000$, exp(-x) will yield NaN, even though $\sigma(x)$ approaches $0$. We need a way to bypass the NaN. Let’s look at the graph of $\sigma(x)$.

    It looks symmetric about $(0, \frac{1}{2})$. If it is the case, $$\begin{equation}
    \sigma(x)=1-\sigma(-x)
    \end{equation}$$ will have to hold. Let’s try to prove it.

    $$\begin{align}
    1-\sigma(-x) &= 1-\frac{1}{1+e^x} \\
    &=\frac{1+e^x}{1+e^x}-\frac{1}{1+e^x} \\
    &=\frac{1+e^x-1}{1+e^x} \\
    &=\frac{e^x}{1+e^x} \\
    &=\frac{1}{\frac{1}{e^x}+\frac{e^x}{e^x}} \\
    &=\frac{1}{e^{-x}+1} \\
    &=\sigma(x)
    \end{align}$$

    Done. It’s proven. Now, whenever $x<0$, we can switch $\sigma(x)$ to $1-\sigma(-x)$, which yields the same value and completely avoids the NaN problem. The following is the code.

    double sigma(double x)
    {
        if (x>=0)
            return 1.0/(1.0+exp(-x));
        return 1.0 - 1.0/(1.0+exp(x));
    }
  • Try Word Vector

    Explain The Idea

    Word vector is widely used in text-related AI processes. Its center idea is to maximize the following likelihood function.

    $$\begin{equation} \label{eq:originalL}
    L(\theta) = \prod_{t=1}^T \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta)
    \end{equation}$$

    Terrifying! Right? Let me explain it.

    Imagine as if you have a long text which is a sequence of words. At any time, consider a word $w_t$ as center of a window in the long word sequence. What is the probability of words surrounding $w_t$? Rephrase it, what is the probability of surrounding words appearing in a window, given $w_t$ is the center? Solution to this question can be considered in the following way.

    For any one specific occurrence of $w_t$, its surrounding is $\{ w_{-m}, w_{-m+1}, …, w_{t-1}, w_{t+1}, …, w_{m-1}, w_{m} \}$. For each of the $w_{t+j}$, where $-m \le j \le m \land j \ne 0$, the probability of $w_{t+j}$ given $w_t$ is just denoted as $p(w_{t+j} | w_t ; \theta)$. The $\theta$ there is the parameter to be trained, which guides how each probability is calculated while the formulae are fixed. In fact, you can think of $\theta$ as a store. When computing $p(w_{t+j} | w_t ; \theta)$, $p$ takes the part of $w_{t+j}$ out of $\theta$, takes the part of $w_t$ out of $\theta$, then do the computation using a fixed formula. Now multiply all $p(w_{t+j} | w_t ; \theta)$ for $w_t$’s surrounding words, we get a probability estimate of one surrounding given $w_t$. This is the $$ \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta) $$ part.

    Now, I have got a probability estimate for one position in the long sequence of words. We use the same formula for all positions, and multiply the probabilities to get the probability estimate of all windows appearing collectively, given the long word sequence. And the whole probability estimate is $L(\theta)$. Note: the word sequence is the universe, therefore it is not explicitly included in the formula.

    $L(\theta)$ is clearly between $0$ and $1$, by definition. And the word sequence appears in reality. Therefore, ideally, $L(\theta)$ should be $1$. So, what we need to do is to maximize $L(\theta)$ to make it as close to $1$ as possible.

    In real computation, all $p(w_{t+j} | w_t ; \theta)$’s are between $0$ and $1$. Their products will be extremely close to 0. This will cause underflow and loss of precision. Hence, we use $J(\theta)=-\frac{1}{T}log(L(\theta))$ instead. The minus sign changes maximization into minimization, as people are more comfortable with minimization and more software packages are tuned to minimization. The $\frac{1}{T}$ helps to constrain the sum in the magnitude of one log-probability value, namely the sum does not grow indefinitely large in magnitude as $T$ grows. $J(\theta)$ expands to $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$

    Before working further, we must determine the formula of $p(w_{t+j} | w_t ; \theta)$. Any formula that constitutes a probability density function will do. But softmax is a well acccepted one. So, I will use softmax. This is also the function used in the original algorithm. The original algorithm uses two vectors for each word. $u_c$ when the word considered is the center word, $u_o$ otherwise. Hence, $$ p(w_o | w_c ; \theta) = \frac{exp({u_o}^\top u_c)}{\sum_{q \in V}exp({u_q}^\top u_c)} $$, where $u_o$, $u_c$ and $u_q$ are taken out of $\theta$, and $V$ is the vocabulary, namely the set of all words. In this setting, each word will end up with two vectors. People will have to merge the two vectors for each word, because eventually we want one vector for each word. Here I think, why not just use one vector $u$ for each word? I am going to try it. The formula looks identical: $$ p(w_a | w_b ; \theta) = \frac{exp({u_a}^\top u_b)}{\sum_{d \in V}exp({u_d}^\top u_b)} $$, just it no longer distinguishes center from surrounding. A side effect of this change is that size of $\theta$ is halved.

    The full $J(\theta)$ is $$\begin{align}
    J(\theta)
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta)) \\
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \label{eq:originalJ}
    \end{align}$$.

    Derive The Formula for Minimization

    To minimize $J(\theta)$ over $\theta$, I compute derivative of $J(\theta)$.

    $$ \begin{align}
    & \frac{\partial}{\partial \theta} log(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \frac{\partial}{\partial \theta} \left[ log(exp({u_{t+j}}^\top u_t)) – log({\sum_{d \in V}exp({u_d}^\top u_t)}) \right] \\
    &= \frac{\partial}{\partial \theta} log(exp({u_{t+j}}^\top u_t)) – \frac{\partial}{\partial \theta} log({\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} \frac{\partial}{\partial \theta} exp({u_d}^\top u_t)} \right] \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} exp({u_d}^\top u_t) \frac{\partial}{\partial \theta} {u_d}^\top u_t} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} \sum_{d \in V} exp({u_d}^\top u_t) {u_d} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{\sum_{d \in V} exp({u_d}^\top u_t) {u_d}}{\sum_{f \in V}exp({u_f}^\top u_t)} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { \frac{exp({u_d}^\top u_t) }{\sum_{f \in V}exp({u_f}^\top u_t)} u_d } \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] \\
    \end{align}
    $$

    As the computational process will be iterative, we can treat $u_{t+j}$ as constant parameters to $u_t$. So the last step becomes:

    $$ u_{t+j} – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] $$

    Put them together:

    $$ \begin{align} \label{eq:derivativeJ}
    \frac{\partial J}{\partial \theta} = -\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ u_{t+j} – \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right]
    \end{align} $$

    So, we can now use $\frac{\partial J}{\partial \theta}$ to update $\theta$ so that $\frac{\partial J}{\partial \theta}$ will approach $0$. But how? Let me explain intuitively.

    You can think of $\frac{\partial J}{\partial \theta}$ as a slope. A slope is a vector pointing to the direction of changing function value, assuming the function is continuous and convex in a vicinity that contains both $\theta$ and the minimal point of $J(\theta)$. So, if the direction of $\frac{\partial J}{\partial \theta}$ coincides the growing direction of $J(\theta)$, a portion of $\frac{\partial J}{\partial \theta}$ need be subtracted from $\theta$. If the direction of $\frac{\partial J}{\partial \theta}$ contradicts the growing direction of $J(\theta)$, a portion of negated $\frac{\partial J}{\partial \theta}$ need be added to $\theta$, which amounts to also subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$. Therefore, subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$ is the method for updating $\theta$ in the minimization process. Repeatedly updating $\theta$ in this way until no changes in $\theta$ is perceivable. This is the Gradient Descent Algorithm. If the portion is not fixed, but changing due to some intuitive, then it is the Adaptive Gradient Descent Algorithm. In formula, it is:

    $$ \theta_{k+1} \gets \theta_{k} – \alpha \frac{\partial J}{\partial \theta} $$

    Computational Considerations

    In the process of computing $\frac{\partial J}{\partial \theta}$, we will need all ${u_a}^\top u_b$’s for all $a,b \in V$. As $|V|$ can be very large. The memory that are required for storing all ${u_a}^\top u_b$’s will be formidably large. So, storing them is better avoided.

    Look closely at structure of $$p(w_d | w_t ; \theta) = \frac{exp({u_d}^\top u_t)}{\sum_{f \in V}exp({u_f}^\top u_t)}$$. It needs dot products of $u_t$ with all other vectors $u_f$ for all $f \in V$. And $t$ is just some $f$. Observe structure of $J(\theta)$ again: $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$. We can find that in the inner sum $$
    \sum_{\substack{-m \le j \le m \\ j \ne 0}}^T log(p(w_{t+j} | w_t ; \theta))
    $$, denominator of $p$ is repeatedly computed, and it does not depend on $j$. Hence denominator of $p$ can be calculated beforehand and used repeatedly.

    However, you may notice that $p(w_{t+j} | w_t ; \theta)$ still asks for too much computation as it must iterate over the whole vocabulary.

    Changing Probability Evaluation Method into An Intuitive One

    Let’s focus on $$\begin{align}
    log(p(w_{t+j} | w_t ; \theta))
    &= log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \\
    & \notag \\
    &= log(exp({u_{t+j}}^\top u_t)) \quad – \quad log(\sum_{f \in V}exp({u_f}^\top u_t)) \\
    &= \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \quad – \quad \underbrace{log(\sum_{f \in V}exp({u_f}^\top u_t))}_\text{part2}
    \end{align}$$.
    Part1 is a linear part, it measures similarity of two words in a vector space. Part2 removes some ground similarity that $u_t$ is to all words on average. The heavy computation burden comes from part2. So, we need to come up with some efficient, yet possibly approximate, alternatives.

    Look deeper into part2. suppose the average of all the exponential is $E$. Then part2 is $log(\sum_{f \in V}E)=log(|V| \cdot E)$. This means that part2 can be treated as a kind of average over the set of ${u_f}^\top u_t$. It has the property that large $exp({u_f}^\top u_t)$’s will dominate the value of the logged sum. Only when all the $exp({u_f}^\top u_t)$’s are close to $0$, will the part2 become negative. In fact, becoming negative is not definitely necessary, because this negativity only helps when all the negative words are dissimilar to the center word. This achieves the similar effect of only punishing similarity between the negative words with the center word. With this observation, part2 becomes $$
    \sum_{f \in V}\Lambda({u_f}^\top u_t)
    $$, where $\Lambda$ is the rectified linear unit function $
    ReLU(x)=
    \begin{cases}
    x & \text{if } x \ge 0 \\
    0 & \text{if } x < 0
    \end{cases}
    $.

    Put them back into the formula of $J$: $$\begin{equation} \label{eq:alternativeL1}
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \; – \;
    \underbrace{\sum_{f \in V}\Lambda({u_f}^\top u_t)}_\text{part2} \right]
    \end{equation}$$. This is a variation of Skip-Gram with Negative Sampling.

    The first summation $$
    \sum_{t=1}^T
    $$ can still be too big. One can update $\theta$ partially by selecting a sample from the indices $\{w_1,…,w_T\}$. This entitles the algorithm Stochastic: $$
    J(\theta)=-\frac{1}{|S|} \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right]
    $$, where $S$ is a sample of $\{w_1,…,w_T\}$.
    Theoretically, if $|S|$ is fixed, $\frac{1}{|S|}$ can be omitted. And $J$ becomes $$\begin{equation} \label{eq:alternativeL2}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \Lambda({u_f}^\top u_t) \right]
    \end{equation}$$.
    However, removing $\frac{1}{|S|}$ may not be a very good idea. The reason is that $\frac{1}{|S|}$ scales down the derivative, so that it does not grow too big as the number of involved terms increase. Without $\frac{1}{|S|}$, it can easy cause floating-point number overflow.

    There are two kinds of samples $S$ and $K$. They are sampled using different strategies. $S$ needs to cover the whole set of word positions, hence sampling of $S$ can be a permutation of $\{w_1,…,w_T\}$ cut into segments. $K$ needs to avoid words in current window, hence it is almost a tendency to cover less frequent words. Two methods can be used. One is to sample the vocabulary uniformly. Due to Zipf’s law, you will most probably always hit less frequent words. The other is to sample the uniform distribution while taking a bias towards less frequent words. Which one is better? Have a try!

    Now, consider any partial derivative $\frac{\partial J}{\partial u_i}$ on a single variable. Consider an arbitrary pair $(t, j) \in S \times \{-m \le j \le +m \wedge j \ne 0 \}$.

    $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$ can be nonzero only if either $w_{t+j}=w_i$ or $w_t=w_i$.
    – When $w_{t+j}=w_i$, $u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.
    – When $w_t=w_i$, $u_{t+j}$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.

    Then, consider an arbitrary pair $(t,f) \in S \times K$.

    $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ can be nonzero only if $w_i = w_t \vee w_i = w_f$.
    – When $w_i=w_t$, $-u_f$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.
    – When $w_i=w_f$, $-u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.

    The above conditions are not exclusive, meaning, if a $w_i$ satisfies multiple conditions, multiple additive terms will be involved. Because $j$ can have $2m$ values, the expansion of $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ will be repeated for $2m$ times.

    To summarize, define $
    \Lambda'(x)=
    \begin{cases}
    1 & \text{if } x>0 \\
    0 & \text{if } x \le 0
    \end{cases}
    $ and $
    I(a,b)=\begin{cases}
    1 & \text{if }a=b \\
    0 & \text{if } a \ne b
    \end{cases}
    $, then, $$\begin{align}
    &\frac{\partial}{\partial u_i}J(\theta) \\
    &=-\frac{1}{|S|}\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
    \frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{\partial}{\partial u_i}\Lambda({u_f}^\top u_t) \right] \\
    &=- \left\{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|}\left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; – \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|}\Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    \right\} \\
    &=-
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; + \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right] \label{eq:derivativeJ2} \\
    \end{align}$$

    What Could Be the Problem?

    It turns out that still some problems remain challenging.

    Non-Convexity

    First, let’s look back at equation \eqref{eq:originalJ} or \eqref{eq:alternativeL2}. Think of each $p(\bullet)$s as $x_i$. On any 2-D circle formed by any of $x_i$ and $x_j$, $x_i x_j$ reaches its maximum when $x_i = x_j$, and minimum when $x_i=-x_j$. So do they when applied $exp(\bullet)$. So, $\vec{0}$ is indeed a saddle point. This means that I cannot set initial values of $\theta$ to $\vec{0}$. And this raises a risk in training that $\theta$ might converge to $\vec{0}$. At least, if this happens, I must retrain the model.

    Linear Dependency

    Look close at \eqref{eq:derivativeJ} and \eqref{eq:derivativeJ2}, you can find that the derivative can only update $\theta$ in multiple of $u_t$ and $u_f$. So, if the $u_t$ and $u_f$ are linearly-dependent to begin with, the resulting $\theta$ can only consist of linearly-dependent vectors. So, I must initialize $\theta$ so that all the $u$’s are likely to be linear-independent, or at least find a way to introduce some disturbance to break the linear-dependency.

    Magnitude

    $J(\theta)$ is affected by dot products of two word vectors. Positive dot products tend to be greater in positive direction, negative dot products tend to be greater in negative direction. After the dot products growing big, they almost do not affect value of $\sigma$, thus almost do not affect value of $J(\theta)$. This is bad. Because, if a dot product is two big, it will definitely cause a float-pointing overflow, which will make further computation meaningless. So, I will need a mechanism to restrict magnitudes of the word vectors. One way to do it is to add a squared $L_2$ norm. This turns \eqref{eq:alternativeL2} into $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \frac{1}{|S|} {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{1}{|S|} \Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$.
    Here, $D$ is a constant for tuning the balance between model fidelity and vector magnitude. Then, \eqref{eq:derivativeJ2} becomes $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2}
    \; + \;
    \underbrace{ D u_i }_\text{Term3}
    \end{align}$$

    .

    In practice, a correct $D$ may be hard to estimate in advance. Another method could be scaling magnitude of the whole collection of word vectors to some constant, between each sample.

    Weighting

    In the above formulae, the two sequences $\{a,b,c,d,e\}$ and $\{d,a,c,e,b\}$ are indifferent. This is not good. Because, clearly, “Let me give you books” and “Let me give books you” clearly means different action direction. And the latter one is perhaps very rare. One method to alleviate this problem is weight the dot products in $\text{Term1}$ according to their distance from the center words. So: $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \mu (1-\tau^{m-|j|+1}) {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$,
    where $0 < \tau \le 1$ is some weight decay constant, and $\mu$ is a constant scaling factor. The reason for $\mu$ is that $\tau^{|j|-1}$ is at most $1$, this weight is likely to underweight $\text{Term1}$, so $\mu$ is used to scale it back.

    Then $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \mu (1-\tau^{m-|j|+1}) \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1′}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2′}
    \; + \;
    \underbrace{ D u_i }_\text{Term3′}
    \end{align}$$

    One way to calculate a $\mu$ is to set $$
    \mu=\frac{m}{\sum_{j=1}^m (1-\tau^{m-|j|+1})}
    $$.

    Stability

    To avoid a random imperfect initialization to determine the result prematurely, I added some random disturbance during the training. And because each time only a sample is seen, results from each sample are merged with exponential moving average.