Month: January 2024

  • Compute Sigmoid Function Smart

    The sigmoid function frequently appears in machine learning contexts. It has a simple form: $$
    \sigma(x)=\frac{1}{1+e^{-x}}
    $$. A naive implementation in C would be:

    double sigma(double x)
    {
        return 1.0/(1.0+exp(-x));
    }

    It looks good, right? No, it is not. When $x << 0$, say $x=1000$, exp(-x) will yield NaN, even though $\sigma(x)$ approaches $0$. We need a way to bypass the NaN. Let’s look at the graph of $\sigma(x)$.

    It looks symmetric about $(0, \frac{1}{2})$. If it is the case, $$\begin{equation}
    \sigma(x)=1-\sigma(-x)
    \end{equation}$$ will have to hold. Let’s try to prove it.

    $$\begin{align}
    1-\sigma(-x) &= 1-\frac{1}{1+e^x} \\
    &=\frac{1+e^x}{1+e^x}-\frac{1}{1+e^x} \\
    &=\frac{1+e^x-1}{1+e^x} \\
    &=\frac{e^x}{1+e^x} \\
    &=\frac{1}{\frac{1}{e^x}+\frac{e^x}{e^x}} \\
    &=\frac{1}{e^{-x}+1} \\
    &=\sigma(x)
    \end{align}$$

    Done. It’s proven. Now, whenever $x<0$, we can switch $\sigma(x)$ to $1-\sigma(-x)$, which yields the same value and completely avoids the NaN problem. The following is the code.

    double sigma(double x)
    {
        if (x>=0)
            return 1.0/(1.0+exp(-x));
        return 1.0 - 1.0/(1.0+exp(x));
    }
  • HOWTO: Eclipse 2023-12 + Cygwin + GDB 13 — A Work Around

    It was said that GDB 11 or above does not work well with Eclipse with Cygwin. The true reason seems to be that Eclipse sends Windows-style path of your executable to GDB of Cygwin, while GDB of cygwin does not recognize “C:\” as beginning of an absolute path. So GDB will start end up some ugly combined path like

    /cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:/store/your-project-name/Debug/your-project-name.exe

    I haven’t found a beautiful way to work around it as GDB does not fix this problem, and Eclipse developers insist this as a problem of GDB. So, stalemate.

    But, I am not going to get jammed here. GDB wants that path? I give GDB the path.

    So, under Cygwin, I make a symbolic link like the following:

    mkdir -p "/cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:"
    cd "/cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:"
    ln -s /cygdrive/c/store store

    This will let GDB find your executable.

    What if you cannot be sure the path that GDB is looking at? Don’t worry! Follow this operation path. Window > Preferences > C/C++ > Debug > GDB, check “Show the GDB traces consoles with character limit:”. Run debugging again. Then in the console tab, open the console named “… gdb traces”. The path that GDB looks for will be in it.

    Eclipse cannot find source files when debugging? Don’t worry! I bet you see window reporting “Can’t find a source file at …”. Then click the “Edit Source Lookup Path …” button. Then Add > Path Mapping > OK >(enter a new mapping name, e.g. “cygwin”) > Add . Enter “/cygdrive/c/” as the “Compilcation path”, select “C:\” as Local file system path. Repeat this adding operation until all source files are found.

    By the way, if you get some strange exit code when starting GDB, like 0xc0000135 or something alike, be sure to look into environment variable settings. Your system PATH variable, project environment variables in Debug settings. Be sure to set a complete list of paths or nothing.

    But there is still a drawback. I still have not found how to make breakpoint work. …

    So, the better solution may still be installing gdb 9.2-1. If you cannot find it in your Cygwin setup program, try using another mirror.

  • Try Word Vector

    Explain The Idea

    Word vector is widely used in text-related AI processes. Its center idea is to maximize the following likelihood function.

    $$\begin{equation} \label{eq:originalL}
    L(\theta) = \prod_{t=1}^T \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta)
    \end{equation}$$

    Terrifying! Right? Let me explain it.

    Imagine as if you have a long text which is a sequence of words. At any time, consider a word $w_t$ as center of a window in the long word sequence. What is the probability of words surrounding $w_t$? Rephrase it, what is the probability of surrounding words appearing in a window, given $w_t$ is the center? Solution to this question can be considered in the following way.

    For any one specific occurrence of $w_t$, its surrounding is $\{ w_{-m}, w_{-m+1}, …, w_{t-1}, w_{t+1}, …, w_{m-1}, w_{m} \}$. For each of the $w_{t+j}$, where $-m \le j \le m \land j \ne 0$, the probability of $w_{t+j}$ given $w_t$ is just denoted as $p(w_{t+j} | w_t ; \theta)$. The $\theta$ there is the parameter to be trained, which guides how each probability is calculated while the formulae are fixed. In fact, you can think of $\theta$ as a store. When computing $p(w_{t+j} | w_t ; \theta)$, $p$ takes the part of $w_{t+j}$ out of $\theta$, takes the part of $w_t$ out of $\theta$, then do the computation using a fixed formula. Now multiply all $p(w_{t+j} | w_t ; \theta)$ for $w_t$’s surrounding words, we get a probability estimate of one surrounding given $w_t$. This is the $$ \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta) $$ part.

    Now, I have got a probability estimate for one position in the long sequence of words. We use the same formula for all positions, and multiply the probabilities to get the probability estimate of all windows appearing collectively, given the long word sequence. And the whole probability estimate is $L(\theta)$. Note: the word sequence is the universe, therefore it is not explicitly included in the formula.

    $L(\theta)$ is clearly between $0$ and $1$, by definition. And the word sequence appears in reality. Therefore, ideally, $L(\theta)$ should be $1$. So, what we need to do is to maximize $L(\theta)$ to make it as close to $1$ as possible.

    In real computation, all $p(w_{t+j} | w_t ; \theta)$’s are between $0$ and $1$. Their products will be extremely close to 0. This will cause underflow and loss of precision. Hence, we use $J(\theta)=-\frac{1}{T}log(L(\theta))$ instead. The minus sign changes maximization into minimization, as people are more comfortable with minimization and more software packages are tuned to minimization. The $\frac{1}{T}$ helps to constrain the sum in the magnitude of one log-probability value, namely the sum does not grow indefinitely large in magnitude as $T$ grows. $J(\theta)$ expands to $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$

    Before working further, we must determine the formula of $p(w_{t+j} | w_t ; \theta)$. Any formula that constitutes a probability density function will do. But softmax is a well acccepted one. So, I will use softmax. This is also the function used in the original algorithm. The original algorithm uses two vectors for each word. $u_c$ when the word considered is the center word, $u_o$ otherwise. Hence, $$ p(w_o | w_c ; \theta) = \frac{exp({u_o}^\top u_c)}{\sum_{q \in V}exp({u_q}^\top u_c)} $$, where $u_o$, $u_c$ and $u_q$ are taken out of $\theta$, and $V$ is the vocabulary, namely the set of all words. In this setting, each word will end up with two vectors. People will have to merge the two vectors for each word, because eventually we want one vector for each word. Here I think, why not just use one vector $u$ for each word? I am going to try it. The formula looks identical: $$ p(w_a | w_b ; \theta) = \frac{exp({u_a}^\top u_b)}{\sum_{d \in V}exp({u_d}^\top u_b)} $$, just it no longer distinguishes center from surrounding. A side effect of this change is that size of $\theta$ is halved.

    The full $J(\theta)$ is $$\begin{align}
    J(\theta)
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta)) \\
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \label{eq:originalJ}
    \end{align}$$.

    Derive The Formula for Minimization

    To minimize $J(\theta)$ over $\theta$, I compute derivative of $J(\theta)$.

    $$ \begin{align}
    & \frac{\partial}{\partial \theta} log(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \frac{\partial}{\partial \theta} \left[ log(exp({u_{t+j}}^\top u_t)) – log({\sum_{d \in V}exp({u_d}^\top u_t)}) \right] \\
    &= \frac{\partial}{\partial \theta} log(exp({u_{t+j}}^\top u_t)) – \frac{\partial}{\partial \theta} log({\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} \frac{\partial}{\partial \theta} exp({u_d}^\top u_t)} \right] \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} exp({u_d}^\top u_t) \frac{\partial}{\partial \theta} {u_d}^\top u_t} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} \sum_{d \in V} exp({u_d}^\top u_t) {u_d} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{\sum_{d \in V} exp({u_d}^\top u_t) {u_d}}{\sum_{f \in V}exp({u_f}^\top u_t)} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { \frac{exp({u_d}^\top u_t) }{\sum_{f \in V}exp({u_f}^\top u_t)} u_d } \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] \\
    \end{align}
    $$

    As the computational process will be iterative, we can treat $u_{t+j}$ as constant parameters to $u_t$. So the last step becomes:

    $$ u_{t+j} – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] $$

    Put them together:

    $$ \begin{align} \label{eq:derivativeJ}
    \frac{\partial J}{\partial \theta} = -\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ u_{t+j} – \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right]
    \end{align} $$

    So, we can now use $\frac{\partial J}{\partial \theta}$ to update $\theta$ so that $\frac{\partial J}{\partial \theta}$ will approach $0$. But how? Let me explain intuitively.

    You can think of $\frac{\partial J}{\partial \theta}$ as a slope. A slope is a vector pointing to the direction of changing function value, assuming the function is continuous and convex in a vicinity that contains both $\theta$ and the minimal point of $J(\theta)$. So, if the direction of $\frac{\partial J}{\partial \theta}$ coincides the growing direction of $J(\theta)$, a portion of $\frac{\partial J}{\partial \theta}$ need be subtracted from $\theta$. If the direction of $\frac{\partial J}{\partial \theta}$ contradicts the growing direction of $J(\theta)$, a portion of negated $\frac{\partial J}{\partial \theta}$ need be added to $\theta$, which amounts to also subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$. Therefore, subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$ is the method for updating $\theta$ in the minimization process. Repeatedly updating $\theta$ in this way until no changes in $\theta$ is perceivable. This is the Gradient Descent Algorithm. If the portion is not fixed, but changing due to some intuitive, then it is the Adaptive Gradient Descent Algorithm. In formula, it is:

    $$ \theta_{k+1} \gets \theta_{k} – \alpha \frac{\partial J}{\partial \theta} $$

    Computational Considerations

    In the process of computing $\frac{\partial J}{\partial \theta}$, we will need all ${u_a}^\top u_b$’s for all $a,b \in V$. As $|V|$ can be very large. The memory that are required for storing all ${u_a}^\top u_b$’s will be formidably large. So, storing them is better avoided.

    Look closely at structure of $$p(w_d | w_t ; \theta) = \frac{exp({u_d}^\top u_t)}{\sum_{f \in V}exp({u_f}^\top u_t)}$$. It needs dot products of $u_t$ with all other vectors $u_f$ for all $f \in V$. And $t$ is just some $f$. Observe structure of $J(\theta)$ again: $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$. We can find that in the inner sum $$
    \sum_{\substack{-m \le j \le m \\ j \ne 0}}^T log(p(w_{t+j} | w_t ; \theta))
    $$, denominator of $p$ is repeatedly computed, and it does not depend on $j$. Hence denominator of $p$ can be calculated beforehand and used repeatedly.

    However, you may notice that $p(w_{t+j} | w_t ; \theta)$ still asks for too much computation as it must iterate over the whole vocabulary.

    Changing Probability Evaluation Method into An Intuitive One

    Let’s focus on $$\begin{align}
    log(p(w_{t+j} | w_t ; \theta))
    &= log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \\
    & \notag \\
    &= log(exp({u_{t+j}}^\top u_t)) \quad – \quad log(\sum_{f \in V}exp({u_f}^\top u_t)) \\
    &= \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \quad – \quad \underbrace{log(\sum_{f \in V}exp({u_f}^\top u_t))}_\text{part2}
    \end{align}$$.
    Part1 is a linear part, it measures similarity of two words in a vector space. Part2 removes some ground similarity that $u_t$ is to all words on average. The heavy computation burden comes from part2. So, we need to come up with some efficient, yet possibly approximate, alternatives.

    Look deeper into part2. suppose the average of all the exponential is $E$. Then part2 is $log(\sum_{f \in V}E)=log(|V| \cdot E)$. This means that part2 can be treated as a kind of average over the set of ${u_f}^\top u_t$. It has the property that large $exp({u_f}^\top u_t)$’s will dominate the value of the logged sum. Only when all the $exp({u_f}^\top u_t)$’s are close to $0$, will the part2 become negative. In fact, becoming negative is not definitely necessary, because this negativity only helps when all the negative words are dissimilar to the center word. This achieves the similar effect of only punishing similarity between the negative words with the center word. With this observation, part2 becomes $$
    \sum_{f \in V}\Lambda({u_f}^\top u_t)
    $$, where $\Lambda$ is the rectified linear unit function $
    ReLU(x)=
    \begin{cases}
    x & \text{if } x \ge 0 \\
    0 & \text{if } x < 0
    \end{cases}
    $.

    Put them back into the formula of $J$: $$\begin{equation} \label{eq:alternativeL1}
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \; – \;
    \underbrace{\sum_{f \in V}\Lambda({u_f}^\top u_t)}_\text{part2} \right]
    \end{equation}$$. This is a variation of Skip-Gram with Negative Sampling.

    The first summation $$
    \sum_{t=1}^T
    $$ can still be too big. One can update $\theta$ partially by selecting a sample from the indices $\{w_1,…,w_T\}$. This entitles the algorithm Stochastic: $$
    J(\theta)=-\frac{1}{|S|} \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right]
    $$, where $S$ is a sample of $\{w_1,…,w_T\}$.
    Theoretically, if $|S|$ is fixed, $\frac{1}{|S|}$ can be omitted. And $J$ becomes $$\begin{equation} \label{eq:alternativeL2}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \Lambda({u_f}^\top u_t) \right]
    \end{equation}$$.
    However, removing $\frac{1}{|S|}$ may not be a very good idea. The reason is that $\frac{1}{|S|}$ scales down the derivative, so that it does not grow too big as the number of involved terms increase. Without $\frac{1}{|S|}$, it can easy cause floating-point number overflow.

    There are two kinds of samples $S$ and $K$. They are sampled using different strategies. $S$ needs to cover the whole set of word positions, hence sampling of $S$ can be a permutation of $\{w_1,…,w_T\}$ cut into segments. $K$ needs to avoid words in current window, hence it is almost a tendency to cover less frequent words. Two methods can be used. One is to sample the vocabulary uniformly. Due to Zipf’s law, you will most probably always hit less frequent words. The other is to sample the uniform distribution while taking a bias towards less frequent words. Which one is better? Have a try!

    Now, consider any partial derivative $\frac{\partial J}{\partial u_i}$ on a single variable. Consider an arbitrary pair $(t, j) \in S \times \{-m \le j \le +m \wedge j \ne 0 \}$.

    $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$ can be nonzero only if either $w_{t+j}=w_i$ or $w_t=w_i$.
    – When $w_{t+j}=w_i$, $u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.
    – When $w_t=w_i$, $u_{t+j}$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.

    Then, consider an arbitrary pair $(t,f) \in S \times K$.

    $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ can be nonzero only if $w_i = w_t \vee w_i = w_f$.
    – When $w_i=w_t$, $-u_f$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.
    – When $w_i=w_f$, $-u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.

    The above conditions are not exclusive, meaning, if a $w_i$ satisfies multiple conditions, multiple additive terms will be involved. Because $j$ can have $2m$ values, the expansion of $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ will be repeated for $2m$ times.

    To summarize, define $
    \Lambda'(x)=
    \begin{cases}
    1 & \text{if } x>0 \\
    0 & \text{if } x \le 0
    \end{cases}
    $ and $
    I(a,b)=\begin{cases}
    1 & \text{if }a=b \\
    0 & \text{if } a \ne b
    \end{cases}
    $, then, $$\begin{align}
    &\frac{\partial}{\partial u_i}J(\theta) \\
    &=-\frac{1}{|S|}\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
    \frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{\partial}{\partial u_i}\Lambda({u_f}^\top u_t) \right] \\
    &=- \left\{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|}\left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; – \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|}\Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    \right\} \\
    &=-
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; + \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right] \label{eq:derivativeJ2} \\
    \end{align}$$

    What Could Be the Problem?

    It turns out that still some problems remain challenging.

    Non-Convexity

    First, let’s look back at equation \eqref{eq:originalJ} or \eqref{eq:alternativeL2}. Think of each $p(\bullet)$s as $x_i$. On any 2-D circle formed by any of $x_i$ and $x_j$, $x_i x_j$ reaches its maximum when $x_i = x_j$, and minimum when $x_i=-x_j$. So do they when applied $exp(\bullet)$. So, $\vec{0}$ is indeed a saddle point. This means that I cannot set initial values of $\theta$ to $\vec{0}$. And this raises a risk in training that $\theta$ might converge to $\vec{0}$. At least, if this happens, I must retrain the model.

    Linear Dependency

    Look close at \eqref{eq:derivativeJ} and \eqref{eq:derivativeJ2}, you can find that the derivative can only update $\theta$ in multiple of $u_t$ and $u_f$. So, if the $u_t$ and $u_f$ are linearly-dependent to begin with, the resulting $\theta$ can only consist of linearly-dependent vectors. So, I must initialize $\theta$ so that all the $u$’s are likely to be linear-independent, or at least find a way to introduce some disturbance to break the linear-dependency.

    Magnitude

    $J(\theta)$ is affected by dot products of two word vectors. Positive dot products tend to be greater in positive direction, negative dot products tend to be greater in negative direction. After the dot products growing big, they almost do not affect value of $\sigma$, thus almost do not affect value of $J(\theta)$. This is bad. Because, if a dot product is two big, it will definitely cause a float-pointing overflow, which will make further computation meaningless. So, I will need a mechanism to restrict magnitudes of the word vectors. One way to do it is to add a squared $L_2$ norm. This turns \eqref{eq:alternativeL2} into $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \frac{1}{|S|} {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{1}{|S|} \Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$.
    Here, $D$ is a constant for tuning the balance between model fidelity and vector magnitude. Then, \eqref{eq:derivativeJ2} becomes $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2}
    \; + \;
    \underbrace{ D u_i }_\text{Term3}
    \end{align}$$

    .

    In practice, a correct $D$ may be hard to estimate in advance. Another method could be scaling magnitude of the whole collection of word vectors to some constant, between each sample.

    Weighting

    In the above formulae, the two sequences $\{a,b,c,d,e\}$ and $\{d,a,c,e,b\}$ are indifferent. This is not good. Because, clearly, “Let me give you books” and “Let me give books you” clearly means different action direction. And the latter one is perhaps very rare. One method to alleviate this problem is weight the dot products in $\text{Term1}$ according to their distance from the center words. So: $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \mu (1-\tau^{m-|j|+1}) {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$,
    where $0 < \tau \le 1$ is some weight decay constant, and $\mu$ is a constant scaling factor. The reason for $\mu$ is that $\tau^{|j|-1}$ is at most $1$, this weight is likely to underweight $\text{Term1}$, so $\mu$ is used to scale it back.

    Then $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \mu (1-\tau^{m-|j|+1}) \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1′}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2′}
    \; + \;
    \underbrace{ D u_i }_\text{Term3′}
    \end{align}$$

    One way to calculate a $\mu$ is to set $$
    \mu=\frac{m}{\sum_{j=1}^m (1-\tau^{m-|j|+1})}
    $$.

    Stability

    To avoid a random imperfect initialization to determine the result prematurely, I added some random disturbance during the training. And because each time only a sample is seen, results from each sample are merged with exponential moving average.