Blog

  • バーブコーディングのブームがそろそろ過ぎ去ろうとしているのではないか

    動画サイトでは、まだまだバイブコーディングの話題で盛り上げています。しかし、すでにいくつかの楽観視できない結果でデータとして出ています。最近は下記の結果が目に飛び込んできました。

    The AI Productivity Boom that Wasn’t | L&DeepDive

    要約しますと、ベテランプログラマーであれば、生成AIを使って、20%早くなったという錯覚が生じて、実際は20%遅くなったという結果になった。

    これでは、バイブコーディングのきれいな泡はほぼ弾けてしまいました。

    バイブコーディングでは、もともとテクにある負債を神速に積み上げることが予見されています。今回の結果を加えて、生産性を落としていることも再度確認されました。

    もちろん、生成AIは一つの発展方向ですが、実際のビジネス環境というすごく複雑な文脈(コンテキスト)をどうやって一つの生成AIに低コストに詰め込むことは相当難しいでしょう。

    Claude Code社では、80%-90%のコードはAIに書かせていると宣言したことですが、あれば、自社製品作成の一環としてもみなすことができ、通常のプロジェクト開発と比べて、コスト上昇に対する容認度はかなり高いでしょう。

    バイブコーディングはプロダクションコードの主力になるまでまだまだ時間がかかりそうですね。

  • I made a browser extension that helps generate low-code test cases.

    Are you a pure programming lover? Or you are someone who dares not to challenge programming?

    I’d prefer to choose the latter as testers. The reason is that the people who are not programmers tend to focus more on how users would look at the system. The problem is then how to let them make automated tests.

    Letting testers do even low code tests could be a bit overwhelming. The only real solution seems to record and replay. But most of the tools just record to Python or JS codes. And I do not want source codes presented in faces of testers.

    How to solve this dilemma?

    My answer is a self-made browser plugin! I call it the CommandRecorder.
    The CommandRecorder simply records user actions and replay them in order. The codes that are generated are in Japanese! Accompanied is a test runner which is hidden behind a GUI program. The testers just need to select which case file and which sheet to run.

    The test case books are in Excel file format. An Excel file has some good features that a plain text file cannot provide.
    E.g.,
    – Excel sheets can use powerful formulae.
    – Cells are well aligned.
    – To repeat a pattern, you just need to drag.
    – Data areas by purpose can be cleanly separated by sheets.
    – Editing are straight forward, you don’t have to know structuring grammars.

    The only difficulty of using Excel files is there lacks an effective comparison tool for finding version differences. But what the hell! I just need to find time to make one.

    And more, have you noticed the output of the CommandRecorder is in Japanese? Yes, a natural language! This makes the resulting code very easily understandable. Even one who is completely unaware of the system and unaware of any programming language can understand what a test does. This largely flattens the learning curve.

  • Rust練習問題:数の整除

    問題文

    定理:二桁以上の正整数であれば、その一の位を取り除いて、残った数を前記一の位の数字の五倍で割ったら、残りの数が17の倍数の場合かつその場合に限って、元の数も17の倍数である。

    例えば、34が17の場合である。なぜなら、3-20=-17が17の倍数である。201は17の倍数ではない。なぜなら、20-5=15は17の倍数ではない。一個正整数nを入力して、あなたの任務はこれが17の倍数であるかを判断することだ。

    入力フォーマット

    入力ファイルは最多で10セットのテストデータを有する。一セットのテストデータは一行を占める。その行は一個の整数n(1<=n<=10^100)のみがあり、判断待ちの正整数を表す。n=0であれば、終了を意味し、この行を処理すべきではない。

    出力フォーマット

    一セットのテストデータに対して一行を出力して、相応のnが17の倍数であるかを表す。1が肯定を表し、0が否定を表す。

    入力サンプル

    34
    201
    2098765413
    1717171717171717171717171717171717171717171717171718
    0

    出力サンプル

    1
    0
    1
    0

    分析

    この問題の難点は、nの範囲にある。64-bitの整数型を使っても、表示範囲が最大で20桁の十進数くらいになる。100桁まで昇るnとしても使えない。つまり、プログラミング言語のビルトイン型では、nを表現できない。したがって、nを表現できる型またはそれに類するものを作らないといけない。

    また、前記定理によると、nが17の倍数であるかは、nより一桁小さい別の整数で判定できる。

    つまり、$n=n_0$が17の倍数という問題が下記の問題に相当する。
    ・$n_1$が17の倍数であるか、さらに次の問題に相当する。
    ・$n_2$が17の倍数であるか、さらに次の問題に相当する。
    ・$n_3$が17の倍数であるか、さらに次の問題に相当する。
    ・……
    そして、$n_1 > 10 n_2 > 10^2 n_3 > 10^3 n_4 > ……$

    すると、ある$n_p$がプログラミング言語のビルトイン型で表現できる大きさになったら、通常の割り算で17の倍数であるかは判定できるようになる。

    前記の定理の中に、掛け算と引き算がある。掛け算に参加する数字は小さく、引き算に参加する数字が大きい。このため、大きい数字の引き算を実装する必要があることになる。

    回答案

    use std::io;
    
    fn main() {
        let mut line = String::new();
        while io::stdin().read_line(&mut line).unwrap_or(0) > 0 {
            line = String::from(line.trim());
            if line == "0" {
                break;
            }
            let bn = BigNumber { digits: line.clone() };
            println!("{}", if is_divisible_by_17(&bn) { 1 } else { 0 });
            line.clear()
        }
    }
    
    struct BigNumber {
        digits: String
    }
    
    impl BigNumber {
        fn is_negative(&self) -> bool {
            if let Some(first_char) = self.digits.chars().next() {
                match first_char {
                    '-' => true,
                    _ => false
                }
            }
            else {
                false
            }
        }
    
        fn subtract(&self, other: &BigNumber) -> BigNumber {
            let mut result_digits = String::new();
            let mut ards : Vec<i8> = Vec::new();
            let mut brds : Vec<i8> = Vec::new();
            for c in self.digits.chars().rev() {
                match c {
                    '0'..='9' => ards.push((c as u8 - '0' as u8) as i8),
                    _ => panic!("Only unsigned numbers are allowed in minuend")
                }
            }
            for c in other.digits.chars().rev() {
                match c {
                    '0'..='9' => brds.push((c as u8 - '0' as u8) as i8),
                    _ => panic!("Only unsigned numbers are allowed in subtrahend")
                }
            }
            let mut diffds : Vec<i8> = Vec::new();
            let mut carry : i8 = 0;
            let maxlen = if ards.len() > brds.len() {ards.len()} else {brds.len()};
            for di in 0..maxlen {
                let mut df = 
                    if di >= ards.len() {0} else {ards[di]}
                    - if di >= brds.len() {0} else {brds[di]}
                    - carry;
                if df < 0 {
                    carry = 1;
                    df += 10;
                }
                else {
                    carry = 0;
                }
                diffds.push(df);
            }
            if carry == 1 {
                // The result is negative, and need be complemented.
                for d in diffds.iter_mut() {
                    *d = 9 - *d + carry;
                    if *d < 10 {
                        carry = 0;
                    }
                    else {
                        *d -= 10;
                    }
                }
                result_digits.push('-');
            }
            for d in diffds.iter().rev() {
                result_digits.push(('0' as u8 + *d as u8) as char);
            }
            BigNumber {
                digits: result_digits
            }
        }
    
        fn last_digit(&self) -> Option<i8> {
            if let Some(c) = self.digits.chars().rev().next() {
                Some((c as u32 - '0' as u32) as i8)
            }
            else {
                None
            }
        }
    
        fn remove_last_digit(&mut self) {
            if self.is_negative() {
                if self.digits.len() > 2 {
                    self.digits.pop();
                }
                else {
                    self.digits.clear();
                    self.digits.push('0');
                }
            }
            else {
                if self.digits.len() > 1 {
                    self.digits.pop();
                }
                else {
                    self.digits.clear();
                    self.digits.push('0');
                }
            }
        }
    }
    
    fn is_divisible_by_17(number: &BigNumber) -> bool {
        if number.digits.len() < 4 {
            let intval = number.digits.parse::<i32>().unwrap();
            return intval % 17 == 0;
        }
        let last_digit_times_5: u32 = number.last_digit().unwrap() as u32 * 5;
        let mut shortened = BigNumber {digits: number.digits.clone()};
        shortened.remove_last_digit();
        let subtrahend = BigNumber {digits: last_digit_times_5.to_string()};
        let difference = shortened.subtract(&subtrahend);
        return is_divisible_by_17(&difference);
    }
  • Rust練習問題:弟の算数検査

    問題文

    弟は100以内の足し算と引き算をやりました。チャックしてあげてください。核問題の形式はa+b=c或いはa-b=c、どれも100を超えない非負整数です。cは弟が算出した回答で、200以内の比数整数であるか、一個の「?」かです。「?」は回答不能を意味します。

    入力形式

    入力は100行以内とし、EOF記号で終了します。一行ごとに、一問があります。形式は前述規定に則し、いかなるスペースを含みません。入力されたすべての整数に左に不要な0をつけていません。

    出力形式

    一行のみを出力します。一個の非負整数のみが出て、つまり、弟が正解した問題の数。

    入力サンプル

    1+2=3
    3-1=5
    6+7=?
    99-0=99

    回答案

    use std::io;
    use core::iter::Peekable;
    use core::str::Chars;
    pub struct Expression {
        a: i32,
        b: i32,
        op: char,
        answer: i32,
        unanswered: bool
    }
    
    fn main() {
        let mut line: String = String::new();
        let mut correct_count: i32 = 0;
        while io::stdin().read_line(&mut line).unwrap() > 0 {
            let mut char_seq: Peekable<Chars<'_>> = line.trim().chars().peekable();
            let a: i32   = read_number(&mut char_seq).unwrap().unwrap();
            let op: char = char_seq.next().unwrap();
            let b: i32   = read_number(&mut char_seq).unwrap().unwrap();
            assert!(char_seq.next().unwrap()=='=');
            let got_answer = read_number(&mut char_seq).unwrap();
            let expression = Expression {
                a: a,
                op: op,
                b: b,
                answer: got_answer.unwrap_or_default(),
                unanswered: got_answer.is_none()
            };
            match check_expression(&expression) {
                Ok(r) => correct_count += if r {1} else {0},
                Err(e) => println!("Error: {}", e)
            }
            line.clear();
        }
        println!("{}", correct_count);
    }
    
    fn read_number(text_iter: &mut Peekable<Chars<'_>>) -> Result<Option<i32>, &'static str> {
        let mut number_str: String = String::new();
        while let Some(&c) = text_iter.peek() {
            match c {
                c if c >= '0' && c <= '9' => {
                    number_str.push(c);
                    text_iter.next();
                },
                '?' => {
                    number_str.push(c);
                    text_iter.next();
                    break;
                },
                _ => break
            }
        }
        if number_str == "?" {
            Ok(None)
        } else if number_str.len() > 0 {
            Ok(Some(number_str.parse().unwrap()))
        }
        else {
            Err("Reading a number while not at a number.")
        }
    }
    
    fn check_expression(expr: &Expression) -> Result<bool, &'static str> {
        let expected: i32;
        match expr.op {
            '+' => expected = expr.a + expr.b,
            '-' => expected = expr.a - expr.b,
            '*' => expected = expr.a * expr.b,
            '/' => expected = expr.a / expr.b,
            _ => return Err("Invalid operator.")
        }
        Ok(!expr.unanswered && expr.answer == expected)
    }
  • $\frac{1}{2022}+\frac{5}{2022}+\frac{7}{2022}+…+$約分できない真分数を合計する。

    約分できない真分数を加算するので、分数が分母の素因数の倍数の場合は、加算されないことになる。

    2022に三つの素因数がある。2と3と337である。つまり、$U=\{1,2,3,…,2021\}$の中に、2の倍数でもなく、3の倍数でもなく、337の倍数でもない数字がどのくらいあるかを数える問題になる。式を書きやすくするために、記号を定義する。$U$の中の2の倍数の集合を$A$となる。$U$の中の3の倍数の集合を$B$となる。Uの中の337の倍数の集合を$C$となる。すると、求めたい数は$|U|-|A \cup B \cup C|$。

    $$\begin{align}
    |A|&=\lfloor 2021 \div 2 \rfloor = 1010 \\
    |B|&=\lfloor 2021 \div 3 \rfloor = 673 \\
    |C|&=\lfloor 2021 \div 337 \rfloor = 5 \\
    |A \cap B|&=\lfloor 2021 \div 2 \div 3 \rfloor = 336 \\
    |A \cap C|&=\lfloor 2021 \div 2 \div 337 \rfloor = 2 \\
    |B \cap C|&=\lfloor 2021 \div 3 \div 337 \rfloor = 1 \\
    |A \cap B \cap C|&=\lfloor 2021 \div 2 \div 3 \div 337 \rfloor = 0 \\
    \end{align}$$

    $$\begin{align}
    |A \cup B \cup C|&=|A|+|B|+|C|-|A \cap B|-|A \cap C|-|B \cap C|+|A \cap B \cap C| \\
    &=1010+673+5-336-2-1+0 \\
    &=1349
    \end{align}$$

    $$\begin{align}
    |U|-|A \cup B \cup C|=2021-1349=672
    \end{align}$$

    つまり、$\frac{1}{2022}+\frac{5}{2022}+\frac{7}{2022}+…+$の合計式に、全部で672項がある。

    この合計式の構造をもっと詳しく観察する。仮に$gcd(a, 2022)=f$であれば、仮に$gcd(2022-a, 2022)=k$とする。$k$を求める。仮に、$d$を整数とすると、

    $$\begin{align}
    & (d \mid 2022-a) \wedge (d \mid 2022) \\
    & \iff (d \mid a) \wedge (d \mid 2022)
    \end{align}$$

    即ち、$a$又は$2022$又は$2022-a$の任意の約数$d$が常に、同時に、$gcd(a, 2022)$の約数でありながら、$gcd(2022-a, 2022)$の約数である。即ち、$gcd(a, 2022)=gcd(2022-a, 2022)$.

    問題の合算式は、分子の小さいほうを左に置き、右へ分子が大きくなる一方である。つまり、$\frac{a}{2022}$が左から$n$番目であれば、$\frac{2022-a}{2022}$が右から$n$番目である。$\frac{a}{2022}$と$\frac{2022-a}{2022}$を対にすれば、この対の合計が$\frac{a}{2022}+\frac{2022-a}{2022}=1$である。全部で672項あるから、全部で、336対ある。つまり、問題の式の合計値が336である。

    簡単なPythonプログラムで足して結果を確かめよう。

    >>> from math import gcd
    >>> sum=0
    >>> for i in range(1, 2022):
    ...     sum+=0 if gcd(i, 2022)>1 else i
    >>> print(sum/2022)
    336.0

    結果がぴったりだ。

  • Compute Sigmoid Function Smart

    The sigmoid function frequently appears in machine learning contexts. It has a simple form: $$
    \sigma(x)=\frac{1}{1+e^{-x}}
    $$. A naive implementation in C would be:

    double sigma(double x)
    {
        return 1.0/(1.0+exp(-x));
    }

    It looks good, right? No, it is not. When $x << 0$, say $x=1000$, exp(-x) will yield NaN, even though $\sigma(x)$ approaches $0$. We need a way to bypass the NaN. Let’s look at the graph of $\sigma(x)$.

    It looks symmetric about $(0, \frac{1}{2})$. If it is the case, $$\begin{equation}
    \sigma(x)=1-\sigma(-x)
    \end{equation}$$ will have to hold. Let’s try to prove it.

    $$\begin{align}
    1-\sigma(-x) &= 1-\frac{1}{1+e^x} \\
    &=\frac{1+e^x}{1+e^x}-\frac{1}{1+e^x} \\
    &=\frac{1+e^x-1}{1+e^x} \\
    &=\frac{e^x}{1+e^x} \\
    &=\frac{1}{\frac{1}{e^x}+\frac{e^x}{e^x}} \\
    &=\frac{1}{e^{-x}+1} \\
    &=\sigma(x)
    \end{align}$$

    Done. It’s proven. Now, whenever $x<0$, we can switch $\sigma(x)$ to $1-\sigma(-x)$, which yields the same value and completely avoids the NaN problem. The following is the code.

    double sigma(double x)
    {
        if (x>=0)
            return 1.0/(1.0+exp(-x));
        return 1.0 - 1.0/(1.0+exp(x));
    }
  • HOWTO: Eclipse 2023-12 + Cygwin + GDB 13 — A Work Around

    It was said that GDB 11 or above does not work well with Eclipse with Cygwin. The true reason seems to be that Eclipse sends Windows-style path of your executable to GDB of Cygwin, while GDB of cygwin does not recognize “C:\” as beginning of an absolute path. So GDB will start end up some ugly combined path like

    /cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:/store/your-project-name/Debug/your-project-name.exe

    I haven’t found a beautiful way to work around it as GDB does not fix this problem, and Eclipse developers insist this as a problem of GDB. So, stalemate.

    But, I am not going to get jammed here. GDB wants that path? I give GDB the path.

    So, under Cygwin, I make a symbolic link like the following:

    mkdir -p "/cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:"
    cd "/cygdrive/c/Users/yourname/Desktop/C:/store/your-project-name/C:"
    ln -s /cygdrive/c/store store

    This will let GDB find your executable.

    What if you cannot be sure the path that GDB is looking at? Don’t worry! Follow this operation path. Window > Preferences > C/C++ > Debug > GDB, check “Show the GDB traces consoles with character limit:”. Run debugging again. Then in the console tab, open the console named “… gdb traces”. The path that GDB looks for will be in it.

    Eclipse cannot find source files when debugging? Don’t worry! I bet you see window reporting “Can’t find a source file at …”. Then click the “Edit Source Lookup Path …” button. Then Add > Path Mapping > OK >(enter a new mapping name, e.g. “cygwin”) > Add . Enter “/cygdrive/c/” as the “Compilcation path”, select “C:\” as Local file system path. Repeat this adding operation until all source files are found.

    By the way, if you get some strange exit code when starting GDB, like 0xc0000135 or something alike, be sure to look into environment variable settings. Your system PATH variable, project environment variables in Debug settings. Be sure to set a complete list of paths or nothing.

    But there is still a drawback. I still have not found how to make breakpoint work. …

    So, the better solution may still be installing gdb 9.2-1. If you cannot find it in your Cygwin setup program, try using another mirror.

  • Try Word Vector

    Explain The Idea

    Word vector is widely used in text-related AI processes. Its center idea is to maximize the following likelihood function.

    $$\begin{equation} \label{eq:originalL}
    L(\theta) = \prod_{t=1}^T \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta)
    \end{equation}$$

    Terrifying! Right? Let me explain it.

    Imagine as if you have a long text which is a sequence of words. At any time, consider a word $w_t$ as center of a window in the long word sequence. What is the probability of words surrounding $w_t$? Rephrase it, what is the probability of surrounding words appearing in a window, given $w_t$ is the center? Solution to this question can be considered in the following way.

    For any one specific occurrence of $w_t$, its surrounding is $\{ w_{-m}, w_{-m+1}, …, w_{t-1}, w_{t+1}, …, w_{m-1}, w_{m} \}$. For each of the $w_{t+j}$, where $-m \le j \le m \land j \ne 0$, the probability of $w_{t+j}$ given $w_t$ is just denoted as $p(w_{t+j} | w_t ; \theta)$. The $\theta$ there is the parameter to be trained, which guides how each probability is calculated while the formulae are fixed. In fact, you can think of $\theta$ as a store. When computing $p(w_{t+j} | w_t ; \theta)$, $p$ takes the part of $w_{t+j}$ out of $\theta$, takes the part of $w_t$ out of $\theta$, then do the computation using a fixed formula. Now multiply all $p(w_{t+j} | w_t ; \theta)$ for $w_t$’s surrounding words, we get a probability estimate of one surrounding given $w_t$. This is the $$ \prod_{\substack{-m \le j \le m \\ j \ne 0}} p(w_{t+j} | w_t ; \theta) $$ part.

    Now, I have got a probability estimate for one position in the long sequence of words. We use the same formula for all positions, and multiply the probabilities to get the probability estimate of all windows appearing collectively, given the long word sequence. And the whole probability estimate is $L(\theta)$. Note: the word sequence is the universe, therefore it is not explicitly included in the formula.

    $L(\theta)$ is clearly between $0$ and $1$, by definition. And the word sequence appears in reality. Therefore, ideally, $L(\theta)$ should be $1$. So, what we need to do is to maximize $L(\theta)$ to make it as close to $1$ as possible.

    In real computation, all $p(w_{t+j} | w_t ; \theta)$’s are between $0$ and $1$. Their products will be extremely close to 0. This will cause underflow and loss of precision. Hence, we use $J(\theta)=-\frac{1}{T}log(L(\theta))$ instead. The minus sign changes maximization into minimization, as people are more comfortable with minimization and more software packages are tuned to minimization. The $\frac{1}{T}$ helps to constrain the sum in the magnitude of one log-probability value, namely the sum does not grow indefinitely large in magnitude as $T$ grows. $J(\theta)$ expands to $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$

    Before working further, we must determine the formula of $p(w_{t+j} | w_t ; \theta)$. Any formula that constitutes a probability density function will do. But softmax is a well acccepted one. So, I will use softmax. This is also the function used in the original algorithm. The original algorithm uses two vectors for each word. $u_c$ when the word considered is the center word, $u_o$ otherwise. Hence, $$ p(w_o | w_c ; \theta) = \frac{exp({u_o}^\top u_c)}{\sum_{q \in V}exp({u_q}^\top u_c)} $$, where $u_o$, $u_c$ and $u_q$ are taken out of $\theta$, and $V$ is the vocabulary, namely the set of all words. In this setting, each word will end up with two vectors. People will have to merge the two vectors for each word, because eventually we want one vector for each word. Here I think, why not just use one vector $u$ for each word? I am going to try it. The formula looks identical: $$ p(w_a | w_b ; \theta) = \frac{exp({u_a}^\top u_b)}{\sum_{d \in V}exp({u_d}^\top u_b)} $$, just it no longer distinguishes center from surrounding. A side effect of this change is that size of $\theta$ is halved.

    The full $J(\theta)$ is $$\begin{align}
    J(\theta)
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta)) \\
    &=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \label{eq:originalJ}
    \end{align}$$.

    Derive The Formula for Minimization

    To minimize $J(\theta)$ over $\theta$, I compute derivative of $J(\theta)$.

    $$ \begin{align}
    & \frac{\partial}{\partial \theta} log(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \frac{\partial}{\partial \theta} \left[ log(exp({u_{t+j}}^\top u_t)) – log({\sum_{d \in V}exp({u_d}^\top u_t)}) \right] \\
    &= \frac{\partial}{\partial \theta} log(exp({u_{t+j}}^\top u_t)) – \frac{\partial}{\partial \theta} log({\sum_{d \in V}exp({u_d}^\top u_t)}) \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} \frac{\partial}{\partial \theta} exp({u_d}^\top u_t)} \right] \\
    &= \left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} {\sum_{d \in V} exp({u_d}^\top u_t) \frac{\partial}{\partial \theta} {u_d}^\top u_t} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{1}{\sum_{f \in V}exp({u_f}^\top u_t)} \sum_{d \in V} exp({u_d}^\top u_t) {u_d} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \frac{\sum_{d \in V} exp({u_d}^\top u_t) {u_d}}{\sum_{f \in V}exp({u_f}^\top u_t)} \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { \frac{exp({u_d}^\top u_t) }{\sum_{f \in V}exp({u_f}^\top u_t)} u_d } \right] \\
    &=\left[ \frac{\partial}{\partial \theta} {u_{t+j}}^\top u_t \right] – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] \\
    \end{align}
    $$

    As the computational process will be iterative, we can treat $u_{t+j}$ as constant parameters to $u_t$. So the last step becomes:

    $$ u_{t+j} – \left[ \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right] $$

    Put them together:

    $$ \begin{align} \label{eq:derivativeJ}
    \frac{\partial J}{\partial \theta} = -\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ u_{t+j} – \sum_{d \in V} { p(w_d | w_t ; \theta) u_d } \right]
    \end{align} $$

    So, we can now use $\frac{\partial J}{\partial \theta}$ to update $\theta$ so that $\frac{\partial J}{\partial \theta}$ will approach $0$. But how? Let me explain intuitively.

    You can think of $\frac{\partial J}{\partial \theta}$ as a slope. A slope is a vector pointing to the direction of changing function value, assuming the function is continuous and convex in a vicinity that contains both $\theta$ and the minimal point of $J(\theta)$. So, if the direction of $\frac{\partial J}{\partial \theta}$ coincides the growing direction of $J(\theta)$, a portion of $\frac{\partial J}{\partial \theta}$ need be subtracted from $\theta$. If the direction of $\frac{\partial J}{\partial \theta}$ contradicts the growing direction of $J(\theta)$, a portion of negated $\frac{\partial J}{\partial \theta}$ need be added to $\theta$, which amounts to also subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$. Therefore, subtracting a portion of $\frac{\partial J}{\partial \theta}$ from $\theta$ is the method for updating $\theta$ in the minimization process. Repeatedly updating $\theta$ in this way until no changes in $\theta$ is perceivable. This is the Gradient Descent Algorithm. If the portion is not fixed, but changing due to some intuitive, then it is the Adaptive Gradient Descent Algorithm. In formula, it is:

    $$ \theta_{k+1} \gets \theta_{k} – \alpha \frac{\partial J}{\partial \theta} $$

    Computational Considerations

    In the process of computing $\frac{\partial J}{\partial \theta}$, we will need all ${u_a}^\top u_b$’s for all $a,b \in V$. As $|V|$ can be very large. The memory that are required for storing all ${u_a}^\top u_b$’s will be formidably large. So, storing them is better avoided.

    Look closely at structure of $$p(w_d | w_t ; \theta) = \frac{exp({u_d}^\top u_t)}{\sum_{f \in V}exp({u_f}^\top u_t)}$$. It needs dot products of $u_t$ with all other vectors $u_f$ for all $f \in V$. And $t$ is just some $f$. Observe structure of $J(\theta)$ again: $$
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} log(p(w_{t+j} | w_t ; \theta))
    $$. We can find that in the inner sum $$
    \sum_{\substack{-m \le j \le m \\ j \ne 0}}^T log(p(w_{t+j} | w_t ; \theta))
    $$, denominator of $p$ is repeatedly computed, and it does not depend on $j$. Hence denominator of $p$ can be calculated beforehand and used repeatedly.

    However, you may notice that $p(w_{t+j} | w_t ; \theta)$ still asks for too much computation as it must iterate over the whole vocabulary.

    Changing Probability Evaluation Method into An Intuitive One

    Let’s focus on $$\begin{align}
    log(p(w_{t+j} | w_t ; \theta))
    &= log \left(\frac{exp({u_{t+j}}^\top u_t)}{\sum_{d \in V}exp({u_d}^\top u_t)} \right) \\
    & \notag \\
    &= log(exp({u_{t+j}}^\top u_t)) \quad – \quad log(\sum_{f \in V}exp({u_f}^\top u_t)) \\
    &= \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \quad – \quad \underbrace{log(\sum_{f \in V}exp({u_f}^\top u_t))}_\text{part2}
    \end{align}$$.
    Part1 is a linear part, it measures similarity of two words in a vector space. Part2 removes some ground similarity that $u_t$ is to all words on average. The heavy computation burden comes from part2. So, we need to come up with some efficient, yet possibly approximate, alternatives.

    Look deeper into part2. suppose the average of all the exponential is $E$. Then part2 is $log(\sum_{f \in V}E)=log(|V| \cdot E)$. This means that part2 can be treated as a kind of average over the set of ${u_f}^\top u_t$. It has the property that large $exp({u_f}^\top u_t)$’s will dominate the value of the logged sum. Only when all the $exp({u_f}^\top u_t)$’s are close to $0$, will the part2 become negative. In fact, becoming negative is not definitely necessary, because this negativity only helps when all the negative words are dissimilar to the center word. This achieves the similar effect of only punishing similarity between the negative words with the center word. With this observation, part2 becomes $$
    \sum_{f \in V}\Lambda({u_f}^\top u_t)
    $$, where $\Lambda$ is the rectified linear unit function $
    ReLU(x)=
    \begin{cases}
    x & \text{if } x \ge 0 \\
    0 & \text{if } x < 0
    \end{cases}
    $.

    Put them back into the formula of $J$: $$\begin{equation} \label{eq:alternativeL1}
    J(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \underbrace{{u_{t+j}}^\top u_t}_\text{part1} \; – \;
    \underbrace{\sum_{f \in V}\Lambda({u_f}^\top u_t)}_\text{part2} \right]
    \end{equation}$$. This is a variation of Skip-Gram with Negative Sampling.

    The first summation $$
    \sum_{t=1}^T
    $$ can still be too big. One can update $\theta$ partially by selecting a sample from the indices $\{w_1,…,w_T\}$. This entitles the algorithm Stochastic: $$
    J(\theta)=-\frac{1}{|S|} \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right]
    $$, where $S$ is a sample of $\{w_1,…,w_T\}$.
    Theoretically, if $|S|$ is fixed, $\frac{1}{|S|}$ can be omitted. And $J$ becomes $$\begin{equation} \label{eq:alternativeL2}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \Lambda({u_f}^\top u_t) \right]
    \end{equation}$$.
    However, removing $\frac{1}{|S|}$ may not be a very good idea. The reason is that $\frac{1}{|S|}$ scales down the derivative, so that it does not grow too big as the number of involved terms increase. Without $\frac{1}{|S|}$, it can easy cause floating-point number overflow.

    There are two kinds of samples $S$ and $K$. They are sampled using different strategies. $S$ needs to cover the whole set of word positions, hence sampling of $S$ can be a permutation of $\{w_1,…,w_T\}$ cut into segments. $K$ needs to avoid words in current window, hence it is almost a tendency to cover less frequent words. Two methods can be used. One is to sample the vocabulary uniformly. Due to Zipf’s law, you will most probably always hit less frequent words. The other is to sample the uniform distribution while taking a bias towards less frequent words. Which one is better? Have a try!

    Now, consider any partial derivative $\frac{\partial J}{\partial u_i}$ on a single variable. Consider an arbitrary pair $(t, j) \in S \times \{-m \le j \le +m \wedge j \ne 0 \}$.

    $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$ can be nonzero only if either $w_{t+j}=w_i$ or $w_t=w_i$.
    – When $w_{t+j}=w_i$, $u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.
    – When $w_t=w_i$, $u_{t+j}$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t$.

    Then, consider an arbitrary pair $(t,f) \in S \times K$.

    $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ can be nonzero only if $w_i = w_t \vee w_i = w_f$.
    – When $w_i=w_t$, $-u_f$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.
    – When $w_i=w_f$, $-u_t$ is an additive term in the expansion of $\frac{\partial}{\partial u_i}{u_f}^\top u_t$.

    The above conditions are not exclusive, meaning, if a $w_i$ satisfies multiple conditions, multiple additive terms will be involved. Because $j$ can have $2m$ values, the expansion of $-\frac{\partial}{\partial u_i}{u_f}^\top u_t$ will be repeated for $2m$ times.

    To summarize, define $
    \Lambda'(x)=
    \begin{cases}
    1 & \text{if } x>0 \\
    0 & \text{if } x \le 0
    \end{cases}
    $ and $
    I(a,b)=\begin{cases}
    1 & \text{if }a=b \\
    0 & \text{if } a \ne b
    \end{cases}
    $, then, $$\begin{align}
    &\frac{\partial}{\partial u_i}J(\theta) \\
    &=-\frac{1}{|S|}\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[
    \frac{\partial}{\partial u_i}{u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{\partial}{\partial u_i}\Lambda({u_f}^\top u_t) \right] \\
    &=- \left\{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|}\left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; – \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|}\Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    \right\} \\
    &=-
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]
    \; + \;
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right] \label{eq:derivativeJ2} \\
    \end{align}$$

    What Could Be the Problem?

    It turns out that still some problems remain challenging.

    Non-Convexity

    First, let’s look back at equation \eqref{eq:originalJ} or \eqref{eq:alternativeL2}. Think of each $p(\bullet)$s as $x_i$. On any 2-D circle formed by any of $x_i$ and $x_j$, $x_i x_j$ reaches its maximum when $x_i = x_j$, and minimum when $x_i=-x_j$. So do they when applied $exp(\bullet)$. So, $\vec{0}$ is indeed a saddle point. This means that I cannot set initial values of $\theta$ to $\vec{0}$. And this raises a risk in training that $\theta$ might converge to $\vec{0}$. At least, if this happens, I must retrain the model.

    Linear Dependency

    Look close at \eqref{eq:derivativeJ} and \eqref{eq:derivativeJ2}, you can find that the derivative can only update $\theta$ in multiple of $u_t$ and $u_f$. So, if the $u_t$ and $u_f$ are linearly-dependent to begin with, the resulting $\theta$ can only consist of linearly-dependent vectors. So, I must initialize $\theta$ so that all the $u$’s are likely to be linear-independent, or at least find a way to introduce some disturbance to break the linear-dependency.

    Magnitude

    $J(\theta)$ is affected by dot products of two word vectors. Positive dot products tend to be greater in positive direction, negative dot products tend to be greater in negative direction. After the dot products growing big, they almost do not affect value of $\sigma$, thus almost do not affect value of $J(\theta)$. This is bad. Because, if a dot product is two big, it will definitely cause a float-pointing overflow, which will make further computation meaningless. So, I will need a mechanism to restrict magnitudes of the word vectors. One way to do it is to add a squared $L_2$ norm. This turns \eqref{eq:alternativeL2} into $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \frac{1}{|S|} {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K} \frac{1}{|S|} \Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$.
    Here, $D$ is a constant for tuning the balance between model fidelity and vector magnitude. Then, \eqref{eq:derivativeJ2} becomes $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2}
    \; + \;
    \underbrace{ D u_i }_\text{Term3}
    \end{align}$$

    .

    In practice, a correct $D$ may be hard to estimate in advance. Another method could be scaling magnitude of the whole collection of word vectors to some constant, between each sample.

    Weighting

    In the above formulae, the two sequences $\{a,b,c,d,e\}$ and $\{d,a,c,e,b\}$ are indifferent. This is not good. Because, clearly, “Let me give you books” and “Let me give books you” clearly means different action direction. And the latter one is perhaps very rare. One method to alleviate this problem is weight the dot products in $\text{Term1}$ according to their distance from the center words. So: $$\begin{equation}
    J(\theta)=-\sum_{t \in S} \sum_{\substack{-m \le j \le m \\ j \ne 0}} \left[ \mu (1-\tau^{m-|j|+1}) {u_{t+j}}^\top u_t \; – \;
    \sum_{f \in K}\Lambda({u_f}^\top u_t) \right] + \frac{D}{2} \theta^\top \theta
    \end{equation}$$,
    where $0 < \tau \le 1$ is some weight decay constant, and $\mu$ is a constant scaling factor. The reason for $\mu$ is that $\tau^{|j|-1}$ is at most $1$, this weight is likely to underweight $\text{Term1}$, so $\mu$ is used to scale it back.

    Then $$\begin{align}
    \frac{\partial J(\theta)}{\partial u_i}=
    -\underbrace{
    \sum_{t \in S} \sum_{\substack{-m \le j \le m \\ \ j \ne 0}} \mu (1-\tau^{m-|j|+1}) \frac{1}{|S|} \left[
    I(w_{t+j}, w_i) u_t + I(w_t, w_i) u_{t+j}
    \right]}_\text{Term1′}
    \; + \;
    \underbrace{
    2m \sum_{t \in S} \sum_{f \in K} \frac{1}{|S|} \Lambda'({u_f}^\top u_t) \left[
    I(w_i, w_t)u_f+I(w_i, w_f)u_t
    \right]
    }_\text{Term2′}
    \; + \;
    \underbrace{ D u_i }_\text{Term3′}
    \end{align}$$

    One way to calculate a $\mu$ is to set $$
    \mu=\frac{m}{\sum_{j=1}^m (1-\tau^{m-|j|+1})}
    $$.

    Stability

    To avoid a random imperfect initialization to determine the result prematurely, I added some random disturbance during the training. And because each time only a sample is seen, results from each sample are merged with exponential moving average.

  • Install React-Native for Windows on Mapped Network Drive. Very Limited Success!

    React-Native claims to let people write once and run everywhere. So today, I decided to have it at my PC. And I don’t want it on my local drive, as its available space is becoming more and more precious. So, I decided it to install it on one of my mapped network drive, say “U:”.

    Before installing react-native, you may need to install git first and run the following command to set email and name.

    git config --global user.email "you@example.com"
    git config --global user.name "Your Name"
    git config --global --add safe.directory "//192.168.x.y/mount-point/project-name"

    The installation process is listed on the Get Started with Windows page, which gives the following command line that can be easily completed in Windows CMD.

    npx react-native@nightly init <projectName> --version "nightly"

    And you may have noticed the “nightly” word. This is a bit too aggressive. I wanted the latest release version. Therefore, I switched “latest” for “nightly” and ran the following line.

    npx react-native@latest init project-name --version "latest" --verbose
    cd project-name
    npx react-native-windows-init --overwrite

    This command ran smoothly to its end.

    Now, I ran:

    npx react-native run-windows

    However, it claimed lack of dependencies. And I needed to run on an elevated powershell:

    U:\project-name\node_modules\react-native-windows\scripts\rnw-dependencies.ps1

    But before it, I have to switch to the drive U: in powershell. This time, I got the following error:

    Powershell did not share the same drive mapping as in CMD/File Explorer!

    Here is what came to rescue. I needed to map the network drive under powershell. The following was the command.

    PS C:\WINDOWS\system32> New-PSDrive -Name U -Root \\192.168.x.y\network-mount -PSProvider FileSystem

    After it, I was able to switch to U: in powershell. BTW, if you need to keep the U: drive permanently, you just need to add a “-Persist” argument to the command line above.

    Run the following commands to install dependencies.

    Set-ExecutionPolicy Unrestricted -Scope Process -Force;
    U:\project-name\node_modules\react-native-windows\scripts\rnw-dependencies.ps1

    Or, you can run the following instead, as introduced on the System Requirements page.

    Set-ExecutionPolicy Unrestricted -Scope Process -Force;
    iex (New-Object System.Net.WebClient).DownloadString('https://aka.ms/rnw-vs2022-deps.ps1');

    The installation process ends here. However, my turmoil did not end here. This command did not end successfully. It could not run WindowsStoreAppUtils.ps1 due to powershell execution policy. I bypassed this problem by editing the following file:

    U:\project-name\node_modules\@react-native-windows\cli\lib-commonjs\utils\commandWithProgress.js

    Find the string '-NoProfile' in the file, and add two lines below it:

                '-ExecutionPolicy',
                'ByPass',
    

    Now you can run:

    npx react-native run-windows --logging

    The above adjustments helped me to successfully build a debug version. However, registering the app will fail, because the app resided on a network drive. After searching on the Internet, it seems that registering an app on network drive is simply not supported.

    I also tried running through Visual Studio, but the result was the same.

    DEP0700: Registration of the app failed. [0x80073D1F]

    Bad luck! At least, I learned that react-native-windows apps cannot be developed on network drives, because you cannot debug it.

    Later, I tried build without running, and I succeeded. This is what I did:

    "C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64\MakeCert" /n "CN=project-name" /r /h 0 /eku "1.3.6.1.5.5.7.3.3,1.3.6.1.4.1.311.10.3.13" /e "12/31/2100" /sv project-name.pvk project-name.cer
    
    "C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64\Pvk2Pfx" /pvk project-name.pvk /spc project-name.cer /pfx project-name.pfx

    You may have to change the blue and red parts above to match your projects. These commands create three files.

    project-name.cer
    project-name.pvk
    project-name.pfx

    The .cer file is the certificate file. You need to find in the file explorer, double click to import it into your PC. Be sure to import it into local machine, not current user. Put it into the “Trusted Root Certification Authorities” store. Otherwise, app install will not succeed.

    The .pfx file will be the one that you use to sign your app package. Select “From File …” when building app package in VisualStudio.

    As a note, be sure to build in Release settings. For otherwise, the app will not run. As it will not successfully connect to the local server port, even if you started the server.

    Good luck!

  • Use Powershell To Collect Windows Performance Data

    Powershell has many cmdlets that help users perform various tasks. In this post, I will use several commands to collect CPU and memory usage of processes.

    Collect Process CPU Usage

    Powershell has a Get-Counter cmdlet. This cmdlet let users to obtain metrics of the system. One of the counters that you can use is “\Processor(*)\% Processor Time“. Open your powershell, and run the following command:

    The result looks like:

    You can find that the timestamp is just when the user ran this command. What we need is just the CounterSamples part. So with following command, the CounterSamples part is extracted.

    This gives a list of all processes with CPU usages. But it is too many for performance monitoring. Therefore, I will restrict the results to those having nonzero CPU usages. This is done with the following command.

    This results in something like the following:

    The CookedValue column is the “percentage” of CPU use. You can find that the values can be greater than 100. This is because it adds multiple cores. On my PC, there are 16 cores, so it sums to about 1600.

    Collect Process Memory Usage

    To get memory usage, another cmdlet Get-Process is useful. It goes like as below:

    Here, like in the previous section, I used the Where-Object cmdlet to restrict the result to active processes only. And because Get-Process can give a long list of properties of every process, I restrict to two fields. Name is the name of the process. WS is is the working set memory in KB. The output goes like the following:

    Name                           WS
    ----                           --
    ChsIME                    7901184
    conhost                  16814080
    conhost                  17293312
    ctfmon                   21581824
    dllhost                  13258752
    explorer                207691776
    LockApp                  61153280
    msedge                   20189184
    msedge                  289972224
    msedge                   17846272
    msedge                   29536256
    msedge                   47423488
    msedge                  106610688
    msedge                  123375616
    msedge                  125349888
    msedge                  168783872
    msedge                  128581632
    msedge                  191397888
    msedge                  104919040
    msedge                   21688320
    msedge                  120528896
    msedge                    8261632
    MusNotifyIcon            14864384
    notepad++                30425088
    powershell               98615296
    powershell              101376000
    RtkAudUService64          9863168
    RuntimeBroker            28770304
    RuntimeBroker            24854528
    RuntimeBroker            41598976
    RuntimeBroker            32833536
    SearchApp               266919936
    SecurityHealthSystray     9510912
    sihost                   29876224
    StartMenuExperienceHost  73383936
    svchost                  13090816
    svchost                  25026560
    svchost                  33251328
    svchost                   8740864
    svchost                  22769664
    taskhostw                18632704
    TextInputHost            53374976
    TSVNCache                 9928704
    ttpmenu                  12865536
    UserOOBEBroker           10113024
    

    Summary

    This article, I mentioned two ways for collecting CPU and memory usages, respectively. These outputs can be treated as input data matrices for producing higher level summaries like identifying usage pattern or bottleneck, etc.

    Wish this helpful!