XGBoostについての備忘録

XGBoost の公式ドキュメントを見るたびに、毎回「どのような式変形でこの形になるのだっけ」となり、思い出すのに時間がかかる箇所があったので、備忘録としてまとめようと思います。

xgboost.readthedocs.io

式変形の箇所

「Tree Boosting」という見出しの中の「Additive Training」の部分です。

Additive Training

... We write the prediction value at step $t$ as $\hat{y}^{(t)}$ . Then we have

$\displaystyle \hat{y}^{(t)} = \sum_{k=1}^{t} f_k(x_i) = \hat{y_i}^{(t-1)} + f_t(x_i)$

It remains to ask: which tree do we want at each step? A natural thing is to add the one that optimizes our objective.

$\displaystyle obj^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y_i}^{(t)}) + \sum_{i=1}^{t} \omega(f_t) \\ = \sum_{i=1}^{n} l(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) + \omega(f_t) + constant \qquad (1)$

... So in the general case, we take the Taylor expansion of the loss function up to the second order:

$\displaystyle obj^{(t)} = \sum_{i=1}^{n} [ l(y_i, \hat{y_i}^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^{2}(x_i) ] + \omega(f_t) + constant \quad (2)$

where the $g_i$ and $h_i$ are defined as

$\displaystyle g_i = \partial_{\hat{y_i}^{(t-1)}} l(y_i, \hat{y_i}^{(t-1)}) \\ h_i = \partial^{2}_{\hat{y_i}^{(t-1)}} l(y_i, \hat{y_i}^{(t-1)})$

この最後のテーラー展開の仕方を思い出すときに時間が掛かってしまいます。

式変形の詳細

では、上記の変形がどのように行われているか詳細を見ていきましょう。

まず、テーラー展開そのものについて復習です。テーラー展開は関数 $f$ を以下のような無限和で表現したものを言うのでした。

$\displaystyle f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!} (x-a)^{n} \\\ = f(a) + f^{'}(a)(x-a) + \frac{f^{''}(a)}{2!}(x-a)^{2} + \frac{f^{'''}(a)}{3!}(x-a)^{3} + \cdots$

なお、このとき点 $a$ での情報を用いているため、「 $a$ まわりでテーラー展開する」と言います。

また、この $f(x)$ の $x$ を $x+\Delta$ とし、 $x$ まわりで展開することで、以下のように別の表現をすることができます。

$\displaystyle f(x + \Delta x) \\\ = \sum_{n=0}^{\infty} \frac{f^{(n)}(x)}{n!} (x+\Delta x - x)^{n} \\\ = \sum_{n=0}^{\infty} \frac{f^{(n)}(x)}{n!} (\Delta x)^{n} \\\ = f(x) + f'(x) \Delta x + f''(x) (\Delta x)^{2} + f'''(x) (\Delta x)^{3} + \cdots$

XGBoost のドキュメントに載っている式変形では、こちらの表現が使われています。具体的には、前述の引用文にある (1) 式

$\displaystyle obj^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) + \omega(f_t) + constant$

に対して、テーラー展開の $f$ ・ $x$ ・ $\Delta x$ を以下に置き換えて展開しています。

$f$ を $l$
$x$ を $\hat{y_i}^{(t-1)}$
$\Delta x$ を $f_t(x)$

すなわち、以下のように関数 $l$ を表しています。

$\displaystyle l(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) \\\ = \sum_{n=0}^{\infty} \frac{l^{(n)}(y_i, \hat{y_i}^{(t-1)})}{n!} (f_t(x_i))^{n} \\\ = l(y_i, \hat{y_i}^{(t-1)}) + l'(y_i, \hat{y_i}^{(t-1)}) f_t(x_i) + \frac{1}{2} l''(y_i, \hat{y_i}^{(t-1)}) (f_t(x_i))^{2} + \cdots$

なお、

$\displaystyle l^{(n)}(y_i, \hat{y_i}^{(t-1)}) = \partial_{\hat{y_i}^{(t-1)}}^{(n)} l(y_i, \hat{y_i}^{(t-1)})$

です。

そして、テーラー展開後の2次の項までで表される式を $l$ の近似式とし、 $l'(y_i, \hat{y_i}^{(t-1)})$ を $g_i$ 、 $l''(y_i, \hat{y_i}^{(t-1)})$ を $h_i$ と置くと、関数 $l$ は次のように表されます。

$\displaystyle l(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) \approx l(y_i, \hat{y_i}^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i (f_t(x_i))^{2}$

この近似式を (1) 式に入れると、

$\displaystyle obj^{(t)} = \sum_{i=1}^{n} [ l(y_i, \hat{y_i}^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i (f_t(x_i))^{2} ] + \omega(f_t) + constant \qquad$

となり、(2) 式と同じになります。このようにして展開されています。

補足

上記では (1) 式

$\displaystyle obj^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) + \omega(f_t) + constant$

に対して、テーラー展開 $f(x + \Delta x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(x)}{n!} (\Delta x)^{n}$ の $f$ ・ $x$ ・ $\Delta x$ を以下に置き換えて展開しました。

$f$ を $l$
$x$ を $\hat{y_i}^{(t-1)}$
$\Delta x$ を $f_t(x)$

しかし、XGBoost のドキュメントを読むとステップ t においては既に $\hat{y_i}^{(t-1)}$ の値が定まっており、定数として扱われています。

そのため、変数ではないのに $x$ の置き換え先として $\hat{y_i}^{(t-1)}$ を使って良いのかと思うかもしれませんが、論理としては、一旦 $\hat{y_i}^{(t-1)}$ を変数として扱いテーラー展開しても、入出力の関係を表す関数としては変わりがないため、このようにしても問題ないのだと思われます。

また、これと同じように、ステップ t において $f_t(x)$ は値が定まっていない変数ですが、一旦定数として見て $\Delta x$ の置き換え先にしたとしても、入出力の関係を表す関数として変わりがないため問題ないのだと思われます。

参考記事

テーラー展開の説明：

computationjp.com

テーラー展開を用いた式変形の大まかな流れ：

qiita.com