分岐はないほうがいい？

前回はedaxというオセロソフトのboard_score_1関数を分岐不要に書き換える話をしました。cmov命令がそれに使えるのがポイントでした。私はcmov命令の存在を最近知ったのですが、記事を投稿したあとで"cmov"でググったら達人の間では昔から当たり前だったみたいでした。

https://www.slideshare.net/herumi/cmovmaxps

分岐予測器がうまく機能するならば分岐で書くべきだというのはその通りだと思います。分岐予測器の正解率をいつどうやって予測するのかが鍵のような気がします。状況のほうは

問題の性質によって予測できるケース
- コーディング時にちょっと考えれば予想できるケース
- コンパイル時にprofile-guided optimizationすればわかるケース
入力クエリの性質によって変わるケース
様々なユーザーが各々のデバイス（異なる分岐予測器を持っている）の上で走らせるケース

に分かれると思います。

以降はしょうもないネタを書いていきます。

board_solve関数について

edaxでは、両者に合法手がない終局局面（空白マスが残っている場合もある）を入力として最終スコアを返す関数が以下のように実装されています。

edax-reversi/endgame.c at 8e5da6d206b750119fe7157e43ea6985b468e8d0 · abulmo/edax-reversi · GitHub

static int board_solve(const Board *board, const int n_empties)
{
	int score = bit_count(board->player) * 2 - SCORE_MAX;	// in case of opponents win
	int diff = score + n_empties;		// = n_discs_p - (64 - n_empties - n_discs_p)

	SEARCH_STATS(++statistics.n_search_solve);

	if (diff >= 0)
		score = diff;
	if (diff > 0)
		score += n_empties;
	return score;
}

オセロのルールでは、空白マスを残した状態で終局した場合、盤上の石が多い方（＝勝者）が空白マスを"総取り"します。上記のコードはちゃんと「先手の勝ち・引き分け・後手の勝ち」の3パターンを考慮したコードになっています。

しかし、空白マスの数(=n_empties)が奇数であることが保証されていれば、引き分けがありえないことを利用してちょっとシンプルにできます。例えば以下のように書けます。

static int board_solve_odd(const Board *board, const int n_empties)
{
	assert(n_empties & 1);
	int score = bit_count(board->player) * 2 - SCORE_MAX;	// in case of opponents win
	int diff = score + n_empties;		// = n_discs_p - (64 - n_empties - n_discs_p)

	SEARCH_STATS(++statistics.n_search_solve);

	if (diff > 0)
		score = diff + n_empties;
	return score;
}

edaxでは残り4手以降は個別の関数を用意していて、かつboard_solveは全部インライン化されるはずなので、残り3手の場所でこの関数を使えばちょっと高速化させられるでしょう。

move_evaluateについて

edaxのmove_evaluateという関数のなかに、以下の記述があります。

https://github.com/abulmo/edax-reversi/blob/master/src/move.c

static void move_evaluate(Move *move, Search *search, const HashData *hash_data, const int sort_alpha, const int sort_depth)
{
（前略）
	const int w_low_parity = 1 << 3;
	const int w_mid_parity = 1 << 2;
	const int w_high_parity = 1 << 1;
（中略）
		if (search->n_empties < 12 && search->parity & QUADRANT_ID[move->x]) move->score += w_low_parity; // parity
		else if (search->n_empties < 21 && search->parity & QUADRANT_ID[move->x]) move->score += w_mid_parity; // parity
		else if (search->n_empties < 30 && search->parity & QUADRANT_ID[move->x]) move->score += w_high_parity; // parity
（後略）

QUADRANT_IDは以下のようなテーブルです。

https://github.com/abulmo/edax-reversi/blob/8e5da6d206b750119fe7157e43ea6985b468e8d0/src/search.c

const unsigned char QUADRANT_ID[] = {
		1, 1, 1, 1, 2, 2, 2, 2,
		1, 1, 1, 1, 2, 2, 2, 2,
		1, 1, 1, 1, 2, 2, 2, 2,
		1, 1, 1, 1, 2, 2, 2, 2,
		4, 4, 4, 4, 8, 8, 8, 8,
		4, 4, 4, 4, 8, 8, 8, 8,
		4, 4, 4, 4, 8, 8, 8, 8,
		4, 4, 4, 4, 8, 8, 8, 8,
		0, 0
	};

この処理の意味は、序盤中盤終盤でparity based move sortingの重要度を変えたいということなのでしょう。

（おさらい）parity based move sortingとは、まずオセロの盤面を4つの象限に分けて、各象限に存在する空白マスの個数の偶奇を記録しておき、奇数個空きの象限のマスを優先して探索するというものです。なぜなら、オセロの終盤においては孤立した空白マスに打つほうが一般に有利だからです。「奇数個空き象限」は「偶数個空き象限」より空きマス数の期待値が小さいので、前者を優先するのが良いといえます。

（i7-4700, Visual Studio 2017の）プロファイリングによると、ここのif文の連続が重いようだったので、（分岐予測器が外しまくっているのではと予想して）分岐不要でcmovと表引きに頼るように書き換えたところ、私の環境では少し早くなりました。以下のような感じでやりました。

（前略）
	const int w_low_parity = 1 << 3;
	const int w_mid_parity = 1 << 2;
	const int w_high_parity = 1 << 1;
（中略）
const int32_t parity_table[64] = {
	w_low_parity,w_low_parity,w_low_parity,w_low_parity,w_low_parity,w_low_parity,w_low_parity,w_low_parity,
	w_low_parity,w_low_parity,w_low_parity,w_low_parity,

	w_mid_parity,w_mid_parity,w_mid_parity,w_mid_parity,w_mid_parity,w_mid_parity,w_mid_parity,w_mid_parity,
	w_mid_parity,

	w_high_parity,w_high_parity,w_high_parity,w_high_parity,w_high_parity,w_high_parity,w_high_parity,w_high_parity,
	w_high_parity,

	0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,
	0,0,
};
（中略）
		//if (search->n_empties < 12 && search->parity & QUADRANT_ID[move->x]) move->score += w_low_parity; // parity
		//else if (search->n_empties < 21 && search->parity & QUADRANT_ID[move->x]) move->score += w_mid_parity; // parity
		//else if (search->n_empties < 30 && search->parity & QUADRANT_ID[move->x]) move->score += w_high_parity; // parity
		const bool f1 = (search->parity & QUADRANT_ID[move->x]) != 0;
		const int32_t p1 = parity_table[search->n_empties];
		move->score += f1 ? p1 : 0;

parity based move sortingについて

edaxで、残り4マス空き時点でのparity based move sortingは以下のように記述されています。

https://github.com/abulmo/edax-reversi/blob/8e5da6d206b750119fe7157e43ea6985b468e8d0/src/endgame.c

	// parity based move sorting.
	// The following hole sizes are possible:
	//    4 - 1 3 - 2 2 - 1 1 2 - 1 1 1 1
	// Only the 1 1 2 case needs move sorting.
	if (!(search->parity & QUADRANT_ID[x1])) {
		if (search->parity & QUADRANT_ID[x2]) {
			if (search->parity & QUADRANT_ID[x3]) { // case 1(x2) 1(x3) 2(x1 x4)
				int tmp = x1; x1 = x2; x2 = x3; x3 = tmp;
			} else { // case 1(x2) 1(x4) 2(x1 x3)
				int tmp = x1; x1 = x2; x2 = x4; x4 = x3; x3 = tmp;
			}
		} else if (search->parity & QUADRANT_ID[x3]) { // case 1(x3) 1(x4) 2(x1 x2)
			int tmp = x1; x1 = x3; x3 = tmp; tmp = x2; x2 = x4; x4 = tmp;
		}
	} else {
		if (!(search->parity & QUADRANT_ID[x2])) {
			if (search->parity & QUADRANT_ID[x3]) { // case 1(x1) 1(x3) 2(x2 x4)
				int tmp = x2; x2 = x3; x3 = tmp;
			} else { // case 1(x1) 1(x4) 2(x2 x3)
				int tmp = x2; x2 = x4; x4 = x3; x3 = tmp;
			}
		}
	}

分岐だらけですね。

これは分岐なしで以下のように書けます。

static const uint16_t QUADRANT_ID_2[] = {
	1, 1, 1, 1, 8, 8, 8, 8,
	1, 1, 1, 1, 8, 8, 8, 8,
	1, 1, 1, 1, 8, 8, 8, 8,
	1, 1, 1, 1, 8, 8, 8, 8,
	64, 64, 64, 64, 512, 512, 512, 512,
	64, 64, 64, 64, 512, 512, 512, 512,
	64, 64, 64, 64, 512, 512, 512, 512,
	64, 64, 64, 64, 512, 512, 512, 512
};
static const uint16_t QUADRANT_ID_3[] = {
	6, 6, 6, 6, 48, 48, 48, 48,
	6, 6, 6, 6, 48, 48, 48, 48,
	6, 6, 6, 6, 48, 48, 48, 48,
	6, 6, 6, 6, 48, 48, 48, 48,
	384, 384, 384, 384, 3072, 3072, 3072, 3072,
	384, 384, 384, 384, 3072, 3072, 3072, 3072,
	384, 384, 384, 384, 3072, 3072, 3072, 3072,
	384, 384, 384, 384, 3072, 3072, 3072, 3072
};
static const int8_t answer_order[32][4] = {
	{0,3,1,2}, {0,3,1,2}, {0,1,2,3}, {0,1,2,3},
	{0,1,2,3}, {0,2,1,3}, {1,2,0,3}, {0,1,2,3},
	{0,1,2,3}, {0,1,2,3}, {0,1,2,3}, {0,1,2,3},
	{2,3,0,1}, {0,1,2,3}, {0,1,2,3}, {0,1,2,3},
	{0,2,1,3}, {0,1,2,3}, {1,3,0,2}, {0,1,2,3},
	{0,1,2,3}, {0,1,2,3}, {0,1,2,3}, {0,1,2,3},
	{3,0,1,2}, {0,1,2,3}, {0,1,2,3}, {0,1,2,3},
	{0,1,2,3}, {0,1,2,3}, {0,1,2,3}, {0,1,2,3},
};
inline bool SortWithParity(const int32_t empties[4], int32_t * const x0, int32_t * const x1, int32_t * const x2, int32_t * const x3) {

	const uint64_t p =
		QUADRANT_ID_2[empties[0]] +
		QUADRANT_ID_2[empties[1]] +
		QUADRANT_ID_2[empties[2]] +
		QUADRANT_ID_2[empties[3]];

	const bool b0 = (p & QUADRANT_ID_3[empties[0]]) == 0;//empties[0]の象限に自分しか居なければtrue
	const bool b1 = (p & QUADRANT_ID_3[empties[1]]) == 0;//empties[1]の象限に自分しか居なければtrue
	const bool b2 = (p & QUADRANT_ID_3[empties[2]]) == 0;//empties[2]の象限に自分しか居なければtrue
	const bool b3 = QUADRANT_ID_2[empties[0]] == QUADRANT_ID_2[empties[1]];//empties[0]の象限が、empties[1]の象限と同じならtrue
	const bool b4 = QUADRANT_ID_2[empties[0]] == QUADRANT_ID_2[empties[2]];//empties[0]の象限が、empties[2]の象限と同じならtrue

	const uint64_t i0 = b0 ? 1 : 0;
	const uint64_t i1 = b1 ? 2 : 0;
	const uint64_t i2 = b2 ? 4 : 0;
	const uint64_t i3 = b3 ? 8 : 0;
	const uint64_t i4 = b4 ? 16 : 0;
	const uint64_t parity_index = i0 + i1 + i2 + i3 + i4;

	*x0 = empties[answer_order[parity_index][0]];
	*x1 = empties[answer_order[parity_index][1]];
	*x2 = empties[answer_order[parity_index][2]];
	*x3 = empties[answer_order[parity_index][3]];

	return (((1ULL << 0b00000) + (1ULL << 0b01000) + (1ULL << 0b10000)) & (1ULL << parity_index)) != 0;//全部2ならtrue
}

引数emptiesは空きマス4つの座標で、x0～x4はソート結果とします。b0～b2は、empties[0]～[2]がその象限に1つなのかどうかを示します。b3とb4は、2つの空きマスが同じ象限かどうかを示します。その5bitの情報から、どのマスを優先すべきかを表引きで求めて返します。parity_indexの5bitの情報だけでは、各象限への偏りのパターンを完全に特定はできませんが、安定ソートでなくてよければ必ずソートできます。（例えばparity_index == 0b01100のとき、(x0,x1,x2,x3) が (3,3,1,3) か (2,2,1,1) のどちらかであることまでしかわかりませんが、x2→x3→x0→x1 とソートすればどちらにも対応できます。

さらに、この方法なら（元のedaxと違い）残り3手時点で再びソートする必要をなくすことができます。理由は以下の2点です。

3-1で偏っているケースを正しくソートできる。
2-2で偏っているケースも"ソート"しておけるので、残り4手時点で打った手と同じ象限に属す手を優先させられる。

これがより速いどうかは元のedaxの処理を分岐予測器がうまく捌けているかなどの諸々によるでしょう。