Reading time ~7 minutes

This is the story of Fatboy, a neural network that taught itself to play backgammon back in the early 90s. When it connected to the First Internet Backgammon Server (FIBS) 24 hours a day and 7 days a week in the summer of 94, it was probably the first autonomous AI game playing agent on the internet. It wasn’t the strongest backgammon playing network, that title was taken by TD-Gammon and later by JellyFish, but for a while it was the strongest freely available program. It played at a decent level and generated a lot of interest on FIBS when it first appeared.

DeepMind’s AlphaZero may have attracted all the headlines in the last few years as the best Go, chess, and shogi playing entity on the planet. Its playing strength is often attributed to its ability to learn through self-play but this technique is not new - its use was pioneered by Gerald Tesauro in the late 80s and early 90s. Tesauro created TD-Gammon, a world-class backgammon AI that learned by self-play.

In 1993-94 I was introduced to neural networks while studying for a postgraduate Diploma in Computer Science at the University of Cambridge and I wanted to learn more. Somehow, although I cannot remember how, I had heard of Gerald Tesauro’s revolutionary work and was interested in replicating his results for my thesis. I was already playing backgammon for the Cambridge University Backgammon Club and socially at the Department of Pure Mathematics and Mathematical Statistics, so the project naturally combined two interests.

The result was Fatboy, a backgammon playing agent taught entirely through self-play. Fatboy included an interface to FIBS and to the best of my knowledge was the first AI agent to play online against all-comers on the internet.

Cover of the thesis describing Fatboy

Game-playing AIs

Game playing is a natural environment to research and develop AI and has been used as such since Samuel’s checker programs in the 50s. Some other notable early examples include:

Why backgammon?

Out of all these early game-playing AIs why was it that the most success was found in backgammon with TD-Gammon? Many of these agents shared the same learning algorithm and used similar neural architectures. The consensus is that it was the nature of the game of backgammon that enabled TD-Gammon to perform so well. The dice in backgammon are stochastic and this precludes long-term planning. Calculating tactical skirmishes is less important than in games such as chess. The evaluation of most chess positions (especially those critical to the outcome of the game) relies heavily on deep tree search to resolve these tactics. In contrast, the evaluation of backgammon positions depends more on the recognition of positional features and an understanding of how they interact rather than resolving a tree of variations.

In addition to the effect the dice have on the principles of evaluating backgammon positions, they may also aid the learning algorithm. Reinforcement learning algorithms for game playing can get stuck in local strategic minima when they believe their sub-optimal strategies are best. However, a bad roll can force the agent to visit positions it would not ordinarily choose. This exploration can be vital for successful learning when it results in the re-evaluation of positions previously thought to be bad. Indeed, the importance of exploration for reinforcement learning is demonstrated by how much active research is dedicated to exploration/exploitation trade-offs.

Reinforcement learning

The field of Reinforcement learning is concerned with how to optimise an agent’s behaviour to maximise some reward signal. The reward signal is typically sparse in board games as the agent only receives feedback when the game finishes.

Most game playing AI agents’ behaviour relies on value functions. A value function is a proxy for how good it is to be in any possible position in the game. For example a value function that accurately estimates expected future rewards would allow an agent to select optimal moves by considering the value of each possible resulting position. Of course in practice 100% accurate estimates are infeasible and much research effort is dedicated to how to learn value functions.

Temporal difference learning

Temporal difference algorithms learn a value function by bootstrapping. Given an existing value function, $V$, they use evaluations of future states, $V(x_{t+k})$, to update the evaluation of the current state, $V(x_t) \mapsto V^*(x_t)$. TD-Lambda is one such algorithm that smooths evaluations using exponential discounting.

\[V^*(x_t) = \frac{1}{n-t} \sum_{k=t+1}^n[(1 - \lambda^{k-t-1})V(x_t) + \lambda^{k-t-1}V(x_{t+k})]\]

$\lambda$ is the discounting parameter, $x_n$ is the end of game state and $V(x_n)$ is the reward signal.

Fatboy

In Fatboy, $V$ is implemented as a neural network, the states ${x_1, \dots, x_n}$ are generated by self-play with random dice and $V(x_t)$ is trained to approximate $V^*(x_t)$ for all $t$. This procedure is repeated over many thousands of games.

Early 90s neural networks

In the early 90s there were no deep learning frameworks such as TensorFlow or PyTorch. Fatboy was implemented from scratch in ANSI C. Modern optimisers like ADAM were not available, although I did have to make a choice between back propagation (with momentum), (scaled) conjugate gradient descent, delta-bar-delta, and RProp.

The final Fatboy agent consisted of:

  • a set of hand-crafted position features (à la TD-Gammon)
  • a neural network with three hidden layers
  • outputs that represent the probability that a single or double game would be won or lost (backgammons were ignored as they were so rare and could almost always be avoided by a roll-off database)
  • an almost perfect database for bearing off
  • an understanding of backgammon match equity in relation to cube decisions

As described above the TD-Lambda reinforcement learning was used to update the position evaluations arrived at through self-play.

Unfortunately, despite purchasing a USB floppy disk drive and trawling through some antiquated disks, Fatboy’s source code and weights have been lost and he will probably never play backgammon again. This is probably for the best as he might embarrass himself against the modern neural network backgammon playing monsters.

Online backgammon

Playing 24-by-7 had some advantages and Fatboy soon became the most active player on FIBS. By November 1995 he had played over 36,000 games.

There was some discussion of his playing strength. Several in the community were convinced he was over-rated but this wasn’t necessarily supported by the evidence. He played many more games than human players which reduced the variance of his rating. In addition he had virtually no defences against individuals who would not complete matches they were about to lose, which had a negative effect on his rating. He definitely had deficiencies in certain technical aspects of backgammon. This may have lead observers to believe he was over-rated. However it could be that his positional play more than made up for this. Indeed TD-Gammon and other neural network based agents forced a re-evaluation of positional play in backgammon as they convinced the world that some of their strategies were superior to those followed by world-class players.

In any case, Fatboy was rated well over the median rating and was getting into expert territory. He was never close to challenging TD-Gammon for the title of strongest backgammon playing bot and soon other neural network bots also overtook him (for example mloner, Jellyfish and Snowie). However for a while he was the strongest freely available bot.

Fatboy was popular with at least some of the FIBS community:

“I find fatboy to be one of the politest players on FIBS. He never whines about bad rolls. He almost always plays quickly (unless lagged) and yet never complains if you play slowly. He accepts matches from anyone, no matter how inexperienced or annoying they may be. I think a lot of people on FIBS could take a lesson in proper conduct from fatboy.

Many people must agree with me – otherwise, why would he be the most popular player on FIBS?

So he’s a little on the quiet side – maybe he’s shy. :-)

In any case, there are many many other candidates more deserving of the Least Congenial award.” - Darse Billings

This is presumably the same Darse Billings who went on to research poker playing AI. Anyway thanks for the quote Darse!

Thesis

If you are really interested in more details please refer to my thesis (or just ask me online):

Reid, John. ‘A Comparison of Various Neural Network Architectures for Learning Context-Dependent Game Strategies’. Diploma in Computer Science, Cambridge University, 1994.

Hamiltonian Annealed Importance Sampling

Estimating expectations with respect to high-dimensional multimodaldistributions is difficult. Here we describe animplementation of Hamil...… Continue reading