John ReidJekyll2020-09-21T18:55:12+01:00http://johnreid.github.io/John Reidhttp://johnreid.github.io/johnbaronreid@gmail.com<![CDATA[Fatboy: a backgammon AI]]>http://johnreid.github.io/2020/03/fatboy2020-03-08T00:00:00+00:002020-03-08T00:00:00+00:00John Reidhttp://johnreid.github.iojohnbaronreid@gmail.com
<p>This is the story of <em>Fatboy</em>, a neural network that taught itself to play
backgammon back in the early 90s. When it connected to the <a href="http://www.fibs.com/">First Internet
Backgammon Server</a> (FIBS) 24 hours a day and 7 days
a week in the summer of 94, it was probably the first autonomous AI game
playing agent on the internet. It wasn’t the strongest backgammon playing
network, that title was taken by TD-Gammon and later by JellyFish, but for
a while it was the strongest freely available program. It played at a decent
level and generated a lot of
<a href="https://groups.google.com/forum/#!searchin/rec.games.backgammon/fatboy|sort:date/rec.games.backgammon/U7uYIu3wk6Q/htGTNKMWz5AJ">interest</a>
on FIBS when it first appeared.</p>
<!-- Control how much is shown as an excerpt. -->
<!--more-->
<p>DeepMind’s
<a href="https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go">AlphaZero</a>
may have attracted all the
<a href="https://www.theguardian.com/sport/2018/dec/11/creative-alphazero-leads-way-chess-computers-science">headlines</a>
in the last few years as the best Go, chess, and shogi playing entity on the
planet. Its playing strength is often attributed to its ability to learn
through self-play but this technique is not new - its use was pioneered by
Gerald Tesauro in the late 80s and early 90s. Tesauro created
<em><a href="http://www.mitpressjournals.org/doi/10.1162/neco.1994.6.2.215">TD-Gammon</a></em>,
a world-class backgammon AI that learned by self-play.</p>
<p>In 1993-94 I was introduced to neural networks while studying for
a postgraduate Diploma in Computer Science at the University of Cambridge and
I wanted to learn more. Somehow, although I cannot remember how, I had heard of
Gerald Tesauro’s revolutionary work and was interested in replicating his
results for my thesis. I was already playing backgammon for the Cambridge
University Backgammon Club and socially at the Department of Pure Mathematics
and Mathematical Statistics, so the project naturally combined two interests.</p>
<p>The result was <em>Fatboy</em>, a backgammon playing agent taught entirely through
self-play. Fatboy included an interface to FIBS and to the best of my knowledge
was the first AI agent to play online against all-comers on the internet.</p>
<p><img src="http://johnreid.github.io/images/fatboy-cover.jpg" alt="Cover of the thesis describing Fatboy" /></p>
<h2 id="game-playing-ais">Game-playing AIs</h2>
<p>Game playing is a natural environment to research and develop AI and has been
used as such since Samuel’s
<a href="https://ieeexplore.ieee.org/abstract/document/5392560">checker</a>
<a href="https://ieeexplore.ieee.org/abstract/document/5391906">programs</a> in the 50s.
Some other notable early examples include:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Hans_Berliner">Hans Berliner’s</a> (the
well-known computer chess programmer and correspondence chess world
champion) rules-based <a href="http://www.sciencedirect.com/science/article/pii/0004370280900417">backgammon
program</a>
that in 1979 defeated the world champion of that time.</li>
<li>Tesauro’s world class backgammon neural network agent
<a href="https://en.wikipedia.org/wiki/TD-Gammon">TD-Gammon</a> (1989)</li>
<li>Boyan’s <a href="https://bkgm.com/articles/Grater/Bibliography/files/Boyan-BackgammonThesis.pdf">modular
networks</a>
for playing backgammon (1992)</li>
<li>Schraudolph, Dayan and Sejnowski’s 9x9 <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/sds94.html">Go playing neural network
agent</a> (1994)</li>
<li>Moriarty and Miikkulainen’s <a href="http://nn.cs.utexas.edu/downloads/papers/moriarty.discovering.pdf">genetic
algorithms</a>
for Othello (1995)</li>
</ul>
<h2 id="why-backgammon">Why backgammon?</h2>
<p>Out of all these early game-playing AIs why was it that the most success was
found in backgammon with TD-Gammon? Many of these agents shared the same
learning algorithm and used similar neural architectures. The consensus is that
it was the nature of the game of backgammon that enabled TD-Gammon to perform
so well. The dice in backgammon are stochastic and this precludes long-term
planning. Calculating tactical skirmishes is less important than in games such
as chess. The evaluation of most chess positions (especially those critical to
the outcome of the game) relies heavily on deep tree search to resolve these
tactics. In contrast, the evaluation of backgammon positions depends more on
the recognition of positional features and an understanding of how they
interact rather than resolving a tree of variations.</p>
<p>In addition to the effect the dice have on the principles of evaluating
backgammon positions, they may also aid the learning algorithm. Reinforcement
learning algorithms for game playing can get stuck in local strategic minima
when they believe their sub-optimal strategies are best. However, a bad roll
can force the agent to visit positions it would not ordinarily choose. This
exploration can be vital for successful learning when it results in the
re-evaluation of positions previously thought to be bad. Indeed, the importance
of exploration for reinforcement learning is demonstrated by how much active
research is dedicated to exploration/exploitation trade-offs.</p>
<h2 id="reinforcement-learning">Reinforcement learning</h2>
<p>The field of <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">Reinforcement
learning</a> is concerned
with how to optimise an agent’s behaviour to maximise some reward signal. The
reward signal is typically sparse in board games as the agent only receives
feedback when the game finishes.</p>
<p>Most game playing AI agents’ behaviour relies on value functions. A value
function is a proxy for how good it is to be in any possible position in the
game. For example a value function that accurately estimates expected future
rewards would allow an agent to select optimal moves by considering the value
of each possible resulting position. Of course in practice 100% accurate estimates
are infeasible and much research effort is dedicated to how to learn value
functions.</p>
<h3 id="temporal-difference-learning">Temporal difference learning</h3>
<p>Temporal difference algorithms learn a value function by bootstrapping. Given
an existing value function, $V$, they use evaluations of future states,
$V(x_{t+k})$, to update the evaluation of the current state, $V(x_t) \mapsto V^*(x_t)$.
TD-Lambda is one such algorithm that smooths evaluations using exponential
discounting.</p>
\[V^*(x_t) = \frac{1}{n-t} \sum_{k=t+1}^n[(1 - \lambda^{k-t-1})V(x_t) + \lambda^{k-t-1}V(x_{t+k})]\]
<p>$\lambda$ is the discounting parameter, $x_n$ is the end of game state and
$V(x_n)$ is the reward signal.</p>
<h2 id="fatboy">Fatboy</h2>
<p>In Fatboy, $V$ is implemented as a neural network, the states ${x_1, \dots,
x_n}$ are generated by self-play with random dice and $V(x_t)$ is trained to
approximate $V^*(x_t)$ for all $t$. This procedure is repeated over many
thousands of games.</p>
<!-- ### Training -->
<!-- ![Hyperparameter search](http://johnreid.github.io/images/fatboy-params.jpg){: width="400px" .align-center} -->
<h3 id="early-90s-neural-networks">Early 90s neural networks</h3>
<p>In the early 90s there were no deep learning frameworks such as TensorFlow or
PyTorch. Fatboy was implemented from scratch in ANSI C. Modern optimisers like
ADAM were not available, although I did have to make a choice between back
propagation (with momentum), (scaled) conjugate gradient descent,
delta-bar-delta, and RProp.</p>
<!-- ![Fatboy's forward propagation algorithm](http://johnreid.github.io/images/fatboy-forward-i.jpg){: .center-image width="400px" } -->
<!-- ![Fatboy's forward propagation algorithm](http://johnreid.github.io/images/fatboy-forward-ii.jpg){: .center-image width="400px" } -->
<p>The final Fatboy agent consisted of:</p>
<ul>
<li>a set of hand-crafted position features (à la TD-Gammon)</li>
<li>a neural network with three hidden layers</li>
<li>outputs that represent the probability that a single or double game would
be won or lost (backgammons were ignored as they were so rare and could
almost always be avoided by a roll-off database)</li>
<li>an almost perfect database for bearing off</li>
<li>an understanding of backgammon match equity in relation to cube decisions</li>
</ul>
<p>As described above the TD-Lambda reinforcement learning was used to update
the position evaluations arrived at through self-play.</p>
<p>Unfortunately, despite purchasing a USB floppy disk drive and trawling through
some antiquated disks, Fatboy’s source code and weights have been lost and he
will probably never play backgammon again. This is probably for the best as he
might embarrass himself against the modern neural network backgammon playing
monsters.</p>
<h2 id="online-backgammon">Online backgammon</h2>
<p>Playing 24-by-7 had some advantages and Fatboy soon became the most active
player on FIBS. By November 1995 he had played over <a href="https://groups.google.com/forum/#!topic/rec.games.backgammon/C8KL6uF9vuU">36,000
games</a>.</p>
<p>There was some <a href="https://bkgm.com/rgb/rgb.cgi?view+181">discussion</a> of his
playing strength. Several in the community were convinced he was over-rated but
this wasn’t necessarily supported by the evidence. He played many more games
than human players which reduced the variance of his rating. In addition he had
virtually no defences against individuals who would not complete matches they
were about to lose, which had a negative effect on his rating. He definitely had
deficiencies in certain technical aspects of backgammon. This may have lead observers
to believe he was over-rated. However it could be that his positional play more than
made up for this. Indeed TD-Gammon and other neural network based agents forced
a re-evaluation of positional play in backgammon as they convinced the world
that some of their strategies were superior to those followed by world-class players.</p>
<p>In any case, Fatboy was rated well over the median rating and was getting into
expert territory. He was never close to challenging TD-Gammon for the title of
strongest backgammon playing bot and soon other neural network bots also
overtook him (for example mloner, Jellyfish and Snowie). However for a while he
was the strongest freely available bot.</p>
<p>Fatboy was popular with at least some of the FIBS
<a href="https://groups.google.com/forum/#!searchin/rec.games.backgammon/fatboy|sort:date/rec.games.backgammon/S4T7wmYE5Bs/BJKMz9lc8k0J">community</a>:</p>
<blockquote>
<p>“I find fatboy to be one of the politest players on FIBS. He never
whines about bad rolls. He almost always plays quickly (unless lagged)
and yet never complains if you play slowly. He accepts matches from
anyone, no matter how inexperienced or annoying they may be. I think
a lot of people on FIBS could take a lesson in proper conduct from
fatboy.</p>
</blockquote>
<blockquote>
<p>Many people must agree with me – otherwise, why would he be the most
popular player on FIBS?</p>
</blockquote>
<blockquote>
<p>So he’s a little on the quiet side – maybe he’s shy. :-)</p>
</blockquote>
<blockquote>
<p>In any case, there are many many other candidates more deserving of the
Least Congenial award.” - Darse Billings</p>
</blockquote>
<p>This is presumably the same <a href="https://webdocs.cs.ualberta.ca/~darse/">Darse
Billings</a> who went on to research poker
playing AI. Anyway thanks for the quote Darse!</p>
<h2 id="thesis">Thesis</h2>
<p>If you are really interested in more details please refer to my thesis (or just ask me online):</p>
<p>Reid, John. ‘A Comparison of Various Neural Network Architectures for Learning Context-Dependent Game Strategies’. Diploma in Computer Science, Cambridge University, 1994.</p>
<p><a href="http://johnreid.github.io/2020/03/fatboy">Fatboy: a backgammon AI</a> was originally published by John Reid at <a href="http://johnreid.github.io">John Reid</a> on March 08, 2020.</p>
<![CDATA[Hamiltonian Annealed Importance Sampling]]>http://johnreid.github.io/2018/12/hais2018-12-04T00:00:00+00:002018-12-04T00:00:00+00:00John Reidhttp://johnreid.github.iojohnbaronreid@gmail.com
<p>Estimating expectations with respect to high-dimensional multimodal
distributions is difficult. Here we describe an
<a href="https://github.com/JohnReid/HAIS">implementation</a> of Hamiltonian annealed
importance sampling in TensorFlow and compare it to other annealed importance
sampling implementations. This is joint work with <a href="https://twitter.com/bilginhalil">Halil Bilgin</a>.</p>
<!-- Control how much is shown as an excerpt. -->
<!--more-->
<h2 id="introduction">Introduction</h2>
<p>Radford Neal <a href="http://arxiv.org/abs/physics/9803008">showed how</a> to use
annealing techniques to define importance samplers suitable for complex
multimodal distributions. <a href="http://arxiv.org/abs/1205.1925">Sohl-Dickstein and Culpepper
extended</a> his work by demonstrating the utility
of Hamiltonian dynamics for the transition kernels between the annealing
distributions. We summarise these developments here.</p>
<h3 id="naive-monte-carlo-expectations">Naive Monte Carlo expectations</h3>
<p>A naive yet often practical way to estimate the expectation of a function $f(x)$
is simply to sample from the underlying distribution $p(x)$ and take an average:</p>
\[\mathbb{E}_p[f(X)] \approx \hat{\mu} = \frac{1}{N} \sum_{n=1}^{N} f(X_n), \qquad X_n \sim p\]
<p>However when $f(x)$ is close to zero outside some set $\mathcal{A}$ where
$\mathbb{P}(X \in \mathcal{A})$ is small, samples from $p(X)$ typically fall
outside of $\mathcal{A}$ and the variance of $\hat{\mu}$ is high. Estimating
the marginal likelihood (or evidence) for a model given some data $\mathcal{D}$
is a canonical example of this. In this case $f(x) = p(\mathcal{D}|x)$ is the
likelihood and $p(x)$ is the prior. For many models and data the
posterior will be highly concentrated around a typical set, $\mathcal{A}$, that
only has small support under the prior. That is $p(\mathcal{D}|x)$ will be
small for most samples from $p(x)$ and a few terms in the average will
dominate, resulting in a high variance estimator.</p>
<h3 id="importance-sampling">Importance Sampling</h3>
<p><a href="https://en.wikipedia.org/wiki/Importance_sampling">Importance sampling</a> (IS)
is a method that can be well-suited for estimating expectations for which the
naive Monte Carlo estimator has high variance. The IS estimate of the
expectation is</p>
\[\hat{\mu}_q = \frac{1}{N} \sum_{n=1}^{N} \frac{f(X_n) p(X_n)}{q(X_n)}, \qquad X_n \sim q\]
<p>where $q$ is the <strong>importance distribution</strong> and $p$ is the <strong>nominal
distribution</strong>. Choosing $q=p$ gives the naive Monte Carlo estimator. A well
chosen $q$ will give $\mathbb{V}[\hat{\mu}_q] < \mathbb{V}[\hat{\mu}]$. A badly
chosen $q$ can give $\hat{\mu}_q$ infinite variance. The ideal $q$ is
proportional to $f(x)p(x)$. However in general this is not helpful as the
required normalising constant is the intractable expectation we wish to
estimate in the first place.</p>
<h3 id="annealed-importance-sampling">Annealed Importance Sampling</h3>
<p>Finding good importance distributions when $X$ is high-dimensional and/or $p$
is multimodal can be difficult. This makes the variance of $\hat{\mu}_q$
difficult to control. <a href="http://arxiv.org/abs/physics/9803008">Annealed importance
sampling</a> (AIS) is designed to alleviate
this issue. AIS produces importance weighted samples from an unnormalised
target distribution $p_0$ by annealing towards it from some proposal distribution
$p_N$. For example,</p>
\[p_n(x) = p_0(x)^{\beta_n} p_N(x)^{1-\beta_n}\]
<p>where $1 = \beta_0 > \beta_1 > \dots > \beta_N = 0$. To implement AIS we must be
able to</p>
<ul>
<li>sample from $p_N$</li>
<li>evaluate each (potentially unnormalised) distribution $p_n$</li>
<li>simulate a Markov transition $T_n$ for each $1 \le n \le N-1$ that leaves $p_n$ invariant</li>
</ul>
<h3 id="hamiltonian-annealed-importance-sampling">Hamiltonian Annealed Importance Sampling</h3>
<p><a href="http://arxiv.org/abs/1205.1925">Hamiltonian annealed importance sampling</a>
(HAIS) is a variant of AIS in which Hamiltonian dynamics are used to simulate
the Markov transitions between the annealed distributions. An important feature
is that the momentum term is partially preserved between transitions.</p>
<h2 id="examples">Examples</h2>
<h3 id="log-gamma-normalising-constant">Log-gamma normalising constant</h3>
<p>Unnormalised versions of well known densities are useful test cases for
annealed importance samplers. If $p$ is such an unnormalised density function
then its normalising constant is simply $\mathbb{E}_p[1]$ and can be estimated
using AIS. This estimate can be compared against the exact value to
double-check the validity of the estimate.</p>
<p>$X$ is said to be distributed as $\textrm{log-gamma}(\alpha, \beta)$ when $\log
X \sim \Gamma(\alpha, \beta)$. The probability density function for a gamma
distribution is</p>
<p>\(f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{- \beta x}\)
for all $x > 0$ and any given shape $\alpha > 0$ and rate $\beta > 0$. Given a change
of variables $y = \log(x)$ we have the density for a log-gamma distribution</p>
\[f(y; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} e^{\alpha y - \beta e^y}\]
<p>Thus if we define our unnormalised density as $p_0(x) = e^{\alpha x - \beta e^x}$
its normalising constant is $\frac{\Gamma(\alpha)}{\beta^\alpha}$.</p>
<p>Running our HAIS sampler on this unnormalised density with $\alpha = 2$, $\beta = 3$
and a standard normal prior gives these samples
<img src="http://johnreid.github.io/images/hais-log-gamma-samples.png" alt="HAIS samples from log-gamma distribution" />
and the estimate of the log normalising constant is -2.1975 (the true value is -2.1972).</p>
<h3 id="marginal-likelihood">Marginal likelihood</h3>
<p>We used our HAIS implementation to estimate the marginal log likelihood for a
latent variable model for which an analytic solution is known (model 1a from
Sohl-Dickstein and Culpepper). In the plot below we compare our HAIS estimates
with those estimated by the BayesFlow AIS sampler that is included with
TensorFlow (version 1.6). <img src="http://johnreid.github.io/images/model1a-gaussian-estimates.png" alt="HAIS samples from log-gamma distribution" /> The dotted line represents
the ideal marginal log likelihood estimates. We see our estimates are much
closer to these true values.</p>
<h2 id="implementation">Implementation</h2>
<p>Our implementation is <a href="https://github.com/JohnReid/HAIS">available</a> under a MIT
license. Test scripts to generate the figures in this post are also
<a href="https://github.com/JohnReid/HAIS/tree/master/tests">available</a>.</p>
<p><a href="http://johnreid.github.io/2018/12/hais">Hamiltonian Annealed Importance Sampling</a> was originally published by John Reid at <a href="http://johnreid.github.io">John Reid</a> on December 04, 2018.</p>
<![CDATA[Retrieving the k largest (or smallest) elements in R]]>http://johnreid.github.io/2018/09/partial-sort2018-09-26T00:00:00+01:002018-09-26T00:00:00+01:00John Reidhttp://johnreid.github.iojohnbaronreid@gmail.com
<p>A common problem in computer science is selecting the $k$ largest (or smallest)
elements from an unsorted list containing $n$ elements. The most commonly implemented solution is far from optimal. This post describes a better way.</p>
<!-- Control how much is shown as an excerpt. -->
<!--more-->
<p>The problem is a form of <a href="https://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_selection">partition-based selection</a>.
For example, when computing k-nearest-neighbour distances, we first calculate
all the pairwise distances between samples, then for each sample we select the
$k$ closest distances. In R this is implemented too often as</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sort</span><span class="p">(</span><span class="n">dists</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="n">k</span><span class="p">]</span></code></pre></figure>
<p>which is correct but does not scale well. It sorts the entire vector <code class="language-plaintext highlighter-rouge">dists</code>
before selecting the first $k$ elements. As the number of elements $n$ grows
this is inefficient as the <code class="language-plaintext highlighter-rouge">sort()</code> call runs in $\mathcal{O}(n \log n)$ time.
Partition-based selection algorithms do not sort the entire list nor the
selected elements. They run in $\mathcal{O}(n + k \log k)$ resulting in savings
of a factor of $\log n$.</p>
<p>The statistical programming language R has an inbuilt and under-appreciated
partial sorting implementation that can help tremendously. We showcase,
benchmark and discuss this functionality here.</p>
<h2 id="set-up">Set up</h2>
<p>Load the necessary packages.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">suppressPackageStartupMessages</span><span class="p">({</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggthemes</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">microbenchmark</span><span class="p">)</span><span class="w">
</span><span class="p">})</span></code></pre></figure>
<p>Configure plots and seed RNG.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">3737</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_few</span><span class="p">())</span></code></pre></figure>
<p>Set parameters.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">3000</span><span class="w"> </span><span class="c1"># Number of samples</span><span class="w">
</span><span class="n">k</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w"> </span><span class="c1"># How many to select</span><span class="w">
</span><span class="n">zoom.margin</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="c1"># Margin for zoomed-in plot</span></code></pre></figure>
<h2 id="an-example">An example</h2>
<p>Just to demonstrate what R’s partial sorting implementation does, we generate
some test samples.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="c1"># samples</span></code></pre></figure>
<p>R’s standard <code class="language-plaintext highlighter-rouge">sort</code> function takes a <code class="language-plaintext highlighter-rouge">partial</code> argument specifying the indexes
at which you wish the vector to be partitioned. Here we want to select the
smallest $k$ elements so we have just one such index, $k$ itself.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">x_selected</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)</span></code></pre></figure>
<p>We plot the selected array to show that every element beneath the $k$’th is indeed
smaller than the $(k+1)$’th.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gp</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">qplot</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">x_selected</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_vline</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">linetype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x_selected</span><span class="p">[</span><span class="n">k</span><span class="p">],</span><span class="w"> </span><span class="n">linetype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">gp</span></code></pre></figure>
<p><img src="/../_posts/../images/R-figs/plotPartial-1.svg" alt="plot of chunk plotPartial" /></p>
<p>Zoom in to the detail around the $k$’th element.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gp</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlim</span><span class="p">(</span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">zoom.margin</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">zoom.margin</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylim</span><span class="p">(</span><span class="n">x_selected</span><span class="p">[</span><span class="n">k</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">zoom.margin</span><span class="p">],</span><span class="w"> </span><span class="n">x_selected</span><span class="p">[</span><span class="n">k</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">zoom.margin</span><span class="p">])</span></code></pre></figure>
<p><img src="/../_posts/../images/R-figs/plotPartialZoom-1.svg" alt="plot of chunk plotPartialZoom" /></p>
<h2 id="benchmarks">Benchmarks</h2>
<p>Here we use the <code class="language-plaintext highlighter-rouge">microbenchmark</code> package to show how much quicker
partition-based selection is than full sorting. Note we also test finding the
largest $k$ elements (<code class="language-plaintext highlighter-rouge">sort(x, partial = length(x) - k)</code>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">microbenchmark</span><span class="p">(</span><span class="w">
</span><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w">
</span><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="p">),</span><span class="w">
</span><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Unit: microseconds
## expr min lq mean median
## sort(x, partial = k) 48.626 50.6075 54.18525 53.0365
## sort(x, partial = length(x) - k) 46.398 48.2705 51.06711 50.1240
## sort(x) 151.349 153.8045 161.37612 156.5275
## uq max neval cld
## 54.9455 101.500 100 a
## 52.3850 73.985 100 a
## 158.7200 284.841 100 b</code></pre></figure>
<h2 id="asymptotics">Asymptotics</h2>
<p>The running time should be linear in $n$. We define a function to time the
partition-based selection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">time_partial_sort</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">samples_n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">samples</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="n">n</span><span class="p">]</span><span class="w">
</span><span class="n">then</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">proc.time</span><span class="p">()</span><span class="w">
</span><span class="n">sort</span><span class="p">(</span><span class="n">samples_n</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="nf">proc.time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">then</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>We choose 50 problem sizes ($n$) ranging from 100,000 to 100,000,000.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">problem_sizes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="m">10</span><span class="o">^</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)))</span></code></pre></figure>
<p>Sample data to test with.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">problem_sizes</span><span class="p">))</span></code></pre></figure>
<p>Time the partition-based selection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">timings</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">t</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">problem_sizes</span><span class="p">,</span><span class="w"> </span><span class="n">time_partial_sort</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">problem_sizes</span><span class="p">)</span></code></pre></figure>
<p>Plot the elapsed times. We observe a linear relationship between the running
time and $n$.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">timings</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">elapsed</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'lm'</span><span class="p">)</span></code></pre></figure>
<p><img src="/../_posts/../images/R-figs/plotElapsed-1.svg" alt="plot of chunk plotElapsed" /></p>
<h1 id="drawbacks">Drawbacks</h1>
<p>Frequently we are interested not in the values of the $k$ smallest elements but
their indexes. Unfortunately R’s <code class="language-plaintext highlighter-rouge">sort()</code> will not let us retrieve these
indexes as the <code class="language-plaintext highlighter-rouge">index.return = TRUE</code> parameter is not compatible with the
<code class="language-plaintext highlighter-rouge">partial</code> argument.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">index.return</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...): unsupported options for partial sorting</code></pre></figure>
<p>One possible solution is to find the $k$’th largest element by partition-based
selection and then to run through the data again to locate those elements that
are less than or equal to it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">kth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sort</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">partial</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">k</span><span class="p">)[</span><span class="n">k</span><span class="p">]</span><span class="w">
</span><span class="n">kth</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] -2.642236</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">indexes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">kth</span><span class="p">)</span><span class="w">
</span><span class="n">indexes</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 74 82 305 335 347 509 594 656 744 1093 1384 1512 2003 2103
## [15] 2403 2494 2512 2638 2736 2815</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">x</span><span class="p">[</span><span class="n">indexes</span><span class="p">]</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] -2.664565 -2.645308 -2.801753 -2.642236 -3.058703 -2.997622 -2.690972
## [8] -3.167249 -2.934196 -2.656970 -2.685767 -2.647660 -3.342775 -4.279542
## [15] -2.984152 -3.673439 -2.849113 -2.884244 -3.026133 -3.874028</code></pre></figure>
<p>Note this does not deal with ties when there is more than one $k$’th smallest element.
This still has running time $\mathcal{O}(n + k \log k)$ but with a worse constant and
memory requirements.</p>
<p>A more sophisticated approach could build upon this Rcpp
<a href="http://gallery.rcpp.org/articles/sorting/">example</a>.</p>
<p><a href="http://johnreid.github.io/2018/09/partial-sort">Retrieving the k largest (or smallest) elements in R</a> was originally published by John Reid at <a href="http://johnreid.github.io">John Reid</a> on September 26, 2018.</p>