## September 26, 2018

A common problem in computer science is selecting the $k$ largest (or smallest) elements from an unsorted list containing $n$ elements. The most commonly implemented solution is far from optimal. This post describes a better way.

The problem is a form of partition-based selection. For example, when computing k-nearest-neighbour distances, we first calculate all the pairwise distances between samples, then for each sample we select the $k$ closest distances. In R this is implemented too often as

which is correct but does not scale well. It sorts the entire vector dists before selecting the first $k$ elements. As the number of elements $n$ grows this is inefficient as the sort() call runs in $\mathcal{O}(n \log n)$ time. Partition-based selection algorithms do not sort the entire list nor the selected elements. They run in $\mathcal{O}(n + k \log k)$ resulting in savings of a factor of $\log n$.

The statistical programming language R has an inbuilt and under-appreciated partial sorting implementation that can help tremendously. We showcase, benchmark and discuss this functionality here.

## Set up

Configure plots and seed RNG.

Set parameters.

## An example

Just to demonstrate what R’s partial sorting implementation does, we generate some test samples.

R’s standard sort function takes a partial argument specifying the indexes at which you wish the vector to be partitioned. Here we want to select the smallest $k$ elements so we have just one such index, $k$ itself.

We plot the selected array to show that every element beneath the $k$’th is indeed smaller than the $(k+1)$’th. Zoom in to the detail around the $k$’th element. ## Benchmarks

Here we use the microbenchmark package to show how much quicker partition-based selection is than full sorting. Note we also test finding the largest $k$ elements (sort(x, partial = length(x) - k)).

## Asymptotics

The running time should be linear in $n$. We define a function to time the partition-based selection.

We choose 50 problem sizes ($n$) ranging from 100,000 to 100,000,000.

Sample data to test with.

Time the partition-based selection.

Plot the elapsed times. We observe a linear relationship between the running time and $n$. # Drawbacks

Frequently we are interested not in the values of the $k$ smallest elements but their indexes. Unfortunately R’s sort() will not let us retrieve these indexes as the index.return = TRUE parameter is not compatible with the partial argument.

One possible solution is to find the $k$’th largest element by partition-based selection and then to run through the data again to locate those elements that are less than or equal to it.

Note this does not deal with ties when there is more than one $k$’th smallest element. This still has running time $\mathcal{O}(n + k \log k)$ but with a worse constant and memory requirements.

A more sophisticated approach could build upon this Rcpp example.

### Fatboy: a backgammon AI

This is the story of *Fatboy*, a neural network that taught itself to playbackgammon back in the early 90s. When it connected to the [Fir...… Continue reading

#### Hamiltonian Annealed Importance Sampling

Published on December 04, 2018