A common problem in computer science is selecting the $k$ largest (or smallest) elements from an unsorted list containing $n$ elements. The most commonly implemented solution is far from optimal. This post describes a better way.

The problem is a form of partition-based selection. For example, when computing k-nearest-neighbour distances, we first calculate all the pairwise distances between samples, then for each sample we select the $k$ closest distances. In R this is implemented too often as

which is correct but does not scale well. It sorts the entire vector `dists`

before selecting the first $k$ elements. As the number of elements $n$ grows
this is inefficient as the `sort()`

call runs in $\mathcal{O}(n \log n)$ time.
Partition-based selection algorithms do not sort the entire list nor the
selected elements. They run in $\mathcal{O}(n + k \log k)$ resulting in savings
of a factor of $\log n$.

The statistical programming language R has an inbuilt and under-appreciated partial sorting implementation that can help tremendously. We showcase, benchmark and discuss this functionality here.

## Set up

Load the necessary packages.

Configure plots and seed RNG.

Set parameters.

## An example

Just to demonstrate what R’s partial sorting implementation does, we generate some test samples.

R’s standard `sort`

function takes a `partial`

argument specifying the indexes
at which you wish the vector to be partitioned. Here we want to select the
smallest $k$ elements so we have just one such index, $k$ itself.

We plot the selected array to show that every element beneath the $k$’th is indeed smaller than the $(k+1)$’th.

Zoom in to the detail around the $k$’th element.

## Benchmarks

Here we use the `microbenchmark`

package to show how much quicker
partition-based selection is than full sorting. Note we also test finding the
largest $k$ elements (`sort(x, partial = length(x) - k)`

).

## Asymptotics

The running time should be linear in $n$. We define a function to time the partition-based selection.

We choose 50 problem sizes ($n$) ranging from 100,000 to 100,000,000.

Sample data to test with.

Time the partition-based selection.

Plot the elapsed times. We observe a linear relationship between the running time and $n$.

# Drawbacks

Frequently we are interested not in the values of the $k$ smallest elements but
their indexes. Unfortunately R’s `sort()`

will not let us retrieve these
indexes as the `index.return = TRUE`

parameter is not compatible with the
`partial`

argument.

One possible solution is to find the $k$’th largest element by partition-based selection and then to run through the data again to locate those elements that are less than or equal to it.

Note this does not deal with ties when there is more than one $k$’th smallest element. This still has running time $\mathcal{O}(n + k \log k)$ but with a worse constant and memory requirements.

A more sophisticated approach could build upon this Rcpp example.