A common problem in computer science is selecting the $k$ largest (or smallest) elements from an unsorted list containing $n$ elements. The most commonly implemented solution is far from optimal. This post describes a better way.
The problem is a form of partition-based selection. For example, when computing k-nearest-neighbour distances, we first calculate all the pairwise distances between samples, then for each sample we select the $k$ closest distances. In R this is implemented too often as
which is correct but does not scale well. It sorts the entire vector dists
before selecting the first $k$ elements. As the number of elements $n$ grows
this is inefficient as the sort()
call runs in $\mathcal{O}(n \log n)$ time.
Partition-based selection algorithms do not sort the entire list nor the
selected elements. They run in $\mathcal{O}(n + k \log k)$ resulting in savings
of a factor of $\log n$.
The statistical programming language R has an inbuilt and under-appreciated partial sorting implementation that can help tremendously. We showcase, benchmark and discuss this functionality here.
Set up
Load the necessary packages.
Configure plots and seed RNG.
Set parameters.
An example
Just to demonstrate what R’s partial sorting implementation does, we generate some test samples.
R’s standard sort
function takes a partial
argument specifying the indexes
at which you wish the vector to be partitioned. Here we want to select the
smallest $k$ elements so we have just one such index, $k$ itself.
We plot the selected array to show that every element beneath the $k$’th is indeed smaller than the $(k+1)$’th.
Zoom in to the detail around the $k$’th element.
Benchmarks
Here we use the microbenchmark
package to show how much quicker
partition-based selection is than full sorting. Note we also test finding the
largest $k$ elements (sort(x, partial = length(x) - k)
).
Asymptotics
The running time should be linear in $n$. We define a function to time the partition-based selection.
We choose 50 problem sizes ($n$) ranging from 100,000 to 100,000,000.
Sample data to test with.
Time the partition-based selection.
Plot the elapsed times. We observe a linear relationship between the running time and $n$.
Drawbacks
Frequently we are interested not in the values of the $k$ smallest elements but
their indexes. Unfortunately R’s sort()
will not let us retrieve these
indexes as the index.return = TRUE
parameter is not compatible with the
partial
argument.
One possible solution is to find the $k$’th largest element by partition-based selection and then to run through the data again to locate those elements that are less than or equal to it.
Note this does not deal with ties when there is more than one $k$’th smallest element. This still has running time $\mathcal{O}(n + k \log k)$ but with a worse constant and memory requirements.
A more sophisticated approach could build upon this Rcpp example.