The Book of Gehn

Greenwald-Khanna e-approximated q-quantile - a review

December 31, 2024

Sorted observations stored in an array. Assuming 1-based index, the observation at rank \(r\) is at the index \(r\).

Given \(n\) observations, finding which value is at rank \(r\) is trivially easy: if we store and sort the observations, the rank \(r\) will be at index \(r\).

But when \(n\) gets really large, it is unfeasible to store or sort all the observations.

Greenwald and Khanna developed a data structure that solves this but with a trade off: we can answer which value is at rank \(r\) within certain error.

For the entire post I will talk about ranks. To deal with quantiles it is just a matter of computing its equivalent rank \(r \leftarrow q n\).

It is called an \(q\)-quantile \(\epsilon\)-approximation summary.

I coded it, it didn’t work and after a week on this I realized that the original work of Greenwald an Khanna may have a few imprecisions.

This post describes how I rethinked the data structure from scratch, where I found the mentioned imprecisions and how I got a working implementation.

TL;DR -> python implementation in cryptonita Paper -> Greenwald and Khanna (GK01)

Square one: what is wrong with using a sorted array?

The cost is not bad: sorting is \(O(n \textrm{log}(n))\) and answering is \(O(1)\). If we query multiple times, the cost of the sort will be amortized.

But this assumes that \(n\) is fixed and we have all the observations at the moment of sorting.

What would happen if we want to add new observations on the fly?

You see, to keep the array sorted we find the insertion point of the new observation with a binary search \(O(\textrm{log}(n))\) and once we know where to insert, we need to move all the next elements of the array to make room. And this is \(O(n)\).

And that displacement is bad.

Implicit (computed) ranks

An array is a dead road; long live to a linked list!

The downside is that we lose the \(O(1)\) for querying a rank but we avoid having to move the elements to make room for a new one.

Or kind of.

If we explicitly track the rank of each element we still have an \(O(n)\) insert because we would have to update all the next elements’ ranks.

Example of inserting the value \(12\), at rank \(3\). The rank of all the observations on its right must be updated increasing their ranks by \(1\).

Instead, we store tuples of the form \(t_i = (v_i, g_i)\):

For example, \(g_1 + g_2 + g_3 = 1 + 1 + 1 = 3\) gives the rank of the third observation \(20\).

Every time we want to know the rank of any tuple \(t_i\) we compute \(\sum_{j \le i} g_j\).

Now let’s review what happen when we insert the observation \(12\). The observations previous to it should not change but we know that \(20\) cannot be at rank \(3\) anymore but at least one more, making room for the new incoming \(12\).

In other words, adding a new observation shifts by 1 the rank of all the observations on the right. But there is no need to update anything: setting \(g = 1\) for \(12\) is enough, no need to do any \(O(n)\) stuff.

We are going to improve this even further at the end of the post.

The insertions are now just \(O(log(n))\).

Don’t keep all, remove some tuples

To handle very large \(n\) we cannot keep all the observations in memory: we need to remove some while still preserving the rank of the others within some error.

For example, the tuple \(t_{i+1}\) with value \(12\) is at rank \(3\). If we remove the previous tuple \(t_i\), the value \(12\) must still be at rank \(3\).

This is easy to achieve: we just need to increase \(g_{i+1}\) by \(g_i\).

Here are a few more examples:

You can corroborate that the observation \(21\) remains at rank \(5\) in all the cases.

Removing tuples trades off space by accuracy.

If we ask for the rank \(2\), we lost the precise answer but we can return an approximated one: if we answer with rank \(1\) (value \(4\)), we missed just be \(1\) rank.

Ask for the observation at rank \(2\) but we answer with the observation at rank \(1\) effectively making a mistake of \(1\) rank.

In general we answer with the closest tuple so the maximum error is at the middle point and it is half the count of tuples removed in between the tuples.

In this example, \(\lceil \frac{4 - 1}{2} \rceil = 2\)

Luckily, we already are tracking how many tuples we removed: it is just the \(g_i\) value of the right-most tuple minus 1.

Why the \(- 1\)? Because each \(g_i\) starts at \(1\) and increases by the same amount of removed tuples.

So we can put a upper bound on the error for answering for a rank \(r\) that falls in the gap: \(\lceil \frac{g_i - 1}{2} \rceil\)

Minimum and maximum ranks

Let’s back to how we handle an insert. What would happen if we want to insert the value \(16\)?

We don’t know what rank the value \(16\) should have but we know that it is between ranks \(2\) and \(6\).

In fact, if we didn’t delete any tuple, the precise rank for \(16\) should had been \(4\) which it is between \(2\) and \(5\).

The value \(16\) is between \(4\) and \(21\) so the rank of \(4\) should not change but the value \(21\) cannot longer be at rank \(5\) but at rank \(6\) to make room for the new observation.

Therefore the value \(16\) is at rank between \(2\) and \(5\) (inclusive)

Those ranks are the respective minimum and maximum ranks for \(16\).

$$\begin{align*} \textrm{rmin}_i &= \textrm{rank}_{i-1} + 1 \\ \textrm{rmax}_i & = \textrm{rank}_{i+1} \end{align*}$$

But our tuples don’t have explicit ranks but relative offsets.

From \(\textrm{rmin}_{i}\) we solve \(g_{i}\):

I’m using here \(\textrm{rmax}_{i+1}\) to be the rank before updating the \(g_{i+1}\) of the tuple \(t_{i+1}\). In the example, \(\textrm{rmax}_{i+1}\) would be \(5\) not \(6\).

In the practice this detail doesn’t matter because the algorithm would never be working with ranks directly.

$$\begin{align*} \textrm{rank}_{i} &= \textrm{rank}_{i-1} + 1 \\ \textrm{rank}_{i} - \textrm{rank}_{i-1} &= 1 \\ g_{i} &= 1 \end{align*}$$

Let’s define a new offset, \(\Delta_{i} = \textrm{rmax}_{i} - \textrm{rmin}_{i}\) which can be rewritten as:

$$\begin{align*} \textrm{rmax}_{i} & = \textrm{rank}_{i+1} \\ &= \textrm{rank}_{i-1} + g_{i+1} \\ \textrm{rmax}_{i} - \textrm{rmin}_{i} &= \textrm{rank}_{i-1} + g_{i+1} - \textrm{rmin}_{i} \\ \textrm{rmax}_{i} - \textrm{rmin}_{i} &= \textrm{rank}_{i-1} + g_{i+1} - \textrm{rank}_{i-1} - 1 \\ &= g_{i+1} - 1 \\ \Delta_{i} &= g_{i+1} - 1 \end{align*}$$

Our tuples now we have 3 values:

Insert, revised

But the above is incomplete. Let’s see an insert in between two observations that don’t have a single rank but a range.

The value \(4\) may be at rank \(1\) or \(2\). We know for sure that the value \(16\) is after \(4\) so it cannot have a rank of \(1\) but it could have a rank of \(2\) or anything above.

The \(\textrm{rmin}\) of \(4\) defines the \(\textrm{rmin}\) of \(16\).

For the value \(21\) we do the same analysis: it may be at ranks \(5\) and \(6\) so the value \(16\) cannot be at rank \(6\) but it could be at rank \(5\) and above.

In summary:

$$\begin{align*} \textrm{rmin}_i &= \textrm{rmin}_{i-1} + 1 \\ \textrm{rmax}_i & = \textrm{rmax}_{i+1} - 1 \end{align*}$$

From the first equation we get \(g_i = 1\) for the inserted tuple \(t_i\) as we got before.

\(\Delta_i\) is slightly different:

I’m using here \(\textrm{rmax}_{i+1}\) to be the rank before updating the \(g_{i+1}\) of the tuple \(t_{i+1}\).

$$\begin{align*} \Delta_i & = \textrm{rmax}_i - \textrm{rmin}_i \\ & = \textrm{rmax}_{i+1} - \textrm{rmin}_{i-1} - 1 \\ & = \textrm{rmin}_{i+1} + \Delta_{i+1} - \textrm{rmin}_{i-1} - 1 \\ & = \textrm{rmin}_{i-1} + g_{i+1} + \Delta_{i+1} - \textrm{rmin}_{i-1} - 1 \\ \Delta_i & = g_{i+1} + \Delta_{i+1} - 1 \end{align*}$$

Removed tuples count, revised

We saw that \(g_i - 1\) counts for the tuples removed immediately before the tuple \(t_i\) but this is not enough.

Visually we want to count the removed tuples from \(\textrm{rmin}_{i-1}\) to \(\textrm{rmax}_{i}\).

$$\begin{align*} & = \textrm{rmax}_{i} - \textrm{rmin}_{i-1} - 1 \\ & = \Delta_i + \textrm{rmin}_{i} - \textrm{rmin}_{i-1} - 1 \\ & = \Delta_i + g_i - 1 \end{align*}$$

With this, the upper bound error for answering for a rank \(r\) that falls in the gap becomes \(\lceil \frac{\Delta_i + g_i - 1}{2} \rceil\)

Answer, revised

To answer for a rank \(r\), we search any tuple whose ranks are fully contained in the range \((r - \lfloor \epsilon n \rfloor , r + \lfloor \epsilon n \rfloor)\).

This is because we cannot answer with a tuple \(t_i\) if any of its possible ranks are outside of the range. Answering with such tuple will imply that we are potentially answering with a value with a truly rank beyond the tolerance \(\lfloor \epsilon n \rfloor\), hence having a relative error larger than \(\epsilon\).

The answering requires a \(O(s)\) search where \(s\) is the count of tuples in the summary, much smaller than the total number of observations \(n\).

Invariant of the summary: we can always guarantee to answer any rank

It’s easy to see that if the maximum gap between 2 consecutive tuples is \(\Delta_i + g_i - 1\) (twice the maximum error), and if we keep this below \(2 \lfloor \epsilon n \rfloor\), we can always find the rank with an error of up to \(\epsilon n\) (just divide both expressions by 2).

Not a proof but here we go with an intuition:

In short, the summary always maintains the invariant:

$$\Delta_i + g_i - 1 \le 2 \lfloor \epsilon n \rfloor$$

This imposes a restriction when we can remove or not a tuple: a tuple can be removed if by doing so we are not creating a gap that violates the invariant.

In other words, the tuple \(t_i\) can be removed iff:

$$g_i + \Delta_i + g_{i+1} + \Delta_{i+1} - 1 \le 2 \lfloor \epsilon n \rfloor$$

Amortization via deferring inserts

So far we have \(O(\textrm{log} s)\) inserts. However, it is possible to defer the inserts, holding them in a temporal buffer and insert all them together later.

Once the buffer is full, we sort it in \(O(b \textrm{log} b)\) and do a merge between it and the summary in \(O(s+b)\).

We traded \(b\) inserts of \(O(\textrm{log} s)\) by one \(O(b \textrm{log} b) + O(s+b)\).

Moreover, during the merge we can check if the tuple can be removed without violating the invariant.

Wrapping all together

The summary data structure has 3 operations:

On each insert, store the observation in a buffer. Once the buffer reaches to the size of the current summary or to some predefined limit, merge it and create a new summary.

The merge operation is as follows:

For query, we can offer answers for quantiles or ranks.

To answer for the rank \(r\) we scan the summary, summing the \(g_j\) values along the way to compute:

$$\begin{align*} \textrm{rmin}_i &\leftarrow \sum_{j \le i} g_j \\ \textrm{rmax}_i &\leftarrow \textrm{rmin}_i + \Delta_i \end{align*}$$

We stop once we find a tuple \(t_i\) that satisfies both inequalities:

$$\begin{align*} \textrm{rmin}_i &\ge r - \lfloor \epsilon n \rfloor \\ \textrm{rmax}_i &\le r + \lfloor \epsilon n \rfloor \end{align*}$$

To answer for the quantile \(q\), we compute the equivalent rank \(r \leftarrow q n\) and proceed as above.

So, which are the discrepancies with Greenwald-Khanna’s work?

Typo in the invariant

Considere the corollary 1 (GK01):

$$\textrm{max}(g_i + \Delta_i) \le 2 \epsilon n \quad \forall i$$

The incorrect part is \(g_i + \Delta_i\).

For any new tuple, its \(g_i\) is \(1\) and it is easy to find a valid small \(n\) such \(2 \epsilon n\) is smaller than 1 and because \(\Delta_i\) is never negative, the corollary does not hold.

I think that the expression should had been:

$$\textrm{max}(g_i + \Delta_i - 1) \le 2 \epsilon n \quad \forall i$$

The left side is always positive but it can be zero so it does not enter in conflict with the right side. This is just a slightly weaker version of the invariant mentioned in the blog post.

And of course, it may no be an error at all ans just I’m not interpreting the paper correctly.

Compress and bands

The authors also implemented a separated routine called \(COMPRESS\) that removes unneeded tuples which requires the categorization of all the \(\Delta_i\) into bands.

While they use the concept of bands to prove the properties of the algorithm, IMO, these are not needed for running it.

Initial \(\Delta_i\)

In the first sections of the paper, the authors define for a new tuple \(t_i\), its \(\Delta_i\) as \(\Delta_i = \lfloor 2 \epsilon n \rfloor\).

This is a loose value but the authors used it to prove properties of the algorithm.

In the next sections of the paper, they use \(\Delta_i = g_{i+1} + \Delta_{i+1}\) as in the blog post.

Minimum is not preserved

I didn’t explain this but in the python implementation I added some extra condition to delete or not a tuple: if the tuple is at one extreme (its value \(v\) is either the maximum or the minimum), do not delete it.

This was the intention of the GK01 authors too but their implementation (or better said, its pseudo code) does not prevent the minimum to be removed.

Conclusions

It was hard. You have no idea.

Once I realized that the paper had some inconsistencies, it took me a week to fully understand and write down –from the scratch– the algorithm and summary data structure.

But that was the easy part. It took me like a month to write this blog post and I truly have deep respect for Greenwald and Khanna.

Writing this blog post taught me how hard is to explain something and even harder to it write without mistakes. I had to redo more than one diagram, fixing indices or adding a missing \(- 1\) and, of course, I’m quite sure there is room for improvements.

This is also a reminder why publishing the full implementation, not just pseudo code, is so important. It is written in a formal language and it can be executed and put in test easy.

For example, checking the incorrect invariant \(\textrm{max}(g_i + \Delta_i - 1) \le 2 \epsilon n \quad \forall i\) it would be trivial: just add an assert in the code.

The code is never immune to errors, but it is a bit closer to the truth.

Future work

Greenwald-Khanna e-approximated q-quantile - a review - December 31, 2024 - Martin Di Paola