Friday, February 14, 2014

FSE decoding : wrap up

 With all the previous articles written on FSE, it seems we now have all elements to build a complete FSE decoder. Let's wrap up all this learning into a single piece.

To start with, we consider that all symbols are attributed a weight, of minimum 1, normalized over a sum which is a power of 2 (8, 16, 32, etc.).
The size of the decoding table is this sum.

Each symbol will appear in the table exactly as many times as it weights.

We have seen that, to maximize table's capability to capture fractional bits, we have to spread each symbol over the table, at roughly regular distances (the more regular, the better the compression performance).
For example, using the previous example of a symbol A with weight 5/16 :


We have seen that destination sub-ranges must sum up to cover the complete table, with each sub-range having a size in power of 2, defining an exact number of bits to use. We have seen that the smaller ranges must be placed at the beginning.


We now must associate each symbol in the table with a destination range.
To do that, we will number the ranges, beginning with the larger ones.


So the final assembling becomes :


If you build a decoding table using the above construction method, you have a simple decoding algorithm as a result. The hot loop is a mere 4-lines, as follows :

// Persistent variables across the loops : state and bitStream
       BYTE symbol = decodeTable[state].symbol;
       int nbBits = decodeTable[state].nbBits;
       int rest = getBits (bitStream, nbBits);
       state = decodeTable[state].newStateBaseline + rest;

This is all it takes to build your own FSE decoder.
You can always use the code at Github for further guidance.

Next part will be a study of the compression algorithm part of FSE, which is a bit more complex...

Sunday, February 9, 2014

FSE : distributing symbol values


Note : Charles Bloom develops an article around the same issue on his blog : Understanding ANS - 8

 Following the last article on defining optimal sub-ranges for FSE, and attributing them according to symbol ranking, we were left with the question of how to position symbol values into the state table.

We already said that an FSE decoding table following ANS theory is equivalent to an Huffman Low-bit-order decoding table : a symbol with a probability of 50% will be positioned once every 2 cells. With a probability of 25%, it will be positioned once every 4 cells. And so on.



Obviously, the question is : what if the probability of a symbol is not a power of 2 ?
Well, with this context in mind, it seems obvious that we should follow the same logic, and if for example, a symbol has a probability of 33.333%, it should be positioned once every 3 cells.

OK, but even 33% is an easy target, because it's such a convenient fractional number. What if we have, for example, a probability of 7.27%, resulting in an average distance between symbols at every 13.75515.... cells ?
Well, an intuitive idea is that we will try to approach this distribution, typically by spacing symbols sometimes by 13 cells, sometimes by 14 cells. And indeed, that's how it works.

OK, it may look clear and simple for a single symbol. But what about all these other symbols together, trying to get their own regularly spaced slots ? Won't they be fighting for some same slots ?

Let's take an example, to see that situation clearer. Let's imagine that, on a 16-states decoding table, we have the following distribution :

A : 5 ; B : 5 ; C : 3 ; D : 3

In this case, we have an issue, because 16/5=3.2, which obviously is not an integer. So we will have to space A by a distance of 3.2, which we can only approximate, by being 3 most of the time, and maybe 4 sometimes.

The intuitive optimal distribution for A is reached by applying this algorithm :
Position 0 : at P/2 (=1.6, so rounded down to 1 for example)
Position n = position n-1 + 3.2
Which gives : 1 (1.6), 4 (4.8), 8 (8.0), 11 (11.2), 14 (14.4)
(Edit : Charles Bloom calls this method "bias 0.5", and recommends "bias 1.0" instead, which gives better results. The reason why it's better is likely related to the fact that positions in the state table are not equal in term of contribution to probability, with the lower states being worth more than the higher ones.)

Yeah, this would be fine enough, except that, A is not alone.
What about B ? It has the same probability, so it should occupy the same positions ?
Of course not, only one symbol can occupy a state, so let's say that B is kind enough to move its own position by 1, leading to 2, 5, 9, 12, 15.

OK, it looks fine, so now, what about C ?
C distribution is 3, so Pos0(C) = (16/3) / 2 = 5.33/2 = 2.66 => 2.
Humm, that's a problem, 2 is already used by B.
OK, let's say B has higher priority because it should have 1.6 to begin with, so it's lower than 2.66. So we keep 2 for B, and give 3 for C.

And so on. Add D to the mix for even more fun. D basically takes whatever cell remains for its own use, wherever they are.



As you can see, it seems to be a mess. And it is : with more and more symbols filling the table, space is necessarily becoming scarce for the later ones.
However, a perfect distribution algorithm exists along the same lines : for each position, compare all fractional state values of all symbols, pick the lowest one, increase its value by step=1/probability, and start again the same process for next state. You still have a lot of ties to solve, but you can for example decide to select symbol values in order in such cases.
It works.
It's just damn slow, due to the huge number of comparisons required.



It can be sped up by using bucket sorting strategy. An interesting method is to use more buckets than states, for easier tie management. It works fine too, and drastically reduces the number of comparisons involved, at the cost of a bit of accuracy. But that's okay, since precise initialization is not required, it only compresses "closer to Shannon limit".



Let's by the way spend some more time on this strategy : precise initialization is not required.
Indeed, the only point is to reduce the amount of "waste" in fractional bit estimation. Where does that waste come from ?
Well, as said in previous article, a fractional bit is translated into a "decrease" of the state value.


For this to work correctly, i.e. to minimize the fractional bit cost, we need to have our target symbol approximately at the intended destination state value. If not, then we will "waste" some more fractional bit, in order to find the intended symbol.
With this idea in mind, the worst thing that can happen is to "cluster" the values together at one end of the range : it maximizes the probability that state value will have to be decreased more to find the required symbol.



On the other hand, if a symbol with a probability 50% is distributed evenly, once every 2 cells, finding it is highly simplified, and even guaranteed at a distance of at most 1.
The above explanation is a bit simplified, but works well at explaining how fractional bit works.

So, it's not very important to be precise, what matters is to be well "spread" across the spectrum of state values.
With this idea in mind, we can concentrate on the next objective : speed.
Rather than trying to find the best distribution, we'll try to find one which is "good enough" at spreading, and most importantly fast to implement.
With some experiment, I ended up selecting this simple formula :
nextState = currentState + (5/8) range + 3
which is way faster than all previous proposals, thanks to the absence of comparisons, branches, and the guarantee to reach all state values in a single scan. Using the above example, this heuristic solution gives the following distribution result :



in a word, good enough.
Note that results are better than this example with larger tables. We are limited here, for visibility, at 16 states, but the heuristic has been tuned for larger sizes, 512 states and beyond.

In Charles Bloom's article, you'll find a different recommended distribution method, closer to optimal, and backed with solid experimental evidences. There's no contradiction : this blog is called "fast compression" for a reason, and that's why in this recommendation, speed takes precedence.