Thursday, April 12, 2012

Detecting errors

 After thinking a blip second about it, it seems clear to me that an important feature expected from the framing format is a good enough guarantee that the regenerated data is correct.
Surely there are also valid usages of LZ4 which may not need error detection and its associated CPU and bandwidth overhead, hence this feature will remain optional in the final specification, but nonetheless recommended.

This requirement is really very common. And a reasonable method to check for unintentional errors (such as transmission or storage ones) can be provided with a simple checksum. Typically, a good 32-bits checksum will make it relatively improbable (less than 1 in a billionth) that random changes stay unnoticed. This method is typically used by Zip.

One could note that the LZ4 decoder already includes some form of detection for malformed input. It's correct, but it's not enough, because it only checks that provided instructions are valid. The main objective was to forbid any attempt to read or write outside of the memory buffers. Hence, it cannot detect a corrupted literal within the compressed stream for example.

On the other side, one could also require the method to guarantee the input data never was modified, neither randomly nor intentionally (data integrity). Such objective, however, seems out of scope.
To begin with, it require an electronic signature, at least 128-bits such as MD-5. But even MD-5 is now considered too weak for this use, and shall be superseded by stronger 160-bits SHA-1 and such.
One obvious issue is speed. What's the use of a super-fast compression / decompression algorithm if the checksum function is several times slower ?
But the most important issue is that the signature is not enough. The signature's mission is to make it almost impossible to find a modification to the original source file which produces the same electronic signature. But since the electronic signature is provided within the framing format, a malicious attacker would just need to change both the content and the signature.
Therefore, an effective protection shall also require encryption, or a separate signature storage and transmission, which is way beyond the objective of this framing format.
So we'll just consider random data corruption from now on.

The next question is : what has to be checksum-ed ?
Due to the popularity of the Zip format, an obvious answer could be "the original data".
But let's look into it, it's not that obvious.

Take the snappy framing format for example. It checksums the original data of each chunk individually, with each chunk size being <= 32KB. The main advantage is that an incorrect chunk is detected immediately, there is no need to wait for the end of the stream for this result. This is especially important when the stream is very large (for example 1GB), and the consumer process wants to use the decoded data without waiting for the last chunk.
The main drawback, however, is silent truncation. Imagine for example that the last few chunks are missing. There will be no way to detect that.

Another possible answer is to checksum the compressed data instead of the original one.
This is valid. A compressed stream can only be decoded into a single result. Hence, protecting the compressed data from errors is the same as protecting the original data.
And it brings several advantages with it.
First, by checking the checksum against the compressed stream, it warns against data corruption even before attempting to decode.
Second, since compressed data is generally smaller than original one, the checksum function will be faster to execute. Both properties have positive impact on CPU consumption.

Note however that for the first property to be true, each chunk must be checksumed, not just the entire stream, because the decoding starts immediately on reading the first chunk from i/o, both to save time and memory.

OK, so considering these advantages, i'm almost sold to the method of checksuming the compressed chunks. The most important drawback i would still worry about is the silent truncation issue of snappy-frame.

It seems however this can be easily solved. One just needs to ensure that the "continuation" information is provided. And there are several possibilities to do it. For example, provide the size of the original data into the framing format, and calculate the number of (fixed size) chunks. Or embed the continuation information into a field associated with each chunk, and make it part of the checksumed data. There are surely other possibilities. The main idea is that the "end of stream" information must be provided, not just detected on hitting the EOF marker.

One last property that could be missing is using the checksum as a kind of "weak signature". For example, in the zip format, since the checksum is calculated from the original data, it is possible to "check" that the file present into the zip file is likely the expected one.
But i'm not sure if such usage has any real value, not even convenience. Has stated earlier, such "signature" is much too weak to be trusted, so it can merely be used as an "hint", not much else.

As usual comments are welcomed.
There is another point to discuss, which is the checksum algorithm itself.
We'll keep that for another blog note.

3 comments:

  1. 1. For message signing you need a signing algorithm, not a hash,
    and its not directly related to encryption. See
    http://en.wikipedia.org/wiki/Digital_Signature_Algorithm
    http://en.wikipedia.org/wiki/Elliptic_Curve_DSA

    2. I still think that there's no need to add weak crcs to ensure
    block integrity. For a format where there's a risk of literals
    getting modified without affecting the match structure, I'd suggest
    to add some fast encryption instead -
    (like xor with previous encrypted byte, or rc4)
    then accidental changes would mess up the whole block and you'd
    notice it.
    Some options which can be used for integrity checks:
    - strict block size. The encoder can make sure that each block
    is compressed separately (no matches overlapping the block boundary).
    - unused distance (or length) range. Some interval of distance
    values (eg. 64k-256..64k) can be intentionally left unused and it
    would be possible to treat it as error to detect broken blocks.
    - end-of-block code.

    3. As to crc choice, its certainly crc32 because of
    http://www.strchr.com/crc32_popcnt

    ReplyDelete
  2. Well, that's about 1.5 bytes per cycle for Hardware CRC32. That's certainly fast (although apparently limited to newest Intel processors, and i don't think such intrinsic is easily accessible - i.e. hello hand-made ASM).

    Now, what if something even faster was possible ? ;)

    ReplyDelete
  3. http://msdn.microsoft.com/en-us/library/bb514033.aspx

    ReplyDelete