The Free Lossless Image Format (FLIF) is a lossless image compression format. It supports grayscale, RGB and RGBA, with a color depth from 1 to 16 bits per channel. Both still images and animations are supported. FLIF can handle sparse-color images (e.g. 256 color palettes) effectively. It has an interlaced and a non-interlaced mode; unlike PNG, interlacing usually does not come with worse compression. In terms of compression, FLIF usually outperforms other lossless compression formats like PNG, lossless WebP, BPG, JPEG 2000, JPEG XR, JPEG-LS, APNG, MNG, or GIF.
This document specifies the FLIF16 bitstream, as implemented by version 0.2 of the reference encoder (https://github.com/FLIF-hub/FLIF), which was released in September 2016.
Overview of the format
A FLIF file consists of 4 parts, with increasingly complicated encoding methods:
-
The main header, containing basic image metadata: width, height, color depth, number of frames. This information is encoded in a straightforward way, to make it easy to quickly identify a file.
-
Optional metadata chunks. These chunks contain non-pixel data: Exif/XMP metadata, an ICC color profile, … This information is encoded using DEFLATE compression.
-
Second header, containing further information about how the actual pixel data is encoded. This header also describes a chain of transformations that were applied before encoding, which have to be reversed after decoding (e.g. the YCoCg color transform). This information is encoded using CABAC entropy coding.
-
The actual pixel data. This information is encoded using MANIAC entropy coding.
Notational conventions
We use '/' to denote integer division (rounding towards zero) and '>>' to denote arithmetic right shift. This corresponds to the usual operator semantics in e.g. the C programming language.
>>1 denotes arithmetic right shift by one, which corresponds to dividing by two rounding down. For positive integers this is the same as /2, but for negative integers it is not, e.g. -1 / 2 = 0 while -1 >> 1 = -1. |
Part 1: main header
Type | Description | Value |
---|---|---|
4 bytes |
Magic |
|
4 bits |
Interlacing, animation |
3 = ni still; 4 = i still; 5 = ni anim; 6 = i anim |
4 bits |
Number of channels (nb_channels) |
1 = Grayscale; 3 = RGB; 4 = RGBA |
1 byte |
Bytes per channel (Bpc) |
|
varint |
Width |
width-1 |
varint |
Height |
height-1 |
varint |
Number of frames (nb_frames) |
nb_frames-2 (only if animation) |
The fifth byte is an ASCII character that can be interpreted as follows:
for non-interlaced still images: '1'
=Grayscale, '3'
=RGB, '4'
=RGBA,
for interlaced images add +0x10 (so 'A'
=Grayscale, 'C'
=RGB, 'D'
=RGBA),
for animations add +0x20 (so 'Q'
, 'S'
, 'T'
for non-interlaced, 'a'
,'c'
,'d'
for interlaced).
Variable-size integer encoding (varint):
-
An unsigned integer (Big-Endian, MSB first) stored on a variable number of bytes.
-
All the bytes except the last one have a '1' as their first bit.
-
The unsigned integer is represented as the concatenation of the remaining 7 bit codewords.
nb_frames is 1 for a still image, > 1 for an animation.
Part 2: metadata
Chunks are defined similar to PNG chunks, except the chunk size is encoded with a variable number of bytes, and there is no chunk CRC. Also, the first chunk (which is always "FLIF") has no explicit chunk size and is not DEFLATE-compressed.
Chunk names are either 4 letters (4 bytes), indicating an optional chunk, or 1 byte with a value below 32, indicating a non-optional chunk.
The convention of using upper and lower case letters is kept, but the meaning of the bits is slightly different.
-
First letter: uppercase=critical, lowercase=non-critical → non-critical chunks can be stripped while keeping a valid file (it might be rendered differently though)
-
Second letter: uppercase=public, lowercase=private
-
Third letter: uppercase=needed to correctly display the image (e.g. a color profile), lowercase=can be stripped safely without changing anything visually
-
Fourth letter: uppercase=safe to copy blindly (when editing the actual image data), lowercase=may depend on image data (e.g. contain a thumbnail)
This optional part is a concatenation of chunks that each look like this:
Type | Description |
---|---|
4 bytes |
Chunk name |
varint |
Chunk length (size) |
size bytes |
DEFLATE-compressed chunk content |
The optional chunks are followed by one non-optional chunk with a single-byte chunk name. Non-optional chunks do not have an explicit chunk size and their content is not DEFLATE-compressed. Currently the only non-optional chunk has name 0x00 (a NUL byte), which indicates a FLIF16 bitstream. Future (non-backwards compatible) revisions of the FLIF format will use a different name for this non-optional chunk.
If a decoder encounters an unknown optional chunk, it should proceed decoding (optionally it can produce a warning), as long as the chunk is non-critical. If a decoder encounters an unknown non-optional chunk, it should fail and produce an error message.
The following optional chunks are defined:
Chunk name | Description | Content (after DEFLATE-decompression) |
---|---|---|
iCCP |
ICC color profile |
raw ICC color profile data |
eXif |
Exif metadata |
|
eXmp |
XMP metadata |
XMP contained within a read-only xpacket with no padding |
Symbol encoding
From here on, all values are encoded using arithmetic coding.
uni_int(min,max) denotes an integer in the interval min..max (inclusive), encoded using Uniform symbol coding (see below) and a 24-bit RAC. This encoding is not context-adaptive, and every number in the interval gets a fixed chance (close to a uniform distribution, hence the name). uni_int(b) denotes an integer of b bits, i.e. it’s a synonym for uni_int(0,2^b-1).
nz_int_CONTEXT(min,max) denotes an integer in the interval min..max (inclusive), encoded using Near-zero symbol coding (see below) and a 24-bit RAC, where the chance table is given by CONTEXT. This encoding is context-adaptive, i.e. the chance table gets updated when encoding or decoding a number. Numbers are encoded using a zero-sign-exponent-mantissa representation, which favors distributions in which near-zero values have a higher chance (hence the name).
gnz_int_CONTEXT(min,max) denotes a simple generalization of Near-zero symbol coding that can be used in case the interval does not necessarily include zero. It is defined as follows:
gnz_int_CONTEXT(min,max) =
nz_int_CONTEXT(0, max - min) + min |
if min > 0 |
nz_int_CONTEXT(min - max, 0) + max |
if max < 0 |
nz_int_CONTEXT(min,max) |
otherwise |
Range encoder
TODO: describe the 24-bit RAC and the 12-bit chances.
Chance table update
TODO: describe how cutoff and alpha influence this.
Uniform Symbol Coding
An integer encoded with Uniform symbol coding uses a series of 50% chance bits (chance is 2048/4096, non-adaptive) to recursively narrow down the input interval until it is a singleton. The following code describes how to decode such an integer:
// assumption: min <= max
int read_uniform_int(int min, int max) {
if (min == max) return min;
// split in [min..mid] [mid+1..max]
int med = min + (max-min)/2;
if (rac.read_bit()) {
return read_uniform_int(mid+1, max);
} else {
return read_uniform_int(min, mid);
}
}
There is no compression advantage of using a RAC to encode these 50% chance bits (just regular bits would be just as efficient), but it is convenient to use the RAC anyway to avoid having to mix the RAC bitstream with regular bits.
For intervals of the form 0..2^k this representation boils down to writing down the number as k binary bits; for other intervals it can be slightly more concise than binary notation.
This representation is used mostly for simple signaling, where context adaptation would be of little use.
Near-zero symbol coding
This is the representation used for the bulk of the data, including the pixel values. It is very efficient for values that are close to zero (e.g. well-predicted pixels).
An integer encoded with Near-zero symbol coding uses a zero-sign-exponent-mantissa representation, where the exponent is represented in unary notation. Every bit position of this representation has its own RAC context (so their chances adapt separately); the exponent bits use different chances not just depending on their position but also depending on the sign. Chances are represented as 12-bit numbers (1..4095), where 1 means that the bit is very likely to be zero (it has 1/4096 chance of being one), while 4095 means that the bit is very likely to be one (the chance is 4095/4096).
The following code describes how to decode an integer in this representation:
// assumption: min <= 0 <= max
int read_nearzero_int(int min, int max) {
if (min == max) return min;
if (rac.read(ZERO)) return 0;
bool sign; // true = positive, false = negative
if (min < 0) {
if (max > 0) {
// only actually read the sign bit if both signs are possible
sign = rac.read(SIGN);
} else {
// max <= 0 and the number is not zero, so it is negative
sign = false;
}
} else {
// min >= 0 and the number is not zero, so it is positive
sign = true;
}
// highest possible absolute value
const int amax = (sign? max : -min);
// highest possible exponent
const int emax = ilog2(amax);
// exponent is represented in unary: a series of 1s followed by 0
// (the end 0 is dropped if e=emax)
int e;
for (e = 0 ; e < emax ; e++) {
if (rac.read(EXP(e,sign)) break;
}
// the first mantissa bit is always 1
int have = (1 << e);
// if all other mantissa bits are 1, then the total is have+left
int left = have-1;
// read mantissa bits from most-significant to least-significant
for (int pos = e; pos>0;) {
left >>= 1; pos--;
// if the bit is 1, then the value will be at least minabs1
int minabs1 = have | (1<<pos);
// if the bit is 0, then the value will be at most maxabs0
int maxabs0 = have | left;
if (minabs1 > amax) {
// 1-bit is impossible (would bump value above maximum),
// so assume the bit is 0 without reading it
} else if (maxabs0 >= 1) {
// 0-bit and 1-bit are both possible,
// so we read the bit and adjust what we have if it is a 1
if (rac.read(MANT(pos))) have = minabs1;
} else {
// 0-bit is impossible (would make the value zero),
// so assume the bit is 1 without reading it
have = minabs1;
}
}
return (sign ? have : -have);
}
The chances are initialized as follows:
bit | initial chance |
---|---|
ZERO |
1000 |
SIGN |
2048 |
EXP(0,sign) |
1000 |
EXP(1,sign) |
1200 |
EXP(2,sign) |
1500 |
EXP(3,sign) |
1750 |
EXP(4,sign) |
2000 |
EXP(5,sign) |
2300 |
EXP(6,sign) |
2800 |
EXP(7,sign) |
2400 |
EXP(8,sign) |
2300 |
EXP(9,sign) |
2048 |
EXP(k,sign), k > 9 |
2048 |
MANT(0) |
1900 |
MANT(1) |
1850 |
MANT(2) |
1800 |
MANT(3) |
1750 |
MANT(4) |
1650 |
MANT(5) |
1600 |
MANT(6) |
1600 |
MANT(7) |
2048 |
MANT(k), k > 7 |
2048 |
Part 3: second header
Type | Description | Condition | Default value |
---|---|---|---|
1 byte |
NUL byte (0x00), chunk name of a FLIF16 bitstream |
||
uni_int(1,16) |
Bits per pixel of the channels |
Bpc == '0': repeat(nb_channels) |
8 if Bpc == '1', 16 if Bpc == '2' |
uni_int(0,1) |
Flag: alpha_zero |
nb_channels > 3 |
0 |
uni_int(0,100) |
Number of loops |
nb_frames > 1 |
|
uni_int(0,60_000) |
Frame delay in ms |
nb_frames > 1: repeat(nb_frames) |
|
uni_int(0,1) |
Flag: has_custom_cutoff_and_alpha |
||
uni_int(1,128) |
cutoff |
has_custom_cutoff_and_alpha |
2 |
uni_int(2,128) |
alpha divisor |
has_custom_cutoff_and_alpha |
19 |
uni_int(0,1) |
Flag: has_custom_bitchance |
has_custom_cutoff_and_alpha |
0 |
? |
Bitchance |
has_custom_bitchance |
|
variable |
Transformations (see below) |
||
uni_int(1) = 0 |
Indicator bit: done with transformations |
||
uni_int(0,2) |
Invisible pixel predictor |
alpha_zero && interlaced && alpha range includes zero |
Channels are ordered as follows:
Channel number | Description |
---|---|
0 |
Red or Gray |
1 |
Green |
2 |
Blue |
3 |
Alpha |
Transformations
For each transformation:
Type | Description |
---|---|
uni_int(1) = 1 |
Indicator bit: not done yet |
uni_int(0,13) |
Transformation identifier |
variable |
Transformation data (depends on transformation) |
Transformations have to be encoded in ascending order of transformation identifier. All transformations are optional.
Transformations serve two main purposes:
-
to modify the pixel data (in a reversible way) to make it compress better, and
-
to keep track of the range of actually occuring pixel values, in order to narrow it down.
Initially, pixel values are assumed to be in the range 0..2^(bit_depth); this range can be modified by transformations. We’ll use range(channel).min and range(channel).max to denote the global minimum and maximum value of a particular channel.
We also use a potentially more accurate (narrow) conditional range crange(channel,values) to denote the range of a pixel value in channel channel, given that the pixel values in previously encoded channels are values. Initially, the conditional ranges are simply equal to the global range, but transformations might change that.
Finally, we define a function snap(channel,values,x) which given a pixel value x for channel channel and pixel values values in previously encoded channels, returns a 'valid' value as close as possible to x. Usually, snap(channel,values,x) simply clamps x to the conditional range crange(channel,values), but the ColorBuckets transformation changes that behavior.
In the following descriptions of transformations, we use orig_range, orig_crange, orig_snap to denote the original ranges and snap function (the initial ones, or the ones resulting from the previous transformation in the chain). We use new_range, new_crange, new_snap to denote the updated ranges and snap function.
In part 4 we will use range, crange and snap to denote the final ranges and snap functions, i.e. after applying all transformations.
We will now describe the transformations and their encoding in more detail.
Transformation 0: ChannelCompact
The ChannelCompact transformation looks at each channel independently, and reduces its range by eliminating values that do not actually occur in the image.
To be able to reconstruct the original values, the mapping from the reduced range to the original range is encoded. Near-zero symbol coding is used, with a single context which we’ll call A.
The information is encoded as follows:
-
For each channel c :
-
new_range(c).max = nz_int_A(0,orig_range(c).max-orig_range(c).max)
-
min = orig_range(c).min
-
For i = 0..new_range(c).max-1 :
-
decompacted(i) = min + nz_int_A(0, orig_range(c).max-min+new_range(c).max-i)
-
min = decompacted(i)+1
-
-
The effect of this transformation is as follows:
-
new_range(c).min = new_crange(c,…).min = 0
-
new_range(c).max = new_crange(c,…).max is explicitly encoded
-
new_snap is the default snap function (simple clamping)
To reverse the transformation (after decoding the pixel values) :
-
For each channel c :
-
For every pixel value v :
-
Set v = decompacted(v)
-
-
Transformation 1: YCoCg
The YCoCg transformation converts the colorspace from RGB to YCoCg. No information has to be encoded for this (besides the identifier of the transformation).
The transformation only affects the first three channels (0,1,2). It is not allowed to be used if nb_channels = 1.
Pixel transformation
Channel | Original meaning | New meaning | Forward transform | Reverse transform |
---|---|---|---|---|
0 |
Red (R) |
Luma (Y) |
Y = (((R+B)>>1) + G)>>1 |
R = Co + Y + ((1-Cg)>>1) - (Co>>1) |
1 |
Green (G) |
Chroma orange (Co) |
Co = R - B |
G = Y - ((-Cg)>>1) |
2 |
Blue (B) |
Chroma green (Cg) |
Cg = G - ((R+B)>>1) |
B = Y + ((1-Cg)>>1) - (Co>>1) |
Luma (Y) corresponds to roughly 50% green, 25% red, 25% blue. Chroma orange (Co) is positive for colors near orange (red, orange, yellow), and negative for colors near blue. Chroma green (Cg) is positive for colors near green and negative for colors near purple.
The YCoCg transformation tends to decorrelate the channels, which helps to improve compression. For example, for grayscale images, i.e. images where R=G=B, the result of the YCoCg transform is Y=R (=G=B), Co=Cg=0. Human perception is more sensitive to luma than it is to chroma. For this reason, in interlaced mode, the default zoomlevel/channel ordering (see below) gives priority to the luma channel; this results in partial decodes (progressive previews) which are effectively chroma subsampled. |
New ranges
Define origmax4 to be equal to max(orig_range(0).max,orig_range(1).max,orig_range(2).max)/4+1 and newmax to be equal to 4 * (origmax4) - 1. In the most common case where the three channels have the range 0..255, this evaluates to origmax4 = 64 and newmax = 255.
Channel number c | Original meaning | New meaning | new_range(c) |
---|---|---|---|
0 |
Red (R) |
Luma (Y) |
0..newmax |
1 |
Green (G) |
Chroma orange (Co) |
-newmax..newmax |
2 |
Blue (B) |
Chroma green (Cg) |
-newmax..newmax |
Unlike the RGB color space, not every coordinate in the YCoCg color space corresponds to an actual color. In particular, the range for Co and Cg is much smaller for near-black and near-white colors than for intermediate luma values:
(Download the above animation losslessly: WebP (847K), APNG (938K), FLIF (90K); or lossy: GIF (7024K), MP4 (235K))
The conditional range function crange is updated to reflect this. It is updated as follows:
-
new_crange(0) = new_range(0)
-
new_crange(1,yval).min =
-3 + 4 * yval if yval < origmax4 - 1
4 * (yval-newmax) if yval > 3 * origmax4 - 1
-newmax otherwise
-
new_crange(1,yval).max =
3 + 4 * yval if yval < origmax4 - 1
4 * (newmax-yval) if yval > 3 * origmax4 - 1
newmax otherwise
-
new_crange(2,yval,coval).min =
-2 - 2 * yval if yval < origmax4 - 1
-2 * (newmax-yval) + 2 * ((abs(coval)+1)/2) if yval > 3 * origmax4 - 1
min(2 * yval + 1, 2 * newmax - 2 * yval - 2 * abs(coval)+1)/2 otherwise
-
new_crange(2,yval,coval).max =
1 + 2 * yval - 2 * (abs(coval)/2) if yval < origmax4 - 1
2 * (newmax-yval) if yval > 3 * origmax4 - 1
min(2 * (yval- newmax), - 2 * yval - 1 + 2* (abs(coval)/2)) otherwise
Transformation 2: reserved (unused)
Transformation identifier 2 is not used. It is reserved for future extensions that support transformations to other color spaces like YCbCr.
Transformation 3: PermutePlanes
The PermutePlanes transformation reorders (permutes) the channels; optionally it also subtracts the values of the new channel 0 from the values of channels 1 and 2. This transformation is useful if for some reason the YCoCg transformation is not used: it can e.g. be used to transform RGB to G (R-G) (B-G).
This transformation is not allowed to be used in conjunction with the YCoCg transformation; it is also not allowed to be used if nb_channels = 1. Also, if alpha_zero is true, then channel 3 (Alpha) is not allowed to be permuted to a different channel number.
There are two main reasons to do a channel reordering: better compression (the order matters for compression since the values of previously encoded channels are used in the MANIAC properties, see below), and better progressive previews (e.g. Green is perceptually more important than Red and Blue, so it makes sense to encode it first). Additionally, subtracting channel 0 from the other channels is a simple form of channel decorrelation; usually not as good as the YCoCg transformation though.
We denote the permutation used by PermutePlanes with p, where p(nc)=oc means that the new channel number nc corresponds to the old channel number oc.
Without subtraction, the forward transformation looks as follows:
Channel number c | Original pixel value | New pixel value (no Subtract) | new_range(c) (no Subtract) |
---|---|---|---|
0 |
v0 |
vp(0) |
orig_range(p(0)) |
1 |
v1 |
vp(1) |
orig_range(p(1)) |
2 |
v2 |
vp(2) |
orig_range(p(2)) |
3 |
v3 |
vp(3) |
orig_range(p(3)) |
With subtraction, the forward transformation looks as follows:
Channel number c | Original pixel value | New pixel value (with Subtract) | new_range(c) (with Subtract) |
---|---|---|---|
0 |
v0 |
vp(0) |
orig_range(p(0)) |
1 |
v1 |
vp(1)-vp(0) |
orig_range(p(1)).min-orig_range(p(0)).max to orig_range(p(1)).max-orig_range(p(0)).min |
2 |
v2 |
vp(2)-vp(0) |
orig_range(p(2)).min-orig_range(p(0)).max to orig_range(p(2)).max-orig_range(p(0)).min |
3 |
v3 |
vp(3) |
orig_range(perm(3)) |
The reverse transformation can easily be derived from this: given input values (in0,in1,in2,in3), the output values are given by outp(c) = inc if there is no Subtract or c is 0 or 3, and by outp(c) = inc + in0 if there is Subtract and c is 1 or 2.
To encode the parameters of this transformation, near-zero symbol coding is used, with a single context which we will call A.
Type | Description | Condition |
---|---|---|
nz_int_A(0,1) |
Boolean: Subtract |
|
nz_int_A(0,nb_channels-1) |
p(c) |
repeat: c from 0 to nb_channels-1 |
The decoder has to check that p actually describes a permutation, i.e. it is a bijection (no two input channels map to the same output channel).
Transformation 4: Bounds
Transformation 5: PaletteAlpha
Transformation 6: Palette
Transformation 7: ColorBuckets
The ColorBuckets transformation is an alternative to the Palette transformations; it is useful for sparse-color images, especially if the number of colors is relatively small but still too large for effective palette encoding.
Unlike the Palette transformations, ColorBuckets does not modify the actual pixel values. As a result, the reverse transformation is trivial: nothing has to be done. However, the transformation does change the crange and the snap functions. By reducing the range of valid pixel values (sometimes drastically), compression improves.
A 'Color Bucket' is a (possibly empty) set of pixel values. For channel 0, there is a single Color Bucket b0. For channel 1, there is one Color Bucket for each pixel value in orig_range(0); we’ll denote these Color Buckets with b1(v0). For channel 2, there is one Color Bucket for each combination of values (v0,Q(v1)) where v0 is in orig_range(0), v1 is in orig_range(1), and the quantization function Q maps x to (x - orig_range(1).min) / 4. Finally, for channel 3, there is a single Color Bucket b3.
The new ranges are identical to the original ranges: new_range(c) = orig_range(c).
The new conditional ranges are given by the minimum and maximum of the corresponding Color Buckets:
-
new_crange(0) = min(b0) .. max(b0)
-
new_crange(1,v0) = min(b1(v0)) .. max(b1(v0))
-
new_crange(2,v0,v1) = min(b2(v0,Q(v1)) .. max(b2(v0,Q(v1))
-
new_crange(3) = min(b3) .. max(b3)
The new snap function returns the value in the corresponding Color Bucket that is closest to the input value; if there are two such values, it returns the lowest one. For example, if Color Bucket b1(20) is the set {-1,3,4,6,8}, then new_snap(1,20,x) returns -1 for x=-100, 4 for x=5, 6 for x=6, and 8 for x=100.
The ColorBuckets transformation is not allowed in the following circumstances:
-
Palette or PaletteAlpha is used, or in general, both channel 0 and 2 contain only zeroes: orig_range(0) = orig_range(2) = 0 .. 0
-
The image is a grayscale image
-
Channel 1 is trivial: orig_range(1) is a singleton
-
Channel 0, 1 or 2 requires more than 10 bits: orig_range(c).max - orig_range(c).min > 1023
To encode the Color Buckets, (generalized) near-zero symbol coding is used with 6 different contexts, which we will call A,B,C,D,E, and F. The encoding is quite complicated.
To decode, all Color Buckets are initialized to empty sets.
First b0 is encoded:
Type | Description | Condition | Effect |
---|---|---|---|
nz_int_A(0,1) = 1 |
Boolean: nonempty |
||
gnz_int_B(orig_range(0).min, orig_range(0).max) |
min |
||
gnz_int_C(min, orig_range(0).max) |
max |
if max - min < 2, then b0 = { min, max } |
|
nz_int_D(0,1) |
discrete |
max - min > 1 |
if discrete = 0, then b0 = min .. max |
gnz_int_E(2, min(255, max - min)) |
n = size of b0 |
discrete |
b0[0] = min, b0[n-1] = max |
gnz_int_F(b0[i-1]+1, max + 1 + i - n) |
b0[i] |
discrete, repeat: i from 1 to n-2 |
Next, for all values v0 in b0, Color Bucket b1(v0) is encoded:
Type | Description | Condition | Effect |
---|---|---|---|
nz_int_A(0,1) |
Boolean: nonempty |
if false, b1(v0) is the empty set |
|
gnz_int_B(orig_crange(1,v0).min, orig_crange(1,v0).max) |
min |
nonempty |
|
gnz_int_C(min, orig_range(1,v0).max) |
max |
nonempty |
if max - min < 2, then b1(v0) = { min, max } |
nz_int_D(0,1) |
discrete |
max - min > 1 |
if discrete = 0, then b1(v0) = min .. max |
gnz_int_E(2, min(510, max - min)) |
n = size of b1(v0) |
discrete |
b1(v0)[0] = min, b1(v0)[n-1] = max |
gnz_int_F(b1(v0)[i-1]+1, max + 1 + i - n) |
b1(v0)[i] |
discrete, repeat: i from 1 to n-2 |
Next, for all values v0 in b0, for all values qv1 from 0 to (orig_range(1).max - orig_range(1).min) / 4, Color Bucket b2(v0,qv1) is encoded if for some k in 0..3, it is the case that v1 = qv1 * 4 + orig_range(1).min + k is in the set b1(v0):
Type | Description | Condition | Effect |
---|---|---|---|
nz_int_A(0,1) |
Boolean: nonempty |
if false, remove v1 from b1(v0) |
|
gnz_int_B(mink=0..3(orig_crange(2,v0,v1 + k).min), maxk=0..3(orig_crange(2,v0,v1 + k).max)) |
min |
nonempty |
|
gnz_int_C(min, maxk=0..3(orig_crange(2,v0,v1 + k).max)) |
max |
nonempty |
if max - min < 2, then b2(v0,qv1) = { min, max } |
nz_int_D(0,1) |
discrete |
max - min > 1 |
if discrete = 0, then b2(v0,qv1) = min .. max |
gnz_int_E(2, min(5, max - min)) |
n = size of b2(v0,qv1) |
discrete |
b2(v0,qv1)[0] = min, b2(v0,qv1)[n-1] = max |
gnz_int_F(b2(v0,qv1)[i-1]+1, max + 1 + i - n) |
b2(v0,qv1)[i] |
discrete, repeat: i from 1 to n-2 |
Finally, if there is an Alpha channel (i.e. nb_channels > 3), then b3 is encoded in exactly the same way as b0.
Transformation 8: reserved (unused)
Transformation 9: reserved (unused)
Transformation 10: DuplicateFrame
Transformation 11: FrameShape
Transformation 12: FrameLookback
Transformation 13: reserved (unused)
MANIAC entropy coding
Meta-Adaptive Near-zero Integer Arithmetic Coding (MANIAC) is a variant of Context-adaptive binary arithmetic coding (CABAC) in which not just the probability model is adaptive (based on local context), but also the context model itself is adaptive. In this sense, it is meta-adaptive.
The symbol encoding itself corresponds to the Near-zero symbol coding method described above. The chance table (the chances for ZERO, SIGN, EXP(p,s) and MANT(p)) depends on the local context; these chances adapt each time a symbol is encoded or decoded.
The local context is described as a vector of integer numbers, which are called properties. To determine which chance table to use, a decision tree is used. This tree is called a MANIAC tree. The full structure of the tree is known in advance by the decoder (it is encoded in part 4 of the bitstream, see below). However, at decode time, the tree is initially only partially used (only the root node at the beginning), and it 'grows' to eventually become the full tree.
The inner nodes of a MANIAC tree contain a test of the form property[k] > value
. If this test evaluates to true
, then the left branch is taken, otherwise the right branch is taken. Eventually a leaf node is reached.
The leaf nodes of a MANIAC tree contain a counter and a chance table. This chance table is used in the Near-zero symbol coding. The counter gets decremented each time the leaf node is reached. When it reaches zero, the tree 'grows' and the leaf node becomes an inner node (a decision node). Its chance table gets duplicated and becomes the chance table of its two branch nodes.
Part 4: pixel data
In this final part of the FLIF bitstream, the actual pixel values are encoded. There are two methods (or modes) of pixel encoding: non-interlaced and interlaced. The main header indicates which method is used (see Part 1 above).
Non-Interlaced method
If this encode method is used, then we start immediately with the encoding of the MANIAC trees (see below), followed by the encoding of the pixels. The order in which the pixels are encoded is described by the following nested loops:
-
For all channels c : (in the order 4,3,0,1,2, skipping those that don’t exist or have a singleton range)
-
For all rows y, i.e. for y from 0 to height-1 :
-
For all frames f , i.e. for f from 0 to nb_frames-1 :
-
For all columns x, i.e. for x from 0 to width-1 :
-
Encode the pixel at location (x,y) in channel c of frame f
-
-
-
-
How the pixel is actually encoded is described below.
Interlaced method
For interlacing, we define the notion of zoomlevels. Zoomlevel 0 is the full image. Zoomlevel 1 are all the even-numbered rows of the image (counting from 0). Zoomlevel 2 are all the even-numbered columns of zoomlevel 1. In general: zoomlevel 2k+1 are all the even-numbered rows of zoomlevel 2k, and zoomlevel 2k+2 are all the even-numbered columns of zoomlevel 2k+1.
In other words, every even-numbered zoomlevel 2k is a downsampled version of the image, at scale 1:2^k.
We define the 'maximum zoomlevel' max_zl of an image as the zoomlevel with the lowest number that consists of a single pixel. This is always the pixel in the top-left corner of the image (row 0, column 0). This pixel is always encoded first, as a special case.
The zoomlevels are encoded from highest (most zoomed out) to lowest; in each zoomlevel, obviously only those pixels are encoded that haven’t been encoded previously. So in an even-numbered zoomlevel, the odd-numbered rows are encoded, while in an odd-numbered zoomlevel, the odd-numbered columns are encoded.
If the interlaced encode method is used, we do not encode the MANIAC trees right away. Instead, we initialize the trees to a single root node per channel, and start encoding a 'rough preview' of the image (a few of the highest zoomlevels). This allows a rough thumbnail extraction without needing to decode the MANIAC tree. Then the MANIAC tree is encoded, and then the rest of the zoomlevels are encoded.
Type | Description | Condition |
---|---|---|
uni_int(0..max_zl) |
Number of the first MANIAC-encoded zoomlevel: first_zl |
|
uni_int(range(c).min,range(c).max) |
pixel value of the top-left (0,0) pixel of channel c |
repeat(c in 0..nb_channels-1) |
Encode_zoomlevels(max_zl-1,first_zl+1) |
encoding of zoomlevels max_zl-1 until first_zl+1 |
|
encoding of MANIAC trees |
see further below |
|
Encode_zoomlevels(first_zl,0) |
encoding of zoomlevels first_zl until 0 |
The encoding of a series of zoomlevels happens by interleaving the channels in some way. This interleaving is either in the 'default order', or in a custom order. In any case, the following invariants must hold:
-
Zoomlevel k of a channel can only be encoded after zoomlevel k+1 of that channel has already been encoded;
-
If channel 3 exists and alpha_zero is true, then zoomlevel k of channel 0 (usually Luma) can only be encoded after zoomlevel k of channel 3 (Alpha) has already been encoded;
-
Zoomlevel k of channel 1 (usually Co) can only be encoded after zoomlevel k of channel 0 (usually Luma) has been encoded;
-
Zoomlevel k of channel 2 (usually Cg) can only be encoded after zoomlevel k of channels 0 and 1 (Luma and Co) have been encoded;
-
If channel 4 (FrameLookback) exists: zoomlevel k of any other channel (0,1,2, or 3) can only be encoded after zoomlevel k of channel 4 has already been encoded.
TODO: describe default order
Encode_zoomlevels(h,l)
There are three different pixel predictors in interlaced mode (numbered 0, 1, 2). For each channel, either the same predictor is used in all zoomlevels h to l (in that case the predictor number is encoded only once), or a different predictor can be picked for each zoomlevel (in that case the number -1 is encoded, and the actual predictor number is encoded at the beginning of each zoomlevel).
Type | Description | Condition |
---|---|---|
uni_int(0,1) |
Boolean: default_order |
|
uni_int(-1,2) |
Pixel predictors pred[channel] |
repeat(nb_channels) |
Repeat nb_channels * (h-l+1) times: (once for every channel/zoomlevel in the range from h to l)
Type | Description | Condition | Default value |
---|---|---|---|
uni_int(0,nb_channels-1) |
Channel c to be encoded |
not default_order |
given by default order |
Zoomlevel z is implicit |
|||
uni_int(0,2) |
Pixel predictor p to use |
pred[c] == -1 |
pred[c] |
Encode_zl(c,z,p) |
Encoding of the next zoomlevel of channel c |
range(c).min < range(c).max |
The encoding of a single zoomlevel/channel combination is defined as follows:
Encode_zl(c,z,p) :
-
If z is even (a 'horizontal' zoomlevel)
-
For all odd rows y of zoomlevel z, i.e. for y from 1 to (height-1)/(2^(z/2)) in steps of 2 :
-
For all frames f, i.e. for f from 0 to nb_frames-1 :
-
For all columns x of zoomlevel z, i.e. for x from 0 to (width-1)/(2^(z/2)) :
-
Encode the pixel at location (x,y) in zoomlevel z of channel c of frame f, using predictor p
-
-
-
-
-
Else (if z is odd, that is, a 'vertical' zoomlevel)
-
For all rows y of zoomlevel z, i.e. for y from 0 to (height-1)/(2^(z+1/2)) :
-
For all frames f, i.e. for f from 0 to nb_frames-1 :
-
For all odd columns x of zoomlevel z, i.e. for x from 1 to (width-1)/(2^(z/2)) with steps of 2 :
-
Encode the pixel at location (x,y) in zoomlevel z of channel c of frame f, using predictor p
-
-
-
-
Pixel encoding
The overall encoding mechanism of a pixel value v is as follows:
-
The value is predicted, resulting in a predicted value guess
-
The range of possible values is computed as accurately as possible using crange(c,…), resulting in min and max
-
guess is adjusted to be a valid value in this range using snap(c,…)
-
-
The local context of the pixel is computed as a MANIAC property vector pvec
-
The number that actually gets encoded is the difference between the actual and predicted value v - guess
-
If the prediction is any good, these numbers tend to be close to zero
-
To decode the value: nz_int_MANIACc,pvec(min-guess, max-guess) + guess
-
The above mechanism is the same in interlaced and non-interlaced mode and for all channels; however the predictor (guess) and the layout of property vector (pvec) depends on the mode and the channel.
Skipped pixels
Some pixels are not encoded at all. There are three such cases:
-
If the alpha_zero flag is true, then 'invisible pixels' (pixels with Alpha value zero) semantically have undefined RGB values. So if the pixel value in channel 3 (Alpha) is equal to 0, then the pixel values in channels 0, 1, 2 (RGB, or e.g. YCoCg after transformations) are not encoded. However, the decoder does need to set the values of these pixels to something, since those 'invisible' values might very well be used in predictors or properties of neighboring visible pixels. It sets these values simply to the predicted value. In interlaced mode, the predictor to use for that is called the 'invisible pixel predictor' (see part 3 above).
-
In animations which use the FrameShape transformation, from the second frame onwards, all rows have a 'begin' and 'end' column. Pixels before the begin and after the end are not encoded; their value is equal to the corresponding pixel value from the previous frame.
-
In animations which use the FrameLookback transformation, channel 4 (Lookback) refers to past frames. If the value in channel 4 is not zero but some number k > 0, then all other channels are not encoded for that pixel. The pixel value for channels 0,1,2,3 is equal to the corresponding pixel value from k frames ago.
As an optimization, it is safe to simply skip pixels from a constant channel, that is, a channel which has a singleton range. Technically, one could argue that they are actually encoded, but for each pixel value v, we have v = min = max = guess, which means the symbol encoding doesn’t use any bits. |
Pixel predictors
The different pixel predictors are described below. In each case, the actual predicted value is the value returned by the snap function. Usually the snap function will simply return its input value, but not always. Noteworthy exceptions are Predictor 1 of Interlaced mode, which as defined below could even be outside the channel range range, and any predictor in combination with the ColorBuckets transformation, which can have a nontrivial snap function.
Non-interlaced mode
Assume we have to predict pixel X in channel c given its previously decoded neighbors:
TL |
T |
L |
X |
The predicted value for X is the median of the following three values:
-
L + T - TL
-
L
-
T
Border conditions:
-
If X is at column 0 (so it has no left neighbor L), then we set L = range(c).min, except when alpha_zero is true, c < 3 and the alpha value at the position of pixel X is zero; in that case we set L = (range(c).min+range(c).max)/2.
-
If X is at row 0 (so it has no top neighbor T), then we set T = TL = L.
-
If X is at row > 0 and column 0, then TL = T.
Interlaced mode
In interlaced mode, there are three different predictors. Moreover, the predictors are slightly different in even-numbered zoomlevels compared to odd-numbered zoomlevels, since slightly different neighboring pixels are available (already decoded) : in horizontal zoomlevels the pixel below the current pixel is already known (since it is part of the previously decoded zoomlevel), while in vertical zoomlevels the pixel to the right of the current pixel is already known.
Horizontal zoomlevel (even-numbered)
Assume we have to predict pixel X in channel c given its previously decoded neighbors.
These neighbors are defined with respect to the current zoomlevel. In the full image, these 'neighbors' are not directly adjacent (except in zoomlevel 0). |
TT |
|||
TL |
T |
TR |
|
LL |
L |
X |
|
BL |
B |
BR |
The following predictors are defined:
- Predictor 0
-
(T + B)>>1
- Predictor 1
-
The median of the following three values:
-
(T + B)>>1
-
L + T - TL
-
L + B - BL
-
- Predictor 2
-
The median of T, B and L
Border conditions:
-
If X is at column 0 (so it has no left neighbor L), then we set L = TL = BL = T
-
If X is at the rightmost column, then we set TR = T.
-
If X is at the last row, then BL = B = L.
-
If X is at the rightmost column or last row, then BR = B.
Vertical zoomlevel (odd-numbered)
Assume we have to predict pixel X in channel c given its previously decoded neighbors.
TT |
|||
TL |
T |
TR |
|
LL |
L |
X |
R |
BL |
BR |
The following predictors are defined:
- Predictor 0 (P0)
-
(L + R)>>1
- Predictor 1 (P1)
-
The median of the following three values:
-
(L + R)>>1
-
L + T - TL
-
R + T - TR
-
- Predictor 2 (P2)
-
The median of T, L and R
Border conditions:
-
If X is at row 0 (so it has no top neighbor T), then we set T = TL = TR = L
-
If X is at the rightmost column, then we set TR = R = T.
-
If X is at the last row, then BL = L.
-
If X is at the rightmost column or last row, then BR = R.
MANIAC properties
The MANIAC property vector pvec which is used to determine the context (i.e. the MANIAC tree leaf node) is constructed in the following way. Assume the pixel currently being decoded is pixel X in channel number c and the pixel values of previously decoded neighboring pixels (of the same channel) are named as in the diagram below — it depends on the interlacing mode and zoomlevel which of these are known at decode time.
TT |
|||
TL |
T |
TR |
|
LL |
L |
X |
R |
BL |
B |
Moreover, we use Xpc to denote the pixel value at the position of X in a previously decoded channel pc, and P to denote the predicted value for X (as described above).
Finally, we define the range maxdiff(c) as the range of a difference between two pixel values in channel c, that is: maxdiff(c).min = range(c).min - range(c).max and maxdiff(c).max = range(0).max - range(c).min.
Non-interlaced mode
In non-interlaced mode, the following MANIAC properties are defined for a pixel in channel c:
Property | Description | Condition | Property range (prange) |
---|---|---|---|
X0 |
Luma / channel 0 value |
0 < c < 3 |
range(0) |
X1 |
Co / channel 1 value |
1 < c < 3 |
range(1) |
X3 |
Alpha value |
c < 3 < nb_channels |
range(3) |
P |
Predicted value snap(c,…,median(L+T-TL,L,T)) |
range(c) |
|
Mni |
Median-index (see below) |
0..2 |
|
L - TL |
Left - Topleft |
maxdiff(c) |
|
TL - T |
Topleft - Top |
maxdiff(c) |
|
T - TR |
Top - Topright |
maxdiff(c) |
|
TT - T |
Toptop - Top |
maxdiff(c) |
|
LL - L |
Leftleft - Left |
maxdiff(c) |
For images without an alpha channel, channel 0 has 7 properties (numbered 0 to 6), channel 1 has 8 properties, and channel 2 has 9 properties. For images with an alpha channel, channel 0 has 8 properties, channel 1 has 9 properties, channel 2 has 10 properties, and channel 3 has 7 properties. In animations with FrameLookback, channel 4 has 7 properties. |
Border conditions: if in the last five properties, which are differences between two neighboring pixels, one of those pixels does not exist, the property is defined to be zero.
The median-index Mni indicates which of the three input values for the median predictor became the predictor; more precisely, it is defined as follows:
Mni =
0 |
if P = L+T-TL |
1 |
if P = L (and the above condition does not hold) |
2 |
if P = T (and neither of the above conditions hold) |
0 |
otherwise |
Interlaced mode
In interlaced mode, the following MANIAC properties are defined for a pixel in zoomlevel z of channel c using predictor p:
Property | Description | Condition | Property range (prange) |
---|---|---|---|
X0 |
Luma / channel 0 value |
0 < c < 3 |
range(0) |
X1 |
Co / channel 1 value |
1 < c < 3 |
range(1) |
X3 |
Alpha value |
c < 3 < nb_channels |
range(3) |
Mi |
Median-index (see below) |
0..2 |
|
X0 - (T0+B0)>>1 |
Luma 'prediction miss' |
0 < c < 3 and z is even |
maxdiff(0) |
X0 - (L0+R0)>>1 |
Luma 'prediction miss' |
0 < c < 3 and z is odd |
maxdiff(0) |
T - B |
Top - Bottom |
z is even |
maxdiff(c) |
T - (TL + TR)>>1 |
Top 'prediction miss' |
z is even |
maxdiff(c) |
L - (BL + TL)>>1 |
Left 'prediction miss' |
z is even |
maxdiff(c) |
B - (BL + BR)>>1 |
Bottom 'prediction miss' |
z is even |
maxdiff(c) |
L - R |
Left - Right |
z is odd |
maxdiff(c) |
L - (BL + TL)>>1 |
Left 'prediction miss' |
z is odd |
maxdiff(c) |
T - (TL + TR)>>1 |
Top 'prediction miss' |
z is odd |
maxdiff(c) |
R - (BR + TR)>>1 |
Right 'prediction miss' |
z is odd |
maxdiff(c) |
Pp |
Predicted value (see above) |
range(c) |
|
TT - T |
Toptop - Top |
c is not 2 |
maxdiff(c) |
LL - L |
Leftleft - Left |
c is not 2 |
maxdiff(c) |
For images without an alpha channel, channel 0 has 8 properties (numbered 0 to 7), channel 1 has 10 properties, and channel 2 has 9 properties. For images with an alpha channel, channel 0 has 9 properties, channel 1 has 11 properties, channel 2 has 10 properties, and channel 3 has 8 properties. In animations with FrameLookback, channel 4 has 8 properties. |
Border conditions: If TL, T, TR, L, BL, B, BR or R are missing, they are defined as described in the interlaced pixel predictor definition above. If TT or LL are missing, the property they occur in gets value zero. If B0 is missing, B0 = T0; if R0 is missing, R0 = L0.
The median-index Mi indicates which of the three input values for the median in Predictor 1 is the median; more precisely, it is defined as follows:
Mi =
0 |
if z is even and P1 = (T + B)>>1 |
0 |
if z is odd and P1 = (L + R)>>1 |
1 |
if P1 = L + T - TR (and the above conditions do not hold) |
2 |
otherwise |
MANIAC tree encoding
There is one tree per non-trivial channel (a channel is trivial if its range is a singleton or if it doesn’t exist). The trees are encoded one after another and in a recursive (depth-first) way, as follows:
nb_properties depends on the channel, the number of channels, and the encoding method (interlaced or non-interlaced), as specified above. The range of each property is maintained during tree traversal. The initial property ranges prange[property] are defined above; these are narrowed down when going to a deeper node in the tree.
For each channel c, three different contexts are used: we’ll just call them Ac, Bc and Cc.
Type | Description | Condition |
---|---|---|
nz_int_Ac(0,nb_properties) |
0=leaf node, > 0: property+1 |
|
nz_int_Bc(1,512) |
node counter |
not a leaf node |
nz_int_Cc(prange[property].min,prange[property].max-1) |
test_value |
not a leaf node |
recursive encoding of left branch |
where prange[property].min = test_value+1 |
not a leaf node |
recursive encoding of right branch |
where prange[property].max = test_value |
not a leaf node |
From this description, the MANIAC trees can be reconstructed. Leaf nodes have a counter value that is effectively infinity (they can never be turned into a decision node).
Checksum
At the very end of the bitstream, there is an optional 32-bit checksum to verify the integrity of the file.
Type | Description | Condition |
---|---|---|
uni_int(1) |
Boolean: have_checksum |
|
uni_int(16) |
Most significant 16 bits of checksum |
have_checksum |
uni_int(16) |
Least significant 16 bits of checksum |
have_checksum |
This checksum is a standard CRC-32, computed from the original (or decoded) pixel data in the following way:
The CRC is initialized to the value (width << 16) + height, interpreted as an unsigned 32-bit number. Then the pixel values of all channels are processed, in scanline order (top to bottom, left to right), where each value is represented as a little-endian integer in 1, 2 or 4 bytes, depending on the bit depth of the image. The following sizes are used:
Channel | All channels use at most 8 bits per pixel | At least one channel uses 9 or more bits per pixel |
---|---|---|
0 |
1 byte |
2 bytes |
1 |
2 bytes |
4 bytes |
2 |
2 bytes |
4 bytes |
3 |
1 byte |
2 bytes |