The Free Lossless Image Format (FLIF) is a lossless image compression format. It supports grayscale, RGB and RGBA, with a color depth from 1 to 16 bits per channel. Both still images and animations are supported. FLIF can handle sparse-color images (e.g. 256 color palettes) effectively. It has an interlaced and a non-interlaced mode; unlike PNG, interlacing usually does not come with worse compression. In terms of compression, FLIF usually outperforms other lossless compression formats like PNG, lossless WebP, BPG, JPEG 2000, JPEG XR, JPEG-LS, APNG, MNG, or GIF.

This document specifies the FLIF16 bitstream, as implemented by version 0.2 of the reference encoder (https://github.com/FLIF-hub/FLIF), which was released in September 2016.

Overview of the format

A FLIF file consists of 4 parts, with increasingly complicated encoding methods:

  1. The main header, containing basic image metadata: width, height, color depth, number of frames. This information is encoded in a straightforward way, to make it easy to quickly identify a file.

  2. Optional metadata chunks. These chunks contain non-pixel data: Exif/XMP metadata, an ICC color profile, …​ This information is encoded using DEFLATE compression.

  3. Second header, containing further information about how the actual pixel data is encoded. This header also describes a chain of transformations that were applied before encoding, which have to be reversed after decoding (e.g. the YCoCg color transform). This information is encoded using CABAC entropy coding.

  4. The actual pixel data. This information is encoded using MANIAC entropy coding.

Notational conventions

We use '/' to denote integer division (rounding towards zero) and '>>' to denote arithmetic right shift. This corresponds to the usual operator semantics in e.g. the C programming language.

>>1 denotes arithmetic right shift by one, which corresponds to dividing by two rounding down. For positive integers this is the same as /2, but for negative integers it is not, e.g. -1 / 2 = 0 while -1 >> 1 = -1.

Part 1: main header

Type Description Value

4 bytes

Magic

"FLIF"

4 bits

Interlacing, animation

3 = ni still; 4 = i still; 5 = ni anim; 6 = i anim

4 bits

Number of channels (nb_channels)

1 = Grayscale; 3 = RGB; 4 = RGBA

1 byte

Bytes per channel (Bpc)

'0','1','2' ('0'=custom)

varint

Width

width-1

varint

Height

height-1

varint

Number of frames (nb_frames)

nb_frames-2 (only if animation)

The fifth byte is an ASCII character that can be interpreted as follows: for non-interlaced still images: '1'=Grayscale, '3'=RGB, '4'=RGBA, for interlaced images add +0x10 (so 'A'=Grayscale, 'C'=RGB, 'D'=RGBA), for animations add +0x20 (so 'Q', 'S', 'T' for non-interlaced, 'a','c','d' for interlaced).

Variable-size integer encoding (varint):

  • An unsigned integer (Big-Endian, MSB first) stored on a variable number of bytes.

  • All the bytes except the last one have a '1' as their first bit.

  • The unsigned integer is represented as the concatenation of the remaining 7 bit codewords.

nb_frames is 1 for a still image, > 1 for an animation.

Part 2: metadata

Chunks are defined similar to PNG chunks, except the chunk size is encoded with a variable number of bytes, and there is no chunk CRC. Also, the first chunk (which is always "FLIF") has no explicit chunk size and is not DEFLATE-compressed.

Chunk names are either 4 letters (4 bytes), indicating an optional chunk, or 1 byte with a value below 32, indicating a non-optional chunk.

The convention of using upper and lower case letters is kept, but the meaning of the bits is slightly different.

  • First letter: uppercase=critical, lowercase=non-critical → non-critical chunks can be stripped while keeping a valid file (it might be rendered differently though)

  • Second letter: uppercase=public, lowercase=private

  • Third letter: uppercase=needed to correctly display the image (e.g. a color profile), lowercase=can be stripped safely without changing anything visually

  • Fourth letter: uppercase=safe to copy blindly (when editing the actual image data), lowercase=may depend on image data (e.g. contain a thumbnail)

This optional part is a concatenation of chunks that each look like this:

Type Description

4 bytes

Chunk name

varint

Chunk length (size)

size bytes

DEFLATE-compressed chunk content

The optional chunks are followed by one non-optional chunk with a single-byte chunk name. Non-optional chunks do not have an explicit chunk size and their content is not DEFLATE-compressed. Currently the only non-optional chunk has name 0x00 (a NUL byte), which indicates a FLIF16 bitstream. Future (non-backwards compatible) revisions of the FLIF format will use a different name for this non-optional chunk.

If a decoder encounters an unknown optional chunk, it should proceed decoding (optionally it can produce a warning), as long as the chunk is non-critical. If a decoder encounters an unknown non-optional chunk, it should fail and produce an error message.

The following optional chunks are defined:

Chunk name Description Content (after DEFLATE-decompression)

iCCP

ICC color profile

raw ICC color profile data

eXif

Exif metadata

"Exif\0\0" header followed immediately by a TIFF header and the EXIF data

eXmp

XMP metadata

XMP contained within a read-only xpacket with no padding

Symbol encoding

From here on, all values are encoded using arithmetic coding.

uni_int(min,max) denotes an integer in the interval min..max (inclusive), encoded using Uniform symbol coding (see below) and a 24-bit RAC. This encoding is not context-adaptive, and every number in the interval gets a fixed chance (close to a uniform distribution, hence the name). uni_int(b) denotes an integer of b bits, i.e. it’s a synonym for uni_int(0,2^b-1).

nz_int_CONTEXT(min,max) denotes an integer in the interval min..max (inclusive), encoded using Near-zero symbol coding (see below) and a 24-bit RAC, where the chance table is given by CONTEXT. This encoding is context-adaptive, i.e. the chance table gets updated when encoding or decoding a number. Numbers are encoded using a zero-sign-exponent-mantissa representation, which favors distributions in which near-zero values have a higher chance (hence the name).

gnz_int_CONTEXT(min,max) denotes a simple generalization of Near-zero symbol coding that can be used in case the interval does not necessarily include zero. It is defined as follows:

gnz_int_CONTEXT(min,max) =

nz_int_CONTEXT(0, max - min) + min

if min > 0

nz_int_CONTEXT(min - max, 0) + max

if max < 0

nz_int_CONTEXT(min,max)

otherwise

Range encoder

TODO: describe the 24-bit RAC and the 12-bit chances.

Chance table update

TODO: describe how cutoff and alpha influence this.

Uniform Symbol Coding

An integer encoded with Uniform symbol coding uses a series of 50% chance bits (chance is 2048/4096, non-adaptive) to recursively narrow down the input interval until it is a singleton. The following code describes how to decode such an integer:

    // assumption: min <= max
    int read_uniform_int(int min, int max) {
        if (min == max) return min;

        // split in [min..mid] [mid+1..max]
        int med = min + (max-min)/2;

        if (rac.read_bit()) {
            return read_uniform_int(mid+1, max);
        } else {
            return read_uniform_int(min, mid);
        }
    }

There is no compression advantage of using a RAC to encode these 50% chance bits (just regular bits would be just as efficient), but it is convenient to use the RAC anyway to avoid having to mix the RAC bitstream with regular bits.

For intervals of the form 0..2^k this representation boils down to writing down the number as k binary bits; for other intervals it can be slightly more concise than binary notation.

This representation is used mostly for simple signaling, where context adaptation would be of little use.

Near-zero symbol coding

This is the representation used for the bulk of the data, including the pixel values. It is very efficient for values that are close to zero (e.g. well-predicted pixels).

An integer encoded with Near-zero symbol coding uses a zero-sign-exponent-mantissa representation, where the exponent is represented in unary notation. Every bit position of this representation has its own RAC context (so their chances adapt separately); the exponent bits use different chances not just depending on their position but also depending on the sign. Chances are represented as 12-bit numbers (1..4095), where 1 means that the bit is very likely to be zero (it has 1/4096 chance of being one), while 4095 means that the bit is very likely to be one (the chance is 4095/4096).

The following code describes how to decode an integer in this representation:

    // assumption: min <= 0 <= max
    int read_nearzero_int(int min, int max) {

        if (min == max) return min;

        if (rac.read(ZERO)) return 0;

        bool sign; // true = positive, false = negative
        if (min < 0) {
          if (max > 0) {
            // only actually read the sign bit if both signs are possible
            sign = rac.read(SIGN);
          } else {
            // max <= 0 and the number is not zero, so it is negative
            sign = false;
          }
        } else {
          // min >= 0 and the number is not zero, so it is positive
          sign = true;
        }

        // highest possible absolute value
        const int amax = (sign? max : -min);

        // highest possible exponent
        const int emax = ilog2(amax);

        // exponent is represented in unary: a series of 1s followed by 0
        // (the end 0 is dropped if e=emax)
        int e;
        for (e = 0 ; e < emax ; e++) {
            if (rac.read(EXP(e,sign)) break;
        }

        // the first mantissa bit is always 1
        int have = (1 << e);
        // if all other mantissa bits are 1, then the total is have+left
        int left = have-1;

        // read mantissa bits from most-significant to least-significant
        for (int pos = e; pos>0;) {
            left >>= 1; pos--;
            // if the bit is 1, then the value will be at least minabs1
            int minabs1 = have | (1<<pos);
            // if the bit is 0, then the value will be at most maxabs0
            int maxabs0 = have | left;
            if (minabs1 > amax) {
                // 1-bit is impossible (would bump value above maximum),
                // so assume the bit is 0 without reading it
            } else if (maxabs0 >= 1) {
                // 0-bit and 1-bit are both possible,
                // so we read the bit and adjust what we have if it is a 1
                if (rac.read(MANT(pos))) have = minabs1;
            } else {
                // 0-bit is impossible (would make the value zero),
                // so assume the bit is 1 without reading it
                have = minabs1;
            }
        }
        return (sign ? have : -have);
    }

The chances are initialized as follows:

bit initial chance

ZERO

1000

SIGN

2048

EXP(0,sign)

1000

EXP(1,sign)

1200

EXP(2,sign)

1500

EXP(3,sign)

1750

EXP(4,sign)

2000

EXP(5,sign)

2300

EXP(6,sign)

2800

EXP(7,sign)

2400

EXP(8,sign)

2300

EXP(9,sign)

2048

EXP(k,sign), k > 9

2048

MANT(0)

1900

MANT(1)

1850

MANT(2)

1800

MANT(3)

1750

MANT(4)

1650

MANT(5)

1600

MANT(6)

1600

MANT(7)

2048

MANT(k), k > 7

2048

Part 3: second header

Type Description Condition Default value

1 byte

NUL byte (0x00), chunk name of a FLIF16 bitstream

uni_int(1,16)

Bits per pixel of the channels

Bpc == '0': repeat(nb_channels)

8 if Bpc == '1', 16 if Bpc == '2'

uni_int(0,1)

Flag: alpha_zero

nb_channels > 3

0

uni_int(0,100)

Number of loops

nb_frames > 1

uni_int(0,60_000)

Frame delay in ms

nb_frames > 1: repeat(nb_frames)

uni_int(0,1)

Flag: has_custom_cutoff_and_alpha

uni_int(1,128)

cutoff

has_custom_cutoff_and_alpha

2

uni_int(2,128)

alpha divisor

has_custom_cutoff_and_alpha

19

uni_int(0,1)

Flag: has_custom_bitchance

has_custom_cutoff_and_alpha

0

?

Bitchance

has_custom_bitchance

variable

Transformations (see below)

uni_int(1) = 0

Indicator bit: done with transformations

uni_int(0,2)

Invisible pixel predictor

alpha_zero && interlaced && alpha range includes zero

Channels are ordered as follows:

Channel number Description

0

Red or Gray

1

Green

2

Blue

3

Alpha

Transformations

For each transformation:

Type Description

uni_int(1) = 1

Indicator bit: not done yet

uni_int(0,13)

Transformation identifier

variable

Transformation data (depends on transformation)

Transformations have to be encoded in ascending order of transformation identifier. All transformations are optional.

Transformations serve two main purposes:

  1. to modify the pixel data (in a reversible way) to make it compress better, and

  2. to keep track of the range of actually occuring pixel values, in order to narrow it down.

Initially, pixel values are assumed to be in the range 0..2^(bit_depth); this range can be modified by transformations. We’ll use range(channel).min and range(channel).max to denote the global minimum and maximum value of a particular channel.

We also use a potentially more accurate (narrow) conditional range crange(channel,values) to denote the range of a pixel value in channel channel, given that the pixel values in previously encoded channels are values. Initially, the conditional ranges are simply equal to the global range, but transformations might change that.

Finally, we define a function snap(channel,values,x) which given a pixel value x for channel channel and pixel values values in previously encoded channels, returns a 'valid' value as close as possible to x. Usually, snap(channel,values,x) simply clamps x to the conditional range crange(channel,values), but the ColorBuckets transformation changes that behavior.

Example

As a typical example, consider 8-bit RGBA to which the YCoCg transformation gets applied:

Channel number Original meaning Original range New meaning New range

0

Red

0..255

Luma (Y)

0..255

1

Green

0..255

Chroma orange (Co)

-255..255

2

Blue

0..255

Chroma green (Cg)

-255..255

3

Alpha

0..255

Alpha

0..255

In this example, the conditional ranges also change: e.g. crange(1,2) (the range for Co given that Y=2) happens to be -7..7.

In the following descriptions of transformations, we use orig_range, orig_crange, orig_snap to denote the original ranges and snap function (the initial ones, or the ones resulting from the previous transformation in the chain). We use new_range, new_crange, new_snap to denote the updated ranges and snap function.

In part 4 we will use range, crange and snap to denote the final ranges and snap functions, i.e. after applying all transformations.

We will now describe the transformations and their encoding in more detail.

Transformation 0: ChannelCompact

The ChannelCompact transformation looks at each channel independently, and reduces its range by eliminating values that do not actually occur in the image.

To be able to reconstruct the original values, the mapping from the reduced range to the original range is encoded. Near-zero symbol coding is used, with a single context which we’ll call A.

The information is encoded as follows:

  • For each channel c :

    • new_range(c).max = nz_int_A(0,orig_range(c).max-orig_range(c).max)

    • min = orig_range(c).min

    • For i = 0..new_range(c).max-1 :

      • decompacted(i) = min + nz_int_A(0, orig_range(c).max-min+new_range(c).max-i)

      • min = decompacted(i)+1

The effect of this transformation is as follows:

  • new_range(c).min = new_crange(c,…​).min = 0

  • new_range(c).max = new_crange(c,…​).max is explicitly encoded

  • new_snap is the default snap function (simple clamping)

To reverse the transformation (after decoding the pixel values) :

  • For each channel c :

    • For every pixel value v :

      • Set v = decompacted(v)

Transformation 1: YCoCg

The YCoCg transformation converts the colorspace from RGB to YCoCg. No information has to be encoded for this (besides the identifier of the transformation).

The transformation only affects the first three channels (0,1,2). It is not allowed to be used if nb_channels = 1.

Pixel transformation

Channel Original meaning New meaning Forward transform Reverse transform

0

Red (R)

Luma (Y)

Y = (((R+B)>>1) + G)>>1

R = Co + Y + ((1-Cg)>>1) - (Co>>1)

1

Green (G)

Chroma orange (Co)

Co = R - B

G = Y - ((-Cg)>>1)

2

Blue (B)

Chroma green (Cg)

Cg = G - ((R+B)>>1)

B = Y + ((1-Cg)>>1) - (Co>>1)

Luma (Y) corresponds to roughly 50% green, 25% red, 25% blue. Chroma orange (Co) is positive for colors near orange (red, orange, yellow), and negative for colors near blue. Chroma green (Cg) is positive for colors near green and negative for colors near purple.

The YCoCg transformation tends to decorrelate the channels, which helps to improve compression. For example, for grayscale images, i.e. images where R=G=B, the result of the YCoCg transform is Y=R (=G=B), Co=Cg=0. Human perception is more sensitive to luma than it is to chroma. For this reason, in interlaced mode, the default zoomlevel/channel ordering (see below) gives priority to the luma channel; this results in partial decodes (progressive previews) which are effectively chroma subsampled.

New ranges

Define origmax4 to be equal to max(orig_range(0).max,orig_range(1).max,orig_range(2).max)/4+1 and newmax to be equal to 4 * (origmax4) - 1. In the most common case where the three channels have the range 0..255, this evaluates to origmax4 = 64 and newmax = 255.

Channel number c Original meaning New meaning new_range(c)

0

Red (R)

Luma (Y)

0..newmax

1

Green (G)

Chroma orange (Co)

-newmax..newmax

2

Blue (B)

Chroma green (Cg)

-newmax..newmax

Unlike the RGB color space, not every coordinate in the YCoCg color space corresponds to an actual color. In particular, the range for Co and Cg is much smaller for near-black and near-white colors than for intermediate luma values:

(Download the above animation losslessly: WebP (847K), APNG (938K), FLIF (90K); or lossy: GIF (7024K), MP4 (235K))

The conditional range function crange is updated to reflect this. It is updated as follows:

  • new_crange(0) = new_range(0)

  • new_crange(1,yval).min =

    -3 + 4 * yval

    if yval < origmax4 - 1

    4 * (yval-newmax)

    if yval > 3 * origmax4 - 1

    -newmax

    otherwise

  • new_crange(1,yval).max =

    3 + 4 * yval

    if yval < origmax4 - 1

    4 * (newmax-yval)

    if yval > 3 * origmax4 - 1

    newmax

    otherwise

  • new_crange(2,yval,coval).min =

    -2 - 2 * yval

    if yval < origmax4 - 1

    -2 * (newmax-yval) + 2 * ((abs(coval)+1)/2)

    if yval > 3 * origmax4 - 1

    min(2 * yval + 1, 2 * newmax - 2 * yval - 2 * abs(coval)+1)/2

    otherwise

  • new_crange(2,yval,coval).max =

    1 + 2 * yval - 2 * (abs(coval)/2)

    if yval < origmax4 - 1

    2 * (newmax-yval)

    if yval > 3 * origmax4 - 1

    min(2 * (yval- newmax), - 2 * yval - 1 + 2* (abs(coval)/2))

    otherwise

Transformation 2: reserved (unused)

Transformation identifier 2 is not used. It is reserved for future extensions that support transformations to other color spaces like YCbCr.

Transformation 3: PermutePlanes

The PermutePlanes transformation reorders (permutes) the channels; optionally it also subtracts the values of the new channel 0 from the values of channels 1 and 2. This transformation is useful if for some reason the YCoCg transformation is not used: it can e.g. be used to transform RGB to G (R-G) (B-G).

This transformation is not allowed to be used in conjunction with the YCoCg transformation; it is also not allowed to be used if nb_channels = 1. Also, if alpha_zero is true, then channel 3 (Alpha) is not allowed to be permuted to a different channel number.

There are two main reasons to do a channel reordering: better compression (the order matters for compression since the values of previously encoded channels are used in the MANIAC properties, see below), and better progressive previews (e.g. Green is perceptually more important than Red and Blue, so it makes sense to encode it first). Additionally, subtracting channel 0 from the other channels is a simple form of channel decorrelation; usually not as good as the YCoCg transformation though.

We denote the permutation used by PermutePlanes with p, where p(nc)=oc means that the new channel number nc corresponds to the old channel number oc.

Without subtraction, the forward transformation looks as follows:

Channel number c Original pixel value New pixel value (no Subtract) new_range(c) (no Subtract)

0

v0

vp(0)

orig_range(p(0))

1

v1

vp(1)

orig_range(p(1))

2

v2

vp(2)

orig_range(p(2))

3

v3

vp(3)

orig_range(p(3))

With subtraction, the forward transformation looks as follows:

Channel number c Original pixel value New pixel value (with Subtract) new_range(c) (with Subtract)

0

v0

vp(0)

orig_range(p(0))

1

v1

vp(1)-vp(0)

orig_range(p(1)).min-orig_range(p(0)).max to orig_range(p(1)).max-orig_range(p(0)).min

2

v2

vp(2)-vp(0)

orig_range(p(2)).min-orig_range(p(0)).max to orig_range(p(2)).max-orig_range(p(0)).min

3

v3

vp(3)

orig_range(perm(3))

The reverse transformation can easily be derived from this: given input values (in0,in1,in2,in3), the output values are given by outp(c) = inc if there is no Subtract or c is 0 or 3, and by outp(c) = inc + in0 if there is Subtract and c is 1 or 2.

To encode the parameters of this transformation, near-zero symbol coding is used, with a single context which we will call A.

Type Description Condition

nz_int_A(0,1)

Boolean: Subtract

nz_int_A(0,nb_channels-1)

p(c)

repeat: c from 0 to nb_channels-1

The decoder has to check that p actually describes a permutation, i.e. it is a bijection (no two input channels map to the same output channel).

Transformation 4: Bounds

Transformation 5: PaletteAlpha

Transformation 6: Palette

Transformation 7: ColorBuckets

The ColorBuckets transformation is an alternative to the Palette transformations; it is useful for sparse-color images, especially if the number of colors is relatively small but still too large for effective palette encoding.

Unlike the Palette transformations, ColorBuckets does not modify the actual pixel values. As a result, the reverse transformation is trivial: nothing has to be done. However, the transformation does change the crange and the snap functions. By reducing the range of valid pixel values (sometimes drastically), compression improves.

A 'Color Bucket' is a (possibly empty) set of pixel values. For channel 0, there is a single Color Bucket b0. For channel 1, there is one Color Bucket for each pixel value in orig_range(0); we’ll denote these Color Buckets with b1(v0). For channel 2, there is one Color Bucket for each combination of values (v0,Q(v1)) where v0 is in orig_range(0), v1 is in orig_range(1), and the quantization function Q maps x to (x - orig_range(1).min) / 4. Finally, for channel 3, there is a single Color Bucket b3.

The new ranges are identical to the original ranges: new_range(c) = orig_range(c).

The new conditional ranges are given by the minimum and maximum of the corresponding Color Buckets:

  • new_crange(0) = min(b0) .. max(b0)

  • new_crange(1,v0) = min(b1(v0)) .. max(b1(v0))

  • new_crange(2,v0,v1) = min(b2(v0,Q(v1)) .. max(b2(v0,Q(v1))

  • new_crange(3) = min(b3) .. max(b3)

The new snap function returns the value in the corresponding Color Bucket that is closest to the input value; if there are two such values, it returns the lowest one. For example, if Color Bucket b1(20) is the set {-1,3,4,6,8}, then new_snap(1,20,x) returns -1 for x=-100, 4 for x=5, 6 for x=6, and 8 for x=100.

The ColorBuckets transformation is not allowed in the following circumstances:

  • Palette or PaletteAlpha is used, or in general, both channel 0 and 2 contain only zeroes: orig_range(0) = orig_range(2) = 0 .. 0

  • The image is a grayscale image

  • Channel 1 is trivial: orig_range(1) is a singleton

  • Channel 0, 1 or 2 requires more than 10 bits: orig_range(c).max - orig_range(c).min > 1023

To encode the Color Buckets, (generalized) near-zero symbol coding is used with 6 different contexts, which we will call A,B,C,D,E, and F. The encoding is quite complicated.

To decode, all Color Buckets are initialized to empty sets.

First b0 is encoded:

Type Description Condition Effect

nz_int_A(0,1) = 1

Boolean: nonempty

gnz_int_B(orig_range(0).min, orig_range(0).max)

min

gnz_int_C(min, orig_range(0).max)

max

if max - min < 2, then b0 = { min, max }

nz_int_D(0,1)

discrete

max - min > 1

if discrete = 0, then b0 = min .. max

gnz_int_E(2, min(255, max - min))

n = size of b0

discrete

b0[0] = min, b0[n-1] = max

gnz_int_F(b0[i-1]+1, max + 1 + i - n)

b0[i]

discrete, repeat: i from 1 to n-2

Next, for all values v0 in b0, Color Bucket b1(v0) is encoded:

Type Description Condition Effect

nz_int_A(0,1)

Boolean: nonempty

if false, b1(v0) is the empty set

gnz_int_B(orig_crange(1,v0).min, orig_crange(1,v0).max)

min

nonempty

gnz_int_C(min, orig_range(1,v0).max)

max

nonempty

if max - min < 2, then b1(v0) = { min, max }

nz_int_D(0,1)

discrete

max - min > 1

if discrete = 0, then b1(v0) = min .. max

gnz_int_E(2, min(510, max - min))

n = size of b1(v0)

discrete

b1(v0)[0] = min, b1(v0)[n-1] = max

gnz_int_F(b1(v0)[i-1]+1, max + 1 + i - n)

b1(v0)[i]

discrete, repeat: i from 1 to n-2

Next, for all values v0 in b0, for all values qv1 from 0 to (orig_range(1).max - orig_range(1).min) / 4, Color Bucket b2(v0,qv1) is encoded if for some k in 0..3, it is the case that v1 = qv1 * 4 + orig_range(1).min + k is in the set b1(v0):

Type Description Condition Effect

nz_int_A(0,1)

Boolean: nonempty

if false, remove v1 from b1(v0)

gnz_int_B(mink=0..3(orig_crange(2,v0,v1 + k).min), maxk=0..3(orig_crange(2,v0,v1 + k).max))

min

nonempty

gnz_int_C(min, maxk=0..3(orig_crange(2,v0,v1 + k).max))

max

nonempty

if max - min < 2, then b2(v0,qv1) = { min, max }

nz_int_D(0,1)

discrete

max - min > 1

if discrete = 0, then b2(v0,qv1) = min .. max

gnz_int_E(2, min(5, max - min))

n = size of b2(v0,qv1)

discrete

b2(v0,qv1)[0] = min, b2(v0,qv1)[n-1] = max

gnz_int_F(b2(v0,qv1)[i-1]+1, max + 1 + i - n)

b2(v0,qv1)[i]

discrete, repeat: i from 1 to n-2

Finally, if there is an Alpha channel (i.e. nb_channels > 3), then b3 is encoded in exactly the same way as b0.

Transformation 8: reserved (unused)

Transformation 9: reserved (unused)

Transformation 10: DuplicateFrame

Transformation 11: FrameShape

Transformation 12: FrameLookback

Transformation 13: reserved (unused)

MANIAC entropy coding

Meta-Adaptive Near-zero Integer Arithmetic Coding (MANIAC) is a variant of Context-adaptive binary arithmetic coding (CABAC) in which not just the probability model is adaptive (based on local context), but also the context model itself is adaptive. In this sense, it is meta-adaptive.

The symbol encoding itself corresponds to the Near-zero symbol coding method described above. The chance table (the chances for ZERO, SIGN, EXP(p,s) and MANT(p)) depends on the local context; these chances adapt each time a symbol is encoded or decoded.

The local context is described as a vector of integer numbers, which are called properties. To determine which chance table to use, a decision tree is used. This tree is called a MANIAC tree. The full structure of the tree is known in advance by the decoder (it is encoded in part 4 of the bitstream, see below). However, at decode time, the tree is initially only partially used (only the root node at the beginning), and it 'grows' to eventually become the full tree.

The inner nodes of a MANIAC tree contain a test of the form property[k] > value. If this test evaluates to true, then the left branch is taken, otherwise the right branch is taken. Eventually a leaf node is reached.

The leaf nodes of a MANIAC tree contain a counter and a chance table. This chance table is used in the Near-zero symbol coding. The counter gets decremented each time the leaf node is reached. When it reaches zero, the tree 'grows' and the leaf node becomes an inner node (a decision node). Its chance table gets duplicated and becomes the chance table of its two branch nodes.

Design rationale: MANIAC vs regular CABAC

In context-adaptive binary arithmetic coding, it is hard to define a good context model; in particular, to get the number of contexts 'just right'. Using too many contexts hurts compression because context adaptation is limited (few encoded symbols per context). With too few contexts, compression also suffers since symbols with different local context properties end up in the same context. Also, when the contexts are defined statically, a lot of the contexts are actually not used at all since the corresponding combination of properties simply does not occur in the source image.

MANIAC encoding solves this problem using meta-adaptation. The number of contexts (i.e. the number of leaf nodes in the MANIAC tree) can be image-dependent. Moreover, because the tree is 'grown' as more and more symbols are encoded (using the counters), the number of contexts is not static, but it also grows during encoding. This means that you can have fast early adaptation (because there are only a small number of contexts), which will ensure that the chances converge quickly to a rough distribution, while you can also have a very fine-grained adaptation later on, with as many contexts as needed to represent the influence of local properties on the probability distributions.

Part 4: pixel data

In this final part of the FLIF bitstream, the actual pixel values are encoded. There are two methods (or modes) of pixel encoding: non-interlaced and interlaced. The main header indicates which method is used (see Part 1 above).

Non-Interlaced method

If this encode method is used, then we start immediately with the encoding of the MANIAC trees (see below), followed by the encoding of the pixels. The order in which the pixels are encoded is described by the following nested loops:

  • For all channels c : (in the order 4,3,0,1,2, skipping those that don’t exist or have a singleton range)

    • For all rows y, i.e. for y from 0 to height-1 :

      • For all frames f , i.e. for f from 0 to nb_frames-1 :

        • For all columns x, i.e. for x from 0 to width-1 :

          • Encode the pixel at location (x,y) in channel c of frame f

How the pixel is actually encoded is described below.

Interlaced method

For interlacing, we define the notion of zoomlevels. Zoomlevel 0 is the full image. Zoomlevel 1 are all the even-numbered rows of the image (counting from 0). Zoomlevel 2 are all the even-numbered columns of zoomlevel 1. In general: zoomlevel 2k+1 are all the even-numbered rows of zoomlevel 2k, and zoomlevel 2k+2 are all the even-numbered columns of zoomlevel 2k+1.

In other words, every even-numbered zoomlevel 2k is a downsampled version of the image, at scale 1:2^k.

We define the 'maximum zoomlevel' max_zl of an image as the zoomlevel with the lowest number that consists of a single pixel. This is always the pixel in the top-left corner of the image (row 0, column 0). This pixel is always encoded first, as a special case.

The zoomlevels are encoded from highest (most zoomed out) to lowest; in each zoomlevel, obviously only those pixels are encoded that haven’t been encoded previously. So in an even-numbered zoomlevel, the odd-numbered rows are encoded, while in an odd-numbered zoomlevel, the odd-numbered columns are encoded.

If the interlaced encode method is used, we do not encode the MANIAC trees right away. Instead, we initialize the trees to a single root node per channel, and start encoding a 'rough preview' of the image (a few of the highest zoomlevels). This allows a rough thumbnail extraction without needing to decode the MANIAC tree. Then the MANIAC tree is encoded, and then the rest of the zoomlevels are encoded.

Type Description Condition

uni_int(0..max_zl)

Number of the first MANIAC-encoded zoomlevel: first_zl

uni_int(range(c).min,range(c).max)

pixel value of the top-left (0,0) pixel of channel c

repeat(c in 0..nb_channels-1)

Encode_zoomlevels(max_zl-1,first_zl+1)

encoding of zoomlevels max_zl-1 until first_zl+1

encoding of MANIAC trees

see further below

Encode_zoomlevels(first_zl,0)

encoding of zoomlevels first_zl until 0

The encoding of a series of zoomlevels happens by interleaving the channels in some way. This interleaving is either in the 'default order', or in a custom order. In any case, the following invariants must hold:

  • Zoomlevel k of a channel can only be encoded after zoomlevel k+1 of that channel has already been encoded;

  • If channel 3 exists and alpha_zero is true, then zoomlevel k of channel 0 (usually Luma) can only be encoded after zoomlevel k of channel 3 (Alpha) has already been encoded;

  • Zoomlevel k of channel 1 (usually Co) can only be encoded after zoomlevel k of channel 0 (usually Luma) has been encoded;

  • Zoomlevel k of channel 2 (usually Cg) can only be encoded after zoomlevel k of channels 0 and 1 (Luma and Co) have been encoded;

  • If channel 4 (FrameLookback) exists: zoomlevel k of any other channel (0,1,2, or 3) can only be encoded after zoomlevel k of channel 4 has already been encoded.

TODO: describe default order

Encode_zoomlevels(h,l)

There are three different pixel predictors in interlaced mode (numbered 0, 1, 2). For each channel, either the same predictor is used in all zoomlevels h to l (in that case the predictor number is encoded only once), or a different predictor can be picked for each zoomlevel (in that case the number -1 is encoded, and the actual predictor number is encoded at the beginning of each zoomlevel).

Type Description Condition

uni_int(0,1)

Boolean: default_order

uni_int(-1,2)

Pixel predictors pred[channel]

repeat(nb_channels)

Repeat nb_channels * (h-l+1) times: (once for every channel/zoomlevel in the range from h to l)

Type Description Condition Default value

uni_int(0,nb_channels-1)

Channel c to be encoded

not default_order

given by default order

Zoomlevel z is implicit

uni_int(0,2)

Pixel predictor p to use

pred[c] == -1

pred[c]

Encode_zl(c,z,p)

Encoding of the next zoomlevel of channel c

range(c).min < range(c).max

The encoding of a single zoomlevel/channel combination is defined as follows:

Encode_zl(c,z,p) :

  • If z is even (a 'horizontal' zoomlevel)

    • For all odd rows y of zoomlevel z, i.e. for y from 1 to (height-1)/(2^(z/2)) in steps of 2 :

      • For all frames f, i.e. for f from 0 to nb_frames-1 :

        • For all columns x of zoomlevel z, i.e. for x from 0 to (width-1)/(2^(z/2)) :

          • Encode the pixel at location (x,y) in zoomlevel z of channel c of frame f, using predictor p

  • Else (if z is odd, that is, a 'vertical' zoomlevel)

    • For all rows y of zoomlevel z, i.e. for y from 0 to (height-1)/(2^(z+1/2)) :

      • For all frames f, i.e. for f from 0 to nb_frames-1 :

        • For all odd columns x of zoomlevel z, i.e. for x from 1 to (width-1)/(2^(z/2)) with steps of 2 :

          • Encode the pixel at location (x,y) in zoomlevel z of channel c of frame f, using predictor p

Design rationale: Animation loop nesting

Both in the interlaced and in the non-interlaced mode, the frames of an animation are interleaved at a relatively deep level of the nested loops. This means that progressive decoding of an animation does not result in the first few frames at full quality, but instead, you get all frames at reduced quality. In non-interlaced mode, early stages of progressive decoding will give you the top portion of all frames, only in grayscale (assuming the YCoCg color transformation). In interlaced mode, progressive decoding will give you a blurry / low resolution preview of the full animation.

Pixel encoding

The overall encoding mechanism of a pixel value v is as follows:

  • The value is predicted, resulting in a predicted value guess

  • The range of possible values is computed as accurately as possible using crange(c,…​), resulting in min and max

    • guess is adjusted to be a valid value in this range using snap(c,…​)

  • The local context of the pixel is computed as a MANIAC property vector pvec

  • The number that actually gets encoded is the difference between the actual and predicted value v - guess

    • If the prediction is any good, these numbers tend to be close to zero

    • To decode the value: nz_int_MANIACc,pvec(min-guess, max-guess) + guess

The above mechanism is the same in interlaced and non-interlaced mode and for all channels; however the predictor (guess) and the layout of property vector (pvec) depends on the mode and the channel.

Skipped pixels

Some pixels are not encoded at all. There are three such cases:

  1. If the alpha_zero flag is true, then 'invisible pixels' (pixels with Alpha value zero) semantically have undefined RGB values. So if the pixel value in channel 3 (Alpha) is equal to 0, then the pixel values in channels 0, 1, 2 (RGB, or e.g. YCoCg after transformations) are not encoded. However, the decoder does need to set the values of these pixels to something, since those 'invisible' values might very well be used in predictors or properties of neighboring visible pixels. It sets these values simply to the predicted value. In interlaced mode, the predictor to use for that is called the 'invisible pixel predictor' (see part 3 above).

  2. In animations which use the FrameShape transformation, from the second frame onwards, all rows have a 'begin' and 'end' column. Pixels before the begin and after the end are not encoded; their value is equal to the corresponding pixel value from the previous frame.

  3. In animations which use the FrameLookback transformation, channel 4 (Lookback) refers to past frames. If the value in channel 4 is not zero but some number k > 0, then all other channels are not encoded for that pixel. The pixel value for channels 0,1,2,3 is equal to the corresponding pixel value from k frames ago.

As an optimization, it is safe to simply skip pixels from a constant channel, that is, a channel which has a singleton range. Technically, one could argue that they are actually encoded, but for each pixel value v, we have v = min = max = guess, which means the symbol encoding doesn’t use any bits.

Pixel predictors

The different pixel predictors are described below. In each case, the actual predicted value is the value returned by the snap function. Usually the snap function will simply return its input value, but not always. Noteworthy exceptions are Predictor 1 of Interlaced mode, which as defined below could even be outside the channel range range, and any predictor in combination with the ColorBuckets transformation, which can have a nontrivial snap function.

Non-interlaced mode

Assume we have to predict pixel X in channel c given its previously decoded neighbors:

TL

T

L

X

The predicted value for X is the median of the following three values:

  • L + T - TL

  • L

  • T

Border conditions:

  • If X is at column 0 (so it has no left neighbor L), then we set L = range(c).min, except when alpha_zero is true, c < 3 and the alpha value at the position of pixel X is zero; in that case we set L = (range(c).min+range(c).max)/2.

  • If X is at row 0 (so it has no top neighbor T), then we set T = TL = L.

  • If X is at row > 0 and column 0, then TL = T.

Interlaced mode

In interlaced mode, there are three different predictors. Moreover, the predictors are slightly different in even-numbered zoomlevels compared to odd-numbered zoomlevels, since slightly different neighboring pixels are available (already decoded) : in horizontal zoomlevels the pixel below the current pixel is already known (since it is part of the previously decoded zoomlevel), while in vertical zoomlevels the pixel to the right of the current pixel is already known.

Horizontal zoomlevel (even-numbered)

Assume we have to predict pixel X in channel c given its previously decoded neighbors.

These neighbors are defined with respect to the current zoomlevel. In the full image, these 'neighbors' are not directly adjacent (except in zoomlevel 0).

TT

TL

T

TR

LL

L

X

BL

B

BR

The following predictors are defined:

Predictor 0

(T + B)>>1

Predictor 1

The median of the following three values:

  • (T + B)>>1

  • L + T - TL

  • L + B - BL

Predictor 2

The median of T, B and L

Border conditions:

  • If X is at column 0 (so it has no left neighbor L), then we set L = TL = BL = T

  • If X is at the rightmost column, then we set TR = T.

  • If X is at the last row, then BL = B = L.

  • If X is at the rightmost column or last row, then BR = B.

Vertical zoomlevel (odd-numbered)

Assume we have to predict pixel X in channel c given its previously decoded neighbors.

TT

TL

T

TR

LL

L

X

R

BL

BR

The following predictors are defined:

Predictor 0 (P0)

(L + R)>>1

Predictor 1 (P1)

The median of the following three values:

  • (L + R)>>1

  • L + T - TL

  • R + T - TR

Predictor 2 (P2)

The median of T, L and R

Border conditions:

  • If X is at row 0 (so it has no top neighbor T), then we set T = TL = TR = L

  • If X is at the rightmost column, then we set TR = R = T.

  • If X is at the last row, then BL = L.

  • If X is at the rightmost column or last row, then BR = R.

MANIAC properties

The MANIAC property vector pvec which is used to determine the context (i.e. the MANIAC tree leaf node) is constructed in the following way. Assume the pixel currently being decoded is pixel X in channel number c and the pixel values of previously decoded neighboring pixels (of the same channel) are named as in the diagram below — it depends on the interlacing mode and zoomlevel which of these are known at decode time.

TT

TL

T

TR

LL

L

X

R

BL

B

Moreover, we use Xpc to denote the pixel value at the position of X in a previously decoded channel pc, and P to denote the predicted value for X (as described above).

Finally, we define the range maxdiff(c) as the range of a difference between two pixel values in channel c, that is: maxdiff(c).min = range(c).min - range(c).max and maxdiff(c).max = range(0).max - range(c).min.

Non-interlaced mode

In non-interlaced mode, the following MANIAC properties are defined for a pixel in channel c:

Property Description Condition Property range (prange)

X0

Luma / channel 0 value

0 < c < 3

range(0)

X1

Co / channel 1 value

1 < c < 3

range(1)

X3

Alpha value

c < 3 < nb_channels

range(3)

P

Predicted value snap(c,…​,median(L+T-TL,L,T))

range(c)

Mni

Median-index (see below)

0..2

L - TL

Left - Topleft

maxdiff(c)

TL - T

Topleft - Top

maxdiff(c)

T - TR

Top - Topright

maxdiff(c)

TT - T

Toptop - Top

maxdiff(c)

LL - L

Leftleft - Left

maxdiff(c)

For images without an alpha channel, channel 0 has 7 properties (numbered 0 to 6), channel 1 has 8 properties, and channel 2 has 9 properties. For images with an alpha channel, channel 0 has 8 properties, channel 1 has 9 properties, channel 2 has 10 properties, and channel 3 has 7 properties. In animations with FrameLookback, channel 4 has 7 properties.

Border conditions: if in the last five properties, which are differences between two neighboring pixels, one of those pixels does not exist, the property is defined to be zero.

The median-index Mni indicates which of the three input values for the median predictor became the predictor; more precisely, it is defined as follows:

Mni =

0

if P = L+T-TL

1

if P = L (and the above condition does not hold)

2

if P = T (and neither of the above conditions hold)

0

otherwise

Interlaced mode

In interlaced mode, the following MANIAC properties are defined for a pixel in zoomlevel z of channel c using predictor p:

Property Description Condition Property range (prange)

X0

Luma / channel 0 value

0 < c < 3

range(0)

X1

Co / channel 1 value

1 < c < 3

range(1)

X3

Alpha value

c < 3 < nb_channels

range(3)

Mi

Median-index (see below)

0..2

X0 - (T0+B0)>>1

Luma 'prediction miss'

0 < c < 3 and z is even

maxdiff(0)

X0 - (L0+R0)>>1

Luma 'prediction miss'

0 < c < 3 and z is odd

maxdiff(0)

T - B

Top - Bottom

z is even

maxdiff(c)

T - (TL + TR)>>1

Top 'prediction miss'

z is even

maxdiff(c)

L - (BL + TL)>>1

Left 'prediction miss'

z is even

maxdiff(c)

B - (BL + BR)>>1

Bottom 'prediction miss'

z is even

maxdiff(c)

L - R

Left - Right

z is odd

maxdiff(c)

L - (BL + TL)>>1

Left 'prediction miss'

z is odd

maxdiff(c)

T - (TL + TR)>>1

Top 'prediction miss'

z is odd

maxdiff(c)

R - (BR + TR)>>1

Right 'prediction miss'

z is odd

maxdiff(c)

Pp

Predicted value (see above)

range(c)

TT - T

Toptop - Top

c is not 2

maxdiff(c)

LL - L

Leftleft - Left

c is not 2

maxdiff(c)

For images without an alpha channel, channel 0 has 8 properties (numbered 0 to 7), channel 1 has 10 properties, and channel 2 has 9 properties. For images with an alpha channel, channel 0 has 9 properties, channel 1 has 11 properties, channel 2 has 10 properties, and channel 3 has 8 properties. In animations with FrameLookback, channel 4 has 8 properties.

Border conditions: If TL, T, TR, L, BL, B, BR or R are missing, they are defined as described in the interlaced pixel predictor definition above. If TT or LL are missing, the property they occur in gets value zero. If B0 is missing, B0 = T0; if R0 is missing, R0 = L0.

The median-index Mi indicates which of the three input values for the median in Predictor 1 is the median; more precisely, it is defined as follows:

Mi =

0

if z is even and P1 = (T + B)>>1

0

if z is odd and P1 = (L + R)>>1

1

if P1 = L + T - TR (and the above conditions do not hold)

2

otherwise

MANIAC tree encoding

There is one tree per non-trivial channel (a channel is trivial if its range is a singleton or if it doesn’t exist). The trees are encoded one after another and in a recursive (depth-first) way, as follows:

nb_properties depends on the channel, the number of channels, and the encoding method (interlaced or non-interlaced), as specified above. The range of each property is maintained during tree traversal. The initial property ranges prange[property] are defined above; these are narrowed down when going to a deeper node in the tree.

For each channel c, three different contexts are used: we’ll just call them Ac, Bc and Cc.

Type Description Condition

nz_int_Ac(0,nb_properties)

0=leaf node, > 0: property+1

nz_int_Bc(1,512)

node counter

not a leaf node

nz_int_Cc(prange[property].min,prange[property].max-1)

test_value

not a leaf node

recursive encoding of left branch

where prange[property].min = test_value+1

not a leaf node

recursive encoding of right branch

where prange[property].max = test_value

not a leaf node

From this description, the MANIAC trees can be reconstructed. Leaf nodes have a counter value that is effectively infinity (they can never be turned into a decision node).

Checksum

At the very end of the bitstream, there is an optional 32-bit checksum to verify the integrity of the file.

Type Description Condition

uni_int(1)

Boolean: have_checksum

uni_int(16)

Most significant 16 bits of checksum

have_checksum

uni_int(16)

Least significant 16 bits of checksum

have_checksum

This checksum is a standard CRC-32, computed from the original (or decoded) pixel data in the following way:

The CRC is initialized to the value (width << 16) + height, interpreted as an unsigned 32-bit number. Then the pixel values of all channels are processed, in scanline order (top to bottom, left to right), where each value is represented as a little-endian integer in 1, 2 or 4 bytes, depending on the bit depth of the image. The following sizes are used:

Channel All channels use at most 8 bits per pixel At least one channel uses 9 or more bits per pixel

0

1 byte

2 bytes

1

2 bytes

4 bytes

2

2 bytes

4 bytes

3

1 byte

2 bytes

Design rationale: why more bytes for channels 1 and 2?

You might wonder why for the checksum computation, pixels in channels 1 and 2 are represented using a larger bit width than strictly necessary. The reason is that while for the RGBA values, a smaller width would suffice, this is not true during the encoding or decoding if the YCoCg color transformation is used. The Co and Cg channels (channel 1 and 2) have a larger range; a typical implementation would use signed 16-bit values (if the original RGB values are unsigned 8-bit) or signed 32-bit values (if the original RGB values are unsigned 16-bit). For this reason, it is convenient to use a larger buffer for these channels, and it is convenient to be able to compute the checksum directly on this larger buffer.