Welcome Guest ( Log In | Register )


Important

The forums will be closing permanently the weekend of March 15th. Please see the notice in the announcements forum for details.

 
Code Optimizing, guru wanted for lossless decoding
« Next Oldest | Next Newest » Track this topic | Email this topic | Print this topic
Sir_Lagsalot
Posted: Mar 10 2005, 10:03 PM


Unregistered









I figured this would be a good place to find coders who enjoy working on video related stuff and have experience with low level optimization.
I've written a lossless codec (http://lags.leetcode.net/codec.html); it's stable, offers superior compression compared to most other lossless codecs, and is fairly fast encoding. Its Achilles heel though is the slow decoding speed...
The following code snippit undoes the range coding, and accounts for about 60-80 percent of total execution time. I'm hoping someone will be able to spot some optimizations I've overlooked, or even be willing to help put it into assembly:

#define inbyte() ((unsigned int)(*in++))

// Decompress a byte stream that has had range coding applied to it.
// The frequency table "prob_ranges" will have previously been set up using
// the readProb function.

void CompressClass::decode( const unsigned char * in, unsigned char * out, unsigned int length){
unsigned int buffer = inbyte();
unsigned int low = buffer >> 1;
buffer&=1;
unsigned int range = 0x80;
const unsigned char * ending = out+length;
const unsigned int rangetop = prob_ranges[255];
const unsigned int * const p_ranges = prob_ranges;
const unsigned int shift = scale;

do {
while( range <= 0x00800000 ){ // executed at most 3 times,
low += low; // using a switch/nested
low |= buffer &1; // conditionals offered no
low <<= 7; // noticable change in speed
buffer = inbyte();
low |= ( buffer >> 1 );
range <<= 8;
}

unsigned int help = range >> shift;
unsigned int tmp = low / help;
if ( tmp < rangetop ){
// determine symbol from the probability
int x=0;
if ( tmp >= p_ranges[128] )
x=128;
if ( tmp >= p_ranges[x+64])
x+=64;
if ( tmp >= p_ranges[x+32])
x+=32;
if ( tmp >= p_ranges[x+16])
x+=16;
if ( tmp >= p_ranges[x+8] )
x+=8;
if ( tmp >= p_ranges[x+4] )
x+=4;
if ( tmp >= p_ranges[x+2] )
x+=2;
if ( tmp >= p_ranges[x+1] )
x+=1;

*out=x; // output the decoded value
out++;

tmp = help * p_ranges[x];
low -= tmp;
range = help * (p_ranges[x+1]-p_ranges[x]);

} else { // if tmp >= p_ranges[255], then x has to equal 255
*out=255;
out++;
tmp = help * rangetop;
low -= tmp;
range -= tmp;
}
} while ( out <ending );
}

This was written based off the decoder implementation described at http://www.arturocampos.com/ac_range.html, that should give a good overall picture of what is going on. The code will only be running on 486 or better machines, and is compiled with the Intel 8.0 compiler. Processor specific optimizations are fine if they offer a significant speed improvement. If you are interested and need more info, email me at brg0853 at rit.edu, or AIM me at 'slrlagsalot'.
 
  Top
phaeron
Posted: Mar 11 2005, 05:02 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



A profiler will tell you for sure, but you have two glaring targets in your source code: the range divide, and the binary search on the interval list.

The latter is the easier of the two to attack. I'd suggest partitioning the code space up into 256 parts; have a pointer to the start of each code space and search linearly upward from there until you find the correct code. Don't worry about the linear search; in the worst case that you have 255 points on one entry, you will only hit that 1/256th of the time since the entropy encoding ensures that your distribution is roughly even.

The divide is pretty bad since it's a ~30+ clock stall on the critical path, but it's also much harder to deal with, particularly since I'm not sure what the range or precision requirements are for that divide. If the number of divisors are small enough or your precision requirements lax enough, I'd build a table of reciprocals and do a scaled multiply. I did this once in a normal map computation routine to good benefit (1Kx1K bump to normal in about 9ms), but you have to be very careful about accuracy. You could try building a table of reciprocals for the intervals and maintaining 1/range as well, but again you'd have to be really careful with controlling accumulated error.
 
    Top
Sir_Lagsalot
Posted: Mar 11 2005, 06:55 PM


Unregistered









I'm not exactly sure what you are sugesting for replacing the binary search; doing a vanilla linear search ( for ( ; tmp >= p_ranges[x];x++ ); ) increases the execution time of the routine by about half, and even unrolling that doesn't strike me as being anywhere near as efficent as the binary search. Another point about the search is that the byte distributions are no where near even, the vast majority of the values will be tightly clustered around either 0 or 0xff.

A reciprocal multiply instead of a divide could be useful; the value of help with typically have a maximal value between 0x800 and 0x1000; however the resulting value will need to exactly match the value division would have given, as being off by 1 can crash the codec or completely corrupt the resulting video frame.
 
  Top
phaeron
Posted: Mar 12 2005, 05:07 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



The purpose of an entropy coder, like a Huffman or arithmetic coder, is to compress data by removing imbalances in the output. Let's say you had the following prefix code generated by a Huffman tree:
CODE

A -> 0 (occurs 1/2 of the time)
B -> 10 (occurs 1/4th of the time)
C -> 11 (occurs 1/4th of the time)

The distributions are powers of two, so this Huffman tree is optimal. Now, if you were to use a two-bit lookup table to decode data compressed with that code:
CODE

00 -> A
01 -> A
10 -> B
11 -> C

...you'd notice that the distribution of symbols in that table match the probabilities. More importantly, this means that all entries in the table are likely to be hit with roughly equal probability. Think about it -- if they weren't, you could use a second arithmetic coder to take advantage of the imbalance! In other words, your byte values had better not be clustered around 00 or FF, or it means your entropy coder isn't doing its job.

As a side note, this means that if you have a situation where all symbols have the same probability of occuring, neither Huffman nor range nor arithmetic coding is going to do squat.

What I was suggesting was to take the high 8 bits of your fraction and use that to split the code space into 256 bins. Each bin then points to the starting symbol for that interval. It'd work something like this:

CODE

unsigned int tmp2 = tmp >> (fraction_bits - 8);

for(x = p_bin_start[tmp2]; x >= p_ranges[x+1]; ++x)
 ;


(This is a common trick for decoding Huffman or VLC codes.)

Since the probability over [0,1] is uniform, this means that each of the bins will be hit 1/256th of the time. You are correct in noting that the linear search becomes worse performance-wise as there are more symbols in a bin -- but as symbols cluster into fewer bins, they're less likely to be hit overall.

When I wrote my Motion JPEG decoder, I first used a dumb method that simply shifted bits into an accumulator and tested each of the length bins in a loop. It turned out to not be that bad, because the more frequent symbols were in the shorter bins, and thus the loop ran faster for the more common symbols.

As for the divide, if your value of help really is limited to that small of a range, the table should be small enough to not blow the cache too much. My normal map conversion routine used a 64K table! You will probably need to use a 32x32->64 multiply to get the rounding correct. AMD's Athlon optimization manual will show you how to construct an expression ((x*a+cool.gif>>32) that computes a divide by a constant with correct rounding.
 
    Top
Sir_Lagsalot
Posted: Mar 12 2005, 07:29 AM


Unregistered









I think you're missing something here; the binary search is _decoing_ a symbol based on the probability range that it falls into - the result of the binary search is the original input symbol. This should decidedly not be balanced; if it were, there would be no point in attempting to compress in the 1st place since all bytes would have equal probabity of occuring and thus need 8 bits.
For DVD quality video (which is what I am trying to optimize for), the value of shift is almost always between 16 and 19, meaning 'help' should have at most 16 significant bits in the majority of the cases. I've found a good explination of how to implement the reciprical multiply, so that part is going well.
 
  Top
phaeron
Posted: Mar 12 2005, 09:06 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



Oops, I did misinterpret slightly; yes, the uneven distribution in your input samples will cause the branches in the binary search to be skewed. Nevertheless, I still think you can do better than eight conditionals using binning. That entire binary search, despite a possibly high branch prediction rate, can't be parallelized by the CPU at all, and is also on the critical path.
 
    Top
Sir_Lagsalot
Posted: Mar 15 2005, 05:52 AM


Unregistered









w00t, using the binning you suggested managed to improve decoding speed by about 17% overall.
With the multiply by the reciprical, I occasionally have values that are one less than the correct value ( ie x*y' sometimes equals x/y-1 ). Am I correct in assuming that this is an inheriant inaccuracy with the algorithm? If so, this may still be usefull since I can simply check if adding one would cause the value to cross a range boundry, and do an actual division in the rare case where it does.
 
  Top
phaeron
Posted: Mar 15 2005, 07:36 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



You're going to be penalized on the P4 for the check, since a multiply on that CPU goes through the floating point unit (10+ clock latency), but maybe it won't be too bad if it's infrequent. I think the AMD algorithms are supposed to be exact, though. Sure you're not accidentally rounding off bits or something?
 
    Top
Sir_Lagsalot
Posted: Mar 15 2005, 05:38 PM


Unregistered









I never found the AMD algorithm, but I tried 2 other methods for generating the recipical and they both exibited it. It makes sense to me that it would be under occasionally, since the basic idea as I understand it is that instead of a=b/c you have a = (b * (1<<r) / c ) >> r, and 1<<r/c is sometimes slightly less than the exact reciprical due to it using integer math. As for the check, that would be performed after the inital lookup, so the value will already be avalible, and i guesstimate the division would only need to be done about 1/20000 of the time.
 
  Top
phaeron
Posted: Mar 16 2005, 04:45 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



You should get better accuracy with a correction term, i.e. (a*b+c)/k rather than (a*cool.gif/k. Basically, you're using a fixed-point reciprocal and rounding the result.

For the AMD algorithm:
http://www.amd.com/us-en/assets/content_ty..._docs/25112.PDF

You want section 8.8, Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants.
 
    Top
dlandelle
Posted: Apr 23 2005, 10:55 AM


Unregistered









QUOTE (Sir_Lagsalot @ Mar 10 2005, 04:03 PM)
The following code snippit undoes the range coding, and accounts for about 60-80 percent of total execution time

What are the possible values in prob_ranges[] ?

I did some code optimization on image processing a few years ago, going through "cache" and "branch predict".

Forgive me if I write thing you find obvious tongue.gif

1)I liked LUTs (look up tables with precalculated datas).

2)Testing if(condition) and if(!condition) is not the same according to test result probability.

3)I noticed that it was generally faster to do un-necessary things (useless code) rather than testing...unless the balance is really heavy like phareon says.

4)I tried all compiling options...sometimes optimize for size is faster than optimise for speed, because of "cache" size

5)I was sometimes disappointed by assembly (even MMX) because bottleneck was more often on data access.

Nevertheless, in your case where the data seems to be a small stream, it should not be the case.

Hope this helps a little unsure.gif
 
  Top
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
10 replies since Mar 10 2005, 10:03 PM Track this topic | Email this topic | Print this topic

<< Back to Off-Topic