Printable Version of Topic
Click here to view this topic in its original format
Unofficial VirtualDub Support Forums > General Discussion > 32 Bit To 64 Bit Performance Impact. Impressions


Posted by: Thefumigator Feb 9 2005, 05:26 PM
Hi everyone!!, wanted to share my first testing about the speed difference between 32 and 64 bit virtual dub, running 32 and 64 bit Xvid respectively. Here are the details:


TEST:

Recompression of an Xvid file, lenght 3:10 aprox.


Hardware:

Asus K8V-X socket 754 + athlon 64 3000 + Kingston 512MB DDR400 Cas Latency 3
(similar performance impact should be achieved with an "EMT64 enabled" Intel processor)


Software:

Virtual dub 1.6.3 64 bit + Codec Xvid 64 bit (under Windows XP x64 RC)

VS

Virtual dub 1.5.1 32 bit + Xvid 32 bit (under Windows 2000 32 bit)


Results:

Second pass full recompress:

Decoder configuration 32 , 64:

user posted image user posted image

user posted image user posted image


Motion configuration 32 , 64:

user posted image user posted image

Profile 32, 64:

user posted image user posted image

Doubts:

- Don't know if closed GOV is significative in performance. I don't have enough time and I just realized watching the screenshot
-As far as I know different versions of Virtual Dub don't have perfromance impact, at least, version 1.5.1 is no faster or slower that 1.6
-It seems Xvid 64 bit I've got is compiled using Intel 64 bit compiler, which makes programs to run slower on AMD 64 processors (quite logical). However, don't know which compiler was used for Virtual dub 64 bit, but as far as I know, Microsoft has a preview 64 bit version of Visual Basic .NET 2005 so people who wants to make some noise may want to download that for free.
-Deringing to the 64 bit test, don't know if this has impact in performance.


Conclusions

the 64 bit test is 28 seconds faster than the 32 bit test. That's about 20% faster. Imagine having a 10 hours processing with Xvid, under 64 bit that would take 8 hours. It's quite interesting knowing that everythin including windows is still beta and not hardly optimized for 64 bit but just compiled to run. I hope to see a better version of 64 bit Vdub and Xvid. Does anyone know if DivX is going to release a 64 bit version? Avisynth 64 bit is going to be released it seems... Hope you like this test.


Posted by: Thefumigator Feb 9 2005, 06:10 PM
-I forgot to mention that The RC version I used is the latest available in the microsoft website. cool.gif
-Where it says "Visual Basic .NET 2005" it should say "Visual Studio .Net 2005"
Note that it's 32 bit also. wink.gif

Posted by: stephanV Feb 9 2005, 06:56 PM
you used XviD 1.0.x for 32 bit and XviD 1.1 for 64-bit. It is known there are speed-ups in XviD 1.1 compared with 1.0.3. Also, why didnt you use the same VirtualDub for both tests? There are 32-bit versions for the programs you used with 64-bit. It would certainly straighten out the test. smile.gif

Posted by: Thefumigator Feb 9 2005, 07:13 PM
I'm not sure about the 64 bit Xvid compilation, I've got it from a forum I even don't remember and it only came with 2 dll and an inf file... so I can't tell if it's v 1.1 but if you know where o get something more official, tell me and I will redo the test.
As far as I know, virtual dub 1.6.3 will not make Xvid faster. But if Xvid 1.1 is faster than 1.03 then I should update mine (XviD-1.0.3-20122004 _Final Release_
) Didn't know that there was a 1.1 version of Xvid LOL. biggrin.gif

Posted by: Thefumigator Feb 9 2005, 07:38 PM
Uhmm tested Vdub 32 bit v 1.6.3 with the latest Koepi's Xvid, quite an improvement over the old Xvid... ohmy.gif

user posted image

Well... 32 bit version is still slower. But at least we can find the difference between new and old Xvid... sad.gif
Anyway, I still have doubts about the 64 bit version I've got, if someone knows about a compiled version of Xvid 64 bit (that expressally says 1.1b) please tell me (direct link would be good, thanks) rolleyes.gif

Posted by: stephanV Feb 9 2005, 07:42 PM
it is certainly 1.1 since VHQ for b-frames was implemented first in 1.1. Doesnt it say anything in the about section in the codec config screen?

[edit] im moving this to "general" as it doesnt really fit in the bug-report section, but still is relevant for VirtualDub wink.gif

Posted by: Thefumigator Feb 9 2005, 08:00 PM
HEY!! I also tested old virtual dub with new Xvid:

user posted image

As you can see in the post above, I seem to get a slight performance boost (2:38 against 2:28) when using the new Vdub 1.6.3.

I hoped to get better performance in 64 bit... sad.gif
many 32 bit programs out there run quite slower than their 64 bit versions... trying to test 64 bit capabilities with virtual dub 64 and Xvid 64 was quite a mistake, since they are too new... I'll wait for the x86-64 asm code to be optimized for speed

wink.gif

Posted by: stephanV Feb 9 2005, 08:23 PM
i must warn you though, the tests you do are rather short (take 10.000 frames as a minimum). And to be really proper you should actually reboot after each encode... but yeah, its still indicative i think. smile.gif

Posted by: Thefumigator Feb 10 2005, 12:45 AM
One note to add though, someone at doom9 forum told me that my encode was quite slow, and that his pentium 3 1.1Ghz was almost that speedy. The file is already encoded in Xvid (so I recompressed in Xvid) and it is 720x352, quite huge surface to encode, and decoding that resolution is quite a task I mean, maybe is not as encoding but it takes a little bit of CPU usage.

ALSO: Don't look at framerate since the it goes down and up all the time, I suggest to look at time spend to encode.

I disagree, I actually don't need to reboot, I take care of my windows installation as if it was a Ford Mustang maniac. Trust me... smile.gif

Posted by: fccHandler Feb 10 2005, 07:06 AM
QUOTE (Thefumigator @ Feb 9 2005, 08:45 PM)
I disagree, I actually don't need to reboot, I take care of my windows installation as if it was a Ford Mustang maniac. Trust me...  smile.gif

Me too! biggrin.gif

But Windows itself caches file reads (and presumably lots of other stuff). Haven't you ever noticed that the second time you read from a file it's often significantly faster than the first time?

Rebooting flushes all of Window's buffers and caches, so theoretically it should make your results more accurate.

Posted by: phaeron Feb 10 2005, 07:23 AM
The 32-bit build of VirtualDub is cheating slightly -- it is using MMX-optimized pixmap converters while the 64-bit build is not (because I haven't written 64-bit SSE2 versions yet). Tweaking the video format to do direct YCbCr-to-YCbCr recoding so that the pixmap converters aren't used may make a difference. Check the real-time profiler and make sure the video compression is taking the lion's share of the CPU time.



Posted by: squid_80 Feb 10 2005, 09:22 AM
In case anyone's wondering, there's more information about the http://forum.doom9.org/showthread.php?s=&threadid=87524

I've also done a x64 version of huffyuv: http://home.iprimus.com.au/ajdunstan/huffyuv64.zip

There's no exe installer, just unzip, right click the inf file and install. Also I haven't tested huffyuv on RC1 (1289).
(now you see why I want to get MPEG2 streams into vdub64 smile.gif )

Posted by: i4004 Feb 10 2005, 09:51 AM
in the context of this story, this one holds some interesting points too;
http://www.digit-life.com/articles2/pentium4-xvid-opt/

who said person building xvid will hit all the right switches needed for codec to go faster?

perhaps squid's version is better, though.

Posted by: stephanV Feb 10 2005, 11:35 AM
QUOTE (Thefumigator @ Feb 10 2005, 01:45 AM)
I disagree, I actually don't need to reboot, I take care of my windows installation as if it was a Ford Mustang maniac. Trust me... smile.gif

me too: almost 7 days uptime now --> http://www.uptime-project.net/page.php?page=toplist&content=profile&uid=42418 smile.gif

but yeah, flushing the caches was my main concern. wink.gif

Posted by: Thefumigator Feb 10 2005, 08:58 PM
stephan V : I think I have a record here tongue.gif

user posted image

Phaeron: What do you mean? "Check the real-time profiler and make sure the video compression is taking the lion's share of the CPU time"

Squid_80: Keep Working on it! wink.gif

Posted by: squid_80 Feb 10 2005, 09:34 PM
QUOTE (i4004 @ Feb 10 2005, 09:51 AM)
in the context of this story, this one holds some interesting points too;
http://www.digit-life.com/articles2/pentium4-xvid-opt/

who said person building xvid will hit all the right switches needed for codec to go faster?

perhaps squid's version is better, though.

That article confuses the hell out of me. The majority of the optimizations are MMX based, without them turned on my encoding speed plummets. But they claim enabling MMX on a pentium 4 decreases performance - if the P4 was that bad at MMX instructions, there'd be a lot of unhappy gamers out there.
It just doesn't make sense that the MMX functions would be that much slower than their C source counterparts. I think the only way you could cripple a P4 in the way they describe is if you purposefully modified the source to do so.... I don't think Koepi would do that, but who knows.

I have no idea what the performance of my build is on an intel compared to AMD... There was a guy who posted on doom9 who was running it on a dual nocona 3.6Ghz but he never said anything about speeds and I don't know anyone who's got that kind of hardware to compare it with anyway.

Posted by: phaeron Feb 11 2005, 05:12 AM
@Thefumigator:
If you use Options > Show Real-time Profiler before starting a render, the profiler window will show you a graphical display of what the threads are doing. For a compression benchmark, you want the block marked "V-Compress" to occupy as much of time time as possible within the processing thread. If you saw "V-Blit" showing up in significant amounts, for instance, it would mean that VirtualDub's pixmap conversion routines were taking up some CPU power and thus making your results less valid, because you are measuring some overhead besides the codec.

@squid_80:
I was reading the article with some interest up until this line:
QUOTE

What kind of optimization is it if the outdated MMX beats 3DNow! + SSE?

This betrays the author's lack of optimization experience. MMX is an integer instruction set, while 3DNow! and SSE are floating-point -- the two are not good replacements for each other. SSE is useless for processing packed pixels; MMX is terrible for transforming vertices. SSE has no more parallelism than MMX -- 4x -- and can easily be slower than MMX because it has to move twice as much data and has more awkward conversion primitives. 3DNow! vs. MMX is especially dumb because it's the same register set, same data width (64-bit), and half the parallelism. Wheeeee.

I should note that if the MMX code is poorly integrated into the surrounding code -- such as a short portion of inline assembly -- reduced compiler optimization and EMMS overhead may explain the lower performance. The Athlon is generally better at switching modes than the P4, which really doesn't like to take detours.

That the P4 is crippled with respect to MMX has some truth to it. When it comes to executing MMX instructions, the P4 is issue-bound because it has three units that can execute one 64-bit op per cycle -- multiplier, shifter, and adder -- but only one execution pipe to push MMX ALU uops through. Pentium II/III/M can execute two such instructions per clock as long as both aren't multiplies or both shifts; Athlons can pair those as well. This tilts the scale toward the 128-bit ops on P4 because with those the instructions take two clocks in the pipelines, but you can still start one per clock as long as you're hitting different pipes and thus can double your throughput compared to MMX. In theory, if you have an operation which is heavy on adds and subtracts, such as a DCT butterfly, you might be able to beat MMX with very-well optimized code on the double-speed scalar units, since they have a sustained throughput of 3 uops/clock, and thus can produce 96 bits of result per clock instead of 64. (It would be 4 uops except that the trace cache and retirement stations are bottlenecks.) In practice, though, you get nailed by load/store overhead, and well-written integer SSE2 will beat both anyway.

Posted by: squid_80 Feb 11 2005, 07:15 AM
QUOTE (phaeron @ Feb 11 2005, 05:12 AM)
This betrays the author's lack of optimization experience. MMX is an integer instruction set, while 3DNow! and SSE are floating-point -- the two are not good replacements for each other. SSE is useless for processing packed pixels; MMX is terrible for transforming vertices. SSE has no more parallelism than MMX -- 4x -- and can easily be slower than MMX because it has to move twice as much data and has more awkward conversion primitives. 3DNow! vs. MMX is especially dumb because it's the same register set, same data width (64-bit), and half the parallelism. Wheeeee.

What xvid calls "Integer SSE" is actually Extensions for MMX. Also xvid checks for SSE support but doesn't have any optimizations that actually use it (SSE2 yes, SSE1 no). Hence more confusion - how can code using mmx extensions give a speed increase while code using plain mmx decreases speed?

Posted by: phaeron Feb 11 2005, 07:30 AM
They are indeed actually extensions to MMX, but officially they're part of SSE and you have to test the SSE bit in the CPUID feature register to detect them. AMD added them in their Athlon CPUs as part of 3DNow! Professional before adding full SSE starting with the Athlon XP; this is the reason programs have separate detection for them. The integer SSE instructions are heavily geared toward optimizing MPEG encoders, including instructions to assist with half-pel motion prediction (pavgb/pavgw, packed average byte/word) and motion search (psadbw, packed sum of absolute difference bytes to word). It also contains the first prefetch and streaming store instructions, which can assist in speeding up routines with heavy memory traffic. It's possible that XviD's plain MMX routines are suboptimal but the difference is more than made up with the additional instructions. I haven't looked at the code though so I couldn't tell you if this is true. A profile with Intel VTune or AMD CodeAnalyst would probably be rather enlightening.

Posted by: Thefumigator Feb 11 2005, 05:23 PM
WOW... quite interesting. So I have a question, as I've heard the Athlon 64 is not a 64 bit processor but a 32 bit one with 64 bit extensions... if it's that true does it means that the "64" in "Athlon 64" means just a new extension set or I'm just adding more confusion?

Posted by: phaeron Feb 12 2005, 04:58 AM
I don't know why you wouldn't consider the Athlon 64 to be a true 64-bit processor. It has a 64-bit address space, processing of 64-bit values, and can do so at full speed. The extension argument would work if the 64-bit instructions were added on like MMX -- your native ops were still 32-bit and it wasn't possible to really use 64-bit for everything, or there was a penalty for doing so related to main data paths not being 64-bit. Pentium MMXs have 64-bit processing in their FPU and MMX units, but you can't really rewrite Notepad using floating-point and vector operations. The Athlon 64, though, really can process 64-bit data at full speed with the generic operations necessary for a CPU to be general purpose.

It's true that when in 64-bit mode (long mode) the default operand size is 32-bit, given that most code doesn't need 64-bit operations. However, that's just an optimization, and several important aspects of program execution, such as the program counter and memory addressing, are natively 64-bit. Nor is there any awkwardness in using 64-bit data sizes -- you simply use the new 64-bit register names to use the whole register.

The Athlon 64 can also seamlessly execute 32-bit code in compatibility mode, but that doesn't make it a 32-bit processor any more than real mode makes an Athlon XP a 16-bit processor.

Posted by: squid_80 Feb 13 2005, 09:58 AM
QUOTE (phaeron @ Feb 11 2005, 07:30 AM)
They are indeed actually extensions to MMX, but officially they're part of SSE and you have to test the SSE bit in the CPUID feature register to detect them. AMD added them in their Athlon CPUs as part of 3DNow! Professional before adding full SSE starting with the Athlon XP; this is the reason programs have separate detection for them. The integer SSE instructions are heavily geared toward optimizing MPEG encoders, including instructions to assist with half-pel motion prediction (pavgb/pavgw, packed average byte/word) and motion search (psadbw, packed sum of absolute difference bytes to word). It also contains the first prefetch and streaming store instructions, which can assist in speeding up routines with heavy memory traffic. It's possible that XviD's plain MMX routines are suboptimal but the difference is more than made up with the additional instructions. I haven't looked at the code though so I couldn't tell you if this is true. A profile with Intel VTune or AMD CodeAnalyst would probably be rather enlightening.

As usual, your knowledge is enlightening. I didn't realize if a cpu indicates SSE support, this implies support for MMXext - probably should have guessed it logically, or read Intel's programming manuals as well as AMD's, or maybe just paid more attention when I did read the AMD version. Anyway, when I look at xvid's check_cpu_features function again, it makes sense; it tests for SSE support and sets both the XVID_CPU_SSE and XVID_CPU_MMXEXT flags, then if the CPU is an AMD it uses the AMD specific test for MMX Extensions and sets XVID_CPU_MMXEXT (for athlons before xp). All I saw the first time is that the XVID_CPU_SSE flag isn't used when assigning the pointers to assembly functions, presumably because there's better alternatives than using floating point ops.

I've tried CodeAnalyst with varying success - in 32-bit mode it works, but when I try it under windows x64 (which is what I really want, to find how best to optimize the codec) it gives strange results, recording rIP values that are apparently in .data section instead of .text. I think maybe it's getting the wrong base address for where xvidcore.dll is loaded in memory, but it's only a guess.

I hope I'm not hijacking this thread too much... tongue.gif

Re Athlon 64s being 32-bit with 64-bit extensions, depends how you want to look at it I think. Like Phaeron says it is a true 64-bit processor (meaning it can do operations using 64-bits at a time, not 2 ops using 32-bit or something like that) but the instruction set used is based on Intel's 32-bit(IA-32). But if that's what it means to have a 32-bit processor with 64-bit extensions, I'd gladly choose one of them any day over an Intel Itanium which is 64-bit but uses a completely different instruction set (IA-64) and runs existing 32-bit code very poorly.

Posted by: phaeron Feb 13 2005, 10:21 PM
Unless they've revved the download, the current version of CodeAnalyst has problems with build 1218+ of Windows x64 because the CA driver is built against the 1069 DDK. If you have problems with CodeAnalyst I highly recommend that you write to their feedback address; they're very responsive to good feedback and I've gotten several responses before.

Posted by: wiak Feb 15 2005, 04:27 AM
you are using XviD 1.0.3 on 32bit and XviD 1.1 beta 1 on 64bit !

Posted by: Thefumigator Feb 16 2005, 06:10 PM
WIAK: Why don't you read the whole discussion? tongue.gif there's also a speed increment when switching to virtual dub 1.6.3

Powered by Invision Board (http://www.invisionboard.com)
© Invision Power Services (http://www.invisionpower.com)