|
|
| squid_80 |
| Posted: Feb 10 2005, 09:34 PM |
 |
|
Advanced Member
  
Group: Members
Posts: 594
Member No.: 13813
Joined: 22-January 05

|
| QUOTE (i4004 @ Feb 10 2005, 09:51 AM) | in the context of this story, this one holds some interesting points too; http://www.digit-life.com/articles2/pentium4-xvid-opt/
who said person building xvid will hit all the right switches needed for codec to go faster?
perhaps squid's version is better, though. |
That article confuses the hell out of me. The majority of the optimizations are MMX based, without them turned on my encoding speed plummets. But they claim enabling MMX on a pentium 4 decreases performance - if the P4 was that bad at MMX instructions, there'd be a lot of unhappy gamers out there. It just doesn't make sense that the MMX functions would be that much slower than their C source counterparts. I think the only way you could cripple a P4 in the way they describe is if you purposefully modified the source to do so.... I don't think Koepi would do that, but who knows.
I have no idea what the performance of my build is on an intel compared to AMD... There was a guy who posted on doom9 who was running it on a dual nocona 3.6Ghz but he never said anything about speeds and I don't know anyone who's got that kind of hardware to compare it with anyway. |
 |
| phaeron |
| Posted: Feb 11 2005, 05:12 AM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
@Thefumigator: If you use Options > Show Real-time Profiler before starting a render, the profiler window will show you a graphical display of what the threads are doing. For a compression benchmark, you want the block marked "V-Compress" to occupy as much of time time as possible within the processing thread. If you saw "V-Blit" showing up in significant amounts, for instance, it would mean that VirtualDub's pixmap conversion routines were taking up some CPU power and thus making your results less valid, because you are measuring some overhead besides the codec.
@squid_80: I was reading the article with some interest up until this line:
| QUOTE | What kind of optimization is it if the outdated MMX beats 3DNow! + SSE?
|
This betrays the author's lack of optimization experience. MMX is an integer instruction set, while 3DNow! and SSE are floating-point -- the two are not good replacements for each other. SSE is useless for processing packed pixels; MMX is terrible for transforming vertices. SSE has no more parallelism than MMX -- 4x -- and can easily be slower than MMX because it has to move twice as much data and has more awkward conversion primitives. 3DNow! vs. MMX is especially dumb because it's the same register set, same data width (64-bit), and half the parallelism. Wheeeee.
I should note that if the MMX code is poorly integrated into the surrounding code -- such as a short portion of inline assembly -- reduced compiler optimization and EMMS overhead may explain the lower performance. The Athlon is generally better at switching modes than the P4, which really doesn't like to take detours.
That the P4 is crippled with respect to MMX has some truth to it. When it comes to executing MMX instructions, the P4 is issue-bound because it has three units that can execute one 64-bit op per cycle -- multiplier, shifter, and adder -- but only one execution pipe to push MMX ALU uops through. Pentium II/III/M can execute two such instructions per clock as long as both aren't multiplies or both shifts; Athlons can pair those as well. This tilts the scale toward the 128-bit ops on P4 because with those the instructions take two clocks in the pipelines, but you can still start one per clock as long as you're hitting different pipes and thus can double your throughput compared to MMX. In theory, if you have an operation which is heavy on adds and subtracts, such as a DCT butterfly, you might be able to beat MMX with very-well optimized code on the double-speed scalar units, since they have a sustained throughput of 3 uops/clock, and thus can produce 96 bits of result per clock instead of 64. (It would be 4 uops except that the trace cache and retirement stations are bottlenecks.) In practice, though, you get nailed by load/store overhead, and well-written integer SSE2 will beat both anyway. |
 |
| squid_80 |
| Posted: Feb 11 2005, 07:15 AM |
 |
|
Advanced Member
  
Group: Members
Posts: 594
Member No.: 13813
Joined: 22-January 05

|
| QUOTE (phaeron @ Feb 11 2005, 05:12 AM) | | This betrays the author's lack of optimization experience. MMX is an integer instruction set, while 3DNow! and SSE are floating-point -- the two are not good replacements for each other. SSE is useless for processing packed pixels; MMX is terrible for transforming vertices. SSE has no more parallelism than MMX -- 4x -- and can easily be slower than MMX because it has to move twice as much data and has more awkward conversion primitives. 3DNow! vs. MMX is especially dumb because it's the same register set, same data width (64-bit), and half the parallelism. Wheeeee. |
What xvid calls "Integer SSE" is actually Extensions for MMX. Also xvid checks for SSE support but doesn't have any optimizations that actually use it (SSE2 yes, SSE1 no). Hence more confusion - how can code using mmx extensions give a speed increase while code using plain mmx decreases speed? |
 |
| phaeron |
| Posted: Feb 11 2005, 07:30 AM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
They are indeed actually extensions to MMX, but officially they're part of SSE and you have to test the SSE bit in the CPUID feature register to detect them. AMD added them in their Athlon CPUs as part of 3DNow! Professional before adding full SSE starting with the Athlon XP; this is the reason programs have separate detection for them. The integer SSE instructions are heavily geared toward optimizing MPEG encoders, including instructions to assist with half-pel motion prediction (pavgb/pavgw, packed average byte/word) and motion search (psadbw, packed sum of absolute difference bytes to word). It also contains the first prefetch and streaming store instructions, which can assist in speeding up routines with heavy memory traffic. It's possible that XviD's plain MMX routines are suboptimal but the difference is more than made up with the additional instructions. I haven't looked at the code though so I couldn't tell you if this is true. A profile with Intel VTune or AMD CodeAnalyst would probably be rather enlightening. |
 |
| Thefumigator |
| Posted: Feb 11 2005, 05:23 PM |
 |
|
Unregistered

|
WOW... quite interesting. So I have a question, as I've heard the Athlon 64 is not a 64 bit processor but a 32 bit one with 64 bit extensions... if it's that true does it means that the "64" in "Athlon 64" means just a new extension set or I'm just adding more confusion? |
 |
| phaeron |
| Posted: Feb 12 2005, 04:58 AM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
I don't know why you wouldn't consider the Athlon 64 to be a true 64-bit processor. It has a 64-bit address space, processing of 64-bit values, and can do so at full speed. The extension argument would work if the 64-bit instructions were added on like MMX -- your native ops were still 32-bit and it wasn't possible to really use 64-bit for everything, or there was a penalty for doing so related to main data paths not being 64-bit. Pentium MMXs have 64-bit processing in their FPU and MMX units, but you can't really rewrite Notepad using floating-point and vector operations. The Athlon 64, though, really can process 64-bit data at full speed with the generic operations necessary for a CPU to be general purpose.
It's true that when in 64-bit mode (long mode) the default operand size is 32-bit, given that most code doesn't need 64-bit operations. However, that's just an optimization, and several important aspects of program execution, such as the program counter and memory addressing, are natively 64-bit. Nor is there any awkwardness in using 64-bit data sizes -- you simply use the new 64-bit register names to use the whole register.
The Athlon 64 can also seamlessly execute 32-bit code in compatibility mode, but that doesn't make it a 32-bit processor any more than real mode makes an Athlon XP a 16-bit processor. |
 |
| squid_80 |
| Posted: Feb 13 2005, 09:58 AM |
 |
|
Advanced Member
  
Group: Members
Posts: 594
Member No.: 13813
Joined: 22-January 05

|
| QUOTE (phaeron @ Feb 11 2005, 07:30 AM) | | They are indeed actually extensions to MMX, but officially they're part of SSE and you have to test the SSE bit in the CPUID feature register to detect them. AMD added them in their Athlon CPUs as part of 3DNow! Professional before adding full SSE starting with the Athlon XP; this is the reason programs have separate detection for them. The integer SSE instructions are heavily geared toward optimizing MPEG encoders, including instructions to assist with half-pel motion prediction (pavgb/pavgw, packed average byte/word) and motion search (psadbw, packed sum of absolute difference bytes to word). It also contains the first prefetch and streaming store instructions, which can assist in speeding up routines with heavy memory traffic. It's possible that XviD's plain MMX routines are suboptimal but the difference is more than made up with the additional instructions. I haven't looked at the code though so I couldn't tell you if this is true. A profile with Intel VTune or AMD CodeAnalyst would probably be rather enlightening. |
As usual, your knowledge is enlightening. I didn't realize if a cpu indicates SSE support, this implies support for MMXext - probably should have guessed it logically, or read Intel's programming manuals as well as AMD's, or maybe just paid more attention when I did read the AMD version. Anyway, when I look at xvid's check_cpu_features function again, it makes sense; it tests for SSE support and sets both the XVID_CPU_SSE and XVID_CPU_MMXEXT flags, then if the CPU is an AMD it uses the AMD specific test for MMX Extensions and sets XVID_CPU_MMXEXT (for athlons before xp). All I saw the first time is that the XVID_CPU_SSE flag isn't used when assigning the pointers to assembly functions, presumably because there's better alternatives than using floating point ops.
I've tried CodeAnalyst with varying success - in 32-bit mode it works, but when I try it under windows x64 (which is what I really want, to find how best to optimize the codec) it gives strange results, recording rIP values that are apparently in .data section instead of .text. I think maybe it's getting the wrong base address for where xvidcore.dll is loaded in memory, but it's only a guess.
I hope I'm not hijacking this thread too much...
Re Athlon 64s being 32-bit with 64-bit extensions, depends how you want to look at it I think. Like Phaeron says it is a true 64-bit processor (meaning it can do operations using 64-bits at a time, not 2 ops using 32-bit or something like that) but the instruction set used is based on Intel's 32-bit(IA-32). But if that's what it means to have a 32-bit processor with 64-bit extensions, I'd gladly choose one of them any day over an Intel Itanium which is 64-bit but uses a completely different instruction set (IA-64) and runs existing 32-bit code very poorly. |
 |
| phaeron |
| Posted: Feb 13 2005, 10:21 PM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
Unless they've revved the download, the current version of CodeAnalyst has problems with build 1218+ of Windows x64 because the CA driver is built against the 1069 DDK. If you have problems with CodeAnalyst I highly recommend that you write to their feedback address; they're very responsive to good feedback and I've gotten several responses before. |
 |
| wiak |
| Posted: Feb 15 2005, 04:27 AM |
 |
|
Member
 
Group: Members
Posts: 24
Member No.: 10451
Joined: 30-May 04

|
you are using XviD 1.0.3 on 32bit and XviD 1.1 beta 1 on 64bit ! |
 |
| Thefumigator |
| Posted: Feb 16 2005, 06:10 PM |
 |
|
Unregistered

|
WIAK: Why don't you read the whole discussion? there's also a speed increment when switching to virtual dub 1.6.3 |
 |