Advanced Member
  
Group: Members
Posts: 167
Member No.: 22605
Joined: 13-December 07

|
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | Ack, I was an idiot and edited your post instead of responding to it. Hope it still makes sense to people.  |

Must have been early in the morning before your coffee has kicked in?
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | | This is with /Ox with Visual Studio 2010 SP1 Pro. Notice the imul instructions in the loop. |
I see, but unless your primary target audience are people with CPUs older than Presler (i.e. with no dedicated integer multiplier) then even 2x IMUL in the loop will be heavily outweighted by 4x speedup obtained by threading.
Video processing is either compute (complex math for each pixel) bound or bandwidth (large radius and thus lot of memory fetches for each pixel) bound, you have to do really simple processing code or do it on a very crappy CPU in order to notice the performance effect of those IMULs in the loop.
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | | I prefer not having to second-guess the compiler all of the time in critical loops. |
Then maybe you should invest in Intel C++ Compiler? 
Same loop, same options (/Ox) with Intel Compiler 12.1.3.102:
| CODE | push esi ;9.46 push edi ;9.46 push ebx ;9.46 push ebp ;9.46 push esi ;9.46 mov edx, DWORD PTR [28+esp] ;9.6 mov esi, DWORD PTR [8+edx] ;14.21 test esi, esi ;14.21 mov edi, DWORD PTR [edx] ;12.38 jbe .B1.5 ; Prob 10% ;14.21 ; LOE edx esi edi .B1.2: ; Preds .B1.1 mov eax, DWORD PTR [24+esp] ;9.6 xor ecx, ecx ; xor ebx, ebx ; mov ebp, DWORD PTR [4+eax] ;15.9 mov eax, DWORD PTR [4+edx] ;15.29 xor edx, edx ; mov DWORD PTR [esp], eax ; ; LOE edx ecx ebx ebp esi edi .B1.3: ; Preds .B1.3 .B1.2 movzx eax, BYTE PTR [edx+edi] ;15.22 inc ecx ;14.35 mov BYTE PTR [ebx+edi], al ;15.2 add ebx, ebp ;14.35 add edx, DWORD PTR [esp] ;14.35 cmp ecx, esi ;14.21 jb .B1.3 ; Prob 82% ;14.21 ; LOE edx ecx ebx ebp esi edi .B1.5: ; Preds .B1.3 .B1.1 pop ecx ;17.1 pop ebp ;17.1 pop ebx ;17.1 pop edi ;17.1 pop esi ;17.1 ret ;17.1
|
The same code is generated even at /O1 with Intel Compiler. The only way to get code with IMUL is to use /Oa- which disables "assume no aliasing". Frankly, I prefer compiler to assume there is no aliasing because 99% of the time there is no aliasing, and for the remaining 1% your code should be fixed.
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | | Logical cores is not really a problem. In fact, I would actually recommend just counting the number of bits in the process affinity mask rather than checking the CPU for purposes of estimating a parallelism target. This will make your filter execute more reasonably if someone intentionally restricts core usage on the process. |
I have Core i7 2600K here with 8 logical and 4 physical cores and I'd have to disagree with this because I just did some tests where I get slower than single-threaded performance with 8 threads, and 3.98x single-threaded performance with 4 cores. There are some loads where logical cores do not help.
Now, I can either restrict the number of threads myself because I know that logical cores do not help in this particular case, or let the users with HyperThreading capable CPUs figure out on their own that the performance is worse with HyperThreading.
If I limit the number of threads then user can still benefit from using all logical cores where it actually does help. If I don't, then they will have to disable HyperThreading either in BIOS or by fiddling with process affinity mask which means they will disable it for the whole pipeline. I personally think the first option would be better for the user.
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | | Checking physical cores and/or HyperThreading, however, is a pain in the butt. I know how to do it, I'm honestly not sure it would be a good idea for me to provide this, especially since it's likely to be something the user would want or need to tune. |
I don't see how a typical user could make more than an educated guess what filter could benefit from his "tuning" attempts? You would have to allow them to tune per-filter then because there is no "one size fits all" option when it comes to threading.
| QUOTE (phaeron @ May 12 2012, 10:37 PM) | | Two instances of your filter can run in parallel -- and this will happen if filter multithreading is enabled |
Then the performance will suffer because the threads will compete for resources. There should be a way of detecting that running filters in parallel is enabled in VirtualDub and warn the user that multi-threaded filters will have better performance if they are not run in parallel with themselves and other filters. |