Welcome Guest ( Log In | Register )


Important

The forums will be closing permanently the weekend of March 15th. Please see the notice in the announcements forum for details.

Pages: (2) 1 [2]  ( Go to first unread post )
Plugin Sdk V1.2
« Next Oldest | Next Newest » Track this topic | Email this topic | Print this topic
levicki
Posted: May 13 2012, 01:18 PM


Advanced Member


Group: Members
Posts: 167
Member No.: 22605
Joined: 13-December 07



QUOTE (phaeron @ May 12 2012, 10:37 PM)
Ack, I was an idiot and edited your post instead of responding to it. Hope it still makes sense to people. smile.gif


biggrin.gif

Must have been early in the morning before your coffee has kicked in?

QUOTE (phaeron @ May 12 2012, 10:37 PM)
This is with /Ox with Visual Studio 2010 SP1 Pro. Notice the imul instructions in the loop.


I see, but unless your primary target audience are people with CPUs older than Presler (i.e. with no dedicated integer multiplier) then even 2x IMUL in the loop will be heavily outweighted by 4x speedup obtained by threading.

Video processing is either compute (complex math for each pixel) bound or bandwidth (large radius and thus lot of memory fetches for each pixel) bound, you have to do really simple processing code or do it on a very crappy CPU in order to notice the performance effect of those IMULs in the loop.

QUOTE (phaeron @ May 12 2012, 10:37 PM)
I prefer not having to second-guess the compiler all of the time in critical loops.


Then maybe you should invest in Intel C++ Compiler? wink.gif

Same loop, same options (/Ox) with Intel Compiler 12.1.3.102:
CODE

       push      esi                                          ;9.46
       push      edi                                          ;9.46
       push      ebx                                          ;9.46
       push      ebp                                          ;9.46
       push      esi                                          ;9.46
       mov       edx, DWORD PTR [28+esp]                      ;9.6
       mov       esi, DWORD PTR [8+edx]                      ;14.21
       test      esi, esi                                    ;14.21
       mov       edi, DWORD PTR [edx]                        ;12.38
       jbe       .B1.5        ; Prob 10%                    ;14.21
                             ; LOE edx esi edi
.B1.2:                        ; Preds .B1.1
       mov       eax, DWORD PTR [24+esp]                      ;9.6
       xor       ecx, ecx                                    ;
       xor       ebx, ebx                                    ;
       mov       ebp, DWORD PTR [4+eax]                      ;15.9
       mov       eax, DWORD PTR [4+edx]                      ;15.29
       xor       edx, edx                                    ;
       mov       DWORD PTR [esp], eax                        ;
                             ; LOE edx ecx ebx ebp esi edi
.B1.3:                        ; Preds .B1.3 .B1.2
       movzx     eax, BYTE PTR [edx+edi]                      ;15.22
       inc       ecx                                          ;14.35
       mov       BYTE PTR [ebx+edi], al                      ;15.2
       add       ebx, ebp                                    ;14.35
       add       edx, DWORD PTR [esp]                        ;14.35
       cmp       ecx, esi                                    ;14.21
       jb        .B1.3        ; Prob 82%                    ;14.21
                             ; LOE edx ecx ebx ebp esi edi
.B1.5:                        ; Preds .B1.3 .B1.1
       pop       ecx                                          ;17.1
       pop       ebp                                          ;17.1
       pop       ebx                                          ;17.1
       pop       edi                                          ;17.1
       pop       esi                                          ;17.1
       ret                                                    ;17.1


The same code is generated even at /O1 with Intel Compiler. The only way to get code with IMUL is to use /Oa- which disables "assume no aliasing". Frankly, I prefer compiler to assume there is no aliasing because 99% of the time there is no aliasing, and for the remaining 1% your code should be fixed.

QUOTE (phaeron @ May 12 2012, 10:37 PM)
Logical cores is not really a problem. In fact, I would actually recommend just counting the number of bits in the process affinity mask rather than checking the CPU for purposes of estimating a parallelism target. This will make your filter execute more reasonably if someone intentionally restricts core usage on the process.


I have Core i7 2600K here with 8 logical and 4 physical cores and I'd have to disagree with this because I just did some tests where I get slower than single-threaded performance with 8 threads, and 3.98x single-threaded performance with 4 cores. There are some loads where logical cores do not help.

Now, I can either restrict the number of threads myself because I know that logical cores do not help in this particular case, or let the users with HyperThreading capable CPUs figure out on their own that the performance is worse with HyperThreading.

If I limit the number of threads then user can still benefit from using all logical cores where it actually does help. If I don't, then they will have to disable HyperThreading either in BIOS or by fiddling with process affinity mask which means they will disable it for the whole pipeline. I personally think the first option would be better for the user.

QUOTE (phaeron @ May 12 2012, 10:37 PM)
Checking physical cores and/or HyperThreading, however, is a pain in the butt. I know how to do it, I'm honestly not sure it would be a good idea for me to provide this, especially since it's likely to be something the user would want or need to tune.


I don't see how a typical user could make more than an educated guess what filter could benefit from his "tuning" attempts? You would have to allow them to tune per-filter then because there is no "one size fits all" option when it comes to threading.

QUOTE (phaeron @ May 12 2012, 10:37 PM)
Two instances of your filter can run in parallel -- and this will happen if filter multithreading is enabled


Then the performance will suffer because the threads will compete for resources. There should be a way of detecting that running filters in parallel is enabled in VirtualDub and warn the user that multi-threaded filters will have better performance if they are not run in parallel with themselves and other filters.
 
      Top
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
15 replies since Dec 23 2011, 07:37 AM Track this topic | Email this topic | Print this topic
Pages: (2) 1 [2] 
<< Back to VirtualDub Filters and Filter Development