Welcome Guest ( Log In | Register )


Important

The forums will be closing permanently the weekend of March 15th. Please see the notice in the announcements forum for details.

 
Best Grayscale Code, Just my little experiments with Grays...
« Next Oldest | Next Newest » Track this topic | Email this topic | Print this topic
asmoptimizer
Posted: Jun 6 2010, 05:35 PM


Newbie


Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10



Hi, I was experimenting with Grayscale code in order to make it run
as fast as possible without losing quality. (see my code bellow...)

I used MMX + a weird command: "pmaddwd"
while playing with this command, I got really confused about what it does. I've
checked many books that describes it but it has no seance with the reality.
I think it is a CPU bug (I used Intel Atom N280) because when I use numbers:
(23434 * color) that return max possible result: 255*23434 = 5B2E76h which is
greater then 65535 (word), All works fine - that proves that "pmaddwd" really use
dwords for it's calculations.

But when I use (color * 46869) this function starts to return some weird numbers.
Does anyone know what is the problem???

My main idea was to use as big numbers as possible in order to get as higher
precision as possible using MMX. By using "pmaddwd" command, I kill two rabbits
in one shut - It uses dwords(for better precision) and it also calculate the sum
(means smaller code size).

I don't know if it really improves the quality of an image but, I would like
to hear your opinion about my code. Thanks.


One more question: I have seen that different image processing codes uses
different calculations like - Colour = 0.30*R + 0.59*G + 0.11*B
Just wondering, What is the preferred one ?


NOTE: I use "punpcklbw mm0,qword[esi]" - it may cause problems - depends
how do you use it. When it loads the last pixel of the image, it reads one more dword
which is "outside of the image". if this byte is unreadable - out of the allocated
memory or something, This may cause a CPU exception -> app crash etc...



The code (Fasm syntax) you should copy it to a text file for a better view...


;; normal conversion equation is:
;; Y = 0.212671*R + 0,715160*G + 0.072169*B;
;;
;; Y = (13937 * R + 46869 * G + 4730 * B ) /65536; <-- Not working, Is it MMX bug???
;; Y = (6969 * R + 23434 * G + 2365 * B ) /32768; V
;; Y = (54 * R + 183 * G + 19 * B ) /256;

;; Y = (13937 * 255 + 46869 * 255 + 4730 * 255) = 3553935 + 11951595 + 1206150 = 16711680

;; Input: ecx = layer size (width * height)
;; ds:esi = layer pos


Grayscale: movq mm2,qword[Num_Gray_mul] ;;

Grayscale1: punpcklbw mm0,qword[esi] ;; Before: 00 00 00 00 11 22 33 44
psrlw mm0,8 ;; After: 1100 2200 3300 4400
pmaddwd mm0,mm2 ;; [R*6969 + G*23434] | [B*2365 + 0000000]
punpckldq mm1,mm0 ;; [B*2365 + 0000000] | [xxxxxxxxxxxxxxxx]
paddd mm0,mm1 ;; [R*6969 + G*23434 + B*2365]
psrlq mm0,32+8+8-1 ;; /32768
movd eax,mm0 ;; al=new color
mov ah,al ;;
mov [esi],ax ;; B,G
mov [esi+2],al ;; R (A is not changed)
add esi,4 ;; pixel++
loopd Grayscale1 ;;
ret ;;;;;;;;;;;;;;;;;;;;;;;;;;


Num_Gray_mul dw 2365,23434,6969,0 ; BGRA for grayscale


smile.gif smile.gif
 
     Top
phaeron
Posted: Jun 7 2010, 07:01 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



QUOTE

But when I use (color * 46869) this function starts to return some weird numbers.
Does anyone know what is the problem???


PMADDWD does signed 16-bit multiplication. You're not multiplying by 46869, but by -18667.

QUOTE

One more question: I have seen that different image processing codes uses
different calculations like - Colour = 0.30*R + 0.59*G + 0.11*B
Just wondering, What is the preferred one ?


It depends on your intended use. For casual viewing or real-time effects, it usually doesn't matter which axis you use as long as you're somewhere in the ballpark of how the eye sees luma. It matters more if you're doing your own video compression format, as chroma can be compressed more than luma, and so getting a good match to perceived luma and chroma lets you compress more. It's critical if you're encoding into a defined format like digital video or a standard video compression format, as you have to match what the decoder expects on the other end.

Standard definition usually uses Rec. 601 (luma axis is 30%/59%/11%) and high definition uses Rec. 709 (luma axis is 21%/72%/7%).

CODE

Grayscale1:
   punpcklbw mm0,qword[esi];; Before: 00 00 00 00 11 22 33 44
   psrlw     mm0,8         ;; After: 1100 2200 3300 4400
   pmaddwd   mm0,mm2       ;; [R*6969 + G*23434] | [B*2365 + 0000000]
   punpckldq mm1,mm0       ;; [B*2365 + 0000000] | [xxxxxxxxxxxxxxxx]
   paddd     mm0,mm1       ;; [R*6969 + G*23434 + B*2365]
   psrlq     mm0,32+8+8-1  ;; /32768
   movd      eax,mm0       ;; al=new color
   mov       ah,al         ;;
   mov       [esi],ax      ;; B,G
   mov       [esi+2],al    ;; R (A is not changed)
   add       esi,4         ;; pixel++
   loopd     Grayscale1    ;;


Hoo boy... where to start...

First, you're doing the fixed point to integer conversion without adding a rounding constant. The right shift effectively does a floor(), so you are incurring an average -0.5 error in doing so. One symptom is that you will only get a white (255) result for exactly #ffffff, when it should also result from #fffffe.

Second, you're only doing one pixel at a time per loop, which means you're wasting a lot of CPU cycles underutilizing the vector hardware. The only time you're using the hardware fully is in the first three instructions. All of the rest of the code is operating at half capacity or less, because part of the word contains unused / don't care data. An optimal loop would process two or four pixels per loop iteration.

You have some big no-nos in the latter half of the loop:

  • You're moving data between the vector unit (MMX) and the scalar unit (GPRs). That has extra latency and should be avoided. Keep vector data in MMX/SSE and scalar data on the GPR side when possible.
  • You're using 16-bit operand size. This requires an instruction prefix in 32-bit and 64-bit modes that can slow down decoding.
  • You're writing to two halves of a register (AL/AH) and then operating on the combined register (AX). This causes a partial register stall on many CPUs, as it forces the pipeline to stall until both register halves are ready.
  • You used the LOOP instruction. LOOP's been terrible for a long time -- use DEC+JNZ instead.


What you want to do is unroll this loop by 2x, then expand the grayscale values to full 32-bit values on two pixels at a time with MMX so that you can write the result with a direct MOVQ instruction. Then, unroll it by 2 again to 4x so that you can interleave the calculations and give the CPU something to do while waiting for the multiply.

I suggest reading the excellent software optimization guide at agner.org. It has lots of good optimization tips as well as information specific for various x86 CPUs, including the Atom.

 
    Top
asmoptimizer
Posted: Jun 7 2010, 04:03 PM


Newbie


Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10



Thanks a lot for your comments, I have expected that the second half of my code
is far away from being perfect. I will try to process two pixels at a time, it will
help me to replace the AX register by MMX units.

You right about rounding the result, I just forgot to add one more command
before the right shift.
It can be easily done by adding the value 16384 or 16385 to high dword of mm0. (32768/2)




It looks like I have a lot of homework to do...
 
     Top
asmoptimizer
Posted: Jun 9 2010, 12:28 PM


Newbie


Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10



Hi, This code is processing two pixels at a time.
I tested it with ollydbg to see that it really returns correct results.

I didn't find a better way to optimize the second part of the code, but I hope
that It is better than the previous version.

By the way, How do I add TABs or spaces in my posts so it won't look so ugly? smile.gif

MOVQ MM4,QWORD[Num_Gray_round] ;;
MOVQ MM5,QWORD[Num_Gray_and_1] ;;
MOVQ MM6,QWORD[Num_Gray_mul_1] ;;
MOVQ MM7,QWORD[Num_Gray_mul_2] ;;

Grayscale1: MOVQ MM0,QWORD[ESI] ;; load two pixels...
PUNPCKLBW MM1,MM0 ;; mm1 = A1 R1 | G1 B1
PUNPCKHBW MM2,MM0 ;; mm2 = A2 R2 | G2 B2
PSRLW MM1,8 ;; Convert to little endien...
PSRLW MM2,8 ;; Convert to little endien...
MOVQ MM3,MM1 ;; mm3 = xx xx | G1 B1
PUNPCKHDQ MM1,MM2 ;; mm1 = A2 R2 | A1 R1
PUNPCKLDQ MM3,MM2 ;; mm3 = G2 B2 | G1 B1
PMADDWD MM1,MM6 ;; mm1 = [ A2*0 + R2*6969 ] | [ A1*0 + R1*6969 ]
PMADDWD MM3,MM7 ;; mm3 = [ G2*23434 + B2*2365 ] | [ G1*23434 + B1*2365 ]
PAND MM0,MM5 ;; mm0 = A2 00 00 00 | A1 00 00 00 (prepare alpha)
PADDD MM1,MM3 ;; mm1 = A2+R2+G2+B2 | A1+R1+G1+B1

PADDD MM1,MM4 ;; round up...

PSRLD MM1,15 ;; /32768
MOVQ MM3,MM1 ;; mm3 = 00 00 00 C2 | 00 00 00 C1
PSLLD MM1,8 ;; mm1 = 00 00 C2 00 | 00 00 C1 00
POR MM3,MM1 ;; mm3 = 00 00 C2 C2 | 00 00 C1 C1
PSLLD MM1,8 ;; mm1 = 00 C2 00 00 | 00 C1 00 00
POR MM1,MM3 ;; mm1 = 00 C2 C2 C2 | 00 C1 C1 C1
POR MM1,MM0 ;; mm1 = A2 C2 C2 C2 | A1 C1 C1 C1 (restore alpha)

MOVQ [ESI],MM1 ;; put two pixels
ADD ESI,8 ;; pos+2


Num_Gray_mul_1 dw 6969,0,6969,0 ;; RARA for grayscale
Num_Gray_mul_2 dw 2365,23434,2365,23434 ;; BGBG for grayscale
Num_Gray_and_1 dd 0xFF000000,0xFF000000 ;; A000A000
Num_Gray_round dd 16384,16384 ;; used for round up
 
     Top
phaeron
Posted: Jun 11 2010, 04:42 AM


Virtualdub Developer


Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02



Use the code tag to preserve spacing.

I'm not sure that doing an unpack backwards and then shifting down is advantageous. It's bad on a Pentium M / Core 2 because of the false loop carried dependency; it's bad on the Atom because it requires two FP0 slots. FP0 slots are precious because integer shift/multiply ops go there. MOVQ can go in either pipe, so you're probably better off with MOVQ/PUNPCKLBW. You do need zero in a register, but you should have a free FP1 slot somewhere near the end of the loop to create it.

You can gain a bit in the multiply section by rearranging the instructions a bit. For instance, you should move PADDD MM1, MM4 up above PADDD MM1, MM3 since MM1 will be done a clock sooner and doesn't need to wait for MM4.

Since you're optimizing for Atom, you should consider using SSE/SSE2/SSSE3 instructions. With SSE, you can use PSHUFW to speed up the replication section slightly. Using SSE2 and XMM registers will give you a major speed boost as Atom handles many integer vector operations just as fast with 128-bit vectors (XMM) as with 64-bit vectors (MMX). With SSSE3, you can use PMADDUBSW if you're willing to sacrifice some precision.
 
    Top
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
4 replies since Jun 6 2010, 05:35 PM Track this topic | Email this topic | Print this topic

<< Back to VirtualDub Development Forum