|
|
| asmoptimizer |
| Posted: Jun 6 2010, 05:35 PM |
 |
|
Newbie

Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10

|
Hi, I was experimenting with Grayscale code in order to make it run as fast as possible without losing quality. (see my code bellow...)
I used MMX + a weird command: "pmaddwd" while playing with this command, I got really confused about what it does. I've checked many books that describes it but it has no seance with the reality. I think it is a CPU bug (I used Intel Atom N280) because when I use numbers: (23434 * color) that return max possible result: 255*23434 = 5B2E76h which is greater then 65535 (word), All works fine - that proves that "pmaddwd" really use dwords for it's calculations.
But when I use (color * 46869) this function starts to return some weird numbers. Does anyone know what is the problem???
My main idea was to use as big numbers as possible in order to get as higher precision as possible using MMX. By using "pmaddwd" command, I kill two rabbits in one shut - It uses dwords(for better precision) and it also calculate the sum (means smaller code size).
I don't know if it really improves the quality of an image but, I would like to hear your opinion about my code. Thanks.
One more question: I have seen that different image processing codes uses different calculations like - Colour = 0.30*R + 0.59*G + 0.11*B Just wondering, What is the preferred one ?
NOTE: I use "punpcklbw mm0,qword[esi]" - it may cause problems - depends how do you use it. When it loads the last pixel of the image, it reads one more dword which is "outside of the image". if this byte is unreadable - out of the allocated memory or something, This may cause a CPU exception -> app crash etc...
The code (Fasm syntax) you should copy it to a text file for a better view...
;; normal conversion equation is: ;; Y = 0.212671*R + 0,715160*G + 0.072169*B; ;; ;; Y = (13937 * R + 46869 * G + 4730 * B ) /65536; <-- Not working, Is it MMX bug??? ;; Y = (6969 * R + 23434 * G + 2365 * B ) /32768; V ;; Y = (54 * R + 183 * G + 19 * B ) /256;
;; Y = (13937 * 255 + 46869 * 255 + 4730 * 255) = 3553935 + 11951595 + 1206150 = 16711680
;; Input: ecx = layer size (width * height) ;; ds:esi = layer pos
Grayscale: movq mm2,qword[Num_Gray_mul] ;; Grayscale1: punpcklbw mm0,qword[esi] ;; Before: 00 00 00 00 11 22 33 44 psrlw mm0,8 ;; After: 1100 2200 3300 4400 pmaddwd mm0,mm2 ;; [R*6969 + G*23434] | [B*2365 + 0000000] punpckldq mm1,mm0 ;; [B*2365 + 0000000] | [xxxxxxxxxxxxxxxx] paddd mm0,mm1 ;; [R*6969 + G*23434 + B*2365] psrlq mm0,32+8+8-1 ;; /32768 movd eax,mm0 ;; al=new color mov ah,al ;; mov [esi],ax ;; B,G mov [esi+2],al ;; R (A is not changed) add esi,4 ;; pixel++ loopd Grayscale1 ;; ret ;;;;;;;;;;;;;;;;;;;;;;;;;;
Num_Gray_mul dw 2365,23434,6969,0 ; BGRA for grayscale
|
 |
| phaeron |
| Posted: Jun 7 2010, 07:01 AM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
| QUOTE | But when I use (color * 46869) this function starts to return some weird numbers. Does anyone know what is the problem???
|
PMADDWD does signed 16-bit multiplication. You're not multiplying by 46869, but by -18667.
| QUOTE | One more question: I have seen that different image processing codes uses different calculations like - Colour = 0.30*R + 0.59*G + 0.11*B Just wondering, What is the preferred one ?
|
It depends on your intended use. For casual viewing or real-time effects, it usually doesn't matter which axis you use as long as you're somewhere in the ballpark of how the eye sees luma. It matters more if you're doing your own video compression format, as chroma can be compressed more than luma, and so getting a good match to perceived luma and chroma lets you compress more. It's critical if you're encoding into a defined format like digital video or a standard video compression format, as you have to match what the decoder expects on the other end.
Standard definition usually uses Rec. 601 (luma axis is 30%/59%/11%) and high definition uses Rec. 709 (luma axis is 21%/72%/7%).
| CODE | Grayscale1: punpcklbw mm0,qword[esi];; Before: 00 00 00 00 11 22 33 44 psrlw mm0,8 ;; After: 1100 2200 3300 4400 pmaddwd mm0,mm2 ;; [R*6969 + G*23434] | [B*2365 + 0000000] punpckldq mm1,mm0 ;; [B*2365 + 0000000] | [xxxxxxxxxxxxxxxx] paddd mm0,mm1 ;; [R*6969 + G*23434 + B*2365] psrlq mm0,32+8+8-1 ;; /32768 movd eax,mm0 ;; al=new color mov ah,al ;; mov [esi],ax ;; B,G mov [esi+2],al ;; R (A is not changed) add esi,4 ;; pixel++ loopd Grayscale1 ;;
|
Hoo boy... where to start...
First, you're doing the fixed point to integer conversion without adding a rounding constant. The right shift effectively does a floor(), so you are incurring an average -0.5 error in doing so. One symptom is that you will only get a white (255) result for exactly #ffffff, when it should also result from #fffffe.
Second, you're only doing one pixel at a time per loop, which means you're wasting a lot of CPU cycles underutilizing the vector hardware. The only time you're using the hardware fully is in the first three instructions. All of the rest of the code is operating at half capacity or less, because part of the word contains unused / don't care data. An optimal loop would process two or four pixels per loop iteration.
You have some big no-nos in the latter half of the loop:
- You're moving data between the vector unit (MMX) and the scalar unit (GPRs). That has extra latency and should be avoided. Keep vector data in MMX/SSE and scalar data on the GPR side when possible.
- You're using 16-bit operand size. This requires an instruction prefix in 32-bit and 64-bit modes that can slow down decoding.
- You're writing to two halves of a register (AL/AH) and then operating on the combined register (AX). This causes a partial register stall on many CPUs, as it forces the pipeline to stall until both register halves are ready.
- You used the LOOP instruction. LOOP's been terrible for a long time -- use DEC+JNZ instead.
What you want to do is unroll this loop by 2x, then expand the grayscale values to full 32-bit values on two pixels at a time with MMX so that you can write the result with a direct MOVQ instruction. Then, unroll it by 2 again to 4x so that you can interleave the calculations and give the CPU something to do while waiting for the multiply.
I suggest reading the excellent software optimization guide at agner.org. It has lots of good optimization tips as well as information specific for various x86 CPUs, including the Atom.
|
 |
| asmoptimizer |
| Posted: Jun 7 2010, 04:03 PM |
 |
|
Newbie

Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10

|
Thanks a lot for your comments, I have expected that the second half of my code is far away from being perfect. I will try to process two pixels at a time, it will help me to replace the AX register by MMX units.
You right about rounding the result, I just forgot to add one more command before the right shift. It can be easily done by adding the value 16384 or 16385 to high dword of mm0. (32768/2)
It looks like I have a lot of homework to do... |
 |
| asmoptimizer |
| Posted: Jun 9 2010, 12:28 PM |
 |
|
Newbie

Group: Members
Posts: 3
Member No.: 27756
Joined: 6-June 10

|
Hi, This code is processing two pixels at a time. I tested it with ollydbg to see that it really returns correct results.
I didn't find a better way to optimize the second part of the code, but I hope that It is better than the previous version.
By the way, How do I add TABs or spaces in my posts so it won't look so ugly?
MOVQ MM4,QWORD[Num_Gray_round] ;; MOVQ MM5,QWORD[Num_Gray_and_1] ;; MOVQ MM6,QWORD[Num_Gray_mul_1] ;; MOVQ MM7,QWORD[Num_Gray_mul_2] ;;
Grayscale1: MOVQ MM0,QWORD[ESI] ;; load two pixels... PUNPCKLBW MM1,MM0 ;; mm1 = A1 R1 | G1 B1 PUNPCKHBW MM2,MM0 ;; mm2 = A2 R2 | G2 B2 PSRLW MM1,8 ;; Convert to little endien... PSRLW MM2,8 ;; Convert to little endien... MOVQ MM3,MM1 ;; mm3 = xx xx | G1 B1 PUNPCKHDQ MM1,MM2 ;; mm1 = A2 R2 | A1 R1 PUNPCKLDQ MM3,MM2 ;; mm3 = G2 B2 | G1 B1 PMADDWD MM1,MM6 ;; mm1 = [ A2*0 + R2*6969 ] | [ A1*0 + R1*6969 ] PMADDWD MM3,MM7 ;; mm3 = [ G2*23434 + B2*2365 ] | [ G1*23434 + B1*2365 ] PAND MM0,MM5 ;; mm0 = A2 00 00 00 | A1 00 00 00 (prepare alpha) PADDD MM1,MM3 ;; mm1 = A2+R2+G2+B2 | A1+R1+G1+B1
PADDD MM1,MM4 ;; round up...
PSRLD MM1,15 ;; /32768 MOVQ MM3,MM1 ;; mm3 = 00 00 00 C2 | 00 00 00 C1 PSLLD MM1,8 ;; mm1 = 00 00 C2 00 | 00 00 C1 00 POR MM3,MM1 ;; mm3 = 00 00 C2 C2 | 00 00 C1 C1 PSLLD MM1,8 ;; mm1 = 00 C2 00 00 | 00 C1 00 00 POR MM1,MM3 ;; mm1 = 00 C2 C2 C2 | 00 C1 C1 C1 POR MM1,MM0 ;; mm1 = A2 C2 C2 C2 | A1 C1 C1 C1 (restore alpha)
MOVQ [ESI],MM1 ;; put two pixels ADD ESI,8 ;; pos+2
Num_Gray_mul_1 dw 6969,0,6969,0 ;; RARA for grayscale Num_Gray_mul_2 dw 2365,23434,2365,23434 ;; BGBG for grayscale Num_Gray_and_1 dd 0xFF000000,0xFF000000 ;; A000A000 Num_Gray_round dd 16384,16384 ;; used for round up |
 |
| phaeron |
| Posted: Jun 11 2010, 04:42 AM |
 |
|

Virtualdub Developer
  
Group: Administrator
Posts: 7773
Member No.: 61
Joined: 30-July 02

|
Use the code tag to preserve spacing.
I'm not sure that doing an unpack backwards and then shifting down is advantageous. It's bad on a Pentium M / Core 2 because of the false loop carried dependency; it's bad on the Atom because it requires two FP0 slots. FP0 slots are precious because integer shift/multiply ops go there. MOVQ can go in either pipe, so you're probably better off with MOVQ/PUNPCKLBW. You do need zero in a register, but you should have a free FP1 slot somewhere near the end of the loop to create it.
You can gain a bit in the multiply section by rearranging the instructions a bit. For instance, you should move PADDD MM1, MM4 up above PADDD MM1, MM3 since MM1 will be done a clock sooner and doesn't need to wait for MM4.
Since you're optimizing for Atom, you should consider using SSE/SSE2/SSSE3 instructions. With SSE, you can use PSHUFW to speed up the replication section slightly. Using SSE2 and XMM registers will give you a major speed boost as Atom handles many integer vector operations just as fast with 128-bit vectors (XMM) as with 64-bit vectors (MMX). With SSSE3, you can use PMADDUBSW if you're willing to sacrifice some precision. |
 |
|