Image: array [0..250, 0..255] of byte;
Each byte contains a value in the range 0..255 with zero denoting black, 255 representing white, and the other values representing even shades of gray between these two extremes.
The blurring algorithm averages a pixel with its eight closest neighbors. A single blur operation applies this average to all interior pixels of an image (that is, it does not apply to the pixels on the boundary of the image because they do not have the same number of neighbors as the other pixels). The following Pascal program implements the blurring algorithm and lets the user specify the amount of blurring (by looping through the algorithm the number of times the user specifies):
program PhotoFilter(input,output); (* Here is the raw file data type produced by the Photoshop program *) type image = array [0..250] of array [0..255] of byte; (* The variables we will use. Note that the "datain" and "dataout" *) (* variables are pointers because Turbo Pascal will not allow us to *) (* allocate more than 64K data in the one global data segment it *) (* supports. *) var h,i,j,k,l,sum,iterations:integer; datain, dataout: ^image; f,g:file of image; begin (* Open the files and real the input data *) assign(f, 'roller1.raw'); assign(g, 'roller2.raw'); reset(f); rewrite(g); new(datain); new(dataout); read(f,datain^); (* Get the number of iterations from the user *) write('Enter number of iterations:'); readln(iterations); writeln('Computing result'); (* Copy the data from the input array to the output array. *) (* This is a really lame way to copy the border from the *) (* input array to the output array. *) for i := 0 to 250 do for j := 0 to 255 do dataout^ [i][j] := datain^ [i][j]; (* Okay, here's where all the work takes place. The outside *) (* loop repeats this blurring operation the number of *) (* iterations specified by the user. *) for h := 1 to iterations do begin (* For each row except the first and the last, compute *) (* a new value for each element. *) for i := 1 to 249 do (* For each column except the first and the last, com- *) (* pute a new value for each element. *) for j := 1 to 254 do begin (* For each element in the array, compute a new blurred value by adding up the eight cells around an array element along with eight times the current cell's value. Then divide this by sixteen to compute a weighted average of the nine cells forming a square around the current cell. The current cell has a 50% weighting, the other eight cells around the current cel provide the other 50% weighting (6.25% each). *) sum := 0; for k := -1 to 1 do for l := -1 to 1 do sum := sum + datain^ [i+k][j+l]; (* Sum currently contains the sum of the nine *) (* cells, add in seven times the current cell so *) (* we get a total of eight times the current cell. *) dataout^ [i][j] := (sum + datain^ [i][j]*7) div 16; end; (* Copy the output cell values back to the input cells *) (* so we can perform the blurring on this new data on *) (* the next iteration. *) for i := 0 to 250 do for j := 0 to 255 do datain^ [i][j] := dataout^ [i][j]; end; writeln('Writing result'); write(g,dataout^); close(f); close(g); end.
The Pascal program above, compiled with Turbo Pascal v7.0, takes 45 seconds to compute 100 iterations of the blurring algorithm. A comparable program written in C and compiled with Borland C++ v4.02 takes 29 seconds to run. The same source file compiled with Microsoft C++ v8.00 runs in 21 seconds. Obviously the C compilers produce better code than Turbo Pascal. It took about three hours to get the Pascal version running and tested. The C versions took about another hour to code and test. The following two images provide a "before" and "after" example of this program's function:
Before blurring:
After blurring (10 iterations):
The following is a crude translation from Pascal directly into assembly language of the above program. It requires 36 seconds to run. Yes, the C compilers did a better job, but once you see how bad this code is, you'll wonder what it is that Turbo Pascal is doing to run so slow. It took about an hour to translate the Pascal version into this assembly code and debug it to the point it produced the same output as the Pascal version.
; IMGPRCS.ASM ; ; An image processing program. ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 36 seconds. ; Borland Pascal v7.0- 45 seconds. ; Borland C++ v4.02- 29 seconds. ; Microsoft C++ v8.00- 21 seconds. .xlist include stdlib.a includelib stdlib.lib .list .286 dseg segment para public 'data' ; Loop control variables and other variables: h word ? i word ? j word ? k word ? l word ? sum word ? iterations word ? ; File names: InName byte "roller1.raw",0 OutName byte "roller2.raw",0 dseg ends ; Here is the input data that we operate on. InSeg segment para public 'indata' DataIn byte 251 dup (256 dup (?)) InSeg ends ; Here is the output array that holds the result. OutSeg segment para public 'outdata' DataOut byte 251 dup (256 dup (?)) OutSeg ends cseg segment para public 'code' assume cs:cseg, ds:dseg Main proc mov ax, dseg mov ds, ax meminit mov ax, 3d00h ;Open input file for reading. lea dx, InName int 21h jnc GoodOpen print byte "Could not open input file.",cr,lf,0 jmp Quit GoodOpen: mov bx, ax ;File handle. mov dx, InSeg ;Where to put the data. mov ds, dx lea dx, DataIn mov cx, 256*251 ;Size of data file to read. mov ah, 3Fh int 21h cmp ax, 256*251 ;See if we read the data. je GoodRead print byte "Did not read the file properly",cr,lf,0 jmp Quit GoodRead: mov ax, dseg mov ds, ax print byte "Enter number of iterations: ",0 getsm atoi free mov iterations, ax print byte "Computing Result",cr,lf,0 ; Copy the input data to the output buffer. mov i, 0 iloop0: cmp i, 250 ja iDone0 mov j, 0 jloop0: cmp j, 255 ja jDone0 mov bx, i ;Compute index into both shl bx, 8 ; arrays using the formula add bx, j ; i*256+j (row major). mov cx, InSeg ;Point at input segment. mov es, cx mov al, es:DataIn[bx] ;Get DataIn[i][j]. mov cx, OutSeg ;Point at output segment. mov es, cx mov es:DataOut[bx], al ;Store into DataOut[i][j] inc j ;Next iteration of j loop. jmp jloop0 jDone0: inc i ;Next iteration of i loop. jmp iloop0 iDone0: ; for h := 1 to iterations- mov h, 1 hloop: mov ax, h cmp ax, iterations ja hloopDone ; for i := 1 to 249 - mov i, 1 iloop: cmp i, 249 ja iloopDone ; for j := 1 to 254 - mov j, 1 jloop: cmp j, 254 ja jloopDone ; sum := 0; ; for k := -1 to 1 do for l := -1 to 1 do mov ax, InSeg ;Gain access to InSeg. mov es, ax mov sum, 0 mov k, -1 kloop: cmp k, 1 jg kloopDone mov l, -1 lloop: cmp l, 1 jg lloopDone ; sum := sum + datain [i+k][j+l] mov bx, i add bx, k shl bx, 8 ;Multiply by 256. add bx, j add bx, l mov al, es:DataIn[bx] mov ah, 0 add Sum, ax inc l jmp lloop lloopDone: inc k jmp kloop ; dataout [i][j] := (sum + datain[i][j]*7) div 16; kloopDone: mov bx, i shl bx, 8 ;*256 add bx, j mov al, es:DataIn[bx] mov ah, 0 imul ax, 7 add ax, sum shr ax, 4 ;div 16 mov bx, OutSeg mov es, bx mov bx, i shl bx, 8 add bx, j mov es:DataOut[bx], al inc j jmp jloop jloopDone: inc i jmp iloop iloopDone: ; Copy the output data to the input buffer. mov i, 0 iloop1: cmp i, 250 ja iDone1 mov j, 0 jloop1: cmp j, 255 ja jDone1 mov bx, i ;Compute index into both shl bx, 8 ; arrays using the formula add bx, j ; i*256+j (row major). mov cx, OutSeg ;Point at input segment. mov es, cx mov al, es:DataOut[bx] ;Get DataIn[i][j]. mov cx, InSeg ;Point at output segment. mov es, cx mov es:DataIn[bx], al ;Store into DataOut[i][j] inc j ;Next iteration of j loop. jmp jloop1 jDone1: inc i ;Next iteration of i loop. jmp iloop1 iDone1: inc h jmp hloop hloopDone: print byte "Writing result",cr,lf,0 ; Okay, write the data to the output file: mov ah, 3ch ;Create output file. mov cx, 0 ;Normal file attributes. lea dx, OutName int 21h jnc GoodCreate print byte "Could not create output file.",cr,lf,0 jmp Quit GoodCreate: mov bx, ax ;File handle. push bx mov dx, OutSeg ;Where the data can be found. mov ds, dx lea dx, DataOut mov cx, 256*251 ;Size of data file to write. mov ah, 40h ;Write operation. int 21h pop bx ;Retrieve handle for close. cmp ax, 256*251 ;See if we wrote the data. je GoodWrite print byte "Did not write the file properly",cr,lf,0 jmp Quit GoodWrite: mov ah, 3eh ;Close operation. int 21h Quit: ExitPgm ;DOS macro to quit program. Main endp cseg ends sseg segment para stack 'stack' stk byte 1024 dup ("stack ") sseg ends zzzzzzseg segment para public 'zzzzzz' LastBytes byte 16 dup (?) zzzzzzseg ends end Main
This assembly code is a very straight-forward, line by line translation of the previous Pascal code. Even beginning programmers (who've read and understand Chapters Eight and Nine) should easily be able to improve the performance of this code.
While we could run a profiler on this program to determine where the "hot spots" are in this code, a little analysis, particularly of the Pascal version, should make it obvious that there are a lot of nested loops in this code. As Chapter Ten points out, when optimizing code you should always start with the innermost loops. The major change between the code above and the following assembly language version is that we've unrolled the innermost loops and we've replaced the array index computations with some constant computations. These minor changes speed up the execution by a factor of six! The assembly version now runs in six seconds rather than 36. A Microsoft C++ version of the same program with comparable optimzations runs in eight seconds. It required nearly four hours to develop, test, and debug this code. It required an additional hour to apply these same modifications to the C version.
; IMGPRCS2.ASM ; ; An image processing program (First optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 6 seconds. ; Original ASM code- 36 seconds. ; Borland Pascal v7.0- 45 seconds. ; Borland C++ v4.02- 29 seconds. ; Microsoft C++ v8.00- 21 seconds. ; // Lots of omitted code goes here, see the previous version// print byte "Computing Result",cr,lf,0 ; for h := 1 to iterations- mov h, 1 hloop: ; Copy the input data to the output buffer. ; Optimization step #1: Replace with movs instruction. push ds mov ax, OutSeg mov ds, ax mov ax, InSeg mov es, ax lea si, DataOut lea di, DataIn mov cx, (251*256)/4 rep movsd pop ds ; Optimization Step #1: Convert loops to repeat..until form. ; for i := 1 to 249 - mov i, 1 iloop: ; for j := 1 to 254 - mov j, 1 jloop: ; Optimization. Unroll the innermost two loops: mov bh, byte ptr i ;i is always less than 256. mov bl, byte ptr j ;Computes i*256+j! push ds mov ax, InSeg ;Gain access to InSeg. mov ds, ax mov cx, 0 ;Compute sum here. mov ah, ch mov cl, ds:DataIn[bx-257] ;DataIn[i-1][j-1] mov al, ds:DataIn[bx-256] ;DataIn[i-1][j] add cx, ax mov al, ds:DataIn[bx-255] ;DataIn[i-1][j+1] add cx, ax mov al, ds:DataIn[bx-1] ;DataIn[i][j-1] add cx, ax mov al, ds:DataIn[bx+1] ;DataIn[i][j+1] add cx, ax mov al, ds:DataIn[bx+255] ;DataIn[i+1][j-1] add cx, ax mov al, ds:DataIn[bx+256] ;DataIn[i+1][j] add cx, ax mov al, ds:DataIn[bx+257] ;DataIn[i+1][j+1] add cx, ax mov al, ds:DataIn[bx] ;DataIn[i][j] shl ax, 3 ;DataIn[i][j]*8 add cx, ax shr cx, 4 ;Divide by 16 mov ax, OutSeg mov ds, ax mov ds:DataOut[bx], cl pop ds inc j cmp j, 254 jbe jloop inc i cmp i, 249 jbe iloop inc h mov ax, h cmp ax, Iterations jnbe Done jmp hloop Done: print byte "Writing result",cr,lf,0 ; //More omitted code goes here, see the previous version//
The second version above still uses memory variables for most computations. The optimizations applied to the original code were mainly language-independent optimizations. The next step was to begin applying some assembly language specific optimizations to the code. The first optimization we need to do is to move as many variables as possible into the 80x86's register set. The following code provides this optimization. Although this only improves the running time by 2 seconds, that is a 33% improvement (six seconds down to four)!
; IMGPRCS.ASM ; ; An image processing program (Second optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of delete code goes here» print byte "Computing Result",cr,lf,0 ; Copy the input data to the output buffer. hloop: mov ax, InSeg mov es, ax mov ax, OutSeg mov ds, ax lea si, DataOut lea di, DataIn mov cx, (251*256)/4 rep movsd assume ds:InSeg, es:OutSeg mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax mov cl, 249 iloop: mov bh, cl ;i*256 mov bl, 1 ;Start at j=1. mov ch, 254 ;# of times through loop. jloop: mov dx, 0 ;Compute sum here. mov ah, dh mov dl, DataIn[bx-257] ;DataIn[i-1][j-1] mov al, DataIn[bx-256] ;DataIn[i-1][j] add dx, ax mov al, DataIn[bx-255] ;DataIn[i-1][j+1] add dx, ax mov al, DataIn[bx-1] ;DataIn[i][j-1] add dx, ax mov al, DataIn[bx+1] ;DataIn[i][j+1] add dx, ax mov al, DataIn[bx+255] ;DataIn[i+1][j-1] add dx, ax mov al, DataIn[bx+256] ;DataIn[i+1][j] add dx, ax mov al, DataIn[bx+257] ;DataIn[i+1][j+1] add dx, ax mov al, DataIn[bx] ;DataIn[i][j] shl ax, 3 ;DataIn[i][j]*8 add dx, ax shr dx, 4 ;Divide by 16 mov DataOut[bx], dl inc bx dec ch jne jloop dec cl jne iloop dec bp jne hloop Done: print byte "Writing result",cr,lf,0 ; //More deleted code goes here, see the original version//
Note that on each iteration, the code above still copies the output data back to the input data. That's almost six and a half megabytes of data movement for 100 iterations! The following version of the blurring program unrolls the hloop
twice. The first occurrence copies the data from DataIn to DataOut while computing the blur, the second instance copies the data from DataOut back to DataIn while blurring the image. By using these two code sequences, the program save copying the data from one point to another. This version also maintains some common computations between two adjacent cells to save a few instructions in the innermost loop. This version arranges instructions in the innermost loop to help avoid data hazards on 80486 and later processors. The end result is almost 40% faster than the previous version (down to 2.5 seconds from four seconds).
; IMGPRCS.ASM ; ; An image processing program (Third optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; Version #4: Eliminated copying data from DataOut to DataIn on each pass. ; Removed hazards. Maintained common subexpressions. Did some ; more loop unrolling. ; ; ; Performance comparisons (66 MHz 80486 DX/2 system, 100 iterations). ; ; This code- 2.5 seconds. ; 2nd optimization pass- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of deleted code here, see the original version» print byte "Computing Result",cr,lf,0 assume ds:InSeg, es:OutSeg mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax ; Copy the data once so we get the edges in both arrays. mov cx, (251*256)/4 lea si, DataIn lea di, DataOut rep movsd ; "hloop" repeats once for each iteration. hloop: mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax ; "iloop" processes the rows in the matrices. mov cl, 249 iloop: mov bh, cl ;i*256 mov bl, 1 ;Start at j=1. mov ch, 254/2 ;# of times through loop. mov si, bx mov dh, 0 ;Compute sum here. mov bh, 0 mov ah, 0 ; "jloop" processes the individual elements of the array. ; This loop has been unrolled once to allow the two portions to share ; some common computations. jloop: ; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] + ; DataIn [i+1][j+1] will be used in the second half of this computation. ; So save its value in a register (di) until we need it again. mov dl, DataIn[si-256] ;[i-1,j] mov al, DataIn[si-255] ;[i-1,j+1] mov bl, DataIn[si+257] ;[i+1,j+1] add dx, ax mov al, DataIn[si+256] ;[I+1,j] add dx, bx mov bl, DataIn[si+1] ;[i,j+1] add dx, ax mov al, DataIn[si+255] ;[i+1,j-1] mov di, dx ;Save partial result. add dx, bx mov bl, DataIn[si-1] ;[i,j-1] add dx, ax mov al, DataIn[si] ;[i,j] add dx, bx mov bl, DataIn[si-257] ;[i-1,j-1] shl ax, 3 ;DataIn[i,j] * 8. add dx, bx add dx, ax shr ax, 3 ;Restore DataIn[i,j]. shr dx, 4 ;Divide by 16. add di, ax mov DataOut[si], dl ; Okay, process the next cell over. Note that we've got a partial sum ; sitting in DI already. Don't forget, we haven't bumped SI at this point, ; so the offsets are off by one. (This is the second half of the unrolled ; loop.) mov dx, di ;Partial sum. mov bl, DataIn[si-254] ;[i-1,j+1] mov al, DataIn[si+2] ;[i,j+1] add dx, bx mov bl, DataIn[si+258] ;[i+1,j+1]; add dx, ax mov al, DataIn[si+1] ;[i,j] add dx, bx shl ax, 3 ;DataIn[i][j]*8 add si, 2 ;Bump array index. add dx, ax mov ah, 0 ;Clear for next iter. shr dx, 4 ;Divide by 16 dec ch mov DataOut[si-1], dl jne jloop dec cl jne iloop dec bp je Done ; Special case so we don't have to move the data between the two arrays. ; This is an unrolled version of the hloop that swaps the input and output ; arrays so we don't have to move data around in memory. mov ax, OutSeg mov ds, ax mov ax, InSeg mov es, ax assume es:InSeg, ds:OutSeg hloop2: mov cl, 249 iloop2: mov bh, cl mov bl, 1 mov ch, 254/2 mov si, bx mov dh, 0 mov bh, 0 mov ah, 0 jloop2: mov dl, DataOut[si-256] mov al, DataOut[si-255] mov bl, DataOut[si+257] add dx, ax mov al, DataOut[si+256] add dx, bx mov bl, DataOut[si+1] add dx, ax mov al, DataOut[si+255] mov di, dx add dx, bx mov bl, DataOut[si-1] add dx, ax mov al, DataOut[si] add dx, bx mov bl, DataOut[si-257] shl ax, 3 add dx, bx add dx, ax shr ax, 3 shr dx, 4 mov DataIn[si], dl mov dx, di mov bl, DataOut[si-254] add dx, ax mov al, DataOut[si+2] add dx, bx mov bl, DataOut[si+258] add dx, ax mov al, DataOut[si+1] add dx, bx shl ax, 3 add si, 2 add dx, ax mov ah, 0 shr dx, 4 dec ch mov DataIn[si-1], dl jne jloop2 dec cl jne iloop2 dec bp je Done2 jmp hloop ; Kludge to guarantee that the data always resides in the output segment. Done2: mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax mov cx, (251*256)/4 lea si, DataIn lea di, DataOut rep movsd Done: print byte "Writing result",cr,lf,0 ; //Lots of deleted code here, see the original program//
This code provides a good example of the kind of optimization that scares a lot of people. There is a lot of cycle counting, instruction scheduling, and other crazy stuff that makes program very difficult to read and understand. This is the kind of optimization for which assembly language programmers are famous; the stuff that spawned the phrase "never optimize early." You should never try this type of optimization until you feel you've exhausted all other possibilities. Once you write your code in this fashion, it is going to be very difficult to make further changes to it. By the way, the above code took about 15 hours to develop and debug (debugging took the most time). That works out to a 0.1 second improvement (for 100 iterations) for each hour of work. Although this code certainly isn't optimal yet, it is difficult to justify more time attempting to improve this code by mechanical means (e.g., moving instructions around, etc.) because the performance gains would be so little.
In the four steps above, we've reduced the running time of the assembly code from 36 seconds down to 2.5 seconds. Quite an impressive feat. However, you shouldn't get the idea that this was easy or even that there were only four steps involved. During the actual development of this example, there were many attempts that did not improve performance (in fact, some modifications wound up reducing performance) and others did not improve performance enough to justify their inclusion. Just to demonstrate this last point, the following code included a major change in the way the program organized data. The main loop operates on 16 bit objects in memory rather than eight bit objects. On some machines with large external caches (256K or better) this algorithm provides a slight improvement in performance (2.4 seconds, down from 2.5). However, on other machines it runs slower. Therefore, this code was not chosen as the final implementation:
; IMGPRCS.ASM ; ; An image processing program (Fourth optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; Version #4: Eliminated copying data from DataOut to DataIn on each pass. ; Removed hazards. Maintained common subexpressions. Did some ; more loop unrolling. ; ; Version #5: Converted data arrays to words rather than bytes and operated ; on 16-bit values. Yielded minimal speedup. ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 2.4 seconds. ; 3rd optimization pass- 2.5 seconds. ; 2nd optimization pass- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. .xlist include stdlib.a includelib stdlib.lib .list .386 option segment:use16 dseg segment para public 'data' ImgData byte 251 dup (256 dup (?)) InName byte "roller1.raw",0 OutName byte "roller2.raw",0 Iterations word 0 dseg ends ; This code makes the naughty assumption that the following ; segments are loaded contiguously in memory! Also, because these ; segments are paragraph aligned, this code assumes that these segments ; will contain a full 65,536 bytes. You cannot declare a segment with ; exactly 65,536 bytes in MASM. However, the paragraph alignment option ; ensures that the extra byte of padding is added to the end of each ; segment. DataSeg1 segment para public 'ds1' Data1a byte 65535 dup (?) DataSeg1 ends DataSeg2 segment para public 'ds2' Data1b byte 65535 dup (?) DataSeg2 ends DataSeg3 segment para public 'ds3' Data2a byte 65535 dup (?) DataSeg3 ends DataSeg4 segment para public 'ds4' Data2b byte 65535 dup (?) DataSeg4 ends cseg segment para public 'code' assume cs:cseg, ds:dseg Main proc mov ax, dseg mov ds, ax meminit mov ax, 3d00h ;Open input file for reading. lea dx, InName int 21h jnc GoodOpen print byte "Could not open input file.",cr,lf,0 jmp Quit GoodOpen: mov bx, ax ;File handle. lea dx, ImgData mov cx, 256*251 ;Size of data file to read. mov ah, 3Fh int 21h cmp ax, 256*251 ;See if we read the data. je GoodRead print byte "Did not read the file properly",cr,lf,0 jmp Quit GoodRead: print byte "Enter number of iterations: ",0 getsm atoi free mov Iterations, ax cmp ax, 0 jle Quit printf byte "Computing Result for %d iterations",cr,lf,0 dword Iterations ; Copy the data and expand it from eight bits to sixteen bits. ; The first loop handles the first 32,768 bytes, the second loop ; handles the remaining bytes. mov ax, DataSeg1 mov es, ax mov ax, DataSeg3 mov fs, ax mov ah, 0 mov cx, 32768 lea si, ImgData xor di, di ;Output data is at ofs zero. CopyLoop: lodsb ;Read a byte mov fs:[di], ax ;Store a word in DataSeg3 stosw ;Store a word in DataSeg1 dec cx jne CopyLoop mov di, DataSeg2 mov es, di mov di, DataSeg4 mov fs, di mov cx, (251*256) - 32768 xor di, di CopyLoop1: lodsb ;Read a byte mov fs:[di], ax ;Store a word in DataSeg4 stosw ;Store a word in DataSeg2 dec cx jne CopyLoop1 ; hloop completes one iteration on the data moving it from Data1a/Data1b ; to Data2a/Data2b hloop: mov ax, DataSeg1 mov ds, ax mov ax, DataSeg3 mov es, ax ; Process the first 127 rows (65,024 bytes) of the array): mov cl, 127 lea si, Data1a+202h ;Start at [1,1] iloop0: mov ch, 254/2 ;# of times through loop. jloop0: mov dx, [si] ;[i,j] mov bx, [si-200h] ;[i-1,j] mov ax, dx shl dx, 3 ;[i,j] * 8 add bx, [si-1feh] ;[i-1,j+1] mov bp, [si+2] ;[i,j+1] add bx, [si+200h] ;[i+1,j] add dx, bp add bx, [si+202h] ;[i+1,j+1] add dx, [si-202h] ;[i-1,j-1] mov di, [si-1fch] ;[i-1,j+2] add dx, [si-2] ;[i,j-1] add di, [si+4] ;[i,j+2] add dx, [si+1feh] ;[i+1,j-1] add di, [si+204h] ;[i+1,j+2] shl bp, 3 ;[i,j+1] * 8 add dx, bx add bp, ax shr dx, 4 ;Divide by 16. add bp, bx mov es:[si], dx ;Store [i,j] entry. add bp, di add si, 4 ;Affects next store operation! shr bp, 4 ;Divide by 16. dec ch mov es:[si-2], bp ;Store [i,j+1] entry. jne jloop0 add si, 4 ;Skip to start of next row. dec cl jne iloop0 ; Process the last 124 rows of the array). This requires that we switch from ; one segment to the next. Note that the segments overlap. mov ax, DataSeg2 sub ax, 40h ;Back up to last 2 rows in DS2 mov ds, ax mov ax, DataSeg4 sub ax, 40h ;Back up to last 2 rows in DS4 mov es, ax mov cl, 251-127-1 ;Remaining rows to process. mov si, 202h ;Continue with next row. iloop1: mov ch, 254/2 ;# of times through loop. jloop1: mov dx, [si] ;[i,j] mov bx, [si-200h] ;[i-1,j] mov ax, dx shl dx, 3 ;[i,j] * 8 add bx, [si-1feh] ;[i-1,j+1] mov bp, [si+2] ;[i,j+1] add bx, [si+200h] ;[i+1,j] add dx, bp add bx, [si+202h] ;[i+1,j+1] add dx, [si-202h] ;[i-1,j-1] mov di, [si-1fch] ;[i-1,j+2] add dx, [si-2] ;[i,j-1] add di, [si+4] ;[i,j+2] add dx, [si+1feh] ;[i+1,j-1] add di, [si+204h] ;[i+1,j+2] shl bp, 3 ;[i,j+1] * 8 add dx, bx add bp, ax shr dx, 4 ;Divide by 16 add bp, bx mov es:[si], dx ;Store [i,j] entry. add bp, di add si, 4 ;Affects next store operation! shr bp, 4 dec ch mov es:[si-2], bp ;Store [i,j+1] entry. jne jloop1 add si, 4 ;Skip to start of next row. dec cl jne iloop1 mov ax, dseg mov ds, ax assume ds:dseg dec Iterations je Done0 ; Unroll the iterations loop so we can move the data from DataSeg2/4 back ; to DataSeg1/3 without wasting extra time. Other than the direction of the ; data movement, this code is virtually identical to the above. mov ax, DataSeg3 mov ds, ax mov ax, DataSeg1 mov es, ax mov cl, 127 lea si, Data1a+202h iloop2: mov ch, 254/2 jloop2: mov dx, [si] mov bx, [si-200h] mov ax, dx shl dx, 3 add bx, [si-1feh] mov bp, [si+2] add bx, [si+200h] add dx, bp add bx, [si+202h] add dx, [si-202h] mov di, [si-1fch] add dx, [si-2] add di, [si+4] add dx, [si+1feh] add di, [si+204h] shl bp, 3 add dx, bx add bp, ax shr dx, 4 add bp, bx mov es:[si], dx add bp, di add si, 4 shr bp, 4 dec ch mov es:[si-2], bp jne jloop2 add si, 4 dec cl jne iloop2 mov ax, DataSeg4 sub ax, 40h mov ds, ax mov ax, DataSeg2 sub ax, 40h mov es, ax mov cl, 251-127-1 mov si, 202h iloop3: mov ch, 254/2 jloop3: mov dx, [si] mov bx, [si-200h] mov ax, dx shl dx, 3 add bx, [si-1feh] mov bp, [si+2] add bx, [si+200h] add dx, bp add bx, [si+202h] add dx, [si-202h] mov di, [si-1fch] add dx, [si-2] add di, [si+4] add dx, [si+1feh] add di, [si+204h] shl bp, 3 add dx, bx add bp, ax shr dx, 4 add bp, bx mov es:[si], dx add bp, di add si, 4 shr bp, 4 dec ch mov es:[si-2], bp jne jloop3 add si, 4 dec cl jne iloop3 mov ax, dseg mov ds, ax assume ds:dseg dec Iterations je Done2 jmp hloop Done2: mov ax, DataSeg1 mov bx, DataSeg2 jmp Finish Done0: mov ax, DataSeg3 mov bx, DataSeg4 Finish: mov ds, ax print byte "Writing result",cr,lf,0 ; Convert data back to byte form and write to the output file: mov ax, dseg mov es, ax mov cx, 32768 lea di, ImgData xor si, si ;Output data is at offset zero. CopyLoop3: lodsw ;Read a word from final array. stosb ;Write a byte to output array. dec cx jne CopyLoop3 mov ds, bx mov cx, (251*256) - 32768 xor si, si CopyLoop4: lodsw ;Read final data word. stosb ;Write data byte to output array. dec cx jne CopyLoop4 ; Okay, write the data to the output file: mov ah, 3ch ;Create output file. mov cx, 0 ;Normal file attributes. mov dx, dseg mov ds, dx lea dx, OutName int 21h jnc GoodCreate print byte "Could not create output file.",cr,lf,0 jmp Quit GoodCreate: mov bx, ax ;File handle. push bx mov dx, dseg ;Where the data can be found. mov ds, dx lea dx, ImgData mov cx, 256*251 ;Size of data file to write. mov ah, 40h ;Write operation. int 21h pop bx ;Retrieve handle for close. cmp ax, 256*251 ;See if we wrote the data. je GoodWrite print byte "Did not write the file properly",cr,lf,0 jmp Quit GoodWrite: mov ah, 3eh ;Close operation. int 21h Quit: ExitPgm ;DOS macro to quit program. Main endp cseg ends sseg segment para stack 'stack' stk byte 1024 dup ("stack ") sseg ends zzzzzzseg segment para public 'zzzzzz' LastBytes byte 16 dup (?) zzzzzzseg ends end Main
Of course, the absolute best way to improve the performance of any piece of code is with a better algorithm. All of the above assembly language versions were limited by a single requirement - they all must produce the same output file as the original Pascal program. Often, programmers lose sight of what it is that they are trying to accomplish and get so caught up in the computations they are performing that they fail to see other possibilities. The optimization example above is a perfect example. The assembly code faithfully preserves the semantics of the original Pascal program; it computes the weighted average of all interior pixels as the sum of the eight neighbors around a pixel plus eight times the current pixel's value, with the entire sum divided by 16. Now this is a good blurring function, but it is not the only blurring function. A Photoshop (or other image processing program) user doesn't care about algorithms or such. When that user selects "blur image" they want it to go out of focus. Exactly how much out of focus is generally immaterial. In fact, the less the better because the user can always run the blur algorithm again (or specify some number of iterations). The following assembly language program shows how to get better performance by modifying the blurring algorithm to reduce the number of instructions it needs to execute in the innermost loops. It computes blurring by averaging a pixel with the four neighbors above, below, to the left, and to the right of the current pixel. This modification yields a program that runs 100 iterations in 2.2 seconds, a 12% improvement over the previous version:
; IMGPRCS.ASM ; ; An image processing program (Fifth optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K), the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; Version #4: Eliminated copying data from DataOut to DataIn on each pass. ; Removed hazards. Maintained common subexpressions. Did some ; more loop unrolling. ; ; Version #6: Changed the blurring algorithm to use fewer computations. ; This version does *NOT* produce the same data as the other ; programs. ; ; ; Performance comparisons (66 MHz 80486 DX/2 system, 100 iterations). ; ; This code- 2.2 seconds. ; 3rd optmization pass- 2.5 seconds. ; 2nd optimization pass- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of deleted code here, see the original program» print byte "Computing Result",cr,lf,0 assume ds:InSeg, es:OutSeg mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax ; Copy the data once so we get the edges in both arrays. mov cx, (251*256)/4 lea si, DataIn lea di, DataOut rep movsd ; "hloop" repeats once for each iteration. hloop: mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax ; "iloop" processes the rows in the matrices. mov cl, 249 iloop: mov bh, cl ;i*256 mov bl, 1 ;Start at j=1. mov ch, 254/2 ;# of times through loop. mov si, bx mov dh, 0 ;Compute sum here. mov bh, 0 mov ah, 0 ; "jloop" processes the individual elements of the array. ; This loop has been unrolled once to allow the two portions to share ; some common computations. jloop: ; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] + ; DataIn [i+1][j+1] will be used in the second half of this computation. ; So save its value in a register (di) until we need it again. mov dl, DataIn[si] ;[i,j] mov al, DataIn[si-256] ;[I-1,j] shl dx, 2 ;[i,j]*4 mov bl, DataIn[si-1] ;[i,j-1] add dx, ax mov al, DataIn[si+1] ;[i,j+1] add dx, bx mov bl, DataIn[si+256] ;[i+1,j] add dx, ax shl ax, 2 ;[i,j+1]*4 add dx, bx mov bl, DataIn[si-255] ;[i-1,j+1] shr dx, 3 ;Divide by 8. add ax, bx mov DataOut[si], dl mov bl, DataIn[si+2] ;[i,j+2] mov dl, DataIn[si+257] ;[i+1,j+1] add ax, bx mov bl, DataIn[si] ;[i,j] add ax, dx add ax, bx shr ax, 3 dec ch mov DataOut[si+1], al jne jloop dec cl jne iloop dec bp je Done ; Special case so we don't have to move the data between the two arrays. ; This is an unrolled version of the hloop that swaps the input and output ; arrays so we don't have to move data around in memory. mov ax, OutSeg mov ds, ax mov ax, InSeg mov es, ax assume es:InSeg, ds:OutSeg hloop2: mov cl, 249 iloop2: mov bh, cl mov bl, 1 mov ch, 254/2 mov si, bx mov dh, 0 mov bh, 0 mov ah, 0 jloop2: mov dl, DataOut[si-256] mov al, DataOut[si-255] mov bl, DataOut[si+257] add dx, ax mov al, DataOut[si+256] add dx, bx mov bl, DataOut[si+1] add dx, ax mov al, DataOut[si+255] mov di, dx add dx, bx mov bl, DataOut[si-1] add dx, ax mov al, DataOut[si] add dx, bx mov bl, DataOut[si-257] shl ax, 3 add dx, bx add dx, ax shr ax, 3 shr dx, 4 mov DataIn[si], dl mov dx, di mov bl, DataOut[si-254] add dx, ax mov al, DataOut[si+2] add dx, bx mov bl, DataOut[si+258] add dx, ax mov al, DataOut[si+1] add dx, bx shl ax, 3 add si, 2 add dx, ax mov ah, 0 shr dx, 4 dec ch mov DataIn[si-1], dl jne jloop2 dec cl jne iloop2 dec bp je Done2 jmp hloop ; Kludge to guarantee that the data always resides in the output segment. Done2: mov ax, InSeg mov ds, ax mov ax, OutSeg mov es, ax mov cx, (251*256)/4 lea si, DataIn lea di, DataOut rep movsd Done: print byte "Writing result",cr,lf,0 ; //Lots of delete code here, see the original program//
One very important thing to keep in mind about the codein this section is that we've optimized it for 100 iterations. While it turns out that these optimizations apply equally well to more iterations, this isn't necessarily true for fewer iterations. In particular, if we run only one iteration, any copying of data at the end of the operation will easily consume a large part of the time we save by the optimizations. Since it is very rare for a user to blur an image 100 times in a row, our optimizations may not be as good as we could make them. However, this section does provide a good example of the steps you must go through in order to optimize a given program. One hundred iterations was a good choice for this example because it was easy to measure the running time of all versions of the program. However, you must keep in mind that you should optimize your programs for the expected case, not an arbitrary case.