Art of Assembly: Chapter Twenty-Five

[Chapter Twenty-Five][Previous] [Art of Assembly][Randall Hyde]

Art of Assembly: Chapter Twenty-Five

25.5 - Improving the Implementation of an Algorithm

25.5 Improving the Implementation of an Algorithm

One easy way to partially demonstrate how to optimize a piece of code is to provide an example of some program and the optimization steps you can apply to that program. This section will present a short program that blurs an eight-bit gray scale image. Then, this section will lead though through several optimization steps and show you how to get that program running over 16 times faster.

The following code assumes that you provide it with a file containing a 251x256 gray scale photographic image. The data structure for this file is as follows:









	Image: array [0..250, 0..255] of byte;

Each byte contains a value in the range 0..255 with zero denoting black, 255 representing white, and the other values representing even shades of gray between these two extremes.

The blurring algorithm averages a pixel with its eight closest neighbors. A single blur operation applies this average to all interior pixels of an image (that is, it does not apply to the pixels on the boundary of the image because they do not have the same number of neighbors as the other pixels). The following Pascal program implements the blurring algorithm and lets the user specify the amount of blurring (by looping through the algorithm the number of times the user specifies):









program PhotoFilter(input,output);

(* Here is the raw file data type produced by the Photoshop program *)

type
 image = array [0..250] of array [0..255] of byte;

(* The variables we will use. Note that the "datain" and "dataout" *)
(* variables are pointers because Turbo Pascal will not allow us to *)
(* allocate more than 64K data in the one global data segment it *)
(* supports. *)

var
 h,i,j,k,l,sum,iterations:integer;
 datain, dataout: ^image;
 f,g:file of image;

begin

 (* Open the files and real the input data *)

 assign(f, 'roller1.raw');
 assign(g, 'roller2.raw');
 reset(f);
 rewrite(g);
 new(datain);
 new(dataout);
 read(f,datain^);

 (* Get the number of iterations from the user *)

 write('Enter number of iterations:');
 readln(iterations);

 writeln('Computing result');

 (* Copy the data from the input array to the output array. *)
 (* This is a really lame way to copy the border from the *)
 (* input array to the output array. *)

 for i := 0 to 250 do
        for j := 0 to 255 do
                dataout^ [i][j] := datain^ [i][j];

 (* Okay, here's where all the work takes place. The outside                                                            *)
 (* loop repeats this blurring operation the number of                                                          *)
 (* iterations specified by the user.                                                           *)

 for h := 1 to iterations do begin

        (* For each row except the first and the last, compute                                                  *)
        (* a new value for each element.                                                        *)

        for i := 1 to 249 do

                (* For each column except the first and the last, com-                                                  *)
                (* pute a new value for each element.                                                   *)

                for j := 1 to 254 do begin

                        (* For each element in the array, compute a new
                           blurred value by adding up the eight cells
                           around an array element along with eight times
                           the current cell's value. Then divide this by
                           sixteen to compute a weighted average of the
                           nine cells forming a square around the current
                           cell. The current cell has a 50% weighting,
                           the other eight cells around the current cel
                           provide the other 50% weighting (6.25% each). *)

                        sum := 0;
                        for k := -1 to 1 do
                            for l := -1 to 1 do
                                sum := sum + datain^ [i+k][j+l];

                        (* Sum currently contains the sum of the nine     *)
                        (* cells, add in seven times the current cell so  *)
                        (* we get a total of eight times the current cell. *)

                        dataout^ [i][j] := (sum + datain^ [i][j]*7) div 16;

                end;

                (* Copy the output cell values back to the input cells                                                  *)
                (* so we can perform the blurring on this new data on                                                   *)
                (* the next iteration.                                                  *)

                for i := 0 to 250 do
                    for j := 0 to 255 do
                        datain^ [i][j] := dataout^ [i][j];

 end;

 writeln('Writing result');
 write(g,dataout^);
 close(f);
 close(g);

end.

The Pascal program above, compiled with Turbo Pascal v7.0, takes 45 seconds to compute 100 iterations of the blurring algorithm. A comparable program written in C and compiled with Borland C++ v4.02 takes 29 seconds to run. The same source file compiled with Microsoft C++ v8.00 runs in 21 seconds. Obviously the C compilers produce better code than Turbo Pascal. It took about three hours to get the Pascal version running and tested. The C versions took about another hour to code and test. The following two images provide a "before" and "after" example of this program's function:

Before blurring:

After blurring (10 iterations):

The following is a crude translation from Pascal directly into assembly language of the above program. It requires 36 seconds to run. Yes, the C compilers did a better job, but once you see how bad this code is, you'll wonder what it is that Turbo Pascal is doing to run so slow. It took about an hour to translate the Pascal version into this assembly code and debug it to the point it produced the same output as the Pascal version.









; IMGPRCS.ASM
;
; An image processing program.
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              36 seconds.
;       Borland Pascal v7.0-                    45 seconds.
;       Borland C++ v4.02-                      29 seconds.
;       Microsoft C++ v8.00-                    21 seconds.

                .xlist
                include         stdlib.a
                includelib      stdlib.lib
                .list
                .286

dseg            segment para public 'data'

; Loop control variables and other variables:

h               word    ?
i               word    ?
j               word    ?
k               word    ?
l               word    ?
sum             word    ?
iterations      word    ?

; File names:

InName          byte    "roller1.raw",0
OutName         byte    "roller2.raw",0

dseg            ends


; Here is the input data that we operate on.

InSeg           segment para public 'indata'

DataIn          byte    251 dup (256 dup (?))

InSeg           ends


; Here is the output array that holds the result.

OutSeg          segment para public 'outdata'

DataOut         byte    251 dup (256 dup (?))

OutSeg          ends




cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg

Main            proc
                mov     ax, dseg
                mov     ds, ax
                meminit

                mov     ax, 3d00h               ;Open input file for reading.
                lea     dx, InName
                int     21h
                jnc     GoodOpen
                print
                byte    "Could not open input file.",cr,lf,0
                jmp     Quit

GoodOpen:       mov     bx, ax                  ;File handle.
                mov     dx, InSeg               ;Where to put the data.
                mov     ds, dx
                lea     dx, DataIn
                mov     cx, 256*251             ;Size of data file to read.
                mov     ah, 3Fh
                int     21h
                cmp     ax, 256*251             ;See if we read the data.
                je      GoodRead
                print
                byte    "Did not read the file properly",cr,lf,0
                jmp     Quit

GoodRead:       mov     ax, dseg
                mov     ds, ax
                print
                byte    "Enter number of iterations: ",0
                getsm
                atoi
                free
                mov     iterations, ax
                print
                byte    "Computing Result",cr,lf,0

; Copy the input data to the output buffer.

                mov     i, 0
iloop0:         cmp     i, 250
                ja      iDone0
                mov     j, 0
jloop0:         cmp     j, 255
                ja      jDone0

                mov     bx, i                   ;Compute index into both
                shl     bx, 8                   ; arrays using the formula
                add     bx, j                   ; i*256+j (row major).

                mov     cx, InSeg               ;Point at input segment.
                mov     es, cx
                mov     al, es:DataIn[bx]       ;Get DataIn[i][j].

                mov     cx, OutSeg              ;Point at output segment.
                mov     es, cx
                mov     es:DataOut[bx], al      ;Store into DataOut[i][j]

                inc     j                       ;Next iteration of j loop.
                jmp     jloop0

jDone0:         inc     i                       ;Next iteration of i loop.
                jmp     iloop0

iDone0:

; for h := 1 to iterations-

                mov     h, 1
hloop:          mov     ax, h
                cmp     ax, iterations
                ja      hloopDone



; for i := 1 to 249 -

                mov     i, 1
iloop:          cmp     i, 249
                ja      iloopDone

; for j := 1 to 254 -
                mov     j, 1
jloop:          cmp     j, 254
                ja      jloopDone


; sum := 0;
; for k := -1 to 1 do for l := -1 to 1 do

                mov     ax, InSeg               ;Gain access to InSeg.
                mov     es, ax

                mov     sum, 0
                mov     k, -1
kloop:          cmp     k, 1
                jg      kloopDone

                mov     l, -1
lloop:          cmp     l, 1
                jg      lloopDone

; sum := sum + datain [i+k][j+l]

                mov     bx, i
                add     bx, k
                shl     bx, 8                   ;Multiply by 256.
                add     bx, j
                add     bx, l

                mov     al, es:DataIn[bx]
                mov     ah, 0
                add     Sum, ax

                inc     l
                jmp     lloop

lloopDone:      inc     k
                jmp     kloop


; dataout [i][j] := (sum + datain[i][j]*7) div 16;

kloopDone:      mov     bx, i
                shl     bx, 8                   ;*256
                add     bx, j
                mov     al, es:DataIn[bx]
                mov     ah, 0
                imul    ax, 7
                add     ax, sum
                shr     ax, 4                   ;div 16

                mov     bx, OutSeg
                mov     es, bx

                mov     bx, i
                shl     bx, 8
                add     bx, j
                mov     es:DataOut[bx], al

                inc     j
                jmp     jloop

jloopDone:              inc     i
                jmp     iloop

iloopDone:
; Copy the output data to the input buffer.

                mov     i, 0
iloop1:         cmp     i, 250
                ja      iDone1
                mov     j, 0
jloop1:         cmp     j, 255
                ja      jDone1

                mov     bx, i                   ;Compute index into both
                shl     bx, 8                   ; arrays using the formula
                add     bx, j                   ; i*256+j (row major).

                mov     cx, OutSeg              ;Point at input segment.
                mov     es, cx
                mov     al, es:DataOut[bx]      ;Get DataIn[i][j].

                mov     cx, InSeg               ;Point at output segment.
                mov     es, cx
                mov     es:DataIn[bx], al       ;Store into DataOut[i][j]

                inc     j                       ;Next iteration of j loop.
                jmp     jloop1

jDone1:         inc     i                       ;Next iteration of i loop.
                jmp     iloop1

iDone1:         inc     h
                jmp     hloop

hloopDone:      print
                byte    "Writing result",cr,lf,0


; Okay, write the data to the output file:

                mov     ah, 3ch         ;Create output file.
                mov     cx, 0           ;Normal file attributes.
                lea     dx, OutName
                int     21h
                jnc     GoodCreate
                print
                byte    "Could not create output file.",cr,lf,0
                jmp     Quit

GoodCreate:     mov     bx, ax          ;File handle.
                push    bx
                mov     dx, OutSeg      ;Where the data can be found.
                mov     ds, dx
                lea     dx, DataOut
                mov     cx, 256*251     ;Size of data file to write.
                mov     ah, 40h         ;Write operation.
                int     21h
                pop     bx              ;Retrieve handle for close.
                cmp     ax, 256*251     ;See if we wrote the data.
                je      GoodWrite
                print
                byte    "Did not write the file properly",cr,lf,0
                jmp     Quit

GoodWrite:      mov     ah, 3eh         ;Close operation.
                int     21h


Quit:           ExitPgm                 ;DOS macro to quit program.
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             byte    1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       byte    16 dup (?)
zzzzzzseg       ends
                end     Main

This assembly code is a very straight-forward, line by line translation of the previous Pascal code. Even beginning programmers (who've read and understand Chapters Eight and Nine) should easily be able to improve the performance of this code.

While we could run a profiler on this program to determine where the "hot spots" are in this code, a little analysis, particularly of the Pascal version, should make it obvious that there are a lot of nested loops in this code. As Chapter Ten points out, when optimizing code you should always start with the innermost loops. The major change between the code above and the following assembly language version is that we've unrolled the innermost loops and we've replaced the array index computations with some constant computations. These minor changes speed up the execution by a factor of six! The assembly version now runs in six seconds rather than 36. A Microsoft C++ version of the same program with comparable optimzations runs in eight seconds. It required nearly four hours to develop, test, and debug this code. It required an additional hour to apply these same modifications to the C version.









; IMGPRCS2.ASM
;
; An image processing program (First optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              6 seconds.
;       Original ASM code-                      36 seconds.
;       Borland Pascal v7.0-                    45 seconds.
;       Borland C++ v4.02-                      29 seconds.
;       Microsoft C++ v8.00-                    21 seconds.


;       // Lots of omitted code goes here, see the previous version//


                print
                byte    "Computing Result",cr,lf,0

; for h := 1 to iterations-

                mov     h, 1
hloop:

; Copy the input data to the output buffer.
; Optimization step #1: Replace with movs instruction.

                push    ds
                mov     ax, OutSeg
                mov     ds, ax
                mov     ax, InSeg
                mov     es, ax
                lea     si, DataOut
                lea     di, DataIn
                mov     cx, (251*256)/4
        rep     movsd
                pop     ds


; Optimization Step #1: Convert loops to repeat..until form.

; for i := 1 to 249 -

                mov     i, 1
iloop:

; for j := 1 to 254 -

                mov     j, 1
jloop:


; Optimization. Unroll the innermost two loops:

                mov     bh, byte ptr i          ;i is always less than 256.
                mov     bl, byte ptr j          ;Computes i*256+j!

                push    ds
                mov     ax, InSeg               ;Gain access to InSeg.
                mov     ds, ax

                mov     cx, 0                   ;Compute sum here.
                mov     ah, ch
                mov     cl, ds:DataIn[bx-257]   ;DataIn[i-1][j-1]
                mov     al, ds:DataIn[bx-256]   ;DataIn[i-1][j]
                add     cx, ax
                mov     al, ds:DataIn[bx-255]   ;DataIn[i-1][j+1]
                add     cx, ax
                mov     al, ds:DataIn[bx-1]     ;DataIn[i][j-1]
                add     cx, ax
                mov     al, ds:DataIn[bx+1]     ;DataIn[i][j+1]
                add     cx, ax
                mov     al, ds:DataIn[bx+255]   ;DataIn[i+1][j-1]
                add     cx, ax
                mov     al, ds:DataIn[bx+256]   ;DataIn[i+1][j]
                add     cx, ax
                mov     al, ds:DataIn[bx+257]   ;DataIn[i+1][j+1]
                add     cx, ax

                mov     al, ds:DataIn[bx]       ;DataIn[i][j]
                shl     ax, 3                   ;DataIn[i][j]*8
                add     cx, ax
                shr     cx, 4                   ;Divide by 16
                mov     ax, OutSeg
                mov     ds, ax
                mov     ds:DataOut[bx], cl
                pop     ds

                inc     j
                cmp     j, 254
                jbe     jloop

                inc     i
                cmp     i, 249
                jbe     iloop

                inc     h
                mov     ax, h
                cmp     ax, Iterations
                jnbe    Done
                jmp     hloop

Done:           print
                byte    "Writing result",cr,lf,0

;       //More omitted code goes here, see the previous version//

The second version above still uses memory variables for most computations. The optimizations applied to the original code were mainly language-independent optimizations. The next step was to begin applying some assembly language specific optimizations to the code. The first optimization we need to do is to move as many variables as possible into the 80x86's register set. The following code provides this optimization. Although this only improves the running time by 2 seconds, that is a 33% improvement (six seconds down to four)!









; IMGPRCS.ASM
;
; An image processing program (Second optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.


;       <<Lots of delete code goes here»

                print
                byte    "Computing Result",cr,lf,0


; Copy the input data to the output buffer.


hloop:          mov     ax, InSeg
                mov     es, ax
                mov     ax, OutSeg
                mov     ds, ax
                lea     si, DataOut
                lea     di, DataIn
                mov     cx, (251*256)/4
        rep     movsd

                assume  ds:InSeg, es:OutSeg
                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax



                mov     cl, 249
iloop:          mov     bh, cl                  ;i*256
                mov     bl, 1                   ;Start at j=1.
                mov     ch, 254                 ;# of times through loop.
jloop:
                mov     dx, 0                   ;Compute sum here.
                mov     ah, dh
                mov     dl, DataIn[bx-257]      ;DataIn[i-1][j-1]
                mov     al, DataIn[bx-256]      ;DataIn[i-1][j]
                add     dx, ax
                mov     al, DataIn[bx-255]      ;DataIn[i-1][j+1]
                add     dx, ax
                mov     al, DataIn[bx-1]        ;DataIn[i][j-1]
                add     dx, ax
                mov     al, DataIn[bx+1]        ;DataIn[i][j+1]
                add     dx, ax
                mov     al, DataIn[bx+255]      ;DataIn[i+1][j-1]
                add     dx, ax
                mov     al, DataIn[bx+256]      ;DataIn[i+1][j]
                add     dx, ax
                mov     al, DataIn[bx+257]      ;DataIn[i+1][j+1]
                add     dx, ax

                mov     al, DataIn[bx]          ;DataIn[i][j]
                shl     ax, 3                   ;DataIn[i][j]*8
                add     dx, ax
                shr     dx, 4                   ;Divide by 16
                mov     DataOut[bx], dl

                inc     bx
                dec     ch
                jne     jloop

                dec     cl
                jne     iloop

                dec     bp
                jne     hloop

Done:           print
                byte    "Writing result",cr,lf,0

;       //More deleted code goes here, see the original version//

Note that on each iteration, the code above still copies the output data back to the input data. That's almost six and a half megabytes of data movement for 100 iterations! The following version of the blurring program unrolls the hloop twice. The first occurrence copies the data from DataIn to DataOut while computing the blur, the second instance copies the data from DataOut back to DataIn while blurring the image. By using these two code sequences, the program save copying the data from one point to another. This version also maintains some common computations between two adjacent cells to save a few instructions in the innermost loop. This version arranges instructions in the innermost loop to help avoid data hazards on 80486 and later processors. The end result is almost 40% faster than the previous version (down to 2.5 seconds from four seconds).









; IMGPRCS.ASM
;
; An image processing program (Third optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system, 100 iterations).
;
;       This code-                              2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

;       <<Lots of deleted code here, see the original version»

                print
                byte    "Computing Result",cr,lf,0


                assume  ds:InSeg, es:OutSeg

                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax

; Copy the data once so we get the edges in both arrays.

                mov     cx, (251*256)/4
                lea     si, DataIn
                lea     di, DataOut
        rep     movsd


; "hloop" repeats once for each iteration.

hloop:
                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax

; "iloop" processes the rows in the matrices.

                mov     cl, 249
iloop:          mov     bh, cl                  ;i*256
                mov     bl, 1                   ;Start at j=1.
                mov     ch, 254/2               ;# of times through loop.
                mov     si, bx
                mov     dh, 0                   ;Compute sum here.
                mov     bh, 0
                mov     ah, 0

; "jloop" processes the individual elements of the array.
; This loop has been unrolled once to allow the two portions to share
; some common computations.

jloop:

; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] +
; DataIn [i+1][j+1] will be used in the second half of this computation.
; So save its value in a register (di) until we need it again.

                mov     dl, DataIn[si-256]      ;[i-1,j]
                mov     al, DataIn[si-255]      ;[i-1,j+1]
                mov     bl, DataIn[si+257]      ;[i+1,j+1]
                add     dx, ax
                mov     al, DataIn[si+256]      ;[I+1,j]
                add     dx, bx
                mov     bl, DataIn[si+1]        ;[i,j+1]
                add     dx, ax
                mov     al, DataIn[si+255]      ;[i+1,j-1]

                mov     di, dx                  ;Save partial result.

                add     dx, bx
                mov     bl, DataIn[si-1]        ;[i,j-1]
                add     dx, ax
                mov     al, DataIn[si]          ;[i,j]
                add     dx, bx
                mov     bl, DataIn[si-257]      ;[i-1,j-1]
                shl     ax, 3                   ;DataIn[i,j] * 8.
                add     dx, bx
                add     dx, ax
                shr     ax, 3                   ;Restore DataIn[i,j].
                shr     dx, 4                   ;Divide by 16.
                add     di, ax
                mov     DataOut[si], dl

; Okay, process the next cell over. Note that we've got a partial sum
; sitting in DI already. Don't forget, we haven't bumped SI at this point,
; so the offsets are off by one. (This is the second half of the unrolled
; loop.)

                mov     dx, di                  ;Partial sum.
                mov     bl, DataIn[si-254]      ;[i-1,j+1]
                mov     al, DataIn[si+2]        ;[i,j+1]
                add     dx, bx
                mov     bl, DataIn[si+258]      ;[i+1,j+1];
                add     dx, ax
                mov     al, DataIn[si+1]        ;[i,j]
                add     dx, bx
                shl     ax, 3                   ;DataIn[i][j]*8
                add     si, 2                   ;Bump array index.
                add     dx, ax
                mov     ah, 0                   ;Clear for next iter.
                shr     dx, 4                   ;Divide by 16
                dec     ch
                mov     DataOut[si-1], dl
                jne     jloop

                dec     cl
                jne     iloop

                dec     bp
                je      Done


; Special case so we don't have to move the data between the two arrays.
; This is an unrolled version of the hloop that swaps the input and output
; arrays so we don't have to move data around in memory.

                mov     ax, OutSeg
                mov     ds, ax
                mov     ax, InSeg
                mov     es, ax
                assume  es:InSeg, ds:OutSeg

hloop2:

                mov     cl, 249
iloop2:         mov     bh, cl
                mov     bl, 1
                mov     ch, 254/2
                mov     si, bx
                mov     dh, 0
                mov     bh, 0
                mov     ah, 0
jloop2:
                mov     dl, DataOut[si-256]
                mov     al, DataOut[si-255]
                mov     bl, DataOut[si+257]
                add     dx, ax
                mov     al, DataOut[si+256]
                add     dx, bx
                mov     bl, DataOut[si+1]
                add     dx, ax
                mov     al, DataOut[si+255]

                mov     di, dx

                add     dx, bx
                mov     bl, DataOut[si-1]
                add     dx, ax
                mov     al, DataOut[si]
                add     dx, bx
                mov     bl, DataOut[si-257]
                shl     ax, 3
                add     dx, bx
                add     dx, ax
                shr     ax, 3
                shr     dx, 4
                mov     DataIn[si], dl

                mov     dx, di
                mov     bl, DataOut[si-254]
                add     dx, ax
                mov     al, DataOut[si+2]
                add     dx, bx
                mov     bl, DataOut[si+258]
                add     dx, ax
                mov     al, DataOut[si+1]
                add     dx, bx
                shl     ax, 3
                add     si, 2
                add     dx, ax
                mov     ah, 0
                shr     dx, 4
                dec     ch
                mov     DataIn[si-1], dl
                jne     jloop2

                dec     cl
                jne     iloop2

                dec     bp
                je      Done2
                jmp     hloop


; Kludge to guarantee that the data always resides in the output segment.

Done2:
                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax
                mov     cx, (251*256)/4
                lea     si, DataIn
                lea     di, DataOut
        rep     movsd

Done:           print
                byte    "Writing result",cr,lf,0

;       //Lots of deleted code here, see the original program//

This code provides a good example of the kind of optimization that scares a lot of people. There is a lot of cycle counting, instruction scheduling, and other crazy stuff that makes program very difficult to read and understand. This is the kind of optimization for which assembly language programmers are famous; the stuff that spawned the phrase "never optimize early." You should never try this type of optimization until you feel you've exhausted all other possibilities. Once you write your code in this fashion, it is going to be very difficult to make further changes to it. By the way, the above code took about 15 hours to develop and debug (debugging took the most time). That works out to a 0.1 second improvement (for 100 iterations) for each hour of work. Although this code certainly isn't optimal yet, it is difficult to justify more time attempting to improve this code by mechanical means (e.g., moving instructions around, etc.) because the performance gains would be so little.

In the four steps above, we've reduced the running time of the assembly code from 36 seconds down to 2.5 seconds. Quite an impressive feat. However, you shouldn't get the idea that this was easy or even that there were only four steps involved. During the actual development of this example, there were many attempts that did not improve performance (in fact, some modifications wound up reducing performance) and others did not improve performance enough to justify their inclusion. Just to demonstrate this last point, the following code included a major change in the way the program organized data. The main loop operates on 16 bit objects in memory rather than eight bit objects. On some machines with large external caches (256K or better) this algorithm provides a slight improvement in performance (2.4 seconds, down from 2.5). However, on other machines it runs slower. Therefore, this code was not chosen as the final implementation:









; IMGPRCS.ASM
;
; An image processing program (Fourth optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
; Version #5: Converted data arrays to words rather than bytes and operated
;        on 16-bit values. Yielded minimal speedup.
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              2.4 seconds.
;       3rd optimization pass-                  2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

                .xlist
                include         stdlib.a
                includelib      stdlib.lib
                .list
                .386
                option          segment:use16



dseg            segment para public 'data'

ImgData         byte    251 dup (256 dup (?))

InName          byte    "roller1.raw",0
OutName         byte    "roller2.raw",0
Iterations      word    0

dseg            ends


; This code makes the naughty assumption that the following
; segments are loaded contiguously in memory! Also, because these
; segments are paragraph aligned, this code assumes that these segments
; will contain a full 65,536 bytes. You cannot declare a segment with
; exactly 65,536 bytes in MASM. However, the paragraph alignment option
; ensures that the extra byte of padding is added to the end of each
; segment.

DataSeg1        segment para public 'ds1'
Data1a          byte    65535 dup (?)
DataSeg1        ends

DataSeg2        segment para public 'ds2'
Data1b          byte    65535 dup (?)
DataSeg2        ends

DataSeg3        segment para public 'ds3'
Data2a          byte    65535 dup (?)
DataSeg3        ends

DataSeg4        segment para public 'ds4'
Data2b          byte    65535 dup (?)
DataSeg4        ends




cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg

Main            proc
                mov     ax, dseg
                mov     ds, ax
                meminit

                mov     ax, 3d00h       ;Open input file for reading.
                lea     dx, InName
                int     21h
                jnc     GoodOpen
                print
                byte    "Could not open input file.",cr,lf,0
                jmp     Quit

GoodOpen:       mov     bx, ax          ;File handle.
                lea     dx, ImgData
                mov     cx, 256*251     ;Size of data file to read.
                mov     ah, 3Fh
                int     21h
                cmp     ax, 256*251     ;See if we read the data.
                je      GoodRead
                print
                byte    "Did not read the file properly",cr,lf,0
                jmp     Quit

GoodRead:       print
                byte    "Enter number of iterations: ",0
                getsm
                atoi
                free
                mov     Iterations, ax
                cmp     ax, 0
                jle     Quit

                printf
                byte    "Computing Result for %d iterations",cr,lf,0
                dword   Iterations



; Copy the data and expand it from eight bits to sixteen bits.
; The first loop handles the first 32,768 bytes, the second loop
; handles the remaining bytes.

                mov     ax, DataSeg1
                mov     es, ax
                mov     ax, DataSeg3
                mov     fs, ax

                mov     ah, 0
                mov     cx, 32768
                lea     si, ImgData
                xor     di, di          ;Output data is at ofs zero.
CopyLoop:       lodsb                   ;Read a byte
                mov     fs:[di], ax     ;Store a word in DataSeg3
                stosw                   ;Store a word in DataSeg1
                dec     cx
                jne     CopyLoop

                mov     di, DataSeg2
                mov     es, di
                mov     di, DataSeg4
                mov     fs, di
                mov     cx, (251*256) - 32768
                xor     di, di
CopyLoop1:      lodsb                   ;Read a byte
                mov     fs:[di], ax     ;Store a word in DataSeg4
                stosw                   ;Store a word in DataSeg2
                dec     cx
                jne     CopyLoop1

; hloop completes one iteration on the data moving it from Data1a/Data1b
; to Data2a/Data2b

hloop:          mov     ax, DataSeg1
                mov     ds, ax
                mov     ax, DataSeg3
                mov     es, ax

; Process the first 127 rows (65,024 bytes) of the array):

                mov     cl, 127
                lea     si, Data1a+202h ;Start at [1,1]
iloop0:         mov     ch, 254/2       ;# of times through loop.
jloop0:         mov     dx, [si]        ;[i,j]
                mov     bx, [si-200h]   ;[i-1,j]
                mov     ax, dx
                shl     dx, 3           ;[i,j] * 8
                add     bx, [si-1feh]   ;[i-1,j+1]
                mov     bp, [si+2]      ;[i,j+1]
                add     bx, [si+200h]   ;[i+1,j]
                add     dx, bp
                add     bx, [si+202h]   ;[i+1,j+1]
                add     dx, [si-202h]   ;[i-1,j-1]
                mov     di, [si-1fch]   ;[i-1,j+2]
                add     dx, [si-2]      ;[i,j-1]
                add     di, [si+4]      ;[i,j+2]
                add     dx, [si+1feh]   ;[i+1,j-1]
                add     di, [si+204h]   ;[i+1,j+2]
                shl     bp, 3           ;[i,j+1] * 8
                add     dx, bx
                add     bp, ax
                shr     dx, 4           ;Divide by 16.
                add     bp, bx
                mov     es:[si], dx     ;Store [i,j] entry.
                add     bp, di
                add     si, 4           ;Affects next store operation!
                shr     bp, 4           ;Divide by 16.
                dec     ch
                mov     es:[si-2], bp   ;Store [i,j+1] entry.
                jne     jloop0

                add     si, 4           ;Skip to start of next row.

                dec     cl
                jne     iloop0

; Process the last 124 rows of the array). This requires that we switch from
; one segment to the next. Note that the segments overlap.

                mov     ax, DataSeg2
                sub     ax, 40h         ;Back up to last 2 rows in DS2
                mov     ds, ax
                mov     ax, DataSeg4
                sub     ax, 40h         ;Back up to last 2 rows in DS4
                mov     es, ax

                mov     cl, 251-127-1   ;Remaining rows to process.
                mov     si, 202h        ;Continue with next row.
iloop1:         mov     ch, 254/2       ;# of times through loop.
jloop1:         mov     dx, [si]        ;[i,j]
                mov     bx, [si-200h]   ;[i-1,j]
                mov     ax, dx
                shl     dx, 3           ;[i,j] * 8
                add     bx, [si-1feh]   ;[i-1,j+1]
                mov     bp, [si+2]      ;[i,j+1]
                add     bx, [si+200h]   ;[i+1,j]
                add     dx, bp
                add     bx, [si+202h]   ;[i+1,j+1]
                add     dx, [si-202h]   ;[i-1,j-1]
                mov     di, [si-1fch]   ;[i-1,j+2]
                add     dx, [si-2]      ;[i,j-1]
                add     di, [si+4]      ;[i,j+2]
                add     dx, [si+1feh]   ;[i+1,j-1]
                add     di, [si+204h]   ;[i+1,j+2]
                shl     bp, 3           ;[i,j+1] * 8
                add     dx, bx
                add     bp, ax
                shr     dx, 4           ;Divide by 16
                add     bp, bx
                mov     es:[si], dx     ;Store [i,j] entry.
                add     bp, di
                add     si, 4           ;Affects next store operation!
                shr     bp, 4
                dec     ch
                mov     es:[si-2], bp   ;Store [i,j+1] entry.
                jne     jloop1

                add     si, 4           ;Skip to start of next row.

                dec     cl
                jne     iloop1

                mov     ax, dseg
                mov     ds, ax
                assume  ds:dseg

                dec     Iterations
                je      Done0

; Unroll the iterations loop so we can move the data from DataSeg2/4 back
; to DataSeg1/3 without wasting extra time. Other than the direction of the
; data movement, this code is virtually identical to the above.

                mov     ax, DataSeg3
                mov     ds, ax
                mov     ax, DataSeg1
                mov     es, ax

                mov     cl, 127
                lea     si, Data1a+202h
iloop2:         mov     ch, 254/2
jloop2:         mov     dx, [si]
                mov     bx, [si-200h]
                mov     ax, dx
                shl     dx, 3
                add     bx, [si-1feh]
                mov     bp, [si+2]
                add     bx, [si+200h]
                add     dx, bp
                add     bx, [si+202h]
                add     dx, [si-202h]
                mov     di, [si-1fch]
                add     dx, [si-2]
                add     di, [si+4]
                add     dx, [si+1feh]
                add     di, [si+204h]
                shl     bp, 3
                add     dx, bx
                add     bp, ax
                shr     dx, 4
                add     bp, bx
                mov     es:[si], dx
                add     bp, di
                add     si, 4
                shr     bp, 4
                dec     ch
                mov     es:[si-2], bp
                jne     jloop2

                add     si, 4

                dec     cl
                jne     iloop2


                mov     ax, DataSeg4
                sub     ax, 40h
                mov     ds, ax
                mov     ax, DataSeg2
                sub     ax, 40h
                mov     es, ax

                mov     cl, 251-127-1
                mov     si, 202h
iloop3:         mov     ch, 254/2
jloop3:         mov     dx, [si]
                mov     bx, [si-200h]
                mov     ax, dx
                shl     dx, 3
                add     bx, [si-1feh]
                mov     bp, [si+2]
                add     bx, [si+200h]
                add     dx, bp
                add     bx, [si+202h]
                add     dx, [si-202h]
                mov     di, [si-1fch]
                add     dx, [si-2]
                add     di, [si+4]
                add     dx, [si+1feh]
                add     di, [si+204h]
                shl     bp, 3
                add     dx, bx
                add     bp, ax
                shr     dx, 4
                add     bp, bx
                mov     es:[si], dx
                add     bp, di
                add     si, 4
                shr     bp, 4
                dec     ch
                mov     es:[si-2], bp
                jne     jloop3

                add     si, 4

                dec     cl
                jne     iloop3

                mov     ax, dseg
                mov     ds, ax
                assume  ds:dseg

                dec     Iterations
                je      Done2
                jmp     hloop

Done2:          mov     ax, DataSeg1
                mov     bx, DataSeg2
                jmp     Finish

Done0:          mov     ax, DataSeg3
                mov     bx, DataSeg4
Finish:         mov     ds, ax
                print
                byte    "Writing result",cr,lf,0

; Convert data back to byte form and write to the output file:

                mov     ax, dseg
                mov     es, ax

                mov     cx, 32768
                lea     di, ImgData
                xor     si, si          ;Output data is at offset zero.
CopyLoop3:      lodsw                   ;Read a word from final array.
                stosb                   ;Write a byte to output array.
                dec     cx
                jne     CopyLoop3

                mov     ds, bx
                mov     cx, (251*256) - 32768
                xor     si, si
CopyLoop4:      lodsw                   ;Read final data word.
                stosb                   ;Write data byte to output array.
                dec     cx
                jne     CopyLoop4


; Okay, write the data to the output file:

                mov     ah, 3ch         ;Create output file.
                mov     cx, 0           ;Normal file attributes.
                mov     dx, dseg
                mov     ds, dx
                lea     dx, OutName
                int     21h
                jnc     GoodCreate
                print
                byte    "Could not create output file.",cr,lf,0
                jmp     Quit

GoodCreate:     mov     bx, ax          ;File handle.
                push    bx
                mov     dx, dseg        ;Where the data can be found.
                mov     ds, dx
                lea     dx, ImgData
                mov     cx, 256*251     ;Size of data file to write.
                mov     ah, 40h         ;Write operation.
                int     21h
                pop     bx              ;Retrieve handle for close.
                cmp     ax, 256*251     ;See if we wrote the data.
                je      GoodWrite
                print
                byte    "Did not write the file properly",cr,lf,0
                jmp     Quit

GoodWrite:      mov     ah, 3eh         ;Close operation.
                int     21h


Quit:           ExitPgm                 ;DOS macro to quit program.
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             byte    1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       byte    16 dup (?)
zzzzzzseg       ends
                end     Main

Of course, the absolute best way to improve the performance of any piece of code is with a better algorithm. All of the above assembly language versions were limited by a single requirement - they all must produce the same output file as the original Pascal program. Often, programmers lose sight of what it is that they are trying to accomplish and get so caught up in the computations they are performing that they fail to see other possibilities. The optimization example above is a perfect example. The assembly code faithfully preserves the semantics of the original Pascal program; it computes the weighted average of all interior pixels as the sum of the eight neighbors around a pixel plus eight times the current pixel's value, with the entire sum divided by 16. Now this is a good blurring function, but it is not the only blurring function. A Photoshop (or other image processing program) user doesn't care about algorithms or such. When that user selects "blur image" they want it to go out of focus. Exactly how much out of focus is generally immaterial. In fact, the less the better because the user can always run the blur algorithm again (or specify some number of iterations). The following assembly language program shows how to get better performance by modifying the blurring algorithm to reduce the number of instructions it needs to execute in the innermost loops. It computes blurring by averaging a pixel with the four neighbors above, below, to the left, and to the right of the current pixel. This modification yields a program that runs 100 iterations in 2.2 seconds, a 12% improvement over the previous version:









; IMGPRCS.ASM
;
; An image processing program (Fifth optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16, weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K), the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
; Version #6: Changed the blurring algorithm to use fewer computations.
;        This version does *NOT* produce the same data as the other
;        programs.
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system, 100 iterations).
;
;       This code-                              2.2 seconds.
;       3rd optmization pass-                   2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

;       <<Lots of deleted code here, see the original program»


                print
                byte    "Computing Result",cr,lf,0


                assume  ds:InSeg, es:OutSeg

                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax

; Copy the data once so we get the edges in both arrays.

                mov     cx, (251*256)/4
                lea     si, DataIn
                lea     di, DataOut
        rep     movsd


; "hloop" repeats once for each iteration.

hloop:
                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax

; "iloop" processes the rows in the matrices.

                mov     cl, 249
iloop:          mov     bh, cl                  ;i*256
                mov     bl, 1                   ;Start at j=1.
                mov     ch, 254/2               ;# of times through loop.
                mov     si, bx
                mov     dh, 0                   ;Compute sum here.
                mov     bh, 0
                mov     ah, 0

; "jloop" processes the individual elements of the array.
; This loop has been unrolled once to allow the two portions to share
; some common computations.

jloop:

; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] +
; DataIn [i+1][j+1] will be used in the second half of this computation.
; So save its value in a register (di) until we need it again.

                mov     dl, DataIn[si]          ;[i,j]
                mov     al, DataIn[si-256]      ;[I-1,j]
                shl     dx, 2                   ;[i,j]*4
                mov     bl, DataIn[si-1]        ;[i,j-1]
                add     dx, ax
                mov     al, DataIn[si+1]        ;[i,j+1]
                add     dx, bx
                mov     bl, DataIn[si+256]      ;[i+1,j]
                add     dx, ax
                shl     ax, 2                   ;[i,j+1]*4
                add     dx, bx
                mov     bl, DataIn[si-255]      ;[i-1,j+1]
                shr     dx, 3                   ;Divide by 8.
                add     ax, bx
                mov     DataOut[si], dl
                mov     bl, DataIn[si+2]        ;[i,j+2]
                mov     dl, DataIn[si+257]      ;[i+1,j+1]
                add     ax, bx
                mov     bl, DataIn[si]          ;[i,j]
                add     ax, dx
                add     ax, bx
                shr     ax, 3
                dec     ch
                mov     DataOut[si+1], al
                jne     jloop

                dec     cl
                jne     iloop

                dec     bp
                je      Done


; Special case so we don't have to move the data between the two arrays.
; This is an unrolled version of the hloop that swaps the input and output
; arrays so we don't have to move data around in memory.

                mov     ax, OutSeg
                mov     ds, ax
                mov     ax, InSeg
                mov     es, ax
                assume  es:InSeg, ds:OutSeg

hloop2:

                mov     cl, 249
iloop2:         mov     bh, cl
                mov     bl, 1
                mov     ch, 254/2
                mov     si, bx
                mov     dh, 0
                mov     bh, 0
                mov     ah, 0
jloop2:
                mov     dl, DataOut[si-256]
                mov     al, DataOut[si-255]
                mov     bl, DataOut[si+257]
                add     dx, ax
                mov     al, DataOut[si+256]
                add     dx, bx
                mov     bl, DataOut[si+1]
                add     dx, ax
                mov     al, DataOut[si+255]

                mov     di, dx

                add     dx, bx
                mov     bl, DataOut[si-1]
                add     dx, ax
                mov     al, DataOut[si]
                add     dx, bx
                mov     bl, DataOut[si-257]
                shl     ax, 3
                add     dx, bx
                add     dx, ax
                shr     ax, 3
                shr     dx, 4
                mov     DataIn[si], dl

                mov     dx, di
                mov     bl, DataOut[si-254]
                add     dx, ax
                mov     al, DataOut[si+2]
                add     dx, bx
                mov     bl, DataOut[si+258]
                add     dx, ax
                mov     al, DataOut[si+1]
                add     dx, bx
                shl     ax, 3
                add     si, 2
                add     dx, ax
                mov     ah, 0
                shr     dx, 4
                dec     ch
                mov     DataIn[si-1], dl
                jne     jloop2

                dec     cl
                jne     iloop2

                dec     bp
                je      Done2
                jmp     hloop


; Kludge to guarantee that the data always resides in the output segment.

Done2:
                mov     ax, InSeg
                mov     ds, ax
                mov     ax, OutSeg
                mov     es, ax
                mov     cx, (251*256)/4
                lea     si, DataIn
                lea     di, DataOut
        rep     movsd

Done:           print
                byte    "Writing result",cr,lf,0


;       //Lots of delete code here, see the original program//

One very important thing to keep in mind about the codein this section is that we've optimized it for 100 iterations. While it turns out that these optimizations apply equally well to more iterations, this isn't necessarily true for fewer iterations. In particular, if we run only one iteration, any copying of data at the end of the operation will easily consume a large part of the time we save by the optimizations. Since it is very rare for a user to blur an image 100 times in a row, our optimizations may not be as good as we could make them. However, this section does provide a good example of the steps you must go through in order to optimize a given program. One hundred iterations was a good choice for this example because it was easy to measure the running time of all versions of the program. However, you must keep in mind that you should optimize your programs for the expected case, not an arbitrary case.

25.5 - Improving the Implementation of an Algorithm

Art of Assembly: Chapter Twenty-Five - 29 SEP 1996

[Chapter Twenty-Five][Previous] [Art of Assembly][Randall Hyde]

25.5 Improving the Implementation of an Algorithm

Number of Web Site Hits since Jan 1, 2000: