Art of Assembly/Win32 Edition is now available. Let me read that version.
PLEASE: Before emailing me asking how to get a hard copy of this text, read this.
Important Notice: As you have probably discovered by now, I am no longer updating this document. The reason is quite simple: I'm working on a Windows version of "The Art of Assembly Language Programming". In the past I have encouraged individuals to send me corrections to this text. However, as I am no longer updating this material, don't expect those correctioins to appear in a future release. I am collecting errata that I will post to Webster someday, so feel free to continue sending corrections to AoA/DOS (16-bit) to rhyde@cs.ucr.edu. If you're more interested in leading edge material, please see the information about the Win/32 edition, above.
Hello Again Professor Hyde,
Dallas gave me permission to take orders for the Computer Science 13 Manuals. We would need to take charge card orders. The only cards we take are: Master Card, Visa, and Discover. They would need to send the name, numbers, expiration date, type of card, and authorization to charge $95.00 for the manual and shipping, also we should have their phone number in case the company has any trouble delivery. They can use my e-mail address for the orders and I will process them as soon as possible. I would assume that two weeks would be sufficient for printing, packages and delivery time.
I am open to suggestions if you can think of any to make this as easy as possible.
Thank You for your business,
Kathy Chapman, Assistant
Printing and Reprographics
University of California
Riverside
(909) 787-4443/4444
We are currently working on ways to publish this text in a form other than HTML (e.g., Postscript, PDF, Frameviewer, hard copy, etc.). This, however, is a low-priority project. Please do not contact Randall Hyde concerning this effort. When something happens, an announcement will appear on "Randall Hyde's Assembly Language Page." Please visit this WEB site at http://webster.ucr.edu for the latest scoop.
Did you find an error in The Art of Assembly Language Programming? You can let me know by using the form below to report the error to me so that I can correct the error for the next beta version. Thank you.
Please provide your name and e-mail address so I can contact you if I have any questions regarding your submission.
mov ax,0
and add ax,bx
are meaningless to the microprocessor. As arcane as these statements appear, they are still human readable forms of 80x86 instructions. The 80x86 responds to commands like B80000 and 03C3. An assembler is a program that converts strings like mov ax,0
to 80x86 machine code like "B80000". An assembly language program consists of statements like mov ax,0
. The assembler converts an assembly language source file to machine code - the binary equivalent of the assembly language program. In this respect, the assembler program is much like a compiler, it reads an ASCII source file from the disk and produces a machine language program as output. The major difference between a compiler for a high level language (HLL) like Pascal and an assembler is that the compiler usually emits several machine instructions for each Pascal statement. The assembler generally emits a single machine instruction for each assembly language statement.
{Label} {Mnemonic {Operand}} {;Comment}
Each entity above is a field. The four fields above are the label field, the mnemonic field, the operand field, and the comment field.
The label field is (usually) an optional field containing a symbolic label for the current statement. Labels are used in assembly language, just as in HLLs, to mark lines as the targets of GOTOs (jumps). You can also specify variable names, procedure names, and other entities using symbolic labels. Most of the time the label field is optional, meaning a label need be present only if you want a label on that particular line. Some mnemonics, however, require a label, others do not allow one. In general, you should always begin your labels in column one (this makes your programs easier to read).
A mnemonic is an instruction name (e.g., mov
, add
, etc.). The word mnemonic means memory aid. mov
is much easier to remember than the binary equivalent of the mov
instruction! The braces denote that this item is optional. Note, however, that you cannot have an operand without a mnemonic.
The mnemonic field contains an assembler instruction. Instructions are divided into three classes: 80x86 machine instructions, assembler directives, and pseudo opcodes. 80x86 instructions, of course, are assembler mnemonics that correspond to the actual 80x86 instructions introduced in Chapter Six.
Assembler directives are special instructions that provide information to the assembler but do not generate any code. Examples include the segment
directive, equ
, assume
, and end
. These mnemonics are not valid 80x86 instructions. They are messages to the assembler, nothing else.
A pseudo-opcode is a message to the assembler, just like an assembler directive, however a pseudo-opcode will emit object code bytes. Examples of pseudo-opcodes include byte
, word
, dword
, qword
, and tbyte
. These instructions emit the bytes of data specified by their operands but they are not true 80X86 machine instructions.
The operand field contains the operands, or parameters, for the instruction specified in the mnemonic field. Operands never appear on lines by themselves. The type and number of operands (zero, one, two, or more) depend entirely on the specific instruction.
The comment field allows you to annotate each line of source code in your program. Note that the comment field always begins with a semicolon. When the assembler is processing a line of text, it completely ignores everything on the source line following a semicolon.
Each assembly language statement appears on its own line in the source file. You cannot have multiple assembly language statements on a single line. On the other hand, since all the fields in an assembly language statement are optional, blank lines are fine. You can use blank lines anywhere in your source file. Blank lines are useful for spacing out certain sections of code, making them easier to read.
The Microsoft Macro Assembler is a free form assembler. The various fields of an assembly language statement may appear in any column (as long as they appear in the proper order). Any number of spaces or tabs can separate the various fields in the statement. To the assembler, the following two code sequences are identical:
______________________________________________________
mov ax, 0 mov bx, ax add ax, dx mov cx, ax
______________________________________________________
mov ax, 0 mov bx, ax add ax, dx mov cx, ax
______________________________________________________
The first code sequence is much easier to read than the second (if you don't think so, perhaps you should go see a doctor!). With respect to readability, the judicial use of spacing within your program can make all the difference in the world.
Placing the labels in column one, the mnemonics in column 17 (two tabstops), the operand field in column 25 (the third tabstop), and the comments out around column 41 or 49 (five or six tabstops) produces the best looking listings. Assembly language programs are hard enough to read as it is. Formatting your listings to help make them easier to read will make them much easier to maintain.
You may have a comment on the line by itself. In such a case, place the semicolon in column one and use the entire line for the comment, examples:
; The following section of code positions the cursor to the upper ; left hand position on the screen: mov X, 0 mov Y, 0 ; Now clear from the current cursor position to the end of the ; screen to clear the video display: ; etc.
mov ax, bx
since this instruction is two bytes long.0 : or ah, 9 3 : and ah, 0c9h 6 : xor ah, 40h 9 : pop cx A : mov al, cl C : pop bp D : pop cx E : pop dx F : pop ds 10: ret
The or
, and
, and xor
instructions are all three bytes long; the mov
instruction is two bytes long; the remaining instructions are all one byte long. If these instructions appear at the beginning of a segment, the location counter would be the same as the numbers that appear immediately to the left of each instruction above. For example, the or
instruction above begins at offset zero. Since the or
instruction is three bytes long, the next instruction (and
) follows at offset three. Likewise, and
is three bytes long, so xor
follows at offset six, etc..
jmp
instruction for a moment. This instruction takes the form:
jmp target
Target is the destination address. Imagine how painful it would be if you had to actually specify the target memory address as a numeric value. If you've ever programmed in BASIC (where line numbers are the same thing as statement labels) you've experienced about 10% of the trouble you would have in assembly language if you had to specify the target of a jmp
by an address.
To illustrate, suppose you wanted to jump to some group of instructions you've yet to write. What is the address of the target instruction? How can you tell until you've written every instruction before the target instruction? What happens if you change the program (remember, inserting and deleting instructions will cause the location counter values for all the following instructions within that segment to change). Fortunately, all these problems are of concern only to machine language programmers. Assembly language programmers can deal with addresses in a much more reasonable fashion - by using symbolic addresses.
A symbol, identifier, or label , is a name associated with some particular value. This value can be an offset within a segment, a constant, a string, a segment address, an offset within a record, or even an operand for an instruction. In any case, a label provides us with the ability to represent some otherwise incomprehensible value with a familiar, mnemonic, name.
A symbolic name consists of a sequence of letters, digits, and special characters, with the following restrictions:
%out .186 .286 .286P .287 .386 .386P .387 .486 .486P .8086 .8087 .ALPHA .BREAK .CODE .CONST .CREF .DATA .DATA? .DOSSEG .ELSE .ELSEIF .ENDIF .ENDW .ERR .ERR1 .ERR2 .ERRB .ERRDEF .ERRDIF .ERRDIFI .ERRE .ERRIDN .ERRIDNI .ERRNB .ERRNDEF .ERRNZ .EXIT .FARDATA .FARDATA? .IF .LALL .LFCOND .LIST .LISTALL .LISTIF .LISTMACRO .LISTMACROALL .MODEL .MSFLOAT .NO87 .NOCREF .NOLIST .NOLISTIF .NOLISTMACRO .RADIX .REPEAT .UNTIL .SALL .SEQ .SFCOND .STACK .STARTUP .TFCOND .UNTIL .UNTILCXZ .WHILE .XALL .XCREF .XLIST ALIGN ASSUME BYTE CATSTR COMM COMMENT DB DD DF DOSSEG DQ DT DW DWORD ECHO ELSE ELSEIF ELSEIF1 ELSEIF2 ELSEIFB ELSEIFDEF ELSEIFDEF ELSEIFE ELSEIFIDN ELSEIFNB ELSEIFNDEF END ENDIF ENDM ENDP ENDS EQU EVEN EXITM EXTERN EXTRN EXTERNDEF FOR FORC FWORD GOTO GROUP IF IF1 IF2 IFB IFDEF IFDIF IFDIFI IFE IFIDN IFIDNI IFNB IFNDEF INCLUDE INCLUDELIB INSTR INVOKE IRP IRPC LABEL LOCAL MACRO NAME OPTION ORG PAGE POPCONTEXT PROC PROTO PUBLIC PURGE PUSHCONTEXT QWORD REAL4 REAL8 REAL10 RECORD REPEAT REPT SBYTE SDWORD SEGMENT SIZESTR STRUC STRUCT SUBSTR SUBTITLE SUBTTL SWORD TBYTE TEXTEQU TITLE TYPEDEF UNION WHILE WORD
In addition, all valid 80x86 instruction names and register names are reserved as well. Note that this list applies to Microsoft's Macro Assembler version 6.0. Earlier versions of the assembler have fewer reserved words. Later versions may have more.
Some examples of valid symbols include:
L1 Bletch RightHere Right_Here Item1 __Special $1234 @Home $_@1 Dollar$ WhereAmI? @1234
$1234 and @1234 are perfectly valid, strange though they may seem.
Some examples of illegal symbols include:
1TooMany - Begins with a digit. Hello.There - Contains a period in the middle of the symbol. $ - Cannot have $ or ? by itself. LABEL - Assembler reserved word. Right Here - Symbols cannot contain spaces. Hi,There - or other special symbols besides _, ?, $, and @.
Symbols, as mentioned previously, can be assigned numeric values (such as location counter values), strings, or even whole operands. To keep things straightened out, the assembler assigns a type to each symbol. Examples of types include near, far, byte, word, double word, quad word, text, and strings. How you declare labels of a certain type is the subject of much of the rest of this chapter. For now, simply note that the assembler always assigns some type to a label and will tend to complain if you try to use a label at some point where it does not allow that type of label.
Except for the last example above, most of these literal constants should be reasonably familiar to anyone who has written a program in a high level language like Pascal or C++. Text constants are special forms of strings that allow textual substitution during assembly.
A literal constant's representation corresponds to what we would normally expect for its "real world value." Literal constants are also known as non symbolic constants since they use the value's actual representation, rather than some symbolic name, within your program. MASM also lets you define symbolic, or manifest, constants in a program, but more on that later.
Name | Base | Valid Digits |
---|---|---|
Binary | 2 | 0 1 |
Decimal | 10 | 0 1 2 3 4 5 6 7 8 9 |
Hexadecimal | 16 | 0 1 2 3 4 5 6 7 8 9 A B C D E F |
To differentiate between numbers in the various bases, you use a suffix character. If you terminate a number with a "b" or "B", then MASM assumes that it is a binary number. If it contains any digits other than zero or one the assembler will generate an error. If the suffix is "t", "T", "d" or "D", then the assembler assumes that the number is a decimal (base 10) value. A suffix of "h" or "H" will select the hexadecimal radix.
All integer constants must begin with a decimal digit, including hexadecimal constants. To represent the value "FDED" you must specify 0FDEDh. The leading decimal digit is required by the assembler so that it can differentiate between symbols and numeric constants; remember, "FDEDh" is a perfectly valid symbol to the Microsoft Macro Assembler.
Examples:
0F000h 12345d 0110010100b 1234h 100h 08h
If you do not specify a suffix after your numeric constants, the assembler uses the current default radix. Initially, the default radix is decimal. Therefore, you can usually specify decimal values without the trailing "D" character. The radix
assembler directive can be used to change the default radix to some other base. The .radix
instruction takes the following form:
.radix base ;Optional comment
Base is a decimal value between 2 and 16.
The .radix
statement takes effect as soon as MASM encounters it in the source file. All the statements before the .radix
statement will use the previous default base for numeric constants. By sprinkling multiple .radix
instructions throughout your source file, you can switch the default base amongst several values depending upon what's most convenient at each point in your program.
Generally, decimal is fine as the default base so the .radix
instruction doesn't get used much. However, faced with entering a gigantic table of hexadecimal values, you can save a lot of typing by temporarily switching to base 16 before the table and switching back to decimal after the table. Note: if the default radix is hexadecimal, you should use the "T" suffix to denote decimal values since MASM will confuse the "D" suffix with a hexadecimal digit.
"This is a string" 'So is this'
You may freely place apostrophes inside string constants enclosed by quotation marks and vice versa. If you want to place an apostrophe inside a string delimited by apostrophes, you must place a pair of apostrophes next to each other in the string, e.g.,
'Doesn''t this look weird?'
Quotation marks appearing within a string delimited by quotes must also be doubled up, e.g.,
"Microsoft claims ""Our software is very fast."" Do you believe them?"
Although you can double up apostrophes or quotes as shown in the examples above, the easiest way to include these characters in a string is to use the other character as the string delimiter:
"Doesn't this look weird?" 'Microsoft claims "Our software is very fast." Do you believe them?'
The only time it would be absolutely necessary to double up quotes or apostrophes in a string is if that string contained both symbols. This rarely happens in real programs.
Like the C and C++ programming languages, there is a subtle difference between a character value and a string value. A single character (that is, a string of length one) may appear anywhere MASM allows an integer constant or a string. If you specify a character constant where MASM expects an integer constant, MASM uses the ASCII code of that character as the integer value. Strings (whose length is greater than one) are allowed only within certain contexts.
1.0 3.14159 625.25 -128.0 0.5
Scientific notation is also identical to the form used by various HLLs:
1e5 1.567e-2 -6.02e-10 5.34e+12
The exact range of precision of the numbers depend on your particular floating point package. However, MASM generally emits binary data for the above constants that is compatible with the 80x87 numeric coprocessors. This form corresponds to the numeric format specified by the IEEE standard for floating point values. In particular, the constant 1.0 is not the binary equivalent of the integer one.
5[bx]
could be a textual constant associated with the symbol VAR1. During assembly, an instruction of the form mov ax, VAR1
would be converted to the instruction mov ax, 5[bx]
.5[bx]
would normally be written as <5[bx]>
. When the text substitution occurs, MASM strips the delimiting "<" and ">" characters.
symbol equ expression symbol = expression symbol textequ expression
The expression operand is typically a numeric expression or a text string. The symbol is given the value and type of the expression. The equ
and "=
" directives have been with MASM since the beginning. Microsoft added the textequ
directive starting with MASM 6.0.
The purpose of the "=" directive is to define symbols that have an integer (or single character) quantity associated with them. This directive does not allow real, string, or text operands. This is the primary directive you should use to create numeric symbolic constants in your programs. Some examples:
NumElements = 16 . . . Array byte NumElements dup (?) . . . mov cx, NumElements mov bx, 0 ClrLoop: mov Array[bx], 0 inc bx loop ClrLoop
The textequ
directive defines a text substitution symbol. The expression in the operand field must be a text constant delimited with the "<" and ">" symbols. Whenever MASM encounters the symbol within a statement, it substitutes the text in the operand field for the symbol. Programmers typically use this equate to save typing or to make some code more readable:
Count textequ <6[bp]> DataPtr textequ <8[bp]> . . . les bx, DataPtr ;Same as les bx, 8[bp] mov cx, Count ;Same as mov cx, 6[bp] mov al, 0 ClrLp: mov es:[bx], al inc bx loop ClrLp
Note that it is perfectly legal to equate a symbol to a blank operand using an equate like the following:
BlankEqu textequ <>
The purpose of such an equate will become clear in the sections on conditional assembly and macros.
The equ
directive provides almost a superset of the capabilities of the "=" and textequ
directives. It allows operands that are numeric, text, or string literal constants. The following are all legal uses of the equ directive:
One equ 1 Minus1 equ -1 TryAgain equ 'Y' StringEqu equ "Hello there" TxtEqu equ <4[si]> . . . HTString byte StringEqu ;Same as HTString equ "Hello there" . . . mov ax, TxtEqu ;Same as mov ax, 4[si] . . . mov bl, One ;Same as mov bl, 1 cmp al, TryAgain ;Same as cmp al, 'Y'
Manifest constants you declare with equates help you parameterize a program. If you use the same value, string, or text, multiple times within a program, using a symbolic equate will make it very easy to change that value in future modifications to the program. Consider the following example:
Array byte 16 dup (?) . . . mov cx, 16 mov bx, 0 ClrLoop: mov Array[bx], 0 inc bx loop ClrLoop
If you decide you want Array to have 32 elements rather than 16, you will need to search throughout your program an locate every reference to this data and adjust the literal constants accordingly. Then there is the possibility that you missed modifying some particular section of code, introducing a bug into your program. On the other hand, if you use the NumElements
symbolic constant shown earlier, you would only have to change a single statement in your program, reassemble it, and you would be in business; MASM would automatically update all references using NumElements
.
MASM lets you redefine symbols declared with the "=" directive. That is, the following is perfectly legal:
SomeSymbol = 0 . . . SomeSymbol = 1
Since you can change the value of a constant in the program, the symbol's scope (where the symbol has a particular value) becomes important. If you could not redefine a symbol, one would expect the symbol to have that constant value everywhere in the program. Given that you can redefine a constant, a symbol's scope cannot be the entire program. The solution MASM uses is the obvious one, a manifest constant's scope is from the point it is defined to the point it is redefined. This has one important ramification - you must declare all manifest constants with the "="
directive before you use that constant. Of course, once you redefine a symbolic constant, the previous value of that constant is forgotten. Note that you cannot redefine symbols you declare with the textequ
or equ
directives.
8087
, .287
, and .387
directives activate the floating point instruction set for the given floating point coprocessors. However, the .8086
directive also enables the 8087 instruction set; likewise, .286
enables the 80287 instruction set and .386
enables the 80387 floating point instruction set. About the only purpose for these FPU (floating point unit) directives is to allow 80287 instructions with the 8086 or 80186 instruction set or 80387 instruction with the 8086, 80186, or 80286 instruction set..386, .486,
or .586,
MASM generates instructions for 32 bit segments by default. If you attempt to run such code in real mode under MS-DOS, you will probably crash the system. There are two solutions to this problem. The first is to specify use16 as an operand to each segment you create in your program. The other solution is slightly more practical, simply put the following statement after the 32 bit processor directive:
option segment:use16
This directive tells MASM to generate 16 bit segments by default, rather than 32 bit segments.
Note that MASM does not require an 80486 or Pentium processor if you specify the .486
or .586
directives. The assembler itself is written in 80386 code so you only need an 80386 processor to assemble any program with MASM. Of course, if you use 80486 or Pentium processor specific instructions, you will need an 80486 or Pentium processor to run the assembled code.
You can selectively enable or disable various instruction sets throughout your program. For example, you can turn on 80386 instructions for several lines of code and then return back to 8086 only instructions. The following code sequence demonstrates this:
.386 ;Begin using 80386 instructions . . ;This code can have 80386 instrs. . .8086 ;Return back to 8086-only instr set. . . ;This code can only have 8086 instrs. .
It is possible to write a routine that detects, at run-time, what processor a program is actually running on. Therefore, you can detect an 80386 processor and use 80386 instructions. If you do not detect an 80386 processor, you can stick with 8086 instructions. By selectively turning 80386 instructions on in those sections of your program that executes if an 80386 processor is present, you can take advantage of the additional instructions. Likewise, by turning off the 80386 instruction set in other sections of your program, you can prevent the inadvertent use of 80386 instructions in the 8086-only portion of the program.
ret
instruction encountered along that execution path terminates the procedure. Such expressive freedom, however, is often abused yielding programs that are very hard to read and maintain. Therefore, MASM provides facilities to declare procedures within your code. The basic mechanism for declaring a procedure is:
procname proc {NEAR or FAR} <statements> procname endp
As you can see, the definition of a procedure looks similar to that for a segment. One difference is that procname (that is the name of the procedure you're defining) must be a unique identifier within your program. Your code calls this procedure using this name, it wouldn't do to have another procedure by the same name; if you did, how would the program determine which routine to call?
Proc
allows several different operands, though we will only consider three: the single keyword near
, the single keyword far
, or a blank operand field. MASM uses these operands to determine if you're calling this procedure with a near
or far
call instruction. They also determine which type of ret
instruction MASM emits within the procedure. Consider the following two procedures:
NProc proc near mov ax, 0 ret NProc endp FProc proc far mov ax, 0FFFFH ret FProc endp
and:
call NPROC call FPROC
The assembler automatically generates a three-byte (near) call for the first call
instruction above because it knows that NProc
is a near procedure. It also generates a five-byte (far) call
instruction for the second call
because FProc
is a far procedure. Within the procedures themselves, MASM automatically converts all ret
instructions to near or far returns depending on the type of routine.
Note that if you do not terminate a proc/endp
section with a ret
or some other transfer of control instruction and program flow runs into the endp
directive, execution will continue with the next executable instruction following the endp
. For example, consider the following:
Proc1 proc mov ax, 0 Proc1 endp Proc2 proc mov bx, 0FFFFH ret Proc2 endp
If you call Proc1
, control will flow on into Proc2
starting with the mov bx,0FFFFh
instruction. Unlike high level language procedures, an assembly language procedure does not contain an implicit return instruction before the endp
directive. So always be aware of how the proc/endp
directives work.
There is nothing special about procedure declarations. They're a convenience provided by the assembler, nothing more. You could write assembly language programs for the rest of your life and never use the proc
and endp
directives. Doing so, however, would be poor programming practice. Proc
and endp
are marvelous documentation features which, when properly used, can help make your programs much easier to read and maintain.
MASM versions 6.0 and later treat all statement labels inside a procedure as local. That is, you cannot refer directly to those symbols outside the procedure. For more details, see "How to Give a Symbol a Particular Type" on page 385.