# BASM for Beginners

## BASM for beginners, lesson 2

This is the second chapter of the introduction to BASM programming with Delphi. The first chapter was a short introduction to integer code and this second one is about floating point code. Our example functions can evaluate a second order polynomial. The parameters A, B and C that defines the polynomial is coded as local constants. Input to the function is the variable X of type double and the result is also of type double. The function looks like this.

function SecondOrderPolynomial1(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
Result := A*X*X + B*X + C;
end;

Copying the assembler code from the cpu view gives us this.

function SecondOrderPolynomial2(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

begin
{
push  ebp
mov   ebp,esp
}
Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
fstp  qword ptr [ebp-\$08]
wait
fld   qword ptr [ebp-\$08]
}
{
pop   ecx
pop   ecx
pop   ebp
}
end;

Let us explain the asm code line by line. The begin results in this code,

begin
{
push  ebp
mov   ebp,esp
}

which sets up a stackframe for the function. A stack frame is just a piece of memory that is reserved for the stack. A stack frame is accessed through two pointers, the basepointer and the stack pointer. The basepointer is in ebp and the stack pointer is in esp. These two registers are reserved for use by these pointers only. The line push ebp backs up the basepointer, The line mov ebp, esp sets up a new basepointer which is pointing to the top of the stack. The line add esp, -\$08 moves the stack pointer 8 bytes down. As a curiosity the stacks grows downward and the last line could more intuitively have been sub esp, 8. The new stack frame that was created by these three lines is standing on top of, or actually hanging under, the last stackframe, which was probably allocated by the function that called our SecondOrderPolynomial function.

The next line of Pascal was compiled into no less than 9 lines of ASM.

Result := A*X*X + B*X + C;
{
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
fstp  qword ptr [ebp-\$08]
wait
fld   qword ptr [ebp-\$08]

fstp  qword ptr [ebp-\$08]
fld   qword ptr [ebp-\$08]

are redundant. They just copy the result into the stack frame and loads it back in again. Such a waste of precious time and energy :-). The line wait makes sure that any exceptions that migth have been raised by one of the floating point instructions is checked. See Intel SW Developers Manual Volume 2 page 822 for the full explanation.

Then there is only three lines of asm back to explain.

{
pop   ecx
pop   ecx
pop   ebp
}
end;

These are removing the stack frame, by restoring the values of esp and ebp back to the values they had when the function was entered. This code is much more intuitive and does the same thing

pop ebp

it is also more effective and I do not know why the compiler is incrementing the stack pointer in this cumbersome way. Remember that ecx can be used for free and assigning values to it is just like pouring them into a waste bucket.

Now we only need to investigate what is hiding behind the [A] in the line fld qword ptr [A]. We know that A must be a pointer to the place where A is placed in memory. The address of A is coded in the instruction. This is the full line from the cpu view.

00451E40 DD05803C4500     fld qword ptr [B]

00451E40 is the address of this instruction in the exe file. DD05803C4500 is the machine code for the line and fld qword ptr [B] is the more human readable format of it. By consulting the Intel SW Developers Manual Volume 2 on page 280 we see that the opcode for fld is D9, DD, DB or D9C0 depending on the type of data it should load. We recognize DD which is the opcode for fld double. What is left is 05803C4500. 05 is (Somebody help me ! ). 803C4500 is the 32 bit address of A.

Let us convert the function into a pure BASM function now that we have finished analyzing it.

function SecondOrderPolynomial3(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
push  ebp
mov   ebp,esp
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
fstp  qword ptr [ebp-\$08]
wait
fld   qword ptr [ebp-\$08]
pop   ecx
pop   ecx
pop   ebp
end;

Now comes a few surprises. First the function will not compile. faddp st(1) is not recognized as a valid combination of opcode and operands. By again consulting the Intel manual we learn that faddp comes in one version only. It operates on st(0), st(1) and it is not necessary to write faddp st(0), st(1) and the short form faddp is the only valid one. We comment out st(1) and it compiles now.

Second surprise. Calling the function with X = 2 yields the calculation Y = 2^2+2*2+3 = 11. SecondOrderPolynomial3 returns 3! We must open the FPU view as well as the CPU view and trace through the code and watch what is happening. It is seen that A=1 is correctly loaded into st(0) by the fourth line, but the 5. line that should multiply A by X, 1 by 2, is resulting in st(0) being a very small number, in effect 0. This tells us that X is near zero instead of 2. Two things can be wrong. The calling code is transferring a wrong value of X or we are addressing X incorrectly. By comparing the calling code when calling function SecondOrderPolynomial3 and SecondOrderPolynomial1 we see that it is the same and this is not the bug. It would also be quite surprising if Delphi was suddenly getting this wrong! Try to step through the calling code while watching the memory pane in the cpu view. The little green arrow is the position of the stackpointer. The calling code looks like this.

push dword ptr [ebp-\$0c]
push dword ptr [ebp-\$10]
call SecondOrderPolynomial1

Two pointers are pushed onto the stack. One of them is a pointer to X. I do not what the other one is. By watching the memory pane we see that the first one is the pointer to X and the second one is a nil pointer. When we trace into the function we see that the first two lines are repeated. The compiler automatically inserted the push ebp and mov ebp, esp lines. Because push decrements the stack pointer by 4, the reference to X went wrong. These two first lines are removed and everything is ok again.

Now we have finished analyzing the code and know what it does, we can begin optimizing it.

Let us first change the two fstp/fld lines that we already have seen is redundant.

function SecondOrderPolynomial4(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//push  ebp
//mov   ebp,esp
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
//fstp  qword ptr [ebp-\$08]
wait
//fld   qword ptr [ebp-\$08]
pop   ecx
pop   ecx
pop   ebp
end;

This was the only references to the stack frame which is not needed now.

function SecondOrderPolynomial5(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//push  ebp
//mov   ebp,esp
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
wait
//pop   ecx
//pop   ecx
//pop   ebp
end;

That removed another 6 lines and reduces the function to this.

function SecondOrderPolynomial6(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [A]
fmul  qword ptr [ebp+\$08]
fmul  qword ptr [ebp+\$08]
fld   qword ptr [B]
fmul  qword ptr [ebp+\$08]
wait
end

X is loaded from memory into the FPU 3 times. It would be more effective to load it once and then reuse it.

function SecondOrderPolynomial7(X : Double) : Double;

const
A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+\$08]
fld   qword ptr [A]
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
wait
end
;

I magically came up with this code. The first line loads X. The second line loads A. The third line multiplies A with X. The 4. line multiplies a*X know in st(0) with X. Then we have calculated the first term. Calculating the second term is done by loading B and multiply it with X. This was the last time we needed X in we free's the register, st(2), holding it. Now we add term 1 and 2 and popping term 2 of the stack. The only thing left to do is adding C. The result is now in st(0) and the other registers are empty. Then we check for exceptions with wait and is done. It is seen that no redundant work is done and this implementation is near optimal.

There exits seven instructions for loading often used constants into the FPU. One of these constants is 1, which can be loaded with the instruction fld1. Using it saves one load from memory, which can be costly in terms of clock cycles if data are not properly aligned.

function SecondOrderPolynomial8(X : Double) : Double;

const
//A : Double = 1;
B : Double = 2;
C : Double = 3;

asm
//Result := A*X*X + B*X + C;
fld   qword ptr [ebp+\$08]
//fld   qword ptr [A]
fld1
fmul  st(0), st(1)
fmul  st(0), st(1)
fld   qword ptr [B]
fmul  st(0), st(2)
ffree st(2)
wait
end
;

This ended the second lesson. Stay tuned for more.

Regards

Dennis