# Chapter 4 - Floating Point Mathematics

Many applications require floating point mathematics. Unlike many small microcontrollers, the QED Board provides a built-in floating point package including trigonometric, logarithmic, and exponential functions and formatted real number input and output. The floating point routines also provide the basis for a complete matrix algebra library as described in later chapters.

Programming in QED-Forth with floating point arithmetic confers many advantages. QED-Forth's assembly coded floating point operations are fast; in fact, they run about two and a half times faster than an integer */ operation. Fractional numbers are easy to represent, and the wide dynamic range of floating point numbers means that you don't have to worry about the range limitations of integer math or the hassle of scaling operations to preserve precision. Floating point math is simple; instead of a host of integer, double number, and mixed-number operations, the four basic floating point operations F+, F-, F*, and F/ perform the math while optimally preserving precision. Error checking and benchmarking are supported, and floating point numbers can be recognized and printed in a variety of convenient formats. This chapter describes the floating point operations in detail.

## Floating Point Number Input Format

Numbers that include either a decimal point or an embedded "E" or "e" are recognized by the QED-Forth interpreter as floating point numbers. Numbers without a decimal point or "E" are recognized as integers. It is strongly recommended that all floating point numbers be entered with a decimal point. The following general floating point format is accepted:

Sxxx.yyyEszz

where

S | is the sign of the number, either "+"," -", or missing. |

xxx.yyy | is the mantissa part of the number, any length, with an optional (but recommended) decimal point. Isolated embedded commas are permitted. |

E ( or "e") | designates that an optional base 10 exponent may follow. |

s | is the sign of the base 10 exponent, either "+","-", or missing. |

zz | is the exponent, up to 2 digits. |

Positive and negative numbers are represented over a magnitude range of 1.0x2ˆ-128 to 1.99999x2ˆ+127 corresponding to approximately 2.9x10ˆ-39 to 3.4x10ˆ38. The resolution of the mantissa is 4.8 decimal digits.

It is recommended that a decimal point always be included when entering a floating point number. If the current number base is hexadecimal, "E" is a valid digit, and a number containing an E but no decimal point could be interpreted as an integer. As described in the previous chapter, when the QED-Forth interpreter gets a string from the input stream, it first tries to find it in the dictionary. If it is not found, it tries to convert it to an integer in the current number base and, if it cannot be converted, tries to convert it to a floating point number. An embedded decimal point in the numeric string prevents the string from being interpreted as a valid integer, but allows it to be interpreted as a floating point number (if the rest of the string is in the proper floating point format). Depending on the current number base, an embedded "E" may or may not be sufficient to flag the number as a floating point quantity.

Floating point numbers are always converted and printed as decimal numbers, irrespective of the contents of the user variable BASE.

Some examples of valid input numbers are:

- .414
- +1.414e
- 1234.56E11
- 1.5678E-23
- -0.123e5
- -.0E
- -.00001E+1
- 5.
- 12,344.
- 1e
- 0.
- .1e-4

## Printing Formats

Floating point numbers can be printed using any of the words

- SCIENTIFIC.
- FIXED.
- FLOATING.
- F.

All of these routines remove a floating point number from the stack and print it via the serial port. The first three of these routines print a number in the specified format (scientific, fixed, or floating, respectively). The word F. prints the number in the default format. After a cold restart, the FLOATING format is the default used by F., but the programmer can execute one of the words FIXED SCIENTIFIC or FLOATING to select which format is used by subsequent execution of F**. .**

### Fixed Format

The fixed format is controlled by the user variables LEFT.PLACES and RIGHT.PLACES which set the maximum number of characters to the left and right of the decimal point. After a cold startup, these values are initialized to 4 and 3, respectively, and the programmer can alter them using the ! (store) command.

Other user variables that control the fixed format are the flags TRAILING.ZEROS and FILL.FIELD. If TRAILING.ZEROS is OFF, trailing zeros to the right of the decimal point are suppressed; this is the default state. To print all significant zeros up to a maximum of RIGHT.PLACES digits, execute

TRAILING.ZEROS ON

To decimal align tabular output in fixed mode (as when you are printing out a matrix or table of values), execute

FILL.FIELD ON

which pads out each floating point number with spaces so that the decimal points line up in each column of data. The default state of FILL.FIELD is OFF.

### Scientific Format

The SCIENTIFIC format prints a minus sign if needed, then one digit to the left of the decimal point, MANTISSA.PLACES digits to the right of the decimal point, followed by E, the sign of the exponent, and a 2-digit exponent. MANTISSA.PLACES is a user variable that can be altered by the programmer.

### Floating Format

FLOATING format automatically chooses between fixed and scientific formats to display the number. The number is printed in a field size specified by the SCIENTIFIC format, but the number is printed in FIXED format if it can be displayed with at least as many significant digits as it would using scientific format. If not, SCIENTIFIC format is used. Try playing with the format options to see how they work:

SCIENTIFIC↓ok

45.678 F.↓4.568E+01 ok

45.678 4 MANTISSA.PLACES ! F.↓4.5678E+01 ok

FIXED↓ok

45.678 1 RIGHT.PLACES ! F.↓45.7 ok

45.678 3 RIGHT.PLACES ! F.↓45.678 ok

FLOATING↓ok

45.678 1 MANTISSA.PLACES ! F.↓46. ok

45.678 4 MANTISSA.PLACES ! F.↓45.678 ok

Turning FILL.FIELD ON makes tabular output in FLOATING format neater by ensuring that the field width of numbers printed in scientific notation is the same as the width of numbers printed in fixed notation.

## Stack Notation and Floating Point Stack Operators

A floating point number uses two normal stack locations. A family of stack manipulation words is available to manipulate these 32-bit items, including:

- FDROP
- F2DROP
- FSWAP
- FROT
- FDUP
- F2DUP
- FOVER
- FDUP>R
- F>R
- FR>
- FR@
- FR>DROP

In writing stack notation floating point numbers are identified with an "r" (for *r*eal numbers), while signed and unsigned integers are identified with an "n" and "u", respectively (consult the glossary for a list of stack symbols).

## Floating Point Operators

The floating point operators F+ F- F* and F/ perform floating point addition, subtraction, multiplication, and addition, respectively. Each requires two operands on the stack and places its result on the stack. A number of transcendental functions are also available for powering (FˆN, F and 10ˆN), arithmetic inversion (FNEGATE), multiplicative inversion (1/F), absolute value (FABS), square root (FSQRT), scaling (F2*, F2/, FSCALE), trigonometric, logarithmic, and exponential conversions.

### Trigonometric Functions

The following trigonometric and inverse trigonometric functions are available:

- FSIN
- FASIN
- FCOS
- FACOS
- FTAN
- FATAN

The trigonometric functions require their argument in radians and the inverse trigonometric functions return their results in radians. Degrees may be converted to radians using >RADIANS :

45.0 >RADIANS FSIN F.↓0.7071 ok

Similarly, the results of the inverse trigonometric functions may be converted to degrees using >DEGREES :

0.5 FSQRT FASIN >DEGREES F.↓45. ok

### Logarithmic and Exponential Functions

QED-Forth's logarithmic functions are:

- FLOG2
- FLN
- LOG10

and exponential ("anti-log") functions are:

- FALOG2
- FALN
- FALOG10

For example,

200. FLOG10 F.↓2.301 ok

2.301 FALOG10 F.↓200. ok

The base two logarithm FLOG2, and exponential, FALOG2, are slightly faster than the natural logarithm, FLN, and exponential, FALN, or the base 10 log and exponential FLOG10 and FALOG10.

### Random Number Generation

FRANDOM uses a pseudo-random bit sequence generator to leave a random number between 1.0 and 2.0 on the stack. The last 16-bit mantissa generated is stored in the user variable RANDOM#. This user variable can be initialized to create a reproducible sequence of pseudo-random numbers. RANDOM.GAUSSIAN generates a pseudo-random value drawn from a gaussian distribution with unity standard deviation and zero mean.

### Floating Point Comparison

The following words compare one or two floating point numbers on the stack and return a boolean flag:

- F0=
- F0<>
- F0<
- F0>
- F0>=
- F0⇐
- F=
- F<>
- F<
- F>
- F>=
- F⇐

In addition, there are fast operators for choosing the minimum or maximum of two values:

- FMAX ( r1\r2 -- r | retains the most positive )
- FMIN ( r1\r2 -- r | retains the most negative )

### Pre-defined Floating Point Quantities

The following pre-defined words place floating point constants on the stack:

ZERO | ONE | TEN | |

PI | PI/2 | 2PI/360 | 360/2PI |

1/TEN | 1/PI | SQRT(2) | 1/SQRT(2) |

LOG10(2) | LN(2) | 1/LOG10(2) | 1/LN(2) |

INFINITY | -INFINITY | 1/INFINITY | -1/INFINITY |

The final row of constants lists the largest and smallest numbers that can be represented. In addition to setting the FP.ERROR flag, mathematical operations that overflow return a result of positive or negative infinity, and operations that underflow return positive or negative 1/infinity.

## Floating Point Memory Access, Variables, and Constants

The kernel words FCONSTANT and FVARIABLE define floating point constants and variables. For example, the square root of three may be defined as a floating point constant by executing

3.0 FSQRT FCONSTANT SQRT.OF.3↓ok

Now execution of SQRT.OF.3 will place the square root of 3.0 on the stack:

SQRT.OF.3 F.↓1.732 ok

F@ and F! can be used with floating point variables just as @ and ! are used with integer variables. These are "page smart" memory operators that can access memory on any page, and correctly handle memory accesses that cross page boundaries.

Floating point self-fetching variables may be defined with the word REAL: as

REAL: BUDGET.DEFICIT

To set the value, use the TO operator as

4.0E12 TO BUDGET.DEFICIT

Stating the name of the self-fetching variable leaves the value on the stack:

BUDGET.DEFICIT F.↓4.000E+12ok

## Number Conversion

### Integer to Floating Point Conversion

FLOT converts a signed integer on the stack to a floating point number on the stack. For example,

-1234 FLOT F.↓-1234. ok

-1234 FLOT 10000 FLOT F/ F.-0.1234 ok

Note that FLOT converts integers greater than 32,767 to negative floating point numbers. Use UFLOT to convert an unsigned integer on the stack to a floating point number. For example,

60000 UFLOT F.↓60000. ok

DFLOT converts a signed double number to a floating point number. For example,

DIN 123400 DFLOT F.↓1.234E+05 ok

### Floating Point to Integer Conversion

The kernel word INT converts a signed floating point number on the stack to an integer, truncating the fractional part toward zero. For example,

35.6 INT .↓35 ok

-12.7 INT .↓-12 ok

Likewise, DINT converts a signed floating point number to a double number, as

1.234E5 DINT D.↓123400 ok

FIXX rounds a signed floating point number to the nearest integer. For example,

1.45 FIXX .↓1 ok

-4.8 FIXX .↓-5 ok

Note that FIXX converts floating point numbers greater than 32,767 to the most positive signed 16-bit integer which is +32,767. Use UFIXX to convert an unsigned floating point number up to 65535 to a floating point number. For example,

60000. UFIXX U.↓60000 ok

DFIXX rounds a signed floating point number to the nearest double number. For example,

70,000. DFIXX D.↓70000 ok

INT.FLOOR returns the most positive signed integer less than or equal to a floating point number. For example,

4.99 INT.FLOOR .↓4 ok

-3.99 INT.FLOOR .↓-4 ok

-2. INT.FLOOR .↓-2 ok

Likewise, DINT.FLOOR returns the most positive signed double number less than or equal to a floating point number.

### Floating Point to Integer to Floating Point Conversion

FINT is just like INT but it returns a floating point number--it's the equivalent of INT FLOT. For example,

35.6 FINT F.↓35. ok

FRTI (f-round-to-integer) rounds a floating point number to the nearest signed integer, just as FIXX does, but leaves it in floating point representation. It is the equivalent of FIXX FLOT. For example,

0.45 FRTI F.↓0. ok

0.55 FRTI F.↓1. ok

-4.8 FRTI F. ↓-5. ok

Likewise, FLOOR is a version of INT.FLOOR that returns the result in floating point format.

### Floating Point to String Conversion

F>FIXED$ (pronounced "f-to-fixed-string"), F>SCIENTIFIC$ and F>FLOATING$ each convert a floating point number on the stack to a string in fixed, scientific or floating format respectively. These formats are explained in a prior section of this chapter. The address of the string is returned under the count under a true flag. For example, we could define

: TYPE.FIXED ( r -- ) \ types r in fixed format

F>FIXED$

IF COUNT.TYPE

ENDIF

;

PI TYPE.FIXED↓3.142 ok

If conversion is unsuccessful (for example, as a result of LEFT.PLACES and RIGHT.PLACES being too small to accommodate the number in FIXED format) a false flag is returned.

### String to Floating Point Conversion

FNUMBER expects the extended address of the string on the stack, and converts the string to a floating point number which is left on the stack under a flag. The string can have any number of leading spaces but it must be terminated with a space. For example, try

" .01234e+3 " FNUMBER . F.↓-1 12.34 ok

### Numeric Input

NEXT.NUMBER may be used to input a floating point number from the keyboard. It ignores any non-numeric inputs preceding the first valid number, and waits for as many lines as necessary for the first valid number. The first valid integer or floating point number is left on the stack as a floating point number (integers are converted to floating point via the FLOT operator). For example,

NEXT.NUMBER some garbage 12↓ok [ 2 ] -32768 \ 259F.↓12. ok

ASK.FNUMBER waits for a line of text terminated by a carriage return, and attempts to convert the text to a valid floating point number. Consult its glossary listing for details of operation.

## Floating Point Error Handling

The user variable FP.ERROR holds an error flag that can be inspected after any floating point operation. If the contents are 0, there was no error, while contents of 1 indicate underflow and -1 indicates overflow. The FORTH words UNDERFLOW and OVERFLOW set FP.ERROR to 1 and -1, respectively. In an application in which errors must be detected, the FP.ERROR flag must be polled before the next floating point operation changes them.

Errors detected in the floating point math operations do not cause an ABORT. If the programmer desires, this can be accomplished by polling the FP.ERROR variable and executing an ABORT with an appropriate message if it is non-zero. For example, F/ could be redefined to halt processing if an error is detected during division:

: F/ ( r1\r2 -- r3 | r3 = r1/r2 ) F/ FP.ERROR @ ?DUP IF \ if error occurred -1 = IF ." Overflow or divide by zero error!" ELSE ." Underflow error!" ENDIF \ print message... ABORT \ ... and abort ENDIF ;

## Internal Representation of Floating Point Numbers

Ordinarily you need not worry about internal representation; for input and output the interpreter takes care of conversion automatically. Internally each floating point number is represented as two 16-bit numbers (a total of 4 bytes) in memory or on the stack. It is convenient to think of the number as two integer-sized locations on the stack. The top one contains both the sign of the number (in the most significant byte) and a signed base 2 exponent (the least significant byte), and the bottom one contains a 16-bit mantissa.

The mantissa uses a hidden-bit format. That is, it represents the fractional part of a mantissa between 1 and 2. Using a full 8 bits just for the number's sign is a little wasteful of memory, but it improves the speed of the operations. The two 16-bit numbers have the following format:

- SSEE –> Top of Stack or low memory
- XXXX –> Next on Stack or high memory

where

SS | is the 8-bit sign of the floating point number that is 01 if the number is positive, FF_{H} if the number is negative, and 00 if the number is zero. |

EE | is the 8-bit signed base 2 exponent |

XXXX | is the 16-bit left-justified hidden-bit mantissa. |

There is always assumed to be a 1 to the left of the binary point. Zero is represented when both 16-bit cells are zero. The magnitude of the floating point number is given by 1.XXXX x 2ˆEE .

You can examine the internal representation of floating point numbers by placing a number on the stack and looking at the stack print in hexadecimal base. For example,

15.0 HEX↓ok [ 2 ] \ E000 \ 103

DECIMAL SP!↓ok

shows that 15.0 is represented internally as a sign byte of 01, a base 2 exponent of 03, and a mantissa of E000H.

## Floating Point Numerical Accuracy

### Arithmetic Accuracy

Because a hidden-bit format is used, the precision of representation is always better than 16 bits. (It actually averages to 16.5 bits -- 16 bits for mantissas near 1.0000 and 17 bits for mantissas near 1.9999). Excluding division the basic arithmetic operations preserve this precision, rounding their results to the numeral closest to the infinitely precise result. In the interest of speed precision is relaxed a bit for the division algorithm, and errors of up to +/- 1 least significant bit occur for some operands.

### Function Accuracy

Algorithms are used for the transcendental functions that give constant relative accuracy over the argument's range. The results are usually accurate to 15 bits.

## Floating Point Benchmarks

QED-Forth includes the kernel word BENCHMARK: that lets you measure the execution time and number of basic floating point operations for any word. Let's use BENCHMARK: to calculate the speed of QED-Forth's basic floating point operations.

First, a word is defined that performs a specified number of floating point multiplications:

: F*TEST ( n -- | n is the number of floating point operations ) 1- FOR PI LOG10(2) F* \ multiply two constants and FDROP \ drop the result NEXT ;

The numbers being multiplied are constants, and the result is dropped so that the loop has no net stack effect. To use the BENCHMARK: routine, QED-Forth's system clock must first be started by executing:

START.TIMESLICER↓ok

This elapsed-time clock will run until a reset or restart occurs, or until STOP.TIMESLICER is executed. Next put any needed stack items on the stack and invoke BENCHMARK: followed by the word to be tested. To see how long it takes to do 10,000 multiplies, execute

DECIMAL↓ok\ make sure base is decimal

10,000 BENCHMARK: F*TEST↓Time (sec) = 2.725#F+ & F- = 0#F* & F/ = 10000ok

Thus it requires 2.725/10,000 sec = 0.275 msec to place two floating point constants on the stack, multiply them, drop the result, and branch to the start of the loop. This is a rate of over 3600 floating point operations per second (FLOPS) with an 8 MHz crystal frequency. To calculate the speed of the F* operation alone, we can subtract out the execution time of the following routine:

: FP.ADJUST ( n -- | n is the number of loops ) 1- FOR PI LOG10(2) F2DROP NEXT ;

Because F2DROP executes in exactly the same amount of time as FDROP, this routine accounts for all of the execution time of F*TEST except for the F* operation. This routine is benchmarked as

10000 BENCHMARK: FP.ADJUST↓Time (sec) = 1.12#F+ & F- = 0#F* & F/ = 0ok

Thus 10,000 F* operations require 2.725 - 1.12 seconds, which means that the raw speed of the F* operation is about 0.16 msec. This translates to 6250 floating point operations per second, which will be the upper bound on the speed of the multiplication if the QED Board is operated with an 8 MHz crystal. Vector and matrix operations come closest to this raw speed. Of course, clocking the QED Board with a 16 MHz crystal doubles all of these speeds.

The following table gives benchmarks for the raw speed of some floating point operations:

Operation | Typical speed (msec) at 8 MHz crystal frequency |
---|---|

F+ , F- | 0.170 |

F* | 0.160 |

F/ | 0.150 |

FSQRT | 1.550 |

FSIN | 2.770 |

FCOS | 3.430 |

FLN | 2.480 |

By way of comparison, single precision integer + * and / operations require 0.015, 0.05, and 0.070 msec respectively. Where integer math can be used, it definitely confers a speed advantage over floating point operations. In many realistic computations, however, 16-bit integer quantities are not sufficient because they overflow (if used to accumulate a result), or cannot handle a wide enough range of numbers while maintaining precision, or cannot represent fractional quantities without scaling operations. In these cases, integer computations must rely on double number mathematics, and/or must use the */ operator to scale the results to the proper magnitudes. The benchmarks for the relevant double number and mixed single/double number operations are:

Operation | Typical speed (msec) at 8 MHz crystal frequency |
---|---|

D+ , D- | 0.030 |

UD*S | 0.230 |

*/ | 0.385 |

In practice, floating point mathematics is often more convenient and just as fast as scaled double number arithmetic for many computation-intensive applications.