Floating-point mathematics is a complex topic that confuses manyprogrammers. The tutorial below should help you recognize programmingsituations where floating-point errors are likely to occur and how toavoid them. It should also allow you to recognize cases that arecaused by inherent floating-point math limitations as opposed toactual compiler bugs.
Decimal and Binary Number Systems
Normally, we count things in base 10. The base is completelyarbitrary. The only reason that people have traditionally used base10 is that they have 10 fingers, which have made handy countingtools.
The number 532.25 in decimal (base 10) means the following:
(5 * 10^2) + (3 * 10^1) + (2 * 10^0) + (2 * 10^-1) + (5 * 10^-2) 500 + 30 + 2 + 2/10 + 5/100 _________ = 532.25
In the binary number system (base 2), each column represents a powerof 2 instead of 10. For example, the number 101.01 means thefollowing:
(1 * 2^2) + (0 * 2^1) + (1 * 2^0) + (0 * 2^-1) + (1 * 2^-2) 4 + 0 + 1 + 0 + 1/4 _________ = 5.25 Decimal
How Integers Are Represented in PCs
Because there is no fractional part to an integer, its machinerepresentation is much simpler than it is for floating-point values. Normalintegers on personal computers (PCs) are 2 bytes (16 bits) long with themost significant bit indicating the sign. Long integers are 4 bytes long.Positive values are straightforward binary numbers. For example:
1 Decimal = 1 Binary 2 Decimal = 10 Binary 22 Decimal = 10110 Binary, etc.
However, negative integers are represented using the two's complementscheme. To get the two's complement representation for a negativenumber, take the binary representation for the number's absolute valueand then flip all the bits and add 1. For example:
4 Decimal = 0000 0000 0000 0100 1111 1111 1111 1011 Flip the Bits -4 = 1111 1111 1111 1100 Add 1
Note that -1 Decimal = 1111 1111 1111 1111 in Binary, which explainswhy Basic treats -1 as logical true (All bits = 1). This is aconsequence of not having separate operators for bitwise and logicalcomparisons. Often in Basic, it is convenient to use the code fragmentbelow when your program will be making many logical comparisons. Thisgreatly aids readability.
CONST TRUE = -1 CONST FALSE = NOT TRUE
Note that adding any combination of two's complement numbers togetherusing ordinary binary arithmetic produces the correct result.
Every decimal integer can be exactly represented by a binary integer;however, this is not true for fractional numbers. In fact, everynumber that is irrational in base 10 will also be irrational in anysystem with a base smaller than 10.
For binary, in particular, only fractional numbers that can berepresented in the form p/q, where q is an integer power of 2, can beexpressed exactly, with a finite number of bits.
Even common decimal fractions, such as decimal 0.0001, cannot berepresented exactly in binary. (0.0001 is a repeating binary fractionwith a period of 104 bits!)
This explains why a simple example, such as the following
SUM = 0 FOR I% = 1 TO 10000 SUM = SUM + 0.0001 NEXT I% PRINT SUM ' Theoretically = 1.0.
will PRINT 1.000054 as output. The small error in representing 0.0001in binary propagates to the sum.
For the same reason, you should always be very cautious when makingcomparisons on real numbers. The following example illustrates acommon programming error:
item1# = 69.82# item2# = 69.20# + 0.62# IF item1# = item2# then print "Equality!"
This will NOT PRINT "Equality!" because 69.82 cannot be representedexactly in binary, which causes the value that results from theassignment to be SLIGHTLY different (in binary) than the value that isgenerated from the expression. In practice, you should always codesuch comparisons in such a way as to allow for some tolerance. Forexample:
IF (item1# < 69.83#) AND (item1# > 69.81#) then print "Equal"
This will PRINT "Equal".
IEEE Format Numbers
QuickBasic for MS-DOS, version 3.0 was shipped with an MBF(Microsoft Binary Floating Point) version and an IEEE (Institute ofElectrical and Electronics Engineers) version for machines with amath coprocessor. QuickBasic for MS-DOS, versions 4.0 and lateronly use IEEE. Microsoft chose the IEEE standard to representfloating-point values in current versions of Basic for the followingthree primary reasons:
- To allow Basic to use the Intel math coprocessors, which use IEEE format. The Intel 80x87 series coprocessors cannot work with Microsoft Binary Format numbers.
- To make interlanguage calling between Basic, C, Pascal, FORTRAN, and MASM much easier. Otherwise, conversion routines would have to be used to send numeric values from one language to another.
- To achieve consistency. IEEE is the accepted industry standard for C and FORTRAN compilers.
The following is a quick comparison of IEEE and MBF representationsfor a double-precision number:
Sign Bits Exponent Bits Mantissa Bits --------- ------------- ------------- IEEE 1 11 52 + 1 (Implied) MBF 1 8 56
For more information on the differences between IEEE and MBFfloating-point representation, query in the Microsoft Knowledge Base onthe following words:
IEEE and floating and point and appnote
Note that IEEE has more bits dedicated to the exponent, which allowsit to represent a wider range of values. MBF has more mantissa bits,which allows it to be more precise within its narrower range.
General Floating-Point Concepts
It is very important to realize that any binary floating-point systemcan represent only a finite number of floating-point values in exactform. All other values must be approximated by the closestrepresentable value. The IEEE standard specifies the method forrounding values to the "closest" representable value. QuickBasicfor MS-DOS supports the standard and rounds according to the IEEErules.
Also, keep in mind that the numbers that can be represented in IEEEare spread out over a very wide range. You can imagine them on anumber line. There is a high density of representable numbers near 1.0and -1.0 but fewer and fewer as you go towards 0 or infinity.
The goal of the IEEE standard, which is designed for engineeringcalculations, is to maximize accuracy (to get as close as possible tothe actual number). Precision refers to the number of digits that youcan represent. The IEEE standard attempts to balance the number ofbits dedicated to the exponent with the number of bits used for thefractional part of the number, to keep both accuracy and precisionwithin acceptable limits.
Floating-point numbers are represented in the following form, where[exponent] is the binary exponent:
X = Fraction * 2^(exponent - bias)
[Fraction] is the normalized fractional part of the number, normalizedbecause the exponent is adjusted so that the leading bit is always a1. This way, it does not have to be stored, and you get one more bitof precision. This is why there is an implied bit. You can think ofthis like scientific notation, where you manipulate the exponent tohave one digit to the left of the decimal point, except in binary, youcan always manipulate the exponent so that the first bit is a 1, sincethere are only 1s and 0s.
[bias] is the bias value used to avoid having to store negativeexponents.
The bias for single-precision numbers is 127 and 1023 (decimal) fordouble-precision numbers.
The values equal to all 0's and all 1's (binary) are reserved forrepresenting special cases. There are other special cases as well,that indicate various error conditions.
2 = 1 * 2^1 = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex
Note the sign bit is zero, and the stored exponent is 128, or 100 0000 0 in binary, which is 127 plus 1. The stored mantissa is (1.) 000 0000 ... 0000 0000, which has an implied leading 1 and binary point, so the actual mantissa is 1.
-2 = -1 * 2^1 = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex
Same as +2 except that the sign bit is set. This is true for all IEEE format floating-point numbers.
4 = 1 * 2^2 = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex
Same mantissa, exponent increases by one (biased value is 129, or 100 0000 1 in binary.
6 = 1.5 * 2^2 = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex
Same exponent, mantissa is larger by half -- it's (1.) 100 0000 ... 0000 0000, which, since this is a binary fraction, is 1-1/2 (the values of the fractional digits are 1/2, 1/4, 1/8, etc.).
1 = 1 * 2^0 = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex
Same exponent as other powers of 2, mantissa is one less than 2 at 127, or 011 1111 1 in binary.
.75 = 1.5 * 2^-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex
The biased exponent is 126, 011 1111 0 in binary, and the mantissa is (1.) 100 0000 ... 0000 0000, which is 1-1/2.
2.5 = 1.25 * 2^1 = 0100 0000 0010 0000 ... 0000 0000 = 4020 0000 hex
Exactly the same as 2 except that the bit which represents 1/4 is set in the mantissa.
0.1 = 1.6 * 2^-4 = 0011 1101 1100 1100 ... 1100 1101 = 3DCC CCCD hex
1/10 is a repeating fraction in binary. The mantissa is just shy of 1.6, and the biased exponent says that 1.6 is to be divided by 16 (it is 011 1101 1 in binary, which is 123 in decimal). The true exponent is 123 - 127 = -4, which means that the factor by which to multiply is 2**-4 = 1/16. Note that the stored mantissa is rounded up in the last bit. This is an attempt to represent the unrepresentable number as accurately as possible. (The reason that 1/10 and 1/100 are not exactly representable in binary is similar to the way that 1/3 is not exactly representable in decimal.)
0 = 1.0 * 2^-128 = all zeros -- a special case.
Other Common Floating-Point Errors
The following are common floating-point errors:
- Round-off error
This error results when all of the bits in a binary number cannot be used in a calculation.
Example: Adding 0.0001 to 0.9900 (Single Precision)
Decimal 0.0001 will be represented as:
(1.)10100011011011100010111 * 2^(-14+Bias) (13 Leading 0s in Binary!) 0.9900 will be represented as:
(1.)11111010111000010100011 * 2^(-1+Bias) Now to actually add these numbers, the decimal (binary) points must be aligned. For this they must be Unnormalized. Here is the resulting addition:
.000000000000011010001101 * 2^0 <- Only 11 of 23 Bits retained +.111111010111000010100011 * 2^0 ________________________________ .111111010111011100110000 * 2^0 This is called a round-off error because some computers round when shifting for addition. Others simply truncate. Round-off errors are important to consider whenever you are adding or multiplying two very different values.
- Subtracting two almost equal values
.1235 -.1234 _____ .0001 This will be normalized. Note that although the original numbers each had four significant digits, the result has only one significant digit.
- Overflow and underflow
This occurs when the result is too large or too small to be represented by the data type.
- Quantizing error
This occurs with those numbers that cannot be represented in exact form by the floating-point standard.
- Division by a very small number
This can trigger a "divide by zero" error or can produce bad results, as in the following example:
A = 112000000 B = 100000 C = 0.0009 X = A - B / C In QuickBasic for MS-DOS, X now has the value 888887, instead of the correct answer, 900000.
- Output error
This type of error occurs when the output functions alter the values they are working with.