Article ID: 125056 - View products that this article applies to.
This article was previously published under Q125056
There are many situations in which precision, rounding, and accuracy in floating-point calculations can work to generate results that are surprising to the programmer. There are four general rules that should be followed:
In general, the rules described above apply to all languages, including C, C++, and assembler. The samples below demonstrate some of the rules using FORTRAN PowerStation. All of the samples were compiled using FORTRAN PowerStation 32 without any options, except for the last one, which is written in C.
Please refer to the FORTRAN manual(s) supplied with Microsoft FORTRAN for a description of numeric constants, and article 36068
(https://support.microsoft.com/kb/36068/EN-US/ )for a description of the internal representation of floating-point values.
SAMPLE 1The first sample demonstrates two things:
The result of multiplying a single precision value by an accurate double precision value is nearly as bad as multiplying two single precision values. Both calculations have thousands of times as much error as multiplying two double precision values.
x = 1.100000000000000 y = 1.100000023841858
true = 1.320000000000000 (multiplying 2 double precision values) y = 1.320000052452087 (multiplying a double and a single) z = 1.320000081062318 (multiplying 2 single precision values)
SAMPLE 2Sample 2 uses the quadratic equation. It demonstrates that even double precision calculations are not perfect, and that the result of a calculation should be tested before it is depended on if small errors can have drastic results. The input to the square root function in sample 2 is only very slightly negative, but it is still invalid. If the double precision calculations did not have slight errors, the result would be:
Instead, it generates the following error:
Root = -1.1500000000
run-time error M6201: MATH
- sqrt: DOMAIN error
SAMPLE 3Sample 3 demonstrates that due to optimizations that occur even if optimization is not turned on, values may temporarily retain a higher precision than expected, and that it is very unwise to test two floating- point values for equality.
In this example, two values are both equal and not equal. At the first IF, the value of Z is still on the coprocessor's stack and has the same precision as Y. Therefore X does not equal Y and the first message is printed out. At the time of the second IF, Z had to be loaded from memory and therefore had the same precision and value as X, and the second message also is printed.
SAMPLE 4The first part of sample code 4 calculates the smallest possible difference between two numbers close to 1.0. It does this by adding a single bit to the binary representation of 1.0.
Some versions of FORTRAN round the numbers when displaying them so that the inherent numerical imprecision is not so obvious. This is why x and y look the same when displayed.
x = 1.00000000000000000 (one bit more than 1.0) y = 1.00000000000000000 (exactly 1.0) x-y = .00000000000000022 (smallest possible difference)
The second part of sample code 4 calculates the smallest possible difference between 2 numbers close to 10.0. Again, it does this by adding a single bit to the binary representation of 10.0. Notice that the difference between numbers near 10 is larger than the difference near 1. This demonstrates the general principle that the larger the absolute value of a number, the less precisely it can be stored in a given number of bits.
The binary representation of these numbers is also displayed to show that they do differ by only one bit.
x = 10.00000000000000000 (one bit more than 10.0) y = 10.00000000000000000 (exactly 10.0) x-y = .00000000000000178
The last part of sample code 4 shows that simple nonrepeating decimal values often can be represented in binary only by a repeating fraction. In this case x=1.05, which requires a repeating factor CCCCCCCC....(Hex) in the mantissa. In FORTRAN, the last digit "C" is rounded up to "D" in order to maintain the highest possible accuracy:
x = 4024000000000001 Hex y = 4024000000000000 Hex
Even after rounding, the result is not perfectly accurate. There is some error after the least significant digit, which we can see by removing the first digit.
x = 3FF0CCCCCCCCCCCD (Hex representation of 1.05D0)
x-1 = .05000000000000004
SAMPLE 5In C, floating constants are doubles by default. Use an "f" to indicate a float value, as in "89.95f".
Article ID: 125056 - Last Review: February 24, 2005 - Revision: 2.1