You are currently offline, waiting for your internet to reconnect

Your browser is out-of-date

You need to update your browser to use the site.

Update to the latest version of Internet Explorer

INFO: Precision and Accuracy in Floating-Point Calculations

This article was previously published under Q125056
SUMMARY
There are many situations in which precision, rounding, and accuracy infloating-point calculations can work to generate results that aresurprising to the programmer. There are four general rules that should befollowed:
  1. In a calculation involving both single and double precision, the result will not usually be any more accurate than single precision. If double precision is required, be certain all terms in the calculation, including constants, are specified in double precision.
  2. Never assume that a simple numeric value is accurately represented in the computer. Most floating-point values can't be precisely represented as a finite binary value. For example .1 is .0001100110011... in binary (it repeats forever), so it can't be represented with complete accuracy on a computer using binary arithmetic, which includes all PCs.
  3. Never assume that the result is accurate to the last decimal place. There are always small differences between the "true" answer and what can be calculated with the finite precision of any floating point processing unit.
  4. Never compare two floating-point values to see if they are equal or not- equal. This is a corollary to rule 3. There are almost always going to be small differences between numbers that "should" be equal. Instead, always check to see if the numbers are nearly equal. In other words, check to see if the difference between them is very small or insignificant.
MORE INFORMATION
In general, the rules described above apply to all languages, including C,C++, and assembler. The samples below demonstrate some of the rules usingFORTRAN PowerStation. All of the samples were compiled using FORTRANPowerStation 32 without any options, except for the last one, which iswritten in C.

Please refer to the FORTRAN manual(s) supplied with Microsoft FORTRAN for adescription of numeric constants, and article 36068 for a description ofthe internal representation of floating-point values.

SAMPLE 1

The first sample demonstrates two things:

  • That FORTRAN constants are single precision by default (C constants are double precision by default).
  • Calculations that contain any single precision terms are not much more accurate than calculations in which all terms are single precision.
After being initialized with 1.1 (a single precision constant), y is asinaccurate as a single precision variable.
   x = 1.100000000000000  y = 1.100000023841858				
The result of multiplying a single precision value by an accurate doubleprecision value is nearly as bad as multiplying two single precisionvalues. Both calculations have thousands of times as much error asmultiplying two double precision values.
   true = 1.320000000000000 (multiplying 2 double precision values)   y    = 1.320000052452087 (multiplying a double and a single)   z    = 1.320000081062318 (multiplying 2 single precision values)				

Sample Code

C Compile options: none       real*8 x,y,z       x = 1.1D0       y = 1.1       print *, 'x =',x, 'y =', y       y = 1.2 * x       z = 1.2 * 1.1       print *, x, y, z       end				

SAMPLE 2

Sample 2 uses the quadratic equation. It demonstrates that even doubleprecision calculations are not perfect, and that the result of acalculation should be tested before it is depended on if small errors canhave drastic results. The input to the square root function in sample 2 isonly very slightly negative, but it is still invalid. If the doubleprecision calculations did not have slight errors, the result would be:
   Root =   -1.1500000000				
Instead, it generates the following error:
run-time error M6201: MATH
- sqrt: DOMAIN error

Sample Code

C Compile options: none       real*8 a,b,c,x,y       a=1.0D0       b=2.3D0       c=1.322D0       x = b**2       y = 4*a*c       print *,x,y,x-y       print "(' Root =',F16.10)",(-b+dsqrt(x-y))/(2*a)       end				

SAMPLE 3

Sample 3 demonstrates that due to optimizations that occur even ifoptimization is not turned on, values may temporarily retain a higherprecision than expected, and that it is very unwise to test two floating-point values for equality.

In this example, two values are both equal and not equal. At the first IF,the value of Z is still on the coprocessor's stack and has the sameprecision as Y. Therefore X does not equal Y and the first message isprinted out. At the time of the second IF, Z had to be loaded from memoryand therefore had the same precision and value as X, and the second messagealso is printed.

Sample Code

C Compile options: none       real*8 y       y=27.1024D0       x=27.1024       z=y       if (x.ne.z) then         print *,'X does not equal Z'       end if       if (x.eq.z) then         print *,'X equals Z'       end if       end				

SAMPLE 4

The first part of sample code 4 calculates the smallest possible differencebetween two numbers close to 1.0. It does this by adding a single bit tothe binary representation of 1.0.
   x   = 1.00000000000000000  (one bit more than 1.0)   y   = 1.00000000000000000  (exactly 1.0)   x-y =  .00000000000000022  (smallest possible difference)				
Some versions of FORTRAN round the numbers when displaying them so that theinherent numerical imprecision is not so obvious. This is why x and y lookthe same when displayed.

The second part of sample code 4 calculates the smallest possibledifference between 2 numbers close to 10.0. Again, it does this by adding asingle bit to the binary representation of 10.0. Notice that the differencebetween numbers near 10 is larger than the difference near 1. Thisdemonstrates the general principle that the larger the absolute value of anumber, the less precisely it can be stored in a given number of bits.
   x   = 10.00000000000000000  (one bit more than 10.0)   y   = 10.00000000000000000  (exactly 10.0)   x-y =   .00000000000000178				
The binary representation of these numbers is also displayed to show thatthey do differ by only one bit.
   x = 4024000000000001 Hex   y = 4024000000000000 Hex				
The last part of sample code 4 shows that simple nonrepeating decimalvalues often can be represented in binary only by a repeating fraction. Inthis case x=1.05, which requires a repeating factor CCCCCCCC....(Hex) inthe mantissa. In FORTRAN, the last digit "C" is rounded up to "D" in orderto maintain the highest possible accuracy:
   x = 3FF0CCCCCCCCCCCD (Hex representation of 1.05D0)				
Even after rounding, the result is not perfectly accurate. There is someerror after the least significant digit, which we can see by removing thefirst digit.
   x-1 = .05000000000000004				

Sample Code

C Compile options: none       IMPLICIT real*8 (A-Z)       integer*4 i(2)       real*8 x,y       equivalence (i(1),x)       x=1.       y=x       i(1)=i(1)+1       print "(1x,'x  =',F20.17,'  y=',f20.17)", x,y       print "(1x,'x-y=',F20.17)", x-y       print *       x=10.       y=x       i(1)=i(1)+1       print "(1x,'x  =',F20.17,'  y=',f20.17)", x,y       print "(1x,'x-y=',F20.17)", x-y       print *       print "(1x,'x  =',Z16,' Hex  y=',Z16,' Hex')", x,y       print *       x=1.05D0       print "(1x,'x  =',F20.17)", x       print "(1x,'x  =',Z16,' Hex')", x       x=x-1       print "(1x,'x-1=',F20.17)", x       print *       end				

SAMPLE 5

In C, floating constants are doubles by default. Use an "f" to indicate afloat value, as in "89.95f".
   /* Compile options needed: none   */    #include <stdio.h>   void main()   {      float floatvar;      double doublevar;   /* Print double constant. */       printf("89.95 = %f\n", 89.95);      // 89.95 = 89.950000   /* Printf float constant */       printf("89.95 = %f\n", 89.95F);     // 89.95 = 89.949997   /*** Use double constant. ***/       floatvar = 89.95;      doublevar = 89.95;      printf("89.95 = %f\n", floatvar);   // 89.95 = 89.949997      printf("89.95 = %lf\n", doublevar); // 89.95 = 89.950000   /*** Use float constant. ***/       floatvar = 89.95f;      doublevar = 89.95f;      printf("89.95 = %f\n", floatvar);   // 89.95 = 89.949997      printf("89.95 = %lf\n", doublevar); // 89.95 = 89.949997   }				
Properties

Article ID: 125056 - Last Review: 02/24/2005 16:05:00 - Revision: 2.1

  • Microsoft FORTRAN PowerStation 1.0 Standard Edition
  • Microsoft Fortran PowerStation 1.0a for MS-DOS
  • Microsoft FORTRAN PowerStation 32
  • Microsoft Visual C++ 1.0 Professional Edition
  • Microsoft Visual C++ 1.5 Professional Edition
  • Microsoft Visual C++ 1.51
  • Microsoft Visual C++ 2.0 Professional Edition
  • Microsoft Visual C++ 4.0 Standard Edition
  • Microsoft Visual C++ 5.0 Enterprise Edition
  • Microsoft Visual C++ 6.0 Enterprise Edition
  • Microsoft Visual C++ 5.0 Professional Edition
  • Microsoft Visual C++ 6.0 Professional Edition
  • Microsoft Visual C++, 32-bit Learning Edition 6.0
  • kbinfo kblangfortran kblangc kbcode KB125056
Feedback
ript' src='" + (window.location.protocol) + "//c.microsoft.com/ms.js'><\/script>");