You are currently offline, waiting for your internet to reconnect

- LINEST and related functions.
- NORMSDIST and related functions.
- Functions that involve sums of squares (for example, VAR and STDEV).
- Continuous distribution inverse functions (for example, CHIINV, FINV, NORMSINV, and TINV).
- Functions for discrete distributions (for example, BINOMDIST and POISSON).
- Random number generator (RAND).

Because improvements were made for Excel 2003, you may inevitably wonder about the risk of (and magnitude of) inaccuracies in statistical functions in the earlier versions of Excel. These issues will be discussed briefly in each of the six sections. The six areas are ranked in this author's perception of their importance. While most users do not have to be the least bit alarmed by performance in Microsoft Excel 2002 and earlier, problems with LINEST (area 1) are much more likely to surface than problems with BINOMDIST (area five) or RAND (area six).

There is one shortcoming in Excel 2002 and earlier that was identified and not fixed for Excel 2003. The prototypical application of the Analysis ToolPak's tool for t-Test: Paired Two Sample for Means is one where data includes measurements on the same set of subjects before and after a treatment (for example, weights before and after a diet program). This tool will not compute if there are unequal numbers of missing before and after observations; it will compute incorrect answers if there are missing observations and the number of missing before and after observations are equal. Unless you can guarantee that there are no missing observations, the TTEST function in Excel is preferred (because it handles missing observations correctly).

For more information, click the following article number to view the article in the Microsoft Knowledge Base:

829252 You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel 2003

Two primary references that point out numeric inaccuracies in Microsoft Excel 97 are papers by Knusel (see note 1) for areas two, four, and five and McCullough and Wilson (see note two) for areas one, three, and six. Improvements to the statistical functions in Excel in Microsoft Excel 2000 and Microsoft Excel 2002 were relatively minor.For more information, click the following article number to view the article in the Microsoft Knowledge Base:

828533 Description of the LINEST function in Excel 2003 and in Excel 2004 for Mac

- Knusel, L. On the Accuracy of Statistical Distributions in MicrosoftExcel 97, Computational Statistics and Data Analysis, 26, 375-377, 1998.
- McCullough, B.D. & B. Wilson, On the accuracy of statistical proceduresin Microsoft Excel 97, Computational Statistics and Data Analysis, 31, 27-37, 1999.

Improving LINEST has a good effect on functions that call it: LOGEST, TREND, and GROWTH in addition to the Linear Regression tool in the Analysis ToolPak. The charting tool in Excel allows you to fit a straight line to a plot of data points by using a linear regression tool. Unfortunately, linear regression capabilities in the charting tool have not been upgraded in Excel 2003.

In summary, if you use a version of Excel that is earlier than Excel 2003, you must reject detailed LINEST results in the last three rows of the five row output table in LINEST if LINEST was called with its third argument set to FALSE. Also be aware of the chance of collinearity; for most data sets, collinearity should not be a problem.

Excel 2003 code yields at least fifteen decimal point accuracy for any z. This means fifteen significant digits for z = 1.2 and (because of leading 0's) ten significant digits for z = 4. Improving the accuracy of NORMSDIST has the useful side effect of also improving the following functions that depend on it: CONFIDENCE, LOGINV, LOGNORMDIST, NORMDIST, NORMINV, NORMSINV and ZTEST. Code for CONFIDENCE, LOGNORMDIST, NORMDIST, and ZTEST has not been revised; accuracy of each of them is improved because each essentially calls NORMSDIST one or more times and uses the results of these calls in its computations. Improved accuracy of NORMSDIST also benefited LOGINV, NORMINV, and NORMSINV. Because these are continuous distribution inverse functions (see area four below), they also benefited from search refinements that were associated with such functions. The search refinements in area four discussed for other inverse functions were implemented for these three inverse functions in Excel 2002.

In summary, if you use Excel 2002 and earlier, you should be satisfied with NORMSDIST. However, if you must have highly accurate NORMSDIST(z) values for z far from 0 (such as |z| greater than or equal to four), Excel 2003 might be required. NORMSDIST(-4) = 0.0000316712; earlier versions would be accurate only as far as 0.0000317. You can expect that "about 0.00003" or "about 3 chances in 100,000" is likely to be an accurate enough answer for many users and earlier versions carry this to two more decimal places.

Sum of squares of observations – ((sum of observations)^2) / number of observations

The alternative procedure counts the number of observations and the sum of the observations on the first pass. From this it can calculate the mean as sum of observations divided by number of observations. On the second pass, the procedure calculates:Sum of squared deviations of individual observations from the mean

With either computational approach, VAR is computed by dividing the result by number of observations – 1; VARP is computed by dividing the result by number of observations.With infinitely precise arithmetic, both procedures yield the same results. However, because of the finite precision of Excel, the calculator formula is more prone to round off errors. In texts on statistical computing, the calculator formula is generally presented as an example of how not to compute variance. With the worksheet that accompanies the article about VAR, you can experiment to judge the extent that round off errors in earlier versions of Excel are likely to pose problems. You will see that problems occur when there are many significant digits in the data but very little difference between values. You can expect that for most users, such round off errors are not likely to be troubling in practice. However, revising various functions to substitute the two-pass procedure for the calculator formula was worthwhile because it replaced an outdated and discredited procedure by an appropriate one. It also affected many functions.

DEVSQ also computes the sum of squared deviations about a mean. DEVSQ has always been implemented by using the two-pass procedure. Therefore, another alternative for computing VAR(data) is to compute DEVSQ(data) / (COUNT(data) – 1). This would give the same results in all Excel versions. If you are using Excel 2002 or earlier, you can compute both VAR and DEVSQ(data) / (COUNT(data) – 1) to see how much their results differ. The difference is essentially the round off error in VAR. Other functions that require a sum of squared deviations about a mean and that have always used the two-pass procedure are CORREL and COVAR. PEARSON and CORREL both compute the Pearson Product-Moment Correlation Coefficient. Both yield the same results in Excel 2003. In earlier versions of Excel, PEARSON is implemented with the one pass algorithm.

Many functions involve sums of squared deviations about a mean. The two-pass procedure has been implemented for each of the following: VAR, VARA, VARP, VARPA, STDEV, STDEVA, STDEVP, STDEVPA, PEARSON, SLOPE, and STEYX. Other functions that have been improved because they essentially call one of the functions in the previous list include: FORECAST, INTERCEPT, RSQ, TTEST, and ZTEST (when user omits standard deviation from arguments). In the Analysis ToolPak, each of the three ANOVA tools has been improved by replacing the calculator formula by the two-pass procedure. There are also versions of VAR, VARP, STDEV, and STDEVP that are used in pivot tables and data consolidation. There is a single article on the functions DVAR, DVARP, DSTDEV, and DSTDEVP.

For more information, click the following article number to view the article in the Microsoft Knowledge Base:

828125 Excel statistical functions: DVAR, DVARP, DSTDEV, and DSTDEVP

All of these have also been upgraded.

In summary, if you use an earlier version of Excel, you must be concerned about round-off errors in cases where data contains many significant digits but the variance is small. The article about VAR presents contrived examples of this; you can expect that these round-off problems do not naturally occur frequently in real data.

For more information, click the following article number to view the article in the Microsoft Knowledge Base:

826112 Excel statistical functions: VAR

For Excel 2002, the binary search process was improved in one case, NORMSINV. This improvement also affects NORMINV. NORMSINV and NORMINV are by far the most frequently used inverse functions. The nature of the improvement was to refine the search so that it continued until the result was accurate to the limit of Excel's precision, instead of stopping earlier if the probability associated with the value found was within 3 * 10^(-7) of the correct probability. While this search process refinement improved NORMSINV, Excel 2002's version still required a more accurate NORMSDIST.

For Excel 2003, the improved binary search used for NORMSINV for Excel 2002, was implemented for each of the other "INV" functions. Because NORMSDIST was also improved (area two), the version of NORMSINV in Excel 2003 is better than the version in Excel 2002 and better yet than that in Excel 2000 and earlier versions of Excel.

In summary, these functions will behave well enough in earlier versions of Excel assuming that you call them with a probability value (for example, p in NORMSINV(p)) that is not too close to 0 or 1 (such as within about 10^(-6)).

These problems have been fixed. You can view them as relatively unimportant because they occur only for unlikely values of input parameters. For example, BINOMDIST will return accurate answers in Excel 2002 and earlier as long as the number of trials is fewer than 1,030.

All five of these functions were fixed in the same way: continue to use existing code in cases where inputs to the functions guaranteed no computational problems (such as fewer than 1,030 trials for BINOMDIST); switch to an alternative plan in problematical cases. For each of the five functions, this method uses a process that permits evaluation of the probability that you want without having to evaluate a combinatorial coefficient like COMBIN(1030, 515) or a factorial of a large number like FACT(200). In each case, the alternative plan has the same general approach.

For more information, click the following article numbers to view the articles in the Microsoft Knowledge Base:

827459 Excel statistical functions: BINOMDIST

828117 Excel statistical functions: CRITBINOM

828515 Excel statistical functions: HYPGEOMDIST

828361 Excel statistical functions: NEGBINOMDIST

828130 Excel statistical functions: POISSON

In summary, users of earlier versions of Excel should not be troubled in practice by numeric deficiencies because input values that cause #NUM! results are extreme. Additionally, you can be assured that if you receive a result other than an error message, the result is accurate. Round-off error is not a problem here.

The chance of a serious practical affect on your random data by the random number generator in Excel 2002 and earlier is minimal. For example, you must have a lengthy sequence of random numbers (such as 1 million) before the repetitive behavior would have a serious affect on your results.

Still, because the replacement algorithm was easily implemented, it was worthwhile to try to make improvements.

In summary, if you use an earlier version of Excel, do not be concerned with the quality of pseudo-random numbers generated by RAND unless you use many random numbers.

This article has warned users of Excel 2002 and earlier to avoid using results in the last three rows of the output table for LINEST(known_y's, known_x's, FALSE, TRUE). Users of all versions should not use the Analysis ToolPak's tool for t-Test: Paired Two Sample for Means if there are missing data. These two cases return incorrect results regardless of data values.

In other cases, the extent that numeric inaccuracies in Excel 2002 and earlier will affect users is difficult to measure. For functions where you supply numeric arguments, degree of inaccuracy typically depends on the value of these arguments (for example, BINOMDIST with 1,500 trials or NORMSDIST(-7)). In these cases such arguments typically have to be "extreme" before there is a risk of serious numeric problems. For functions where you supply one or more data ranges, such as any function involving sums of squares (VAR, STDEV, SLOPE), data values also have to be "extreme" in a sense with many significant digits and small variation before round-off problems have an affect. For LINEST, you must be aware of the chance of collinearity.

Perhaps you will never notice a difference when a worksheet that is created in Excel 2002 or earlier is recalculated in Excel 2003. However, it was important to improve functions whose numerical accuracy was found not to be sufficient in tests that were designed to investigate the ability of the functions to handle inputs known by experts in the computational statistics community to stretch such functions to their limits. When the functions were originally added to Excel, nobody could anticipate future uses. For example, Six Sigma techniques were not in widespread use. Now, you would not want to return inaccurate values of NORMSDIST(-6) or NORMSDIST(6). Hopefully, numeric improvements for Excel 2003 will make these statistical functions appropriate for unanticipated future use.

Properties

Article ID: 828888 - Last Review: 11/15/2006 18:35:15 - Revision: 3.4

- Microsoft Office Excel 2003
- Microsoft Excel 2004 for Mac

- kbfunctions kbfuncstat kbinfo KB828888