This article describes the effect of numeric improvements in the statistical functions of Microsoft Office Excel 2003 and of later versions of Excel on Analysis ToolPak (ATP) tools. Most ATP tools call Excel statistical functions in the process of calculating results. In many cases, this article serves as a pointer to articles on the individual Excel statistical functions. Additionally, discussion of future improvements that are helpful is included for some ATP tools.
Microsoft Excel 2004 for Mac informationThe statistical functions in Excel 2004 for Mac were updated by using the same algorithms that were used to update the statistical functions in Microsoft Office Excel 2003 and in later versions of Excel. Any information in this article that describes how a function works or how a function was modified for Excel 2003 and for later versions of Excel also applies to Excel 2004 for Mac.
Code for the ATP has not been edited directly except to introduce improvements in the three ATP ANOVA tools.
For several ATP tools, numeric performance has been improved for Excel 2003 and for later versions of Excel because the tool calls an Excel statistical function that has been improved for Excel 2003 and for later versions of Excel. In cases where results are different for earlier versions of Excel and later versions of Excel, the values for Excel 2003 and for later versions of Excel are more accurate.
Most users will not notice a difference in results between different versions of Excel. This is because differences are generally caused by round-off errors that are significant only in extreme cases. However, this article must first point out one case where differences occur because of a wrong formula in Microsoft Excel 2002 and in earlier versions of Excel. Avoid the tool in those versions.
A second example involves an incorrect formula in Excel 2002 and in earlier versions of Excel that persists in Excel 2003 and in later versions of Excel. Avoid the ATP tool in this scenario for all versions of Excel.
First, avoid the Regression tool when you have to click to check the Constant is Zero
check box. This has been corrected in Excel 2003 and in later versions of Excel. You do not have to avoid the Regression tool when the Constant is Zero
check box is clear (the more typical case in practice).
Second, users of all versions of Excel should avoid the ATP t-Test: Paired Two Sample for Means tool unless you can guarantee that there are no missing data observations. The tool gives inappropriate answers (or no answers at all) if there are one or more missing observations.
For more information about the ATP Matched Pairs Two Sample t-Test tool, click the following article number to view the article in the Microsoft Knowledge Base:
You may obtain incorrect results and misleading labels when you use the Analysis ToolPak t-Test in Excel
If you want to use this tool, and if there is missing data (or if there is even the chance of missing data), the TTEST function in Excel will handle it correctly.
Separate sections are provided later in this article for individual ATP tools. Tools that are not listed have not been affected by improvements in Excel 2003 and in later versions of Excel.
ANOVA: Single Factor, Two-Factor with Replication, and Two-Factor without Replication
Each of these three ANOVA tools has been rewritten to upgrade the computational procedure to a two-pass algorithm that is more numerically robust. These improvements are similar to improvements in statistical functions that compute sums of squared deviations about a mean (for example: VAR, STDEV, SLOPE, PEARSON).
For more information about ATP ANOVA, click the following article number to view the article in the Microsoft Knowledge Base:
Description of numeric improvement in Analysis ToolPak ANOVA tools in Excel
This tool has not been changed. However, there is a small difference between the Correlation tool and the Covariance tool that persists in all versions of Excel. The Correlation tool returns a lower triangular correlation table with 1's on the diagonal and correlations off the diagonal. The tool uses CORREL to compute off-diagonal entries and fills those entries with the value that is returned by CORREL. (Therefore, if any data entry changes, no entry in the table changes. Contrast this behavior with the behavior of Covariance.)
This tool returns a lower triangular covariance table with variances on the diagonal and covariances off the diagonal. Cells on the diagonal contain a formula "=VARP(...)" so that if a data entry changes, the result in the table also changes. VARP has been improved for Office Excel 2003 and for later versions of Excel.
For more information about VARP, click the following article number to view the article in the Microsoft Knowledge Base:
Excel statistical functions: VARP
The Covariance tool uses COVAR to compute off-diagonal entries and fills those entries with the value that is returned by COVAR. Therefore if a data entry changes, the off-diagonal entries do not change.
This tool calls Excel statistical functions for everything that it computes. Because VAR and STDEV are improved for Excel 2003 and for later versions of Excel, different values are possible because of round-off errors in extreme cases.
For more information about VAR, click the following article number to view the article in the Microsoft Knowledge Base:
Excel statistical functions: VAR
F-Test Two-Sample for Variances
Like the Descriptive Statistics tool, this tool calls VAR. Again, different values are possible because of round-off errors in extreme cases.
Random Number Generation
This tool fills a range with random observations. Values of these observations are put directly in the cells so that these cell values are not recomputed and replaced with fresh observations when the sheet is recalculated. The built-in RAND function in Excel, on the other hand, replaces existing random numbers with new ones every time the sheet is recalculated. You can use RAND to preserve values. To do so, copy the results in a range, and then use the Paste Special
command to paste values into the same range.
The Random Number Generation (RNG) tool also produces random observations from various probability distributions whereas RAND corresponds to the single option in the tool: Uniform with range between 0 and 1. This article describes how to combine RAND with the statistical functions in Excel to generate such observations.
Therefore, in terms of functionality, you can emulate the ATP random number tool by using RAND and some ingenuity. Sometimes this is useful, particularly when many random numbers are wanted.
For Excel 2002 and earlier, both the ATP random number generator and RAND were known to perform poorly on standard tests of randomness. Performance was poor because the length of a cycle before the sequence of pseudo-random numbers starts repeating was too short. This is an issue only when many random numbers are required.
RAND has been improved for Excel 2003 and for later versions of Excel so that RAND now passes all such standard tests. RAND's sequence of random numbers will start repeating itself after more than 1 trillion numbers are generated.
For more information about RAND, click the following article number to view the article in the Microsoft Knowledge Base:
Description of the RAND function in Excel
However, the ATP's separate random number generator was not upgraded. As in the version of RAND that is in Excel 2002 and in earlier versions of Excel, the ATP's separate random number generator is known to perform poorly on standard tests of randomness and to have a short repetition cycle. This has negative implications only if you require a very lengthy sequence of random numbers (for example, 1 million).
The RNG tool provides random observations from various probability distributions and the Uniform[0,1], the distribution used for outputting random numbers through RAND. The ATP tool first draws a Uniform[0,1] random number (or more than one such number) and then converts the answer into an observation from one of the following specific distributions. For the benefit of those who prefer RAND because they are going to generate many observations, this article suggests formulas that use RAND in the table below. Following the table, there are some cautionary comments about the Analysis ToolPak's normal distribution case.
|Distribution||Excel formula using RAND()|
|Bernoulli(p)||=IF(RAND() <= p, 1, 0)|
|Binomial(n,p)||=CRITBINOM(n, p, RAND())|
|Normal(mu, sigma)||=NORMINV(RAND(), mu, sigma)|
|Patterned||Not really random|
|Uniform(low, high)||= low + (high – low) * RAND()|
There are two reasons that you might prefer RAND and the formula in this table instead of the ATP random number tool in the Normal(mu, sigma) case. First, RAND is a better Uniform[0,1] random number generator than ATP's. Second, the ATP's tool does not call Excel's NORMINV function, but instead has its own built-in version of the inverse normal distribution. This is not as accurate as the version of NORMINV that is in Excel 2003 and in later versions of Excel. It is inferior both in the accuracy of the normal distribution approximation that it uses (Excel uses the much improved NORMSDIST function) and in the refinement of the binary search (Excel carries it much further to guarantee a value that is closer to NORMINV's probability argument). In a nutshell, using ATP in this case does not take advantage of improvements for Excel 2003 and for later versions of Excel in the NORMINV, NORMSDIST, and RAND functions.
For observations from a discrete distribution, assume that the values are in column B and that their probabilities are in column C. One would then want to fill each row of column A with the probability of observing a value strictly less than the value in column B in that row. Assuming that there were 10 values, assume that this data is in cells A1:C10. Then, because A1 contains the probability of observing a value strictly less than the first value, it must be set to 0. You can use VLOOKUP(RAND(), A1:C10, 2); the fourth argument to VLOOKUP is optional and must be omitted or set to TRUE. The "2" means that you want to return the value in the second column (column B in this example).
The ATP uses an adaptation of the method of generating Poisson observations in Press, W.H., S.A. Teukolsky, W. T. Vetterling, and B.P. Flannery, Numerical Recipes in C, The Art of Scientific Computing
, 2nd ed., Cambridge University Press, 1992, pp. 293-295. There are two methods to take easy advantage of existing Excel functions.
The first uses the observation that a POISSON random variable with mean m has a distribution that is well approximated by a BINOMIAL(n, m/n) for large n. You can then call CRITBINOM(n, m/n, RAND()). Choice of n depends on m; n greater than 1,000 times m should be large enough.
The second relates the POISSON distribution to the Exponential. If events occur according to a POISSON process at rate m per unit time, the time between events has an Exponential distribution with mean 1/m. For a POISSON observation you can take a sequence of observations from this Exponential distribution and count how many of them occur before their sum exceeds 1. To obtain an observation from this Exponential distribution, use GAMMAINV(RAND(), 1, 1/m). This method would be suitable when m is relatively close to 0.
The Regression tool calls Excel's LINEST. The article on LINEST describes extensive improvements for Excel 2003 and for later versions of Excel.
For more information about LINEST, click the following article number to view the article in the Microsoft Knowledge Base:
Description of the LINEST function in Excel
If you use Excel 2002 or an earlier version of Excel, note the same two shortcomings of the ATP Regression tool as of LINEST:
- Regression Sum of Squares, r squared, and f statistic values are always incorrect for the case where the regression is forced through the origin.
For LINEST, this means "third argument set to FALSE instead set to TRUE or omitted." For the ATP tool, it means "Constant is Zero check box is checked."
- LINEST and the ATP tool are insensitive to collinearity issues. The article on LINEST discusses the computational approach to LINEST in Excel 2003 and in later versions of Excel that is designed to find collinearity or near-collinearity when it exists and to act appropriately.
Both of these LINEST shortcomings have been overcome in Excel 2003 and in later versions of Excel. Performance of the ATP Regression tool will be similarly improved. There were no changes made to the tool's code; it has been improved by calling an improved Excel function. This author considers the improvement in LINEST to be the most important of the statistical function improvements.
The following table shows the Regression tool's output for earlier versions of Excel and for later versions of Excel with the Constant is Zero
check box selected. It illustrates the first shortcoming that is mentioned earlier. In earlier versions of Excel, the Regression sum of squares is negative as is the R Square value.
|Excel 2002 and earlier versions|
|Adjusted R Square||-20.9285714|
|Excel 2003 and later versions of Excel version|
|Adjusted R Square||0.401250823|
t-Test: Paired Two Sample for Means
As mentioned earlier, avoid this tool if there is any chance of one or more missing data values. The prototypical application of this test is an experiment with measurements on subjects Before and After a treatment (such as weights Before and After a 60-day diet plan). If there are no missing observations, the tool will behave fine. If there are different numbers of missing Before and After observations, you receive an error message and the tool will not compute anything. If there are missing observations and the numbers of missing Before and After observations are equal, the tool will return answers that contain several errors.
Standard procedure is to remove a subject from the data if either the Before or After measurement is missing and to analyze the data that contains only those subjects that have both Before and After measurements. Excel's TTEST function handles missing data according to this standard procedure.
The other two t-Test tools, Two-Sample Assuming Equal Variances and Two-Sample Assuming Unequal Variances, do not share this defect.
z-Test: Two Sample for Means
This article noted that the normal distribution case of the random number generation tool does not call the NORMSINV function (or more precisely, NORMINV that calls NORMSINV) but has its own inferior procedure for finding normal inverse values.
The z-Test tool does call the NORMSINV function and takes advantage of improvements for Excel 2003 and for later versions of Excel.
Results in earlier versions of Excel
There are ATP tools whose performance has been improved for Excel 2003 and for later versions of Excel because they call Excel statistical functions that have been improved for Excel 2003 and for later versions of Excel. One of these improvements to LINEST, when its third argument is set to FALSE, implies that the ATP regression tool returns incorrect results in Excel 2002 and in earlier versions of Excel when the Constant is Zero
check box is selected. In other cases where Excel functions have been improved, users of earlier versions are not likely to notice differences (most of these differences involve round-off errors in extreme situations).
The three ATP ANOVA tools have been improved by editing ATP code to substitute a more numerically robust algorithm (in the same spirit as the improvement in Excel's VAR). Users of these tools in earlier versions of Excel are likely to notice differences only in extreme situations.
Warning for users of all versions: avoid the t-Test: Paired Two Sample for Means if there is the slightest chance of missing data.
Results in Excel 2003 and in later versions of Excel
Significant improvements have been made in Excel's statistical functions. This translates into improvements in many ATP tools that call these functions. There is one ATP tool, the random number generator, that does not take advantage of an improved RAND function (because it is implemented in a self-contained manner and does not call RAND). This is unfortunate, but even more unfortunate is the special case of normally distributed random observations. The inverse normal distribution is also implemented in a self-contained manner and does not call the much improved NORMSINV function.
The following table lists ATP tools and those Excel functions that they call that have been improved for Excel 2003 and for later versions of Excel. Readers are referred to separate articles on each Excel function that is called.
|ATP tool||Excel functions that are called|
|ANOVA: Single Factor||VAR, FINV|
|ANOVA: Two-Factor With Replication||VAR, FINV|
|ANOVA: Two-Factor Without Replication||VAR, FINV|
|Descriptive Statistics||STDEV, TINV, VAR|
|F-Test Two-Sample for Variances||VAR, FINV|
|Random Number Generation|
|Rank and Percentile|
|t-Test: Paired Two Sample for Means||VAR, PEARSON, TINV|
|t-Test: Two-Sample Assuming Equal Variances||VAR, TINV|
|t-Test: Two-Sample Assuming Unequal Variances||VAR, TINV|
|z-Test: Two Sample for Means||NORMSDIST, NORMSINV|
For all functions that appear in this table other than LINEST and RAND, you are likely to see differences between earlier versions of Excel and later versions of Excel only because of round-off errors in extreme situations. LINEST has been much improved, as discussed earlier. RAND has been improved also. Curiously, the Sampling tool calls RAND, but the RNG tool relies on a self-contained generator that will give inferior performance when a lengthy sequence of random observations is required.
Except for changes to ATP code for each of the three ANOVA tools, ATP code has not been rewritten. However, it benefits from calling improved Excel functions as shown in the ATP Tool table. Flaws in the t-Test: Paired Two Sample for Means test have not been fixed for Excel 2003 or for later versions of Excel. The most noticeable improvements are likely to be in the Regression tool where LINEST no longer returns incorrect results when the Constant is Zero
check box is selected and where LINEST is designed to handle collinearity appropriately.