Select the product you need help with
Description of the LINEST function in Excel 2003 and in later versions of ExcelArticle ID: 828533 - View products that this article applies to. On This PageSUMMARYThe purpose of this article is to describe the LINEST
function in Microsoft Office Excel 2003 and in later versions of Excel, to illustrate how the LINEST function is
used, and to compare the results of the LINEST function in Excel 2003 and in later versions of Excel with the
results of the LINEST function in earlier versions of Excel. Microsoft has made extensive changes to the LINEST function to correct incorrect formulas that are used when the regression line must go through the origin. The changes also pay more attention to issues that involve collinear predictor variables. Because of these extensive improvements, this article focuses more on the improvements and less on instructing users about how to use LINEST. Microsoft Excel 2004 for Mac informationThe statistical functions in Excel 2004 for Mac were updated by using the same algorithms that were used to update the statistical functions in Microsoft Office Excel 2003 and in later versions of Excel. Any information in this article that describes how a function works or how a function was modified for Excel 2003 and for later versions of Excel also applies to Excel 2004 for Mac.MORE INFORMATIONThe LINEST(known_y's, known_x's, intercept, statistics)
function is used to perform linear regression. A least squares criterion is
used, and LINEST tries to find the best fit under that criterion. Known_y's
represents data on the dependent variable, and known_x's represents data on one
or more independent variables. The second argument is optional. If the second
argument is omitted, it is assumed to be an array of the same size as known_y's
that contains the values {1, 2, 3, and so on}. The last argument is set to TRUE if you want additional statistics (various sums of squares, r-squared, f-statistic, or standard errors of the regression coefficients, for example). In this case, LINEST must be entered as an array formula. The last argument is optional; if it is omitted, it is interpreted as FALSE. The array's dimensions are five rows by a number of columns that is equal to the number of independent variables plus one if the third argument is set to TRUE (if the third argument is not set to TRUE, the number of columns is equal to the number of independent variables). Setting the third argument to FALSE in Microsoft Excel 2002 and in earlier versions of Excel requires a workaround. This workaround is discussed later in this article. In the most common uses of LINEST, the argument intercept is set to TRUE. This setting means that you want the linear regression model to include the possibility of a non-zero intercept coefficient in its model. If known_x's is represented in data columns, setting intercept to TRUE tells LINEST to add a data column that is filled with 1s as data on an additional independent variable. The intercept argument should be set to FALSE only if you want to force the regression line to go through the origin. For Excel 2002 and for earlier versions of Excel, setting this argument to FALSE always returns results that are not correct, at least in the detailed statistics that are available from LINEST. This article discusses this issue and provides a workaround. This problem has been corrected in Excel 2003 and in later versions of Excel. The third argument is optional; if it is omitted, it is interpreted as TRUE. For ease of exposition in the remainder of this article, assume that the data is arranged in columns, so that known_y's is a column of y data and known_x's is one or more columns of x data. The dimensions (or lengths) of each of these columns must be equal. All the following observations are equally true if the data is not arranged in columns, but it is easier to discuss this single (most frequently used) case. Another reason for setting the intercept argument to FALSE is if you have already explicitly modeled the intercept in the data by including a column of 1s. In Excel 2002 and in earlier versions of Excel, the best solution is to ignore the column of 1s and to call LINEST with this column missing from known_x's and with the intercept argument set to TRUE. Excel 2002 and earlier versions of Excel always return results that are not correct when the intercept argument is set to FALSE. For Excel 2003 and for later versions of Excel, this approach is also preferred, although the formulas have been corrected for Excel 2003 and for later versions of Excel. The performance of LINEST in earlier versions of Excel (or more precisely, the performance of the Analysis ToolPak's linear regression tool that calls LINEST) has been justifiably criticized (see the "References" section in this article for more information). The main concern about Excel's linear regression tools is a lack of attention to issues of collinear (or nearly collinear) predictor variables. Using datasets that were provided by the National Institute for Standards and Technology (NIST, formerly the National Bureau of Standards) that were designed to test the effectiveness of statistical software, numeric inaccuracies were found in the areas of linear regression, analysis of variance, and non-linear regression. In Excel 2003 and in later versions of Excel, these problems have been addressed, except for non-linear regression, caused by an issue with the Solver add-in instead of with the statistical functions or the Analysis ToolPak. The RAND function in Excel was also put through standard tests of randomness and reported subpar results. The RAND function has also been revised in Excel 2003 and in later versions of Excel. LINEST was using the "Normal Equations" for finding regression coefficients. This method is less stable numerically than Singular Value Decomposition or QR Decomposition. Excel 2003 and later versions of Excel have implemented QR Decomposition. While this is a standard technique that is described in many texts, a small example is discussed in this article. QR Decomposition effectively analyzes collinearity issues and excludes any data column from the final model if that column can be expressed as a sum of multiples of the included columns. Near collinearity is treated in the same way; a set of columns is nearly collinear if, when you try to express one data column as a sum of multiples of others, the resulting fit is extremely close. For example, the sum of squared differences between the data column and the fitted values is less than 10^(-12). The LINEST Help file has been updated in Excel 2003 and in later versions of Excel. In summary, the main changes are as follows:
SyntaxThe most common usage of LINEST includes two ranges of cells that contain the data, such as LINEST(A1:A100, B1:F100, TRUE, TRUE). Because there is typically more than one predictor variable, the second argument in this example contains multiple columns. In this example, there are one hundred subjects, one dependent variable value (known_y's) for each subject, and five independent variable values (known_x's) for each subject. Example of usageSeparate Excel worksheet examples are provided to illustrate different key concepts.To illustrate a negative sum of squares in Excel with the third argument set to FALSE, follow these steps:
Collapse this table
Cells D6:E11 show the LINEST output in Excel 2002 and in earlier versions of Excel. In these versions of Excel, LINEST computes the total sum of squares for the model that has the third argument set to FALSE as the sum of squared deviations of y-values about the y column mean. This value is shown in cell A13 and is an appropriate computation when the third argument is set to TRUE. However, when the third argument is set to FALSE, the correct total sum of squares is the sum of squares of the y-values and is shown in cell A17. Use of the wrong formula for total sum of squares leads to the negative regression sum of squares in cell A15. The correct output in Excel 2003 is shown in cells G6:H11. If you use an earlier version of Excel and if you want to force the best fit linear regression through the origin, you must compute some entries in the last three rows of the output array again. To do this, use the following workaround. Note You can refer to the previous worksheet.
Predictor columns (known_x's) are collinear if at least one column, c, can be expressed as a sum of multiples of others (c1, c2, and perhaps additional columns). Column c is frequently called redundant because the information that it contains can be constructed from the columns c1, c2, and other columns. The fundamental principle in the presence of collinearity is that results should not be affected by whether a redundant column is included in the original data or removed from the original data. Because LINEST in Excel 2002 and in earlier versions of Excel did not look for collinearity, this principle was easily violated. Predictor columns are nearly collinear if at least one column, c, can be expressed as almost equal to a sum of multiples of others (c1, c2, and others). In this case, "almost equal" means a very small sum of squared deviations of entries in c from corresponding entries in the weighted sum of c1, c2, and other columns; "very small" might be less than 10^(-12), for example. To illustrate collinearity, follow these steps:
Collapse this table
To verify that the results in your version coincide with the results in cells F8:I27 or in cells K8:N27, you can enter the following three array formulas:
The second model, in rows 15 to 20, uses columns B, C, and D as predictors but sets the third argument of LINEST to FALSE. Because the intercept was explicitly modeled through column D, you do not want Excel to separately model the intercept by building a second column of 1s. Again, collinearity is present because entries in column C in rows 2 to 6 are exactly equal to the sum of corresponding entries in columns B and D. Analyzing the presence of collinearity is not affected by the fact that column D is explicitly used in this model and a similar column of 1s is created internally by Excel in the first model. In this case, values are computed for the LINEST output table, but some of the values are not appropriate. Any version of Excel can handle the third model (in rows 22 to 27). There is no collinearity, and Excel models the intercept, thereby avoiding the model with the third argument set to FALSE (that uses the incorrect formulas to compute some statistics in versions of Excel earlier than Excel 2003). This example is included in this article for the following reasons:
The second model in rows 15 to 20 sets the third argument of LINEST to FALSE. The entries in cells N16:N17 are Excel's standard way of conveying this information. Entries in cells K16:K17 show that LINEST removed one column (column D) from the model. Coefficients in columns L and M are for data columns C and B, respectively. In the third model, in rows 22 to 27, no collinearity is present and no columns are removed. The predicted y values are the same in all three models because explicitly modeling an intercept (like in the second model) provides exactly the same modeling capability as implicitly modeling it in Excel internally (like in the first model and the third model). Also, removing a redundant column that is a sum of multiples of others (like in the first model and the second model) does not reduce the goodness of fit of the resulting model. Such columns are removed precisely because they represent no value added in trying to find the best least squares fit. The following example is a final example of collinearity. The data in this example is also used in the QR Decomposition example in this article. To illustrate the final example of collinearity, follow these steps:
Collapse this table
All versions of Excel provide the same goodness of fit as measured by cell B18 and cell B25. However, Excel 2002 provides all zeros as the values for the standard errors of the regression coefficients. The entries for df in cell B17 and cell B24 differ. The f-statistics in cell A17 and cell A24 also differ. The df for Excel 2003 is correct for a model with two predictor columns, exactly what the model uses (Excel's built-in intercept column and X1). The df for Excel 2002 is appropriate for three predictor columns. However, because of collinearity, there are only two predictor columns. There are only two predictor columns because after you have used any two of the three columns, expanding the model to use the third column has no value added. Therefore, because of collinearity, the entry in cell B17 is not correct and the entry in cell B24 is correct. The incorrect value of df affects statistics that depend on df: the f ratios in cell A17 and cell A24 and the standard error of y in cell B16 and cell B23. Entries in cell A17 and cell B16 are not correct; the entries in cell A24 and cell B23 are correct. The following example illustrates the QR Decomposition algorithm. It has two primary advantages over the algorithm that uses the "Normal Equations." First, results are more stable numerically. When collinearity is not an issue, results are typically accurate to more decimal places with QR Decomposition. Second, QR Decomposition appropriately handles collinearity. It can be thought of as "processing" columns one at a time, and it does not process columns that are linearly dependent on previously processed columns. The previous algorithm does not correctly handle collinearity. If collinearity is present, the results from the previous algorithm are frequently distorted, sometimes to the point of returning #NUM!. Collapse this table
After these preliminary changes, you can use the main loop of the QR Decomposition algorithm. You want to find a 4x4 matrix (because there are 4 rows of data) that you can use to premultiply each column. This transformation does not change the squared lengths of each column. You first find the column vector V by taking the first column and adding the square root of the column's sum of squares (computed in cell B29) to its first entry. Other entries in the first column are not changed. This action yields the vector in cells E29:E32. The sum of squares in V (as VTV) is in cell G29. (Note The T must be a superscript.) The 4x4 matrix VVT is in cells I29:L32. Use this information to compute the 4x4 transformation matrix, P, by using the following formula. The resulting matrix P is displayed in cells A35:D38. If you premultiply the revised X columns in cells C23:E26 by P, you receive the results in cells G35:I38. Similarly, if you premultiply the revised Y column in cells A23:A26 by P, you receive the results in cells L35:L38. The X1 column has been transformed so that it still has the same sum of squares as before, but all entries except the top entry in the column are 0. More precisely, entries in cells G36:G38 are "effectively 0" because they are zero to fifteen decimal places. In row 40, sums of squares for all columns are computed and are not changed by the transformation. The algorithm continues for a second iteration of the main loop and uses only the X0 and X2 data in cells H36:I38 and the Y data in cells L36:38. Because you are concerned with only three rows, you can calculate the sums of squares for only the last three rows of the X0 and X2 columns. These values are displayed in cells H42:H43. The sum of squares of X0 is essentially 0. The X0 and X2 columns are swapped because X2 has the larger relevant sum of squares. After the columns are swapped, revised columns are displayed in cells A45:E49. V is computed exactly as in the first iteration except that now V has only three rows. Computations of VTV, VVT, and P continue exactly as before and are shown in rows 51-54 and cells A57:C59. You can then premultiply only the last three rows of the X2, X0, and Y columns by P to yield the revised columns in cells G56:L60. To make this more readable, these columns are rewritten in cells G63:L67 by setting values that are effectively zero to exactly zero. The next iteration only involves the X0 column and its last two rows. Because the sum of squares of entries in these rows is zero, the main loop of the algorithm terminates. The residual sum of squares is the sum of squares of revised Y vector entries below the second row. All the rows that were not processed at the time the main loop of the QR Decomposition algorithm terminated are included here. In this case, processing stopped because the last two rows in the X0 column contained only zeros. The residual sum of squares is calculated in cell G74. You can see from the entries in cells G63:L67 that any values for the coefficients of the Xs leave a fitted value of zero for each of these last two rows. The values of coefficients for X1 and X2 that have been found yield an exact fit to Y values in the first two rows. Therefore, Y has been transformed so that its total sum of squares is not changed, the residual sum of squares is the sum of squares in the last two rows, and the regression sum of squares is the sum of squares in the first two rows. The algorithm spotted collinearity when it noticed that the remaining entries in the X0 column were zero. At this point, no columns remain whose coefficients may improve the fit. The X0 column does not contain any useful additional information because X1 and X2 are already included in the model. Although X2 has a coefficient of zero, this does not make it a redundant column that is eliminated as a result of collinearity. At this point, you can extract most of the summary statistics that LINEST provides. However, this article does not discuss how to determine standard errors of the regression coefficients. Values from LINEST output in Excel 2003 are shown in cells I74:K78 for comparison. The regression sum of squares is calculated in cell E74 and R-squared is calculated in cell E75; these values are displayed in the LINEST output in cell I78 and cell I76, respectively. The residual sum of squares (or error sum of squares) is calculated in cell G72 and displayed in the LINEST output in cell J78. Other entries in the LINEST output depend on the degrees of freedom (DF). Many statistical packages report Regression DF, Error DF, and Total DF. Excel reports only Error DF (in cell J77). Earlier versions of Excel compute Error DF correctly in all cases except when there is collinearity that should have eliminated one or more predictor columns. The value of Error DF depends on the number of predictor columns that are actually used. With collinearity, Excel 2003 handles this computation correctly, while earlier versions count all predictor columns even though one or more should have been eliminated by collinearity. Degrees of freedom is examined here in more detail. Assume that collinearity is not an issue. When the intercept is fitted, in other words, the third argument to LINEST is missing or true:
Earlier versions of Excel use these formulas to correctly compute DF, except that Excel 2002 does not look for collinearity. Looking for collinearity is one of the reasons for using QR Decomposition for these computations. The predictor columns form a matrix. If the intercept is fitted, there is effectively an additional column of 1s that does not appear on your spreadsheet. QR Decomposition determines the rank of this matrix. The previous formulas for Regression DF should be changed to the following formulas:
In the example on the worksheet, the intercept was fitted. Total DF is 4 – 1 = 3; Regression DF is 2 – 1 = 1; Error DF is Total DF – Regression DF = 3 – 1 = 2. For this example, Excel 2002 and earlier versions of Excel calculated Regression DF as 3 – 1 = 2 and Error DF as 3 – 2 = 1. The difference comes from the failure to look for collinearity. Earlier versions of Excel noted that there were three predictor columns; Excel 2003 examined these three columns and found that there were really only two. Standard error of Y is calculated in cell E77 and is shown in the LINEST output in cell J76. The f statistic is calculated in cell H78 and in the LINEST output in cell I77. The formula for the f statistic is: Y col mean minus the sum over all X columns (except the intercept column) of X col regression coefficient times X col mean This value is calculated in C80 and agrees with the LINEST output
in cell K74.Summary of results in earlier versions of ExcelLINEST used a formula that is not correct to find the total sum of squares when the third argument to LINEST was set to FALSE. This formula caused values that are not correct in the regression sum of squares and values that are not correct for the other output that depends on the regression sum of squares: r squared and the f statistic.Regardless of the value of the third argument, LINEST was calculated by using an approach that paid no attention to collinearity issues. The presence of collinearity caused round off errors, standard errors of regression coefficients that are not appropriate, and degrees of freedom that are not appropriate. Sometimes, round off problems were sufficiently severe that LINEST filled its output table with #NUM!. LINEST generally provides acceptable results if the following conditions are true:
Summary of results in Excel 2003Improvements include correcting the formula for total sum of squares in the case where the third argument to LINEST was set to FALSE and switching to the QR Decomposition method of determining the regression coefficients. QR Decomposition has the following two advantages:
ConclusionsLINEST has been greatly improved for Excel 2003 and for later versions of Excel. If you use an earlier version of Excel, verify that predictor columns are not collinear before you use LINEST. Be careful to use the workaround in this article if the third argument in LINEST is set to FALSE. Note that collinearity is only a problem in a small percentage of cases, and calls to LINEST with the third argument set to FALSE are also relatively rare in practice. Earlier versions of Excel give acceptable LINEST results when there is no collinearity and when the third argument in LINEST is TRUE or omitted. Improvements in LINEST affect the Analysis ToolPak's linear regression tool that calls LINEST and the following related functions:
REFERENCESMcCullough, B.D. and B. Wilson. "On the Accuracy of
Statistical Procedures in Microsoft Excel 97." Computational Statistics and
Data Analysis, 1999, 31, 27-37. Properties | Article Translations
|


Back to the top








