As of Version 7.0, linear regression functionality is built into the Wolfram Language. »

Linear Regression Package

The built-in function Fit finds a least-squares fit to a list of data as a linear combination of the specified basis functions. The functions Regress and DesignedRegress provided in this package augment Fit by giving a list of commonly required diagnostics such as the coefficient of determination RSquared, the analysis of variance table ANOVATable, and the mean squared error EstimatedVariance. The output of regression functions can be controlled so that only needed information is produced. The Nonlinear Regression Package provides analogous functionality for nonlinear models.

The basis functions f_j specify the predictors as functions of the independent variables. The resulting model for the response variable is y_i=β₁f_1i+β₂f_2i+…+β_pf_pi+e_i, where y_i is the i ^th response, f_ji is the j ^th basis function evaluated at the i ^th observation, and e_i is the i ^th residual error.

Estimates of the coefficients β₁,…,β_p are calculated to minimize , the error or residual sum of squares. For example, simple linear regression is accomplished by defining the basis functions as f₁=1 and f₂=x, in which case β₁ and β₂ are found to minimize [y_i-(β₁+β₂x_i)]².

Regress[data,{1,x,x²},x]	fit a list of data points data to a quadratic model
Regress[data,{1,x₁,x₂,x₁x₂},{x₁,x₂}]
	fit data to a model that includes interaction between independent variables x₁ and x₂
Regress[data,{f₁,f₂…},vars]	fit data to a model as a linear combination of the functions f_i of variables vars

Using Regress.

The arguments of Regress are of the same form as those of Fit. The data can be a list of vectors, each vector consisting of the observed values of the independent variables and the associated response. The basis functions f_j must be functions of the symbols given as variables. These symbols correspond to the independent variables represented in the data. By default, a constant function f_j=1 is added to the list of basis functions if not explicitly given in the list of basis functions.

The data can also be a vector of data points. In this case, Regress assumes that this vector represents the values of a response variable with the independent variable having values 1, 2, ….

{y₁,y₂,…}	data points specified by a list of response values, where a single independent variable is assumed to take the values 1, 2, …
{{x₁₁,x₁₂,…,y₁},{x₂₁,x₂₂,…,y₂}}	data points specified by a matrix, where x_ik is the value of the i ^th case of the k ^th independent variable, and y_i is the i ^th response

Ways of specifying data in Regress.

This loads the package.

This data contains ordered pairs of a single predictor and a response.

This is a plot of the data.

This is the output for fitting the model y_i=β₀+β₁

+e_i.

You can use Fit if you want only the fitted function.

option name	default value
IncludeConstant	True	constant automatically included in model
RegressionReport	SummaryReport	fit diagnostics to include
Weights	Automatic	list of weights for each point or pure function
BasisNames	Automatic	names of basis elements for table headings

Options for Regress.

Two of the options of Regress influence the method of calculation. IncludeConstant has a default setting True, which causes a constant term to be added to the model even if it is not specified in the basis functions. To fit a model without this constant term, specify IncludeConstant->False and do not include a constant in the basis functions.

The Weights option allows you to implement weighted least squares by specifying a list of weights, one for each data point; the default Weights->Automatic implies a weight of unity for each data point. When Weights->{w₁,…,w_n}, the parameter estimates are chosen to minimize the weighted sum of squared residuals w_i .

Weights can also specify a pure function of the response. For example, to choose parameter estimates to minimize , set Weights->(Sqrt[#]&).

The options RegressionReport and BasisNames affect the form and content of the output. If RegressionReport is not specified, Regress automatically gives a list including values for ParameterTable, RSquared, AdjustedRSquared, EstimatedVariance and ANOVATable. This set of objects comprises the default SummaryReport. The option RegressionReport can be used to specify a single object or a list of objects so that more (or less) than the default set of results is included in the output. RegressionReportValues[Regress] gives the objects that may be included in the RegressionReport list for the Regress function.

With the option BasisNames, you can label the headings of predictors in tables such as ParameterTable and ParameterCITable.

The regression functions will also accept any option that can be specified for SingularValueList or StudentTCI. In particular, the numerical tolerance for the internal singular value decomposition is specified using Tolerance, and the confidence level for hypothesis testing and confidence intervals is specified using ConfidenceLevel.

BestFit	best fit function
BestFitParameters	best fit parameter estimates
ANOVATable	analysis of variance table
EstimatedVariance	estimated error variance
ParameterTable	table of parameter information including standard errors and test statistics
ParameterCITable	table of confidence intervals for the parameters
ParameterConfidenceRegion	ellipsoidal joint confidence region for the parameters
ParameterConfidenceRegion[{f_i1,f_i2,…}]
	ellipsoidal conditional joint confidence region for the parameters {f_i1,f_i2,…}
FitResiduals	differences between the observed responses and the predicted responses
PredictedResponse	fitted values obtained by evaluating the best fit function at the observed values of the independent variables
SinglePredictionCITable	table of confidence intervals for predicting a single observation of the response variable
MeanPredictionCITable	table of confidence intervals for predicting the expected value of the response variable
RSquared	coefficient of determination
AdjustedRSquared	adjusted coefficient of determination
CoefficientOfVariation	coefficient of variation
CovarianceMatrix	covariance matrix of the parameters
CorrelationMatrix	correlation matrix of the parameters

Some RegressionReport values.

ANOVATable, a table for analysis of variance, provides a comparison of the given model to a smaller one including only a constant term. If IncludeConstant->False is specified, then the smaller model is reduced to the data. The table includes the degrees of freedom, the sum of squares and the mean squares due to the model (in the row labeled Model) and due to the residuals (in the row labeled Error). The residual mean square is also available in EstimatedVariance, and is calculated by dividing the residual sum of squares by its degrees of freedom. The F-test compares the two models using the ratio of their mean squares. If the value of F is large, the null hypothesis supporting the smaller model is rejected.

To evaluate the importance of each basis function, you can get information about the parameter estimates from the parameter table obtained by including ParameterTable in the list specified by RegressionReport. This table includes the estimates, their standard errors, and t-statistics for testing whether each parameter is zero. The p-values are calculated by comparing the obtained statistic to the t distribution with n-p degrees of freedom, where n is the sample size and p is the number of predictors. Confidence intervals for the parameter estimates, also based on the t distribution, can be found by specifying ParameterCITable. ParameterConfidenceRegion specifies the ellipsoidal joint confidence region of all fit parameters. ParameterConfidenceRegion[{f_i1,f_i2,…}] specifies the joint conditional confidence region of the fit parameters associated with basis functions {f_i1,f_i2,…}, a subset of the complete set of basis functions.

The square of the multiple correlation coefficient is called the coefficient of determination R², and is given by the ratio of the model sum of squares to the total sum of squares. It is a summary statistic that describes the relationship between the predictors and the response variable. AdjustedRSquared is defined as =1-()(1-R²), and gives an adjusted value that you can use to compare subsequent subsets of models. The coefficient of variation is given by the ratio of the residual root mean square to the mean of the response variable. If the response is strictly positive, this is sometimes used to measure the relative magnitude of error variation.

Each row in MeanPredictionCITable gives the confidence interval for the mean response at each of the values of the independent variables. Each row in SinglePredictionCITable gives the confidence interval for a single observed response at each of the values of the independent variables. MeanPredictionCITable gives a region likely to contain the regression curve, while SinglePredictionCITable gives a region likely to contain all possible observations.

The following gives the residuals, the confidence interval table for the predicted response of single observations, and the parameter joint confidence region.

This is a list of the residuals extracted from the output.

The observed response, the predicted response, the standard errors of the predicted response, and the confidence intervals may also be extracted.

This plots the predicted responses against the residuals.

Here the predicted responses and lower and upper confidence limits are paired with the corresponding x values.

This displays the raw data, fitted curve, and the 95% confidence intervals for the predicted responses of single observations.

Graphics may be used to display an Ellipsoid object. This is the joint 95% confidence region for the regression parameters.

This package provides numerous diagnostics for evaluating the data and the fit. The HatDiagonal gives the leverage of each point, measuring whether each observation of the independent variables is unusual. CookD and PredictedResponseDelta are influence diagnostics, simultaneously measuring whether the independent variables and the response variable are unusual. Unfortunately, these diagnostics are primarily useful in detecting single outliers. In particular, the diagnostics may indicate a single outlier, but deleting that observation and recomputing the diagnostics may indicate others. All these diagnostics are subject to this masking effect.

HatDiagonal	diagonal of the hat matrix X(X^TX)^-1X^T, where X is the n by p (weighted) design matrix
JackknifedVariance	{v₁,…,v_n}, where v_i is the estimated error variance computed using the data with the i ^th case deleted
StandardizedResiduals	fit residuals scaled by their standard errors, computed using the estimated error variance
StudentizedResiduals	fit residuals scaled by their standard errors, computed using the jackknifed estimated error variances
CookD	{d₁,…,d_n}, where d_i is Cook’s squared distance diagnostic for evaluating whether the i ^th case is an outlier
PredictedResponseDelta	{d₁,…,d_n}, where d_i is Kuh and Welsch’s DFFITS diagnostic giving the standardized signed difference in the i ^th predicted response, between using all the data and the data with the i ^th case deleted
BestFitParametersDelta	{{d₁₁,…,d_1p},…,{d_n1,…,d_np}}, where d_ij is Kuh and Welsch’s DFBETAS diagnostic giving the standardized signed difference in the j ^th parameter estimate, between using all the data and the data with the i ^th case deleted
CovarianceMatrixDetRatio	{r₁,…,r_n}, where r_i is Kuh and Welsch’s COVRATIO diagnostic giving the ratio of the determinant of the parameter covariance matrix computed using the data with the i ^th case deleted, to the determinant of the parameter covariance matrix computed using the original data

Diagnostics for detecting outliers.

Some diagnostics indicate the degree to which individual basis functions contribute to the fit, or whether the basis functions are involved in a collinear relationship. The sum of the elements in the SequentialSumOfSquares vector gives the model sum of squares listed in the ANOVATable. Each element corresponds to the increment in the model sum of squares obtained by sequentially adding each nonconstant basis function to the model. Each element in the PartialSumOfSquares vector gives the increase in the model sum of squares due to adding the corresponding nonconstant basis function to a model consisting of all other basis functions. SequentialSumOfSquares is useful in determining the degree of a univariate polynomial model, while PartialSumOfSquares is useful in trimming a large set of predictors. VarianceInflation or EigenstructureTable may also be used for predictor set trimming.

PartialSumOfSquares	a list giving the increase in the model sum of squares due to adding each nonconstant basis function to the model consisting of the remaining basis functions
SequentialSumOfSquares	a list giving a partitioning of the model sum of squares, one element for each nonconstant basis function added sequentially to the model
VarianceInflation	{v₁,…,v_p}, where v_j is the variance inflation factor associated with the j ^th parameter
EigenstructureTable	table giving the eigenstructure of the correlation matrix of the nonconstant basis functions

Diagnostics for evaluating basis functions and detecting collinearity.

The Durbin–Watson d statistic is used for testing the existence of a first-order autoregressive process. The statistic takes on values between 0 and 4, with values near the middle of that range indicating uncorrelated errors, an underlying assumption of the regression model. Critical values for the statistic vary with sample size, the number of parameters in the model, and the desired significance. These values can be found in published tables.

DurbinWatsonD

Durbin–Watson d statistic

Correlated errors diagnostic.

Other statistics not mentioned here can be computed with the help of the catcher matrix. This matrix catches all the information the predictors have about the parameter vector. This matrix can be exported from Regress by specifying CatcherMatrix with the RegressionReport option.

CatcherMatrix

p×n matrix C, where C·y is the estimated parameter vector and y is the response vector

Matrix describing the parameter information provided by the predictors.

Frequently, linear regression is applied to an existing design matrix rather than the original data. A design matrix is a list containing the basis functions evaluated at the observed values of the independent variable. If your data is already in the form of a design matrix with a corresponding vector of response data, you can use DesignedRegress for the same analyses as provided by Regress. DesignMatrix puts your data in the form of a design matrix.

DesignedRegress[designmatrix,response]	fit the model represented by designmatrix given the vector response of response data
DesignMatrix[data,{f₁,f₂…},vars]	give the design matrix for modeling data as a linear combination of the functions f_i of variables vars

Functions for linear regression using a design matrix.

DesignMatrix takes the same arguments as Regress. It can be used to get the necessary arguments for DesignedRegress, or to check whether you correctly specified your basis functions. When you use DesignMatrix, the constant term is always included in the model unless IncludeConstant->False is specified. Every option of Regress except IncludeConstant is accepted by DesignedRegress. RegressionReportValues[DesignedRegress] gives the values that may be included in the RegressionReport list for the DesignedRegress function.

This is the design matrix used in the previous regression analysis.

Here is the vector of observed responses.

The result of DesignedRegress is equivalent to that of Regress.

DesignedRegress[svd,response]

fit the model represented by svd, the singular value decomposition of a design matrix, given the vector response of response data

Linear regression using the singular value decomposition of a design matrix.

DesignedRegress will also accept the singular value decomposition of the design matrix. If the regression is not weighted, this approach will save recomputing the design matrix decomposition.

This is the singular value decomposition of the design matrix.

When several responses are of interest, this will save recomputing the design matrix decomposition.

Top