Purpose of regression analysis. Methods of mathematical statistics

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of given values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: , where y is the dependent variable (effective sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in terms of estimated parameters: Building a regression equation is reduced to estimating its parameters. To estimate the parameters of regressions that are linear in parameters, the method of least squares (LSM) is used. LSM makes it possible to obtain such estimates of parameters under which the sum of the squared deviations of the actual values ​​of the resulting feature y from the theoretical ones is minimal, i.e.
.
For linear and nonlinear equations reducible to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the phenomena under study is estimated by the linear coefficient of pair correlation for linear regression:

and correlation index - for non-linear regression:

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
Permissible limit of values ​​- no more than 8-10%.
The average coefficient of elasticity shows how many percent on average the result y will change from its average value when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
,
where is the total sum of squared deviations;
- sum of squared deviations due to regression (“explained” or “factorial”);
- residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values ​​of the Fisher F-criterion is performed. F fact is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a, b и не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
, .
The formulas for calculating confidence intervals are as follows:
; ;
; ;
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value is determined by substituting the corresponding (forecast) value into the regression equation. The average standard error of the forecast is calculated:
,
where
and the confidence interval of the forecast is built:
; ;
where .

Solution Example

Task number 1. For seven territories of the Ural region For 199X, the values ​​of two signs are known.
Table 1.
Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the mean approximation error and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of a linear regression (calculation can be done using a calculator).
solve the system of normal equations with respect to a and b:
Based on the initial data, we calculate :
y x yx x2 y2 A i
l 68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total 405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed value (Total/n) 57,89 54,90 3166,05 3048,34 3383,68 X X 8,1
s 5,74 5,86 X X X X X X
s2 32,92 34,34 X X X X X X


Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's define the coefficient of determination:

The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values ​​into the regression equation X, determine the theoretical (calculated) values . Find the value of the average approximation error:

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

since 1< F < ¥ , should be considered F -1 .
The resulting value indicates the need to accept the hypothesis But oh the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of the tightness of the connection.
1b. The construction of a power model is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:


whereY=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.

Table 1.3

Y X YX Y2 x2 A i
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total 12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Mean 1,7605 1,7370 3,0572 3,1011 3,0194 X X 28,27 8,0
σ 0,0425 0,0484 X X X X X X X
σ2 0,0018 0,0023 X X X X X X X

Calculate C and b:


We get a linear equation: .
By potentiating it, we get:

Substituting in this equation the actual values X, we obtain the theoretical values ​​of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index and the average approximation error

The characteristics of the power model indicate that it describes the relationship somewhat better than the linear function.

1c. Construction of an exponential curve equation

preceded by a procedure for linearizing variables when taking the logarithm of both parts of the equation:

For calculations, we use the table data.

Y x Yx Y2 x2 A i
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total 12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn. 1,7605 54,9 96,5711 3,1011 3048,34 X X 28,68 8,0
σ 0,0425 5,86 X X X X X X X
σ2 0,0018 34,339 X X X X X X X

The values ​​of the regression parameters A and AT amounted to:


A linear equation is obtained: . We potentiate the resulting equation and write it in the usual form:

We estimate the closeness of the connection through the correlation index:

The goal of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factorial, explanatory, determinant, regressors and predictors.

The dependent variable is sometimes referred to as the defined, explained, or "response" variable. The extremely wide distribution of regression analysis in empirical research is due not only to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective modeling and forecasting technique.

Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.

Pairwise regression analysis

The first steps when using regression analysis will be almost identical to those taken by us in the framework of calculating the correlation coefficient. The three main conditions for the effectiveness of correlation analysis using the Pearson method - the normal distribution of variables, the interval measurement of variables, the linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are built using the least squares method.

To more clearly illustrate the differences between the two methods of data analysis, let's turn to the example already considered with the variables "SPS support" and "rural population share". The original data is identical. The difference in scatterplots will be that in the regression analysis it is correct to plot the dependent variable - in our case, “SPS support” along the Y axis, while in the correlation analysis it does not matter. After cleaning outliers, the scatterplot looks like:

The fundamental idea of ​​regression analysis is that, having a general trend for variables - in the form of a regression line - you can predict the value of the dependent variable, having the values ​​of the independent.

Let's imagine an ordinary mathematical linear function. Any line in Euclidean space can be described by the formula:

where a is a constant that specifies the offset along the y-axis; b - coefficient that determines the angle of the line.

Knowing the slope and the constant, you can calculate (predict) the value of y for any x.

This simplest function formed the basis of the regression analysis model with the caveat that we will predict the value of y not exactly, but within a certain confidence interval, i.e. approximately.

The constant is the point of intersection of the regression line and the y-axis (the F-intercept, usually referred to as "interceptor" in statistical packages). In our example of voting for the SPS, its rounded value will be 10.55. The slope coefficient b will be equal to approximately -0.1 (as in the correlation analysis, the sign shows the type of relationship - direct or inverse). Thus, the resulting model will look like SP C = -0.1 x Sel. us. + 10.55.

ATP \u003d -0.10 x 47 + 10.55 \u003d 5.63.

The difference between the original and predicted values ​​is called the residual (we have already encountered this term - fundamental for statistics - when analyzing contingency tables). So, for the case of the Republic of Adygea, the remainder will be 3.92 - 5.63 = -1.71. The larger the modulo value of the remainder, the less well predicted the value.

We calculate the predicted values ​​and residuals for all cases:
Happening Sat. us. THX

(original)

THX

(predicted)

Remains
Republic of Adygea 47 3,92 5,63 -1,71 -
Altai Republic 76 5,4 2,59 2,81
Republic of Bashkortostan 36 6,04 6,78 -0,74
The Republic of Buryatia 41 8,36 6,25 2,11
The Republic of Dagestan 59 1,22 4,37 -3,15
The Republic of Ingushetia 59 0,38 4,37 3,99
Etc.

Analysis of the ratio of initial and predicted values ​​serves to assess the quality of the resulting model, its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values ​​of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variable, in our case - 0.63. To meaningfully interpret the multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - squaring. The coefficient of determination R-square (R 2) shows the proportion of variation in the dependent variable explained by the independent (independent) variables.

In our case, R 2 = 0.39 (0.63 2); this means that the variable "proportion of the rural population" explains about 40% of the variation in the variable "support for CPS". The larger the value of the coefficient of determination, the higher the quality of the model.

Another measure of model quality is the standard error of estimate. This is a measure of how much the points are "scattered" around the regression line. The measure of dispersion for interval variables is the standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of the residuals. The higher its value, the greater the spread and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the variable “SPS support”.

Regression statistics also includes analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two values ​​(/ "-ratio). Dispersion statistics is especially important for sample studies - it shows how likely it is to have a relationship between the independent and dependent variables in the general population. However, for continuous studies (as in our example), the study In this case, it is checked whether the revealed statistical regularity is caused by a coincidence of random circumstances, how characteristic it is for the complex of conditions in which the surveyed population is located, i.e. it is not established that the result obtained is not true for some more extensive general aggregate, but the degree of its regularity, freedom from random influences.

In our case, the analysis of variance statistics is as follows:

SS df MS F meaning
Regress. 258,77 1,00 258,77 54,29 0.000000001
Remaining 395,59 83,00 L,11
Total 654,36

The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can safely reject the null hypothesis (that the relationship we found is random).

A similar function is performed by the t criterion, but with respect to regression coefficients (angular and F-crossings). Using the criterion /, we test the hypothesis that the regression coefficients in the general population are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the pairwise regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 + …+ bpXp + a.

If there are more than two independent variables, we are not able to get a visual representation of their relationship; in this regard, multiple regression is less "visible" than pair regression. When there are two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows a good visual representation of the data structure.

When working with multiple regression, unlike pair regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm assumes sequential inclusion (exclusion) of independent variables, based on their explanatory "weight". The stepwise method is good when there are many independent variables; it "cleanses" the model of frankly weak predictors, making it more compact and concise.

An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements that we have considered for the case of pairwise regression. In addition, there are other important components in the statistics of multiple regression analysis.

We will illustrate the work with multiple regression on the example of testing hypotheses that explain the differences in the level of electoral activity in the regions of Russia. Specific empirical studies have suggested that voter turnout is affected by:

National factor (variable "Russian population"; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the proportion of the Russian population leads to a decrease in voter turnout;

Urbanization factor (variable "urban population"; operationalized as the share of the urban population in the constituent entities of the Russian Federation, we have already worked with this factor as part of the correlation analysis). It is assumed that an increase in the proportion of the urban population also leads to a decrease in voter turnout.

The dependent variable - "intensity of electoral activity" ("active") is operationalized through the average turnout data by regions in the federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will have the following form:

Happening Variables
Assets. Gor. us. Rus. us.
Republic of Adygea 64,92 53 68
Altai Republic 68,60 24 60
The Republic of Buryatia 60,75 59 70
The Republic of Dagestan 79,92 41 9
The Republic of Ingushetia 75,05 41 23
Republic of Kalmykia 68,52 39 37
Karachay-Cherkess Republic 66,68 44 42
Republic of Karelia 61,70 73 73
Komi Republic 59,60 74 57
Mari El Republic 65,19 62 47

Etc. (after cleanup of emissions, 83 cases out of 88 remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Therefore, the national factor and the factor of urbanization together explain about 38% of the variation in the variable "electoral activity".

2. Average error is 3.38. This is how “on average” the constructed model is wrong when predicting the level of turnout.

3. /l-ratio of explained and unexplained variation is 25.2 at the level of 0.000000003. The null hypothesis about the randomness of the revealed relationships is rejected.

4. The criterion / for the constant and regression coefficients of the variables "urban population" and "Russian population" is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis about the randomness of the coefficients is rejected.

Additional useful statistics in the analysis of the ratio of the initial and predicted values ​​of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values ​​of all independent variables for a given case deviates from the average value for all independent variables simultaneously). The second is a measure of the influence of the case. Different observations affect the slope of the regression line in different ways, and using the Cook's distance, you can compare them according to this indicator. This is useful when cleaning up outliers (an outlier can be thought of as an overly influential case).

In our example, Dagestan is one of the unique and influential cases.

Happening Initial

values

Predska

values

Remains Distance

Mahalanobis

Distance
Adygea 64,92 66,33 -1,40 0,69 0,00
Altai Republic 68,60 69.91 -1,31 6,80 0,01
The Republic of Buryatia 60,75 65,56 -4,81 0,23 0,01
The Republic of Dagestan 79,92 71,01 8,91 10,57 0,44
The Republic of Ingushetia 75,05 70,21 4,84 6,73 0,08
Republic of Kalmykia 68,52 69,59 -1,07 4,20 0,00

The actual regression model has the following parameters: Y-intercept (constant) = 75.99; b (Hor. sat.) \u003d -0.1; b (Rus. nas.) = -0.06. Final formula.

The main feature of regression analysis is that it can be used to obtain specific information about the form and nature of the relationship between the variables under study.

The sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

    Task formulation. At this stage, preliminary hypotheses about the dependence of the studied phenomena are formed.

    Definition of dependent and independent (explanatory) variables.

    Collection of statistical data. Data must be collected for each of the variables included in the regression model.

    Formulation of a hypothesis about the form of connection (simple or multiple, linear or non-linear).

    Definition regression functions (consists in the calculation of the numerical values ​​of the parameters of the regression equation)

    Evaluation of the accuracy of regression analysis.

    Interpretation of the obtained results. The results of the regression analysis are compared with preliminary hypotheses. The correctness and plausibility of the obtained results are evaluated.

    Prediction of unknown values ​​of the dependent variable.

With the help of regression analysis, it is possible to solve the problem of forecasting and classification. Predictive values ​​are calculated by substituting the values ​​of the explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and the part of the set where the value of the function is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Tasks of regression analysis

Consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, an estimate of the unknown values ​​of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

    positive linear regression (expressed as a uniform growth of the function);

    positive uniformly accelerating regression;

    positive uniformly increasing regression;

    negative linear regression (expressed as a uniform drop in function);

    negative uniformly accelerated decreasing regression;

    negative uniformly decreasing regression.

However, the varieties described are usually not found in pure form, but in combination with each other. In this case, one speaks of combined forms of regression.

Definition of the regression function.

The second task is to find out the effect on the dependent variable of the main factors or causes, all other things being equal, and subject to the exclusion of the impact on the dependent variable of random elements. regression function defined as a mathematical equation of one type or another.

Estimation of unknown values ​​of the dependent variable.

The solution of this problem is reduced to solving a problem of one of the following types:

    Estimation of the values ​​of the dependent variable within the considered interval of the initial data, i.e. missing values; this solves the problem of interpolation.

    Estimating the future values ​​of the dependent variable, i.e. finding values ​​outside the given interval of the initial data; this solves the problem of extrapolation.

Both problems are solved by substituting the found estimates of the parameters of the values ​​of the independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. it is assumed that the relationship between the variables under consideration is linear. So, in this example, we built a scatterplot and were able to see a clear linear relationship. If, on the scatterplot of variables, we see a clear absence of a linear relationship, i.e. there is a non-linear relationship, non-linear methods of analysis should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values ​​is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, one should take into account its main limitation. It consists in the fact that regression analysis allows you to detect only dependencies, and not the relationships that underlie these dependencies.

Regression analysis makes it possible to assess the degree of association between variables by calculating the expected value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of the constant a and the slope of the line (or slope) b multiplied by the value of the variable X. The constant a is also called the intercept, and the slope is the regression coefficient or B-factor.

In most cases (if not always) there is a certain scatter of observations about the regression line.

Remainder is the deviation of an individual point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis Package" and the Regression analysis tool. Specify the X and Y input intervals. The Y input interval is the range of dependent data being analyzed and must include one column. The input interval X is the range of independent data to be analyzed. The number of input ranges must not exceed 16.

At the output of the procedure in the output range, we get the report given in table 8.3a-8.3v.

RESULTS

Table 8.3a. Regression statistics

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

First, consider the upper part of the calculations presented in table 8.3a, - regression statistics.

Value R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-square close to unity, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, the value R-square, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equal to the square root of the coefficient of determination, this value takes values ​​in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

Odds

standard error

t-statistic

Y-intersection

Variable X 1

* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between the variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

AT table 8.3c. output results are presented leftovers. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains

Observation

Predicted Y

Remains

Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value remainder in our case - 0.778, the smallest - 0.043. For a better interpretation of these data, we will use the graph of the original data and the constructed regression line presented in Fig. rice. 8.3. As you can see, the regression line is quite accurately "fitted" to the values ​​of the original data.

It should be taken into account that the example under consideration is quite simple and it is far from always possible to qualitatively construct a linear regression line.

Rice. 8.3. Initial data and regression line

The problem of estimating unknown future values ​​of the dependent variable based on the known values ​​of the independent variable remained unconsidered, i.e. forecasting task.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values ​​of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable prediction results

Y(predicted)

Thus, as a result of using regression analysis in the Microsoft Excel package, we:

    built a regression equation;

    established the form of dependence and the direction of the relationship between the variables - a positive linear regression, which is expressed in a uniform growth of the function;

    established the direction of the relationship between the variables;

    assessed the quality of the resulting regression line;

    were able to see the deviations of the calculated data from the data of the original set;

    predicted the future values ​​of the dependent variable.

If a regression function is defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, we can assume that the constructed model and predictive values ​​are sufficiently reliable.

The predicted values ​​obtained in this way are the average values ​​that can be expected.

In this paper, we reviewed the main characteristics descriptive statistics and among them such concepts as mean,median,maximum,minimum and other characteristics of data variation.

There was also a brief discussion of the concept emissions. The considered characteristics refer to the so-called exploratory data analysis, its conclusions may not apply to the general population, but only to a data sample. Exploratory data analysis is used to draw primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities of practical use were also considered.

The concepts of correlation and regression are directly related. There are many common computational techniques in correlation and regression analysis. They are used to identify cause-and-effect relationships between phenomena and processes. However, if correlation analysis allows you to evaluate the strength and direction of the stochastic connection, then regression analysis It's also a form of addiction.

Regression can be:

a) depending on the number of phenomena (variables):

Simple (regression between two variables);

Multiple (regression between the dependent variable (y) and several variables explaining it (x1, x2 ... xn);

b) depending on the form:

Linear (displayed as a linear function, and there are linear relationships between the variables under study);

Non-linear (displayed as a non-linear function, the relationship between the variables under study is non-linear);

c) by the nature of the relationship between the variables included in the consideration:

Positive (an increase in the value of the explanatory variable leads to an increase in the value of the dependent variable and vice versa);

Negative (with an increase in the value of the explanatory variable, the value of the explained variable decreases);

d) by type:

Immediate (in this case, the cause has a direct effect on the effect, i.e. the dependent and explanatory variables are directly related to each other);

Indirect (the explanatory variable has an indirect effect through a third or a number of other variables on the dependent variable);

False (nonsense regression) - can arise with a superficial and formal approach to the processes and phenomena under study. An example of nonsense is a regression that establishes a relationship between a decrease in the amount of alcohol consumed in our country and a decrease in the sale of washing powder.

When conducting regression analysis, the following main tasks are solved:

1. Determination of the form of dependence.

2. Definition of the regression function. For this, a mathematical equation of one type or another is used, which allows, firstly, to establish a general trend in the change of the dependent variable, and, secondly, to calculate the effect of the explanatory variable (or several variables) on the dependent variable.

3. Estimation of unknown values ​​of the dependent variable. The resulting mathematical dependence (regression equation) allows you to determine the value of the dependent variable both within the range of given values ​​of the explanatory variables and beyond. In the latter case, regression analysis acts as a useful tool in predicting changes in socio-economic processes and phenomena (provided that existing trends and relationships are preserved). Usually, the length of the time interval for which forecasting is carried out is chosen to be no more than half the time interval over which the observations of the initial indicators were made. It is possible to carry out both a passive forecast, solving the extrapolation problem, and an active one, reasoning according to the well-known "if ... then" scheme and substituting different values ​​into one or more explanatory regression variables.



For building a regression uses a special method called least squares method. This method has advantages over other smoothing methods: a relatively simple mathematical definition of the desired parameters and a good theoretical justification from a probabilistic point of view.

When choosing a regression model, one of the essential requirements for it is to ensure the greatest possible simplicity, which allows obtaining a solution with sufficient accuracy. Therefore, to establish statistical relationships, first, as a rule, a model from the class of linear functions is considered (as the simplest of all possible classes of functions):

where bi, b2...bj - coefficients that determine the influence of independent variables хij on the value yi; ai - free member; ei - random deviation, which reflects the influence of unaccounted for factors on the dependent variable; n is the number of independent variables; N is the number of observations, and the condition (N . n+1) must be satisfied.

Linear model can describe a very wide class of different problems. However, in practice, in particular in socio-economic systems, it is sometimes difficult to use linear models due to large approximation errors. Therefore, non-linear multiple regression functions that allow linearization are often used. Among them, for example, is the production function (power function of Cobb-Douglas), which has found application in various socio-economic studies. It looks like:

where b 0 - normalization factor, b 1 ...b j - unknown coefficients, e i - random deviation.

Using natural logarithms, we can convert this equation into a linear form:

The resulting model allows you to use the standard linear regression procedures described above. Having built models of two types (additive and multiplicative), one can choose the best ones and conduct further studies with smaller approximation errors.

There is a well-developed system for selecting approximating functions - method of group accounting of arguments(MGUA) .

The correctness of the chosen model can be judged by the results of the study of the residuals, which are the differences between the observed values ​​y i and the corresponding values ​​predicted using the regression equation y i . In this case to check the adequacy of the model calculated average approximation error:

The model is considered adequate if e is within 15% or less.

We emphasize in particular that in relation to socio-economic systems, the basic conditions for the adequacy of the classical regression model are by no means always met.

Without dwelling on all the causes of the resulting inadequacy, we will only name multicollinearity- the most difficult problem of effective application of regression analysis procedures in the study of statistical dependencies. Under multicollinearity the presence of a linear relationship between the explanatory variables is understood.

This phenomenon:

a) distorts the meaning of the regression coefficients in their meaningful interpretation;

b) reduces the accuracy of estimation (the variance of estimates increases);

c) enhances the sensitivity of coefficient estimates to sample data (an increase in the sample size can greatly affect the values ​​of the estimates).

There are various techniques to reduce multicollinearity. The most accessible way is to eliminate one of the two variables if the correlation coefficient between them exceeds a value equal in absolute value to 0.8. Which of the variables to keep is decided based on meaningful considerations. Then the regression coefficients are calculated again.

Using the stepwise regression algorithm allows you to consistently include one independent variable in the model and analyze the significance of the regression coefficients and the multicollinearity of the variables. Finally, only those variables remain in the studied dependence that provide the necessary significance of the regression coefficients and the minimum effect of multicollinearity.

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of ​​the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been tasked with developing a strategic plan for opening new stores. This plan should contain a forecast of annual sales in newly opened stores. You believe that selling space is directly related to revenue and want to factor that fact into your decision making process. How do you develop a statistical model that predicts annual sales based on new store size?

Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that predicts the values ​​of the dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - a statistical method that allows you to predict the values ​​of the dependent variable Y by the values ​​of the independent variable X. The following notes will describe a multiple regression model designed to predict the values ​​of the independent variable Y by the values ​​of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values d L and d U for a given number of observations n, the number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If a D< d L , the hypothesis of independence of random deviations is rejected (hence, there is a positive autocorrelation); if D > dU, the hypothesis is not rejected (that is, there is no autocorrelation); if d L< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then d L and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ​​( d L and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, d L= 1.08 and dU= 1.36. Insofar as D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the population slope β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). A-priory t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Insofar as t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other side, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=S YX 2 ).

A-priory F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. The results, presented in the form of a summary table of the analysis of variance, are shown in fig. 20.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Insofar as F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis about the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Hence, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Insofar as b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values ​​of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response for a given value of a variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xi– mathematical expectation of a variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of ​​4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of ​​​​4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of ​​4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

  • Ignoring the conditions of applicability of the method of least squares.
  • An erroneous estimate of the conditions for applicability of the method of least squares.
  • Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
  • Application of regression analysis without in-depth knowledge of the subject of study.
  • Extrapolation of the regression beyond the range of the explanatory variable.
  • Confusion between statistical and causal relationships.

The widespread use of spreadsheets and statistical software has eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods if many of them have no idea at all about the conditions for applicability of the least squares method and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot plotted from data set D illustrates an unusual situation in which the empirical model is highly dependent on a single response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

  • Analysis of the possible relationship between variables X and Y always start with a scatterplot.
  • Before interpreting the results of a regression analysis, check the conditions for its applicability.
  • Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
  • Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
  • If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
  • If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
  • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
  • Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. A regression model was used to predict the values ​​of the dependent variable. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.