Regression equation statistic. Find the parameters of the linear regression equation and give an economic interpretation of the regression coefficient

x - is called a predictor - an independent or explanatory variable.

For a given quantity x, Y is the value of the y variable (called the dependent, output, or response variable) that lies on the estimate line. This is the value we expect for y (on average) if we know the value of x, and this is called the "predicted value of y" (Figure 5).

a - free member (crossing) of the evaluation line; is the value of Y when x = 0.

b is the slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x by one unit (Figure 5). The coefficient b is called the regression coefficient.

For example: with an increase in human body temperature by 1 ° C, the pulse rate increases by an average of 10 beats per minute.

Figure 5. Linear regression line showing the coefficient a and slope b(increase value Y with increasing X per unit)

Mathematically, the solution of the linear regression equation is reduced to calculating the parameters a and b in such a way that the points of the initial data of the correlation field as close as possible to the direct regression .

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Francis Galton (1889). He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" or "reversed" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

We see mean regression in screening and clinical trials where a subset of patients may be selected for treatment because their levels of a particular variable, say cholesterol, are extremely high (or low). If this measurement is repeated over time, the mean of the second reading for the subgroup is usually less than the first reading, tending (i.e., regressing) towards the age- and sex-matched mean in the population, regardless of the treatment they may receive. . Patients recruited into a clinical trial based on high cholesterol at their first visit are thus likely to show an average drop in cholesterol levels at their second visit, even if they were not treated during that period.

Often the method of regression analysis is used to develop normative scales and standards of physical development.


How well the regression line fits the data can be judged by calculating the R coefficient (usually expressed as a percentage and called the coefficient of determination), which is equal to the square of the correlation coefficient (r 2). It represents the proportion or percentage of the variance of y that can be explained by the relationship with x, i.e. the proportion of variation of the trait-result that has developed under the influence of an independent trait. It can take values ​​in the range from 0 to 1, or, respectively, from 0 to 100%. The difference (100% - R) is the percentage of variance in y that cannot be explained by this interaction.

Example

Relationship between height (measured in cm) and systolic blood pressure (SBP, measured in mmHg) in children. We performed a pairwise linear regression analysis of SBP versus height (Fig. 6). There is a significant linear relationship between height and SBP.

Figure 6. Two-dimensional graph showing the relationship between systolic blood pressure and height. Shown is the estimated regression line, systolic blood pressure.

The estimated regression line equation is as follows:

GARDEN \u003d 46.28 + 0.48 x height.

In this example, the intercept is not of interest (an increase of zero is clearly out of the range observed in the study). However, we can interpret the slope; SBP is predicted to increase by an average of 0.48 mm Hg in these children. with an increase in height by one centimeter

We can apply a regression equation to predict the SBP we would expect in a child at a given height. For example, a 115 cm tall child has a predicted SBP of 46.28 + (0.48 x 115) = 101.48 mm Hg. Art., a child with a height of 130 has a predicted SBP, 46.28 + (0.48 x 130) = 108.68 mm Hg. Art.

When calculating the correlation coefficient, it was found that it is equal to 0.55, which indicates a direct correlation of average strength. In this case, the determination coefficient r 2 \u003d 0.55 2 \u003d 0.3. Thus, we can say that the share of the influence of growth on the level of blood pressure in children does not exceed 30%, respectively, 70% of the influence falls on the share of other factors.

Linear (simple) regression is limited to considering the relationship between the dependent variable and only one independent variable. If there is more than one independent variable in the relationship, then we need to turn to multiple regression. The equation for such a regression looks like this:

y = a + bx 1 + b 2 x 2 +.... + b n x n

One may be interested in the result of the influence of several independent variables x 1 , x 2 , .., x n on the response variable y. If we think that these x's can be interdependent, then we must not look separately at the effect of changing the value of one x by y, but must simultaneously take into account the values ​​of all other x's.

Example

Since there is a strong relationship between height and body weight of a child, one might wonder if the relationship between height and systolic blood pressure also changes when the child's body weight and sex are also taken into account. Multiple linear regression examines the combined effect of these multiple independent variables on y.

The multiple regression equation in this case can look like this:

GARDEN \u003d 79.44 - (0.03 x height) + (1.18 x weight) + (4.23 x sex) *

* - (for gender, values ​​0 - boy, 1 - girl)

According to this equation, a girl who is 115 cm tall and weighs 37 kg would have a predicted SBP:

GARDEN \u003d 79.44 - (0.03 x 115) + (1.18 x 37) + (4.23 x 1) \u003d 123.88 mm Hg.

Logistic regression is very similar to linear regression; it is used when there is a binary outcome of interest (i.e. presence/absence of a symptom or a subject who has/does not have a disease) and a set of predictors. From the logistic regression equation, it is possible to determine which predictors influence the outcome and, using the values ​​of the patient's predictors, estimate the likelihood that he/she will have a certain outcome. For example: complications will arise or not, treatment will be effective or not.

Start creating a binary variable to represent the two outcomes (eg "has disease" = 1, "has no disease" = 0). However, we cannot apply these two values ​​as the dependent variable in a linear regression analysis because the normality assumption is violated and we cannot interpret predicted values ​​that are not zero or one.

In fact, instead, we take the probability that the subject is classified in the nearest category (i.e. "has a disease") of the dependent variable, and to overcome mathematical difficulties, apply a logistic transformation, in the regression equation - the natural logarithm of the ratio of the probability of "disease" (p) to the probability of "no disease" (1-p).

An integrative process called the maximum likelihood method, rather than ordinary regression (because we cannot apply the linear regression procedure) creates an estimate of the logistic regression equation from the sample data

logit(p) = a + bx 1 + b 2 x 2 +.... + b n x n

logit (p) is an estimate of the value of the true probability that a patient with an individual set of values ​​for x 1 ... x n has a disease;

a - evaluation of the constant (free term, intersection);

b 1 , b 2 ,... ,b n — estimates of logistic regression coefficients.

1. Questions on the topic of the lesson:

1. Give a definition of functional and correlation.

2. Give examples of direct and reverse correlation.

3. Indicate the size of the correlation coefficients for weak, medium and strong relationships between features.

4. In what cases is the rank method for calculating the correlation coefficient used?

5. In what cases is the calculation of the Pearson correlation coefficient used?

6. What are the main steps in calculating the correlation coefficient by the rank method?

7. Define "regression". What is the essence of the regression method?

8. Describe the formula for a simple linear regression equation.

9. Define the regression coefficient.

10. What conclusion can be drawn if the regression coefficient of weight for height is 0.26 kg/cm?

11. What is the regression equation formula used for?

12. What is the coefficient of determination?

13. In what cases is the multiple regression equation used.

14. What is the method of logistic regression used for?

Task.

For light industry enterprises in the region, information was obtained that characterizes the dependence of the volume of output (Y, million rubles) on the volume of capital investments (Y, million rubles).

Table 1.

Dependence of the volume of output on the volume of capital investments.

X
Y

Required:

1. Find the parameters of the linear regression equation, give an economic interpretation of the regression coefficient.

2. Calculate the residuals; find the residual sum of squares; estimate the variance of the residuals; plot the residuals.

3. Check the fulfillment of the LSM prerequisites.

4. Check the significance of the parameters of the regression equation using Student's t-test (α = 0.05).

5. Calculate the coefficient of determination, check the significance of the regression equation using Fisher's F - criterion (α = 0.05), find the average relative approximation error. Make a judgment about the quality of the model.

6. To predict the average value of the indicator Y at a significance level of α = 0.1, if the predicted value of the factor X is 80% of its maximum value.

7. Present graphically the actual and model Y values ​​of the forecast point.

8. Compose non-linear regression equations and build their graphs:

hyperbolic;

Power;

Demonstrative.

9. For these models, find the coefficients of determination and average relative approximation errors. Compare models according to these characteristics and draw a conclusion.

Let us find the parameters of the linear regression equation and give an economic interpretation of the regression coefficient.

The linear regression equation is: ,

Calculations for finding the parameters a and b are given in Table 2.

Table 2.

Calculation of values ​​to find the parameters of the linear regression equation.

The regression equation is: y = 13.8951 + 2.4016*x.

With an increase in the volume of capital investments (X) by 1 million rubles. the volume of output (Y) will increase by an average of 2.4016 million rubles. Thus, there is a positive correlation of signs, which indicates the efficiency of enterprises and the profitability of investments in their activities.

2. Calculate the remainders; find the residual sum of squares; estimate the variance of the residuals and plot the residuals.

The remainder is calculated by the formula: e i = y i - y progn.

Residual sum of squared deviations: = 207.74.

Residual dispersion: 25.97.

The calculations are shown in Table 3.

Table 3

Y X Y=a+b*x i e i = y i - y predict. e i 2
100,35 3,65 13,306
81,14 -4,14 17,131
117,16 -0,16 0,0269
138,78 -1,78 3,1649
136,38 6,62 43,859
143,58 0,42 0,1744
73,93 8,07 65,061
102,75 -1,75 3,0765
136,38 -4,38 19,161
83,54 -6,54 42,78
Sum 0,00 207,74
The average 111,4 40,6

The balance chart looks like this:


Fig.1. Residue chart

3. Let's check the fulfillment of the prerequisites of the LSM, which includes the elements:

- checking the equality of the mathematical expectation of the random component to zero;

- the random nature of the residues;

- independence check;

- correspondence of a number of residues to the normal distribution law.

Checking the equality of the mathematical expectation of the levels of a series of residuals to zero.

It is carried out during the verification of the corresponding null hypothesis H 0: . For this purpose, a t-statistic is constructed, where .

so the hypothesis is accepted.

The random nature of the remains.

Let's check the randomness of the levels of a series of residuals using the criterion of turning points:

The number of turning points is determined from the table of residuals:

e i = y i - y predict. Turning points e i 2 (e i - e i -1) 2
3,65 13,31
-4,14 * 17,13 60,63
-0,16 * 0,03 15,80
-1,78 * 3,16 2,61
6,62 * 43,86 70,59
0,42 * 0,17 38,50
8,07 * 65,06 58,50
-1,75 * 3,08 96,43
-4,38 19,16 6,88
-6,54 42,78 4,68
Sum 0,00 207,74 354,62
The average

= 6 > , therefore, the property of randomness of residues is satisfied.

Residual Independence verified using the Durbin-Watson test:

=4 - 1,707 = 2,293.

Since it fell into the interval from d 2 to 2, then according to this criterion, we can conclude that the independence property is satisfied. This means that there is no autocorrelation in the series of dynamics, therefore, the model is adequate according to this criterion.

Correspondence of a number of residuals to the normal distribution law determined using the R/S-criterion with critical levels (2.7-3.7);

Calculate the RS value:

RS = (e max - e min) / S,

where e max is the maximum value of the levels of a series of residues E(t) = 8.07;

e min - the minimum value of the levels of a series of residues E(t) = -6.54.

S - standard deviation, = 4,8044.

RS \u003d (e max - e min) / S \u003d (8.07 + 6.54) / 4.8044 \u003d 3.04.

Since 2.7< 3,04 < 3,7, и полученное значение RS попало в за-данный интервал, значит, выполняется свойство нормальности распределения.

Thus, having considered various criteria for fulfilling the prerequisites of the LSM, we conclude that the prerequisites of the LSM are met.

4. Let's check the significance of the parameters of the regression equation using Student's t-test α = 0.05.

Checking the significance of individual regression coefficients is associated with the determination of calculated values t-test (t-statistics) for the corresponding regression coefficients:

Then the calculated values ​​are compared with the table t table= 2.3060. The tabular value of the criterion is determined when ( n- 2) degrees of freedom ( n- number of observations) and the corresponding significance level a (0.05)

If the calculated value of the t-test with (n- 2) degrees of freedom exceeds its tabular value at a given significance level, the regression coefficient is considered significant.

In our case, the regression coefficients a 0 - insignificant, and 1 - significant coefficients.

The regression line is a graphical reflection of the relationship between phenomena. You can easily build a regression line in Excel.

For this you need:

1.Open the Excel program

2. Create columns with data. In our example, we will build a regression line, or relationship, between aggressiveness and self-doubt in first-graders. The experiment involved 30 children, the data are presented in the Excel table:

1 column - number of the subject

2 column - aggressiveness in points

3 column - self-doubt in points

3. Then you need to select both columns (without the name of the column), press the tab insert , choose point , and from the proposed layouts choose the very first dot with markers .

4. So we got a blank for the regression line - the so-called - scatterplot. To go to the regression line, you need to click on the resulting figure, click the tab constructor, find on panel chart layouts and choose M a ket9 , it also says f(x)

5. So, we have a regression line. The graph also shows its equation and the square of the correlation coefficient

6. It remains to add the name of the graph, the name of the axes. Also, if desired, you can remove the legend, reduce the number of horizontal grid lines (tab layout , then net ). The main changes and settings are made in the tab Layout

The regression line is built in MS Excel. Now it can be added to the text of the work.

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of ​​the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been tasked with developing a strategic plan for opening new stores. This plan should contain a forecast of annual sales in newly opened stores. You believe that selling space is directly related to revenue and want to factor that fact into your decision making process. How do you develop a statistical model that predicts annual sales based on new store size?

Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that predicts the values ​​of the dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - a statistical method that allows you to predict the values ​​of the dependent variable Y by the values ​​of the independent variable X. The following notes will describe a multiple regression model designed to predict the values ​​of the independent variable Y by the values ​​of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values d L and d U for a given number of observations n, the number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If a D< d L , the hypothesis of independence of random deviations is rejected (hence, there is a positive autocorrelation); if D > dU, the hypothesis is not rejected (that is, there is no autocorrelation); if d L< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then d L and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ​​( d L and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, d L= 1.08 and dU= 1.36. Insofar as D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the population slope β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). A-priory t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Insofar as t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other side, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=S YX 2 ).

A-priory F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. The results, presented in the form of a summary table of the analysis of variance, are shown in fig. 20.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Insofar as F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis about the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Hence, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Insofar as b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values ​​of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response for a given value of a variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xi– mathematical expectation of a variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of ​​4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of ​​​​4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of ​​4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

  • Ignoring the conditions of applicability of the method of least squares.
  • An erroneous estimate of the conditions for applicability of the method of least squares.
  • Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
  • Application of regression analysis without in-depth knowledge of the subject of study.
  • Extrapolation of the regression beyond the range of the explanatory variable.
  • Confusion between statistical and causal relationships.

The widespread use of spreadsheets and statistical software has eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods if many of them have no idea at all about the conditions for applicability of the least squares method and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot plotted from data set D illustrates an unusual situation in which the empirical model is highly dependent on a single response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

  • Analysis of the possible relationship between variables X and Y always start with a scatterplot.
  • Before interpreting the results of a regression analysis, check the conditions for its applicability.
  • Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
  • Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
  • If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
  • If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
  • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
  • Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. A regression model was used to predict the values ​​of the dependent variable. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.