Critical Spearman rank correlation values. Application of Spearman and Pearson correlation

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used in cases where:
- variables have ranking scale measurements;
- the data distribution is too different from normal or not known at all;
- samples have a small volume (N< 30).

The interpretation of the Spearman rank correlation coefficient is no different from the Pearson coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically justify their areas of application, let’s compare their formulas.

Pearson correlation coefficient:

Spearman correlation coefficient:

As you can see, the formulas differ significantly. Let's compare the formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, but the Spearman formula does not. Thus, to obtain an adequate result using the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are normal distribution parameters). This is not relevant for the Spearman formula.

An element of the Pearson formula is the standardization of each series in z-scale.

As you can see, the conversion of variables to the Z-scale is present in the formula for the Pearson correlation coefficient. Accordingly, for the Pearson coefficient, the scale of the data does not matter at all: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values ​​is, they will all be converted to standard z-values ​​that are the same in scale.

Such normalization does not occur in the Spearman coefficient, therefore

A MANDATORY CONDITION FOR USING THE SPEARMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF THE TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking results in the values ​​of these series acquiring the same minimum = 1 (minimum rank) and a maximum equal to the number of values ​​(maximum, last rank = N, i.e., the maximum number of cases in the sample).

In what cases can you do without ranking?

These are cases when the data is initially ranking scale. For example, Rokeach’s test of value orientations.

Also, these are cases when the number of value options is small and the sample contains a fixed minimum and maximum. For example, in a semantic differential, minimum = 1, maximum = 7.

Example of calculating Spearman's rank correlation coefficient

Rokeach’s test of value orientations was carried out on two samples X and Y. Objective: to find out how close the hierarchies of values ​​of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked by table of critical values. According to the table, with N=18, the obtained value is significant at the p level<=0,005

Spearman and Kendal rank correlation coefficients

For variables belonging to an ordinal scale or for variables not subject to a normal distribution, as well as for variables belonging to an interval scale, Spearman's rank correlation is calculated instead of the Pearson coefficient. To do this, individual variable values ​​are assigned ranks, which are subsequently processed using appropriate formulas. To detect rank correlation, clear the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values ​​of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate rank correlation, it is necessary to have two rows of values,

which can be ranked. Such series of values ​​could be:

1) two signs measured in the same group subjects;

2) two individual hierarchies of characteristics, identified in two subjects using the same

set of features;

3) two group hierarchies of characteristics,

4) individual and group hierarchy of features.

First, the indicators are ranked separately for each of the characteristics.

As a rule, a lower rank is assigned to a lower attribute value.

In the first case (two characteristics), individual values ​​are ranked according to the first

characteristic obtained by different subjects, and then individual values ​​​​for the second

sign.

If two characteristics are positively related, then subjects with low ranks

one of them will have low ranks in the other, and subjects who have high ranks in

one of the characteristics will also have high ranks for the other characteristic. To calculate rs

differences need to be determined (d) between the ranks obtained by a given subject in both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

The smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no correspondence. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

high ranks on another basis will correspond, and vice versa. The greater the discrepancy

between the ranks of subjects on two variables, the closer rs is to -1.

In the second case (two individual profiles), individual ones are ranked

values ​​obtained by each of the 2 subjects according to a certain (the same for them

both) set of features. The first rank will be given to the feature with the lowest value; second rank –

a sign with a higher value, etc. Obviously, all characteristics must be measured in

the same units, otherwise ranking is impossible. For example, it is impossible

rank the indicators on the Cattell Personality Inventory (16PF), if they are expressed in

“raw” points, since the ranges of values ​​are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which factor will take first place in

expression until we bring all the values ​​to a single scale (most often this is the wall scale).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if one subject’s factor E (dominance) has the lowest rank, then

another test subject, it should have a low rank if one test subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, etc.

In the third case (two group profiles), the group average values ​​are ranked,

obtained in 2 groups of subjects according to a specific set, identical for both groups

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In case 4 (individual and group profiles), they are ranked separately

individual values ​​of the subject and group average values ​​for the same set

signs that are obtained, as a rule, by excluding this individual subject - he

does not participate in the average group profile with which his individual profile will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the resulting correlation coefficient is determined

by the number of ranked values N. In the first case, this quantity will coincide with

sample size n. In the second case, the number of observations will be the number of features,

making up the hierarchy. In the third and fourth cases, N is also the number of compared

characteristics, and not the number of subjects in groups. Detailed explanations are given in the examples. If

the absolute value of rs reaches or exceeds a critical value, correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first applies to case 1, the second to the other three

First version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

Second version of hypotheses

H0: The correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. For each variable, at least 5 observations must be presented. Upper

the sampling boundary is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs for a large number of identical

ranks for one or both compared variables gives rough values. Ideally

both correlated series must represent two sequences of divergent

values. If this condition is not met, an amendment must be made to

same ranks.

Spearman's rank correlation coefficient is calculated using the formula:

If both compared rank series contain groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to make corrections for the same

Ta and TV ranks:

Ta = Σ (a3 – a)/12,

Тв = Σ (в3 – в)/12,

Where A - the volume of each group of identical ranks in rank series A, in volume of each

groups of identical ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Point-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous scale. The point biserial correlation coefficient rpb is calculated using the formula:

Here x 1 is the average value over X objects with a value of “one” over Y;

x 0 – average value over X objects with a value of “zero” over Y;

s x – standard deviation of all values ​​along X;

n 1 – number of objects “one” in Y, n 0 – number of objects “zero” in Y;

n = n 1 + n 0 – sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x– overall average value for the variable X.

Point biserial correlation coefficient rpb varies from –1 to +1. Its value is zero if variables with one Y have an average Y, equal to the average of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student’s t-test. Empirical significance

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ (df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient differs significantly from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > (n– 2). Reliability of the relationship calculated using the point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Point biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the point biserial r. This stat. shows the relationship between two variables, one of which is supposedly continuous and normally distributed, and the other is discrete in the strict sense of the word. The point biserial correlation coefficient is denoted by r pbis Since in r pbis dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is determined arbitrarily. Therefore, for all practical purposes. goals r pbis considered in the range from 0.00 to +1.00.

There is also the case where two variables are assumed to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Basic (exact) formulas and procedures for calculation r tet quite complex. Therefore, with practical This method uses approximations r tet,obtained on the basis of abbreviated procedures and tables.

/on-line/dictionary/dictionary.php?term=511

POINT BISERIAL COEFFICIENT is the correlation coefficient between two variables, one measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testing as an indicator of the quality of a test task - reliability and consistency with the overall test score.

To correlate variables measured in dichotomous and interval scale use point-biserial correlation coefficient.
The point-biserial correlation coefficient is a method of correlation analysis of the relationship of variables, one of which is measured on a scale of names and takes only 2 values ​​(for example, men/women, correct answer/false answer, feature present/not present), and the second on a scale ratios or interval scale. Formula for calculating the point-biserial correlation coefficient:

Where:
m1 and m0 are the average values ​​of X with a value of 1 or 0 in Y.
σx – standard deviation of all values ​​by X
n1,n0 – number of X values ​​from 1 or 0 to Y.
n – total number of pairs of values

Most often, this type of correlation coefficient is used to calculate the relationship between test items and the total scale. This is one type of validity check.

39. Rank-biserial correlation coefficient.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

Rank biserial correlation coefficient, used in cases where one of the variables ( X) is presented in an ordinal scale, and the other ( Y) – dichotomous, calculated by the formula

.

Here is the average rank of objects having one in Y; – average rank of objects with zero to Y, n– sample size.

Examination significance hypotheses The rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using the Student’s test with replacement in the formulas rpb on rrb.

In cases where one variable is measured on a dichotomous scale (variable X), and the other in the rank scale (variable Y), the rank-biserial correlation coefficient is used. We remember that the variable X, measured on a dichotomous scale, takes only two values ​​(codes) 0 and 1. We especially emphasize: despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for the interpretation of the results. This is another exception to the general rule.

This coefficient is calculated using the formula:

where ` X 1 average rank for those elements of the variable Y, which corresponds to code (sign) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (sign) 0 in the variable X\

N – total number of elements in the variable X.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X – on a dichotomous scale; other Y– on a ranking scale.

2. Number of varying characteristics in the compared variables X And Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, you should use formula (11.9) and the table of critical values ​​for the Student criterion k = n – 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is represented in dichotomous scale, and the other in rank (ordinal), require application rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

Where:
n – number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 on the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

For correlation in general (and linear correlation in particular), see question No. 36 With. 56 (64) 063.JPG

Mr. PEARSON'S COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricdifferent variables measured on the same sample. There are many situations in which its use is appropriate. Does intelligence affect academic performance in senior university years? Is the size of an employee’s salary related to his friendliness towards colleagues? Does a student’s mood affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest for each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of initial data for measuring two indicators of intelligence (verbal and nonverbal) for 20 8th grade students.

The relationship between these variables can be depicted using a scatterplot (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mostly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence using the data from example 6.1. The position of each /-point (subject with number /) on the scatter diagram relative to the other points (Fig. 6.3) can be specified by the values ​​and signs of deviations of the corresponding variable values ​​from their average values: (xj - MJ And (mind at ). If the signs of these deviations coincide, then this indicates a positive relationship (larger values ​​for X large values ​​correspond to at or lower values X smaller values ​​correspond to y).

For subject No. 1, deviation from the average X and by at positive, and for subject No. 3 both deviations are negative. Consequently, the data from both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the characteristics. Thus, for subject No. 4, the deviation from the average X is negative, by y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x,- M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then a reverse (negative) relationship. Accordingly, if Xwy y are generally related in direct proportion, then most of the products of deviations will be positive, and if they are related by an inverse relationship, then most of the products will be negative. Therefore, a general indicator for the strength and direction of the relationship can be the sum of all products of deviations for a given sample:

With a directly proportional relationship between variables, this value is large and positive - for most subjects, the deviations coincide in sign (large values ​​of one variable correspond to large values ​​of another variable and vice versa). If X And at have feedback, then for most subjects, larger values ​​of one variable will correspond to smaller values ​​of another variable, i.e., the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic connection between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

To ensure that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of interconnection not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, divide the sum of the products of deviations not by N, and on TV - 1. This results in a measure of connection, widely used in physics and technical sciences, which is called covariance (Covahance):


IN In psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of a sign, but in the relative position of subjects in a group. In addition, covariance is very sensitive to the scale of the scale (variance) on which the traits are measured. To make the measure of connection independent of the units of measurement of both characteristics, it is enough to divide the covariance into the corresponding standard deviations. Thus it was obtained for-Mule of the K. Pearson correlation coefficient:

or, after substituting the expressions for o x and


If the values ​​of both variables were converted to r-values ​​using the formula


then the formula for the r-Pearson correlation coefficient looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical linear relationship of a non-causal nature between two quantitative variables X And at. Measured using the "K.L coefficient." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

,

Where s xy- covariance between variables X And at;

s x , s y- standard deviations for variables X And at;

x i , y i- variable values X And at for object with number i;

x, y- arithmetic averages for variables X And at.

Pearson coefficient r can take values ​​from the interval [-1; +1]. Meaning r = 0 means there is no linear relationship between variables X And at(but does not exclude a nonlinear statistical relationship). Positive coefficient values ​​( r> 0) indicate a direct linear connection; the closer its value is to +1, the stronger the relationship is the statistical line. Negative coefficient values ​​( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее обратная связь. Значения r= ±1 means the presence of a complete linear connection, direct or reverse. In the case of complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the strength of connection in a linear pairwise regression model.

41. Correlation matrix and correlation graph.

About correlation in general, see question No. 36 With. 56 (64) 063.JPG

Correlation matrix. Often, correlation analysis includes the study of relationships between not two, but many variables measured on a quantitative scale in one sample. In this case, correlations are calculated for each pair of this set of variables. The calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(Correlation Matrix) is the result of calculating correlations of one type for each pair from the set R variables measured on a quantitative scale in one sample.

EXAMPLE

Suppose we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 Human. Below is a table of source data and a correlation matrix.

AND
similar data:

Correlation matrix:

It is easy to notice that the correlation matrix is ​​square, symmetrical with respect to the main diagonal (takkak,y = /) y), with units on the main diagonal (since G And = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She symmetrical relative to the main diagonal, since the correlation X With at equal to correlation at With X. Units are located on its main diagonal, since the correlation of the feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are located above or below the main diagonal.

Number of correlation coefficients, Pfeatures to be analyzed when studying relationships are determined by the formula: P(P- 1)/2. In the above example, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is identifying the structure of relationships between many features. In this case, visual analysis is possible correlation galaxies- graphic image structures statisticallymeaningful connections, if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factor or cluster analysis (see section “Multivariate methods...”). Using factor or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method for dependent samples is used (p. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (p. 147-148).

Comparison methods correlations in diagonals correlation matrix (to assess the stationarity of a random process) and comparison several correlation matrices obtained for different samples (for their homogeneity) are labor-intensive and beyond the scope of this book. You can get acquainted with these methods from the book by G.V. Sukhodolsky 1.

The problem of statistical significance of correlations. The problem is that the procedure for statistical hypothesis testing assumes one-multiple test carried out on one sample. If the same method is applied repeatedly, even if in relation to different variables, the probability of obtaining a result purely by chance increases. In general, if we repeat the same hypothesis testing method once in relation to different variables or samples, then with the established value a we are guaranteed to receive confirmation of the hypothesis in ahk number of cases.

Suppose a correlation matrix is ​​analyzed for 15 variables, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test hypotheses, the level a = 0.05 is set. By checking the hypothesis 105 times, we will receive confirmation of it five times (!), regardless of whether the connection actually exists. Knowing this and having, say, 15 “statistically significant” correlation coefficients, can we tell which ones were obtained by chance and which ones reflect a real relationship?

Strictly speaking, to make a statistical decision it is necessary to reduce the level a by as many times as the number of hypotheses being tested. But this is hardly advisable, since the probability of ignoring a really existing connection (making a Type II error) increases in an unpredictable way.

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one truly convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factor, cluster or multiple regression analysis) to identify and subsequently interpret groups of statistically significantly related variables.

Missing values ​​problem. If there are missing values ​​in the data, then two options are possible for calculating the correlation matrix: a) row-by-row removal of values (Excludecaseslistwise); b) pairwise deletion of values (Excludecasespairwise). At line by line deletion observations with missing values, the entire row for an object (subject) that has at least one missing value for one of the variables is deleted. This method leads to a “correct” correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values ​​are distributed randomly in the variables, then this method can lead to the fact that there is not a single object left in the data set under consideration (there will be at least one missing value in each row). To avoid this situation, use another method called pairwise removal. This method only considers gaps in each selected column-variable pair and ignores gaps in other variables. The correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are distributed quite randomly, this method does not lead to serious errors. However, sometimes this is not the case. For example, a systematic bias (shift) in the assessment may “hidden” a systematic arrangement of omissions, which is the reason for the difference in correlation coefficients constructed for different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with pairwise removal of gaps occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that the “correct” correlation matrix is ​​used with a certain level of consistency and “compliance” of various coefficients. Using a matrix with “bad” (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if the pairwise method of excluding missing data is used, it is necessary to check whether there are systematic patterns in the distribution of missing data.

If pairwise deletion of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated using the row-by-row method of deleting missing data. If a significant difference is observed, then there is reason to assume that there is a shift in the estimates. For example, if the average (or standard deviation) of the values ​​of a variable A, which was used in calculating its correlation with the variable IN, much less than the mean (or standard deviation) of the same values ​​of the variable A, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-Bus) based on different subsets of data. There will be a bias in the correlations caused by the non-random placement of gaps in the variable values.

Analysis of correlation galaxies. After solving the problem of statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation galaxy or galaxy. Correlation galaxy - This is a figure consisting of vertices and lines connecting them. The vertices correspond to the characteristics and are usually designated by numbers - variable numbers. The lines correspond to statistically significant connections and graphically express the sign and sometimes the j-level of significance of the connection.

The correlation galaxy can reflect All statistically significant connections of the correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADE


Preparation for the state (final) certification of graduates: formation of the Unified State Exam database (general list of Unified State Exam participants of all categories, indicating subjects) - taking into account reserve days in case of the same subjects;

  • Work plan (27)

    Solution

    2. Activities of the educational institution to improve the content and assess the quality in the subjects of science and mathematics education Municipal educational institution secondary school No. 4, Litvinovskaya, Chapaevskaya,

  • In cases where the measurements of the characteristics under study are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between two random variables is carried out using rank correlation coefficients. Consider the Spearman rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data in a certain order, either ascending or descending.

    The ranking operation is carried out according to the following algorithm:

    1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The smallest value is assigned a rank of 1. For example, if n=7, then the largest value will receive a rank of 7, except in cases provided for in the second rule.

    2. If several values ​​are equal, then they are assigned a rank that is the average of the ranks they would receive if they were not equal. As an example, consider an ascending-ordered sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values ​​22 and 23 appear once each, so their ranks are respectively R22=1, and R23=2 . The value 25 appears 3 times. If these values ​​were not repeated, then their ranks would be 3, 4, 5. Therefore, their R25 rank is equal to the arithmetic mean of 3, 4 and 5: . The values ​​28 and 30 are not repeated, so their ranks are respectively R28=6 and R30=7. Finally we have the following correspondence:

    3. The total sum of ranks must coincide with the calculated one, which is determined by the formula:

    where n is the total number of ranked values.

    A discrepancy between the actual and calculated rank sums will indicate an error made when calculating ranks or summing them up. In this case, you need to find and fix the error.

    Spearman's rank correlation coefficient is a method that allows one to determine the strength and direction of the relationship between two traits or two hierarchies of traits. The use of the rank correlation coefficient has a number of limitations:

    • a) The assumed correlation dependence must be monotonic.
    • b) The volume of each sample must be greater than or equal to 5. To determine the upper limit of the sample, use tables of critical values ​​(Table 3 of the Appendix). The maximum value of n in the table is 40.
    • c) During the analysis, it is likely that a large number of identical ranks may arise. In this case, an amendment must be made. The most favorable case is when both samples under study represent two sequences of divergent values.

    To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

    • - two characteristics measured in the same group of subjects;
    • - two individual hierarchies of traits identified in two subjects using the same set of traits;
    • - two group hierarchies of characteristics;
    • - individual and group hierarchies of characteristics.

    We begin the calculation by ranking the studied indicators separately for each of the characteristics.

    Let us analyze a case with two signs measured in the same group of subjects. First, the individual values ​​obtained by different subjects are ranked according to the first characteristic, and then the individual values ​​are ranked according to the second characteristic. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to greater ranks of another indicator, then the two characteristics are positively related. If higher ranks of one indicator correspond to lower ranks of another indicator, then the two characteristics are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to “+1”. If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of subjects on two variables, the closer to “-1” the value of the rs coefficient will be. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

    Let us consider the case with two individual hierarchies of traits identified in two subjects using the same set of traits. In this situation, the individual values ​​obtained by each of the two subjects are ranked according to a certain set of characteristics. The feature with the lowest value must be assigned the first rank; the characteristic with a higher value is the second rank, etc. Particular care should be taken to ensure that all attributes are measured in the same units. For example, it is impossible to rank indicators if they are expressed in different “price” points, since it is impossible to determine which of the factors will take first place in terms of severity until all values ​​are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in another, and vice versa, then the individual hierarchies are positively related.

    In the case of two group hierarchies of characteristics, the average group values ​​obtained in two groups of subjects are ranked according to the same set of characteristics for the studied groups. Next, we follow the algorithm given in previous cases.

    Let us analyze a case with an individual and group hierarchy of characteristics. They begin by ranking separately the individual values ​​of the subject and the average group values ​​according to the same set of characteristics that were obtained, excluding the subject who does not participate in the average group hierarchy, since his individual hierarchy will be compared with it. Rank correlation allows us to assess the degree of consistency of the individual and group hierarchy of traits.

    Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two characteristics, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, significance is determined by the number of characteristics being studied, and not by the number of groups. Thus, the significance of rs in all cases is determined by the number of ranked values ​​n.

    When checking the statistical significance of rs, tables of critical values ​​of the rank correlation coefficient are used, compiled for different numbers of ranked values ​​and different levels of significance. If the absolute value of rs reaches or exceeds a critical value, then the correlation is reliable.

    When considering the first option (a case with two signs measured in the same group of subjects), the following hypotheses are possible.

    H0: The correlation between variables x and y is not different from zero.

    H1: The correlation between variables x and y is significantly different from zero.

    If we work with any of the three remaining cases, then it is necessary to put forward another pair of hypotheses:

    H0: The correlation between hierarchies x and y is not different from zero.

    H1: The correlation between hierarchies x and y is significantly different from zero.

    The sequence of actions when calculating the Spearman rank correlation coefficient rs is as follows.

    • - Determine which two features or two hierarchies of features will participate in the comparison as variables x and y.
    • - Rank the values ​​of the variable x, assigning rank 1 to the smallest value, in accordance with the ranking rules. Place the ranks in the first column of the table in order of test subjects or characteristics.
    • - Rank the values ​​of the variable y. Place the ranks in the second column of the table in order of test subjects or characteristics.
    • - Calculate the differences d between the ranks x and y for each row of the table. Place the results in the next column of the table.
    • - Calculate the squared differences (d2). Place the resulting values ​​in the fourth column of the table.
    • - Calculate the sum of squared differences? d2.
    • - If identical ranks occur, calculate the corrections:

    where tx is the volume of each group of identical ranks in sample x;

    ty is the volume of each group of identical ranks in sample y.

    Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. If there are no identical ranks, calculate the rank correlation coefficient rs using the formula:

    If there are identical ranks, calculate the rank correlation coefficient rs using the formula:

    where?d2 is the sum of squared differences between ranks;

    Tx and Ty - corrections for equal ranks;

    n is the number of subjects or features participating in the ranking.

    Determine the critical values ​​of rs from Appendix Table 3 for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.

    is a quantitative assessment of the statistical study of the relationship between phenomena, used in nonparametric methods.

    The indicator shows how the sum of squared differences between ranks obtained during observation differs from the case of no connection.

    Purpose of the service. Using this online calculator you can:

    • calculation of Spearman's rank correlation coefficient;
    • calculating the confidence interval for the coefficient and assessing its significance;

    Spearman's rank correlation coefficient refers to indicators for assessing the closeness of communication. The qualitative characteristic of the closeness of the connection of the rank correlation coefficient, as well as other correlation coefficients, can be assessed using the Chaddock scale.

    Calculation of coefficient consists of the following steps:

    Properties of Spearman's rank correlation coefficient

    Application area. Rank correlation coefficient used to assess the quality of communication between two populations. In addition, its statistical significance is used when analyzing data for heteroskedasticity.

    Example. Based on a sample of observed variables X and Y:

    1. create a ranking table;
    2. find Spearman's rank correlation coefficient and check its significance at level 2a
    3. assess the nature of the dependence
    Solution. Let's assign ranks to feature Y and factor X.
    XYrank X, d xrank Y, d y
    28 21 1 1
    30 25 2 2
    36 29 4 3
    40 31 5 4
    30 32 3 5
    46 34 6 6
    56 35 8 7
    54 38 7 8
    60 39 10 9
    56 41 9 10
    60 42 11 11
    68 44 12 12
    70 46 13 13
    76 50 14 14

    Rank matrix.
    rank X, d xrank Y, d y(d x - d y) 2
    1 1 0
    2 2 0
    4 3 1
    5 4 1
    3 5 4
    6 6 0
    8 7 1
    7 8 1
    10 9 1
    9 10 1
    11 11 0
    12 12 0
    13 13 0
    14 14 0
    105 105 10

    Checking the correctness of the matrix based on the checksum calculation:

    The sum of the columns of the matrix is ​​equal to each other and the checksum, which means that the matrix is ​​composed correctly.
    Using the formula, we calculate the Spearman rank correlation coefficient.


    The relationship between trait Y and factor X is strong and direct
    Significance of Spearman's rank correlation coefficient
    In order to test the null hypothesis at the significance level α that the general Spearman rank correlation coefficient is equal to zero under the competing hypothesis Hi. p ≠ 0, we need to calculate the critical point:

    where n is the sample size; ρ is the sample Spearman rank correlation coefficient: t(α, k) is the critical point of the two-sided critical region, which is found from the table of critical points of the Student distribution, according to the significance level α and the number of degrees of freedom k = n-2.
    If |p|< Т kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками не значима. Если |p| >T kp - the null hypothesis is rejected. There is a significant rank correlation between qualitative characteristics.
    Using the Student's table we find t(α/2, k) = (0.1/2;12) = 1.782

    Since T kp< ρ , то отклоняем гипотезу о равенстве 0 коэффициента ранговой корреляции Спирмена. Другими словами, коэффициент ранговой корреляции статистически - значим и ранговая корреляционная связь между оценками по двум тестам значимая.

    This calculator below calculates Spearman's rank correlation coefficient between two random variables. The theoretical part is traditional below the calculator.

    add import_export mode_edit delete

    Changes of random variables

    arrow_upwardarrow_downward arrow_upwardarrow_downward
    Items per page: 5 10 20 50 100 chevron_left chevron_right

    Changes of random variables

    Import data Import error

    "One of the following characters is used to separate data fields: tab, semicolon (;) or comma(,)" Sample: -50.5;-50.5

    Import Back Cancel

    Digits after the decimal point: 4

    Calculate

    Spearman's correlation coefficient

    Save share extension

    The method of Spearman's rank correlation coefficient calculation is actually pretty simple. It's like designed the Pearson correlation coefficient , but not for measurements of random variables only but for them ranking values.

    We have only to understand what is the rank value and why all this is necessary.

    If the elements of a variational series arranged in ascending or descending order, that rank of the element will be his number in ordered series.

    For example, we have a varying series (17,26,5,14,21). Let's sort it's elements in a descending order (26,21,17,14,5). 26 has a rank of 1, 21 - rank of 2 and so on, Variational series of ranking values ​​will look like this (3,1,5,4,2).

    I.e. when calculating Spearman's coefficient initial variation series are converted into variational series of ranking values ​​and then Pearson's formula is applied to them.
    .
    There is one subtlety - the rank of the repeating values ​​is taken as the average of the ranks. That is, for a series (17, 15, 14, 15)ranking series will look like (1, 2.5, 4, 2.5), as the first element is 15 has a rank of 2, and the second - rank of 3, and.

    If you don"t have the repeating values, that is, all the values ​​of ranking series - the numbers between 1 and n, the Pearson"s formula can be simplified to

    By the way, this formula is often given as the formula for calculating the Spearman's coefficient.

    What is the essence of the transition from the values ​​themselves to their rank value?
    When investigating the correlation of ranking values ​​you can find how well the dependence of the two variables is described by a monotonic function.

    The sign of the coefficient indicates the direction of the relationship between variables. If the sign is positive the values ​​of Y has a tendency to increase with the increase of X. If the sign is negative the values ​​of Y has a tendency to decrease with the increase of X. If the coefficient is 0 there is no tendency then . If the coefficient equals 1 or -1, the relationship between X and Y has an appearance of monotonic function, i.e. with the increase of X, Y also increases and vice versa.

    That is, unlike the Pearson's correlation coefficient, which can detect only the linear relationship of one variable from another, Spearman's correlation coefficient can detect monotonic dependence, where the direct linear relationship cannot be revealed.

    Here's an example.
    Let me explain with an example. Let's suppose, we examine the function y=10/x.
    We have the following measurements of X and Y
    {{1,10}, {5,2}, {10,1}, {20,0.5}, {100,0.1}}
    For this data, Pearson correlation coefficient is equal to -0.4686, i.e. the relationship is weak or absent. And Spearman's correlation coefficient is strictly equal to -1, as if it's hints to the researcher that Y has strongly negative monotonic dependence from X.

    The rank correlation coefficient, proposed by K. Spearman, refers to a nonparametric measure of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distributions of characteristics in the population. This coefficient determines the degree of closeness of connection between ordinal characteristics, which in this case represent the ranks of the compared quantities.

    The Spearman correlation coefficient also lies in the range of +1 and -1. It, like the Pearson coefficient, can be positive and negative, characterizing the direction of the relationship between two characteristics measured on a rank scale.

    In principle, the number of ranked features (qualities, traits, etc.) can be any, but the process of ranking more than 20 features is difficult. It is possible that this is why the table of critical values ​​of the rank correlation coefficient was calculated only for forty ranked features (n< 40, табл. 20 приложения 6).

    Spearman's rank correlation coefficient is calculated using the formula:

    where n is the number of ranked features (indicators, subjects);

    D is the difference between the ranks for two variables for each subject;

    Sum of squared rank differences.

    Using the rank correlation coefficient, consider the following example.

    Example: A psychologist finds out how individual indicators of readiness for school, obtained before the start of school among 11 first-graders, are related to each other and their average performance at the end of the school year.

    To solve this problem, we ranked, firstly, the values ​​of indicators of school readiness obtained upon admission to school, and, secondly, the final indicators of academic performance at the end of the year for these same students on average. We present the results in the table. 13.

    Table 13

    Student no.

    Ranks of school readiness indicators

    Average annual performance ranks

    We substitute the obtained data into the formula and perform the calculation. We get:

    To find the significance level, refer to the table. 20 of Appendix 6, which shows the critical values ​​for the rank correlation coefficients.

    We emphasize that in table. 20 of Appendix 6, as in the table for linear Pearson correlation, all values ​​of correlation coefficients are given in absolute value. Therefore, the sign of the correlation coefficient is taken into account only when interpreting it.

    Finding the significance levels in this table is carried out by the number n, i.e. by the number of subjects. In our case n = 11. For this number we find:

    0.61 for P 0.05

    0.76 for P 0.01

    We construct the corresponding ``significance axis'':

    The resulting correlation coefficient coincided with the critical value for the significance level of 1%. Consequently, it can be argued that the indicators of school readiness and the final grades of first-graders are connected by a positive correlation - in other words, the higher the indicator of school readiness, the better the first-grader studies. In terms of statistical hypotheses, the psychologist must reject the null hypothesis of similarity and accept the alternative hypothesis of differences, which suggests that the relationship between indicators of school readiness and average academic performance is different from zero.

    The case of identical (equal) ranks

    If there are identical ranks, the formula for calculating the Spearman linear correlation coefficient will be slightly different. In this case, two new terms are added to the formula for calculating correlation coefficients, taking into account the same ranks. They are called equal rank corrections and are added to the numerator of the calculation formula.

    where n is the number of identical ranks in the first column,

    k is the number of identical ranks in the second column.

    If there are two groups of identical ranks in any column, then the correction formula becomes somewhat more complicated:

    where n is the number of identical ranks in the first group of the ranked column,

    k is the number of identical ranks in the second group of the ranked column. The modification of the formula in the general case is as follows:

    Example: A psychologist, using a mental development test (MDT), conducts a study of intelligence in 12 9th grade students. At the same time, he asks teachers of literature and mathematics to rank these same students according to indicators of mental development. The task is to determine how objective indicators of mental development (SHTUR data) and expert assessments of teachers are related to each other.

    We present the experimental data of this problem and the additional columns necessary to calculate the Spearman correlation coefficient in the form of a table. 14.

    Table 14

    Student no.

    Ranks of testing using SHTURA

    Expert assessments of teachers in mathematics

    Expert assessments of teachers on literature

    D (second and third columns)

    D (second and fourth columns)

    (second and third columns)

    (second and fourth columns)

    Since the same ranks were used in the ranking, it is necessary to check the correctness of the ranking in the second, third and fourth columns of the table. Summing each of these columns gives the same total - 78.

    We check using the calculation formula. The check gives:

    The fifth and sixth columns of the table show the values ​​of the difference in ranks between the psychologist’s expert assessments on the SHTUR test for each student and the values ​​of the teachers’ expert assessments, respectively, in mathematics and literature. The sum of the rank difference values ​​must be equal to zero. Summing the D values ​​in the fifth and sixth columns gave the desired result. Therefore, the subtraction of ranks was carried out correctly. A similar check must be done every time when conducting complex types of ranking.

    Before starting the calculation using the formula, it is necessary to calculate corrections for the same ranks for the second, third and fourth columns of the table.

    In our case, in the second column of the table there are two identical ranks, therefore, according to the formula, the value of the correction D1 will be:

    The third column contains three identical ranks, therefore, according to the formula, the value of the correction D2 will be:

    In the fourth column of the table there are two groups of three identical ranks, therefore, according to the formula, the value of the correction D3 will be:

    Before proceeding to solve the problem, let us recall that the psychologist is clarifying two questions - how the values ​​of ranks on the SHTUR test are related to expert assessments in mathematics and literature. That is why the calculation is carried out twice.

    We calculate the first ranking coefficient taking into account additives according to the formula. We get:

    Let's calculate without taking into account the additive:

    As we can see, the difference in the values ​​of the correlation coefficients turned out to be very insignificant.

    We calculate the second ranking coefficient taking into account additives according to the formula. We get:

    Let's calculate without taking into account the additive:

    Again, the differences were very minor. Since the number of students in both cases is the same, according to Table. 20 of Appendix 6 we find the critical values ​​at n = 12 for both correlation coefficients at once.

    0.58 for P 0.05

    0.73 for P 0.01

    We plot the first value on the ``significance axis'':

    In the first case, the obtained rank correlation coefficient is in the zone of significance. Therefore, the psychologist must reject the null hypothesis that the correlation coefficient is similar to zero and accept the alternative hypothesis that the correlation coefficient is significantly different from zero. In other words, the obtained result suggests that the higher the students’ expert assessments on the SHTUR test, the higher their expert assessments in mathematics.

    We plot the second value on the ``significance axis'':

    In the second case, the rank correlation coefficient is in the zone of uncertainty. Therefore, a psychologist can accept the null Hypothesis that the correlation coefficient is similar to zero and reject the alternative Hypothesis that the correlation coefficient is significantly different from zero. In this case, the result obtained suggests that students’ expert assessments on the SHTUR test are not related to expert assessments on literature.

    To apply the Spearman correlation coefficient, the following conditions must be met:

    1. The variables being compared must be obtained on an ordinal (rank) scale, but can also be measured on an interval and ratio scale.

    2. The nature of the distribution of correlated quantities does not matter.

    3. The number of varying characteristics in the compared variables X and Y must be the same.

    Tables for determining the critical values ​​of the Spearman correlation coefficient (Table 20, Appendix 6) are calculated from the number of characteristics equal to n = 5 to n = 40, and with a larger number of compared variables, the table for the Pearson correlation coefficient should be used (Table 19, Appendix 6). Finding critical values ​​is carried out at k = n.