Algorithm for constructing an interval variation series with equal intervals.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

TASK1

We have the following data on the wages of employees in the enterprise:

Table 1.1

The amount of wages in conv. den. units

It is required to build an interval series of the distribution by which to find;

1) average salary;

2) average linear deviation;

4) standard deviation;

5) range of variation;

6) oscillation coefficient;

7) linear coefficient of variation;

8) simple coefficient of variation;

10) median;

11) coefficient of asymmetry;

12) Pearson asymmetry index;

13) kurtosis coefficient.

Decision

As you know, the options (values ​​recognized) are arranged in ascending order to form discrete variation series. With a large number variant (more than 10), even in the case of discrete variation, interval series are built.

If an interval series is compiled with even intervals, then the range of variation is divided by the specified number of intervals. In this case, if the obtained value is integer and unambiguous (which is rare), then the length of the interval is taken equal to this number. In other cases produced rounding necessarily in side magnification, So to the last remaining digit was even. Obviously, with an increase in the length of the interval, the range of variation by a value equal to the product of the number of intervals: by the difference between the calculated and initial length of the interval

a) If the value of the expansion of the range of variation is insignificant, then it is either added to the largest or subtracted from the smallest value of the feature;

b) If the magnitude of the expansion of the range of variation is palpable, then in order to avoid mixing the center of the range, it is roughly divided in half, simultaneously adding to the largest and subtracting from the smallest values ​​of the attribute.

If an interval series is compiled with unequal intervals, then the process is simplified, but as before, the length of the intervals must be expressed as a number with the last even digit, which greatly simplifies subsequent calculations of numerical characteristics.

30 - sample size.

Let's compose an interval distribution series using the Sturges formula:

K \u003d 1 + 3.32 * lg n,

K - number of groups;

K \u003d 1 + 3.32 * lg 30 \u003d 5.91 \u003d 6

We find the range of the sign - the wages of employees at the enterprise - (x) according to the formula

R \u003d xmax - xmin and divide by 6; R=195-112=83

Then the length of the interval will be l lane=83:6=13.83

The beginning of the first interval will be 112. Adding to 112 l ras=13.83, we get its final value 125.83, which is also the beginning of the second interval, and so on. the end of the fifth interval is 195.

When finding frequencies, one should be guided by the rule: "if the value of a feature coincides with the boundary of the internal interval, then it should be referred to the previous interval."

We obtain an interval series of frequencies and cumulative frequencies.

Table 1.2

Therefore, 3 employees have salaries. payment from 112 to 125.83 conventional units. The highest salary payment from 181.15 to 195 conventional units. only 6 workers.

To calculate the numerical characteristics, we convert the interval series into a discrete one, taking the middle of the intervals as a variant:

Table 1.3

14131,83

According to the weighted arithmetic mean formula

cond.mon.un.

Average linear deviation:

where xi is the value of the studied feature in the i-th unit of the population,

The average value of the studied trait.

Posted on http://www.allbest.ru/

LPosted on http://www.allbest.ru/

Monetary unit

Standard deviation:

Dispersion:

Relative range of variation (coefficient of oscillation): c=R:,

Relative linear deviation: q = L:

The coefficient of variation: V = y:

The oscillation coefficient shows the relative fluctuation of the extreme values ​​of the attribute around the arithmetic mean, and the coefficient of variation characterizes the degree and homogeneity of the population.

c \u003d R: \u003d 83 / 159.485 * 100% \u003d 52.043%

Thus, the difference between the extreme values ​​is 5.16% (=94.84%-100%) less than the average salary of employees in the enterprise.

q \u003d L: \u003d 17.765 / 159.485 * 100% \u003d 11.139%

V \u003d y: \u003d 21.704 / 159.485 * 100% \u003d 13.609%

The coefficient of variation is less than 33%, which indicates a weak variation in the wages of employees in the enterprise, i.e. that the average is a typical characteristic of the wages of workers (homogeneous aggregate).

In the interval distribution series fashion is determined by the formula -

The frequency of the modal interval, i.e., the interval containing the largest number of options;

The frequency of the interval preceding the modal;

The frequency of the interval following the modal;

The length of the modal interval;

The lower bound of the modal interval.

For determining medians in the interval series, we use the formula

where is the cumulative (cumulative) frequency of the interval preceding the median;

The lower limit of the median interval;

Frequency of the median interval;

The length of the median interval.

Median Interval- interval, the accumulated frequency of which (=3+3+5+7) exceeds half the sum of frequencies - (153.49; 167.32).

Let's calculate the skewness and kurtosis, for which we will compile a new worksheet:

Table 1.4

Factual data

Estimated data

Calculate the moment of the third order

Therefore, the asymmetry is

Since 0.3553 0.25, the asymmetry is recognized as significant.

Calculate the moment of the fourth order

Therefore, the kurtosis is

As< 0, то эксцесс является плосковершинным.

The degree of skewness can be determined using Pearson's skewness coefficient (As): oscillation sample cost turnover

where is the arithmetic mean of the distribution series; -- fashion; -- standard deviation.

With a symmetric (normal) distribution = Mo, therefore, the asymmetry coefficient is zero. If Аs > 0, then there is more mode, therefore, there is a right-sided asymmetry.

If As< 0, то меньше моды, следовательно, имеется левосторонняя асимметрия. Коэффициент асимметрии может изменяться от -3 до +3.

The distribution is not symmetrical, but has a left-sided asymmetry.

TASK 2

What should be the sample size so that there is a probability of 0.954 that the sampling error does not exceed 0.04 if the variance is known from previous surveys to be 0.24?

Decision

The sample size for non-repetitive sampling is calculated by the formula:

t - confidence factor (with a probability of 0.954 it is equal to 2.0; determined from the tables of probability integrals),

y2=0.24 - standard deviation;

10000 people - sample size;

Dx =0.04 - marginal error of the sample mean.

With a probability of 95.4%, it can be argued that the sample size, providing a relative error of no more than 0.04, should be at least 566 families.

TASK3

The following data are available on income from the main activity of the enterprise, million rubles.

To analyze a series of dynamics, determine the following indicators:

1) chain and basic:

Absolute gains;

Rates of growth;

Growth rates;

2) medium

Dynamic range level;

Absolute growth;

Growth rate;

Rate of increase;

3) the absolute value of 1% growth.

Decision

1. Absolute growth (Dy)- this is the difference between the next level of the series and the previous (or basic):

chain: Du \u003d yi - yi-1,

basic: Du \u003d yi - y0,

yi - row level,

i - row level number,

y0 - base year level.

2. Growth rate (Tu) is the ratio of the next level of the series and the previous one (or the base year 2001):

chain: Tu = ;

basic: Tu =

3. Growth rate (TD) - this is the ratio of absolute growth to the previous level, expressed in%.

chain: Tu = ;

basic: Tu =

4. Absolute value of 1% increase (A)- is the ratio of chain absolute growth to the growth rate, expressed in%.

BUT =

Middle row level calculated using the arithmetic mean formula.

Average level of income from core activities for 4 years:

Average absolute growth calculated by the formula:

where n is the number of levels in the series.

On average, for the year, income from core activities increased by 3.333 million rubles.

Average annual growth rate calculated by the geometric mean formula:

уn - the final level of the series,

y0 - the initial level of the series.

Tu \u003d 100% \u003d 102.174%

Average annual growth rate calculated by the formula:

T? \u003d Tu - 100% \u003d 102.74% - 100% \u003d 2.74%.

Thus, on average, for the year, income from the main activity of the enterprise increased by 2.74%.

TASKSBUT4

Calculate:

1. Individual price indices;

2. General turnover index;

3. Aggregate price index;

4. Aggregate index of the physical volume of the sale of goods;

5. The absolute increase in the value of turnover and decompose by factors (due to changes in prices and the number of goods sold);

6. Make brief conclusions on all the indicators obtained.

Decision

1. By condition, individual price indices for products A, B, C amounted to -

ipA=1.20; ipB=1.15; iрВ=1.00.

2. The total turnover index is calculated by the formula:

I w \u003d \u003d 1470/1045 * 100% \u003d 140.67%

Trade turnover increased by 40.67% (140.67% -100%).

On average, commodity prices rose by 10.24%.

The amount of additional costs for buyers from price increases:

w(p) = ? p1q1-? p0q1 \u003d 1470 - 1333.478 \u003d 136.522 million rubles.

As a result of rising prices, buyers had to spend an additional 136.522 million rubles.

4. General index of physical volume of trade:

The physical volume of trade increased by 27.61%.

5. Let's determine the total change in turnover in the second period compared to the first period:

w \u003d 1470- 1045 \u003d 425 million rubles.

due to price changes:

W(p) \u003d 1470 - 1333.478 \u003d 136.522 million rubles.

by changing the physical volume:

w(q) \u003d 1333.478 - 1045 \u003d 288.478 million rubles.

The turnover of goods increased by 40.67%. Prices on average for 3 goods increased by 10.24%. The physical volume of trade increased by 27.61%.

In general, the volume of sales increased by 425 million rubles, including due to rising prices, it increased by 136.522 million rubles, and due to an increase in sales volumes - by 288.478 million rubles.

TASK5

For 10 plants in one industry, the following data are available.

Factory No.

Output, thousand pieces (X)

Based on the given data:

I) to confirm the provisions of the logical analysis on the presence of a linear correlation between the factor sign (output volume) and the resultant sign (electricity consumption), plot the initial data on the correlation field graph and draw conclusions about the form of the relationship, indicate its formula;

2) determine the parameters of the connection equation and plot the resulting theoretical line on the graph of the correlation field;

3) calculate the linear correlation coefficient,

4) explain the values ​​of the indicators obtained in paragraphs 2) and 3);

5) using the obtained model, make a forecast about the possible consumption of electricity at a plant with a production volume of 4.5 thousand units.

Decision

Character data - the volume of output (factor), denoted by хi; sign - electricity consumption (result) through ui; points with coordinates (x, y) are plotted on the OXY correlation field.

The points of the correlation field are located along some straight line. Therefore, the connection is linear, we will look for the regression equation in the form of a straight line Yx=ax+b. To find it, we use the system of normal equations:

Let's create a spreadsheet.

Based on the averages found, we compose the system and solve it with respect to the parameters a and b:

So, we get the regression equation for y on x: \u003d 3.57692 x + 3.19231

We build a regression line on the correlation field.

Substituting the x values ​​from column 2 into the regression equation, we obtain the calculated ones (column 7) and compare them with the y data, which is reflected in column 8. By the way, the correctness of the calculations is also confirmed by the coincidence of the average values ​​of y and.

Coefficientlinear correlation evaluates the tightness of the relationship between features x and y and is calculated by the formula

The angular coefficient of direct regression a (at x) characterizes the direction of the identifieddependenciessigns: for a>0 they are the same, for a<0- противоположны. His absolute value - a measure of change in the resultant sign when the factorial sign changes per unit of measurement.

The free member of direct regression reveals the direction, and its absolute value - a quantitative measure of influence on the effective sign of all other factors.

If a< 0, then the resource of the factor attribute of an individual object is used with less, and when>0 withhigher performance than the average for the entire set of objects.

Let's do a post-regression analysis.

The coefficient at x of direct regression is 3.57692 > 0, therefore, with an increase (decrease) in output, the consumption of electricity increases (falls). Increase in output by 1 thousand pieces. gives an average increase in electricity consumption by 3.57692 thousand kWh.

2. The free term of the direct regression is equal to 3.19231, therefore, the influence of other factors increases the impact of output on electricity consumption in absolute terms by 3.19231 thousand kWh.

3. The correlation coefficient of 0.8235 reveals a very close dependence of electricity consumption on output.

It is easy to make predictions using the regression model equation. To do this, the x values ​​\u200b\u200bare the volume of output are substituted into the regression equation and electricity consumption is predicted. In this case, the values ​​of x can be taken not only within a given range, but also outside it.

Let's make a forecast about the possible consumption of electricity at a plant with a production volume of 4.5 thousand units.

3.57692*4.5 + 3.19231= 19.288 45 thousand kWh.

LIST OF USED SOURCES

1. Zakharenkov S.N. Socio-economic statistics: Study guide. - Minsk: BSEU, 2002.

2. Efimova M.R., Petrova E.V., Rumyantsev V.N. General theory of statistics. - M.: INFRA - M., 2000.

3. Eliseeva I.I. Statistics. - M.: Prospekt, 2002.

4. General theory of statistics / Ed. ed. O.E. Bashina, A.A. Spirin. - M.: Finance and statistics, 2000.

5. Socio-economic statistics: Textbook.-pract. allowance / Zakharenkov S.N. etc. - Minsk: YSU, 2004.

6. Socio-economic statistics: Proc. allowance. / Ed. Nesterovich S.R. - Minsk: BSEU, 2003.

7. Teslyuk I.E., Tarlovskaya V.A., Terlizhenko N. Statistics. - Minsk, 2000.

8. Kharchenko L.P. Statistics. - M.: INFRA - M, 2002.

9. Kharchenko L.P., Dolzhenkova V.G., Ionin V.G. Statistics. - M.: INFRA - M, 1999.

10. Economic statistics / Ed. Yu.N. Ivanova - M., 2000.

Hosted on Allbest.ru

...

Similar Documents

    Calculation of the arithmetic mean for the interval distribution series. Determination of the general index of the physical volume of trade. Analysis of the absolute change in the total cost of production due to changes in physical volume. Calculation of the coefficient of variation.

    test, added 07/19/2010

    The essence of wholesale, retail and public trade. Formulas for calculating individual, aggregate turnover indices. Calculation of the characteristics of the interval distribution series - arithmetic mean, mode and median, coefficient of variation.

    term paper, added 05/10/2013

    Calculation of the planned and actual volume of sales, the percentage of the plan, the absolute change in turnover. Determination of absolute growth, average growth rates and growth in cash income. Calculation of structural averages: modes, medians, quartiles.

    test, added 02/24/2012

    Interval series of distribution of banks by profit volume. Finding the mode and median of the obtained interval distribution series by a graphical method and by calculation. Calculation of the characteristics of the interval distribution series. Calculation of the arithmetic mean.

    test, added 12/15/2010

    Formulas for determining the average values ​​of the interval series - modes, medians, variances. Calculation of analytical indicators of time series according to chain and basic schemes, growth rates and growth. The concept of a composite index of cost, prices, costs and turnover.

    term paper, added 02/27/2011

    The concept and purpose, order and rules for constructing a variational series. Analysis of data homogeneity in groups. Indicators of variation (fluctuation) of a trait. Determination of the mean linear and square deviation, oscillation coefficient and variation.

    test, added 04/26/2010

    The concept of mode and median as typical characteristics, the order and criteria for their determination. Finding the mode and median in a discrete and interval variation series. Quartiles and deciles as additional characteristics of the variational statistical series.

    test, added 09/11/2010

    Construction of an interval series of distribution on a grouping basis. Characterization of the frequency distribution deviation from the symmetrical form, calculation of kurtosis and asymmetry indicators. Analysis of indicators of the balance sheet or income statement.

    control work, added 10/19/2014

    Transformation of the empirical series into discrete and interval. Determination of the average value over a discrete series using its properties. Calculation of a discrete series of modes, medians, variation indicators (dispersion, deviation, oscillation coefficient).

    test, added 04/17/2011

    Construction of a statistical series of distribution of organizations. Graphical definition of the mode value and median. The tightness of the correlation with the use of the coefficient of determination. Determination of the sampling error of the average number of employees.

Lab #1

According to mathematical statistics

Topic: Primary processing of experimental data

3. Evaluation in points. one

5. Security questions.. 2

6. Methodology for performing laboratory work .. 3

Objective

Acquisition of skills of primary processing of empirical data by methods of mathematical statistics.

On the basis of a set of experimental data, perform the following tasks:

Exercise 1. Construct an interval variation series of distribution.

Task 2. Construct a histogram of the frequencies of the interval variation series.

Task 3. Compose an empirical distribution function and plot.

a) mode and median;

b) conditional initial moments;

c) sample mean;

d) sample variance, corrected population variance, corrected standard deviation;

e) coefficient of variation;

e) asymmetry;

g) kurtosis;

Task 5. Determine the boundaries of the true values ​​of the numerical characteristics of the random variable under study with a given reliability.

Task 6. Meaningful interpretation of the results of primary processing according to the condition of the problem.

Score in points

Tasks 1-56 points

Task 62 points

Lab Protection(oral interview on control questions and laboratory work) - 2 points

The work is submitted in writing on A4 sheets and includes:

1) Title page (Appendix 1)

2) Initial data.

3) Presentation of work according to the specified sample.

4) Calculation results (performed manually and/or using MS Excel) in the specified order.

5) Conclusions - a meaningful interpretation of the results of primary processing according to the condition of the problem.

6) Oral interview on work and control questions.



5. Security questions


Methodology for performing laboratory work

Task 1. Construct an interval variation series of distribution

In order to present statistical data in the form of a variational series with equally spaced variants, it is necessary:

1. In the original data table, find the smallest and largest values.

2. Determine range of variation :

3. Determine the length of the interval h, if there are up to 1000 data in the sample, use the formula: , where n - sample size - the amount of data in the sample; lgn is taken for calculations).

The calculated ratio is rounded up to convenient integer value .

4. To determine the beginning of the first interval for an even number of intervals, it is recommended to take the value ; and for an odd number of intervals .

5. Record grouping intervals and arrange them in ascending order of boundaries

, ,………., ,

where is the lower bound of the first interval. A convenient number is taken for no more than , the upper limit of the last interval must be no less than . It is recommended that the intervals contain the initial values ​​of the random variable and be separated from 5 to 20 intervals.

6. Write down the initial data on the intervals of groupings, i.e. calculate from the original table the number of values ​​of a random variable that fall within the specified intervals. If some values ​​coincide with the boundaries of the intervals, then they are attributed either only to the previous or only to the subsequent interval.

Remark 1. The intervals need not be taken equal in length. In areas where the values ​​are denser, it is more convenient to take smaller short intervals, and where less often - larger ones.

Remark 2.If for some values ​​“zero” or small values ​​of frequencies are obtained, then it is necessary to regroup the data, enlarging the intervals (increasing the step ).

Having the data of statistical observation characterizing this or that phenomenon, it is first of all necessary to streamline them, i.e. make it systematic

English statistician. UjReichman said figuratively about unordered aggregates that to face a mass of ungeneralized data is tantamount to a situation when a person is thrown into the forest thicket without a compass. What is the systematization of statistical data in the form of distribution series?

The statistical distribution series is an ordered statistical population (Table 17). The simplest kind of statistical distribution series is a ranked series, i.e. a series of numbers in ascending or descending order varying signs. Such a series does not allow us to judge the patterns inherent in the distributed data: which value has the majority of indicators grouped, what are the deviations from this value; as a general distribution pattern. For this purpose, the data are grouped, showing how often individual observations occur in their total number (Scheme 1a 1).

. Table 17

. General view of statistical distribution series

. Scheme 1. Scheme of statistical distribution ranks

The distribution of population units according to characteristics that do not have a quantitative expression is called attribute series(for example, the distribution of enterprises according to their production line)

The distribution series of population units according to characteristics, have a quantitative expression, are called variation series. In such series, the value of the feature (options) is in ascending or descending order

In the variation series of distribution, two elements are distinguished: variants and frequency . Option- this is a separate value of the grouping feature frequency- a number that shows how many times each option occurs

In mathematical statistics, one more element of the variational series is calculated - partial. The latter is defined as the ratio of the frequency of cases of a given interval to the total amount of frequencies, the part is determined in fractions of a unit, percent (%) in ppm (% o)

Thus, a variational distribution series is a series in which the options are arranged in ascending or descending order, their frequencies or frequencies are indicated. Variational series are discrete (pererivny) and other intervals (continuous).

. Discrete variation series- these are distribution series in which the variant as the value of a quantitative trait can only take on a certain value. Variants differ from each other by one or more units

So, the number of parts produced per shift by a specific worker can be expressed only by one specific number (6, 10, 12, etc.). An example of a discrete variation series can be the distribution of workers according to the number of parts produced (Table 18-18).

. Table 18

. Discrete distribution range _

. Interval (continuous) variation series- such distribution series in which the value of the options are given as intervals, i.e. feature values ​​can differ from each other by an arbitrarily small amount. When constructing a variational series of NEP, it is impossible to indicate each value of the variants, so the set is distributed over intervals. The latter may or may not be equal. For each of them, frequencies or frequencies are indicated (Table 1 9 19).

In interval distribution series with unequal intervals, mathematical characteristics such as distribution density and relative distribution density in a given interval are calculated. The first characteristic is determined by the ratio of the frequency to the value of the same interval, the second - by the ratio of the frequency to the value of the same interval. For the above example, the distribution density in the first interval will be 3: 5 = 0.6, and the relative density in this interval will be 7.5: 5 = 1.55%.

. Table 19

. Interval distribution series _

Math statistics- a section of mathematics devoted to mathematical methods of processing, systematization and use of statistical data for scientific and practical conclusions.

3.1. BASIC CONCEPTS OF MATHEMATICAL STATISTICS

In biomedical problems, it is often necessary to investigate the distribution of one or another trait for a very large number of individuals. For different individuals, this feature has a different meaning, so it is a random variable. For example, any therapeutic drug has different efficacy when applied to different patients. However, in order to get an idea of ​​the effectiveness of this drug, it is not necessary to apply it to everyone sick. It is possible to trace the results of using the drug to a relatively small group of patients and, on the basis of the data obtained, to identify the essential features (efficacy, contraindications) of the treatment process.

Population- a set of homogeneous elements to be studied, characterized by some feature. This sign is continuous random variable with distribution density f(x).

For example, if we are interested in the prevalence of a disease in a certain region, then the general population is the entire population of the region. If we want to find out the susceptibility to this disease of men and women separately, then two general populations should be considered.

To study the properties of the general population, a certain part of its elements is selected.

Sample- part of the general population selected for examination (treatment).

If this does not cause confusion, then the sample is called as collection of objects selected for examination, and totality

values of the trait under study, obtained during the examination. These values ​​can be represented in several ways.

Simple statistical series - the values ​​of the trait under study, recorded in the order in which they were obtained.

An example of a simple statistical series obtained by measuring the surface wave velocity (m/s) in the forehead skin of 20 patients is shown in Table. 3.1.

Table 3.1.Simple Statistical Series

A simple statistical series is the main and most complete way to record survey results. It can contain hundreds of elements. It is very difficult to take a look at such an aggregate at a glance. Therefore, large samples are usually subdivided into groups. To do this, the area of ​​change of the attribute is divided into several (N) intervals of equal width and calculate the relative frequencies (n/n) of the feature falling into these intervals. The width of each interval is:

The boundaries of the intervals have the following meanings:

If any element of the sample is the boundary between two adjacent intervals, then it is referred to as left interval. Data grouped in this way is called interval statistical series.

- this is a table that shows the intervals of the values ​​of the trait and the relative frequencies of the trait falling into these intervals.

In our case, we can form, for example, such an interval statistical series (N = 5, d= 4), tab. 3.2.

Table 3.2.Interval statistical series

Here, two values ​​equal to 28 are assigned to the interval 28-32 (Table 3.1), and the values ​​32, 33, 34 and 35 are assigned to the interval 32-36.

An interval statistical series can be represented graphically. To do this, intervals of characteristic values ​​are plotted along the abscissa axis, and on each of them, as on the basis, a rectangle is built with a height equal to the relative frequency. The resulting bar chart is called histogram.

Rice. 3.1. bar graph

On the histogram, the statistical patterns of the distribution of the feature are seen quite clearly.

With a large sample size (several thousand) and a small width of the columns, the shape of the histogram is close to the shape of the graph distribution density sign.

The number of columns of the histogram can be selected using the following formula:

Building a histogram manually is a long process. Therefore, computer programs have been developed for their automatic construction.

3.2. NUMERICAL CHARACTERISTICS OF STATISTICAL SERIES

Many statistical procedures use sample estimates for the mean and variance (or standard deviation) of the population.

sample mean(X) is the arithmetic mean of all elements of a simple statistical series:

For our example X= 37.05 (m/s).

The sample mean isthe bestestimation of the general averageM.

Sample variance s 2 is equal to the sum of the squared deviations of the elements from the sample mean, divided by n- 1:

In our example, s 2 \u003d 25.2 (m / s) 2.

Please note that when calculating the sample variance, the denominator of the formula is not the sample size n, but n-1. This is due to the fact that when calculating deviations in formula (3.3), instead of an unknown mathematical expectation, its estimate is used - sample mean.

The sample variance is the best estimate of the general variance (σ 2).

Sample standard deviation(s) is the square root of the sample variance:

For our example s= 5.02 (m/s).

selective rms deviation is the best estimate of the general RMSE (σ).

With an unlimited increase in the sample size, all sample characteristics tend to the corresponding characteristics of the general population.

To calculate the sample characteristics, computer formulas are used. In Excel, these calculations perform the statistical functions AVERAGE, VARR. STDEV.

3.3. INTERVAL ESTIMATE

All sample characteristics are random values. This means that for another sample of the same size, the values ​​of the sample characteristics will be different. Thus, selective

characteristics are only estimates relevant characteristics of the general population.

It compensates for the shortcomings of selective evaluation interval estimation, representing number interval, inside which with a given probability R d the true value of the estimated parameter is found.

Let be U r - some parameter of the general population (general mean, general variance, etc.).

interval estimation parameter U r is called the interval (U 1 , U 2), satisfying the condition:

P(U < Ur < U2) = Рд. (3.5)

Probability R d called confidence probability.

Confidence probability Pd - the probability that the true value of the estimated quantity is inside the specified interval.

At the same time, the interval (U 1 , U 2) called confidence interval for the estimated parameter.

Often, instead of the confidence probability, the associated value α = 1 - R d, which is called significance level.

Significance level is the probability that the true value of the estimated parameter is outside confidence interval.

Sometimes α and R d are expressed as a percentage, for example, 5% instead of 0.05 and 95% instead of 0.95.

In interval estimation, first choose the appropriate confidence level(usually 0.95 or 0.99), and then find the corresponding interval of values ​​of the estimated parameter.

We note some general properties of interval estimates.

1. The lower the significance level (the more R d), the wider the interval estimate. So, if at a significance level of 0.05 the interval estimate of the general mean is 34.7< M< 39,4, то для уровня 0,01 она будет гораздо шире: 33,85 < M< 40,25.

2. The larger the sample size n, the narrower the interval estimate with the selected level of significance. Let, for example, 5 be the percentage estimate of the general average (β=0.05) obtained from a sample of 20 items, then 34.7< M< 39,4.

By increasing the sample size to 80, we will get a more accurate estimate at the same significance level: 35.5< M< 38,6.

In the general case, the construction of reliable confidence estimates requires knowledge of the law according to which the estimated random feature is distributed in the general population. Consider how the interval estimate is constructed general average trait, which is distributed in the general population according to normal law.

3.4. INTERVAL ESTIMATE OF THE GENERAL MEAN FOR THE NORMAL DISTRIBUTION LAW

The construction of an interval estimate of the general mean M for a general population with a normal distribution law is based on the following property. For volume sampling n attitude

obeys the Student distribution with the number of degrees of freedom ν = n- 1.

Here X is the sample mean, and s- selective standard deviation.

Using Student's distribution tables or their computer analogue, one can find such a boundary value that with a given confidence probability the following inequality is satisfied:

This inequality corresponds to the inequality for M:

where ε is the half-width of the confidence interval.

Thus, the construction of a confidence interval for M is carried out in the following sequence.

1. Choose the confidence probability P d (usually 0.95 or 0.99) and for it, according to the Student's distribution table, the parameter t is found

2. Calculate the half-width of the confidence interval ε:

3. An interval estimate of the general average is obtained with the selected confidence probability:

Briefly it is written like this:

Computer procedures have been developed to find interval estimates.

Let's explain how to use the Student's distribution table. This table has two "entrances": the left column, called the number of degrees of freedom ν = n- 1, and the top row is the significance level α. At the intersection of the corresponding row and column, the Student's coefficient is found t.

Let's apply this method to our sample. A fragment of the Student's distribution table is presented below.

Table 3.3. Fragment of Student's distribution table

A simple statistical series for a sample of 20 people (n= 20, ν =19) is presented in Table. 3.1. For this series, calculations using formulas (3.1-3.3) give: X= 37,05; s= 5,02.

Let's choose α = 0.05 (P d = 0.95). At the intersection of row "19" and column "0.05" we find t= 2,09.

Let us calculate the estimation accuracy by formula (3.6): ε = 2.09?5.02/λ /20 = 2.34.

Let's build an interval estimate: with a probability of 95%, the unknown general mean satisfies the inequality:

37,05 - 2,34 < M< 37,05 + 2,34, или M= 37.05 ± 2.34 (m/s), Р d = 0.95.

3.5. METHODS FOR VERIFICATION OF STATISTICAL HYPOTHESES

Statistical hypotheses

Before formulating what a statistical hypothesis is, consider the following example.

To compare two methods of treating a certain disease, two groups of patients of 20 people each were selected, the treatment of which was carried out according to these methods. For each patient, a the number of procedures followed by a positive effect. According to these data, for each group, we found sample means (X), sample variances (s 2) and sample RMS (s).

The results are presented in table. 3.4.

Table 3.4

The number of procedures required to obtain a positive effect is a random variable, all information about which is currently contained in the above sample.

From Table. 3.4 shows that the sample mean in the first group is less than in the second. Does this mean that the same ratio holds for general averages: M 1< М 2 ? Достаточно ли статистических данных для такого вывода? Ответы на эти вопросы и дает statistical testing of hypotheses.

Statistical hypothesis- it is an assumption about the properties of populations.

We will consider hypotheses about the properties two general populations.

If the populations have known, the same distribution of the value being estimated, and the assumptions concern the quantities some parameter this distribution, then the hypotheses are called parametric. For example, samples are drawn from populations with normal law distribution and equal variance. It is required to find out are the same the general averages of these populations.

If nothing is known about the laws of distribution of general populations, then hypotheses about their properties are called nonparametric. For example, are the same the distribution laws of the populations from which the samples are taken.

Null and alternative hypotheses.

The task of testing hypotheses. Significance level

Let's get acquainted with the terminology used in hypothesis testing.

H 0 - null hypothesis (skeptic hypothesis) - this is a hypothesis about no difference between compared samples. The skeptic believes that the differences between the sample estimates obtained from the results of the research are random;

H 1- an alternative hypothesis (the optimist's hypothesis) is a hypothesis about the presence of differences between the compared samples. The optimist believes that the differences between sample estimates are caused by objective reasons and correspond to the differences in general populations.

Testing of statistical hypotheses is feasible only when the elements of the compared samples can be used to compose some value(criterion), the distribution law of which in the case of fairness H 0 known. Then, for this quantity, one can specify confidence interval, into which with a given probability R d gets its value. This interval is called critical area. If the criterion value falls into the critical region, then the hypothesis is accepted H 0 . Otherwise, the hypothesis H 1 is accepted.

In medical research, P d = 0.95 or P d = 0.99 is used. These values ​​correspond significance levelsα = 0.05 or α = 0.01.

When testing statistical hypothesessignificance level(α) is the probability of rejecting the null hypothesis when it is true.

Note that, at its core, the hypothesis testing procedure is aimed at difference detection, not to confirm their absence. When the criterion value goes beyond the critical area, we can say “skeptic” with a pure heart - well, what else do you want ?! If there were no differences, then with a probability of 95% (or 99%) the calculated value would be within the specified limits. So no!..

Well, if the value of the criterion falls into the critical region, then there is no reason to believe that the hypothesis H 0 is correct. This most likely points to one of two possible causes.

1. Sample sizes are not large enough to detect differences. It is likely that continued experimentation will bring success.

2. There are differences. But they are so small that they are of no practical importance. In this case, the continuation of experiments does not make sense.

Let's move on to consider some of the statistical hypotheses used in medical research.

3.6. HYPOTHESES TESTING ON EQUALITY OF VARIANCES, FISHER F-CRITERION

In some clinical studies, a positive effect is evidenced not so much by magnitude parameter under study, how much stabilization, reducing its fluctuations. In this case, the question arises of comparing two general variances based on the results of a sample survey. This task can be solved using Fisher's criterion.

Formulation of the problem

normal law distribution. Sample sizes -

n 1 and n2, a sample variances equal s 1 and s 2 2 general variances.

Tested hypotheses:

H 0- general variances are the same;

H 1- general variances different.

Shown if samples are drawn from populations with normal law distribution, then if the hypothesis is true H 0 the ratio of sample variances obeys the Fisher distribution. Therefore, as a criterion for testing the validity H 0 value is taken F, calculated by the formula:

where s 1 and s 2 - sample variances.

This ratio obeys the Fisher distribution with the number of degrees of freedom of the numerator ν 1 = n 1- 1 and the number of degrees of freedom of the denominator ν 2 = n 2 - 1. The boundaries of the critical region are found according to the tables of Fisher's distribution or using the computer function BRASPOBR.

For the example presented in Table. 3.4, we get: ν 1 \u003d ν 2 \u003d 20 - 1 \u003d 19; F= 2.16/4.05 = 0.53. At α = 0.05, the boundaries of the critical region are equal, respectively: = 0.40, = 2.53.

The criterion value fell into the critical region, so the hypothesis is accepted H 0: general sample variances are the same.

3.7. HYPOTHESIS TESTING REGARDING EQUALITY OF AVERAGES, STUDENT'S t-test

Comparison problem medium two general populations arises when it is the magnitude the trait under study. For example, when comparing the duration of treatment with two different methods or the number of complications that arise when using them. In this case, Student's t-test can be used.

Formulation of the problem

Two samples (X 1 ) and (X 2 ) were obtained from populations with normal law distribution and the same dispersion. Sample sizes - n 1 and n 2 , sample means are equal to X 1 and X 2, and sample variances- s 1 2 and s 2 2 respectively. Needs to be compared general averages.

Tested hypotheses:

H 0- general averages are the same;

H 1- general averages different.

It is shown that if the hypothesis is true H 0 the value of t, calculated by the formula:

distributed according to Student's law with the number of degrees of freedom ν = ν 1 + + ν2 - 2.

Here where ν 1 = n 1 - 1 - number of degrees of freedom for the first sample; v2 = n 2 - 1 - the number of degrees of freedom for the second sample.

The boundaries of the critical region are found from tables of t-distribution or using the computer function STUDRASP. The Student's distribution is symmetrical about zero, so the left and right boundaries of the critical region are the same in absolute value and opposite in sign: -and

For the example presented in Table. 3.4, we get:

v 1 \u003d v 2 \u003d 20 - 1 \u003d 19; v = 38, t= -2.51. With α = 0.05 = 2.02.

The criterion value goes beyond the left border of the critical region, so we accept the hypothesis H 1: general averages different. At the same time, the average of the general population first sample SMALLER.

Applicability of Student's t-test

Student's t-test applies only to samples from normal aggregates with the same general variances. If at least one of the conditions is violated, then the applicability of the criterion is doubtful. The requirement of normality of the general population is usually ignored, referring to central limit theorem. Indeed, the difference in the sample means, which is in the numerator (3.10), can be considered normally distributed for ν > 30. But the question of the equality of variances is not subject to verification, and references to the fact that the Fisher test did not detect differences cannot be taken into account. Nevertheless, the t-test is widely used to detect differences in population means, although without sufficient evidence.

Below is considered nonparametric criterion, which is successfully used for the same purposes and which does not require any normality, neither equality of variances.

3.8. NONPARAMETRIC COMPARISON OF TWO SAMPLES: THE MANN-WHITNEY TEST

Nonparametric criteria are designed to detect differences in the laws of distribution of two general populations. Criteria that are sensitive to differences in general medium, called criteria shift. Criteria that are sensitive to differences in general dispersion, called criteria scale. The Mann-Whitney test refers to the criteria shear and is used to detect differences in the means of two populations, samples from which are presented in ranking scale. The measured signs are located on this scale in ascending order, and then numbered with integers 1, 2 ... These numbers are called ranks. Equal values ​​are assigned the same ranks. It is not the value of the attribute itself that matters, but only ordinal place, which it occupies among other values.

In table. 3.5. the first group from table 3.4 is presented in expanded form (line 1), subjected to ranking (line 2), and then the ranks of the same values ​​are replaced by arithmetic mean values. For example, elements 4 and 4 in the first row were given ranks 2 and 3, which were then replaced with the same values ​​of 2.5.

Table 3.5

Formulation of the problem

Independent samples (X 1) and (X 2) extracted from populations with unknown distribution laws. Sample sizes n 1 and n 2 respectively. The values ​​of the elements of the samples are presented in ranking scale. It is required to check whether these general populations differ from each other?

Tested hypotheses:

H 0- the samples belong to the same general population; H 1- the samples belong to different general populations.

To test such hypotheses, the (/-Mann-Whitney test is used.

First, a combined sample (X) is made from two samples, the elements of which are ranked. Then the sum of the ranks corresponding to the elements of the first sample is found. This sum is the criterion for testing hypotheses.

U= The sum of the ranks of the first sample. (3.11)

For independent samples larger than 20, the value U obeys a normal distribution, the mathematical expectation and standard deviation of which are equal to:

Therefore, the boundaries of the critical region are found according to the normal distribution tables.

For the example presented in Table. 3.4, we get: ν 1 \u003d ν 2 \u003d 20 - 1 \u003d 19, U= 339, μ = 410, σ = 37. For α = 0.05 we get: both left = 338 and right = 482.

The value of the criterion goes beyond the left border of the critical region, so the hypothesis H 1 is accepted: general populations have different distribution laws. At the same time, the average of the general population first sample SMALLER.

When processing large amounts of information, which is especially important when conducting modern scientific developments, the researcher faces the serious task of correctly grouping the initial data. If the data is discrete, then, as we have seen, there are no problems - you just need to calculate the frequency of each feature. If the trait under study has continuous character (which is more common in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range of the feature is divided into a certain number of intervals to.

Grouped interval (continuous) variational series called intervals ranked by the value of the feature (), where indicated together with the corresponding frequencies () the number of observations that fell into the r "th interval, or relative frequencies ():

Characteristic value intervals

mi frequency

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary understanding of the data structure. Such graphs (Fig. 1.15) are built for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the area of ​​​​its possible values, taking any values.

Rice. 1.15.

So the columns on the histogram and the cumulate must be in contact, have no areas where the attribute values ​​do not fall within all possible(i.e., the histogram and cumulate should not have "holes" along the abscissa axis, in which the values ​​of the variable under study do not fall, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fall into the given interval, or the relative frequency - the proportion of observations. Intervals must not cross and are usually the same width.

Rice. 1.16.

The histogram and the polygon are approximations of the probability density curve (differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is of such importance in the primary statistical processing of quantitative continuous data - by their form one can judge the hypothetical distribution law.

Cumulate - the curve of the accumulated frequencies (frequencies) of the interval variation series. The graph of the integral distribution function is compared with the cumulate F(x), also considered in the course of probability theory.

Basically, the concepts of histogram and cumulates are associated precisely with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, as the histogram will be too smooth ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs of Fig. 1.15 are used to construct a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the data under study along the numerical axis: the histogram will turn out to be undersmoothed (undersmoothed) with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How to determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the initial set of values ​​of the studied attribute. This formula has really become super popular - most statistical textbooks offer it, and many statistical packages use it by default. Whether this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution )