Statistics

Contents

Statistics are generally considered hard by most students. Don't be put off by this. Although you may have to put a little effort in to understanding them at first, it is well worth the effort, as they are the key to good experimental design. Being able to do them well is a very important skill, and you'd be surprised how many people use them without understanding what they are actually doing. Like flying a plane when you don't understand all the controls, this is a recipe for disaster.

Data

Data is a collection of experimental observations. It may be:

Data can also be:

Data can also be artificially grouped, e.g. 0-4 yolks, 5-9 yolks. Grouped, discrete and qualitative data usually involve counting, and the results are best tabulated in a frequency distribution table, where frequency (the count) and relative frequency (the count divided by the total count) are recorded. In presentations, they lend themselves to bar charts. Continuous quantitative data lend themselves to line graphs and histograms.

Question 1.

Give examples of:

Use the descriptions above to help. Answer

Question 2.

Mrs. McKenna wanted to know how many students were put on detention by staff of her department. The number of students detained by teachers were recorded as follows:

Express these records in frequency tables. Answer

Question 3.

By means of random sampling, the number of daisy capitula ('flowers') per square metre of lawn were obtained. These results were as shown in the table below. Express these results in frequency tables, grouping the data in five conveniently sized groups. Answer

Square number

Number of capitula

Square number

Number of capitula

1

11

21

21

2

3

22

17

3

13

23

10

4

19

24

15

5

20

25

13

6

16

26

7

7

9

27

20

8

14

28

15

9

12

29

24

10

6

30

13

11

19

31

9

12

15

32

5

13

23

33

12

14

12

34

14

15

10

35

3

16

6

36

9

17

13

37

10

18

15

38

2

19

4

39

12

20

10

40

18

Statistics

Statistics is about calculating numbers that summarise your data in an easy to digest form. These summary numbers are called parameters. We cannot usually measure every single member of a population (think how long it would take to find the height of everyone in the world), so instead we take a small sample of the population, and calculate parameters from that instead.

Mathematical symbols

Statistical analysis usually begins by calculating descriptive statistics, such as means, standard deviations, standard errors and confidence intervals. These can be presented neatly in a table when writing up experiments. Statistics is quite mathematical, so here is a list of the symbols you are likely to come across, and with which you should become familiar.

Averages

There are several sorts of average, which are appropriate for different sorts of data.

Standard deviation

The standard deviation is a measure of the variation around the mean. It can be calculated in two ways, but both give the same answer. The first way shows you what it actually means:

Standard deviation.

We take a measurement (x), subtract the mean from it (x − ) and square it. This tells us how far the measurement is from the mean. We do this for every measurement we have, and add them up (∑). We have to do the squaring because otherwise the clearly spread-out data set { 1, 2, 3, 4, 5 } would have a deviation of zero, as (1 − 3) + (2 − 3) + (3 − 3) + (4 − 3) + (5 − 3) = 0. This sum of squared differences from the mean is unimaginatively called the sum of squares (or SS).

We now take the SS, and 'average' it by dividing by the sample size minus one. The minus 1 is for mathematical reasons, and n-1 is called the degrees of freedom. As you know, the degrees of freedom is equal to the total sample size minus the number of means we have estimated from the data; we'll come across this idea again later.

The sum of squares divided by the degrees of freedom gives us a parameter called the variance (s2). The standard deviation is the square root of this value: we do this so that if our measurements have units (like metres), the standard deviation will also be in metres (rather than m2), so it is nicely comparable to the mean.

A quicker way to calculate the standard deviation is to use the following formula:

Standard deviation.
This version requires you to calculate the sum of the measurements, ∑ x, and the sum of the squares of each measurement, ∑ x2. Note that this is not the same thing as the square of the sum of the measurements, (∑ x)2. This version is quicker to calculate by hand than the first version.

An even quicker way is to get Excel or a calculator to do it for you, but you should be aware what the standard deviation actually means before you go applying it willy-nilly.

Question 4.

Calculate the mean, mode, median and standard deviation of the following two data sets. Answer

To summarise this section, there are always two statistics you should calculate from your data: the mean (the sum of the measurements divided by the sample size), and the standard deviation (square root of the sum of squares divided by the degrees of freedom). The first tells you where the 'middle' of you data is, and the second tells you how much spread there is around this average value.

Normal distribution

The normal distribution is useful to describe the 'shape' (the probability distribution) of biological data. If we plot a bar chart from a frequency table (i.e. frequency of measurement against measurement), we will often find it has a peaked shape: look at the capitula graph again, and you'll see what I mean. So, if we plot the frequency of a measurement against its value (e.g. the number of people of a given height against the values of height that we measure), we will get a bar chart with a peak that looks like the curve below. The curve below has been mathematically smoothed: imagine it as a bar chart with extremely thin bars. For a normal distribution, which is common in biology, the mean, median and mode are all equal.

In a sample with a normal distribution, 68% of the measurements will be within one standard deviation of the mean, 95% within two standard deviations, and 99.5% within three. These have been coloured in various shades of blue below.

Normal curve.
The normal distribution. The mean is 50 cm and the standard deviation is 10 cm. The dark blue area (the mean plus or minus one standard deviation, 40-60 cm) contains 68% of the total area under the curve. If we include the mid-blue area too (all measurements within 2 standard deviations of the mean, i.e. 30-70 cm), this contains 95% of all the measurements. Remembering that 2 standard deviations contains 95% of the measurements is essential to understanding the t test we will come to later.

By understanding he shape of the normal curve, we can calculate probabilities, standard errors and confidence limits from the value of the standard deviation. Probability is the likelihood of something occurring on a scale of 0 to 1. Since 95% of the measurements are within two standard deviations of the mean, there is a probability of 0.95 that any particular measurement will be in this range (the dark and mid blue part of the curve above). Equation of the normal curve.

Standard error

There is another important measurement of spread in data besides the standard deviation: the standard error of the mean (usually written s.e.). This is the standard deviation divided by the square root of the sample size:

Standard error.

Most people don't actually know what the standard error is, they just use it on graphs to make error bars look smaller! However, it is very important to understand what it really means. The standard error shows you how certain you are that the sample mean () you have calculated is the same as the mean of the population (μ) you have sampled from. (Read that until it sinks in).

Random sampling error aside, the standard deviation will be the same however big your sample size is: it's a measure of the variation in your population, so a large standard deviation isn't necessarily something to be ashamed of, since some populations are very variable. However, the standard error should get smaller as your sample size increases, so it is less a measure of how variable your population is, and more a measurement of how good your sampling technique is: a large standard error is something to be ashamed of! More about the s.e.

Confidence limits show how sure you are that the mean of the population you have sampled is the same as the sample mean you have calculated. In biology, we will only accept your data as significant if you are 95% confident that your sample's mean is the same as the population mean. To calculate a confidence limit properly, you need a Student's t table; however, for the moment, just remember that if you calculate the s.e. and multiply it by 2 (remember the normal distribution: 95% of the area under the curve is within 2 standard deviations), then you will get your confidence limits, providing your sample size is greater than 30. 30 is a sort of magic number in statistics: for mathematical reasons, it is the boundary between an 'small' and a 'large' sample size. Technically that number should be 1.96, not 2, but as a rule of thumb, 2 will do.

So, for a sample of 30 measurements with mean 16.30 and standard deviation of 2.80, we can calculate the s.e. as 2.80 ⁄ √30 = 0.51 and the 95% confidence interval as 2 × 0.51 = 1.02. This means we can be 95% sure that the true population mean μ lies within the range 15.28 and 17.32 (16.30 ± 1.02).

Question 5.

Produce a table of descriptive statistics for the following data set. Answer

Hypothesis testing

A hypothesis is a specific prediction about your data. It may apply to a single set of data (e.g. I expect that my mean human height will be equal to 1.7 m), or it may apply to a comparison of a two or more sets of data (e.g. I expect that the mean height of oak trees will be larger than the mean height of humans). The first of these is also really a comparison, between a sample mean and an actual number, 1.7.

For every idea there are two hypotheses you should formulate. The first is the null hypothesis (H0), which states:

The second is the alternative hypothesis (H1), which states:

Some example null hypotheses:

Note that the null hypothesis always says 'no difference', 'same', 'no correlation', 'nothing happening here officer', etc.

Question 6.

Formulate null hypotheses for the following ideas. Answer

In order to apply a statistical test to our oak tree/human comparison, we must:

Golden rules

To do any sort of statistical test, it is essential that you take precautions first. These are my golden rules of experimental design.

Once you've made sure you've taken these precautions, and collected the data, you can do the test you chose. Choosing a suitable test. In biology we are often interested in whether our two means are significantly different from one another. Say we have measured the heights of humans and oak trees, and found that the oak tree mean is much higher than the human mean (surprise!). How can we be sure that this difference is real, and not due to pure chance (accidentally sampling very tall trees and very short people)? The answer is, we can't ever be sure, but we can say how sure we are. If we are 95% sure that the means are different, we say they are significantly different, provisionally accept that the difference is real, and reject the null hypothesis. If we are less sure than this, we say they are insignificantly different, and accept the null hypothesis.

Statistical tests

Statistical tests are used to find out whether a particular hypothesis can be supported, or needs to be rejected. There are many, many sorts of statistical test and technique. The one you use depends on the sort of data, and the relationship you want to (dis)prove. We will discuss three tests here, Student's t test, the χ2 (chi-squared) test, and the Mann-Whitney U test. In all cases, we do two things: we calculate a test statistic from our data, and we compare this to a table of critical values for different confidence levels and sample sizes. To see if our data are significant, we look at the critical value for the 95% confidence level (this is usually written on tables as the 5% significance level, or p = 0.05 level). If our calculated value is larger than the critical value, the data (or the difference between data) is significant, and we can reject the null hypothesis.

Student's t test

The t test is used to determine whether there is a significant difference between the means of two samples. It should only be used if you are sure the data is normally distributed, but you may well have to just assume this in school work. If you're really worried about the data distribution, you can use the Mann-Whitney U test instead, which works on any sort of distribution, normal or not. All statistical tests are more reliable on larger sample sizes; t is most reliable if n > 30.

t is calculated using the formula below.

Student's t test.
x1 is the mean of the first sample, s1 is the standard deviation of the first sample, n1 is the size of the first sample. x2, s2 and n2 are the values from the second sample. The |x1 − x2| means 'ignore any minus sign'. The thing on the bottom is the sum of the standard errors of the two means.

To determine the critical value, we need to know the significance level (p = 0.05), and the number of degrees of freedom (sometimes written ν, more usually as df). The degrees of freedom is simply n1 + n2 − 2 (as we discussed earlier, this is the total sample size minus the number of means we have estimated). We then just look these up on a t table, and if the calculated value of t is greater than the critical value from the table, the null hypothesis can be rejected, and we say there is a significant difference between the two sample means.

If you use Excel to do t tests, it will give you a number of different sorts of t test you can use. They are all slightly different. In general, you will want to chose an "unpaired, two-tailed test assuming equal variances". The other tests are for less common problems, and are usually less strict.

It's worth understanding how t actually works; many people use the test like a sort of magical incantation, without having the foggiest idea what they are actually doing. The t test is not nearly as difficult to understand as you might think. All t does is measure the overlap between two normal distributions (your two samples). That's it. A big overlap means a small value of t and therefore an insignificant difference. A small overlap means a big value of t and therefore a significant difference

Student's t test with large overlap.
These two normal distributions have a very large overlap. The means of the left and right curves are not significantly different, because the overlap is > 5% of the area under the curves. t would be very small.

Student's t test with overlap made smaller by decreasing standard deviations.
These two normal distributions have the same means as before, but much smaller standard deviations. See that the overlap is now much smaller. The means of the right and left curves are now significantly different, because the overlap is < 5% of the area under the curves. t would be very large.

Student's t test with overlap made smaller by increasing difference in means
These two distributions have the same standard deviations as the original curves, but the difference between the means is now large enough to make the overlap smaller. These means are now significantly different, and t would be very large.

If you look at the formula for t, you will see that you get a large t (and therefore be more likely to reject the null hypothesis) if:

A useful variation on the t test is to see if a mean is significantly different from a particular number (N), like in our hypothesis that humans have a mean height of 1.7 m. To do this, simply use the equation below, and look up a critical value of t with degrees of freedom n-1.

Student's t test to compare a mean to a real number.
Calculate the difference between the mean and the number (N) you want to compare it with, and divide this by the standard error.

Question 6.

For the following data sets, formulate a null hypothesis, and test the hypothesis with a t test. You will need to calculate descriptive statistics and degrees of freedom to do this.

Use the formulae above to help. Answer

χ2 test

The χ2 test is used to determine if what you predict from theory and what you observe in an experiment are significantly different from each other. It is used only for count data, and is particularly useful in genetics testing. It is easiest to understand χ2 with an example. If we cross snakes heterozygous for the recessive albinism gene (A is 'normal', and a is albino), we expect a 3:1 ratio of normal to albino offspring (Aa × Aa → (1 AA + 2 Aa) + 1 aa, i.e. three quarters normal phenotype and one quarter albino).

Cornsnake with albino pigmentation.
Albino snake (with red eyes that look white under flash photography) right with white markings. The red colour is not affected by albinism.

Our null hypothesis is that there is no difference between our observed ratio and our expected ratio. If we do this cross and score 100 offspring, we expect 100 × ¾ = 75 normal offspring, and 100 × ¼ = 25 albino offspring. If we actually observe these results:

Class

Observed (O)

Expected (E)

Normal

70

75

Albino

30

25

Total

100

100

to calculate χ2 is very simple:

Chi-squared test.
For each observed(O)/expected(E) pair, we subtract E from O, square the answer, and divide by E. WE then add these up for all our categories, and this gives us χ2.

So χ2 = (70 − 75)2 ⁄ 75 + (30 − 25)2 ⁄ 25 = 1.33. To see if this is significant, we just need to look it up on a χ 2 table, with degrees of freedom equal to the number of categories minus 1, in this case we have two categories (normal and albino, so we have 1 df). The critical value is 3.84. Our calculated value is smaller, so we can accept the null hypothesis that there is no difference between our ratio and a 3:1 ratio. So our data support a simple Mendelian monohybrid cross ratio.

When doing χ2, it is important to avoid having any category with less than 5 tallies in it. It is also extremely important that O and E are real numbers (30 albino snakes, 20 purple corn kernels, 78 dwarf pea plants), not percentages or proportions. You cannot do a χ2 test if you only know you had 30% albino snakes: you must know whether this means 300 snakes in 1000, or 3 in 10, as this makes a big difference to how sure you can be about the ratio. This also goes back top making sure you collect the right sort of data for the test you want to do.

χ2 can also be used to see whether a distribution of things is significantly different from random. Question 7 is an example of this.

Question 7.

Often, you will want to do a χ2 test on more than one independent category, for example:

Species of bird

Ground layer

Shrub layer

Tree layer

Row total

Blackbird

123

43

12

 

Great tit

93

247

64

 

Blue tit

52

72

178

 

Column total

 

 

 

Grand total

In this table, the observed values are shown. Our null hypothesis here is that there is no association between the vertical zone and the species of bird found there, i.e. the birds are distributed at random amongst the different forest layers. To calculate our expected values (E), we need to be a little careful, and use the formula below to ensure that we get the same total number of birds in each place:

Calculation of expected chi squared data.
R is the row total, C is the column total, and T is the grand total.

You will need to calculate these totals yourself. To work out χ2, you just need to do the 'O minus E all squared over E' thing to the nine values, and add them up. The number of degrees of freedom for a two category χ2 is (number of rows − 1) × (number of columns − 1). Do this calculation, and check your answer.

Question 8.

Try these other χ2 tests. Answers

Mann-Whitney's U test

The t test is very powerful, but it relies on some assumptions that may not be true for your data. It assumes:

You can usually get away with these assumptions if you have a large enough sample size (30), but if you are unsure, you may want to try out the U test instead, which is less powerful, but doesn't have these dodgy assumptions. The U test relies on your putting your data into rank order, so it shouldn't be a surprise that it compares medians rather than means. These sorts of ranking tests are called nonparametric tests, because you don't have to calculate parameters like the mean and standard deviation.

Here are some data showing the number of fly agaric toadstools found per hectare in two different sorts of wood. Our null hypothesis is that we find the same number of toadstools in both sorts of wood. The data are not normally distributed - the mean of the birch data is 52.2, but the median is 47.5, this is too different for the data to be normal - so we cannot use a t test.

Sample number

Birch wood

Mixed wood

1

43

12

2

19

18

3

91

24

4

76

40

5

52

19

6

40

33

7

82

29

8

58

44

9

22

15

10

39

28

To use U instead, we need to arrange the data in overall rank order (from 1 to 20), and add up the ranks for the two sets of data. Note the 4.5 and the 12.5: we have to take a mean rank when there are identical pieces of data in both sets.

Rank

Birch wood

Mixed wood

Rank

 

 

12

1

 

 

15

2

 

 

18

3

4.5

19

19

4.5

6

22

 

 

 

 

24

7

 

 

28

8

 

 

29

9

 

 

33

10

11

39

 

 

12.5

40

40

12.5

14

43

 

 

 

 

44

15

16

52

 

 

17

58

 

 

18

76

 

 

19

82

 

 

20

91

 

 

∑ R1=138

median = 47.5

median = 26

∑ R2=72

n1=10

n2=10

The R (rank) scores are summed up (∑ R) and then you can feed them into a simple formula to calculate the values of U for the two data sets:

U test.

U test.

U1 = ( 10 × 10 ) + ( 0.5 × 10 × 11 ) − 72 = 83, and U2 = 17.

These U scores give you an idea of how many of your data points are in the top half of the rank order, and how many are in the bottom half. We compare the smaller of these two values to a U table with n equal to the sample sizes of the groups with the smaller U value. From a table with n1 = 10 and n2 = 10 we get a critical value of 23. Unfortunately, for U, we reject the null hypothesis if our calculated value is smaller (or equal to) than the critical value. This goes against every other test I have ever come across, so I can only apologise on behalf of Whitney and Mann. Our value is smaller, so we do reject the null hypothesis, and we conclude that there are more toadstools in birch woods than mixed woods. You may also come across a very similar test called the Wilcoxon matched pairs tests, which is suitable for matched data, and may be worth investigating if your U test doesn't seem to see a very obvious (to you) difference between two data sets.

Plotting graphs

There are two main sorts of graph we use in biology. The best one to use depends on the nature of the input variable. Bar charts are used to display discrete data, and line charts are used to display continuous data.

Bar charts and error bars

When you draw a bar chart, it is good practice to add error bars to it. These show the standard deviation (or standard error) for each bar. To plot data about the average lifespan of baby snakes, we first calculate some descriptive statistics:

Pigmentation

Average lifespan (yrs)

Normal

15 (± s.e. 2)

Albino

4 (± s.e. 3)

It's a good idea to plot error bars, like the ones on the diagram below, for each category. You can only do this if you've taken several measurements of lifespan from both groups of snakes, so make sure you plan for this when you do your experiment. To draw error bars, just add the standard error (or standard deviation) to the mean, and put a little horizontal dash on your diagram at this value (at 17 for the normal snakes), then do the same with the mean minus the standard deviation/error (13 for normal snakes). Join these up and you have an error bar. They're very easy to do by hand; in Excel, you will need to right click data points and select 'Format data series'…'X error bars'. Simple.

Bar chart with error bars.

Line graphs and regression

Most people have no trouble plotting a line graph. Now you know how to create error bars, you should be able to draw a line graph with error bars on each plot point (if you've been able to collect enough data to make each point a mean).

Linear regression is a way of finding the best fit line to a set of data. What regression actually does is plot a straight line through your data that passes through the mean of the x values and the mean of the y values. This point is the red dot on the diagrams below. It then pivots the line about this dot until it gets the smallest possible sum of distances from the points to the line. If we start off with a flat line, with slope 0:

Regression with slope 0 and high error sum of squares.
The red dot is (x-mean, y-mean), in this case (5,5). Regression plots a line through this point, and calculates the distance to all the points. If we add these distances (red lines) up, we get a measure of how good a fit the line is to the data. This one is obviously not very good.

We can see that if we add up the squares of the distances of each point to the flat line, it will come to quite a large number. This is yet another of those 'sums-of-squares' (SS) that we have seen before. This line doesn't fit very well, so we pivot the line anticlockwise about the red dot, to a larger slope, here about 1.1.

Regression with slope 1.1 and low error sum of squares.

This looks a lot better: the distances of the points from the line sum to a much smaller value (the SS is much smaller). What happens if we pivot it a little more anticlockwise, say to a slope of 3.5?

Regression with slope 3 and high error sum of squares.

The SS starts to increase again: most of the distances are actually so large they are off-scale. It looks like our line of best fit was nearest the one in the second diagram.

Regression is a mathematical way of finding the slope that gives us the smallest SS. If we plot the SS against the slope we get a U shaped curve: as we increase the slope from 0 to infinity, the SS decreases at first, reaches a minimum, then starts to increase again. simple bit of calculus is all that's needed to find the slope that gives the minimum of this curve, and therefore the best slope to fit your data. There is no need to do this by hand: calculators and spreadsheets will do regression for you.

Regression gives you two parameters, a slope (often called 'b' on calculators, but we use the mathematician's 'm') and a y-axis intercept (often called 'a' on calculators, here 'c'), which you can use to plot a straight line:

Regression also gives you three other parameters, the standard errors of the slope and the intercept, and the correlation coefficient R2. R2 describes the degree of correlation between the x and y variables (if x and y are correlated one will increase or decrease when the other increases). It is actually the square of another parameter called R, which you can look up on R tables with (number-of-pairs − 1) degrees of freedom to see if your correlation is significant or not. If you look at the graph below, you can see that two data sets have had a regression performed on them, so both have estimates of the slope and intercept (remember that the equation of a straight line is y = mx + c where m is the slope and c is the y-intercept). The lower line fits the data perfectly (positive correlation with no scatter, hence R2 = 1). The upper line has much more scatter and hence R2 is less than 1. If we had perfect negative correlation (a line with no scatter sloping down from left to right), R2 would also be 1 (although R would be −1: the confusion between R2 and R is a reason I dislike this statistic! It is often better to calculate (or get Excel to calculate) the standard error of the slope and intercept. Then you can see if they are significantly different from a particular value (e.g. see if the slope is significantly different from 0) by using a simple t test.

The commonest application of linear regression is to fit a straight line (y=a+bx) through data. Please make sure your data really is on an approximate straight line before you try fitting straight lines through it! Linear regression can also be used to fit exact curves of best fit described by higher order polynomials, such as y=a+bx+cx2, since (in statistical parlance) the response variable is a linear function of the parameters (a, b, c) that are estimated by the regression. You can sometimes use linear regression to fit curves to non-linear data (such as y=aebx, y=ax/(b+x), y=axb), by first mathematically transforming the equation and the data into a linear function (using logarithms or reciprocals), and then fitting a straight line through the transformed data. However, this often has unintended consequences: a common linearisation in enzyme kinetics (Lineweaver-Burk) uses reciprocals, which gives undue weight to the values in which you have the least confidence. Non-polynomial functions, particularly those that are not amenable to linearisation, can also be fitted using more sophisticated forms of nonlinear regression.

Scatter is measured by R-squared.

Spearman's rank correlation (rs) test

If you come across data that is not in a straight line, but you still want to know whether the data are correlated (i.e. as X gets bigger, so does Y), you can use a test called the Spearman's rank correlation test. Like the Mann-Whitney U test, it is less powerful than a t test, but has fewer assumptions. It can not be used unless you have more than 4 data points, and shouldn't be used if you have fewer than 6.

Say we are wondering whether the length of a sycamore seed wing influences how quickly it will fall. Our null hypothesis is that there is no such correlation. We have no idea if the relationship is a straight-line or not, so we can't use linear regression. We collect the following data:

Wing length (mm)

Speed of descent (m s−1)

25

1.38

41

0.67

27

1.28

35

0.95

36

1.03

31

1.15

34

1.02

29

1.17

33

1.17

To do this test, arrange your X and Y data in rank order.

You can then draw up a rank correlation table, with the data arranged in pairs, in rank order of wing length:

Wing length rank

Wing length

Speed of descent

Speed of descent rank

D2

1

25

1.38

9

64

2

27

1.28

8

36

3

29

1.17

7

16

4

31

1.15

6

4

5

33

1.17

2

9

6

34

1.02

4

4

7

35

0.95

3

16

8

36

1.03

5

9

9

41

0.67

1

64

The column labeled D2 is the square of the difference between the ranks for each data point, i.e. (1 − 9)2 = 64, (2 − 8)2 = 36, etc. If we add up these values of D2 we get ∑ D2 = 222. We then apply a simple formula to get our rank correlation statistic, rs:

Spearman rank correlation coefficient.

rs = 1 − ( 6 × 222 ) ⁄ ( 9 × ( 92 − 1 ) ) = −0.85. Like R, if rs is close to +1, there is good positive correlation, and if it is near -1, there is good negative correlation. We have good negative correlation: as the wing length increases, the speed of descent decreases. To see if this is a significant correlation, we compare it to a table of critical values with a sample size of n = 9. From an rs table, the critical value is ±0.600. As our calculated value is larger (ignore the negative sign), we can reject the null hypothesis, and say that the longer the wing the more slowly the seeds fall.

ANOVA and beyond

Our final port of call in this whirlwind tour of statistical techniques is analysis of variance or ANOVA. If you're lucky you'll not have to use this technique, so feel free to skip to the end if you don't think this will be useful. ANOVA can be used when you come across a situation where you are trying to do dozens of t tests on data to show that several means are significantly different from each other.

One way ANOVA

For example, if you have a trial where you give three different sorts of food to Venus fly traps (nothing, BabyBio and flies), and want to find if these different foods have different effects on growth, you can use ANOVA to compare your three means, rather than doing three pairwise t tests. This is called a three level, one factor (or one way) ANOVA: we are only investigating one factor (sort of food), at three different levels (nothing, BabyBio, flies).

Here is our data:

 

Food

 

Nothing

BabyBio

Flies

Leaves produced per year

6

2

7

 

4

3

8

 

5

2

8

 

7

3

7

 

6

3

8

 

5

2

9

 

6

1

10

 

7

0

11

 

5

2

9

 

4

1

7

Mean

5.5

1.9

8.4

We have calculated individual means for each of the three food sources. We must also calculate the grand mean of the entire data set, which is 5.3. To see whether the number of leaves produced per year is influenced by the kind of food we give the flytraps, we need to calculate three sums-of-squares:

The first is the sum-of-squares of the entire data set. This total sum of squares (SST) is found by calculating:

Total sum of squares.

i.e. find the grand mean of all the data (=5.3), and add together (6 − 5.3)2 + (2 − 5.3)2 + (7 − 5.3)2 + (4 − 5.3)2 for all 30 data points. This is a bit of a hassle to do by hand, so you might like to know that it is equal to the total variance (standard deviation squared) multiplied by the number of degrees of freedom in the data (n − 1), so SST = s2(n − 1) = 247.9, with 29 degrees of freedom.

The second SS to calculate is the amount of variation that is accounted for by our model, SSM. This is found by calculating the difference between each of our individual means and the grand mean. In the graph below, you can see the three data sets, with a line showing the grand mean. SSM is found by calculating the distance of each individual mean from the grand mean, much like we saw when we were discussing regression.

One way ANOVA.

The actual calculation we need to do is:

Model sum of squares.

i.e. number of replicates within a sample group multiplied by (individual mean − grand mean)2.

So, SSM = 10 × (5.5 − 5.3)2+ 10 × (1.9 − 5.3)2 + 10 × (8.4 − 5.3)2 = 212.1. All SS values have an associated number of degrees of freedom; for model SS's, this is a little different to the ones you have come across before. For SSM, the df is the number of means we have estimated minus 1, so as we have estimated 3 means, SSM has 2 degrees of freedom. (Although this sounds like a new rule, different from the sample-size-minus-number-of-means we have used before, it isn't really: we have a 'sample' of three means, and have estimated one grand mean from them, so ν = 3 − 1 ).

By fitting three means to the data, we account for nearly all of the total variation, so our model seems quite good. To see exactly how good, we need to work out one final sum of squares. This is the SSR, the residual sum of squares, which is the variation left over after fitting the model. It equals SST − SSM, so for our data, this is 247.9 − 212.1 = 35.8. This has whatever degrees of freedom are left from the total, i.e. 29 (total) − 2 (model) = 27 (residual).

These three sums of squares and degrees of freedom can be summarised in an ANOVA table: it's good practice to do this, because you can check your maths and make sure that the things that should add up do add up!

Source

SS

df

MSS

F

Significance

SSM

212.1

2

106.0

80.0

**

SSR

35.8

27

s2 = 1.3

 

 

SST

247.9

29

 

 

 

The column labelled MSS is the 'mean sum of squares': it shows how much variation is accounted for given the number of means (df) you need to get this reduction in variation. It would be possible to fit a model that had 15 means in it, but this would barely be better than the 30 pieces of raw data, so to see how good our model parameters really are, we scale the SS values by dividing them by the df, e.g. 212.1 ⁄ 2 = 106.0. These MSS values show us how 'explanatory' our model is: an explanatory model explains away a large amount of variation (SSblah) without having to invoke hundreds of means (df).

One of these MSS values is special: the one we get from SSR.This value is called the error variance, s2, and measures the amount of variation that is not explained by our model. To see how explanatory our model is, the final step is to divide the model's MSS by s2, which gives us a statistic called F, which finally allows us to test how significant the various parameters in our model are! To do this, we look up the critical value of F from tables with (apologies for the words) numerator = parameter df and denominator = s2 df. So for SSM we check a table with (2, 27) degrees of freedom. You can do this on a p = 0.05 F table to see if it is significant (*), and also on a p = 0.01 F table to see if it is very significant (**). As usual, it's significant if our calculated F is bigger than the table F.

From this we conclude that the sort of food we give the flytraps makes a significant difference to the number of traps they make per year. Two-way ANOVA is rather more complex.

ANOVA is 'real' statistics. Along with regression, it is the tip of an iceberg of techniques for mass data analysis. One worth knowing is beyond Excel's pathetic powers, but at least you'll know what to Google for if an experiment like the one below comes up. It's called ANCOVA (analysis of covariance), and is simultaneous regression and ANOVA. Say we have done the same wheat experiment as we have already described, but this time we have used ten different concentrations of each fertiliser too. What we really want to do is plot six graphs and see whether their slopes are significantly different. ANCOVA lets us do this, but we won't go into the details. Multiple regression is a technique that can be used to fit several regression lines to a single data set, so you can see how both fertiliser concentration and light intensity influence grain yield at the same time. ANODEV (analysis of deviance) is the most powerful of the statistical techniques: it is a sort of superset of all the statistical techniques so far described, and can be used to fit any model to any data set. Such ANODEV modelling is usually called general linear modelling (GLM for short). If you ever want to do 'real' stats, GLM is the way to go, but I wouldn't recommend it unless you have access to a good stats package, and a lot of time on your hands. It's way beyond A-level standard anyway!

Summary

Statistics tend to be though of as rather difficult, and to be left to the last minute. In fact, they are mostly quite simple to use, especially the commoner tests, which make intuitive sense when you realise what they actually do. The reason that most people have difficulties with them is they put off thinking about them until after they collect their data, and then get into an mess trying to retro-fit statistics onto data they don't suit.

The important thing to remember is to plan the stats before you collect the data. That way, the statistics are a breeze, and practically analyse your data for you. If you'd like an excellent and free package to explore these statistics and many more, then you can't go far wrong with R.

Peer Review.
This page has been peer reviewed by 4 people. Thanks to Adrian Newton and Beat Rupp for their feedback, to Rob Campbell for his suggestions and correction, and to Michael Kospach, author of the indispensable Perl module Statistics::Distributions, without whom there would be no statistical tables to go with this guide.