Contents
Statistics are generally considered hard by most students. Don't be put off by this. Although you may have to put a little effort in to understanding them at first, it is well worth the effort, as they are the key to good experimental design. Being able to do them well is a very important skill, and you'd be surprised how many people use them without understanding what they are actually doing. Like flying a plane when you don't understand all the controls, this is a recipe for disaster.
- Data
- Statistics
- Normal distribution
- Standard error
- Hypothesis testing
- Statistical tests
- Plotting graphs
- ANOVA and beyond
- Summary
Data
Data is a collection of experimental observations. It may be:
- Qualitative, where the data cannot be described numerically, but falls into certain categories, such as tall and dwarf (as in Mendel's peas), or brown and blue eye colour.
- Quantitative, where the quantity can be described numerically, such as speed (in metres per second) or mass (in kilograms).
Data can also be:
- Continuous, where the data is grouped only by how accurately the data was measured, e.g. eggs of length 8.5 cm include eggs from 8.45 − 8.54 cm. Most quantitative data is of this sort.
- Discrete (or discontinuous), where the data is naturally grouped into discrete 'chunks'. Most qualitative data is of this sort (blue or brown eyes), but some quantitative data is too: e.g. 3 yolks means exactly 3 yolks, not 2.8 or 3.3.
Data can also be artificially grouped, e.g. 0-4 yolks, 5-9 yolks. Grouped, discrete and qualitative data usually involve counting, and the results are best tabulated in a frequency distribution table, where frequency (the count) and relative frequency (the count divided by the total count) are recorded. In presentations, they lend themselves to bar charts. Continuous quantitative data lend themselves to line graphs and histograms.
Question 1.
Give examples of:
- Qualitative data,
- Quantitative data,
- Grouped data,
- A discrete variable,
- A continuous variable.
Use the descriptions above to help. Answer
Question 2.
Mrs. McKenna wanted to know how many students were put on detention by staff of her department. The number of students detained by teachers were recorded as follows:
- 3, 5, 6, 6, 8, 7, 4, 7, 6, 5, 5, 7, 8, 7, 4, 6, 6, 6, 8, 5.
Express these records in frequency tables. Answer
Question 3.
By means of random sampling, the number of daisy capitula ('flowers') per square metre of lawn were obtained. These results were as shown in the table below. Express these results in frequency tables, grouping the data in five conveniently sized groups. Answer
|
Square number |
Number of capitula |
Square number |
Number of capitula |
|---|---|---|---|
|
1 |
11 |
21 |
21 |
|
2 |
3 |
22 |
17 |
|
3 |
13 |
23 |
10 |
|
4 |
19 |
24 |
15 |
|
5 |
20 |
25 |
13 |
|
6 |
16 |
26 |
7 |
|
7 |
9 |
27 |
20 |
|
8 |
14 |
28 |
15 |
|
9 |
12 |
29 |
24 |
|
10 |
6 |
30 |
13 |
|
11 |
19 |
31 |
9 |
|
12 |
15 |
32 |
5 |
|
13 |
23 |
33 |
12 |
|
14 |
12 |
34 |
14 |
|
15 |
10 |
35 |
3 |
|
16 |
6 |
36 |
9 |
|
17 |
13 |
37 |
10 |
|
18 |
15 |
38 |
2 |
|
19 |
4 |
39 |
12 |
|
20 |
10 |
40 |
18 |
Statistics
Statistics is about calculating numbers that summarise your data in an easy to digest form. These summary numbers are called parameters. We cannot usually measure every single member of a population (think how long it would take to find the height of everyone in the world), so instead we take a small sample of the population, and calculate parameters from that instead.
Mathematical symbols
Statistical analysis usually begins by calculating descriptive statistics, such as means, standard deviations, standard errors and confidence intervals. These can be presented neatly in a table when writing up experiments. Statistics is quite mathematical, so here is a list of the symbols you are likely to come across, and with which you should become familiar.
- ∑ = the sum of
- x = individual measurement(s). Hence ∑ x = the sum of the measurements.
- n = the total number of measurements (sample size).
- x̄ = the mean of all the individual measurements in a sample. You may also see this written wrongly as μ, this is the population mean, not the sample mean. These should be similar to each other, but probably not identical because of sampling error: think of the difference between the mean height of the entire human race, and the mean height you would calculate if you accidentally sampled thirty primary school children thinking they were representative of the whole of humanity.
- s = the standard deviation of a sample. You may also see this written as σ or σn-1; σ is the population standard deviation, and is to μ what s is to x̄; σn−1 is simply wrong, but often used on calculators in place of s.
- p = probability, how likely something is to happen on a scale of 0 (never happens) to 1 (will certainly happen). Equivalently, on a scale of 0% to 100% certainty.
- df = degrees of freedom. This is related to the sample size, and is usually the sample size minus the number of means you have estimated from the data. Say your friend measured the heights of the eleven players on a football team, and calculated their mean. If she then told you the mean, and gave you a list of the heights of ten of the players, you could work out the height of the eleventh player, because you know that the eleven numbers add up to eleven times the mean. You only have ten degrees of freedom because the eleventh number is fixed.
- SS = sum of squares. We will come across this when we talk about standard deviations. To calculate it, we find the difference between a measurement and the mean (subtract one from the other), square it to make it positive, then add these squared-differences up for every measurement. It gives a measure of the total variability (or deviance) in the sample.
- s.e. = standard error, a statistic related to the standard deviation, but scaled to the sample size. It tells you how well you have collected your data.
Averages
There are several sorts of average, which are appropriate for different sorts of data.
- The (arithmetic) mean x̄ is the sum of all the measurements divided by
the number in the sample.
(There are other sorts of mean, such as the geometric mean, which is the nth root of the mathematical product of the samples, but these are not very often used in biology.)

- The mode is the measurement that occurs most often.
- The median is the measurement that occurs in the middle when measurements are put in rank order. Medians are useful when you're not sure if your data is normally distributed.
Standard deviation
The standard deviation is a measure of the variation around the mean. It can be calculated in two ways, but both give the same answer. The first way shows you what it actually means:

We take a measurement (x), subtract the mean from it (x − x̄) and square it. This tells us how far the measurement is from the mean. We do this for every measurement we have, and add them up (∑). We have to do the squaring because otherwise the clearly spread-out data set { 1, 2, 3, 4, 5 } would have a deviation of zero, as (1 − 3) + (2 − 3) + (3 − 3) + (4 − 3) + (5 − 3) = 0. This sum of squared differences from the mean is unimaginatively called the sum of squares (or SS).
We now take the SS, and 'average' it by dividing by the sample size minus one. The minus 1 is for mathematical reasons, and n-1 is called the degrees of freedom. As you know, the degrees of freedom is equal to the total sample size minus the number of means we have estimated from the data; we'll come across this idea again later.
The sum of squares divided by the degrees of freedom gives us a parameter called the variance (s2). The standard deviation is the square root of this value: we do this so that if our measurements have units (like metres), the standard deviation will also be in metres (rather than m2), so it is nicely comparable to the mean.
A quicker way to calculate the standard deviation is to use the following formula:

This version requires you to calculate the sum of the measurements,
∑ x, and the sum of the squares of each measurement, ∑
x2. Note that this is not the same thing as the
square of the sum of the measurements, (∑ x)2. This
version is quicker to calculate by hand than the first version.
An even quicker way is to get Excel or a calculator to do it for you, but you should be aware what the standard deviation actually means before you go applying it willy-nilly.
Question 4.
Calculate the mean, mode, median and standard deviation of the following two data sets. Answer
- 32, 8, 15, 15, 28, 10, 5, 5, 15, 22.
- 1, 1, 1, 3, 5, 2, 5, 3, 9, 2, 9, 2, 2, 2.
To summarise this section, there are always two statistics you should calculate from your data: the mean (the sum of the measurements divided by the sample size), and the standard deviation (square root of the sum of squares divided by the degrees of freedom). The first tells you where the 'middle' of you data is, and the second tells you how much spread there is around this average value.
Normal distribution
The normal distribution is useful to describe the 'shape' (the probability distribution) of biological data. If we plot a bar chart from a frequency table (i.e. frequency of measurement against measurement), we will often find it has a peaked shape: look at the capitula graph again, and you'll see what I mean. So, if we plot the frequency of a measurement against its value (e.g. the number of people of a given height against the values of height that we measure), we will get a bar chart with a peak that looks like the curve below. The curve below has been mathematically smoothed: imagine it as a bar chart with extremely thin bars. For a normal distribution, which is common in biology, the mean, median and mode are all equal.
In a sample with a normal distribution, 68% of the measurements will be within one standard deviation of the mean, 95% within two standard deviations, and 99.5% within three. These have been coloured in various shades of blue below.

The normal distribution. The mean is 50 cm and the standard deviation
is 10 cm. The dark blue area (the mean plus or minus one standard
deviation, 40-60 cm) contains 68% of the total area under the curve. If
we include the mid-blue area too (all measurements within 2 standard
deviations of the mean, i.e. 30-70 cm), this contains 95% of
all the measurements. Remembering that 2 standard deviations contains
95% of the measurements is essential to understanding the t test we
will come to later.
By understanding he shape of the normal curve, we can calculate probabilities, standard errors and confidence limits from the value of the standard deviation. Probability is the likelihood of something occurring on a scale of 0 to 1. Since 95% of the measurements are within two standard deviations of the mean, there is a probability of 0.95 that any particular measurement will be in this range (the dark and mid blue part of the curve above). Equation of the normal curve.
Standard error
There is another important measurement of spread in data besides the standard deviation: the standard error of the mean (usually written s.e.). This is the standard deviation divided by the square root of the sample size:

Most people don't actually know what the standard error is, they just use it on graphs to make error bars look smaller! However, it is very important to understand what it really means. The standard error shows you how certain you are that the sample mean (x̄) you have calculated is the same as the mean of the population (μ) you have sampled from. (Read that until it sinks in).
Random sampling error aside, the standard deviation will be the same however big your sample size is: it's a measure of the variation in your population, so a large standard deviation isn't necessarily something to be ashamed of, since some populations are very variable. However, the standard error should get smaller as your sample size increases, so it is less a measure of how variable your population is, and more a measurement of how good your sampling technique is: a large standard error is something to be ashamed of! More about the s.e.
- If you take a large sample from a population (n is large), you can be fairly sure that x̄ = μ, because you have looked at so many individuals. In this case, the s.e. is small.
- If you only take a small sample (n is small), you cannot be sure that your estimate x̄ is the same as μ, and the s.e. will be large. For example, if you were looking at human height, and only took a sample of 2 people, one (by sheer chance) might be a dwarf, and your sample mean x̄ would clearly be very different from the true population mean μ. Your large s.e. will warn you of this flaw in your sampling technique.
Confidence limits show how sure you are that the mean of the population you have sampled is the same as the sample mean you have calculated. In biology, we will only accept your data as significant if you are 95% confident that your sample's mean is the same as the population mean. To calculate a confidence limit properly, you need a Student's t table; however, for the moment, just remember that if you calculate the s.e. and multiply it by 2 (remember the normal distribution: 95% of the area under the curve is within 2 standard deviations), then you will get your confidence limits, providing your sample size is greater than 30. 30 is a sort of magic number in statistics: for mathematical reasons, it is the boundary between an 'small' and a 'large' sample size. Technically that number should be 1.96, not 2, but as a rule of thumb, 2 will do.
So, for a sample of 30 measurements with mean 16.30 and standard deviation of 2.80, we can calculate the s.e. as 2.80 ⁄ √30 = 0.51 and the 95% confidence interval as 2 × 0.51 = 1.02. This means we can be 95% sure that the true population mean μ lies within the range 15.28 and 17.32 (16.30 ± 1.02).
Question 5.
Produce a table of descriptive statistics for the following data set. Answer
- 7.1, 7.3, 7.8, 7.4, 7.9, 7.4, 7.9, 7.5, 8.1, 7.5, 8.1, 7.5, 8.2, 7.6, 8.3, 7.6, 8.4, 7.6, 8.4, 7.6, 8.5, 7.6, 8.6
Hypothesis testing
A hypothesis is a specific prediction about your data. It may apply to a single set of data (e.g. I expect that my mean human height will be equal to 1.7 m), or it may apply to a comparison of a two or more sets of data (e.g. I expect that the mean height of oak trees will be larger than the mean height of humans). The first of these is also really a comparison, between a sample mean and an actual number, 1.7.
For every idea there are two hypotheses you should formulate. The first is the null hypothesis (H0), which states:
- There is no difference between the mean of X and the mean of Y (trees and humans are the same height).
The second is the alternative hypothesis (H1), which states:
- There is a significant difference between the mean of X and the mean of Y (trees and humans are different heights).
Some example null hypotheses:
- The number of woodlice found in containers differing only in humidity, and between which the woodlice can move freely, is the same.
- There is no correlation between the number of bloodworms found in a stream and the oxygen concentration of the water.
- Plants watered with 1 M saline solution will grow at the same rate as plants watered with distilled water.
- The mean heights of trees is the same as the mean height of humans.
Note that the null hypothesis always says 'no difference', 'same', 'no correlation', 'nothing happening here officer', etc.
Question 6.
Formulate null hypotheses for the following ideas. Answer
- Are people with fair hair more likely to have blue eyes?
- If you increase the temperature of a pond, will more bacteria be able to live there?
In order to apply a statistical test to our oak tree/human comparison, we must:
- Formulate a null hypothesis. Here, H0 is "There is no difference between the mean heights of humans and the oak trees".
- Look at our two means and their standard errors to determine how similar they are.
- Determine whether we are 95% confident that there is a difference (or to put it another way, see whether the chances of our sample means being different by random chance is less than 5%).
Golden rules
To do any sort of statistical test, it is essential that you take precautions first. These are my golden rules of experimental design.
- You must plan your method and the data you collect around the statistical test you are going to use. Dr. Cook's first golden law of statistics is Decide on the statistics you plan to use before you collect the data. Retro-fitting an inappropriate statistical test onto the wrong sort of data is a disaster.
- It is much easier to analyse quantitative measurements (like height) or discrete categorical data (like blue eyes and brown eyes) than grouped data, or data collected on arbitrary scales. Ecologists are often keen on arbitrary abundance scales (like ACFOR: abundant, common, frequent, occasional, rare), which are difficult to analyse: if you can collect 'real' numbers instead, then do so. Second law: Use real numbers whenever you can.
- Ensure you have a reasonable sample size: 30 for ecology, 3 to 5 for lab practicals. It really depends on how your test works, but if you are planning on plotting bar charts, make sure every bar is the mean of at least 3 measurements, and if you plan on plotting lines, make sure that every line contains at least six points, preferably themselves the mean of three measurements. Third golden rule of statistics: The larger the sample size the better. Related to this, is: don't be tempted to make it too complicated. It is better for an experiment to look at the effect of pH on enzyme activity than for it to look at temperature, pH and their interaction. Although this is interesting, you will probably not have time to do all the necessary experiments to reasonable sample sizes. Alternative third law: It is better to have twenty replicates of a single experiment than two repeats each of ten.
- Randomise your data as much as possible, especially in ecology, where you must be very careful to make sure that you are not biasing your data, because it is so very easy. Even in the lab, try to ensure you don't always do experiments in the same order: it's amazing what a difference warmed-up machines and over-tired experimenters will make to the quality of the data collected. Fourth law: Randomise your sampling.
Once you've made sure you've taken these precautions, and collected the data, you can do the test you chose. Choosing a suitable test. In biology we are often interested in whether our two means are significantly different from one another. Say we have measured the heights of humans and oak trees, and found that the oak tree mean is much higher than the human mean (surprise!). How can we be sure that this difference is real, and not due to pure chance (accidentally sampling very tall trees and very short people)? The answer is, we can't ever be sure, but we can say how sure we are. If we are 95% sure that the means are different, we say they are significantly different, provisionally accept that the difference is real, and reject the null hypothesis. If we are less sure than this, we say they are insignificantly different, and accept the null hypothesis.
Statistical tests
Statistical tests are used to find out whether a particular hypothesis can be supported, or needs to be rejected. There are many, many sorts of statistical test and technique. The one you use depends on the sort of data, and the relationship you want to (dis)prove. We will discuss three tests here, Student's t test, the χ2 (chi-squared) test, and the Mann-Whitney U test. In all cases, we do two things: we calculate a test statistic from our data, and we compare this to a table of critical values for different confidence levels and sample sizes. To see if our data are significant, we look at the critical value for the 95% confidence level (this is usually written on tables as the 5% significance level, or p = 0.05 level). If our calculated value is larger than the critical value, the data (or the difference between data) is significant, and we can reject the null hypothesis.
Student's t test
The t test is used to determine whether there is a significant difference between the means of two samples. It should only be used if you are sure the data is normally distributed, but you may well have to just assume this in school work. If you're really worried about the data distribution, you can use the Mann-Whitney U test instead, which works on any sort of distribution, normal or not. All statistical tests are more reliable on larger sample sizes; t is most reliable if n > 30.
t is calculated using the formula below.

x1 is the mean of the first sample, s1 is the
standard deviation of the first sample, n1 is the size of
the first sample. x2, s2 and n2 are
the values from the second sample. The |x1 −
x2| means 'ignore any minus sign'. The thing on the bottom
is the sum of the standard errors of the two means.
To determine the critical value, we need to know the significance level (p = 0.05), and the number of degrees of freedom (sometimes written ν, more usually as df). The degrees of freedom is simply n1 + n2 − 2 (as we discussed earlier, this is the total sample size minus the number of means we have estimated). We then just look these up on a t table, and if the calculated value of t is greater than the critical value from the table, the null hypothesis can be rejected, and we say there is a significant difference between the two sample means.
If you use Excel to do t tests, it will give you a number of different sorts of t test you can use. They are all slightly different. In general, you will want to chose an "unpaired, two-tailed test assuming equal variances". The other tests are for less common problems, and are usually less strict.
It's worth understanding how t actually works; many people use the test like a sort of magical incantation, without having the foggiest idea what they are actually doing. The t test is not nearly as difficult to understand as you might think. All t does is measure the overlap between two normal distributions (your two samples). That's it. A big overlap means a small value of t and therefore an insignificant difference. A small overlap means a big value of t and therefore a significant difference

These two normal distributions have a very large overlap. The means of
the left and right curves are not significantly different, because the
overlap is > 5% of the area under the curves. t would be
very small.

These two normal distributions have the same means as before, but much
smaller standard deviations. See that the overlap is now much smaller.
The means of the right and left curves are now significantly different,
because the overlap is < 5% of the area under the curves. t
would be very large.

These two distributions have the same standard deviations as the
original curves, but the difference between the means is now large
enough to make the overlap smaller. These means are now significantly
different, and t would be very large.
If you look at the formula for t, you will see that you get a large t (and therefore be more likely to reject the null hypothesis) if:
- the difference between the means is large
- the standard deviations are small
- the sample sizes are large
A useful variation on the t test is to see if a mean is significantly different from a particular number (N), like in our hypothesis that humans have a mean height of 1.7 m. To do this, simply use the equation below, and look up a critical value of t with degrees of freedom n-1.

Calculate the difference between the mean and the number (N) you want
to compare it with, and divide this by the standard error.
Question 6.
For the following data sets, formulate a null hypothesis, and test the hypothesis with a t test. You will need to calculate descriptive statistics and degrees of freedom to do this.
- Dog whelks on sheltered and exposed shores appear to be different
sizes ('heights'). Here are some data from two different shore types.
Are they significantly different?
- Height on sheltered shores (mm): 22, 23, 26, 29, 30, 31, 32, 32, 33, 33, 35, 39.
- Height on exposed shores (mm): 15, 17, 19, 19, 20, 20, 21, 21, 21, 22, 24, 27.
- Some gardeners pollinate tomato plants by hand to ensure a good
fruit-set. Other more economically-minded gardeners just spray them
with water to encourage pollen transfer. Which gives a better a fruit
yield, if either?
- Fruits per plant (sprayed only): 33, 28, 56, 43, 45, 62, 74, 45, 32, 48.
- Fruits per plant (hand pollinated): 46, 42, 63, 40, 52, 60, 82, 74, 62, 55.
- The Asellus water louse lives in polluted, oxygen-poor
water. It is possible to count the number of gill movements the louse
makes per minute to determine how much effort it is putting into
breathing. The number of gill movements per minute in Asellus
from stagnant and oxygen-rich habitats was counted. Do the lice breathe
harder in stagnant water?
- Gill movements per minute (stagnant): 44, 53, 54, 43, 48, 49, 53.
- Gill movements per minute (stagnant): 42, 48, 46, 43, 49, 42, 41, 40, 44, 48.
Use the formulae above to help. Answer
χ2 test
The χ2 test is used to determine if what you predict from theory and what you observe in an experiment are significantly different from each other. It is used only for count data, and is particularly useful in genetics testing. It is easiest to understand χ2 with an example. If we cross snakes heterozygous for the recessive albinism gene (A is 'normal', and a is albino), we expect a 3:1 ratio of normal to albino offspring (Aa × Aa → (1 AA + 2 Aa) + 1 aa, i.e. three quarters normal phenotype and one quarter albino).
![]()
Albino snake (with red eyes that look white under flash photography)
right with white markings. The red colour is not affected by
albinism.
Our null hypothesis is that there is no difference between our observed ratio and our expected ratio. If we do this cross and score 100 offspring, we expect 100 × ¾ = 75 normal offspring, and 100 × ¼ = 25 albino offspring. If we actually observe these results:
|
Class |
Observed (O) |
Expected (E) |
|---|---|---|
|
Normal |
70 |
75 |
|
Albino |
30 |
25 |
|
Total |
100 |
100 |
to calculate χ2 is very simple:

For each observed(O)/expected(E) pair, we subtract E from O, square the
answer, and divide by E. WE then add these up for all our categories,
and this gives us χ2.
So χ2 = (70 − 75)2 ⁄ 75 + (30 − 25)2 ⁄ 25 = 1.33. To see if this is significant, we just need to look it up on a χ 2 table, with degrees of freedom equal to the number of categories minus 1, in this case we have two categories (normal and albino, so we have 1 df). The critical value is 3.84. Our calculated value is smaller, so we can accept the null hypothesis that there is no difference between our ratio and a 3:1 ratio. So our data support a simple Mendelian monohybrid cross ratio.
When doing χ2, it is important to avoid having any category with less than 5 tallies in it. It is also extremely important that O and E are real numbers (30 albino snakes, 20 purple corn kernels, 78 dwarf pea plants), not percentages or proportions. You cannot do a χ2 test if you only know you had 30% albino snakes: you must know whether this means 300 snakes in 1000, or 3 in 10, as this makes a big difference to how sure you can be about the ratio. This also goes back top making sure you collect the right sort of data for the test you want to do.
χ2 can also be used to see whether a distribution of things is significantly different from random. Question 7 is an example of this.
Question 7.
Often, you will want to do a χ2 test on more than one independent category, for example:
|
Species of bird |
Ground layer |
Shrub layer |
Tree layer |
Row total |
|---|---|---|---|---|
|
Blackbird |
123 |
43 |
12 |
|
|
Great tit |
93 |
247 |
64 |
|
|
Blue tit |
52 |
72 |
178 |
|
|
Column total |
|
|
|
Grand total |
In this table, the observed values are shown. Our null hypothesis here is that there is no association between the vertical zone and the species of bird found there, i.e. the birds are distributed at random amongst the different forest layers. To calculate our expected values (E), we need to be a little careful, and use the formula below to ensure that we get the same total number of birds in each place:

R is the row total, C is the column total, and T is the grand
total.
You will need to calculate these totals yourself. To work out χ2, you just need to do the 'O minus E all squared over E' thing to the nine values, and add them up. The number of degrees of freedom for a two category χ2 is (number of rows − 1) × (number of columns − 1). Do this calculation, and check your answer.
Question 8.
Try these other χ2 tests. Answers
- Some snails of the species Cepea nemoralis are banded,
whilst others are unbanded. Some of these snails live in woodland,
other on sand dunes. Is there any difference in the number of bands
that wood-inhabiting and dune-inhabiting snails have?
Habitat
Zero bands
One band
Two bands
Three bands
Four bands
Row totals
Woods
5
8
16
22
10
Dunes
10
7
15
9
4
Column totals
- Clover plants can produce cyanide in their leaves if they possess a
particular gene. It is thought that this cyanide wards off herbivores.
Clover seedlings of the CN+ (cyanide producing) and CN− (cyanide
free) phenotypes were planted out and the amount of nibbling to leaves
was measured after 48 hr. Leaves with >10% nibbling were scored as
'nibbled', those with less were scored as 'unnibbled'. Do the data
support the idea that cyanide reduces herbivore damage?
Cyanogenesis
Nibbled
Unnibbled
Row totals
CN+
26
74
CN−
34
93
Column totals
- In a dihybrid cross between maize plants heterozygous for the S/s
(plump-starchy or shrunken-sugary grains) and P/p (purple or yellow
grains) loci, we expect a ratio of 9 S-P- (plump purple) : 3 S-pp
(plump yellow) : 3 ssP- (shrunken purple) : 1 (shrunken yellow). In a
sample of grains, we actually count the numbers of grains shown. Does
this data fit a 9:3:3:1 ratio?
Genotype
Number of kernels
S-P- plump purple
827
S-pp plump yellow
321
ssP- shrunken purple
297
sspp shrunken yellow
89
Total
Mann-Whitney's U test
The t test is very powerful, but it relies on some assumptions that may not be true for your data. It assumes:
- The standard deviations of your two samples are the same.
- Your data follow a normal distribution (they are not skewed, don't have two modes, etc).
You can usually get away with these assumptions if you have a large enough sample size (30), but if you are unsure, you may want to try out the U test instead, which is less powerful, but doesn't have these dodgy assumptions. The U test relies on your putting your data into rank order, so it shouldn't be a surprise that it compares medians rather than means. These sorts of ranking tests are called nonparametric tests, because you don't have to calculate parameters like the mean and standard deviation.
Here are some data showing the number of fly agaric toadstools found per hectare in two different sorts of wood. Our null hypothesis is that we find the same number of toadstools in both sorts of wood. The data are not normally distributed - the mean of the birch data is 52.2, but the median is 47.5, this is too different for the data to be normal - so we cannot use a t test.
|
Sample number |
Birch wood |
Mixed wood |
|---|---|---|
|
1 |
43 |
12 |
|
2 |
19 |
18 |
|
3 |
91 |
24 |
|
4 |
76 |
40 |
|
5 |
52 |
19 |
|
6 |
40 |
33 |
|
7 |
82 |
29 |
|
8 |
58 |
44 |
|
9 |
22 |
15 |
|
10 |
39 |
28 |
To use U instead, we need to arrange the data in overall rank order (from 1 to 20), and add up the ranks for the two sets of data. Note the 4.5 and the 12.5: we have to take a mean rank when there are identical pieces of data in both sets.
|
Rank |
Birch wood |
Mixed wood |
Rank |
|---|---|---|---|
|
|
|
12 |
1 |
|
|
|
15 |
2 |
|
|
|
18 |
3 |
|
4.5 |
19 |
19 |
4.5 |
|
6 |
22 |
|
|
|
|
|
24 |
7 |
|
|
|
28 |
8 |
|
|
|
29 |
9 |
|
|
|
33 |
10 |
|
11 |
39 |
|
|
|
12.5 |
40 |
40 |
12.5 |
|
14 |
43 |
|
|
|
|
|
44 |
15 |
|
16 |
52 |
|
|
|
17 |
58 |
|
|
|
18 |
76 |
|
|
|
19 |
82 |
|
|
|
20 |
91 |
|
|
|
∑ R1=138 |
median = 47.5 |
median = 26 |
∑ R2=72 |
|
n1=10 |
n2=10 |
The R (rank) scores are summed up (∑ R) and then you can feed them into a simple formula to calculate the values of U for the two data sets:


U1 = ( 10 × 10 ) + ( 0.5 × 10 × 11 ) − 72 = 83, and U2 = 17.
These U scores give you an idea of how many of your data points are in the top half of the rank order, and how many are in the bottom half. We compare the smaller of these two values to a U table with n equal to the sample sizes of the groups with the smaller U value. From a table with n1 = 10 and n2 = 10 we get a critical value of 23. Unfortunately, for U, we reject the null hypothesis if our calculated value is smaller (or equal to) than the critical value. This goes against every other test I have ever come across, so I can only apologise on behalf of Whitney and Mann. Our value is smaller, so we do reject the null hypothesis, and we conclude that there are more toadstools in birch woods than mixed woods. You may also come across a very similar test called the Wilcoxon matched pairs tests, which is suitable for matched data, and may be worth investigating if your U test doesn't seem to see a very obvious (to you) difference between two data sets.
Plotting graphs
There are two main sorts of graph we use in biology. The best one to use depends on the nature of the input variable. Bar charts are used to display discrete data, and line charts are used to display continuous data.
Bar charts and error bars
When you draw a bar chart, it is good practice to add error bars to it. These show the standard deviation (or standard error) for each bar. To plot data about the average lifespan of baby snakes, we first calculate some descriptive statistics:
|
Pigmentation |
Average lifespan (yrs) |
|
|---|---|---|
|
Normal |
15 (± s.e. 2) |
|
|
Albino |
4 (± s.e. 3) |
|
It's a good idea to plot error bars, like the ones on the diagram below, for each category. You can only do this if you've taken several measurements of lifespan from both groups of snakes, so make sure you plan for this when you do your experiment. To draw error bars, just add the standard error (or standard deviation) to the mean, and put a little horizontal dash on your diagram at this value (at 17 for the normal snakes), then do the same with the mean minus the standard deviation/error (13 for normal snakes). Join these up and you have an error bar. They're very easy to do by hand; in Excel, you will need to right click data points and select 'Format data series'…'X error bars'. Simple.

Line graphs and regression
Most people have no trouble plotting a line graph. Now you know how to create error bars, you should be able to draw a line graph with error bars on each plot point (if you've been able to collect enough data to make each point a mean).
Linear regression is a way of finding the best fit line to a set of data. What regression actually does is plot a straight line through your data that passes through the mean of the x values and the mean of the y values. This point is the red dot on the diagrams below. It then pivots the line about this dot until it gets the smallest possible sum of distances from the points to the line. If we start off with a flat line, with slope 0:

The red dot is (x-mean, y-mean), in this case (5,5). Regression plots a
line through this point, and calculates the distance to all the points.
If we add these distances (red lines) up, we get a measure of how good
a fit the line is to the data. This one is obviously not very good.
We can see that if we add up the squares of the distances of each point to the flat line, it will come to quite a large number. This is yet another of those 'sums-of-squares' (SS) that we have seen before. This line doesn't fit very well, so we pivot the line anticlockwise about the red dot, to a larger slope, here about 1.1.

This looks a lot better: the distances of the points from the line sum to a much smaller value (the SS is much smaller). What happens if we pivot it a little more anticlockwise, say to a slope of 3.5?

The SS starts to increase again: most of the distances are actually so large they are off-scale. It looks like our line of best fit was nearest the one in the second diagram.
Regression is a mathematical way of finding the slope that gives us the smallest SS. If we plot the SS against the slope we get a U shaped curve: as we increase the slope from 0 to infinity, the SS decreases at first, reaches a minimum, then starts to increase again. simple bit of calculus is all that's needed to find the slope that gives the minimum of this curve, and therefore the best slope to fit your data. There is no need to do this by hand: calculators and spreadsheets will do regression for you.
Regression gives you two parameters, a slope (often called 'b' on calculators, but we use the mathematician's 'm') and a y-axis intercept (often called 'a' on calculators, here 'c'), which you can use to plot a straight line:
- Plot the point (0, c).
- Plot the point (X, mX + c), where X is the maximum number on the x-axis (10 on the graphs above).
- Join them up.
Regression also gives you three other parameters, the standard errors of the slope and the intercept, and the correlation coefficient R2. R2 describes the degree of correlation between the x and y variables (if x and y are correlated one will increase or decrease when the other increases). It is actually the square of another parameter called R, which you can look up on R tables with (number-of-pairs − 1) degrees of freedom to see if your correlation is significant or not. If you look at the graph below, you can see that two data sets have had a regression performed on them, so both have estimates of the slope and intercept (remember that the equation of a straight line is y = mx + c where m is the slope and c is the y-intercept). The lower line fits the data perfectly (positive correlation with no scatter, hence R2 = 1). The upper line has much more scatter and hence R2 is less than 1. If we had perfect negative correlation (a line with no scatter sloping down from left to right), R2 would also be 1 (although R would be −1: the confusion between R2 and R is a reason I dislike this statistic! It is often better to calculate (or get Excel to calculate) the standard error of the slope and intercept. Then you can see if they are significantly different from a particular value (e.g. see if the slope is significantly different from 0) by using a simple t test.
The commonest application of linear regression is to fit a straight line (y=a+bx) through data. Please make sure your data really is on an approximate straight line before you try fitting straight lines through it! Linear regression can also be used to fit exact curves of best fit described by higher order polynomials, such as y=a+bx+cx2, since (in statistical parlance) the response variable is a linear function of the parameters (a, b, c) that are estimated by the regression. You can sometimes use linear regression to fit curves to non-linear data (such as y=aebx, y=ax/(b+x), y=axb), by first mathematically transforming the equation and the data into a linear function (using logarithms or reciprocals), and then fitting a straight line through the transformed data. However, this often has unintended consequences: a common linearisation in enzyme kinetics (Lineweaver-Burk) uses reciprocals, which gives undue weight to the values in which you have the least confidence. Non-polynomial functions, particularly those that are not amenable to linearisation, can also be fitted using more sophisticated forms of nonlinear regression.

Spearman's rank correlation (rs) test
If you come across data that is not in a straight line, but you still want to know whether the data are correlated (i.e. as X gets bigger, so does Y), you can use a test called the Spearman's rank correlation test. Like the Mann-Whitney U test, it is less powerful than a t test, but has fewer assumptions. It can not be used unless you have more than 4 data points, and shouldn't be used if you have fewer than 6.
Say we are wondering whether the length of a sycamore seed wing influences how quickly it will fall. Our null hypothesis is that there is no such correlation. We have no idea if the relationship is a straight-line or not, so we can't use linear regression. We collect the following data:
|
Wing length (mm) |
Speed of descent (m s−1) |
|---|---|
|
25 |
1.38 |
|
41 |
0.67 |
|
27 |
1.28 |
|
35 |
0.95 |
|
36 |
1.03 |
|
31 |
1.15 |
|
34 |
1.02 |
|
29 |
1.17 |
|
33 |
1.17 |
To do this test, arrange your X and Y data in rank order.
- Wing length: 25 (rank 1), 27, 29, 31, 33, 34, 35, 36, 41 (rank 9)
- Speed of descent: (rank 1) 0.67, 0.93, 0.95, 1.02, 1.03, 1.15, 1.17, 1.28, 1.38 (rank 9)
You can then draw up a rank correlation table, with the data arranged in pairs, in rank order of wing length:
|
Wing length rank |
Wing length |
Speed of descent |
Speed of descent rank |
D2 |
|---|---|---|---|---|
|
1 |
25 |
1.38 |
9 |
64 |
|
2 |
27 |
1.28 |
8 |
36 |
|
3 |
29 |
1.17 |
7 |
16 |
|
4 |
31 |
1.15 |
6 |
4 |
|
5 |
33 |
1.17 |
2 |
9 |
|
6 |
34 |
1.02 |
4 |
4 |
|
7 |
35 |
0.95 |
3 |
16 |
|
8 |
36 |
1.03 |
5 |
9 |
|
9 |
41 |
0.67 |
1 |
64 |
The column labeled D2 is the square of the difference between the ranks for each data point, i.e. (1 − 9)2 = 64, (2 − 8)2 = 36, etc. If we add up these values of D2 we get ∑ D2 = 222. We then apply a simple formula to get our rank correlation statistic, rs:

rs = 1 − ( 6 × 222 ) ⁄ ( 9 × ( 92 − 1 ) ) = −0.85. Like R, if rs is close to +1, there is good positive correlation, and if it is near -1, there is good negative correlation. We have good negative correlation: as the wing length increases, the speed of descent decreases. To see if this is a significant correlation, we compare it to a table of critical values with a sample size of n = 9. From an rs table, the critical value is ±0.600. As our calculated value is larger (ignore the negative sign), we can reject the null hypothesis, and say that the longer the wing the more slowly the seeds fall.
ANOVA and beyond
Our final port of call in this whirlwind tour of statistical techniques is analysis of variance or ANOVA. If you're lucky you'll not have to use this technique, so feel free to skip to the end if you don't think this will be useful. ANOVA can be used when you come across a situation where you are trying to do dozens of t tests on data to show that several means are significantly different from each other.
One way ANOVA
For example, if you have a trial where you give three different sorts of food to Venus fly traps (nothing, BabyBio and flies), and want to find if these different foods have different effects on growth, you can use ANOVA to compare your three means, rather than doing three pairwise t tests. This is called a three level, one factor (or one way) ANOVA: we are only investigating one factor (sort of food), at three different levels (nothing, BabyBio, flies).
Here is our data:
|
|
Food |
||
|---|---|---|---|
|
|
Nothing |
BabyBio |
Flies |
|
Leaves produced per year |
6 |
2 |
7 |
|
|
4 |
3 |
8 |
|
|
5 |
2 |
8 |
|
|
7 |
3 |
7 |
|
|
6 |
3 |
8 |
|
|
5 |
2 |
9 |
|
|
6 |
1 |
10 |
|
|
7 |
0 |
11 |
|
|
5 |
2 |
9 |
|
|
4 |
1 |
7 |
|
Mean |
5.5 |
1.9 |
8.4 |
We have calculated individual means for each of the three food sources. We must also calculate the grand mean of the entire data set, which is 5.3. To see whether the number of leaves produced per year is influenced by the kind of food we give the flytraps, we need to calculate three sums-of-squares:
The first is the sum-of-squares of the entire data set. This total sum of squares (SST) is found by calculating:

i.e. find the grand mean of all the data (x̄=5.3), and add together (6 − 5.3)2 + (2 − 5.3)2 + (7 − 5.3)2 + (4 − 5.3)2 for all 30 data points. This is a bit of a hassle to do by hand, so you might like to know that it is equal to the total variance (standard deviation squared) multiplied by the number of degrees of freedom in the data (n − 1), so SST = s2(n − 1) = 247.9, with 29 degrees of freedom.
The second SS to calculate is the amount of variation that is accounted for by our model, SSM. This is found by calculating the difference between each of our individual means and the grand mean. In the graph below, you can see the three data sets, with a line showing the grand mean. SSM is found by calculating the distance of each individual mean from the grand mean, much like we saw when we were discussing regression.

The actual calculation we need to do is:

i.e. number of replicates within a sample group multiplied by (individual mean − grand mean)2.
So, SSM = 10 × (5.5 − 5.3)2+ 10 × (1.9 − 5.3)2 + 10 × (8.4 − 5.3)2 = 212.1. All SS values have an associated number of degrees of freedom; for model SS's, this is a little different to the ones you have come across before. For SSM, the df is the number of means we have estimated minus 1, so as we have estimated 3 means, SSM has 2 degrees of freedom. (Although this sounds like a new rule, different from the sample-size-minus-number-of-means we have used before, it isn't really: we have a 'sample' of three means, and have estimated one grand mean from them, so ν = 3 − 1 ).
By fitting three means to the data, we account for nearly all of the total variation, so our model seems quite good. To see exactly how good, we need to work out one final sum of squares. This is the SSR, the residual sum of squares, which is the variation left over after fitting the model. It equals SST − SSM, so for our data, this is 247.9 − 212.1 = 35.8. This has whatever degrees of freedom are left from the total, i.e. 29 (total) − 2 (model) = 27 (residual).
These three sums of squares and degrees of freedom can be summarised in an ANOVA table: it's good practice to do this, because you can check your maths and make sure that the things that should add up do add up!
|
Source |
SS |
df |
MSS |
F |
Significance |
|---|---|---|---|---|---|
|
SSM |
212.1 |
2 |
106.0 |
80.0 |
** |
|
SSR |
35.8 |
27 |
s2 = 1.3 |
|
|
|
SST |
247.9 |
29 |
|
|
|
The column labelled MSS is the 'mean sum of squares': it shows how much variation is accounted for given the number of means (df) you need to get this reduction in variation. It would be possible to fit a model that had 15 means in it, but this would barely be better than the 30 pieces of raw data, so to see how good our model parameters really are, we scale the SS values by dividing them by the df, e.g. 212.1 ⁄ 2 = 106.0. These MSS values show us how 'explanatory' our model is: an explanatory model explains away a large amount of variation (SSblah) without having to invoke hundreds of means (df).
One of these MSS values is special: the one we get from SSR.This value is called the error variance, s2, and measures the amount of variation that is not explained by our model. To see how explanatory our model is, the final step is to divide the model's MSS by s2, which gives us a statistic called F, which finally allows us to test how significant the various parameters in our model are! To do this, we look up the critical value of F from tables with (apologies for the words) numerator = parameter df and denominator = s2 df. So for SSM we check a table with (2, 27) degrees of freedom. You can do this on a p = 0.05 F table to see if it is significant (*), and also on a p = 0.01 F table to see if it is very significant (**). As usual, it's significant if our calculated F is bigger than the table F.
From this we conclude that the sort of food we give the flytraps makes a significant difference to the number of traps they make per year. Two-way ANOVA is rather more complex.
ANOVA is 'real' statistics. Along with regression, it is the tip of an iceberg of techniques for mass data analysis. One worth knowing is beyond Excel's pathetic powers, but at least you'll know what to Google for if an experiment like the one below comes up. It's called ANCOVA (analysis of covariance), and is simultaneous regression and ANOVA. Say we have done the same wheat experiment as we have already described, but this time we have used ten different concentrations of each fertiliser too. What we really want to do is plot six graphs and see whether their slopes are significantly different. ANCOVA lets us do this, but we won't go into the details. Multiple regression is a technique that can be used to fit several regression lines to a single data set, so you can see how both fertiliser concentration and light intensity influence grain yield at the same time. ANODEV (analysis of deviance) is the most powerful of the statistical techniques: it is a sort of superset of all the statistical techniques so far described, and can be used to fit any model to any data set. Such ANODEV modelling is usually called general linear modelling (GLM for short). If you ever want to do 'real' stats, GLM is the way to go, but I wouldn't recommend it unless you have access to a good stats package, and a lot of time on your hands. It's way beyond A-level standard anyway!
Summary
Statistics tend to be though of as rather difficult, and to be left to the last minute. In fact, they are mostly quite simple to use, especially the commoner tests, which make intuitive sense when you realise what they actually do. The reason that most people have difficulties with them is they put off thinking about them until after they collect their data, and then get into an mess trying to retro-fit statistics onto data they don't suit.
The important thing to remember is to plan the stats before you collect the data. That way, the statistics are a breeze, and practically analyse your data for you. If you'd like an excellent and free package to explore these statistics and many more, then you can't go far wrong with R.

This page has been peer reviewed by 4 people. Thanks to Adrian Newton
and Beat Rupp for their feedback, to Rob Campbell for his suggestions
and correction, and to Michael Kospach, author of the indispensable
Perl module Statistics::Distributions, without whom there
would be no statistical tables to go with this guide.
