Previous Next

Plant Detectives Manual: a research-led approach for teaching plant science

Appendix B: What do I do with my data?

B.1) An introduction to data

Data are wonderful (and always plural). In the Plant Detectives Project you will produce a surprising amount of data and we want to get you off on the right foot so you know how to record and organise that data. This may seem basic, and even unimportant, but just imagine you’re listening to years of previous students saying ‘I really wish I’d paid more attention to how I collected my data in the first weeks of the project’, and you might decide to follow these tips from the start!

 

Ongoing and efficient data collection and record keeping puts you in a position to analyse your data and ask questions that you may not be able to anticipate at the outset. So, try to keep all bases covered. We will demonstrate these techniques in principle, but not with a specific software package in mind. Your instructor will provide you with instructions tailored to the packages you have available.

 

Image

1. Labels

  1. As described in the practical activities and appendices, label every plant with a unique number, indicate whether it is wild type or mutant, and provide the background genotype you have. Make sure the labels stay with the plants. To avoid losing a label, consider painting the label onto the pot containing the plant.
  2. Whenever you record data, make sure you also record the information on the label.

2. Datasets

The data that you collect should first be entered into a spreadsheet or database. Your instructor will advise on what package to use, but the following rules apply to most:

  1. Put the column labels in the first row — do not put other text in the rows. Use comments or a notes page for additional text
  2. Include columns for the following in all of your datasets:
  1. plant ID: this is the plant number, if you decide to serial number your samples, or the pot number
  2. genotype: wild type or mutant
  3. replicate: number of the replicate
  4. date: the date of data collection. You may also want to include the practical week, since this will be used when graphing the data
  5. name of collector: this is optional. If you record who in your group collected the samples it can be useful in the future if something goes missing or looks confusing
  6. treatment (if applicable)
  1. Always record the units of your measurements; it is surprisingly easy to mix up centimetres (cm) and millimetres (mm) after a period of time has passed without reviewing data
  2. Do not leave blank lines between your rows or columns. Blank lines confuse the sorting algorithms of your spreadsheet program and will lead to the sorting of some, but not all, data. At the least this is annoying. At the worst you end up sorting your labels and not your data or vice versa.
  3. Save and back up your data regularly:
  1. take turns within your practical group to upload data to an online storage option each week
  2. get agreement from the group on the dataset to be used and to keep it backed up.

 

Sample data sheets

  1. Record of major phenotypic observations

Experiment:

 

Date start:

 

Seeds

 

Date

Day

Comment/observation

 

 

 

 

  1. Excel spreadsheet to record data — always record units too

Student or group name

Plant number

Genotype

Treatment (if applicable)

Date and/or week in practical

Trait value 1 (e.g., stem height)

Trait value 2 (e.g., leaf size

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  1. Alternative spreadsheet format, for germination assay

Plate #

Position #

Genotype

Week

Date

Trait value 1 (e.g., root length (mm))

Trait value 2 (e.g., secondary roots (1/0)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

B.2) Exploratory data analysis

In Section 1 we discussed how to record and organise your data. This section covers how to explore your data and present it visually. This is called exploratory data analysis (EDA). While these techniques can be demonstrated in principle, this section has not been structured with a specific software package in mind. Your instructor will provide you with guidelines tailored to the packages you have available.

 

Why is this important? Too often researchers (not just students) leap into analysing their data statistically before getting to know what it’s telling them. EDA is an important first step that can save time in the long run.

B.2.1) Look for outliers and check the distribution of your data points

The following analyses are based on the assumption that your data will fit, roughly, a normal (bell-shaped) distribution. But, data don’t always oblige. You need to check the distribution of your data before proceeding.

 

Further, you need to check for outlying points — outliers — in your data. Most frequently these are data entry errors, but sometimes you will identify a data point that is different from the rest. This point may cause your data to violate the assumptions of normality. You may need to exclude the point or to transform your data to make it approximately normal.

 

You can use a statistical analysis package to plot a histogram or box plot of your data to check for normality and look for outliers:

  1. If your data are normally distributed, your histogram will look like a bell curve. If your bell has long tails or an isolated point shows up on one or other side of the curve, you have an outlier. If your data does not form a pretty single hump or your curve leans heavily in one direction or the other, you may have data that are not normally distributed. You may need to use a mathematical transformation to rescale your data so they meet the assumption of normality. The most common transformation in biology is the log transformation. Don’t fear transforming your data: you have not changed them, just scaled them.
  2. A box plot is an alternative way of looking at your data. A box plot (Fig. i) shows you the mean (average), median (middle value), and the quantiles of your data. The 25% and 75% quantiles are generally shown as the top and bottom of your ‘box’. The median (50% quantile) is shown as a line drawn across the box. The mean is generally a dot or diamond that should sit near to the median. The box usually has whiskers to show you where the lowest 5% and highest 95% of your data sit. Any points that appear above or below the whiskers are outliers. If your mean is skewed to one side of your box, your data are not normally distributed.

 

You can also use a spreadsheet program with chart options to look for outliers by making an x v y plot, or scatter plot, of two columns of your data. For example, leaf number vs rosette diameter, where you have both measures for each plant. In this case, it is likely that these two variables would scale closely with each other. Any plant that has a high leaf number for its rosette diameter, or a high diameter for its leaf number will fall at a distance from the other points on your graph. Check these points to see if there has been a data entry error (most often this is a mistakenly added or removed digit or decimal place. If most plants with a rosette diameter of 5 cm have ten leaves, but one of your plants has one or 100 — it’s probably a typo.)

 

What do you do with an outlier?

  1. First you must decide whether the outlier is biologically real or a data entry error. If the latter, fix the mistake. If possible, repeat the measurement. If not possible, then you may choose to delete the entry from your dataset and make a note in a comments column explaining why a data point was deleted. These are called missing data and should be mentioned in your write-up.
  2. If the point is not a typo but is a statistical outlier you may still need to exclude it. Determine whether the value reflects a plant that has somehow been damaged or is ill or whether you have had an instrument failure that has given you an incorrect result. If so, delete as above.
  3. Finally, sometimes you will come across an individual that is just very different from the others. In the case of this practical, this might even be a seed that has mistakenly been mixed into a seed lot, or a spontaneous mutation. Such data points, even though real, will prevent you from being able to statistically analyse your data for two reasons. First, the points may lead you to violate the assumption of normality. Second the data points will have undue influence on your ability to estimate the average value of the variable you are measuring. For example, if you have one plant out of 24 that is ten times the size of the others, you would calculate an average size that is much larger than 23 of your 24 plants. This is not an accurate description of the average size. In such cases you are justified in excluding the data point from your graphs and statistical analysis. You must, however, mention in your write-up that the data were excluded and why.
  4. Note: if you delete a data point because it is an outlier, make sure to do so in all copies of your dataset and keep records — that way you won’t get confused down the track.

 

B.2.2) Compare the average (mean) of your two populations (wild type and mutant)

Once you are sure that you have entered your data correctly and there are no outliers, it is possible to compare the mean of the two populations. By ‘population’ here we mean wild type versus mutant. Or, if you have imposed a treatment like a drought stress, you will have two populations * two treatments = four means to compare. To compare the populations, calculate the mean and the variance. The variance describes how much difference there is among the plants within each of your populations. If the variance is large, then it will be more difficult to see a statistically significant difference in the population means.

 

Most spreadsheet packages will enable you to calculate a mean and standard deviation (a common descriptive measure of variability in your data) using formula or equation functions. If you wish to calculate these yourself, the mean is simply the average (sum the values and divide by the total number of points). The standard deviation (SD) is calculated as:

Image

Where X is the value for a given sample, x̅ is the mean of all samples, and n is the total number of samples of a given treatment or genotype combination (e.g., number of wild type plants measured).

 

When you want to compare averages (means), for example, across treatments or genotypes, a measure of the variability around the mean itself is required. This measure of variability is the standard error, which is a measure of the variability in the mean itself and is used to infer whether two sampled means are different. To calculate standard error, use SD and n:

Image

B.2.3) A picture is worth 1,000 words

Plot a graph of your data so that you can visualise the differences (or lack thereof) between the average outcome of your wild type and mutant plants. A simple bar chart will usually suffice. Be sure to plot error bars on your graph so that you can see the variance around your mean. If the error bars of your means are heavily overlapping, don’t expect a statistical test to tell you there is a difference in the means!

 

We recommend that you graph your data at the end of each week’s practical. This way you keep up to date on what your results are. You should bring your graphs to the start of the next practical class so that you can present your results in your discussion group and find out if the other student groups got the same results. If your results differ, you may have found something that distinguishes your mutant.

 

As most spreadsheet packages have graphing options that are adequate to make the graphs needed for this project, and some statistical software packages also make excellent graphs, it is not necessary to use a dedicated graphing package. And, if you are so inclined, graph paper and rulers still provide a perfectly effective way of graphing your results!

B.3) Comparing two means

T-tests (or one-way analysis of variance, ANOVA) are used to assess whether the average value for a trait in one population differs from that of another. Specifically these tests assess whether the between population variation (e.g., between the means) is greater than that within the populations. In this case the populations are the wild type and mutant plants and the trait of interest may be rosette diameter or leaf size. The T-test is the statistical formalisation of the graphing of data that revealed overlap in the error bars.

 

All statistical packages and most spreadsheet packages have an option for calculating a T-test. The formatting requirements vary among packages. If the package requires setting a hypothesised mean difference, set this to 0. You may also need to set an alpha level; pick 0.05 (see below).

 

To run the T-test, supply the following information:

  • the measured variable that you are interested in comparing, e.g., root length or rosette diameter, which is known as the data variate, response variable, or dependent variable; statistical packages differ in their choice of language
  • the variable indicating group or population, which is known as the group factor or independent variable. In this case the population is genotype: wild type or mutant
  • discrete’ and ‘continuous’ variables. Not surprisingly, continuous variables are those for which the values vary continuously. A factor or variable is called discrete when it can only have particular states — low, medium and high, for example. For a T-test, the population variable is discrete and the measured variable is continuous
  • degrees of freedom, which reflect the number of data points, or samples, in your analysis. For a T-test the degrees of freedom is the number of samples in each population minus 1; or, the total number of samples minus 2
  • the P value, which tells you the probability that the observed difference in means could have arisen by chance. Small values of P indicate that this difference was unlikely to have arisen by chance, and we call this evidence for a true difference. By convention P values are generally defined as less than 0.05 to be sufficiently small to count as evidence for a true difference. This a ‘statistically significant difference’.

When you run your T-test, specify the measured variable and the group variable and get the package to calculate the test statistics. Your table will look something like this:

 

Image

 

In this case, the most important things to be able to glean from this table are the means, the standard errors of the means, the group size and the P value. The degrees of freedom should be two less than your number of samples. (If P < 0.05 then it is legitimate to refer to a statistically significant difference in the mean value between your genotypes).

 

Think about your results, compare that statistical result to your observations of the means from your exploratory data analysis — do the statistics support your previous conclusions?

B.4) Two-way tests

Up to this point we have discussed statistical tests comparing two populations only: wild type and mutant. For some studies that you will do in the Plant Detectives Project, however, you will consider two experimental factors at once; these are factorial designs (See box in Activity 9). For example, in the drought experiment you will consider the effect of water stress as well as genotype, asking: Do the mutants differ from the wild type plants in their response to water stress? If your drought-stressed mutant plants fare better than your drought-stressed wild type plants, then you will conclude that the mutation improves the ability of the plants to cope with drought. Sometimes the effect of a mutation will not be apparent under benign conditions: only under stressed ones. In these cases we say there is an interaction between the genotype and drought effects if the severity of the drought effect depends on genotype.

 

To run a two-way ANOVA, you need to supply similar information to that gathered for the t-test:

  • the measured variable that you are interested in comparing; e.g., root length or rosette diameter
  • the variables indicating the populations/treatments. In your case the population is genotype: wild type or mutant, and the treatment is normal or drought
  • indicate that you want to compare means by genotype and by drought treatment, as well as a comparison of the drought response by genotype. In the treatment structure box, write:

genotype + drought + genotype.drought

 

A two-way ANOVA produces a table like this:

 

Image

From this table you want to assess three different P values. The P value for the genotype effect tells you whether or not the genotype means (ignoring drought) differ. The P value for the drought effect tells you whether or not the drought treatment means (ignoring genotype) differ. The interaction term P value tells you whether the effect of drought on wild type is the same as the effect on the mutant.

 

In our case, interaction term P value is the one of interest, and small P values for the interaction term provide evidence that the effect of drought on mutants is different to the effect of drought on wild type plants. The interaction term may be significant if one type responds more than the other species (for example if the wild type is less sensitive to drought). Alternatively, the interaction term may be significant if one type responded in the opposite direction to the other, or didn’t respond at all. If you have significant type and treatment effect, but don’t have a significant interaction term, the analysis indicates that each type responds to light in the same way, but that the types differ in their inherent root mass ratios.

 

Again, your stats package should also calculate the means and standard error for your treatments. There will be three sets of means: the genotype means and drought means (each of which ignores the other treatment) and the means of all four groups (wild type under drought, wild type well watered, mutant under drought and mutant well watered).

 

What do you do with this information? The test statistics can be reported in a table or in text and the means can be graphed to illustrate significant results. If there is not a significant P value for the interaction of genotype and drought treatment,plot the means of each, ignoring the other. Otherwise, plot the four means in a bar or line graph. Sometimes the lines make it easier to interpret the results, sometimes bar charts are more effective. It is up to you which you present (see Fig. 12).

Image

Figure 12: P values for several hypothetical two-way analyses of variance are given in the table

The scenarios A, B, C and D are graphed in the interaction plots of the same label. Note particularly the slopes of the lines and the overlap (or lack thereof) of the error bars.

B.5) Presenting your results

For your final report include tables and figures that present your results as described in Activity 12.1. There is an art to presenting your data well — we do not want to see your raw data, nor do we want exhaustive tables of mean values or pages and pages of statistical analyses. Rather, we want you to use your data to tell a story — in this case, to make an identification of your mutant Arabidopsis and to justify that identification. As the semester progresses, you will accumulate a substantial dataset and a wide range of results from your work. You may choose to present all or just some of the data you have collected and analysed — sometimes less is more if it helps get your story across more clearly.

 

Remember the following when presenting your results:

B.5.1) Tables

  1. tables have a brief descriptive title above them and columns below the title
  2. the first line of the table should contain column identifiers
  3. ensure that all values show units (e.g., cm or grams (g)). Tables can be made in word processing or spreadsheet programs
  4. do not present the same data in both a table and a graph

B.5.2) Figures

  1. figures have a brief descriptive legend below them. Either describe in the legend or show on the figure the meaning of symbols and abbreviations
  2. label all axes and show the units (e.g., cm or g).
  3. include error bars with SD (or standard error if that is what your statistics package calculated) on them. Specify what the error bars are in your legend
  4. figures can be made in a range of software packages or they can be drawn by hand using graph paper. Do whatever you prefer, but make sure the points above are adhered to for all graphs

B.6) Statistical description of two-sample tests

For those of you interested in the guts of the T-test: the equation we use for this test is:

Image

Where

Image and Image are the means of samples 1 and 2 respectively. And, where n1 and n2 are the sample sizes (number of data) for samples 1 and 2 respectively. And,

Image

Where DF1 and DF2 are degrees of freedom (n — 1) and

Image

for samples 1 and 2 respectively.

 

When you have calculated your t-statistic, look up the t value in this table. The following information is based on (Bower et al. 1989).

 


Previous Next