How to Read a Bar Graph With Error Bars

Abstract

Error confined normally appear in figures in publications, just experimental biologists are often unsure how they should be used and interpreted. In this article nosotros illustrate some bones features of error bars and explain how they tin assist communicate data and assist correct estimation. Fault bars may show confidence intervals, standard errors, standard deviations, or other quantities. Different types of error bars give quite dissimilar information, and and then effigy legends must make articulate what fault bars represent. We suggest eight unproblematic rules to assist with constructive utilize and interpretation of error bars.

What are error bars for?

Journals that publish science—knowledge gained through repeated ascertainment or experiment—don't just nowadays new conclusions, they also present evidence then readers can verify that the authors' reasoning is correct. Figures with fault bars can, if used properly (i–6), give data describing the information (descriptive statistics), or information about what conclusions, or inferences, are justified (inferential statistics). These two basic categories of error bars are depicted in exactly the aforementioned way, simply are actually fundamentally different. Our aim is to illustrate bones backdrop of figures with any of the common error confined, every bit summarized in Tabular array I, and to explain how they should be used.

Tabular array I.

Mutual error bars

Error bar	Type	Description	Formula
Range	Descriptive	Corporeality of spread between the extremes of the information	Highest data bespeak minus the lowest
Standard deviation (SD)	Descriptive	Typical or (roughly speaking) average difference between the data points and their mean
Standard error (SE)	Inferential	A measure out of how variable the mean volition be, if you repeat the whole study many times	SE = SD/√n
Confidence interval (CI), usually 95% CI	Inferential	A range of values you lot can be 95% confident contains the true hateful	M ± t _{(northward–1)} × SE, where t _(n–one) is a critical value of t. If n is ten or more, the 95% CI is approximately M ± 2 × SE.

What do error bars tell you lot?

Descriptive error bars.

Range and standard deviation (SD) are used for descriptive mistake bars because they bear witness how the data are spread (Fig. 1). Range error bars encompass the lowest and highest values. SD is calculated past the formula

equation M2

where 10 refers to the private data points, Chiliad is the mean, and Σ (sigma) ways add to discover the sum, for all the n data points. SD is, roughly, the average or typical departure between the data points and their mean, One thousand. About 2 thirds of the data points will lie within the region of mean ± i SD, and ∼95% of the data points will be within 2 SD of the mean.

It is highly desirable to use larger n, to attain narrower inferential error bars and more precise estimates of true population values.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f01.jpg

Descriptive error bars. Means with error confined for iii cases: due north = three, north = 10, and n = thirty. The small black dots are data points, and the column denotes the data mean M. The bars on the left of each column prove range, and the bars on the right show standard departure (SD). K and SD are the same for every instance, just notice how much the range increases with north. Note also that although the range mistake bars cover all of the experimental results, they practise non necessarily encompass all the results that could possibly occur. SD error confined include almost two thirds of the sample, and 2 x SD error bars would encompass roughly 95% of the sample.

Descriptive mistake confined tin also exist used to see whether a single result fits within the normal range. For example, if you wished to run into if a red blood cell count was normal, you could see whether it was within ii SD of the mean of the population as a whole. Less than 5% of all cherry-red blood prison cell counts are more than 2 SD from the hateful, and so if the count in question is more 2 SD from the mean, you might consider information technology to exist abnormal.

As you increase the size of your sample, or repeat the experiment more than times, the mean of your results (M) will tend to get closer and closer to the true mean, or the mean of the whole population, μ. We tin can utilise M equally our best guess of the unknown μ. Similarly, equally yous echo an experiment more and more times, the SD of your results will tend to more and more closely gauge the true standard deviation (σ) that you would go if the experiment was performed an space number of times, or on the whole population. However, the SD of the experimental results volition judge to σ, whether due north is large or modest. Like M, SD does not alter systematically as due north changes, and we can use SD as our best gauge of the unknown σ, whatever the value of n.

Inferential error bars.

In experimental biology it is more than common to be interested in comparing samples from 2 groups, to see if they are different. For example, you might be comparing wild-type mice with mutant mice, or drug with placebo, or experimental results with controls. To brand inferences from the data (i.east., to brand a judgment whether the groups are significantly different, or whether the differences might just be due to random fluctuation or chance), a different type of error bar can exist used. These are standard mistake (SE) confined and confidence intervals (CIs). The mean of the data, M, with SE or CI error bars, gives an indication of the region where yous tin expect the mean of the whole possible ready of results, or the whole population, μ, to lie (Fig. 2). The interval defines the values that are most plausible for μ.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f02.jpg

Confidence intervals. Means and 95% CIs for twenty independent sets of results, each of size n = ten, from a population with mean μ = 40 (marked by the dotted line). In the long run we expect 95% of such CIs to capture μ; here 18 do and so (large black dots) and ii exercise non (open circles). Successive CIs vary considerably, not but in position relative to μ, but also in length. The variation from CI to CI would be less for larger sets of results, for instance due north = xxx or more than, just variation in position and in CI length would be fifty-fifty greater for smaller samples, for example n = 3.

Because error confined can exist descriptive or inferential, and could exist any of the confined listed in Table I or even something else, they are meaningless, or misleading, if the figure legend does not state what kind they are. This leads to the first dominion. Dominion 1: when showing fault bars, always depict in the figure legends what they are.

Statistical significance tests and P values

If you comport out a statistical significance test, the result is a P value, where P is the probability that, if in that location really is no difference, y'all would get, by chance, a deviation equally large every bit the one you observed, or even larger. Other things (e.1000., sample size, variation) being equal, a larger deviation in results gives a lower P value, which makes you lot suspect there is a truthful difference. By convention, if P < 0.05 you say the result is statistically pregnant, and if P < 0.01 you say the issue is highly significant and yous can be more confident yous accept found a truthful effect. As always with statistical inference, you may be incorrect! Possibly there really is no effect, and you had the bad luck to get one of the five% (if P < 0.05) or ane% (if P < 0.01) of sets of results that suggests a deviation where there is none. Of class, even if results are statistically highly significant, it does not hateful they are necessarily biologically important. Information technology is also essential to notation that if P > 0.05, and you therefore cannot conclude there is a statistically significant outcome, y'all may not conclude that the consequence is zero. There may exist a existent outcome, but it is pocket-sized, or you may not have repeated your experiment often plenty to reveal it. It is a common and serious error to conclude "no effect exists" only because P is greater than 0.05. If y'all measured the heights of three male and three female Biddelonian basketball players, and did non run across a meaning difference, you lot could not conclude that sex has no relationship with height, as a larger sample size might reveal one. A big advantage of inferential error bars is that their length gives a graphic indicate of how much doubt there is in the information: The true value of the mean μ we are estimating could plausibly be anywhere in the 95% CI. Wide inferential bars betoken large error; brusk inferential confined point high precision.

Replicates or independent samples—what is n?

Science typically copes with the wide variation that occurs in nature by measuring a number (north) of independently sampled individuals, independently conducted experiments, or independent observations.

Rule 2: the value of northward (i.eastward., the sample size, or the number of independently performed experiments) must be stated in the figure legend.

It is essential that north (the number of contained results) is carefully distinguished from the number of replicates, which refers to repetition of measurement on one private in a single status, or multiple measurements of the same or identical samples. Consider trying to determine whether deletion of a gene in mice affects tail length. We could choose 1 mutant mouse and one wild type, and perform xx replicate measurements of each of their tails. We could summate the means, SDs, and SEs of the replicate measurements, but these would non permit us to answer the central question of whether gene deletion affects tail length, considering due north would equal 1 for each genotype, no matter how often each tail was measured. To accost the question successfully we must distinguish the possible effect of gene deletion from natural animal-to-animal variation, and to do this we need to measure the tail lengths of a number of mice, including several mutants and several wild types, with due north > 1 for each blazon.

Similarly, a number of replicate cell cultures can be made by pipetting the same volume of cells from the aforementioned stock civilization into adjacent wells of a tissue civilisation plate, and later treating them identically. Although it would exist possible to analysis the plate and decide the means and errors of the replicate wells, the errors would reverberate the accuracy of pipetting, not the reproduciblity of the differences between the experimental cells and the control cells. For replicates, n = one, and it is therefore inappropriate to evidence error bars or statistics.

If an experiment involves triplicate cultures, and is repeated four independent times, so n = 4, not three or 12. The variation within each set of triplicates is related to the fidelity with which the replicates were created, and is irrelevant to the hypothesis being tested.

To identify the appropriate value for n, call back of what entire population is beingness sampled, or what the entire set of experiments would be if all possible ones of that type were performed. Conclusions tin can be drawn only well-nigh that population, so brand sure it is appropriate to the question the research is intended to answer.

In the example of replicate cultures from the one stock of cells, the population being sampled is the stock cell culture. For n to be greater than 1, the experiment would have to be performed using separate stock cultures, or split cell clones of the same blazon. Again, consider the population you wish to make inferences nigh—it is unlikely to be simply a single stock civilisation. Whenever you run across a figure with very small fault bars (such as Fig. iii), you should enquire yourself whether the very small variation implied past the fault confined is due to analysis of replicates rather than independent samples. If and so, the bars are useless for making the inference you are because.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f03.jpg

Inappropriate apply of error bars. Enzyme activity for MEFs showing mean + SD from duplicate samples from 1 of 3 representative experiments. Values for wild-type vs. −/− MEFs were pregnant for enzyme activity at the three-h timepoint (P < 0.0005). This effigy and its legend are typical, only illustrate inappropriate and misleading use of statistics because due north = one. The very low variation of the indistinguishable samples implies consistency of pipetting, but says cipher nigh whether the differences between the wild-type and −/− MEFs are reproducible. In this example, the means and errors of the three experiments should have been shown.

Sometimes a effigy shows only the data for a representative experiment, implying that several other like experiments were too conducted. If a representative experiment is shown, then due north = 1, and no fault bars or P values should be shown. Instead, the ways and errors of all the independent experiments should be given, where n is the number of experiments performed.

Rule 3: error bars and statistics should only be shown for independently repeated experiments, and never for replicates. If a "representative" experiment is shown, it should not have error confined or P values, because in such an experiment, northward = i (Fig. 3 shows what non to do).

What type of error bar should exist used?

Rule 4: considering experimental biologists are usually trying to compare experimental results with controls, information technology is normally appropriate to show inferential fault bars, such every bit SE or CI, rather than SD. However, if due north is very modest (for example n = 3), rather than showing error confined and statistics, information technology is better to but plot the individual data points.

What is the difference between SE bars and CIs?

Standard error (SE).

Suppose three experiments gave measurements of 28.seven, 38.seven, and 52.half dozen, which are the data points in the n = 3 example at the left in Fig. 1. The mean of the data is M = 40.0, and the SD = 12.0, which is the length of each arm of the SD confined. 1000 (in this example 40.0) is the all-time gauge of the true mean μ that nosotros would like to know. Simply how accurate an estimate is it? This can be shown by inferential mistake bars such as standard error (SE, sometimes referred to as the standard fault of the mean, SEM) or a conviction interval (CI). SE is defined as SE = SD/√n. In Fig. 4, the large dots mark the ways of the same three samples as in Fig. 1. For the north = three case, SE = 12.0/√3 = 6.93, and this is the length of each arm of the SE confined shown.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f04.jpg

Inferential error confined. Means with SE and 95% CI error confined for three cases, ranging in size from north = 3 to due north = 30, with descriptive SD bars shown for comparison. The modest black dots are data points, and the large dots betoken the information mean M. For each case the error bars on the left testify SD, those in the center show 95% CI, and those on the right show SE. Note that SD does not change, whereas the SE confined and CI both decrease every bit north gets larger. The ratio of CI to SE is the t statistic for that northward, and changes with n. Values of t are shown at the bottom. For each example, nosotros tin exist 95% confident that the 95% CI includes μ, the true mean. The likelihood that the SE bars capture μ varies depending on n, and is lower for n = 3 (for such low values of northward, it is better to simply plot the data points rather than showing mistake bars, as nosotros have washed here for illustrative purposes).

The SE varies inversely with the square root of n, and then the more often an experiment is repeated, or the more samples are measured, the smaller the SE becomes (Fig. iv). This allows more than and more accurate estimates of the true hateful, μ, by the mean of the experimental results, Thou.

We illustrate and give rules for northward = 3 not because nosotros recommend using such a small northward, but because researchers currently often employ such small due north values and it is necessary to be able to interpret their papers. Information technology is highly desirable to use larger n, to achieve narrower inferential error confined and more precise estimates of truthful population values.

Confidence interval (CI).

Fig. 2 illustrates what happens if, hypothetically, xx different labs performed the same experiments, with northward = 10 in each instance. The 95% CI mistake bars are approximately M ± 2xSE, and they vary in position because of course M varies from lab to lab, and they besides vary in width because SE varies. Such mistake bars capture the true mean μ on ∼95% of occasions—in Fig. 2, the results from 18 out of the 20 labs happen to include μ. The trouble is in existent life we don't know μ, and nosotros never know if our error bar interval is in the 95% bulk and includes μ, or by bad luck is one of the five% of cases that just misses μ.

The error bars in Fig. 2 are only approximately Thousand ± 2xSE. They are in fact 95% CIs, which are designed by statisticians so in the long run exactly 95% will capture μ. To reach this, the interval needs to exist M ± t _(n–1) ×SE, where t _(north–1) is a critical value from tables of the t statistic. This critical value varies with n. For north = 10 or more than it is ∼2, simply for modest n it increases, and for n = 3 it is ∼4. Therefore Chiliad ± 2xSE intervals are quite skilful approximations to 95% CIs when n is 10 or more, simply not for pocket-size n. CIs can be thought of equally SE confined that have been adjusted by a factor (t) and then they tin can be interpreted the same way, regardless of northward.

This relation means you tin easily swap in your mind's eye betwixt SE confined and 95% CIs. If a figure shows SE bars you lot can mentally double them in width, to become approximate 95% CIs, equally long equally north is 10 or more. However, if due north = 3, you demand to multiply the SE bars by iv.

Rule v: 95% CIs capture μ on 95% of occasions, and so you can exist 95% confident your interval includes μ. SE bars tin can be doubled in width to get the approximate 95% CI, provided northward is ten or more. If n = 3, SE bars must be multiplied past 4 to become the approximate 95% CI.

Determining CIs requires slightly more computing by the authors of a newspaper, simply for people reading it, CIs brand things easier to sympathize, equally they hateful the same thing regardless of northward. For this reason, in medicine, CIs have been recommended for more than 20 years, and are required past many journals (seven).

Fig. iv illustrates the relation between SD, SE, and 95% CI. The data points are shown as dots to emphasize the unlike values of n (from iii to 30). The leftmost mistake bars show SD, the same in each case. The eye error bars bear witness 95% CIs, and the bars on the right show SE bars—both these types of bars vary greatly with n, and are especially wide for small north. The ratio of CI/SE bar width is t _(n–one); the values are shown at the bottom of the figure. Notation besides that, whatever error confined are shown, it tin be helpful to the reader to show the individual data points, especially for small north, as in Figs. 1 and 4, and rule 4.

Using inferential intervals to compare groups

When comparing two sets of results, due east.one thousand., from north knock-out mice and n wild-type mice, you can compare the SE bars or the 95% CIs on the two means (6). The smaller the overlap of bars, or the larger the gap between confined, the smaller the P value and the stronger the testify for a true difference. Besides equally noting whether the effigy shows SE bars or 95% CIs, information technology is vital to annotation due north, because the rules giving approximate P are different for north = three and for due north ≥ x.

Fig. 5 illustrates the rules for SE bars. The panels on the right show what is needed when n ≥ 10: a gap equal to SE indicates P ≈ 0.05 and a gap of 2SE indicates P ≈ 0.01. To assess the gap, use the average SE for the two groups, meaning the boilerplate of one arm of the group C confined and 1 arm of the Due east bars. Yet, if northward = 3 (the number beloved of joke tellers, Snark hunters (8), and experimental biologists), the P value has to be estimated differently. In this case, P ≈ 0.05 if double the SE bars just impact, meaning a gap of 2 SE.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f05.jpg

Estimating statistical significance using the overlap rule for SE bars. Hither, SE bars are shown on two split up means, for control results C and experimental results Due east, when north is three (left) or n is 10 or more (correct). "Gap" refers to the number of error bar arms that would fit between the lesser of the error bars on the controls and the tiptop of the confined on the experimental results; i.e., a gap of 2 means the distance between the C and Due east error confined is equal to twice the average of the SEs for the 2 samples. When n = 3, and double the length of the SE error bars simply impact (i.e., the gap is 2 SEs), P is ∼0.05 (we don't recommend using mistake bars where n = 3 or some other very pocket-size value, just nosotros include rules to assist the reader interpret such figures, which are mutual in experimental biology).

Rule half-dozen: when northward = 3, and double the SE bars don't overlap, P < 0.05, and if double the SE bars just touch, P is close to 0.05 (Fig. five, leftmost panel). If n is 10 or more, a gap of SE indicates P ≈ 0.05 and a gap of 2 SE indicates P ≈ 0.01 (Fig. v, right panels).

Rule 5 states how SE bars relate to 95% CIs. Combining that relation with rule six for SE confined gives the rules for 95% CIs, which are illustrated in Fig. half dozen. When n ≥ 10 (right panels), overlap of half of one arm indicates P ≈ 0.05, and but touching means P ≈ 0.01. To appraise overlap, apply the average of one arm of the group C interval and ane arm of the E interval. If north = iii (left panels), P ≈ 0.05 when two arms entirely overlap so each mean is about lined up with the finish of the other CI. If the overlap is 0.5, P ≈ 0.01.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f06.jpg

Estimating statistical significance using the overlap rule for 95% CI bars. Here, 95% CI bars are shown on two separate means, for control results C and experimental results E, when n is 3 (left) or north is 10 or more (right). "Overlap" refers to the fraction of the average CI fault bar arm, i.e., the average of the control (C) and experimental (E) artillery. When n ≥ 10, if CI error bars overlap past half the average arm length, P ≈ 0.05. If the tips of the error bars but affect, P ≈ 0.01.

Rule seven: with 95% CIs and n = 3, overlap of 1 total arm indicates P ≈ 0.05, and overlap of one-half an arm indicates P ≈ 0.01 (Fig. 6, left panels).

Repeated measurements of the same group

The rules illustrated in Figs. 5 and 6 apply when the means are independent. If two measurements are correlated, as for example with tests at different times on the aforementioned grouping of animals, or kinetic measurements of the same cultures or reactions, the CIs (or SEs) practice not give the data needed to assess the significance of the differences between means of the same group at different times considering they are not sensitive to correlations within the grouping. Consider the example in Fig. seven, in which groups of contained experimental and control prison cell cultures are each measured at iv times. Error bars can merely be used to compare the experimental to control groups at any 1 time point. Whether the error bars are 95% CIs or SE confined, they tin only exist used to assess between group differences (e.g., E1 vs. C1, E3 vs. C3), and may not exist used to appraise within group differences, such as E1 vs. E2.

An external file that holds a picture, illustration, etc. Object name is jcb1770007f07.jpg

Inferences betwixt and within groups. Ways and SE confined are shown for an experiment where the number of cells in three contained clonal experimental jail cell cultures (E) and three contained clonal control jail cell cultures (C) was measured over time. Error bars tin be used to assess differences between groups at the aforementioned time indicate, for example by using an overlap rule to judge P for E1 vs. C1, or E3 vs. C3; just the fault confined shown here cannot be used to assess inside grouping comparisons, for example the change from E1 to E2.

Assessing a within group difference, for case E1 vs. E2, requires an analysis that takes account of the within group correlation, for example a Wilcoxon or paired t analysis. A graphical approach would require finding the E1 vs. E2 difference for each civilization (or beast) in the group, then graphing the single mean of those differences, with error confined that are the SE or 95% CI calculated from those differences. If that 95% CI does non include 0, in that location is a statistically pregnant difference (P < 0.05) between E1 and E2.

Rule 8: in the case of repeated measurements on the same group (e.g., of animals, individuals, cultures, or reactions), CIs or SE bars are irrelevant to comparisons within the aforementioned group (Fig. 7).

Conclusion

Error bars can be valuable for agreement results in a journal article and deciding whether the authors' conclusions are justified by the information. However, there are pitfalls. When first seeing a figure with fault confined, inquire yourself, "What is n? Are they independent experiments, or simply replicates?" and, "What kind of mistake confined are they?" If the figure legend gives you satisfactory answers to these questions, you lot can interpret the data, but remember that fault bars and other statistics can only exist a guide: yous also need to use your biological agreement to appreciate the meaning of the numbers shown in whatsoever figure.

Acknowledgments

This research was supported by the Australian Research Council.

References

1. Belia, S., F. Fidler, J. Williams, and G. Cumming. 2005. Researchers misunderstand confidence intervals and standard error bars. Psychol. Methods. 10:389–396. [PubMed] [Google Scholar]

2. Cumming, Chiliad., J. Williams, and F. Fidler. 2004. Replication, and researchers' agreement of confidence intervals and standard error bars. Understanding Statistics. three:299–311. [Google Scholar]

iv. Cumming, G., F. Fidler, M. Leonard, P. Kalinowski, A. Christiansen, A. Kleinig, J. Lo, N. McMenamin, and S. Wilson. 2007. Statistical reform in psychology: Is anything changing? Psychol. Sci. In press. [PubMed]

5. Schenker, N., and J.F. Gentleman. 2001. On judging the significance of differences by examining the overlap betwixt confidence intervals. Am. Stat. 55:182–186. [Google Scholar]

6. Cumming, G., and S. Finch. 2005. Inference past centre: Confidence intervals, and how to read pictures of data. Am. Psychol. 60:170–180. [PubMed] [Google Scholar]

vii. International Committee of Medical Periodical Editors. 1997. Uniform requirements for manuscripts submitted to biomedical journals. Ann. Intern. Med. 126:36–47. [PubMed] [Google Scholar]

8. Carroll, L. 1876. The hunting of the snark An desperation in 8 fits. Macmillan, London. 83 pp.

onealcappen.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2064100/