The Green Rocket

Helping the Green Community Take Off!

Translate:      

How to Accurately Interpret Statistics – Part II

By NikkiJade • Jun 19th, 2008 • Category: Articles, Spotlight

As a continuation of Tuesday’s article, “How to Accurately Interpret Statistics, Part I“, here are some more technical points about analyzing sets of data.

What Was the Size of the Sample Taken?

A sample refers to a subset of something, while the set from which the sample is obtained is called a population. So if environmentalists want to study the waste per kilogram created from the production of metal door handles, they would not test every single door handle created but might take a sample of 100 door handles to study from the total population of door handles produced.

The key point about samples is in their size. In general, the larger the sample size is, the more representative it will be of the population. So if environmentalists were to only sample 5 door handles, this may not generate enough data to make a conclusion about the waste created from the entire population of door handles. Whether a sample size is too small or not depends on the size of the total population. For example, if only 10 door handles were produced (the total population), sampling only 5 of them would be sufficient, whereas if 10 000 door handles were produced, a sample of only 5 would likely be too small.

How Was the Sample Taken?

I wanted to bring up the importance of randomness here. A random sample is taken when each observation (unit of a sample to be observed) is drawn at random from the population. This means that no unit is more likely to be selected than any other unit, and each draw is independent of all other draws.

If a sample is not selected randomly, the results of the data will be biased. For example, consider a poll being taken of the importance of environmental policy in a presidential candidate’s platform. If the persons conducting the poll decide to do so by asking everyone that is in a nature park one day, they will likely find more value placed on the environmental policy than they would if they just asked people selected at random on throughout the entire region they are polling. This is because individuals going out of their way to spend time in that nature park may value the environment more as a part of their leisure activity, so only asking them may bias the results.

What Assumptions are Made?

Making assumptions are necessary in most models of statistics, especially when studying large scale populations such as the U.S economy. An assumption just means that the model used for testing a specific number of variables might hold other factors constant. To return to the education and wage model I mentioned before, one assumption made in this case may be to hold number of years of experience constant; that is, assume that the number of years of experience does not affect the wage for the sake of testing the effect of education on it.

With regards to deciding whether an assumption is effective or not, just think intuitively: how strong/logical is the relation of the variables being tested in the first place? If an assumption being made potentially has a larger impact on a variable than the other variable being tested, it is probably not an effective measure of data. For instance, if you are measuring a student’s grade based on how many hours of music a day they listen to, and assuming all other factors are constant (such as how many classes they attend), this assumption is not very realistic because there isn’t a strong logical relation between how many hours a day a student listens to music and their grade, and thus the amount of classes a student attends will likely have a much greater effect on their grade than their exposure to music.

Mean, Median and Mode

Some basic ways of describing data are in the mean, median and mode, which are all a way of describing an “average”. The mean is usually referred to as the average, and is what most people are familiar with. It is taken by summing the values in a data set and then divided by the total number of values in the set.

The median is simply the middle number in a data set that is in chronological order. For example, if the data set is {1,2,4,4,5} the median would be 4. Sometimes the median is a more effective measure than the mean, as the mean can be easily skewed by an “outlier”, or observation (entry) that is substantially different than the bulk of the data (perhaps because of an error).

The mode is the data that is the most “popular”, or appears the most often. So in the above example of a data set, the mode would also be 4.

Ratios, Per Capita and Nominal vs. Real numbers

This point walks hand in hand with putting data in context, as described in Part I. I already brought up ratios in terms of the car accident example in Part I. It is a comparison statistic, giving you the ratio of one number in terms of another to make it easier for you to judge the significance of that number. A type of ratio is a per capita total, which is popular in all sorts of public policy, economic and environmental issues. Per capita just means by or for each individual person—so the emissions of a country per capita are the country’s total emissions divided by their population.

Knowing a per capita total is extremely important, as a number at face value is much less significant without taking factors like population into consideration. For example, say Canada takes on serious environmental initiative to reduce its carbon footprint on the globe, and in five years their total emissions are measured and compared to 2007’s emissions, and are higher. At face value, that makes the initiatives that were taken sound ineffective. However, if the population were to have grown by a third in this time as well, the per capita emissions might actually be less than 2007’s totals, implying that the reduction measures were indeed accurate.

Another set of terminology to keep in mind is nominal numbers versus real. Real numbers are ones that are adjusted to consider other factors: for example, real interest rates are adjusted for inflation, while nominal rates are simply the face values. Again, these can make a large difference in the interpretation of a set of data, especially if you are considering long term gains. For example, an environmental policy that brings $1 000 000 in benefits in ten years from now may actually only be worth $50 000 when considering the inflation over those same ten years.

Margin of Error and Confidence Interval (CI)?

Statisticians should make a model that includes the possibility for various errors in the process measuring and recording the data, as well as in the experiment itself and the assumptions made. This is called the margin of error. This is the acceptable deviation of each observation from the mean/allowance for changing circumstances. It is measured through a confidence interval, which is a process of constructing an interval so that a certain percentage of all data sets (samples taken) will contain the target population value (usually the mean). The confidence level determines this certain percentage, and is most commonly 95% (90% and 99% are other common values.)

The margin of error is important in helping decide how significant a change in data is, as well as how confident you should be in the values as representatives of the true population values. For example, say you are given the statistic that 80% of high school students do not reuse water bottles, with a confidence interval of 72% to 88% and a 95% confidence level. In literal terms, this means that one can say with 95% certainty in any sample of students taken to determine whether students reuse their water bottles or not, the average percent that do not will lie between 72% and 88%, or the margin of error is plus/minus eight percentage points.

Now say the high school in question implements an awareness program to help educate its students about the amount of waste generated by water bottles (which is huge, on a true note!). As a result, the amount of students who do not reuse their water bottles falls from 80% to 74%. While this may seem that the policy was effective, when one considers that 74% was within the original margin of error, it could mean that no real changes occurred.

These points, as well as the intuitions presented in Part I are just a few in helping to properly read and understand information. As Mark Twain once said, “There are lies, damned lies and statistics.” In other words, readers often do not realize how easily a set of data can be manipulated with the smallest of efforts. So it is indeed important as an interpreter to ensure you are fully analyzing the information being presented so when you develop your own judgment, you can do so with an accurate and valid interpretation.

Share and Enjoy:
  • TwitThis
  • Mixx
  • Digg
  • Reddit
  • StumbleUpon
  • del.icio.us
  • Pownce
  • Facebook
  • MySpace
  • Google Bookmarks
  • Technorati
  • Tumblr
  • e-mail

Related Posts

Tagged as: , ,

NikkiJade is Co-Founder of TheGreenRocket.com, an indoor cycling instructor and Honours Economics and Global Studies student at Wilfrid Laurier University with a focus in econometrics, environmental and development economics, and ecotourism. Nicole is passionate about everything green, as she believes nature’s services can be used more efficiently to generate sustainable development in all areas of the world. Twitter: @NikkiJade
Email this author | All posts by NikkiJade

One Response »

  1. [...] Continue to Part Two >> Author Bio: Nicole McCallum is Co-Founder of TheGreenRocket.com, a spinning and group fitness instructor and an Honours Economics and Global Studies student at Wilfrid Laurier University, specializing in natural resource and development economics, as well as ecotourism. Nicole is passionate about everything green, as she believes nature’s services can be used more efficiently to generate sustainable development in all areas of the world. [...]

Leave a Reply