Understanding Key Statistical Terms: Mean, Median, and More

Dec 12, 2024
Lecture 1: Technical Terms andMeasures of Location
Formal data analysis procedures requires the calculation and interpretation of the summary statistics in their numerical form. Suppose, then, that we have a data set, x1, ..., xn, where each xi is a number. What sort of features might be of interest? One of the most important features that we can investigate is the data's location along the number line. In particular, its center. However, before we can analyze any data, we should nail down some key terminology that will be of use all semester long.
Basic Definitions To succeed in statistics, we must first master the basics. This comes in the form of gaining understanding of the fundamental terms that we will see over and over again when studying topics such as probability. Definition Population: a well defined collection of objects on which measurements can be taken. For example, if we are interested in learning about how the country might vote in a presidential election then the population of interest would be the registered voters in the United States. If we were interested in learning about the effects sugar in gum then the population of interest might be all dentists.
Basic Definitions Most often populations, while they may be technically finite, will be too big to measure every object within due to constraints on time, money and other scarce resources. Definition Sample: a subset of the population. While a sample is, by definition, smaller than the corresponding population we still want to draw conclusions about the population. Techniques exist, as you will learn throughout the ST 371/372 sequence, that allow us to take what we learn from the sample and draw conclusions about the populations. This is called making inferences.
Basic Definitions We are typically only interested in very specific characteristics of the objects in a population. Definition Variable: any characteristic whose value may change from one object to another in the population. Initially we will denote variables by using lowercase letters. As we proceed through the semester this will change, so stay tuned. Some examples of variables if our population is the registered voters of the United States in our presidential election example: x = age y = political affiliation z = state lived in
The Mean The most familiar, and often most useful, measure of the center of a data set is going to be the the mean. This is the formal term we will use to describe the arithmetic average of a data set. However, it is possible, even likely, that in the course of discussions the words mean and average are interchanged. While different types of averages exist, generally speaking, the word average, when appearing without an adjective, will be referring to the arithmetic average, or the mean. Since we will almost always be referring to the set of data, xi's, as a sample it is also quite common to call this summary statistic the sample mean. This is another term we will use as interchangeable for the time being. In later chapters it will become important to distinguish between different means, and when that time comes we will be more thoughtful about our word choices.
The Mean If the data is represented as x1, ..., xn then we will mathematically denote the sample mean as ¯x. Definition The sample mean, ¯x, of observations x1, ..., xn is given by ¯x = (x1+...+xn)/n = Σxi/n The numerator of ¯x can be written more informally by dropping the indices on the summation: Σxi when we are summing over all observations. This informal representation of the numerator above is most often what I will be writing when context should make it clear that we are using all of the observations from the sample.
Example 1 Exercise In the table below you will find the run time (in minutes) of the pilot episode in a sample of 10 popular TV shows. Show: Friends, Cheers, GoT, Flash, Parks & Rec Time: 21, 20, 53, 40, 19 Show: White Collar, Burn Notice, Psych, DWTS, Criminal Minds Time: 42, 47, 38, 92, 41 What is the mean run time of these shows?
Population Average We have been discussing the sample mean. This comes from the fact that we have collected a sample from the population. But what if we were able to collect measurements on every single element in the population? Would we still be computing a sample mean? Smart money is on no, because we no longer have a sample. We would be directly computing the population mean!! So, since we are computing something technically different, we should not use the same notation that we did for the sample mean. We will denote the population mean with the greek letter µ. In statistics, it is standard practice to denote the population equivalent of a sample statistic with a greek letter, as we will see quite often in this course.
Mean Problems What did we notice about the mean run time of the shows in the previous example? The average was almost double that of the the 3 sitcoms. Why? Did we forget to include them in our calculation? The answer lies in the time from Dancing with the Stars. Since each of those episodes had approximately 1.5 hours worth of run time, this value pulled the average up. In other words, it's an outlier. Here lies the major deficiency of the sample mean, it is quite susceptible to outliers. We are still interested in measuring the location of the data, specifically the center, so how can we get around the fact that outliers pull our estimate away from the center?
The Median Since we are interested in the center, or middle, of our data, we will simply look at the middle. That's what the median is, the exact middle of our ordered sample. Definition The sample median, ˜x, is obtained by first ordering the n observations from smallest to largest (with repeated values included) and then ˜x is the single middle value if n is odd ˜x is the average of the two middle values if n is even.
Example 2 Exercise What is the median run time of the shows in the previous example?
Population Median Just as the sample mean, ¯x, is the sample equivalent of the population mean, µ there is also a population equivalent of the sample median. We will denote the population median, quite uncreatively, ˜µ. The population mean and median will generally not be identical. If the population is either left or right skewed, even slightly, then µ ≠ ˜µ. If this is the case then we must usually decide which of the two measures of center are of greater interest and proceed accordingly.
Median Problems The reason for even discussing the median is because the sample mean is quite susceptible to outliers in the data. But is this problem fixed by considering the median? Change a few of the numbers at the top and bottom of the run time list. Does the median move? What does this mean? It turns out, the median is the exact opposite than the mean in terms of susceptibility to outliers. The median doesn't care that there are outliers, or even a great many of them. This, in itself, is another problem. The mean is susceptible to a single outlier and the median is impervious to many outliers. There are other measures of center, such as the trimmed mean, that were specifically created to combat this issue, but they are not quite as theoretically appealing as the mean.
Categorical Data So far, we have been discussing numerical data, such as weight or age. But data can come in another form, categorical. When this is the case we generally shift our focus to a summary statistic called sample proportions, ˆp. This is the proportion of the sample that falls into whatever category we happen to be discussing. For example we can have a categorical variable that describes someone's political affiliation. This variable can have categories such as Republican, Democrat, Independent, etc. If we collect a sample of n observations then we will have a certain proportion of them falling into each category. This can become burdensome to track.
Categorical Data Generally speaking it is easiest to pick one of the categories from the variable and focus attention on it. This will allow us to code the responses in such a way that they are assigned a value of 1 if the response falls into the category of interest and a 0 if it doesn't. If we code the data this way, then the sample proportion of observations that fall into the category of interest is sample mean of the 0's and 1's. Thus, we can use a numerical method to summarize categorical data. We can also generalize this statistic to the population and we will use p to denote the population proportion of observations that fall into the category of interest.
