Understanding Key Statistical Terms: Mean, Median, and More

School
North Carolina State University**We aren't endorsed by this school
Course
ST 371
Subject
Statistics
Date
Dec 12, 2024
Pages
16
Uploaded by AdmiralExploration4857
Lecture 1: Technical Terms andMeasures of Location
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionIntroductionFormal data analysis procedures requires the calculationand interpretation of the summary statistics in theirnumerical form.Suppose, then, that we have a data set,x1, ..., xn,where eachxiis a number. What sort of features mightbe of interest?One of the most important features that we caninvestigate is the data’s location along the number line.In particular, its center.However, before we can analyze any data, we shouldnail down some key terminology that will be of use allsemester long.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionBasic DefinitionsTo succeed in statistics, we must first master the basics.This comes in the form of gaining understanding of thefundamental terms that we will see over and over againwhen studying topics such as probability.DefinitionPopulation:a well defined collection of objects on whichmeasurements can be taken.For example, if we are interested in learning about howthe country might vote in a presidential election thenthe population of interest would be the registered votersin the United States.If we were interested in learning about the effects sugarin gum then the population of interest might be alldentists.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionBasic DefinitionsMost often populations, while they may be technicallyfinite, will be too big to measure every object withindue to constraints on time, money and other scarceresources.DefinitionSample:a subset of the population.While a sample is, by definition, smaller than thecorresponding population we still want to drawconclusions about the population.Techniques exist, as you will learn throughout the ST371/372 sequence, that allow us to take what we learnfrom the sample and draw conclusions about thepopulations. This is called making inferences.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionBasic DefinitionsWe are typically only interested in very specificcharacteristics of the objects in a population.DefinitionVariable:any characteristic whose value may change fromone object to another in the population.Initially we will denote variables by using lowercaseletters. As we proceed through the semester this willchange, so stay tuned.Some examples of variables if our population is theregistered voters of the United States in our presidentialelection example:x = agey = political affiliationz = state lived in
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionThe MeanThe most familiar, and often most useful, measure ofthe center of a data set is going to be the themean.This is the formal term we will use to describe thearithmetic average of a data set. However, it is possible,even likely, that in the course of discussions the wordsmean and average are interchanged. While differenttypes of averages exist, generally speaking, the wordaverage, when appearing without an adjective, will bereferring to the arithmetic average, or the mean.Since we will almost always be referring to the set ofdata,xi’s, as a sample it is also quite common to callthis summary statistic thesample mean. This is anotherterm we will use as interchangeable for the time being.In later chapters it will become important to distinguishbetween different means, and when that time comes wewill be more thoughtful about our word choices.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionThe MeanIf the data is represented asx1, ..., xnthen we willmathematically denote the sample mean as¯x.DefinitionThesample mean,¯x, of observationsx1, ..., xnis given by¯x=x1+...+xnn=ni=1xinThe numerator of¯xcan be written more informally bydropping the indices on the summation:xiwhen we aresumming over all observations.This informal representation of the numerator above ismost often what I will be writing when context shouldmake it clear that we are using all of the observationsfrom the sample.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionExample 1ExerciseIn the table below you will find the run time (in minutes) ofthe pilot episode in a sample of 10 popular TV shows.ShowFriendsCheersGoTFlashParks & RecTime2120534019ShowWhite CollarBurn NoticePsychDWTSCriminal MindsTime4247389241What is the mean run time of these shows?
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionPopulation AverageWe have been discussing the sample mean. This comesfrom the fact that we have collected a sample from thepopulation. But what if we were able to collectmeasurements on every single element in thepopulation? Would we still be computing a samplemean?Smart money is on no, because we no longer have asample. We would be directly computing the populationmean!! So, since we are computing somethingtechnically different, we should not use the samenotation that we did for the sample mean.We will denote the population mean with the greekletterµ.In statistics, it is standard practice to denote thepopulation equivalent of a sample statistic with a greekletter, as we will see quite often in this course.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionMean ProblemsWhat did we notice about the mean run time of theshows in the previous example?The average was almost double that of the the 3sitcoms. Why? Did we forget to include them in ourcalculation?The answer lies in the time from Dancing with theStars. Since each of those episodes had approximately1.5 hours worth of run time, this value pulled theaverage up. In other words, it’s anoutlier.Here lies the major deficiency of the sample mean, it isquite susceptible to outliers. We are still interested inmeasuring the location of the data, specifically thecenter, so how can we get around the fact that outlierspull our estimate away from the center?
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionThe MedianSince we are interested in the center, or middle, of ourdata, we will simply look at the middle. That’s whatthemedianis, the exact middle of our ordered sample.DefinitionThe sample median,˜x, is obtained by first ordering thenobservations from smallest to largest (with repeated valuesincluded) and then˜xis the single middle value ifnis odd˜xis the average of the two middle values ifnis even.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionExample 2ExerciseWhat is the median run time of the shows in the previousexample?
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionPopulation MedianJust as the sample mean,¯x, is the sample equivalent ofthe population mean,µthere is also a populationequivalent of the sample median.We will denote the population median, quiteuncreatively,˜µ.The population mean and median will generally not beidentical. If the population is either left or right skewed,even slightly, thenµ̸= ˜µ. If this is the case then wemust usually decide which of the two measures ofcenter are of greater interest and proceed accordingly.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionMedian ProblemsThe reason for even discussing the median is becausethe sample mean is quite susceptible to outliers in thedata. But is this problem fixed by considering themedian?Change a few of the numbers at the top and bottom ofthe run time list. Does the median move? What doesthis mean?It turns out, the median is the exact opposite than themean in terms of susceptibility to outliers. The mediandoesn’t care that there are outliers, or even a greatmany of them.This, in itself, is another problem. The mean issusceptible to a single outlier and the median isimpervious to many outliers.There are other measures of center, such as thetrimmed mean, that were specifically created to combatthis issue, but they are not quite as theoreticallyappealing as the mean.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionCategorical DataSo far, we have been discussing numerical data, such asweight or age. But data can come in another form,categorical. When this is the case we generally shift ourfocus to a summary statistic calledsample proportions,ˆp. This is the proportion of the sample that falls intowhatever category we happen to be discussing.For example we can have a categorical variable thatdescribes someone’s political affiliation. This variablecan have categories such as Republican, Democrat,Independent, etc.If we collect a sample ofnobservations then we willhave a certain proportion of them falling into eachcategory. This can become burdensome to track.
Background image
Prob & Stats:Lecture 1IntroductionKey TermsPopulationSampleVariableThe MeanIntroDefinitionExamplePopulation MeanIssuesThe MedianDefinitionExamplePopulation MedianIssuesCategorical DataSample ProportionCategorical DataGenerally speaking it is easiest to pick one of thecategories from the variable and focus attention on it.This will allow us to code the responses in such a waythat they are assigned a value of 1 if the response fallsinto the category of interest and a 0 if it doesn’t.If we code the data this way, then the sampleproportion of observations that fall into the category ofinterest is sample mean of the 0’s and 1’s.Thus, we can use anumericalmethod to summarizecategoricaldata.We can also generalize this statistic to the populationand we will usepto denote the population proportionof observations that fall into the category of interest.
Background image