Lesson 1 Collecting and Summarizing Data

.pdf

School

New York University**We aren't endorsed by this school

Course

STAT-GB 3214

Subject

Statistics

Date

Dec 20, 2024

Pages

Uploaded by CorporalWombat4739

Lesson 1: Collecting and Summarizing DataLesson 1: Collecting and Summarizing DataOverviewThe lesson starts with a discussion about how to collect data and different methods to use. Once the data iscollected, we need to summarize the data. How to summarize the data depends on which variable type we have.All of these concepts will be presented here.ObjectivesUpon successful completion of this lesson, you should be able to:Describe the benefits and limitations of non-probability and probability sampling methods.Distinguish between experimental and observational studies. Identify explanatory and response variables ina research study.Based on the type of study, determine when a causal conclusion (as opposed to associations) can be made.Articulate how the principles of experimental design (control, randomization, replication) would apply to agiven study.Given a research study, identify possible lurking and confounding variables.Classify variables as categorical and quantitative. Create (using technology) graphical displays ofcategorical variables using pie charts and bar charts.Create (using technology) graphical displays of quantitative variables using dotplots, histograms and boxplots.Select and interpret the appropriate visual representations for one categorical variable, and onequantitative variable.Given a data set, compute and interpret measures of center, position (percentiles), and spread.Construct and interpret a box plot.1.1 - Collecting Data1.1 - Collecting DataCollecting data is an important first step in statistical analysis. The goal of statistics is to make inferences about apopulation based on a sample. How we collect the data is important. If the sample is not representative of thewhole population, we cannot make inferences about the population from that sample.The following are a few frequently used methods for collecting data:Personal InterviewPeople usually respond when asked by a person but their answers may be influenced by theinterviewer.Telephone InterviewCost-effective but need to keep it short since respondents tend to be impatient.Self-Administered QuestionnairesCost-effective but the response rate is lower and the respondents may be a biased sample.Direct ObservationFor certain quantities of interest, one may be able to measure it from the sample.Web-Based SurveyCan only target the population who uses the web.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4641/41

1.1.1 - Types of Bias1.1.1 - Types of BiasWhenever data is collected, there is a risk that the sample is biased. Here are some potential types of bias.Types of BiasNon-Response BiasWhen a large percentage of those sampled do not respond or participate.Response BiasWhen study participants either do not respond truthfully or give answers they feel the researcher wants tohear. For example, when students are asked if they ever cheated on an exam even those who have wouldrespond with "no."SelectionThis bias occurs when the sample selected does not reflect the population of interest. For instance, you areinterested in the attitude of female students regarding campus safety but when sampling you also includemales. In this case, your population of interest was female students however your sample included subjectsnot in that population (i.e. males).Looking AheadStudents interested in pursuing topics related to the design of experiments might explore STAT 503: Design ofExperiments. STAT 503includes extensive coverage implementation and analysis of a wide range of experimentaldesigns.[1] [2]1.1.2 - Strategies for Collecting Data1.1.2 - Strategies for Collecting DataHow can we get data? How do we select observations or measurements for a study?There are two types of methods for collecting data, non-probability methods and probability methods.Non-probability MethodsThese might include:Convenience sampling (haphazard): Collecting data from subjects who are conveniently obtained.Example: surveying students as they pass by in the university's student union building.Gathering volunteers: Collecting data from subjects who volunteer to provide data.Example: using an advertisement in a magazine or on a website inviting people to complete a form orparticipate in a study.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4642/41

Probability MethodsSimple random sample: making selections from a population where each subject in the populationhas an equal chance of being selected.Stratified random sample: where you have first identified the population of interest, you then dividethis population into strata or groups based on some characteristic (e.g. sex, geographic region), thenperform simple random sample from each strata.Cluster sample: where a random cluster of subjects is taken from the population of interest. Forinstance, if we were to estimate the average salary for faculty members at Penn State - University ParkCampus, we could take a simple random sample of departments and find the salary of each facultymember within the sampled department. This would be our cluster sample.There are advantages and disadvantages to both types of methods. Non-probability methods are often easierand cheaper to facilitate. When non-probability methods are used it is often the case that the sample is notrepresentative of the population. If it is not representative, you can make generalizations only about the sample,not the population. The primary benefit of using probability sampling methods is the ability to make inference.We can assume that by using random sampling we attain a representative sample of the population The resultscan be “extended” or “generalized” to the population from which the sample came.Example 1-1: Survey MethodsAirline Company Survey of PassengersLet's say that you are the owner of a large airline company and you live in Los Angeles. You want to survey yourL.A. passengers on what they like and dislike about traveling on your airline. For each of the methods, determineif a non-probability method or a probability method is used. Then determine the type of sampling.a. Since you live in L.A. you go to the airport and just interview passengers as they approach your ticketcounter.AnswerNon-probability method; convenience sampling.b. You have your ticket counter personnel distribute a questionnaire to each passenger requesting theycomplete the survey and return it at end of the flight.AnswerNon-probability methods; Volunteer samplingc. You randomly select a set of passengers flying on your airline and question those that you have selected.AnswerProbability method; Simple random samplingd. You group your passengers by the class they fly (first, business, economy), and then take a random samplefrom each of these groups.AnswerProbability method: Stratified sampling12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4643/41

e. You group your passengers by the class they fly (first, business, economy) and randomly select such classesfrom various flights and survey each passenger in that class and flight selected.AnswerProbability method; Cluster samplingThink About it!In predicting the 2008 Iowa Caucus results a phone survey said that Hillary Clinton would win, but instead,Obama won. Where did they go wrong?The survey was based on landline phones, which was skewed to older people who tended to support Hillary.However, lots of younger people got involved in this election and voted for Obama. The younger people couldonly be reached by cell phone.Looking AheadStudents interested in pursuing topics related to sampling might exploreSTAT 506: Sampling Theory. STAT 506covers sampling design and analysis methods that are useful for research and management in many fields. Awell-designed sampling procedure ensures that we can summarize and analyze data with a minimum ofassumptions and complications.[3] [4]1.1.3 - Types of Studies1.1.3 - Types of StudiesNow that we know how to collect data, the next step is to determine the type of study. The type of study willdetermine what type of relationship we can conclude.There are predominantly two different types of studies:ObservationalA study where a researcher records or observes the observations or measurements without manipulatingany variables. These studies show that there may be a relationship but not necessarily a cause and effectrelationship.ExperimentalA study that involves some random assignment* of a treatment; researchers can draw cause and effect (orcausal) conclusions. An experimental study may also be called a scientific study or an experiment.Note! Random selection (a probability method of sampling) is not random assignment (as in an experiment). In anideal world you would have a completely randomized experiment; one that incorporates random sampling andrandom assignment.Example 1-2: Types of StudiesQuiz and Exam Score StudiesLet's say that there is an option to take quizzes throughout this class. In anobservational study, we may findthat better students tend to take the quizzes and do better on exams. Consequently, we might conclude thatthere may be a relationship between quizzes and exam scores.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4644/41

In anexperimental study, we would randomly assign quizzes to specific students to look for improvements. Inother words, we would look to see whether taking quizzes causes higher exam scores.CausationIt is very important to distinguish between observational and experimental studies since one has to be veryskeptical about drawing cause and effect conclusions using observational studies. The use of random assignmentof treatments (i.e. what distinguishes an experimental study from an observational study) allows one to employcause and effect conclusions.Ethics is an important aspect of experimental design to keep in mind. For example, the original relationshipbetween smoking and lung cancer was based on an observational study and not an assignment of smokingbehavior.Try It!We want to decide whether Advil or Tylenol is more effective in reducing fever.Method 1Ask the subjects which one they use and ask them to rate the effectiveness. Is this an observational study orexperimental study?AnswerThis is an observational study since we just observe the data and have no control on which subject to use whattype of treatment.Method 2Randomly assign half of the subjects to take Tylenol and the other half to take Advil. Ask the subjects to rate theeffectiveness. Is this an observational study or experimental study?AnswerThis is an experimental study since we can decide which subject to use what type of treatment. Thus the selfselection bias will be eliminated.1.1.4 - Variables1.1.4 - VariablesThere may be many variables in a study. The variables may play different roles in the study. Variables can beclassified as either explanatory or response variables.VariableA variable is any characteristic, number, or quantity that can be measured, counted, or observed for record.Response VariableVariable that about which the researcher is posing the question. May also be called the outcome or thedependent variable.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4645/41

Explanatory VariableVariables that serve to explain changes in the response. They may also be called the predictor orindependent variables.Note!A variable can serve as an explanatory variable in one study but response in another.Example 1-3: Response and Explanatory VariablesConsider the variables Sex (Female, Male) and Height (in inches). Which variable do you believe explains theother? In other words, would it make more sense to say a person's sex more likely explains that person's height,or to say a person's height explains that person's sex?AnswerIn this case, Sex would explain Height, making Sex the explanatory variable and Height the response.Consider the variable Height and Weight. Which is the response? Which is the explanatory?AnswerIn this case, a person's height would more likely explain their weight than the other way around.Other VariablesOther types of variables include:Lurking variableA variable that is neither the explanatory variable nor the response variable but has a relationship (e.g. maybe correlated) with the response and the explanatory variable. It is not considered in the study but couldinfluence the relationship between the variables in the study.Confounding variableA variable that is in the study and is related to the other study variables, thus having an effect on therelationship between these variables.A lurking variable, if included in the study, could have a confounding effect and then be classified as aconfounding variable.Example 1-4: Lurking and Confounding VariablesSuppose you teach a class where students must submit weekly homework and then take a weekly quiz. You wantto see if there is a relationship between the scores on the two assignments (i.e. higher homework scores arealigned with higher quiz scores). As you look at the data you begin to consider whether the submission date ofthe homework has an effect on the quiz grades; that is, do students who submit the homework several daysbefore taking the quiz perform better overall on the quiz than students who do not leave much of a time gapbetween completing the assignments (e.g. they do both on the same day). The rational is that students whoallow time between the homework and quiz to study may perform better compared to the other group.AnswerIn this example, “days between submission of homework and quiz” would be a lurking variable as it was notincluded in the study. Now once you got that information and re-examined the relationship between the twoassignments taking into consideration the time gap, if you saw a change in the relationship between the twoassignments (i.e. the relationship changed somewhat from the analysis without the time gap compared to whenthe time gap was included) then this “days between submission” would be considered a confounding variable.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4646/41

In an experiment where treatments are randomly assigned, one assumes these variables get evenly shared acrossthe groups with the intention that any influence they may have on the outcome is negated or reduced.1.1.5 - Principles of Experimental Design1.1.5 - Principles of Experimental DesignThe following principles of experimental design have to be followed to enable a researcher to conclude thatdifferences in the results of an experiment, not reasonably attributable to chance, are likely caused by thetreatments.ControlNeed to control for effects due to factors other than the ones of primary interest.RandomizationSubjects should be randomly divided into groups to avoid unintentional selection bias in the groups.ReplicationA sufficient number of subjects should be used to ensure that randomization creates groups that resembleeach other closely and to increase the chances of detecting differences among the treatments when suchdifferences actually exist.The benefits to randomization are:1. Ifa random assignmentof treatment is done then significant results can be concluded ascausal or causeand effectconclusions. That is, that the treatmentcausedthe result. This treatment can be referred to astheexplanatoryvariable and the result as theresponsevariable.2. Ifrandom selectionis done where the subjects are randomly selected from some population, then theresults can be extended to that population. The random assignment is required for an experiment. Whenboth random assignment and selection are part of the study then we have a completely randomizedexperiment. Without random assignment (i.e.an observational study) then the treatment can only bereferred to as beingrelated tothe outcome.1.2 - Classifying Data1.2 - Classifying DataDistinguishing between the different types of variables is a basic and integral part of applied statistics. Themethods to analyze these data are very different and therefore it is important to make the distinction. The twotypes of variables are Qualitative and Quantitative.Qualitative (Categorical)Data that serves the function of a name only. Categorical values may be:Binary – where there are two choices, e.g. Male and FemaleOrdinal – where the names imply levels with hierarchy or order of preference, e.g. level of educationNominal – where no hierarchy is implied, e.g. political party affiliation.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4647/41

For example, for coding purposes, you may assign Male as 0, Female as 1. The numbers 0 and 1 stand only forthe two categories and there is no order between them.QuantitativeData that takes on numerical values that has a measure of distance between them. Quantitative values canbe:Discrete - or “counted” as in the number of people in attendanceContinuous - or “measured” as in the weight or height of a person.Additional examples of both include:Number of females in this class (Quantitative, Discrete)Nationality (Categorical, nominal)Amount of milk in a 1-gallon container (Quantitative, Continuous)Sex of students (even if coded asM= 0,F= 1) (Categorical, Binary)1.3 - Summarizing One Qualitative Variable1.3 - Summarizing One Qualitative VariableOnce we determine that a variable is Qualitative (or Categorical), we need tools to summarize the data. We cansummarize the data by using frequencies and by graphing the data.Let’s start by an example. In a class size of 30 students, a survey question asked the students to indicate their eyecolor. The responses are shown in the table.HazelBrownBrownBrownBlueBrownBrownBrownBrownBrownBrownGreenBrownBrownBrownBrownBrownBrownBlueBrownBrownBrownHazelBlueBrownBrownBrownBrownBrownBrownFrom this list, we can clearly see that the eye color brown is the most common. Which is more frequent, Hazel orGreen? It may only take a few seconds to answer the question but what if there were 100 students? Or 1000? Thebest way to summarize categorical data is to use frequencies and percentages (or proportions).ProportionA proportion is a fraction or part of the total that possesses a certain characteristic.The best way to summarize categorical data is to use frequencies and percentages like in the table.Eye ColorFrequencyPercentageBrown2480%Blue310%Hazel26.6667%Green13.3333%12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4648/41

The table is much easier to read than the actual data. It is clear to see that more students have Hazel than Greeneyes in the class.As the saying goes, “A picture is worth 1000 words”, it is helpful to visualize the data in a graph.1.4 - Graphing One Qualitative Variable1.4 - Graphing One Qualitative VariableHow can one graph qualitative variables? Two common choices are the pie chart and the bar chart.Pie ChartsPie Charteach sector of the circle represents the percentage of that categoryExample: A pie chart of the eye colorA pie chart for eye color. Brown - 80%, Blue - 10%, Hazel - 6.7%, Green - 3.3%Notes on pie chartsPie charts may not be suitable for too many categories. Thus, if there are too many categories, you caneither combine some categories or use a bar chart to represent the data. What is mean by "too many"?There is no clear cutoff, more of just a judgment on the appearance.Readers may find the pie chart more useful if the percentages are arranged in a descending or ascendingorder.Bar ChartsBar ChartThe height of the bar for each category is equal to the frequency (number of observations) in the category.Leave space in between the bars to emphasize that there is no ordering in the classes.Example: A bar chart of the eye color12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/4649/41

Bar chart showing the counts of each eye color. (Brown 34, Blue 3, Hazel 1 andGreen 2)Notes on Bar ChartsPlease pay attention that even though histogram (shown in section 1.4) also have bars sticking up, they areused to describe the frequency for quantitative variables; bar chart is reserved to describe graphs that showfrequency of categorical variables.For this class, we do not expect you to create the graphs by hand. You should, however, make sure youunderstand how they are created.1.4.1 - Minitab: Graphing One Qualitative Variable1.4.1 - Minitab: Graphing One Qualitative VariableMinitab®– Using Minitab to Construct Pie and Bar ChartsSteps for Creating a Pie Chart1. In Minitab choose Graph> Pie Chart.2. Choose one of the following, depending on the format of your data:In Category names, enter the column of categorical data that defines the groups.In Summary values, enter the column of summary data that you want to graph.3. Choose OK.Steps for Creating a Bar Chart1. In Minitab choose Graph> Bar Chart.2. Choose one of the following, depending on the format of your data:Counts of unique values (This is the best option). Choose Simplefor the graph type.A function of a variableValues from a table3. Choose OK.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46410/41

Try It!One survey of 500 Penn State University students about their favorite sport to watch shows that 283 saidFootball, 126 said Basketball, 45 said Hockey, 46 said Others. Practice with Minitab to create a pie chart and barchart for favorite sport.Solution1.5 - Summarizing One Quantitative Variable1.5 - Summarizing One Quantitative VariableWe will first talk about descriptive measures of quantitative data. The most important characteristic of a data set,central tendency, will be given. After that, a few descriptive measures of the other important characteristic of adata set, the measure of variability, will be discussed. This lesson will be concluded by a discussion of plots,which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.1.5.1 - Measures of Central Tendency12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46411/41

1.5.1 - Measures of Central TendencyMean, Median, and ModeA measure of central tendency is an important aspect of quantitative data. It is an estimate of a “typical” value.Three of the many ways to measure central tendency are the mean, medianand mode.There are other measures, such as a trimmed mean, that we do not discuss here.MeanThe mean is the average of data.Sample MeanLet $x_1, x_2, \ldots, x_n$ be our sample. The sample mean is usually denoted by $\bar{x}$where n is the sample size and are the measurements. One may need to use the sample mean toestimate the population mean since usually only a random sample is drawn and we don't know thepopulation mean.The sample mean is a statisticand a population mean is a parameter. Review the definitions of statistic andparameter in Lesson 0.2.[5]Note on NotationWhat if we say we used $y_i$ for our measurements instead of $x_i$? Is this a problem? No. The formula wouldsimply look like this: The formulas are exactly the same. The letters that you selectto denote the measurements are up to you. For instance, many textbooks use $y$ instead of $x$ to denote themeasurements. The point is to understand how the calculation that is expressed in the formula works. In thiscase, the formula is calculating the mean by summing all of the observations and dividing by the number ofobservations. There is some notation that you will come to see as standards, i.e, n will always equal sample size.We will make a point of letting you know what these are. However, when it comes to the variables, these labelscan (and do) vary.MedianThe median is the middle value of the ordered data.The most important step in finding the median is to first order the data from smallest to largest.Steps to finding the median for a set of data:1. Arrange the data in increasing order, i.e. smallest to largest.2. Find the location of the median in the ordered data by , where n is the sample size.3. The value that represents the location found in Step 2 is the median.Note on Odd or Even Sample Sizes12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46412/41

If the sample size is an odd number then the location point will produce a median that is an observed value. Ifthe sample size is an even number, then the location will require one to take the mean of two numbers tocalculate the median. The result may or may not be an observed value as the example below illustrates.ModeThe mode is the value that occurs most often in the data. It is important to note that there may be morethan one mode in the dataset.Example 1-5: Test ScoresConsider the aptitude test scores of ten children below:95, 78, 69, 91, 82, 76, 76, 86, 88, 80Find the mean, median, and mode.AnswerMeanMedianFirst, order the data.69, 76, 76, 78, 80, 82, 86, 88, 91, 95With n= 10, the median position is found by (10 + 1) / 2 = 5.5. Thus, the median is the average of the fifth (80)and sixth (82) ordered value and the median = 81ModeThe most frequent value in this data set is 76. Therefore the mode is 76.Note!Mean, median and mode are usually not equal.Effects of OutliersOne shortcoming of the mean is that means are easily affected by extreme values. Measures that are not thataffected by extreme values are calledresistant. Measures that are affected by extreme values arecalledsensitive.Example 1-6: Test Scores Cont'd...Using the data from Example 1-5, how would the mean and median change, if the entry 91 is mistakenlyrecorded as 9?AnswerThe data set would be9, 69, 76, 76, 78, 80, 82, 86, 88, 9512/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46413/41

MeanThe mean would be The mean would be 73.9, which is very different from 82.1.MedianLet us see the effect of the mistake on the median value.The data set (with 91 coded as 9) in increasing order is:9, 69, 76, 76, 78, 80, 82, 86, 88, 95where the median = 79The medians of the two sets are not that different. Therefore the median is not that affected by the extremevalue 9.The mean is a sensitive measure (or sensitive statistic) and the median is a resistant measure (or resistantstatistic).After reading this lesson you should know that there are quite a few options when one wants to describe centraltendency. In future lessons, we talk about mainly about the mean. However, we need to be aware of one of itsshortcomings, which is that it is easily affected by extreme values.Unless data points are known mistakes, one should not remove them from the data set! One should keep theextreme points and use more resistant measures. For example, use the sample median to estimate thepopulation median. We will discuss methods using the median in Lesson 11.Adding and Multiplying ConstantsWhat happens to the mean and median if we add or multiply each observation in a data set by a constant?Consider for example if an instructor curves an exam by adding five points to each student’s score. What effectdoes this have on the mean and the median? The result of adding a constant to each value has the intendedeffect of altering the mean and median by the constant.For example, if in the above example where we have 10 aptitude scores, if 5 was added to each score the meanof this new data set would be 87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (theoriginal median of 81 plus 5).Similarly, if each observed data value was multiplied by a constant, the new mean and median would change bya factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were doubled, the thenthe new mean and new median would be double the original mean and median. As we will learn shortly, theeffect is not the same on the variance!Looking Ahead!Why would you want to know this? One reason, especially for those moving onward to more applied statistics(e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods, a required assumptionis that the data is normal, or very near bell-shaped. When the data is not normal, statisticians will transform thedata using numerous techniques e.g. logarithmic transformation. We just need to remember the original datawas transformed!!12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46414/41

ShapeThe shape of the data helps us to determine the most appropriate measure of central tendency. The three mostimportant descriptions of shape are Symmetric, Left-skewed, and Right-skewed. Skewness is a measure of thedegree of asymmetry of the distribution.Symmetricmean, median, and mode are all the same hereno skewness is apparentthe distribution is described as symmetricA symmetrical distribution.Mean = Median = ModeSymmetricalLeft-Skewed or Skewed Leftmean < medianlong tail on the leftA left skewed distribution.MedianMeanModeSkewed to the leftRight-skewed or Skewed Rightmean > medianlong tail on the rightA right skewed distribution.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46415/41

MedianMeanModeSkewed to the rightNote! When one has very skewed data, it is better to use the median as measure of central tendency since themedian is not much affected by extreme values.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46416/41

Application: The Skewed Nature of Salary DataSalary distributions are almost always right-skewed, with a few people that make the most money. To illustratethis, consider your favorite sports team or even the company for which you work. There will be one or twoplayers or personnel that earn the “big bucks”, followed by others who earn less. This will produce a shape that isskewed to the right. Knowing this can be a useful aid in negotiating a higher salary.When one interviews for a position and the discussion gets around to compensation, it is common that theinterviewer states an offer that is “typical for someone in your position”. That is, they are offering you theaverage salary for someone with your particular skill set (e.g. little experience). But is this average the mode,median, or mean? The company – for whom business is business! – will want to pay you the least they can whileyou prefer to earn the most you can. Since salaries tend to be skewed to the right, the offer will most likelyreflect the mode or median. You simply need to ask to which “average” the offer refers and what is the mean ofthis average since the mean would be the highest of the three values. Once you have these averages, you canbegin to negotiate toward the highest number.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46417/41

1.5.2 - Measures of Position1.5.2 - Measures of PositionWhile measures of central tendency are important, they do not tell the whole story. For example, suppose themean score on a statistics exam is 80%. From this information, can we determine a range in which most peoplescored? The answer is no. There are two other types of measures, measures of position and variability, that helppaint a more concise picture of what is going on in the data. In this section, we will consider the measures ofposition and discuss measures of variability in the next one.Measures of position give a range where a certain percentage of the data fall. The measures we consider hereare percentiles and quartiles.PercentilesThe pthpercentile of the data set is a measurement such that after the data are ordered from smallest tolargest, at most, p% of the data are at or below this value and at most, (100 - p)% at or above it.A common application of percentiles is their use in determining passing or failure cutoffs for standardized examssuch as the GRE. If you have a 95th percentile score then you are at or above 95% of all test takers.The median is the value where fifty percent or the data values fall at or below it. Therefore, the median is the50th percentile.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46418/41

We can find any percentile we wish. There are two other important percentiles. The 25th percentile,typically denoted, Q1, and the 75th percentile, typically denoted as Q3. Q1 is commonly called the lowerquartileand Q3 is commonly called the upper quartile.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46419/41

Finding QuartilesThe method we will demonstrate for calculating Q1 and Q3 may differ from the method described in ourtextbook. The results shown here will always be the same as Minitab's results. The method here is also differentfrom the method presented in many undergraduate statistics courses. This method is what we require studentsto use.There are two steps to follow:1. Find the location of the desired quartileIf there are nobservations, arranged in increasing order, then the first quartile is at position $\dfrac{n+1}{4}$, second quartile (i.e. the median) is at position $\dfrac{2(n+1)}{4}$, and the third quartile is at position$\dfrac{3(n+1)}{4}$.2. Find the value in that position for the ordereddata.Note!If the value found in part 1 is not a whole number, interpolate the value.Example 1-7: Final Exam ScoresThe final exam scores of 18 students are (in increasing order):24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93, 94, 97Find the lower quartile (Q1), the median, and the upper quartile (Q3).AnswerIn this example, $n=18$.For Q1, its position is: $\dfrac{18+1}{4}=4.75$The actual value of Q1: Q1 = 67 (4th position) + 0.75 · (71 - 67) = 7012/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46420/41

For the median, its position is: $\dfrac{18+1}{2}=9.5$The actual value of the median: Q2 = 82 (9th position) + 0.5 · (83 - 82) = 82.5For Q3, its position is: The actual value of Q3: Q3 = 88 + 0.25 · (92 - 88) = 89The 5 - Number SummaryThe Five-Number Summary:A helpful summary of the data is called the five number summary. The five number summary consists of fivevalues:1. The minimum2. The lower quartile, Q13. The median (also known as Q2)4. The upper quartile, Q35. The maximumExample 1-7Find the five number summary for the final exam scores. Interpret the values.AnswerThe lowest score on the final exam was 24. The highest score on the exam was 97. 25% of the students scored a70 or below. 50% of the students scored above an 82.5. 75% of the students scored 89 or below. We can also saythat 25% of the students scored at least an 89.1.5.3 - Measures of Variability1.5.3 - Measures of VariabilityTo introduce the idea of variability, consider this example. Two vending machines Aand Bdrop candies when aquarter is inserted. The number of pieces of candy one gets is random. The following data are recorded for sixtrials at each vending machine:Vending Machine APieces of candy from vending machine A:1, 2, 3, 3, 5, 4mean = 3, median = 3, mode = 3Vending Maching BPieces of candy from vending machine B:2, 3, 3, 3, 3, 4mean = 3, median = 3, mode = 312/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46421/41

The dot plot for the pieces of candy from vending machine Aand vending machine Bis displayed in figure 1.4.They have the same center, but what about their spreads?Measures of VariabilityThere are many ways to describe variability or spread including:RangeInterquartile range (IQR)Variance and Standard DeviationRangeThe range is the difference in the maximum and minimum values of a data set. The maximum is the largestvalue in the dataset and the minimum is the smallest value. The range is easy to calculate but it is verymuch affected by extreme values.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46422/41

Like the range, the IQR is a measure of variability, but you must find the quartiles in order to compute its value.Interquartile Range (IQR)Theinterquartile rangeis the difference between upper and lower quartiles and denoted as IQR.Note!TheIQRis not affected by extreme values. It is thus a resistant measure of variability.Try it!Find the IQR for the final exam scores example.24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93, 94, 97SolutionVariance and Standard DeviationOne way to describe spread or variability is to compute the standard deviation. In the following section, we aregoing to talk about how to compute the sample variance and the sample standard deviation for a data set. Thestandard deviation is the square root of the variance.Variancethe average squared distance from the meanPopulation variancewhere $\mu$ is the population mean and the summation is over all possible values of the population andis the population size.$\sigma^2$ is often estimated by using the sample variance.Sample VarianceWhere $n$ is the sample size and $\bar{x}$ is the sample mean.Why do we divide by instead of by ?When we calculate the sample sd we estimate the population mean with the sample mean, and dividing by (n-1)rather than n which gives it a special property that we call an "unbiased estimator". Therefore is an unbiasedestimatorfor the population variance.The sample variance (and therefore sample standard deviation) are the common default calculations used bysoftware. When asked to calculate the variance or standard deviation of a set of data, assume - unless otherwiseinstructed - this is sample data and therefore calculating the sample variance and sample standard deviation.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46423/41

Example 1-8Calculate the variance for these final exam scores.24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93, 94, 97AnswerFirst, find the mean:$\bar{x}=\dfrac{24+58+61+67+71+73+76+79+82+83+85+87+88+88+92+93+94+97}{18}=\dfrac{233}{3}$Next, use a table to sum the squared distances. Click to show the full table.Table $x_i$$(x-\bar{x})$$(x-\bar{x})^2$24-161/325921/958-59/33481/961-50/32500/367-32/31024/971-20/3400/973-14/3196/976-5/325/9794/316/98213/3169/98316/3256/98522/3484/98728/3784/912/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46424/41

$x_i$$(x-\bar{x})$$(x-\bar{x})^2$8831/3961/98831/3961/99243/31849/99346/32116/99449/32401/99758/33364/9Sum046908/9Finally,Try it!Calculate the sample variances for the data set from vending machines A and B yourself and check that it thevariance for B is smaller than that for data set A. Work out your answer first, then click the graphic to compareanswers.a. 1, 2, 3, 3, 4, 5Answerb. 2, 3, 3, 3, 3, 4Answer12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46425/41

Standard DeviationThe standard deviation is a very useful measure. One reason is that it has the same unit of measurement as thedata itself (e.g. if a sample of student heights were in inches then so, too, would be the standard deviation. Thevariance would be in squared units, for example ). Also, the empirical rule, which will be explained later,makes the standard deviation an important yardstick to find out approximately what percentage of themeasurements fall within certain intervals.Standard Deviationapproximately the average distance the values of a data set are from the mean or the square root of thevariancePopulation Standard deviationIt has the same unit as the ’s. This is a desirable property since one may think about the spread in terms of theoriginal unit.is estimated by the sample standard deviation :Sample Standard DeviationA rough estimate of the standard deviation can be found using Adding and Multiplying ConstantsWhat happens to measures of variability if we add or multiply each observation in a data set by a constant? Welearned previously about the effect such actions have on the mean and the median, but do variation measuresbehave similarly? Not really.When we add a constant to all values we are basically shifting the data upward (or downward if we subtract aconstant). This has the result of moving the middle but leaving the variability measures (e.g. range, IQR, variance,standard deviation) unchanged.On the other hand, if one multiplies each value by a constant this does affect measures of variation. The result onthe variance is that the new variance is multiplied by the square of the constant, while the standard deviation,range, and IQR are multiplied by the constant. For example, if the observed values of Machine A in the exampleabove were multiplied by three, the new variance would be 18 (the original variance of 2 multiplied by 9). Thenew standard deviation would be 4.242 (the original standard 1.414 multiplied by 3). The range and IQR wouldalso change by a factor of 3.Coefficient of VariationAbove we considered three measures of variation: Range, IQR, and Variance (and its square root counterpart -Standard Deviation). These are all measures we can calculate from one quantitative variable e.g. height, weight.But how can we compare dispersion (i.e. variability) of data from two or more distinct populations that havevastly different means?A popular statistic to use in such situations is the Coefficient of Variation or CV. This is a unit-free statistic andone where the higher the value the greater the dispersion. The calculation of CV is:Coefficient of Variation (CV)12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46426/41

To demonstrate, think of prices for luxury and budget hotels. Which do you think would have the higher averagecost per night? Which would have the greater standard deviation? The CV would allow you to compare thisdispersion in costs in relative terms by accounting for the fact that the luxury hotels would have a greater meanand standard deviation.Example 1-9: Comparing PricesYou are shopping for toilet tissue. As you compare prices of various brands, some offer price per roll while othersoffer price per sheet. You are interested in determining which pricing method has less variability so you sampleseveral of each and calculate the mean and standard deviation for the sampled items that are priced per roll, andthe mean and standard deviation for the sampled items that are priced per sheet. The table below summarizesyour results.ItemMeanStandard DeviationPrice per Roll0.91960.4233Price Per Sheet0.011340.00553AnswerComparing the standard deviations the Per Sheet appears to have much less variability in pricing. However, themean is also much smaller. The coefficient of variation allows us to make a relative comparison of the variabilityof these two pricing schemes:Relatively speaking, the variation for Price per Sheet is greater than the variability for Price per Roll.1.5.4 - Minitab: Descriptive Statistics1.5.4 - Minitab: Descriptive StatisticsMinitab®Descriptive StatisticsLet's perform some basic operations in Minitab. Some of the examples below are repeats of what we did by handin earlier lessons while others are new. First, we saw previously how you can enter data into the Minitabworksheet by hand, we will now walk through how to load a dataset into Minitab from an Excel file.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46427/41

Loading Data into Minitab from an Excel FileFor the examples in this section, download the minitabintrodata.xlsxspreadsheet file. Save the file locally (if usingMinitab installed on the computer you are using).[6]Open Minitab web and choose 'Open Local File' to find the spreadsheet file.With the data in the Minitab worksheet, you can then perform any number of procedures. First, we obtain somebasic descriptive statistics.Descriptive StatisticsWith the data from the Excel spreadsheet file into your Minitab worksheet window, you should notice that allcolumns are labeled ‘Cx’ where the ‘x’ is a number. Some of these are followed by a ‘-T’. Those columns with the‘-T’ indicate that the data in this column are considered text or categorical data. Otherwise, Minitab recognizesthe data as quantitative. If the operation you conduct in Minitab only functions on a certain variable type (e.g.calculating the mean can only be done on quantitative data) then only columns of that data type will be availableto use for those operations.Minitab®Example 1-10 Hours DataLet's use Minitab to calculate the five number summary, mean and standard deviation for the Hours data,(contained in minitabintrodata.xlsx). And, as you will see, Minitab by default will provide some addedinformation.[7]1. At top of the Minitab window select the menu option Stat > Basic Statistics > Display DescriptiveStatistics2. Once this dialog box opens your cursor should be blinking in the 'Variables' window. If not, simply click insidethis part of the dialog box. The only variables you should see in the left side window are columns of quantitativedata (the two price columns, age, and hours). To enter a variable from the left hand window into the Variableswindow you can either double-click that variable or click the variable to highlight it and then click the 'Select'button. Do so with the variable 'Hours'.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46428/41

3. With the variable 'Hours' in the 'Variable' window click the 'OK' button.The following output should appear in the Session window above the worksheet.StatisticsVariableNN*MeanSE MeanStDevMinimumQ1MedianQ3MaximumHours5083516.001.198.454.009.0015.0022.2534.0012/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46429/41

The mean, standard deviation (StDev), etc. should be the same values as those calculated in the practiceproblems. Minitab also gives the size of the sample used to create these statistics (N), and the number ofobservations from this data that were missing (N*).These statistics are the default statistics. Additional basic descriptive statistics are also available such as trimmean and coefficient of variation (CV).Minitab®Example 1-11: The Coefficient of Variation (CV)To get the CV values for the Price per Sheet and Price per Roll an example found in an earlier lesson, (datacontained in minitabintrodata.xlsx).[8]1. Open Minitab and return to Stat> Basic Statistics> Display Descriptive Statistics.2. Enter both variables into the Variables window. That is, both 'Price_Roll' and 'Price_Sheet' should be in theVariables window.3. Click the Statisticstab and then check the box for 'Coefficient of Variation' (notice the other statisticsavailable!) and click OK.4. Click OKagain.The output will include the same statistics as the example above plus the CV values, (it will be titled 'CoefVar').Descriptive Statistics: Price_Roll, Price_SheetStatisticsVariableNN*MeanSEMeanStDevCoefVarMinimumQ1MedianQ3MaximumPrice_Roll248610.91960.08640.423346.030.53000.60750.77500.95001.9800Price_Sheet248610.011340.001130.0055348.780.006100.007530.009950.014900.031801.6 - Graphing One Quantitative Variable1.6 - Graphing One Quantitative VariableNow that we discussed how to find summary statistics for quantitative variables, the next step is to graph thedata. The graphs we will discuss include:1. Dotplot2. Stem-and-leaf Diagram3. Histogram4. Boxplot1.6.1 - Dotplots, Stem-and-Leaf Diagrams12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46430/41

1.6.1 - Dotplots, Stem-and-Leaf DiagramsDotplotA dot plot displays the data as dots on a number line. It is useful to show the relative positions of the data.Dotplot ExampleEach of the ten children in the second grade was given a reading aptitude test. The scores were as follows:95, 78, 69, 91, 82, 76, 76, 86, 88, 79Here is a dot plot for the data.Each of the observations is represented as a dot. If there is more than one observation with the same value, a dotis placed above the others. A dotplot provides us with a quick glance at the data. We can easily see the minimumand maximum values and can see the mode is 76. Dotplots are generally used for small data sets.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46431/41

Minitab®Minitab: DotplotsHow to create a dotplot in Minitab:1. Click Graph>Dotplot2. Choose Simple.3. Enter the column with your variable4. Click OK.Stem-and-Leaf DiagramsTo produce the diagram, the data need to be grouped based on the “stem”, which depends on the number ofdigits of the quantitative variable. The “leaves” represent the last digit. One advantage of this diagram is that theoriginal data can be recovered (except the order the data is taken) from the diagram.Stem-and-Leaf ExampleJessica weighs herself every Saturday for the past 30 weeks. The table below shows her recorded weights inpounds.135137136137138139140139137140142146148145139140142143144143141139137138139136133134132132Create a Stem-and-Leaf Diagram for Jessica’s Weight.AnswerThe first step is to determine the stem. The weights range from 132 to 148. The stems should be 13 and 14. Theleaves should be the last digit. For example, the first value (also smallest value) is 132, it has a stem of 13 and 2as the leaf.Stem-and-Leaf of weight of Jessica N = 30Leaf Unit = 1.031322312/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46432/41

513451113667777(7)1388999991214000181422334144521461148The first column, called depths, are used to display cumulative frequencies. Starting from the top, the depthsindicate the number of observations that lie in a given row or before. For example, the 11 in the third rowindicates that there are 11 observations in the first three rows. The row that contains the middle observation isdenoted by having a bracketed number of observations in that row; (7) for our example. We thus know that themiddle value lies in the fourth row. The depths following that row indicate the number of observations that lie ina given row or after. For example, the 4 in the seventh row indicates that there are four observations in the lastthree rows.Minitab®Minitab: Stem-and-Leaf DigramsHow to create a Stem-and-Leaf Diagram in Minitab:1. Click Graph>Stem-and-Leaf2. Enter the column with your variable3. Click OK.1.6.2 - Histograms1.6.2 - HistogramsHistogramIf there are many data points and we would like to see the distribution of the data, we can represent the data bya frequency histogramor a relative frequency histogram.A histogram looks similar to a bar chart but it is for quantitative data. To create a histogram, the data need to begrouped into class intervals. Then create a tally to show the frequency (or relative frequency) of the data intoeach interval. The relative frequency is the frequency in a particular class divided by the total number ofobservations. The bars are as wide as the class interval and as tall as the frequency (or relative frequency).Histogram ExampleJessica weighs herself every Saturday for the past 30 weeks. The table below shows her recorded weights inpounds.13513713613713813912/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46433/41

140139137140142146148145139140142143144143141139137138139136133134132132Create a histogram of her weight.AnswerFor histograms, we usually want to have from 5 to 20 intervals. Since the data range is from 132 to 148, it isconvenient to have a class of width 2 since that will give us 9 intervals.131.5-133.5133.5-135.5135.5-137.5137.5-139.5139.5-141.5141.5-143.5143.5-145.5145.5-147.5147.5-149.5The reason that we choose the end points as .5 is to avoid confusion whether the end point belongs to theinterval to its left or the interval to its right. An alternative is to specify the endpoint convention. For example,Minitab includes the left end point and excludes the right end point.Having the intervals, one can construct the frequency table and then draw the frequency histogram or get therelative frequency histogram to construct the relative frequency histogram. The following histogram is producedby Minitab when we specify the midpoints for the definition of intervals according to the intervals chosen above.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46434/41

12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46435/41

If we do not specify the midpoint for the definition of intervals, Minitab will default to choose another set of classintervals resulting in the following histogram. According to the include left and exclude right endpointconvention, the observation 133 is included in the class 133-135.12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46436/41

12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46437/41

Note that different choices of class intervals will result in different histograms. Relative frequency histograms areconstructed in much the same way as a frequency histogram except that the vertical axis represents the relativefrequency instead of the frequency. For the purpose of visually comparing the distribution of two data sets, it isbetter to use relative frequency rather than a frequency histogram since the same vertical scale is used for allrelative frequency--from 0 to 1.Minitab®Minitab: HistogramsHow to create a histogram in Minitab:1. Click Graph>Histogram2. Choose Simple.3. Enter the column with your variable4. Click OK.1.6.3 - Boxplots1.6.3 - BoxplotsBoxplotTo create this plot we need the five number summary. Therefore, we need:minimum value,Q1 (lower quartile),Q2 (median),Q3 (upper quartile), andmaximum value.Using the five number summary, one can construct a skeletal boxplot.1. Mark the five number summary above the horizontal axis with vertical lines.2. Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to form the whisker.Most statistical software does NOT create graphs of a skeletal boxplot but instead opt for the boxplot as followsbelow. Boxplots from statistical software are more detailed than skeletal boxplots because they also showoutliers. However, if there are no outliers, what is produced by the software is essentially the skeletal boxplot.The following terminology will prepare us to understand and draw this more detailed type of the boxplot.Potential outliersare observations that lie outside the lower and upper limits.Lower limit = Q1 - 1.5 * IQRUpper limit = Q3 +1.5 * IQRAdjacent valuesare the most extreme values that are not potential outliers.Boxplot Example12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46438/41

Let's revisit the final exam score data:24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93, 94, 97IQR = Q3 - Q1 = 89 - 70 = 19.Lower limit = Q1 - 1.5 · IQR= 70 - 1.5 *19 = 41.5Upper limit = Q3 + 1.5 · IQR= 89 + 1.5 * 19 = 117.5Lower adjacent value = 58Upper adjacent value = 97Since 24 lies outside the lower and upper limit, it is a potential outlier.Statistical software will create a boxplot of final exam score that may look like this:12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46439/41

Boxplots and Distribution ShapesSymmetric DataA symmetric distribution with its corresponding box plot:A symmetric boxplot with distribution curve.Q3Q2Q1Right-Skewed DataA right-skewed distribution along with it's corresponding box plot:Right-skewed boxplot with distribution curve.Q3Q2Q1Left-Skewed DataA left-skewed distribution along with it's corresponding box plot.:Left-skewed boxplot with distribution curve.Q1Q2Q312/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46440/41

Minitab®Minitab: BoxplotsHow to create a single histogram in Minitab:1. You must have a column of measurement data.2. Click Graph > Boxplot3. Under One Y, choose Simple , then click OK .4. Enter the column of interest under Graph Variables.5. Click OK .1.7 - Lesson 1 Summary1.7 - Lesson 1 SummaryLesson 1 SummaryThe goal of statistics is to make inferences about the population based on the sample. Therefore, knowing howthe sample is obtained is essential to understand. We also have to know what conclusions we can make based onthe type of study we have.Once we gather the data, we need to summarize it in order to make sense of what we have. How we summarizethe data depends on what type of variable we have. For qualitative variables, we can summarize usingfrequencies and proportions. For quantitative data, we can summarize the data using various measures of center,variability, and position.For both types of variables, quantitative and qualitative, we can produce graphs to help us visualize the data.Collecting and summarizing data (numerically and graphically) help us understand what is going on in thesample. The goal is to understand what is happening in the population from that sample (i.e. inference). To begininference, we need to first learn about probability. Probability is discussed in the next Lesson.Legend[1]Link↥Has Tooltip/PopoverToggleable VisibilitySource: https://online.stat.psu.edu/stat500/lesson/1Links:1. https://online.stat.psu.edu/stat5032. https://online.stat.psu.edu/stat5033. https://online.stat.psu.edu/stat5064. https://online.stat.psu.edu/stat5065. https://online.stat.psu.edu/stat500/lesson/0/0.26. https://online.stat.psu.edu/stat500/sites/stat500/files/minitab/minitabintrodata.xlsx7. https://online.stat.psu.edu/stat500/sites/stat500/files/minitab/minitabintrodata.xlsx8. https://online.stat.psu.edu/stat500/sites/stat500/files/minitab/minitabintrodata.xlsx12/20/24, 11:09 PMLesson 1: Collecting and Summarizing Datahttps://online.stat.psu.edu/stat500/book/export/html/46441/41