Understanding Categorical Variable Associations in Statistics

School

University of Michigan**We aren't endorsed by this school

Course

STATS 250

Subject

Statistics

Date

Dec 11, 2024

Pages

Uploaded by AgentDragonPerson846

STATS 250 Lecture 22 | Page 1 Lecture 22: Associations between categorical variables □Compute and interpret differences in conditional proportions and relative risk. Use these summary statistics to informally assess independence/association between two categorical variables.□Informally assess whether two categorical variables are independent or associated using graphical summaries, including segmented/side-by-side bar charts and mosaic plots.1 Summarizing associations between categorical variablesIn previous lectures, we’ve explored numerous methods for exploring association (or the lack of association) between two variables. The table below summarizes many of these topics:Explanatory variable Response Variable Statistical MethodCategorical (binary), forming two independent groups of dataQuantitativeTwo-sample t-test Categorical, forming more than two independent groups of data QuantitativeANOVA QuantitativeQuantitativeSimple Linear Regression Multiple explanatory variables, which can be amix of categorical and quantitative characteristicsQuantitativeMultiple Linear RegressionIn today’s lecture, we’ll begin exploring a new form of research scenario. Like all those listed in the table above, this research scenario focuses on the question of whether an association exists between two chosen explanatory and response variables. In contrast to these examples however, we focus in this lecture on scenarios where both explanatory and response variables are categorical.1.1 Does Hospital Choice Predict Survival Chances?Let’s start with a simple thought experiment. Suppose youare in a life-threatening accident requiring life-saving medical care. You are loaded into the ambulance and offered a choice by the driver to attend either Hospital A or B. Which hospital should you go to? The table below gives the records of past patients who visited each hospital for medical treatment.SurvivedDiedHospital A 𝟓𝟓𝟓𝟓𝟓𝟓𝟓𝟓Hospital B 𝟔𝟔𝟔𝟔𝟑𝟑𝟑𝟑This scenario poses a question about whether the distributions of two categorical variables are related (associated) or unrelated (independent):

STATS 250 Lecture 22 | Page 2 Variable 1 (explanatory): _________________________________Variable 2 (response): ___________________________________a. What proportion of the patients attended each hospital? What proportion of the patients survived? What proportion died?Definitions: Marginal & Conditional distributionsIn the space above, you just computed the marginal distribution of the Outcome and Explanatory Variables. When analyzing two-way tables, one typically starts by considering the marginal distribution of each of the variables by itself before moving on to explore possible relationships between the variables. To study possible relationships between two categorical variables, we examine the conditionaldistributions, i.e., distributions of one variable for given outcomes of the other variable.a.Restrict your attention just to Hospital A. What proportion of patients survived?b. Restrict your attention just to Hospital B. What proportion of patients survived?c. What was the difference in survival rates?

STATS 250 Lecture 22 | Page 3 1.2 Difference of conditional proportionsYou can summarize how two categorical variables are related by considering the rate at which one variable takes on a particular outcome across groups formed by the other outcome.The difference of conditional proportions = π�𝟏𝟏− π�𝟑𝟑A difference in conditional proportions simply records the difference in rates at which a categorical outcome occurs across two groups. d. Consider the summary statistics you've computed above. If the variables Hospital and Outcomewere completely independent, what values of these statistics would you have expected to compute?e. Provide interpretations of the difference in conditional proportions and relative risks. Informally, do the distributions for the variables Hospital and Outcome appear to be related?When a categorical, binary outcome variable is exactly independent of the categorical, binary explanatory variable forming the groups, this difference in conditional proportions will equal zero. The further this statistic is from zero (i.e., the larger its magnitude), the more evidence we have against the claim that these variables are independent and for the claim that these variables are associated.1.3 Visualizing associations between two categorical variablesWe can also visualize the joint distribution of two categorical variables with a variety of graphical options, including segmented bar charts, side-by-side bar charts, and mosaic plots.For instance, consider the dataset below, which records two categorical variables for a sample of 𝑛𝑛= 365residents in Washtenaw County, MI.

STATS 250 Lecture 22 | Page 4 HighSchoolPartialCollegeBachelorsGraduateDegree Sum Bike 33 19 33 18 103 Car 15 63 65 44 187 Other 28 8 20 19 75 Sum 76 90 118 81 365 Try It! Educational Level and Commute Methoda. Based on the data, does education level appear to be related to whether residents in Washtenaw County bike to work? Support your decision with numerical summaries.As mentioned above, we can use several types of graphs to visualize the relationship (or lack of relationship) between two categorical variables. One is a segmented bar chart which gives a visual of the two-way table. In a segmented bar chart, the height of each bar represents the number of residents in each education level, while the colors indicate how many of the residents in each education level use each commute mode.This same information can instead be displayed in side-by-side bar charts, in which separate bar charts are given for each group in one of the categorical variables. The height of each bar displays the commute mode count for each education level.

STATS 250 Lecture 22 | Page 5 HighSchoolPartialCollegeBachelorsGraduateDegree Sum Bike 33 19 33 18 103 Car 15 63 65 44 187 Other 28 8 20 19 75 Sum 76 90 118 81 365 The graph we choose to display depends on what information we hope to convey about the data. Graphs such as segmented bar charts or side-by-side bar charts are called comparativeplots since they allow us to compare groups in a categorical variable. For instance, we could also display these data in a less common graphical display called a Mosaic Plot, as shown below.

STATS 250 Lecture 22 | Page 6 1.3.1 Independence and association in bar charts and mosaic plotsEach of the three graphical displays above provide informal evidence that Educational Level and Commute Method are associated variables. What would these displays look like if, in fact, these two variables were exactly independent (i.e., ifa person’s educational level shared exactly no relationship with how they commuted to work)?Below you’ll find a new table of datathat displays the observed counts used to create the graphs on the previous pages alongside fabricated countsin italics and parentheses. These fabricated data display nearly perfectstatistical independence. That is to say, any combination of 𝜋𝜋�1− 𝜋𝜋�2will be equal to approximately zero!HighSchoolPartialCollegeBachelorsGraduateDegree Sum Bike 33 (21)19 (25)33 (33)18 (23)103 Car 15 (39)63 (46)65 (60)44 (41)187 Other 28 (16)08 (18)20 (24)19 (17)75 Sum 76 90 118 81 365 Try It! Exploring independence through bar charts and mosaic plotsConsider the original stacked bar chart above and how it compares to a similar stacked bar chart based on the fabricated, independent data. What differences/similarities do you notice?

STATS 250 Lecture 22 | Page 7 Consider the originalside-by-side bar chart and how it compares to a similar side-by-side bar chart based on the fabricated, independent data. What differences/similarities do you notice?In the space below, sketch your prediction for what the mosaic plot of the fabricated, independent data might look like.

STATS 250 Lecture 22 | Page 8 2 The need for multivariate thinkingRecall our earlier example concerning the survival rates of patients attending two hospitals (A and B) for life-saving medical treatment.Table of Hospital Choice vs. Outcome for 𝒏𝒏=𝟑𝟑𝟓𝟓𝟓𝟓patientsDiedSurvivedHospital A 50 50 100 Hospital B 32 68 100 82 118 200 Suppose we further categorize each of the 𝑛𝑛= 200 patients according to whether, when they arrived tothe hospital, they needed to be immediately placed on a life support system (e.g., a mechanical breathing system or dialysis machine).Table of Hospital Choice vs. Outcome for 𝒏𝒏=𝟏𝟏𝟓𝟓𝟓𝟓patientsnot placed on immediate life supportDiedSurvivedHospital A21820Hospital B166480Total1882100Table of Hospital Choice vs. Outcome for 𝒏𝒏=𝟏𝟏𝟓𝟓𝟓𝟓patientsplaced on immediate life supportSurvived Died Total Hospital A324880Hospital B41620Total3664100Try It! Exploring associations after accounting for a third variablea. What was the observed difference in survival rates across patients who attended hospital A vs. B?b. What was the observed difference in survival rates across patients who attended hospital A vs. B, among those patients not placed on life support?c. What was the observed difference in survival rates across patients who attended hospital A vs. B, among those patients placedon immediate life support?

STATS 250 Lecture 22 | Page 9 2.1 Confounding VariablesRecall from previous lectures that a confounding variable is defined as a variable of observational units that inﬂuences both the explanatory and response variables of a study butmay be left unmonitored by a researcher. In the space below, explain how the variable Life Support Statusconfounds the observed association between Hospital and Outcome.All PatientsDiedSurvivedHospital A 50 50 100 Hospital B 32 68 100 82 118 200 No Life Support DiedSurvivedHospital A 2 18 20 Hospital B 16 6480 18 82100 Immediate Life Support DiedSurvivedHospital A 48 32 80 Hospital B 16 4 20 64 36 100

STATS 250 Lecture 22 | Page 10 2.2 Simpson’s ParadoxThere is a special instance of confounded variables called Simpson's Paradox. In this special case, the observed association between two variables is reversed when a confounding variable is identified and taken into account.Ex. In this case, Hospital A has the higher overall survival rate and appears to be the safer hospital across all patients. When we take into account a patient's condition, Hospital B has the better survival rates for each group of patients.2.3 How much do Minority Lives Matter?1In 2012, Trayvon Martin was shot to death by George Zimmerman in Sanford, Florida. When Zimmerman stood trial, he invoked Florida’s Stand Your Groundlaw in his defense, which allows someone to use lethal force to defend themselves or another person against an intruder in their own home. After he was acquitted, the Tampa Bay Times created a webpage on which they presented data on 𝑛𝑛= 220 cases in Florida in which the Stand Your Ground law was used by the accused person as part of their defense strategy. A researcher explored these data and recorded three variables: the race of the person who invoked the Stand Your Groundlaw as a defense when accused of assault/murder; the race of the victim of the assault/murder trial; and the outcome of the trial.Try It! Simpson’s Paradox in criminal court casesThe table below represents 𝑛𝑛= 220criminal cases where the Florida Stand Your Ground law was invoked and records two variables: the explanatory variable is the race of the defendant who invoked the law and the response variable is the outcome of the trial. 220 Florida TrialsDefendant ConvictedDefendant found innocentTotal Defendant was White45 86 131 Defendant was non-White 29 60 89 Total 74 146 220 a. Based on the data above, does a defendant’s racial identity as either White or non-White appear to be associated with their chance of being found innocent? Use numerical support.The two tables below represent the same 𝑛𝑛= 220assault/murder trials, separated by whether the victim identified as White or non-White. 1This example is amended from Witmer, J. (2015). How much do minority lives matter?Journal of Statistics Education, 23(2).

STATS 250 Lecture 22 | Page 11 132 Florida Trials where victim was WhiteDefendant ConvictedDefendant found innocent Total Defendant was White40 67 107 Defendant was non-White10 15 35 Total50 82 132 88 Florida Trials where victim was non-WhiteDefendant ConvictedDefendant found innocentTotal Defendant was White5 19 24 Defendant was non-White19 45 64 Total24 64 88 b. Do these additional results suggest a defendant’s racial identity as either White or non-White appear to be associated with their chance of being found innocent?c. Do these data represent an example of Simpson’s Paradox? Explain why or why not using your results from exercises (a) and (b) above.

STATS 250 Lecture 22 | Page 12 ADDITIONAL NOTES:This page is intentionally left blank for you to use to log notes taken during lecture, jot down your thoughts regarding additional examples, or to record work completed during Group Work exercises.