Data

Data Source

In this project, we use data of Youth Risk Behavior Surveillance System (YRBSS) collected by Centers for Disease Control and Prevention (CDC). YRBSS was established in 1990 to monitor health behaviors among U.S. adolescents and adults that are leading causes of death, disability, and social problems. Information about YRBSS can be found at Youth Risk Behavior Surveillance System (YRBSS) Overview. Among six categories of health-related behaviors that YRBSS monitors, we mainly focus on drug use abuse among youth. The YRBSS data are collected every two years, and the most up-to-date data were collected in 2019. We will use the data collected in 2019 on the state level.

The dataset is so large that it exceeds the upload limit of github. Therefore, in order to download the dataset, you can use the code in download_dataset.Rmd.

yrbss_state_a <- readRDS("data/2019 states a-m.rds")
yrbss_state_z <- readRDS("data/2019 states n-z.rds")
yrbss_state = rbind(yrbss_state_a,yrbss_state_z)
yrbss_state = as_tibble(yrbss_state)
yrbss_state_tidy <- read_csv("./data/yrbss_state_lite.csv")

Data Cleaning

The most important variable we are concerned about is the overall drug use status, but there is no question asked about the overall drug use in the YRBSS questionnaire so we imputed a variable drug_use_freq based on the aggregation of the answers of specific drug types. The survey asked about the lifetime use frequency of eight different types of drugs include: marijuana, cocaine, heroine, methamphetamine, inhale drug, ecstasy, steroid and injection drug. Each question have six options of answer: 0 times, 1 or 2 times, 3 to 9 times, 10 to 19 times, 20 to 39 times and 40 or more times (There is one more option for marijuana, 99 or more times). We assigned a numeric value which equivalent to the average of the range for each drug type. For example, if the answer is 1 or 2 times, the value is 1.5. For the answer 40 or more times, the value is 40. Then, we summed up the value across all drug types for each observation and classified the drug use status according to the sum. If the sum is 0, the overall drug use frequency is a no drug use; if the sum is greater than 0 but less than 40, the overall drug use frequency is light dose; the rest are the heavy dose user. We omitted observations who have at least one missing value on any of the drug type since we have no information to impute the missing value.

drug_use_tidy <- function(input_list){
  output <- case_when(input_list == 1 ~ 0,
                      input_list == 2 ~ 1.5,
                      input_list == 3 ~ 6,
                      input_list == 4 ~ 14.5,
                      input_list == 5 ~ 29.5,
                      input_list > 5 ~ 40)
  return(output)
}

drug_use_frequency <- function(input_list){
  output <- case_when(input_list == 0 ~ "No Drug Use",
                      input_list > 0 & input_list < 40 ~ "Light Dose",
                      input_list >= 40 ~ "Heavy Dose")
  factor(output, levels = c("No Drug Use", "Light Dose", "Heavy Dose"))
}


yrbss_state_tidy <- 
  yrbss_state %>% 
  drop_na(q45, q46, q50, q51, q52, q53, q54, q55, q56) %>% 
  mutate(
    age_initial_alcohol = age_tidy(q40),
    age_initial_smoking = age_tidy(q31),
    age_marijuana = age_tidy(q46),
    marijuana = drug_use_tidy(q45),
    cocaine = drug_use_tidy(q50),
    inhale = drug_use_tidy(q51),
    heroin = drug_use_tidy(q52),
    methamphetamine = drug_use_tidy(q53),
    ecstasy = drug_use_tidy(q54),
    steroid = drug_use_tidy(q55),
    illegal_injection = case_when(q56 == 1 ~ 0,
                                  q56 == 2 ~ 20,
                                  q56 == 3 ~ 40),
    drug_use_sum = rowSums(across(marijuana:illegal_injection)),
    drug_use_freq = drug_use_frequency(drug_use_sum))

Data Exploration

The State-wise Drug Use Analysis

Firstly, we are interested in the drug use in each states, so we use our cleaned data according to states to do the following 5 topics of research. Each of the 5 researches are focusing on our interested topics across all the states.

Marijuana is the most abused drug, we are interested in the marijuana drug use proportion, as well as all other drugs use proportion across the states. The analysis plot is in the dashboard. According to the plot, we found that Arizona has the largest proportion of marijuana use among the youth, which is 52%. Arizona also has the highest other drug use proportion of 32%. We could also found in the plot that the states having high proportion of marijuana use are those in which marijuana use is legal, whereas the states which make it illegal have low rate of marijuana such as Utah, Iowa, and Nebraska.

The Average Starting Age of Marijuana

We are also interested in the average of the age that the youth first use marijuana. We tried to find if the mean starting age has a different pattern for each state. According to the plot in dashboard, although the mean starting age are relatively similar for each state, we found that Arizona still has the youngest starting age, which means that Arizona has severe problem of marijuana abuse not only for the proportion, but also for the starting age.

Data Analysis

The data analysis part explored the association between drug use status and other health-related behaviors. The original questionnaire used some ranges of frequency to set answers, but we are concerned about whether the student experienced the behavior or not, so we categorized some variables into bi-level(Yes/No), such as drunk driving, carrying weapon, suicide attempt, physical fight, etc. It is worth to mention that the screening time use is the combination of the time length of computer use and watching tv and we classify the heavy screening use based on the screening time use. However, there are some variables which are reasonable to reclassify into more than two categories, such as smoking, drinking, seat belt use and GPA. All these variables are reclassified and factorized. For the questions about age of first drinking and first smoking, we assigned a numeric value for the average of each range for convenience. Some demographic characteristics variables, initial smoking age, initial drinking age, grade, gender and race are converted to a readable text and factorized. The observations who had at least one missing value in any of the variables are removed. The resulted data set has 91854 observations with 20 variables.

age_tidy <- function(input_list) {
  output <- case_when(input_list == 1 ~ NA_real_,
                      input_list == 2 ~ 8,
                      input_list == 3 ~ 9.5,
                      input_list == 4 ~ 11.5,
                      input_list == 5 ~ 13.5,
                      input_list == 6 ~ 15.5,
                      input_list == 7 ~ 17.5)
}
firstage_tidy <- function(input_list){
  output <- case_when(input_list == 1 ~ "never",
                      input_list == 2 ~ "8 years old or younger",
                      input_list == 3 ~ "9 or 10 years old",
                      input_list == 4 ~ "11 or 12 years old",
                      input_list == 5 ~ "13 or 14 years old",
                      input_list == 6 ~ "15 or 16 years old",
                      input_list == 7 ~ "17 years old or older")
  return(output)
}

yrbss_state_tidy_additional<-
  yrbss_state_tidy %>% 
  drop_na(q31, q40, q9, q10, q14, q17, q26, q39,q59, q79, q80, q88, q33, q42, q89, q8, q11, grade, race4, sex) %>% 
  mutate(
    grade = fct_recode(factor(grade, ordered = TRUE), 
                       "9th Grade" = "1",
                       "10th Grade" = "2",
                       "11th Grade" = "3",
                       "12th Grade" = "4"),
    sex = fct_recode(factor(sex, ordered = TRUE), 
                       "female" = "1",
                       "male" = "2"),
    race4 = fct_recode(factor(race4), 
                       "White" = "1",
                       "Black or African American" = "2",
                       "Hispanic/Latino" = "3",
                       "All other Races" = "4"),
    age_initial_alcohol = age_tidy(q40),
    age_initial_smoking = age_tidy(q31),
    drunk_driving = ifelse(q9 > 1 | q10 > 2, "Yes", "No"),
    drunk_driving = factor(drunk_driving, level = c("Yes", "No")),
    text_driving = ifelse(q11 == 2, "No", "Yes"),
    text_driving = factor(text_driving, level = c("Yes", "No")),
    carrying_weapon = ifelse(q14 == 1, "No", "Yes"),
    carrying_weapon = factor(carrying_weapon, level = c("Yes", "No")),
    suicide_attempt = ifelse(q26 == 1, "Yes", "No"),
    suicide_attempt = factor(suicide_attempt, level = c("Yes", "No")),
    quit_smoke = case_when(q39 == 1 ~ "Never Smoke",
                           q39 == 2 ~ "Yes",
                           q39 == 3 ~ "No"),
    quit_smoke = factor(quit_smoke, level = c("Never Smoke", "Yes", "No")),
    physical_fight = ifelse(q17 == 1, "No", "Yes"),
    physical_fight = factor(physical_fight, level = c("Yes", "No")),
    early_sex = ifelse(q59 > 1 & q59 < 5 , "Yes", "No"),
    early_sex = factor(early_sex, level = c("Yes", "No")),
    q79 = as.numeric(q79),
    q80 = as.numeric(q80),
    tv_use = case_when(q79 == 1 ~ 0, 
                       q79 == 2 ~ 0.5,
                       q79 > 2 ~ q79 - 2),
    computer_use = q80 - 2,
    computer_use = case_when(q80 == 1 ~ 0, 
                       q80 == 2 ~ 0.5,
                       q80 > 2 ~ q80 - 2),
    screening_use = tv_use + computer_use,
    sleeping_time = as.numeric(q88) + 3,
    smoking_status = case_when(q33 == 1 ~ "Never Smoker",
                               q33 >= 2 & q33 <= 5 ~ "Light Smoker",
                               q33 >=6 ~ "Heavy Smoker"),
    smoking_status = factor(smoking_status, level = c("Never Smoker", "Light Smoker", "Heavy Smoker")),
    binge_drinking = case_when(q42 == 1 ~ "No Binge Drinking",
                               q42 >= 2 & q42 <= 4 ~ "Light Binge Drinking",
                               q42 >=5 ~ "Heavy Binge Drinking"),
    binge_drinking = factor(binge_drinking, level = c("No Binge Drinking", "Light Binge Drinking", "Heavy Binge Drinking")),
    seat_belt = case_when(q8 <= 2 ~ "Never or Rarely",
                          q8 == 3 ~ "Sometimes",
                          q8 >= 4 ~ "Most of the time or Always"),
    seat_belt = factor(seat_belt, level = c("Never or Rarely", "Sometimes", "Most of the time or Always")),
    grades_school = case_when(q89 == 1 ~ "Mostly A's",
                       q89 == 2 ~ "Mostly B's",
                       q89 == 3 ~ "Mostly C's",
                       q89 %in% c(4,5) ~ "Mostly Below C's",
                       q89 == 6 ~ "None of these grades",
                       q89 == 7 ~ "Not sure"),
    grades_school = factor(grades_school, level = c("Mostly A's", "Mostly B's", "Mostly C's", "Mostly Below C's", "None of these grades", "Not sure")),
    smoke_age = firstage_tidy(q31),
    alcohol_age = firstage_tidy(q40)
    ) %>% 
  select(drug_use_freq, grade, sex, bmi, race4, drunk_driving, text_driving, carrying_weapon, suicide_attempt, quit_smoke, physical_fight, early_sex, screening_use, sleeping_time, smoking_status, binge_drinking, seat_belt, grades_school, smoke_age, alcohol_age)

Variable desciribe

For the yrbss_state_tidy_additional dataset:

  • drug_use_freq : frequency of drug use. Classified as “No Drug Use”, “Light Dose”, and “Heavy Dose”.
  • grade : classified as “9th Grade”, “10th Grade”, “11th Grade”, and “12th Grade”.
  • sex : classified as “female” or “male”.
  • bmi : body mass index. Extracted from the original dataset.
  • race4 : race of subjects. Classified as “White”, “Black or African American”, “Hispanic/Latino”, and “All other Races”.
  • drunk_driving : classified as “Yes” or “No”. If a student ride in a vehicle driven by someone who had been drinking alcohol or himself/herself drive a vehicle when he/she had been drinking at least 1 time in the past 30 days, then is classified as “Yes”. Otherwise, “No”.
  • text_driving : classified as “Yes” or “No”. If a student text or e-mail while driving at least 1 days in the past 30 days, then is classified as “Yes”. Otherwise, “No”.
  • carrying_weapon : classified as “Yes” or “No”. If a student who carry a gun at least 1 days in the past 12 month, then is classified as “Yes”. Otherwise, “No”.
  • suicide_attempt : classified as “Yes” or “No”. If a student ever seriously consider attempting suicide during the past 12 months, then is classified as “Yes”. Otherwise, “No”.
  • quit_smoke : whether a student ever try to quit using all tobacco products during the past 12 months. Classified as “Never Smoke”, “Yes” and “No”.
  • physical_fight : classified as “Yes” or “No”. If a student were in a physical fight at least 1 time during the past 12 months, then is classified as “Yes”. Otherwise, “No”.
  • early_sex : classified as “Yes” or “No”. If a student had sexual intercourse for the first time before the age of 14, then is classified as “Yes”. Otherwise, “No”.
  • screening_use: hours of watching TV and using a computer on an average school day.
  • sleeping_time : hours of sleep on an average school night.
  • `smoking_status : classified as “Never Smoker”, “Light Smoker”, and “Heavy Smokers”. During the past 30 days, students who did not smoke are classified as “Never Smoker”, 1 to 10 cigarettes per day are classified as “Light Smoker”, greater than 10 cigarettes per day are classified as “Heavy Smokers”.
  • binge_drinking : classified as “No Binge Drinking”, “Light Binge Drinking”, and “Heavy Binge Drinking”. Based on question 42 in original dataset, during the past 30 days, students who did not have binge drinking are classified as “No Binge Drinking”, 1 to 5 days of binge drinking as “Light Binge Drinking”, greater than 5days as “Heavy Binge Drinking”.
  • seat_belt : classified as “Never or Rarely”, “Sometimes”, and “Most of the time or Always”.
  • grades_school : classified as “Mostly A’s”, “Mostly B’s”, “Mostly C’s”, “Mostly Below C’s”, “None of these grades”, and “Not sure”.
  • smoke_age : classified as “never”,“8 years old or younger”,“9 or 10 years old”,“11 or 12 years old”, “13 or 14 years old”,“15 or 16 years old”,“17 years old or older”. This is the age of starting to smoke.
  • alcohol_age : classified as “never”,“8 years old or younger”,“9 or 10 years old”,“11 or 12 years old”, “13 or 14 years old”,“15 or 16 years old”,“17 years old or older”. This is the age of starting to drink alcohol.

For the state-wise tidy dataset:

  • sitecode : the abbreviation of each state.
  • sex : the sex of each subject, 1 = “male”, 2 = “female”.
  • race4 : race of subjects. Classified as “White”, “Black or African American”, “Hispanic/Latino”, and “All other Races”.
  • age : age of subjects.
  • age_marijuana : the starting age to use marijuana.
  • marijuana : the times of dose of marijuana during each subject’s life.
  • cocaine : the times of dose of cocaine during each subject’s life.
  • inhale : the times of sniffing glue, breathing the contents of aerosolspray cans, or inhaling any paints or sprays during each subject’s life.
  • heroin : the times of dose of heroin during each subject’s life.
  • methamphetamine : the times of dose of methamphetamine during each subject’s life.
  • ecstasy : the times of dose of ecstasy during each subject’s life.
  • steroid : the times of dose of steroid during each subject’s life.
  • illegal_injection : the times of use of illegal injection during each subject’s life.
  • drug_use_sum :
  • other_drug_sum :
  • marijuana_use_freq : the frequency of using marijuana. (No Drug Use = the subject never used marijuana; Light Dose = the subject used marijuana less than 40 times during life; Heavy Dose = the subject used marijuana more than 40 times during life)
  • cocaine_use_freq : the frequency of using cocaine (No Drug Use = the subject never used cocaine; Light Dose = the subject used cocaine less than 40 times during life; Heavy Dose = the subject used cocaine more than 40 times during life)
  • inhale_use_freq : the frequency of using inhale drug (No Drug Use = the subject never used inhale drug; Light Dose = the subject used inhale drug less than 40 times during life; Heavy Dose = the subject used inhale drug more than 40 times during life)
  • heroin_use_freq : the frequency of using heroin (No Drug Use = the subject never used heroin; Light Dose = the subject used heroin less than 40 times during life; Heavy Dose = the subject used heroin more than 40 times during life)
  • methamphetamine_use_freq : the frequency of using methamphetamine (No Drug Use = the subject never used methamphetamine; Light Dose = the subject used methamphetamine less than 40 times during life; Heavy Dose = the subject used methamphetamine more than 40 times during life)
  • ecstasy_use_freq : the frequency of using ecstasy (No Drug Use = the subject never used ecstasy; Light Dose = the subject used ecstasy less than 40 times during life; Heavy Dose = the subject used ecstasy more than 40 times during life)
  • steroid_use_freq : the frequency of using steroid (No Drug Use = the subject never used steroid; Light Dose = the subject used steroid less than 40 times during life; Heavy Dose = the subject used steroid more than 40 times during life)
  • illegal_injection_use_freq : the frequency of illegal injection. (No Drug Use = the subject never had illegal injection; Light Dose = the subject had illegal injection less than 40 times during life; Heavy Dose = the subject had illegal injection more than 40 times during life)
  • drug_use_freq : the frequency of using all illegal drugs as a whole. (No Drug Use = the subject never used any illegal drugs; Light Dose = the subject used illegal drugs less than 40 times during life; Heavy Dose = the subject used illegal drugs more than 40 times during life)