Technologies jobs are looking for: AWS / Azure / GCE Python DS libraries (numpy, matplotlib, etc.) why not just use R Building data pipelines, programming ETL tools (extract transform load)

Load in tidyverse library.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.3

Load in dataset.

costds = read_csv("cost_of_living_us.csv")
## Rows: 31430 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): state, areaname, county, family_member_count
## dbl (10): case_id, housing_cost, food_cost, transportation_cost, healthcare_...
## lgl  (1): isMetro
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#https://www.kaggle.com/datasets/asaniczka/us-cost-of-living-dataset-3171-counties

From the Kaggle page:

#About Dataset The US Family Budget Dataset provides insights into the cost of living in different US counties based on the Family Budget Calculator by the Economic Policy Institute (EPI).

This dataset offers community-specific estimates for ten family types, including one or two adults with zero to four children, in all 1877 counties and metro areas across the United States.

Explore dataset.

head(costds)
## # A tibble: 6 × 15
##   case_id state isMetro areaname         county family_member_count housing_cost
##     <dbl> <chr> <lgl>   <chr>            <chr>  <chr>                      <dbl>
## 1       1 AL    TRUE    Montgomery, AL … Autau… 1p0c                       8506.
## 2       1 AL    TRUE    Montgomery, AL … Autau… 1p1c                      12068.
## 3       1 AL    TRUE    Montgomery, AL … Autau… 1p2c                      12068.
## 4       1 AL    TRUE    Montgomery, AL … Autau… 1p3c                      15257.
## 5       1 AL    TRUE    Montgomery, AL … Autau… 1p4c                      15257.
## 6       1 AL    TRUE    Montgomery, AL … Autau… 2p0c                      10180.
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
tail(costds)
## # A tibble: 6 × 15
##   case_id state isMetro areaname         county family_member_count housing_cost
##     <dbl> <chr> <lgl>   <chr>            <chr>  <chr>                      <dbl>
## 1    3171 WY    FALSE   Weston County, … Westo… 1p4c                       13632
## 2    3171 WY    FALSE   Weston County, … Westo… 2p0c                        8316
## 3    3171 WY    FALSE   Weston County, … Westo… 2p1c                       10956
## 4    3171 WY    FALSE   Weston County, … Westo… 2p2c                       10956
## 5    3171 WY    FALSE   Weston County, … Westo… 2p3c                       13632
## 6    3171 WY    FALSE   Weston County, … Westo… 2p4c                       13632
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
length(unique(costds$caseid)) #number of unique cases / counties observed in dataset.
## Warning: Unknown or uninitialised column: `caseid`.
## [1] 0

Restructure some data fields: factorize county, family_member_count, state

costds = costds %>% mutate_at(vars(county,family_member_count,state),as.factor)
typeof(costds$county)
## [1] "integer"
typeof(costds$family_member_count)
## [1] "integer"
head(costds)
## # A tibble: 6 × 15
##   case_id state isMetro areaname         county family_member_count housing_cost
##     <dbl> <fct> <lgl>   <chr>            <fct>  <fct>                      <dbl>
## 1       1 AL    TRUE    Montgomery, AL … Autau… 1p0c                       8506.
## 2       1 AL    TRUE    Montgomery, AL … Autau… 1p1c                      12068.
## 3       1 AL    TRUE    Montgomery, AL … Autau… 1p2c                      12068.
## 4       1 AL    TRUE    Montgomery, AL … Autau… 1p3c                      15257.
## 5       1 AL    TRUE    Montgomery, AL … Autau… 1p4c                      15257.
## 6       1 AL    TRUE    Montgomery, AL … Autau… 2p0c                      10180.
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>

Questions about the data: 1. To what extent does family_member_count impact housing_cost? 2. What is the relationship between family_member_count and food_cost? 3. How does total_cost compare to households with $0 childcare_cost and >$0 childcare_cost? 4. What is the difference between total_cost wrt isMetro? 5. What is the difference between transportation_cost wrt isMetro?

First, I always like visualizing the distribution of the data over a variable of interest. This is a histogram of the total cost of living over the dataset.

Histogram of the data: (with beautifications)

costds %>%
  ggplot(aes(x=total_cost))+
  geom_histogram(fill="purple")+
  labs(title="Distribution of cost of living in Dataset", subtitle="Cost of living across all regions.")+
  xlab("Cost of living ($)")+
  ylab("Count")+
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answering question 1:

First, make the first plot that comes to mind.

costds %>% 
  ggplot(aes(x=family_member_count,y=housing_cost)) +
  geom_point()

This looks like each factor on family_member_count can become its own boxplot. Let’s group them as such.

costds %>% 
  ggplot(aes(x=family_member_count,y=housing_cost)) +
  geom_boxplot()

Outliers seem to be squishing our boxes. Let’s rethink what outliers can look like. Current definition seems to be 2.5 standard deviations beyond the mean, but let’s reduce to 2 sd.

boxplots = costds %>%
  ggplot(aes(x=family_member_count,y=total_cost/1000,fill=family_member_count)) +
  geom_boxplot() +
  scale_y_continuous(limits = quantile(costds$total_cost/1000, c(0.1,0.9)))
boxplots
## Warning: Removed 6286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Make the boxplots pretty. It turns out that total_cost has a cooler yield.

boxplots +
  theme_bw() +
  labs(title="Cost of Living Boxplot Analysis",
       subtitle="What do families expect to pay to live based on size?")+
  xlab("Family group (p=parent, c=child)")+
  ylab("Cost of Living (x$1000)")+
  theme(legend.position="none")
## Warning: Removed 6286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Numerical summaries.

five_num_sum = costds %>%
  group_by(family_member_count) %>%
  summarize(count = n(),
            min=min(total_cost),
            Q1=quantile(total_cost,prob=0.25,na.rm=TRUE),
            med=quantile(total_cost,prob=0.50,na.rm=TRUE),
            avg=mean(total_cost),
            Q3=quantile(total_cost,prob=0.75,na.rm=TRUE),
            max=max(total_cost))
five_num_sum
## # A tibble: 10 × 8
##    family_member_count count    min     Q1    med    avg      Q3     max
##    <fct>               <int>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
##  1 1p0c                 3143 30088. 34139. 35625. 36627.  37720.  84336.
##  2 1p1c                 3143 41888. 48660. 51454. 53540.  55791. 139711.
##  3 1p2c                 3143 49612. 59670. 64242. 66631.  70893. 159283.
##  4 1p3c                 3143 59166. 72138. 78485. 81493.  86840. 200590.
##  5 1p4c                 3143 62831. 77259. 83996. 87137.  92887. 212576.
##  6 2p0c                 3143 40894. 46396. 48455. 49653.  51536. 109036.
##  7 2p1c                 3143 53267. 60889. 64331. 66308.  69579. 153675.
##  8 2p2c                 3143 59685. 71269. 76215. 78348.  82934. 172888.
##  9 2p3c                 3143 68384. 82456. 88673. 91691.  97004. 214410.
## 10 2p4c                 3143 72425. 87719. 94428. 97587. 103313. 223718.
dsmed=quantile(costds$total_cost,prob=0.50,na.rm=TRUE)

cdsskim = skim(costds)
cdsskim
Data summary
Name costds
Number of rows 31430
Number of columns 15
_______________________
Column type frequency:
character 1
factor 3
logical 1
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
areaname 0 1 12 183 0 2561 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
state 0 1 FALSE 51 TX: 2540, GA: 1590, VA: 1330, KY: 1200
county 0 1 FALSE 1877 Was: 300, Jef: 250, Fra: 240, Jac: 230
family_member_count 0 1 FALSE 10 1p0: 3143, 1p1: 3143, 1p2: 3143, 1p3: 3143

Variable type: logical

skim_variable n_missing complete_rate mean count
isMetro 0 1 0.37 FAL: 19730, TRU: 11700

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
case_id 0 1 1589.31 917.22 1.00 792.00 1593.00 2386.00 3171.00 ▇▇▇▇▇
housing_cost 0 1 11073.67 4165.61 4209.31 8580.00 10416.00 12444.00 61735.59 ▇▁▁▁▁
food_cost 0 1 8287.50 3271.14 2220.28 5801.42 8129.16 10703.62 31178.62 ▇▇▁▁▁
transportation_cost 0 1 13593.86 1640.46 2216.46 12535.16 13698.16 14765.76 19816.48 ▁▁▃▇▁
healthcare_cost 0 1 13394.03 5204.55 3476.38 9667.44 13082.70 16657.82 37252.27 ▅▇▃▁▁
other_necessities_cost 0 1 7015.32 2397.42 2611.64 5286.35 6733.06 8413.09 28829.44 ▇▅▁▁▁
childcare_cost 0 1 9879.58 6778.22 0.00 5341.62 10166.34 14276.38 48831.09 ▇▇▁▁▁
taxes 0 1 7657.71 3339.80 1027.80 5597.97 6898.47 8790.21 47753.39 ▇▁▁▁▁
total_cost 0 1 70901.68 21846.55 30087.66 53776.02 70977.68 85371.34 223717.55 ▇▇▁▁▁
median_family_income 10 1 68316.00 16886.97 25529.98 57223.99 65955.61 76136.07 177662.47 ▂▇▁▁▁

(10/17/2023) The only thing wrong that I can see with this 5-num summary is that I should be seeing different Q1 and Q3s for each group, but those numbers may not be necessary for now. If they’re needed, there are good estimates provided by the graph. I’ll have to look into this.

Update: 2/20/2024 I’m seeing that I used Q1=quantile(costds$total_cost,prob=0.25,na.rm=TRUE) instead of Q1=quantile(total_cost,prob=0.25,na.rm=TRUE) which resulted in the loss of progress yielded by the piping I did over the previous lines of the command. Interesting error to make and overlook, but hey, we learn not to do that anymore!

Let’s break it down by necessity that contributes to total cost!

boxplots_housing = costds %>%
  ggplot(aes(x=family_member_count,y=housing_cost/1000,fill=family_member_count)) +
  geom_boxplot() +
  scale_y_continuous(limits = quantile(costds$housing_cost/1000, c(0.1,0.9)))
boxplots_housing
## Warning: Removed 6284 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Looks weird. 1p1c and 1p2c (and 2p1c and 2p2c) exhibit the same behavior. Hmm. 3/13/24 skipping for now! Too confusing.

I’d like to try to explore median family income.

costds %>%
  ggplot(aes(x=median_family_income))+
  geom_histogram(fill="cyan")+
  labs(title="Distribution of median family income in Dataset", subtitle="Median family income across all regions.")+
  xlab("Median family income ($/yr)")+
  ylab("Count")+
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_bin()`).

COLds = costds %>%
  group_by(county) %>%
  summarise(count = n(),
            minCOL=min(total_cost,na.rm=TRUE),
            Q1COL=quantile(total_cost,prob=0.25,na.rm=TRUE),
            medCOL=quantile(total_cost,prob=0.50,na.rm=TRUE),
            avgCOL=mean(total_cost,na.rm=TRUE),
            Q3COL=quantile(total_cost,prob=0.75,na.rm=TRUE),
            maxCOL=max(total_cost,na.rm=TRUE))
            #mfi=median_family_income)

clean_COLds = na.omit(COLds)
clean_COLds
## # A tibble: 1,877 × 8
##    county           count minCOL  Q1COL medCOL avgCOL  Q3COL  maxCOL
##    <fct>            <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 Abbeville County    10 33966. 47132. 60306. 59188. 69649.  81662.
##  2 Acadia Parish       10 35492. 52403. 67822. 66328. 79435.  92556.
##  3 Accomack County     10 33954. 52983. 67378. 65524. 77979.  89866.
##  4 Ada County          10 37437. 60141. 79311. 78165. 98435. 109790.
##  5 Adair County        40 32895. 49651. 66650. 65487. 79784.  96457.
##  6 Adams County       120 32524. 52588. 72088. 70403. 84831. 127218.
##  7 Addison County      10 43172. 72389. 89307. 84677. 98336. 112173.
##  8 Aiken County        10 34063. 49372. 62957. 61226. 72113.  82906.
##  9 Aitkin County       10 33545. 51617. 65721. 64217. 77795.  88177.
## 10 Alachua County      10 38150. 58904. 75068. 73451. 88672. 101047.
## # ℹ 1,867 more rows
#COMMITTING TO CLEANLINESS
COLds=clean_COLds
  
COLds
## # A tibble: 1,877 × 8
##    county           count minCOL  Q1COL medCOL avgCOL  Q3COL  maxCOL
##    <fct>            <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 Abbeville County    10 33966. 47132. 60306. 59188. 69649.  81662.
##  2 Acadia Parish       10 35492. 52403. 67822. 66328. 79435.  92556.
##  3 Accomack County     10 33954. 52983. 67378. 65524. 77979.  89866.
##  4 Ada County          10 37437. 60141. 79311. 78165. 98435. 109790.
##  5 Adair County        40 32895. 49651. 66650. 65487. 79784.  96457.
##  6 Adams County       120 32524. 52588. 72088. 70403. 84831. 127218.
##  7 Addison County      10 43172. 72389. 89307. 84677. 98336. 112173.
##  8 Aiken County        10 34063. 49372. 62957. 61226. 72113.  82906.
##  9 Aitkin County       10 33545. 51617. 65721. 64217. 77795.  88177.
## 10 Alachua County      10 38150. 58904. 75068. 73451. 88672. 101047.
## # ℹ 1,867 more rows
costds %>%
  ggplot(aes(x=isMetro,fill=isMetro))+
  geom_bar()+
  labs(title="Proportion of data from metro counties")+
  xlab("Metro County")+
  ylab("# of cases")+
  theme_bw()+
  theme(legend.position="none")

Conditional testing from this result: How do the transportation costs in non-metro counties compare to metro counties? What is the extent of the gap between cost of living between both types?

Given that a county is a metro county, what is the probability that the cost of living is greater than the median of the dataset?

Marginal probability of isMetro:

marginMetro = costds %>%
  count(isMetro) %>%
  mutate(prop = n/sum(n))
marginMetro
## # A tibble: 2 × 3
##   isMetro     n  prop
##   <lgl>   <int> <dbl>
## 1 FALSE   19730 0.628
## 2 TRUE    11700 0.372

This proportion implies that isMetro and its complement are mutually exclusive.

If the median cost of living in this dataset is 70977.68, then there exist two groups: below average and above average. I want to know more about the relationship between cost of living and metro by examining this boolean setup.

I want to add a couple of features to this dataset first. First, I want to incorporate our average into the dataset and compare the value of the median family income of each county to the median cost of living.

medincds=costds %>%
  group_by(county) %>%
  summarize(med_inc=quantile(median_family_income,prob=0.50,na.rm=TRUE))
medincds
## # A tibble: 1,877 × 2
##    county           med_inc
##    <fct>              <dbl>
##  1 Abbeville County  53251.
##  2 Acadia Parish     58170.
##  3 Accomack County   59259.
##  4 Ada County        87161.
##  5 Adair County      56373.
##  6 Adams County      68103.
##  7 Addison County    86722.
##  8 Aiken County      64666.
##  9 Aitkin County     62395.
## 10 Alachua County    77957.
## # ℹ 1,867 more rows
costds_isMetro = costds %>% 
  group_by(county) %>%
  summarize(isMetro = first(isMetro))
costds_medianinc = costds %>%
  group_by(county) %>%
  summarize(family_income = first(median_family_income))

COLds=left_join(COLds,medincds,by="county")
COLds=left_join(COLds,costds_isMetro,by="county")
COLds=left_join(COLds,costds_medianinc,by="county")
#wholeds=left_join(COLds,costds,by="county")
COLds
## # A tibble: 1,877 × 11
##    county        count minCOL  Q1COL medCOL avgCOL  Q3COL maxCOL med_inc isMetro
##    <fct>         <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <lgl>  
##  1 Abbeville Co…    10 33966. 47132. 60306. 59188. 69649. 8.17e4  53251. FALSE  
##  2 Acadia Parish    10 35492. 52403. 67822. 66328. 79435. 9.26e4  58170. TRUE   
##  3 Accomack Cou…    10 33954. 52983. 67378. 65524. 77979. 8.99e4  59259. FALSE  
##  4 Ada County       10 37437. 60141. 79311. 78165. 98435. 1.10e5  87161. TRUE   
##  5 Adair County     40 32895. 49651. 66650. 65487. 79784. 9.65e4  56373. FALSE  
##  6 Adams County    120 32524. 52588. 72088. 70403. 84831. 1.27e5  68103. TRUE   
##  7 Addison Coun…    10 43172. 72389. 89307. 84677. 98336. 1.12e5  86722. FALSE  
##  8 Aiken County     10 34063. 49372. 62957. 61226. 72113. 8.29e4  64666. TRUE   
##  9 Aitkin County    10 33545. 51617. 65721. 64217. 77795. 8.82e4  62395. FALSE  
## 10 Alachua Coun…    10 38150. 58904. 75068. 73451. 88672. 1.01e5  77957. TRUE   
## # ℹ 1,867 more rows
## # ℹ 1 more variable: family_income <dbl>
COLds = COLds%>%
  mutate(
    NATL_med_fam_inc = 70977.68,
    med_inc_below_NATLmed = med_inc < NATL_med_fam_inc,
    hard_to_live = med_inc < medCOL
  )
COLds
## # A tibble: 1,877 × 14
##    county        count minCOL  Q1COL medCOL avgCOL  Q3COL maxCOL med_inc isMetro
##    <fct>         <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <lgl>  
##  1 Abbeville Co…    10 33966. 47132. 60306. 59188. 69649. 8.17e4  53251. FALSE  
##  2 Acadia Parish    10 35492. 52403. 67822. 66328. 79435. 9.26e4  58170. TRUE   
##  3 Accomack Cou…    10 33954. 52983. 67378. 65524. 77979. 8.99e4  59259. FALSE  
##  4 Ada County       10 37437. 60141. 79311. 78165. 98435. 1.10e5  87161. TRUE   
##  5 Adair County     40 32895. 49651. 66650. 65487. 79784. 9.65e4  56373. FALSE  
##  6 Adams County    120 32524. 52588. 72088. 70403. 84831. 1.27e5  68103. TRUE   
##  7 Addison Coun…    10 43172. 72389. 89307. 84677. 98336. 1.12e5  86722. FALSE  
##  8 Aiken County     10 34063. 49372. 62957. 61226. 72113. 8.29e4  64666. TRUE   
##  9 Aitkin County    10 33545. 51617. 65721. 64217. 77795. 8.82e4  62395. FALSE  
## 10 Alachua Coun…    10 38150. 58904. 75068. 73451. 88672. 1.01e5  77957. TRUE   
## # ℹ 1,867 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## #   med_inc_below_NATLmed <lgl>, hard_to_live <lgl>

Time to sample this new frame to derive neat insights:

sample_colDS = COLds %>%
  sample_n(50)
sample_colDS
## # A tibble: 50 × 14
##    county        count minCOL  Q1COL medCOL avgCOL  Q3COL maxCOL med_inc isMetro
##    <fct>         <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <lgl>  
##  1 Dillingham C…    10 39511. 67013. 8.67e4 8.47e4 1.06e5 1.18e5  57301  FALSE  
##  2 McClain Coun…    10 37880. 57135. 7.63e4 7.41e4 9.15e4 1.03e5  80289. TRUE   
##  3 Weld County      10 40387. 68980. 8.66e4 8.54e4 1.07e5 1.18e5  86818. TRUE   
##  4 Alameda Coun…    10 57829. 97210. 1.17e5 1.18e5 1.48e5 1.61e5 122821. TRUE   
##  5 Ogle County      10 35607. 56985. 7.26e4 7.00e4 8.31e4 9.61e4  76972. FALSE  
##  6 Adair County     40 32895. 49651. 6.66e4 6.55e4 7.98e4 9.65e4  56373. FALSE  
##  7 Indian River…    10 36742. 58945. 7.46e4 7.35e4 8.96e4 1.01e5  71933. TRUE   
##  8 Pickett Coun…    10 33343. 48242. 6.12e4 5.92e4 6.85e4 8.07e4  47319. FALSE  
##  9 Ziebach Coun…    10 37781. 56462. 7.45e4 7.22e4 8.58e4 1.00e5  40431. FALSE  
## 10 Tulare County    10 38393. 57220. 7.24e4 7.11e4 8.59e4 9.76e4  54761. TRUE   
## # ℹ 40 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## #   med_inc_below_NATLmed <lgl>, hard_to_live <lgl>

To be clear about the definitions (which are mathematically represented in the code), we are using a constant national median family income (this comprises all types of family structures represented in the dataset). Each case county has a median family income, and this value is either above or below the national median. As a reminder, this project is primarily interested in situations where families struggle with the cost of living in America, so it makes sense to name the new logical comparison variable with respect to this purpose. Next, “hard-to-live” is defined in its most trivial sense: the median family income of the area less than the cost of living for that area.

From the most recent data frame, it looks like we might be able to consider whether we can observe more “hard-to-live” records in metropolitan counties. Question: is there a significant association between a county being metropolitan and being borderline-to-extremely unlivable?

A summary of the features we have for this dataset:

colnames(sample_colDS)
##  [1] "county"                "count"                 "minCOL"               
##  [4] "Q1COL"                 "medCOL"                "avgCOL"               
##  [7] "Q3COL"                 "maxCOL"                "med_inc"              
## [10] "isMetro"               "family_income"         "NATL_med_fam_inc"     
## [13] "med_inc_below_NATLmed" "hard_to_live"

Checking out isMetro:

sample_colDS %>%
  group_by(isMetro,hard_to_live) %>%
  summarize(count=n()) %>%
  mutate(prop=count/sum(count))
## `summarise()` has grouped output by 'isMetro'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   isMetro [2]
##   isMetro hard_to_live count  prop
##   <lgl>   <lgl>        <int> <dbl>
## 1 FALSE   FALSE            4 0.148
## 2 FALSE   TRUE            23 0.852
## 3 TRUE    FALSE           13 0.565
## 4 TRUE    TRUE            10 0.435
#REVISIT THIS PROP TABLE; something is wrong. Proportions are not being calculated correctly...
#EDIT from 5 minutes ago: it turns out that this proportions table is a row-based proportions
#table. The values are replicable from adding parameter `margin=1` to prop.table().
tbl = table(sample_colDS$isMetro, sample_colDS$hard_to_live)
dimnames(tbl) = list(isMetro = unique(sample_colDS$isMetro), hard_to_live = unique(sample_colDS$hard_to_live))
prop.table(tbl)
##        hard_to_live
## isMetro TRUE FALSE
##   FALSE 0.08  0.46
##   TRUE  0.26  0.20

What are the chances this sample created this table given the dataset? The results here are surprising.

TODO: Run some stats tests on this proportions table Investigate other samples? Re-examine hard_to_live definition?

Changing gears

Let’s think about how individual costs and necessities are correlated!

#wholeds = wholeds %>%
#  group_by(county) %>%
#  summarize(county=first(county))
#wholeds = left_join(wholeds,costds,by="county")
#wholeds  

#wholeds has been giving me problems and I'm over it 3/14/24

3/13/24 Strange behavior with the new dataset I’m calling wholeds. My point for adding wholeds was to include all of the median family incomes into the COLds dataset without breaking anything. I’m hoping to be able to solve this problem so that I can run regression analyses predicting what taxes should be for each family given median income.

3/14/24 Try this on for size:

COLds
## # A tibble: 1,877 × 14
##    county        count minCOL  Q1COL medCOL avgCOL  Q3COL maxCOL med_inc isMetro
##    <fct>         <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <lgl>  
##  1 Abbeville Co…    10 33966. 47132. 60306. 59188. 69649. 8.17e4  53251. FALSE  
##  2 Acadia Parish    10 35492. 52403. 67822. 66328. 79435. 9.26e4  58170. TRUE   
##  3 Accomack Cou…    10 33954. 52983. 67378. 65524. 77979. 8.99e4  59259. FALSE  
##  4 Ada County       10 37437. 60141. 79311. 78165. 98435. 1.10e5  87161. TRUE   
##  5 Adair County     40 32895. 49651. 66650. 65487. 79784. 9.65e4  56373. FALSE  
##  6 Adams County    120 32524. 52588. 72088. 70403. 84831. 1.27e5  68103. TRUE   
##  7 Addison Coun…    10 43172. 72389. 89307. 84677. 98336. 1.12e5  86722. FALSE  
##  8 Aiken County     10 34063. 49372. 62957. 61226. 72113. 8.29e4  64666. TRUE   
##  9 Aitkin County    10 33545. 51617. 65721. 64217. 77795. 8.82e4  62395. FALSE  
## 10 Alachua Coun…    10 38150. 58904. 75068. 73451. 88672. 1.01e5  77957. TRUE   
## # ℹ 1,867 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## #   med_inc_below_NATLmed <lgl>, hard_to_live <lgl>

It has everything I want. I’m over it. Let’s get on with the analysis. 3/14/24 5:46P –No it doesn’t. Just use groupby bro.

At this moment, my curiosity is leading me to try predicting a family’s cost in taxes (okay, for some reason it’s taken me this long to realize that the easiest solution would be to just group the whole dataset by family size and type…. and just have multiple lines. Like… Duh? What?? Why didn’t I think of that before?)

Just use costds and group by family_member_count.

costds %>%
  group_by(family_member_count)
## # A tibble: 31,430 × 15
## # Groups:   family_member_count [10]
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1       1 AL    TRUE    Montgomery, AL… Autau… 1p0c                       8506.
##  2       1 AL    TRUE    Montgomery, AL… Autau… 1p1c                      12068.
##  3       1 AL    TRUE    Montgomery, AL… Autau… 1p2c                      12068.
##  4       1 AL    TRUE    Montgomery, AL… Autau… 1p3c                      15257.
##  5       1 AL    TRUE    Montgomery, AL… Autau… 1p4c                      15257.
##  6       1 AL    TRUE    Montgomery, AL… Autau… 2p0c                      10180.
##  7       1 AL    TRUE    Montgomery, AL… Autau… 2p1c                      12068.
##  8       1 AL    TRUE    Montgomery, AL… Autau… 2p2c                      12068.
##  9       1 AL    TRUE    Montgomery, AL… Autau… 2p3c                      15257.
## 10       1 AL    TRUE    Montgomery, AL… Autau… 2p4c                      15257.
## # ℹ 31,420 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
# Create the dataset
data = tibble(
  V1 = c("A", "A", "B", "B", "C", "C"), #county
  V2 = c(3, 3.53, 6.20, 2.2, 1, 3), #taxes
  V3 = c(242, 242, 289, 289, 303, 303) #median family income
)

# Convert V1 to a factor variable
data$V1 = as.factor(data$V1)

# Fit a linear regression model
model = lm(V2 ~ V3 + V1, data = data)

# Print the summary of the regression model
summary(model)
## 
## Call:
## lm(formula = V2 ~ V3 + V1, data = data)
## 
## Residuals:
##      1      2      3      4      5      6 
## -0.265  0.265  2.000 -2.000 -1.000  1.000 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  8.28352    8.26434   1.002    0.390
## V3          -0.02074    0.03014  -0.688    0.541
## V1B          1.90967    1.66806   1.145    0.335
## V1C               NA         NA      NA       NA
## 
## Residual standard error: 1.839 on 3 degrees of freedom
## Multiple R-squared:  0.3247, Adjusted R-squared:  -0.1255 
## F-statistic: 0.7213 on 2 and 3 DF,  p-value: 0.5549
# Create the dataset
costds
## # A tibble: 31,430 × 15
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1       1 AL    TRUE    Montgomery, AL… Autau… 1p0c                       8506.
##  2       1 AL    TRUE    Montgomery, AL… Autau… 1p1c                      12068.
##  3       1 AL    TRUE    Montgomery, AL… Autau… 1p2c                      12068.
##  4       1 AL    TRUE    Montgomery, AL… Autau… 1p3c                      15257.
##  5       1 AL    TRUE    Montgomery, AL… Autau… 1p4c                      15257.
##  6       1 AL    TRUE    Montgomery, AL… Autau… 2p0c                      10180.
##  7       1 AL    TRUE    Montgomery, AL… Autau… 2p1c                      12068.
##  8       1 AL    TRUE    Montgomery, AL… Autau… 2p2c                      12068.
##  9       1 AL    TRUE    Montgomery, AL… Autau… 2p3c                      15257.
## 10       1 AL    TRUE    Montgomery, AL… Autau… 2p4c                      15257.
## # ℹ 31,420 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
# Convert V1 to a factor variable
costds$family_member_count = as.factor(costds$family_member_count)

# Fit a linear regression model
model = lm(costds$taxes ~ costds$median_family_income + costds$family_member_count, costds = costds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'costds' will be disregarded
# Print the summary of the regression model
summary(model)
## 
## Call:
## lm(formula = costds$taxes ~ costds$median_family_income + costds$family_member_count, 
##     costds = costds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9256.3 -1401.2  -102.2  1141.7 30876.4 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.652e+03  7.336e+01 -22.524  < 2e-16 ***
## costds$median_family_income     1.060e-01  8.461e-04 125.292  < 2e-16 ***
## costds$family_member_count1p1c  5.075e+02  6.389e+01   7.943 2.03e-15 ***
## costds$family_member_count1p2c  1.202e+03  6.389e+01  18.815  < 2e-16 ***
## costds$family_member_count1p3c  3.296e+03  6.389e+01  51.589  < 2e-16 ***
## costds$family_member_count1p4c  3.086e+03  6.389e+01  48.294  < 2e-16 ***
## costds$family_member_count2p0c  9.923e+02  6.389e+01  15.531  < 2e-16 ***
## costds$family_member_count2p1c  2.177e+03  6.389e+01  34.065  < 2e-16 ***
## costds$family_member_count2p2c  2.452e+03  6.389e+01  38.372  < 2e-16 ***
## costds$family_member_count2p3c  3.667e+03  6.389e+01  57.385  < 2e-16 ***
## costds$family_member_count2p4c  3.297e+03  6.389e+01  51.606  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2533 on 31409 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.4252, Adjusted R-squared:  0.425 
## F-statistic:  2324 on 10 and 31409 DF,  p-value: < 2.2e-16

Plot this:

costds %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_point()`).

Noooooopppppeeee

Sample time!

#identify unique ID column by which each of the rows are grouped
uniqueIDs = unique(costds$case_id)

#sample by unique IDs
sampledUID = sample(uniqueIDs, 250)

#extract rows for sample by unique IDs
sampleDataByUID = costds[costds$case_id %in% sampledUID, ]

#display samples
print(sampleDataByUID)
## # A tibble: 2,500 × 15
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1      23 AL    FALSE   Dale County, AL Dale … 1p0c                        6276
##  2      23 AL    FALSE   Dale County, AL Dale … 1p1c                        7608
##  3      23 AL    FALSE   Dale County, AL Dale … 1p2c                        7608
##  4      23 AL    FALSE   Dale County, AL Dale … 1p3c                       10680
##  5      23 AL    FALSE   Dale County, AL Dale … 1p4c                       10680
##  6      23 AL    FALSE   Dale County, AL Dale … 2p0c                        6312
##  7      23 AL    FALSE   Dale County, AL Dale … 2p1c                        7608
##  8      23 AL    FALSE   Dale County, AL Dale … 2p2c                        7608
##  9      23 AL    FALSE   Dale County, AL Dale … 2p3c                       10680
## 10      23 AL    FALSE   Dale County, AL Dale … 2p4c                       10680
## # ℹ 2,490 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>

Hoping for a more readable plot:

#change on 3/19/24
#wanting 1p group
singles = sampleDataByUID %>%
  filter(family_member_count==c("1p0c","1p1c","1p2c","1p3c","1p4c"))
singles
## # A tibble: 1,250 × 15
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1      23 AL    FALSE   Dale County, AL Dale … 1p0c                        6276
##  2      23 AL    FALSE   Dale County, AL Dale … 1p1c                        7608
##  3      23 AL    FALSE   Dale County, AL Dale … 1p2c                        7608
##  4      23 AL    FALSE   Dale County, AL Dale … 1p3c                       10680
##  5      23 AL    FALSE   Dale County, AL Dale … 1p4c                       10680
##  6      34 AL    TRUE    Henry County, … Henry… 1p0c                        5904
##  7      34 AL    TRUE    Henry County, … Henry… 1p1c                        7608
##  8      34 AL    TRUE    Henry County, … Henry… 1p2c                        7608
##  9      34 AL    TRUE    Henry County, … Henry… 1p3c                       10296
## 10      34 AL    TRUE    Henry County, … Henry… 1p4c                       10296
## # ℹ 1,240 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
#ensuring we only have 1p group:
unique(singles$family_member_count)
## [1] 1p0c 1p1c 1p2c 1p3c 1p4c
## Levels: 1p0c 1p1c 1p2c 1p3c 1p4c 2p0c 2p1c 2p2c 2p3c 2p4c
couples = sampleDataByUID %>%
  filter(family_member_count==c("2p0c","2p1c","2p2c","2p3c","2p4c"))
couples
## # A tibble: 1,250 × 15
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1      23 AL    FALSE   Dale County, AL Dale … 2p0c                        6312
##  2      23 AL    FALSE   Dale County, AL Dale … 2p1c                        7608
##  3      23 AL    FALSE   Dale County, AL Dale … 2p2c                        7608
##  4      23 AL    FALSE   Dale County, AL Dale … 2p3c                       10680
##  5      23 AL    FALSE   Dale County, AL Dale … 2p4c                       10680
##  6      34 AL    TRUE    Henry County, … Henry… 2p0c                        6384
##  7      34 AL    TRUE    Henry County, … Henry… 2p1c                        7608
##  8      34 AL    TRUE    Henry County, … Henry… 2p2c                        7608
##  9      34 AL    TRUE    Henry County, … Henry… 2p3c                       10296
## 10      34 AL    TRUE    Henry County, … Henry… 2p4c                       10296
## # ℹ 1,240 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
unique(couples$family_member_count)
## [1] 2p0c 2p1c 2p2c 2p3c 2p4c
## Levels: 1p0c 1p1c 1p2c 1p3c 1p4c 2p0c 2p1c 2p2c 2p3c 2p4c
##
  
sampleDataByUID %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

#change on 3/19/24
#graph by group
#re-examined 4/12/24
singles %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size",
       title="Taxes paid by singles given income") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

couples %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size", title="Taxes paid by couples given income") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Okay then! Let’s just do one group, please! My eyes hurt.

costds$family_member_count = factor(costds$family_member_count)
singleNoKids=costds %>%
  filter(family_member_count==levels(costds$family_member_count)[1])
singleNoKids
## # A tibble: 3,143 × 15
##    case_id state isMetro areaname        county family_member_count housing_cost
##      <dbl> <fct> <lgl>   <chr>           <fct>  <fct>                      <dbl>
##  1       1 AL    TRUE    Montgomery, AL… Autau… 1p0c                       8506.
##  2       2 AL    TRUE    Daphne-Fairhop… Baldw… 1p0c                       8616 
##  3       3 AL    FALSE   Barbour County… Barbo… 1p0c                       5856 
##  4       4 AL    TRUE    Birmingham-Hoo… Bibb … 1p0c                       7974.
##  5       5 AL    TRUE    Birmingham-Hoo… Bloun… 1p0c                       7246.
##  6       6 AL    FALSE   Bullock County… Bullo… 1p0c                       6384 
##  7       7 AL    FALSE   Butler County,… Butle… 1p0c                       5748 
##  8       8 AL    TRUE    Anniston-Oxfor… Calho… 1p0c                       5784 
##  9       9 AL    FALSE   Chambers Count… Chamb… 1p0c                       6408 
## 10      10 AL    FALSE   Cherokee Count… Chero… 1p0c                       5748 
## # ℹ 3,133 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>

(3/19/24) Some reflective notes about this chunk: I spent a lot of time on this chunk because it was showing me an unexpected return. I had the following

#costds$family_member_count = factor(costds$family_member_count)
#costds %>%
#  filter(family_member_count=="1p0c")
#costds

and then being all surprised pikachu when I wasn’t getting the expected return. Silly mistake. Gotta make sure I’m paying attention to what I want and what I am returning.

singleNoKids %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

All graphs together:

allplotted = sampleDataByUID %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
  geom_point() +
  #geom_smooth(method="lm", se=FALSE) +
  facet_wrap(~family_member_count) +
  labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Remove x-axis labels
        axis.text.y = element_blank(),  # Remove y-axis labels
        strip.background = element_blank()) +  # Remove facet background
  labs(subtitle="Note: these graphs are \"to scale\" which means that the plots can be compared despite unlabled axes.")
allplotted

Adding linear regression lines:

# Define a function to fit the linear regression model and calculate summaries
fitSumLM <- function(dependent_var, independent_vars, data) {
  # Construct the formula for the linear model
  formula <- as.formula(paste(dependent_var, "~", paste(independent_vars, collapse = " + ")))

  # Fit the linear model
  model <- lm(formula, data = data)

  # Print the summary of the fitted model
  print(summary(model))
  
  # Return the fitted model
  return(model)
}

fitSumLM("taxes",c("median_family_income","family_member_count"),sampleDataByUID)
## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6329.3 -1320.6   -79.8  1157.4 13069.2 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.859e+03  2.397e+02  -7.754 1.29e-14 ***
## median_family_income     1.087e-01  2.744e-03  39.608  < 2e-16 ***
## family_member_count1p1c  5.103e+02  2.086e+02   2.446   0.0145 *  
## family_member_count1p2c  1.189e+03  2.086e+02   5.701 1.33e-08 ***
## family_member_count1p3c  3.291e+03  2.086e+02  15.773  < 2e-16 ***
## family_member_count1p4c  3.102e+03  2.086e+02  14.871  < 2e-16 ***
## family_member_count2p0c  9.972e+02  2.086e+02   4.780 1.86e-06 ***
## family_member_count2p1c  2.190e+03  2.086e+02  10.499  < 2e-16 ***
## family_member_count2p2c  2.460e+03  2.086e+02  11.794  < 2e-16 ***
## family_member_count2p3c  3.657e+03  2.086e+02  17.529  < 2e-16 ***
## family_member_count2p4c  3.314e+03  2.086e+02  15.885  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2332 on 2489 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4758 
## F-statistic: 227.8 on 10 and 2489 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = formula, data = data)
## 
## Coefficients:
##             (Intercept)     median_family_income  family_member_count1p1c  
##              -1858.5479                   0.1087                 510.3012  
## family_member_count1p2c  family_member_count1p3c  family_member_count1p4c  
##               1189.2756                3290.6074                3102.4217  
## family_member_count2p0c  family_member_count2p1c  family_member_count2p2c  
##                997.1658                2190.3589                2460.3912  
## family_member_count2p3c  family_member_count2p4c  
##               3656.9417                3313.8517

4/1/24 My current understanding of this output is that it seems very unlikely that the dependent and independent variables each have no effect on each other. The p-value indicates such, especially for the fields coded ***. In other words, each independent variable (median family income given family size) has some effect on the response variable (taxes paid). The t-scores for each predictor variable tell me the extent to which that feature of the model influences the response variable. In other words, it appears that median family income, compared to the other features, explains most of the variability of the response variable (“taxes paid”) in the sample of the dataset passed through to the function (designed to just fit, run, and print the summary for a linear model). I ran a multiple linear model on this data, which means that I have made some assumptions about the data:

  1. The relationship between median family income given family size is linear to the taxes a family will pay. I can test for this assumption by plotting the residuals of the model; no trend on this plot would indicate linearity of the data.
  2. These residuals are not dependent on each other. A Durbin-Watson autocorrelation test should yield a value of around 2.
  3. The data is homoscedastic; i.e. the variance of the residuals is constant across each feature of the model. A scatterplot of residuals vs predicted values should show an even spread over the predicted values.
  4. The residuals are normally distributed. There are various tests for this: Shapiro test, Smirnov test, and a QQ plot between the residuals and a normal distribution that should yield a straight line.
  5. Since I ran a linear model using multiple independent variables, the assumption here is that there is no perfect multicollinearity which just means that there is not much correlation between each independent variable. A Variance Inflation Factor greater than 10 indicate that this assumption might be violated for some feature of the model.
  6. I also assume that there are no outliers or significantly influential points that break the model. A Cook’s distance greater than 1 for a point means that it significantly influences the model.

I tried to get this code to work to plot regression lines and figured that I have libraries that can do it for me. No need to reinvent the wheel. The code doesn’t work anyway for some reason which I’m too lazy to figure out.

## Create a dataframe to store the regression coefficients
#coeffDF <- data.frame()

## Define a function to fit the linear regression model and add the regression line to each facet
#add_regression_line <- function(givenData, independent_var, dependent_var) {
#  model <- lm(dependent_var ~ independent_var, data = givenData)
#  coefficients <- summary(model)$coefficients[, 1]
#  coeffDF <<- bind_rows(coeffDF, cbind(variable = independent_var, coefficients))
#  geom_smooth(method = "lm", se = FALSE, color = "red") +
#  geom_text(aes(label = paste("R-squared:", round(summary(model)$adj.r.squared, 3))), x = Inf, y = Inf, hjust = 1, vjust = 1)
#}

Here I test each assumption:

#generate the model explicitly and gather residuals
MLRmodel = lm(formula = taxes ~ median_family_income+family_member_count, sampleDataByUID)
residuals = residuals(MLRmodel)
#summary(MLRmodel) #just verifying that the model is the same.
#I'll need this library:
#install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
#Assumption 1
ggplot(sampleDataByUID, aes(x=residuals,y=sampleDataByUID$median_family_income)) +
  geom_point() +
  labs(x="Residuals",
       y="MFI",
       title="Residual Plot Against Median Family Income (MFI)")
## Warning: Use of `sampleDataByUID$median_family_income` is discouraged.
## ℹ Use `median_family_income` instead.

ggplot(sampleDataByUID, aes(x=residuals,y=sampleDataByUID$family_member_count)) +
  geom_point() +
  labs(x="Residuals",
       y="MFI",
       title="Residual Plot Against Family Size")
## Warning: Use of `sampleDataByUID$family_member_count` is discouraged.
## ℹ Use `family_member_count` instead.

Graphs show no trend between the residuals and independent variables used in the model. Therefore the data follows a linear relationship.

#Assumption 2
dw = dwtest(MLRmodel)
print(dw)
## 
##  Durbin-Watson test
## 
## data:  MLRmodel
## DW = 0.54418, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

It seems that the test yields a significant result of 0.53669, which is a value rather close to 0 compared to 2, the value which would indicate no autocorrelation. My idea for why this might be is that I did not separate the data appropriately to satisfy this assumption sufficiently. Generally, family sizes containing 1p will make less income than family sizes containing 2p, so these variables might correlate with each other.

This will definitely need to be further explored and addressed. The sample data might need to be modified to subset the groups of one and two parents.

#Assumption 3:
bp=bptest(MLRmodel)
print(bp)
## 
##  studentized Breusch-Pagan test
## 
## data:  MLRmodel
## BP = 371.72, df = 10, p-value < 2.2e-16

4/12/24 So, I definitely need to regroup the sample data I’m using.

head(singles)
## # A tibble: 6 × 15
##   case_id state isMetro areaname         county family_member_count housing_cost
##     <dbl> <fct> <lgl>   <chr>            <fct>  <fct>                      <dbl>
## 1      23 AL    FALSE   Dale County, AL  Dale … 1p0c                        6276
## 2      23 AL    FALSE   Dale County, AL  Dale … 1p1c                        7608
## 3      23 AL    FALSE   Dale County, AL  Dale … 1p2c                        7608
## 4      23 AL    FALSE   Dale County, AL  Dale … 1p3c                       10680
## 5      23 AL    FALSE   Dale County, AL  Dale … 1p4c                       10680
## 6      34 AL    TRUE    Henry County, A… Henry… 1p0c                        5904
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
head(couples)
## # A tibble: 6 × 15
##   case_id state isMetro areaname         county family_member_count housing_cost
##     <dbl> <fct> <lgl>   <chr>            <fct>  <fct>                      <dbl>
## 1      23 AL    FALSE   Dale County, AL  Dale … 2p0c                        6312
## 2      23 AL    FALSE   Dale County, AL  Dale … 2p1c                        7608
## 3      23 AL    FALSE   Dale County, AL  Dale … 2p2c                        7608
## 4      23 AL    FALSE   Dale County, AL  Dale … 2p3c                       10680
## 5      23 AL    FALSE   Dale County, AL  Dale … 2p4c                       10680
## 6      34 AL    TRUE    Henry County, A… Henry… 2p0c                        6384
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## #   healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## #   taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
#set labels for number of children:
customLabels = c("1p0c"=0,
                 "1p1c"=1,
                 "1p2c"=2,
                 "1p3c"=3,
                 "1p4c"=4)

singles %>% ggplot(aes(x=median_family_income,y=taxes,color=family_member_count)) +
  geom_point()+
  scale_color_manual(
            values=c("1p0c"="maroon",
                     "1p1c"="darkorange",
                     "1p2c"="olivedrab",
                     "1p3c"="goldenrod",
                     "1p4c"="indianred"),
            labels=c("1p0c"="0",
                      "1p1c"="1",
                      "1p2c"="2",
                      "1p3c"="3",
                      "1p4c"="4"))+
  labs(title="Singles' taxes paid against median income by number of children",
       x="Median Family Income (yearly)",
       y="Taxes Paid (yearly)",
       color="Number of Children")

Linear regression time! This time, it’ll be a mixed linear model, but without the second grouping found in the previous data.

#Setup model and residuals:
MLRmodel = lm(formula = taxes ~ median_family_income+family_member_count, singles)
residuals = residuals(MLRmodel)

#Assumption 1:
ggplot(singles, aes(x=residuals,y=singles$median_family_income)) +
  geom_point() +
  labs(x="Residuals",
       y="MFI",
       title="Residual Plot Against Median Family Income (MFI)")
## Warning: Use of `singles$median_family_income` is discouraged.
## ℹ Use `median_family_income` instead.

ggplot(singles, aes(x=residuals,y=singles$family_member_count)) +
  geom_point() +
  labs(x="Residuals",
       y="MFI",
       title="Residual Plot Against Family Size")
## Warning: Use of `singles$family_member_count` is discouraged.
## ℹ Use `family_member_count` instead.

#Assumption 2:
dw = dwtest(MLRmodel)
print(dw)
## 
##  Durbin-Watson test
## 
## data:  MLRmodel
## DW = 0.58734, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
#Assumption 3:
bp=bptest(MLRmodel)
print(bp)
## 
##  studentized Breusch-Pagan test
## 
## data:  MLRmodel
## BP = 196.13, df = 5, p-value < 2.2e-16

Interesting. There is more autocorrelation. Time to separate even further! Metro and non-metro singles!

metroSingles = singles %>%
  filter(isMetro==TRUE)
unique(metroSingles$isMetro)
## [1] TRUE
nonmetroSingles = singles %>%
  filter(isMetro==FALSE)
unique(nonmetroSingles$isMetro)
## [1] FALSE

Easy.

For future: Compare differences between childcare cost, median family income, and family size.