Technologies jobs are looking for: AWS / Azure / GCE Python DS libraries (numpy, matplotlib, etc.) why not just use R Building data pipelines, programming ETL tools (extract transform load)
Load in tidyverse library.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.3
Load in dataset.
costds = read_csv("cost_of_living_us.csv")
## Rows: 31430 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): state, areaname, county, family_member_count
## dbl (10): case_id, housing_cost, food_cost, transportation_cost, healthcare_...
## lgl (1): isMetro
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#https://www.kaggle.com/datasets/asaniczka/us-cost-of-living-dataset-3171-counties
From the Kaggle page:
#About Dataset The US Family Budget Dataset provides insights into the cost of living in different US counties based on the Family Budget Calculator by the Economic Policy Institute (EPI).
This dataset offers community-specific estimates for ten family types, including one or two adults with zero to four children, in all 1877 counties and metro areas across the United States.
Explore dataset.
head(costds)
## # A tibble: 6 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <chr> <lgl> <chr> <chr> <chr> <dbl>
## 1 1 AL TRUE Montgomery, AL … Autau… 1p0c 8506.
## 2 1 AL TRUE Montgomery, AL … Autau… 1p1c 12068.
## 3 1 AL TRUE Montgomery, AL … Autau… 1p2c 12068.
## 4 1 AL TRUE Montgomery, AL … Autau… 1p3c 15257.
## 5 1 AL TRUE Montgomery, AL … Autau… 1p4c 15257.
## 6 1 AL TRUE Montgomery, AL … Autau… 2p0c 10180.
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
tail(costds)
## # A tibble: 6 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <chr> <lgl> <chr> <chr> <chr> <dbl>
## 1 3171 WY FALSE Weston County, … Westo… 1p4c 13632
## 2 3171 WY FALSE Weston County, … Westo… 2p0c 8316
## 3 3171 WY FALSE Weston County, … Westo… 2p1c 10956
## 4 3171 WY FALSE Weston County, … Westo… 2p2c 10956
## 5 3171 WY FALSE Weston County, … Westo… 2p3c 13632
## 6 3171 WY FALSE Weston County, … Westo… 2p4c 13632
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
length(unique(costds$caseid)) #number of unique cases / counties observed in dataset.
## Warning: Unknown or uninitialised column: `caseid`.
## [1] 0
Restructure some data fields: factorize county
,
family_member_count
, state
costds = costds %>% mutate_at(vars(county,family_member_count,state),as.factor)
typeof(costds$county)
## [1] "integer"
typeof(costds$family_member_count)
## [1] "integer"
head(costds)
## # A tibble: 6 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 1 AL TRUE Montgomery, AL … Autau… 1p0c 8506.
## 2 1 AL TRUE Montgomery, AL … Autau… 1p1c 12068.
## 3 1 AL TRUE Montgomery, AL … Autau… 1p2c 12068.
## 4 1 AL TRUE Montgomery, AL … Autau… 1p3c 15257.
## 5 1 AL TRUE Montgomery, AL … Autau… 1p4c 15257.
## 6 1 AL TRUE Montgomery, AL … Autau… 2p0c 10180.
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
Questions about the data: 1. To what extent does
family_member_count
impact housing_cost
? 2.
What is the relationship between family_member_count
and
food_cost
? 3. How does total_cost
compare to
households with $0 childcare_cost
and >$0
childcare_cost
? 4. What is the difference between
total_cost
wrt isMetro
? 5. What is the
difference between transportation_cost
wrt
isMetro
?
First, I always like visualizing the distribution of the data over a variable of interest. This is a histogram of the total cost of living over the dataset.
Histogram of the data: (with beautifications)
costds %>%
ggplot(aes(x=total_cost))+
geom_histogram(fill="purple")+
labs(title="Distribution of cost of living in Dataset", subtitle="Cost of living across all regions.")+
xlab("Cost of living ($)")+
ylab("Count")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Answering question 1:
First, make the first plot that comes to mind.
costds %>%
ggplot(aes(x=family_member_count,y=housing_cost)) +
geom_point()
This looks like each factor on family_member_count
can
become its own boxplot. Let’s group them as such.
costds %>%
ggplot(aes(x=family_member_count,y=housing_cost)) +
geom_boxplot()
Outliers seem to be squishing our boxes. Let’s rethink what outliers can look like. Current definition seems to be 2.5 standard deviations beyond the mean, but let’s reduce to 2 sd.
boxplots = costds %>%
ggplot(aes(x=family_member_count,y=total_cost/1000,fill=family_member_count)) +
geom_boxplot() +
scale_y_continuous(limits = quantile(costds$total_cost/1000, c(0.1,0.9)))
boxplots
## Warning: Removed 6286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Make the boxplots pretty. It turns out that total_cost
has a cooler yield.
boxplots +
theme_bw() +
labs(title="Cost of Living Boxplot Analysis",
subtitle="What do families expect to pay to live based on size?")+
xlab("Family group (p=parent, c=child)")+
ylab("Cost of Living (x$1000)")+
theme(legend.position="none")
## Warning: Removed 6286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Numerical summaries.
five_num_sum = costds %>%
group_by(family_member_count) %>%
summarize(count = n(),
min=min(total_cost),
Q1=quantile(total_cost,prob=0.25,na.rm=TRUE),
med=quantile(total_cost,prob=0.50,na.rm=TRUE),
avg=mean(total_cost),
Q3=quantile(total_cost,prob=0.75,na.rm=TRUE),
max=max(total_cost))
five_num_sum
## # A tibble: 10 × 8
## family_member_count count min Q1 med avg Q3 max
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1p0c 3143 30088. 34139. 35625. 36627. 37720. 84336.
## 2 1p1c 3143 41888. 48660. 51454. 53540. 55791. 139711.
## 3 1p2c 3143 49612. 59670. 64242. 66631. 70893. 159283.
## 4 1p3c 3143 59166. 72138. 78485. 81493. 86840. 200590.
## 5 1p4c 3143 62831. 77259. 83996. 87137. 92887. 212576.
## 6 2p0c 3143 40894. 46396. 48455. 49653. 51536. 109036.
## 7 2p1c 3143 53267. 60889. 64331. 66308. 69579. 153675.
## 8 2p2c 3143 59685. 71269. 76215. 78348. 82934. 172888.
## 9 2p3c 3143 68384. 82456. 88673. 91691. 97004. 214410.
## 10 2p4c 3143 72425. 87719. 94428. 97587. 103313. 223718.
dsmed=quantile(costds$total_cost,prob=0.50,na.rm=TRUE)
cdsskim = skim(costds)
cdsskim
Name | costds |
Number of rows | 31430 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 3 |
logical | 1 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
areaname | 0 | 1 | 12 | 183 | 0 | 2561 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
state | 0 | 1 | FALSE | 51 | TX: 2540, GA: 1590, VA: 1330, KY: 1200 |
county | 0 | 1 | FALSE | 1877 | Was: 300, Jef: 250, Fra: 240, Jac: 230 |
family_member_count | 0 | 1 | FALSE | 10 | 1p0: 3143, 1p1: 3143, 1p2: 3143, 1p3: 3143 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
isMetro | 0 | 1 | 0.37 | FAL: 19730, TRU: 11700 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
case_id | 0 | 1 | 1589.31 | 917.22 | 1.00 | 792.00 | 1593.00 | 2386.00 | 3171.00 | ▇▇▇▇▇ |
housing_cost | 0 | 1 | 11073.67 | 4165.61 | 4209.31 | 8580.00 | 10416.00 | 12444.00 | 61735.59 | ▇▁▁▁▁ |
food_cost | 0 | 1 | 8287.50 | 3271.14 | 2220.28 | 5801.42 | 8129.16 | 10703.62 | 31178.62 | ▇▇▁▁▁ |
transportation_cost | 0 | 1 | 13593.86 | 1640.46 | 2216.46 | 12535.16 | 13698.16 | 14765.76 | 19816.48 | ▁▁▃▇▁ |
healthcare_cost | 0 | 1 | 13394.03 | 5204.55 | 3476.38 | 9667.44 | 13082.70 | 16657.82 | 37252.27 | ▅▇▃▁▁ |
other_necessities_cost | 0 | 1 | 7015.32 | 2397.42 | 2611.64 | 5286.35 | 6733.06 | 8413.09 | 28829.44 | ▇▅▁▁▁ |
childcare_cost | 0 | 1 | 9879.58 | 6778.22 | 0.00 | 5341.62 | 10166.34 | 14276.38 | 48831.09 | ▇▇▁▁▁ |
taxes | 0 | 1 | 7657.71 | 3339.80 | 1027.80 | 5597.97 | 6898.47 | 8790.21 | 47753.39 | ▇▁▁▁▁ |
total_cost | 0 | 1 | 70901.68 | 21846.55 | 30087.66 | 53776.02 | 70977.68 | 85371.34 | 223717.55 | ▇▇▁▁▁ |
median_family_income | 10 | 1 | 68316.00 | 16886.97 | 25529.98 | 57223.99 | 65955.61 | 76136.07 | 177662.47 | ▂▇▁▁▁ |
(10/17/2023) The only thing wrong that I can see with this 5-num summary is that I should be seeing different Q1 and Q3s for each group, but those numbers may not be necessary for now. If they’re needed, there are good estimates provided by the graph. I’ll have to look into this.
Update: 2/20/2024 I’m seeing that I used
Q1=quantile(costds$total_cost,prob=0.25,na.rm=TRUE)
instead
of Q1=quantile(total_cost,prob=0.25,na.rm=TRUE)
which
resulted in the loss of progress yielded by the piping I did over the
previous lines of the command. Interesting error to make and overlook,
but hey, we learn not to do that anymore!
Let’s break it down by necessity that contributes to total cost!
boxplots_housing = costds %>%
ggplot(aes(x=family_member_count,y=housing_cost/1000,fill=family_member_count)) +
geom_boxplot() +
scale_y_continuous(limits = quantile(costds$housing_cost/1000, c(0.1,0.9)))
boxplots_housing
## Warning: Removed 6284 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Looks weird. 1p1c and 1p2c (and 2p1c and 2p2c) exhibit the same
behavior. Hmm. 3/13/24 skipping for now! Too confusing.
I’d like to try to explore median family income.
costds %>%
ggplot(aes(x=median_family_income))+
geom_histogram(fill="cyan")+
labs(title="Distribution of median family income in Dataset", subtitle="Median family income across all regions.")+
xlab("Median family income ($/yr)")+
ylab("Count")+
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_bin()`).
COLds = costds %>%
group_by(county) %>%
summarise(count = n(),
minCOL=min(total_cost,na.rm=TRUE),
Q1COL=quantile(total_cost,prob=0.25,na.rm=TRUE),
medCOL=quantile(total_cost,prob=0.50,na.rm=TRUE),
avgCOL=mean(total_cost,na.rm=TRUE),
Q3COL=quantile(total_cost,prob=0.75,na.rm=TRUE),
maxCOL=max(total_cost,na.rm=TRUE))
#mfi=median_family_income)
clean_COLds = na.omit(COLds)
clean_COLds
## # A tibble: 1,877 × 8
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abbeville County 10 33966. 47132. 60306. 59188. 69649. 81662.
## 2 Acadia Parish 10 35492. 52403. 67822. 66328. 79435. 92556.
## 3 Accomack County 10 33954. 52983. 67378. 65524. 77979. 89866.
## 4 Ada County 10 37437. 60141. 79311. 78165. 98435. 109790.
## 5 Adair County 40 32895. 49651. 66650. 65487. 79784. 96457.
## 6 Adams County 120 32524. 52588. 72088. 70403. 84831. 127218.
## 7 Addison County 10 43172. 72389. 89307. 84677. 98336. 112173.
## 8 Aiken County 10 34063. 49372. 62957. 61226. 72113. 82906.
## 9 Aitkin County 10 33545. 51617. 65721. 64217. 77795. 88177.
## 10 Alachua County 10 38150. 58904. 75068. 73451. 88672. 101047.
## # ℹ 1,867 more rows
#COMMITTING TO CLEANLINESS
COLds=clean_COLds
COLds
## # A tibble: 1,877 × 8
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abbeville County 10 33966. 47132. 60306. 59188. 69649. 81662.
## 2 Acadia Parish 10 35492. 52403. 67822. 66328. 79435. 92556.
## 3 Accomack County 10 33954. 52983. 67378. 65524. 77979. 89866.
## 4 Ada County 10 37437. 60141. 79311. 78165. 98435. 109790.
## 5 Adair County 40 32895. 49651. 66650. 65487. 79784. 96457.
## 6 Adams County 120 32524. 52588. 72088. 70403. 84831. 127218.
## 7 Addison County 10 43172. 72389. 89307. 84677. 98336. 112173.
## 8 Aiken County 10 34063. 49372. 62957. 61226. 72113. 82906.
## 9 Aitkin County 10 33545. 51617. 65721. 64217. 77795. 88177.
## 10 Alachua County 10 38150. 58904. 75068. 73451. 88672. 101047.
## # ℹ 1,867 more rows
costds %>%
ggplot(aes(x=isMetro,fill=isMetro))+
geom_bar()+
labs(title="Proportion of data from metro counties")+
xlab("Metro County")+
ylab("# of cases")+
theme_bw()+
theme(legend.position="none")
Conditional testing from this result: How do the transportation costs in non-metro counties compare to metro counties? What is the extent of the gap between cost of living between both types?
Given that a county is a metro county, what is the probability that the cost of living is greater than the median of the dataset?
Marginal probability of isMetro
:
marginMetro = costds %>%
count(isMetro) %>%
mutate(prop = n/sum(n))
marginMetro
## # A tibble: 2 × 3
## isMetro n prop
## <lgl> <int> <dbl>
## 1 FALSE 19730 0.628
## 2 TRUE 11700 0.372
This proportion implies that isMetro and its complement are mutually exclusive.
If the median cost of living in this dataset is 70977.68, then there exist two groups: below average and above average. I want to know more about the relationship between cost of living and metro by examining this boolean setup.
I want to add a couple of features to this dataset first. First, I want to incorporate our average into the dataset and compare the value of the median family income of each county to the median cost of living.
medincds=costds %>%
group_by(county) %>%
summarize(med_inc=quantile(median_family_income,prob=0.50,na.rm=TRUE))
medincds
## # A tibble: 1,877 × 2
## county med_inc
## <fct> <dbl>
## 1 Abbeville County 53251.
## 2 Acadia Parish 58170.
## 3 Accomack County 59259.
## 4 Ada County 87161.
## 5 Adair County 56373.
## 6 Adams County 68103.
## 7 Addison County 86722.
## 8 Aiken County 64666.
## 9 Aitkin County 62395.
## 10 Alachua County 77957.
## # ℹ 1,867 more rows
costds_isMetro = costds %>%
group_by(county) %>%
summarize(isMetro = first(isMetro))
costds_medianinc = costds %>%
group_by(county) %>%
summarize(family_income = first(median_family_income))
COLds=left_join(COLds,medincds,by="county")
COLds=left_join(COLds,costds_isMetro,by="county")
COLds=left_join(COLds,costds_medianinc,by="county")
#wholeds=left_join(COLds,costds,by="county")
COLds
## # A tibble: 1,877 × 11
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL med_inc isMetro
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 Abbeville Co… 10 33966. 47132. 60306. 59188. 69649. 8.17e4 53251. FALSE
## 2 Acadia Parish 10 35492. 52403. 67822. 66328. 79435. 9.26e4 58170. TRUE
## 3 Accomack Cou… 10 33954. 52983. 67378. 65524. 77979. 8.99e4 59259. FALSE
## 4 Ada County 10 37437. 60141. 79311. 78165. 98435. 1.10e5 87161. TRUE
## 5 Adair County 40 32895. 49651. 66650. 65487. 79784. 9.65e4 56373. FALSE
## 6 Adams County 120 32524. 52588. 72088. 70403. 84831. 1.27e5 68103. TRUE
## 7 Addison Coun… 10 43172. 72389. 89307. 84677. 98336. 1.12e5 86722. FALSE
## 8 Aiken County 10 34063. 49372. 62957. 61226. 72113. 8.29e4 64666. TRUE
## 9 Aitkin County 10 33545. 51617. 65721. 64217. 77795. 8.82e4 62395. FALSE
## 10 Alachua Coun… 10 38150. 58904. 75068. 73451. 88672. 1.01e5 77957. TRUE
## # ℹ 1,867 more rows
## # ℹ 1 more variable: family_income <dbl>
COLds = COLds%>%
mutate(
NATL_med_fam_inc = 70977.68,
med_inc_below_NATLmed = med_inc < NATL_med_fam_inc,
hard_to_live = med_inc < medCOL
)
COLds
## # A tibble: 1,877 × 14
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL med_inc isMetro
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 Abbeville Co… 10 33966. 47132. 60306. 59188. 69649. 8.17e4 53251. FALSE
## 2 Acadia Parish 10 35492. 52403. 67822. 66328. 79435. 9.26e4 58170. TRUE
## 3 Accomack Cou… 10 33954. 52983. 67378. 65524. 77979. 8.99e4 59259. FALSE
## 4 Ada County 10 37437. 60141. 79311. 78165. 98435. 1.10e5 87161. TRUE
## 5 Adair County 40 32895. 49651. 66650. 65487. 79784. 9.65e4 56373. FALSE
## 6 Adams County 120 32524. 52588. 72088. 70403. 84831. 1.27e5 68103. TRUE
## 7 Addison Coun… 10 43172. 72389. 89307. 84677. 98336. 1.12e5 86722. FALSE
## 8 Aiken County 10 34063. 49372. 62957. 61226. 72113. 8.29e4 64666. TRUE
## 9 Aitkin County 10 33545. 51617. 65721. 64217. 77795. 8.82e4 62395. FALSE
## 10 Alachua Coun… 10 38150. 58904. 75068. 73451. 88672. 1.01e5 77957. TRUE
## # ℹ 1,867 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## # med_inc_below_NATLmed <lgl>, hard_to_live <lgl>
Time to sample this new frame to derive neat insights:
sample_colDS = COLds %>%
sample_n(50)
sample_colDS
## # A tibble: 50 × 14
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL med_inc isMetro
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 Dillingham C… 10 39511. 67013. 8.67e4 8.47e4 1.06e5 1.18e5 57301 FALSE
## 2 McClain Coun… 10 37880. 57135. 7.63e4 7.41e4 9.15e4 1.03e5 80289. TRUE
## 3 Weld County 10 40387. 68980. 8.66e4 8.54e4 1.07e5 1.18e5 86818. TRUE
## 4 Alameda Coun… 10 57829. 97210. 1.17e5 1.18e5 1.48e5 1.61e5 122821. TRUE
## 5 Ogle County 10 35607. 56985. 7.26e4 7.00e4 8.31e4 9.61e4 76972. FALSE
## 6 Adair County 40 32895. 49651. 6.66e4 6.55e4 7.98e4 9.65e4 56373. FALSE
## 7 Indian River… 10 36742. 58945. 7.46e4 7.35e4 8.96e4 1.01e5 71933. TRUE
## 8 Pickett Coun… 10 33343. 48242. 6.12e4 5.92e4 6.85e4 8.07e4 47319. FALSE
## 9 Ziebach Coun… 10 37781. 56462. 7.45e4 7.22e4 8.58e4 1.00e5 40431. FALSE
## 10 Tulare County 10 38393. 57220. 7.24e4 7.11e4 8.59e4 9.76e4 54761. TRUE
## # ℹ 40 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## # med_inc_below_NATLmed <lgl>, hard_to_live <lgl>
To be clear about the definitions (which are mathematically represented in the code), we are using a constant national median family income (this comprises all types of family structures represented in the dataset). Each case county has a median family income, and this value is either above or below the national median. As a reminder, this project is primarily interested in situations where families struggle with the cost of living in America, so it makes sense to name the new logical comparison variable with respect to this purpose. Next, “hard-to-live” is defined in its most trivial sense: the median family income of the area less than the cost of living for that area.
From the most recent data frame, it looks like we might be able to consider whether we can observe more “hard-to-live” records in metropolitan counties. Question: is there a significant association between a county being metropolitan and being borderline-to-extremely unlivable?
A summary of the features we have for this dataset:
colnames(sample_colDS)
## [1] "county" "count" "minCOL"
## [4] "Q1COL" "medCOL" "avgCOL"
## [7] "Q3COL" "maxCOL" "med_inc"
## [10] "isMetro" "family_income" "NATL_med_fam_inc"
## [13] "med_inc_below_NATLmed" "hard_to_live"
Checking out isMetro
:
sample_colDS %>%
group_by(isMetro,hard_to_live) %>%
summarize(count=n()) %>%
mutate(prop=count/sum(count))
## `summarise()` has grouped output by 'isMetro'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups: isMetro [2]
## isMetro hard_to_live count prop
## <lgl> <lgl> <int> <dbl>
## 1 FALSE FALSE 4 0.148
## 2 FALSE TRUE 23 0.852
## 3 TRUE FALSE 13 0.565
## 4 TRUE TRUE 10 0.435
#REVISIT THIS PROP TABLE; something is wrong. Proportions are not being calculated correctly...
#EDIT from 5 minutes ago: it turns out that this proportions table is a row-based proportions
#table. The values are replicable from adding parameter `margin=1` to prop.table().
tbl = table(sample_colDS$isMetro, sample_colDS$hard_to_live)
dimnames(tbl) = list(isMetro = unique(sample_colDS$isMetro), hard_to_live = unique(sample_colDS$hard_to_live))
prop.table(tbl)
## hard_to_live
## isMetro TRUE FALSE
## FALSE 0.08 0.46
## TRUE 0.26 0.20
What are the chances this sample created this table given the dataset? The results here are surprising.
TODO: Run some stats tests on this proportions table Investigate
other samples? Re-examine hard_to_live
definition?
Changing gears
Let’s think about how individual costs and necessities are correlated!
#wholeds = wholeds %>%
# group_by(county) %>%
# summarize(county=first(county))
#wholeds = left_join(wholeds,costds,by="county")
#wholeds
#wholeds has been giving me problems and I'm over it 3/14/24
3/13/24 Strange behavior with the new dataset I’m calling
wholeds
. My point for adding wholeds was to include all of
the median family incomes into the COLds
dataset without
breaking anything. I’m hoping to be able to solve this problem so that I
can run regression analyses predicting what taxes should be for each
family given median income.
3/14/24 Try this on for size:
COLds
## # A tibble: 1,877 × 14
## county count minCOL Q1COL medCOL avgCOL Q3COL maxCOL med_inc isMetro
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 Abbeville Co… 10 33966. 47132. 60306. 59188. 69649. 8.17e4 53251. FALSE
## 2 Acadia Parish 10 35492. 52403. 67822. 66328. 79435. 9.26e4 58170. TRUE
## 3 Accomack Cou… 10 33954. 52983. 67378. 65524. 77979. 8.99e4 59259. FALSE
## 4 Ada County 10 37437. 60141. 79311. 78165. 98435. 1.10e5 87161. TRUE
## 5 Adair County 40 32895. 49651. 66650. 65487. 79784. 9.65e4 56373. FALSE
## 6 Adams County 120 32524. 52588. 72088. 70403. 84831. 1.27e5 68103. TRUE
## 7 Addison Coun… 10 43172. 72389. 89307. 84677. 98336. 1.12e5 86722. FALSE
## 8 Aiken County 10 34063. 49372. 62957. 61226. 72113. 8.29e4 64666. TRUE
## 9 Aitkin County 10 33545. 51617. 65721. 64217. 77795. 8.82e4 62395. FALSE
## 10 Alachua Coun… 10 38150. 58904. 75068. 73451. 88672. 1.01e5 77957. TRUE
## # ℹ 1,867 more rows
## # ℹ 4 more variables: family_income <dbl>, NATL_med_fam_inc <dbl>,
## # med_inc_below_NATLmed <lgl>, hard_to_live <lgl>
It has everything I want. I’m over it. Let’s get on with the analysis. 3/14/24 5:46P –No it doesn’t. Just use groupby bro.
At this moment, my curiosity is leading me to try predicting a family’s cost in taxes (okay, for some reason it’s taken me this long to realize that the easiest solution would be to just group the whole dataset by family size and type…. and just have multiple lines. Like… Duh? What?? Why didn’t I think of that before?)
Just use costds and group by family_member_count
.
costds %>%
group_by(family_member_count)
## # A tibble: 31,430 × 15
## # Groups: family_member_count [10]
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 1 AL TRUE Montgomery, AL… Autau… 1p0c 8506.
## 2 1 AL TRUE Montgomery, AL… Autau… 1p1c 12068.
## 3 1 AL TRUE Montgomery, AL… Autau… 1p2c 12068.
## 4 1 AL TRUE Montgomery, AL… Autau… 1p3c 15257.
## 5 1 AL TRUE Montgomery, AL… Autau… 1p4c 15257.
## 6 1 AL TRUE Montgomery, AL… Autau… 2p0c 10180.
## 7 1 AL TRUE Montgomery, AL… Autau… 2p1c 12068.
## 8 1 AL TRUE Montgomery, AL… Autau… 2p2c 12068.
## 9 1 AL TRUE Montgomery, AL… Autau… 2p3c 15257.
## 10 1 AL TRUE Montgomery, AL… Autau… 2p4c 15257.
## # ℹ 31,420 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
# Create the dataset
data = tibble(
V1 = c("A", "A", "B", "B", "C", "C"), #county
V2 = c(3, 3.53, 6.20, 2.2, 1, 3), #taxes
V3 = c(242, 242, 289, 289, 303, 303) #median family income
)
# Convert V1 to a factor variable
data$V1 = as.factor(data$V1)
# Fit a linear regression model
model = lm(V2 ~ V3 + V1, data = data)
# Print the summary of the regression model
summary(model)
##
## Call:
## lm(formula = V2 ~ V3 + V1, data = data)
##
## Residuals:
## 1 2 3 4 5 6
## -0.265 0.265 2.000 -2.000 -1.000 1.000
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28352 8.26434 1.002 0.390
## V3 -0.02074 0.03014 -0.688 0.541
## V1B 1.90967 1.66806 1.145 0.335
## V1C NA NA NA NA
##
## Residual standard error: 1.839 on 3 degrees of freedom
## Multiple R-squared: 0.3247, Adjusted R-squared: -0.1255
## F-statistic: 0.7213 on 2 and 3 DF, p-value: 0.5549
# Create the dataset
costds
## # A tibble: 31,430 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 1 AL TRUE Montgomery, AL… Autau… 1p0c 8506.
## 2 1 AL TRUE Montgomery, AL… Autau… 1p1c 12068.
## 3 1 AL TRUE Montgomery, AL… Autau… 1p2c 12068.
## 4 1 AL TRUE Montgomery, AL… Autau… 1p3c 15257.
## 5 1 AL TRUE Montgomery, AL… Autau… 1p4c 15257.
## 6 1 AL TRUE Montgomery, AL… Autau… 2p0c 10180.
## 7 1 AL TRUE Montgomery, AL… Autau… 2p1c 12068.
## 8 1 AL TRUE Montgomery, AL… Autau… 2p2c 12068.
## 9 1 AL TRUE Montgomery, AL… Autau… 2p3c 15257.
## 10 1 AL TRUE Montgomery, AL… Autau… 2p4c 15257.
## # ℹ 31,420 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
# Convert V1 to a factor variable
costds$family_member_count = as.factor(costds$family_member_count)
# Fit a linear regression model
model = lm(costds$taxes ~ costds$median_family_income + costds$family_member_count, costds = costds)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'costds' will be disregarded
# Print the summary of the regression model
summary(model)
##
## Call:
## lm(formula = costds$taxes ~ costds$median_family_income + costds$family_member_count,
## costds = costds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9256.3 -1401.2 -102.2 1141.7 30876.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.652e+03 7.336e+01 -22.524 < 2e-16 ***
## costds$median_family_income 1.060e-01 8.461e-04 125.292 < 2e-16 ***
## costds$family_member_count1p1c 5.075e+02 6.389e+01 7.943 2.03e-15 ***
## costds$family_member_count1p2c 1.202e+03 6.389e+01 18.815 < 2e-16 ***
## costds$family_member_count1p3c 3.296e+03 6.389e+01 51.589 < 2e-16 ***
## costds$family_member_count1p4c 3.086e+03 6.389e+01 48.294 < 2e-16 ***
## costds$family_member_count2p0c 9.923e+02 6.389e+01 15.531 < 2e-16 ***
## costds$family_member_count2p1c 2.177e+03 6.389e+01 34.065 < 2e-16 ***
## costds$family_member_count2p2c 2.452e+03 6.389e+01 38.372 < 2e-16 ***
## costds$family_member_count2p3c 3.667e+03 6.389e+01 57.385 < 2e-16 ***
## costds$family_member_count2p4c 3.297e+03 6.389e+01 51.606 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2533 on 31409 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.4252, Adjusted R-squared: 0.425
## F-statistic: 2324 on 10 and 31409 DF, p-value: < 2.2e-16
Plot this:
costds %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 10 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_point()`).
Noooooopppppeeee
Sample time!
#identify unique ID column by which each of the rows are grouped
uniqueIDs = unique(costds$case_id)
#sample by unique IDs
sampledUID = sample(uniqueIDs, 250)
#extract rows for sample by unique IDs
sampleDataByUID = costds[costds$case_id %in% sampledUID, ]
#display samples
print(sampleDataByUID)
## # A tibble: 2,500 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 23 AL FALSE Dale County, AL Dale … 1p0c 6276
## 2 23 AL FALSE Dale County, AL Dale … 1p1c 7608
## 3 23 AL FALSE Dale County, AL Dale … 1p2c 7608
## 4 23 AL FALSE Dale County, AL Dale … 1p3c 10680
## 5 23 AL FALSE Dale County, AL Dale … 1p4c 10680
## 6 23 AL FALSE Dale County, AL Dale … 2p0c 6312
## 7 23 AL FALSE Dale County, AL Dale … 2p1c 7608
## 8 23 AL FALSE Dale County, AL Dale … 2p2c 7608
## 9 23 AL FALSE Dale County, AL Dale … 2p3c 10680
## 10 23 AL FALSE Dale County, AL Dale … 2p4c 10680
## # ℹ 2,490 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
Hoping for a more readable plot:
#change on 3/19/24
#wanting 1p group
singles = sampleDataByUID %>%
filter(family_member_count==c("1p0c","1p1c","1p2c","1p3c","1p4c"))
singles
## # A tibble: 1,250 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 23 AL FALSE Dale County, AL Dale … 1p0c 6276
## 2 23 AL FALSE Dale County, AL Dale … 1p1c 7608
## 3 23 AL FALSE Dale County, AL Dale … 1p2c 7608
## 4 23 AL FALSE Dale County, AL Dale … 1p3c 10680
## 5 23 AL FALSE Dale County, AL Dale … 1p4c 10680
## 6 34 AL TRUE Henry County, … Henry… 1p0c 5904
## 7 34 AL TRUE Henry County, … Henry… 1p1c 7608
## 8 34 AL TRUE Henry County, … Henry… 1p2c 7608
## 9 34 AL TRUE Henry County, … Henry… 1p3c 10296
## 10 34 AL TRUE Henry County, … Henry… 1p4c 10296
## # ℹ 1,240 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
#ensuring we only have 1p group:
unique(singles$family_member_count)
## [1] 1p0c 1p1c 1p2c 1p3c 1p4c
## Levels: 1p0c 1p1c 1p2c 1p3c 1p4c 2p0c 2p1c 2p2c 2p3c 2p4c
couples = sampleDataByUID %>%
filter(family_member_count==c("2p0c","2p1c","2p2c","2p3c","2p4c"))
couples
## # A tibble: 1,250 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 23 AL FALSE Dale County, AL Dale … 2p0c 6312
## 2 23 AL FALSE Dale County, AL Dale … 2p1c 7608
## 3 23 AL FALSE Dale County, AL Dale … 2p2c 7608
## 4 23 AL FALSE Dale County, AL Dale … 2p3c 10680
## 5 23 AL FALSE Dale County, AL Dale … 2p4c 10680
## 6 34 AL TRUE Henry County, … Henry… 2p0c 6384
## 7 34 AL TRUE Henry County, … Henry… 2p1c 7608
## 8 34 AL TRUE Henry County, … Henry… 2p2c 7608
## 9 34 AL TRUE Henry County, … Henry… 2p3c 10296
## 10 34 AL TRUE Henry County, … Henry… 2p4c 10296
## # ℹ 1,240 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
unique(couples$family_member_count)
## [1] 2p0c 2p1c 2p2c 2p3c 2p4c
## Levels: 1p0c 1p1c 1p2c 1p3c 1p4c 2p0c 2p1c 2p2c 2p3c 2p4c
##
sampleDataByUID %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
#change on 3/19/24
#graph by group
#re-examined 4/12/24
singles %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size",
title="Taxes paid by singles given income") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
couples %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size", title="Taxes paid by couples given income") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Okay then! Let’s just do one group, please! My eyes hurt.
costds$family_member_count = factor(costds$family_member_count)
singleNoKids=costds %>%
filter(family_member_count==levels(costds$family_member_count)[1])
singleNoKids
## # A tibble: 3,143 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 1 AL TRUE Montgomery, AL… Autau… 1p0c 8506.
## 2 2 AL TRUE Daphne-Fairhop… Baldw… 1p0c 8616
## 3 3 AL FALSE Barbour County… Barbo… 1p0c 5856
## 4 4 AL TRUE Birmingham-Hoo… Bibb … 1p0c 7974.
## 5 5 AL TRUE Birmingham-Hoo… Bloun… 1p0c 7246.
## 6 6 AL FALSE Bullock County… Bullo… 1p0c 6384
## 7 7 AL FALSE Butler County,… Butle… 1p0c 5748
## 8 8 AL TRUE Anniston-Oxfor… Calho… 1p0c 5784
## 9 9 AL FALSE Chambers Count… Chamb… 1p0c 6408
## 10 10 AL FALSE Cherokee Count… Chero… 1p0c 5748
## # ℹ 3,133 more rows
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
(3/19/24) Some reflective notes about this chunk: I spent a lot of time on this chunk because it was showing me an unexpected return. I had the following
#costds$family_member_count = factor(costds$family_member_count)
#costds %>%
# filter(family_member_count=="1p0c")
#costds
and then being all surprised pikachu when I wasn’t getting the expected return. Silly mistake. Gotta make sure I’m paying attention to what I want and what I am returning.
singleNoKids %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
All graphs together:
allplotted = sampleDataByUID %>% ggplot(aes(x=median_family_income, y=taxes, color=family_member_count)) +
geom_point() +
#geom_smooth(method="lm", se=FALSE) +
facet_wrap(~family_member_count) +
labs(x="Median Family Income", y="Taxes Paid", color="Family size") +
theme_minimal() +
theme(axis.text.x = element_blank(), # Remove x-axis labels
axis.text.y = element_blank(), # Remove y-axis labels
strip.background = element_blank()) + # Remove facet background
labs(subtitle="Note: these graphs are \"to scale\" which means that the plots can be compared despite unlabled axes.")
allplotted
Adding linear regression lines:
# Define a function to fit the linear regression model and calculate summaries
fitSumLM <- function(dependent_var, independent_vars, data) {
# Construct the formula for the linear model
formula <- as.formula(paste(dependent_var, "~", paste(independent_vars, collapse = " + ")))
# Fit the linear model
model <- lm(formula, data = data)
# Print the summary of the fitted model
print(summary(model))
# Return the fitted model
return(model)
}
fitSumLM("taxes",c("median_family_income","family_member_count"),sampleDataByUID)
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6329.3 -1320.6 -79.8 1157.4 13069.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.859e+03 2.397e+02 -7.754 1.29e-14 ***
## median_family_income 1.087e-01 2.744e-03 39.608 < 2e-16 ***
## family_member_count1p1c 5.103e+02 2.086e+02 2.446 0.0145 *
## family_member_count1p2c 1.189e+03 2.086e+02 5.701 1.33e-08 ***
## family_member_count1p3c 3.291e+03 2.086e+02 15.773 < 2e-16 ***
## family_member_count1p4c 3.102e+03 2.086e+02 14.871 < 2e-16 ***
## family_member_count2p0c 9.972e+02 2.086e+02 4.780 1.86e-06 ***
## family_member_count2p1c 2.190e+03 2.086e+02 10.499 < 2e-16 ***
## family_member_count2p2c 2.460e+03 2.086e+02 11.794 < 2e-16 ***
## family_member_count2p3c 3.657e+03 2.086e+02 17.529 < 2e-16 ***
## family_member_count2p4c 3.314e+03 2.086e+02 15.885 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2332 on 2489 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4758
## F-statistic: 227.8 on 10 and 2489 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = formula, data = data)
##
## Coefficients:
## (Intercept) median_family_income family_member_count1p1c
## -1858.5479 0.1087 510.3012
## family_member_count1p2c family_member_count1p3c family_member_count1p4c
## 1189.2756 3290.6074 3102.4217
## family_member_count2p0c family_member_count2p1c family_member_count2p2c
## 997.1658 2190.3589 2460.3912
## family_member_count2p3c family_member_count2p4c
## 3656.9417 3313.8517
4/1/24 My current understanding of this output is that it seems very
unlikely that the dependent and independent variables each have no
effect on each other. The p-value indicates such, especially for the
fields coded ***
. In other words, each independent variable
(median family income given family size) has some effect on the response
variable (taxes paid). The t-scores for each predictor variable tell me
the extent to which that feature of the model influences the response
variable. In other words, it appears that median family income, compared
to the other features, explains most of the variability of the response
variable (“taxes paid”) in the sample of the dataset passed through to
the function (designed to just fit, run, and print the summary for a
linear model). I ran a multiple linear model on this data, which means
that I have made some assumptions about the data:
I tried to get this code to work to plot regression lines and figured that I have libraries that can do it for me. No need to reinvent the wheel. The code doesn’t work anyway for some reason which I’m too lazy to figure out.
## Create a dataframe to store the regression coefficients
#coeffDF <- data.frame()
## Define a function to fit the linear regression model and add the regression line to each facet
#add_regression_line <- function(givenData, independent_var, dependent_var) {
# model <- lm(dependent_var ~ independent_var, data = givenData)
# coefficients <- summary(model)$coefficients[, 1]
# coeffDF <<- bind_rows(coeffDF, cbind(variable = independent_var, coefficients))
# geom_smooth(method = "lm", se = FALSE, color = "red") +
# geom_text(aes(label = paste("R-squared:", round(summary(model)$adj.r.squared, 3))), x = Inf, y = Inf, hjust = 1, vjust = 1)
#}
Here I test each assumption:
#generate the model explicitly and gather residuals
MLRmodel = lm(formula = taxes ~ median_family_income+family_member_count, sampleDataByUID)
residuals = residuals(MLRmodel)
#summary(MLRmodel) #just verifying that the model is the same.
#I'll need this library:
#install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
#Assumption 1
ggplot(sampleDataByUID, aes(x=residuals,y=sampleDataByUID$median_family_income)) +
geom_point() +
labs(x="Residuals",
y="MFI",
title="Residual Plot Against Median Family Income (MFI)")
## Warning: Use of `sampleDataByUID$median_family_income` is discouraged.
## ℹ Use `median_family_income` instead.
ggplot(sampleDataByUID, aes(x=residuals,y=sampleDataByUID$family_member_count)) +
geom_point() +
labs(x="Residuals",
y="MFI",
title="Residual Plot Against Family Size")
## Warning: Use of `sampleDataByUID$family_member_count` is discouraged.
## ℹ Use `family_member_count` instead.
Graphs show no trend between the residuals and independent variables
used in the model. Therefore the data follows a linear relationship.
#Assumption 2
dw = dwtest(MLRmodel)
print(dw)
##
## Durbin-Watson test
##
## data: MLRmodel
## DW = 0.54418, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
It seems that the test yields a significant result of 0.53669, which is a value rather close to 0 compared to 2, the value which would indicate no autocorrelation. My idea for why this might be is that I did not separate the data appropriately to satisfy this assumption sufficiently. Generally, family sizes containing 1p will make less income than family sizes containing 2p, so these variables might correlate with each other.
This will definitely need to be further explored and addressed. The sample data might need to be modified to subset the groups of one and two parents.
#Assumption 3:
bp=bptest(MLRmodel)
print(bp)
##
## studentized Breusch-Pagan test
##
## data: MLRmodel
## BP = 371.72, df = 10, p-value < 2.2e-16
4/12/24 So, I definitely need to regroup the sample data I’m using.
head(singles)
## # A tibble: 6 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 23 AL FALSE Dale County, AL Dale … 1p0c 6276
## 2 23 AL FALSE Dale County, AL Dale … 1p1c 7608
## 3 23 AL FALSE Dale County, AL Dale … 1p2c 7608
## 4 23 AL FALSE Dale County, AL Dale … 1p3c 10680
## 5 23 AL FALSE Dale County, AL Dale … 1p4c 10680
## 6 34 AL TRUE Henry County, A… Henry… 1p0c 5904
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
head(couples)
## # A tibble: 6 × 15
## case_id state isMetro areaname county family_member_count housing_cost
## <dbl> <fct> <lgl> <chr> <fct> <fct> <dbl>
## 1 23 AL FALSE Dale County, AL Dale … 2p0c 6312
## 2 23 AL FALSE Dale County, AL Dale … 2p1c 7608
## 3 23 AL FALSE Dale County, AL Dale … 2p2c 7608
## 4 23 AL FALSE Dale County, AL Dale … 2p3c 10680
## 5 23 AL FALSE Dale County, AL Dale … 2p4c 10680
## 6 34 AL TRUE Henry County, A… Henry… 2p0c 6384
## # ℹ 8 more variables: food_cost <dbl>, transportation_cost <dbl>,
## # healthcare_cost <dbl>, other_necessities_cost <dbl>, childcare_cost <dbl>,
## # taxes <dbl>, total_cost <dbl>, median_family_income <dbl>
#set labels for number of children:
customLabels = c("1p0c"=0,
"1p1c"=1,
"1p2c"=2,
"1p3c"=3,
"1p4c"=4)
singles %>% ggplot(aes(x=median_family_income,y=taxes,color=family_member_count)) +
geom_point()+
scale_color_manual(
values=c("1p0c"="maroon",
"1p1c"="darkorange",
"1p2c"="olivedrab",
"1p3c"="goldenrod",
"1p4c"="indianred"),
labels=c("1p0c"="0",
"1p1c"="1",
"1p2c"="2",
"1p3c"="3",
"1p4c"="4"))+
labs(title="Singles' taxes paid against median income by number of children",
x="Median Family Income (yearly)",
y="Taxes Paid (yearly)",
color="Number of Children")
Linear regression time! This time, it’ll be a mixed linear model, but
without the second grouping found in the previous data.
#Setup model and residuals:
MLRmodel = lm(formula = taxes ~ median_family_income+family_member_count, singles)
residuals = residuals(MLRmodel)
#Assumption 1:
ggplot(singles, aes(x=residuals,y=singles$median_family_income)) +
geom_point() +
labs(x="Residuals",
y="MFI",
title="Residual Plot Against Median Family Income (MFI)")
## Warning: Use of `singles$median_family_income` is discouraged.
## ℹ Use `median_family_income` instead.
ggplot(singles, aes(x=residuals,y=singles$family_member_count)) +
geom_point() +
labs(x="Residuals",
y="MFI",
title="Residual Plot Against Family Size")
## Warning: Use of `singles$family_member_count` is discouraged.
## ℹ Use `family_member_count` instead.
#Assumption 2:
dw = dwtest(MLRmodel)
print(dw)
##
## Durbin-Watson test
##
## data: MLRmodel
## DW = 0.58734, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
#Assumption 3:
bp=bptest(MLRmodel)
print(bp)
##
## studentized Breusch-Pagan test
##
## data: MLRmodel
## BP = 196.13, df = 5, p-value < 2.2e-16
Interesting. There is more autocorrelation. Time to separate even further! Metro and non-metro singles!
metroSingles = singles %>%
filter(isMetro==TRUE)
unique(metroSingles$isMetro)
## [1] TRUE
nonmetroSingles = singles %>%
filter(isMetro==FALSE)
unique(nonmetroSingles$isMetro)
## [1] FALSE
Easy.
For future: Compare differences between childcare cost, median family income, and family size.