Current deliverable: RMD Notebook of Cost of Living Analysis

Project objective and overview:

The present project aims to use the Kaggle dataset, "US Cost of Living Dataset (1877 Counties)" by asaniczka to practice skills in data engineering, analysis, visualization, predictive modeling, diagnostic analysis, and presentation of insights.

According to the Kaggle dataset page,

The US Family Budget Dataset provides insights into the cost of living in different US counties based on the Family Budget Calculator by the Economic Policy Institute (EPI). This dataset offers community-specific estimates for ten family types, including one or two adults with zero to four children, in all 1877 counties and metro areas across the United States.
It offers many different task ideas since this is surely a daunting dataset in size. Indeed, it was difficult for me to begin analyzing anything of subtance in the dataset at first, so I had to begin by looking through what the data was and how it was composed. This would give me insights into a) what kinds of variables I'm working with, b) their types, and c) potential relationships each variable might have on others from a cursory glance.

I aim to maintain this page to document my progress on this project. Below, I post my workflow that contributed productively to the goals of the project, and I include ideas to work on and their respective statuses.

The average time I spend weekly on this project is approximately 5 hours. I anticipate that every hour I spend on the project


Feature matrix:

Feature Scope Milestones and Dates Status Roadblocks Value to Add Notes
Various Exploratory Analyses of Data Broad look into dataset to bring out questions about what the data can show.
  • 10/17/23: Broad boxplots and histograms, factorize categorical data, numerical summaries,
  • 02/20/24: Coloring/styling graphs, cleaning/reformatting dataframes, proportions and probabiliity tables,
EDA complete; delving deeper!
  • Strange behavior with family_member_count versus housing_cost boxplot => signals potential systematic data collection decision that might affect data analysis which remains unclear.
  • Struggled a little with the number of datasets in play, but managed to successfully troubleshoot and make progress.
Gain a better idea of the data being investigated; formulate specific and well-formed questions to ask and examine. Notes
Feature Scope Milestones and Dates Status Roadblocks Value to Add Notes



Idea matrix
Ideas Backlog Priority Work in Progress Completed
a Backlog Priority Work in Progress Completed
Ideas Backlog Priority Work in Progress Completed