Data Wrangling Final Project

Data Wrangling Final Project

Data Wrangling Final Project

Data Wrangling Final Project

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

My Role

My Role

My Role

My Role


  • Defined project scope and coordinated team workflow as Project Manager, made sure tasks and timelines were clearly aligned.


  • Set up a shared environment to streamline collaboration and maintain version control across the team.


  • Led the data cleaning process and resolved inconsistencies to prepare datasets for analysis.


  • Built visualizations that highlighted cleaned datasets and surfaced key insights for decision-making

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

  • Small Business Administration (SBA) loan data


  • Annual Survey of Manufacturers (ASM)


and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

  • Small Business Administration (SBA) loan data


  • Annual Survey of Manufacturers (ASM)


and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

  • Small Business Administration (SBA) loan data


  • Annual Survey of Manufacturers (ASM)


and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

  • Small Business Administration (SBA) loan data


  • Annual Survey of Manufacturers (ASM)


and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Highlights of the Process

Highlights of the Process

Highlights of the Process

Highlights of the Process

  1. Initial Data Cleaning & Audit
  1. Initial Data Cleaning & Audit
  1. Initial Data Cleaning & Audit
  1. Initial Data Cleaning & Audit
  1. Handling Missing Data

  1. Handling Missing Data

  1. Handling Missing Data

  1. Handling Missing Data

  1. Dataset Integration (SBA + ASM)

  1. Dataset Integration (SBA + ASM)

  1. Dataset Integration (SBA + ASM)

  1. Dataset Integration (SBA + ASM)

Full Analysis Available on my Github
Full Analysis Available on my Github
Full Analysis Available on my Github
Full Analysis Available on my Github
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
  • Found 192 missing values in key columns
  • Instead of dropping these values we joined the dataset with FDIC bank data

  • Found 192 missing values in key columns
  • Instead of dropping these values we joined the dataset with FDIC bank data

  • Found 192 missing values in key columns
  • Instead of dropping these values we joined the dataset with FDIC bank data

  • Found 192 missing values in key columns
  • Instead of dropping these values we joined the dataset with FDIC bank data

foia_data %>% summarize_all(~ sum(is.na(.) | . == ""))
foia_data <- foia_data %>%
  left_join(lender_location, by = "ThirdPartyLender_Name") %>%
  select(names(foia_data))
foia_data = foia_data %>% mutate(NaicsCode = as.character(NaicsCode)
) %>% mutate(NaicsIndustry = case_when(
  str_detect(NaicsCode, "^11\\d{4}$") ~ "Agriculture",
  str_detect(NaicsCode, "^21\\d{4}$") ~ "Mining", ...

asm_data = asm_data %>%
  mutate(State = case_when(
    str_detect(State, "Alabama") ~ "AL",
    str_detect(State, "Alaska") ~ "AK",...

asm_foia_merge = left_join(asm_data,test, by = c("State", "Year"))
  • Made a custom column to classify NAICS codes

  • Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")

  • Left joined with ASM data to see loan data with manufacturing statistics

  • Made a custom column to classify NAICS codes

  • Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")

  • Left joined with ASM data to see loan data with manufacturing statistics

  • Made a custom column to classify NAICS codes

  • Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")

  • Left joined with ASM data to see loan data with manufacturing statistics

  • Made a custom column to classify NAICS codes

  • Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")

  • Left joined with ASM data to see loan data with manufacturing statistics

Some of the columns from the joined data
Some of the columns from the joined data

Graphs and Visualizations

Graphs and Visualizations

Graphs and Visualizations

Graphs and Visualizations

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
  • California consistently has the most jobs supported throughout the years.

  • In 2021, across the US, there were more jobs being supported compared to previous years.

  • California consistently has the most jobs supported throughout the years.

  • In 2021, across the US, there were more jobs being supported compared to previous years.

  • California consistently has the most jobs supported throughout the years.

  • In 2021, across the US, there were more jobs being supported compared to previous years.

  • California consistently has the most jobs supported throughout the years.

  • In 2021, across the US, there were more jobs being supported compared to previous years.

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Trends in Loan Metrics
Trends in Loan Metrics
Trends in Loan Metrics
Trends in Loan Metrics
  • The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

  • The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

  • The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

  • The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

Results and Impact

Results and Impact

Results and Impact

Results and Impact


  • Built a reliable, improved dataset for analysis.


  • Preserved 89% more data through joins rather than deletion.


  • Enabled cross-dataset analysis between SBA loans and manufacturing spending.


  • Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well. 


  • Built a reliable, improved dataset for analysis.


  • Preserved 89% more data through joins rather than deletion.


  • Enabled cross-dataset analysis between SBA loans and manufacturing spending.


  • Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well. 


  • Built a reliable, improved dataset for analysis.


  • Preserved 89% more data through joins rather than deletion.


  • Enabled cross-dataset analysis between SBA loans and manufacturing spending.


  • Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well. 


  • Built a reliable, improved dataset for analysis.


  • Preserved 89% more data through joins rather than deletion.


  • Enabled cross-dataset analysis between SBA loans and manufacturing spending.


  • Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well. 

Deliverables

Deliverables

Deliverables

Deliverables

  • Code Repository : Github


  • Preliminary Work :

Reflection

Reflection

Reflection

Reflection

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information. 

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.


This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information. 

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked great, and everyone contributed meaningfully to the final project.


This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information. 

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.


This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information. 

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.


This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information. 

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.


Create a free website with Framer, the website builder loved by startups, designers and agencies.