Data Wrangling Final Project
Data Wrangling Final Project
Data Wrangling Final Project
Data Wrangling Final Project
Role: Project Manager / Data Analyst
Team Size: 3 students
Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub
Role: Project Manager / Data Analyst
Team Size: 3 students
Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub
Role: Project Manager / Data Analyst
Team Size: 3 students
Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub
Role: Project Manager / Data Analyst
Team Size: 3 students
Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub




My Role
My Role
My Role
My Role
Defined project scope and coordinated team workflow as Project Manager, made sure tasks and timelines were clearly aligned.
Set up a shared environment to streamline collaboration and maintain version control across the team.
Led the data cleaning process and resolved inconsistencies to prepare datasets for analysis.
Built visualizations that highlighted cleaned datasets and surfaced key insights for decision-making
Problem Statement
Problem Statement
Problem Statement
Problem Statement
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.
Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)
and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.
Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)
and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.
Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)
and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.
Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)
and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.
Highlights of the Process
Highlights of the Process
Highlights of the Process
Highlights of the Process
Initial Data Cleaning & Audit
Initial Data Cleaning & Audit
Initial Data Cleaning & Audit
Initial Data Cleaning & Audit
Handling Missing Data
Handling Missing Data
Handling Missing Data
Handling Missing Data
Dataset Integration (SBA + ASM)
Dataset Integration (SBA + ASM)
Dataset Integration (SBA + ASM)
Dataset Integration (SBA + ASM)
Full Analysis Available on my Github
Full Analysis Available on my Github
Full Analysis Available on my Github
Full Analysis Available on my Github
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
Here we can see BorrCity column needs to be put in title case
Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data
Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data
Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data
Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data
foia_data %>% summarize_all(~ sum(is.na(.) | . == ""))
foia_data <- foia_data %>%
left_join(lender_location, by = "ThirdPartyLender_Name") %>%
select(names(foia_data))
foia_data = foia_data %>% mutate(NaicsCode = as.character(NaicsCode)
) %>% mutate(NaicsIndustry = case_when(
str_detect(NaicsCode, "^11\\d{4}$") ~ "Agriculture",
str_detect(NaicsCode, "^21\\d{4}$") ~ "Mining", ...
asm_data = asm_data %>%
mutate(State = case_when(
str_detect(State, "Alabama") ~ "AL",
str_detect(State, "Alaska") ~ "AK",...
asm_foia_merge = left_join(asm_data,test, by = c("State", "Year"))




Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics
Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics
Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics
Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics



Some of the columns from the joined data
Some of the columns from the joined data
Graphs and Visualizations
Graphs and Visualizations
Graphs and Visualizations
Graphs and Visualizations
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:
After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:




















California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.
California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.
California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.
California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.
Jobs Supported 2018-2021 By State
Jobs Supported 2018-2021 By State
Jobs Supported 2018-2021 By State
Jobs Supported 2018-2021 By State
Trends in Loan Metrics
Trends in Loan Metrics
Trends in Loan Metrics
Trends in Loan Metrics




The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels
The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels
The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels
The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels
Results and Impact
Results and Impact
Results and Impact
Results and Impact
Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.
Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.
Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.
Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.
Reflection
Reflection
Reflection
Reflection