Home Page

Data Wrangling Final Project

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

My Role

Defined project scope and coordinated team workflow as Project Manager, made sure tasks and timelines were clearly aligned.
Set up a shared environment to streamline collaboration and maintain version control across the team.
Led the data cleaning process and resolved inconsistencies to prepare datasets for analysis.
Built visualizations that highlighted cleaned datasets and surfaced key insights for decision-making

Problem Statement

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data
Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Highlights of the Process

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Handling Missing Data

Handling Missing Data

Handling Missing Data

Handling Missing Data

Dataset Integration (SBA + ASM)

Dataset Integration (SBA + ASM)

Dataset Integration (SBA + ASM)

Dataset Integration (SBA + ASM)

Full Analysis Available on my Github

Here we can see BorrCity column needs to be put in title case

Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data

Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data

Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data

Found 192 missing values in key columns
Instead of dropping these values we joined the dataset with FDIC bank data

foia_data %>% summarize_all(~ sum(is.na(.) | . == ""))

foia_data <- foia_data %>%
  left_join(lender_location, by = "ThirdPartyLender_Name") %>%
  select(names(foia_data))

foia_data = foia_data %>% mutate(NaicsCode = as.character(NaicsCode)
) %>% mutate(NaicsIndustry = case_when(
  str_detect(NaicsCode, "^11\\d{4}$") ~ "Agriculture",
  str_detect(NaicsCode, "^21\\d{4}$") ~ "Mining", ...

asm_data = asm_data %>%
  mutate(State = case_when(
    str_detect(State, "Alabama") ~ "AL",
    str_detect(State, "Alaska") ~ "AK",...

asm_foia_merge = left_join(asm_data,test, by = c("State", "Year"))

Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics

Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics

Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics

Made a custom column to classify NAICS codes
Made a column for states to match ASM states column (e.g format was "New Jersey" instead of NJ")
Left joined with ASM data to see loan data with manufacturing statistics

Some of the columns from the joined data

Graphs and Visualizations

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:

California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.

California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.

California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.

California consistently has the most jobs supported throughout the years.
In 2021, across the US, there were more jobs being supported compared to previous years.

Jobs Supported 2018-2021 By State

Trends in Loan Metrics

The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

The COVID-19 loan surge in 2021 supported more jobs, but loan activity has since returned to pre-pandemic levels

Results and Impact

Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.
Preserved 89% more data through joins rather than deletion.
Enabled cross-dataset analysis between SBA loans and manufacturing spending.
Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Deliverables

Download

Code Repository : Github
Preliminary Work :

Reflection

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked great, and everyone contributed meaningfully to the final project.

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

This project taught me that data wrangling is an iterative process. It may seem tempting to just omit missing values even if they may seem like a small amount in comparison to the dataset. But with thorough research, there are plenty of resources to prevent you from having to do so. This project taught me about preserving data integrity and properly handling information.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

Role: Project Manager / Data AnalystTeam Size: 3 studentsTools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data AnalystTeam Size: 3 studentsTools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data AnalystTeam Size: 3 studentsTools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data AnalystTeam Size: 3 studentsTools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Defined project scope and coordinated team workflow as Project Manager, made sure tasks and timelines were clearly aligned.

Set up a shared environment to streamline collaboration and maintain version control across the team.

Led the data cleaning process and resolved inconsistencies to prepare datasets for analysis.

Built visualizations that highlighted cleaned datasets and surfaced key insights for decision-making

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data

Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data

Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data

Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Large public datasets often have transcription errors, duplicates, inconsistent formatting, and missing values. Our goal was to take two government datasets.

Small Business Administration (SBA) loan data

Annual Survey of Manufacturers (ASM)

and transform them into clean, consistent, and joinable data sources that could provide insights into U.S. small businesses and the manufacturing sector.

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Initial Data Cleaning & Audit

Full Analysis Available on my Github

Full Analysis Available on my Github

Full Analysis Available on my Github

Full Analysis Available on my Github

Here we can see BorrCity column needs to be put in title case

Here we can see BorrCity column needs to be put in title case

Here we can see BorrCity column needs to be put in title case

Here we can see BorrCity column needs to be put in title case

Found 192 missing values in key columns

Found 192 missing values in key columns

Found 192 missing values in key columns

Found 192 missing values in key columns

Some of the columns from the joined data

Some of the columns from the joined data

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:

After conducting multiple joins, and heavily cleaning the data these were the key figures we were able to produce:

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Jobs Supported 2018-2021 By State

Trends in Loan Metrics

Trends in Loan Metrics

Trends in Loan Metrics

Trends in Loan Metrics

Built a reliable, improved dataset for analysis.

Preserved 89% more data through joins rather than deletion.

Enabled cross-dataset analysis between SBA loans and manufacturing spending.

Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.

Preserved 89% more data through joins rather than deletion.

Enabled cross-dataset analysis between SBA loans and manufacturing spending.

Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.

Preserved 89% more data through joins rather than deletion.

Enabled cross-dataset analysis between SBA loans and manufacturing spending.

Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Built a reliable, improved dataset for analysis.

Preserved 89% more data through joins rather than deletion.

Enabled cross-dataset analysis between SBA loans and manufacturing spending.

Delivered a workflow that can scale across other government datasets. You can perform this analysis in SQL as well.

Code Repository : Github

Preliminary Work :

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked great, and everyone contributed meaningfully to the final project.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

I also gained experience in leading a collaborative project. I set up the shared environment, assigned responsibilities to group members and set deadlines for tasks. The team worked efficiently, and everyone contributed meaningfully to the final project.

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub

Role: Project Manager / Data Analyst

Team Size: 3 students

Tools: R (tidyverse, stringr, dplyr, ggplot2, maps, sf, patchwork, scales) , GitHub