A collaborative ETL (Extract, Transform, Load) project.
Initial Planning:
After examining data available through the CDC and Census.gov, we decided to focus upon allocations of the Covid 19 vaccines by state. We wrote a project proposal, created an ERD for the database, set up a GitHub repository, and created a Bootstrap-based HTML framework for displaying the data. We planned to clean, join, and filter the data in a relational database in order to produce an effective representation of Covid 19 allocations by state. The tables we produced included a hard-coded vaccine information table, a state population table based on census data, and an allocation table based on CDC data (see the ERD, below).
These tables were ultimately joined to produce analyses based on population per state vs. the number of vaccinated individuals.
Extract:
We began by making API calls to CDC and Census.gov to extract the appropriate information. From the CDC, we pulled state-specific weekly allocation data for the Pfizer, Moderna, and Janssen vaccines. From the census, we pulled population estimates for each state for the last 10 years.
# Census URL
census_url = "https://api.census.gov/data/2019/pep/population?get=DATE_CODE,DATE_DESC,POP,NAME,STATE&for=state:*"
#Request data in json and store variable
census_response = requests.get(census_url)
census_data = census_response.json()
# Create list to store values for the census data
date_code = []
date_desc = []
pop = []
name = []
state = []
# Loop through data and append list
for items in census_data:
date_code.append(items[0])
date_desc.append(items[1])
pop.append(items[2])
name.append(items[3])
state.append(items[4])
# Create dataframe from list
census_df = pd.DataFrame({"Date Code":date_code, "Date Desc":date_desc,
"Population":pop, "State":name, "State ID":state})
census_df.head()
Transform:
We created several pandas dataframes to hold and analyze information. From our API calls to the CDC website, we created three dataframes, each devoted to storing CDC allocation data for one of the vaccines. We created a dataframe giving essential information about each vaccine, including required doses, days between doses, and minimum age. We also created a dataframe with state population data, which required filtering the dataframe to include only the most recent date (July 1, 2019). We extracted the substring that contained the date in order to isolate the most recent data.
After creating these initial dataframes, we concatenated the three dataframes with CDC information on specific vaccine allocations into a single dataframe called ‘allocation_df’. We reset the index, renamed columns for consistency, and dropped unnecessary columns. In addition, we joined our ‘allocation_df’ with the state and vaccine dfs and renamed the columns to match the plan in our ERD for our postgres db.
df = state_df.reset_index()
df = df.set_index('state')
allocation_df = allocation_df.join(df, on='State', rsuffix='_state')
df = vaccine_df.reset_index()
df = vaccine_df.set_index('vaccine')
allocation_df = allocation_df.join(df, on='vaccine', rsuffix='_vaccine')
allocation_df = allocation_df.rename(columns={'Week of Allocation':'week'
,'1st Dose Allocation':'dose1'
,'2nd Dose Allocation':'dose2'
,'id':'state_id'
,'id_vaccine':'vaccine_id'})
allocation_df = allocation_df[['week','dose1','dose2','state_id','vaccine_id']]
allocation_df.tail()
Load:
We wrote an SQL script to set up the tables in postgres according to our ERD. We loaded the dataframe data into the postgres tables using SQLAlchemy. Then, we queried postgres from our jupyter notebook to retrieve the data from the tables. We chose to design a relational database in third normal form because it allows us to easily keep the vaccine and state information separate from the allocation table for easy updates and because all rows in each table have a consistent structure.
Database Design:

/* Drop Tables if they exist */
DROP TABLE IF EXISTS allocation;
DROP TABLE IF EXISTS vaccine;
DROP TABLE IF EXISTS state;
/* Create Tables with proper PKs & FKs */
CREATE TABLE state
(id INT PRIMARY KEY
,state TEXT
,population INT
,population_date DATE
);
CREATE TABLE vaccine
(id INT PRIMARY KEY
,vaccine TEXT
,req_doses INT
,days_between INT
,min_age INT
);
CREATE TABLE allocation
(id SERIAL PRIMARY KEY
,week DATE
,dose1 INT
,dose2 INT
,state_id INT REFERENCES state
,vaccine_id INT REFERENCES vaccine
);