Sie sind auf Seite 1von 10

Step 1: Scope the Project and Gather Data

Scope
The purpose of this project is to provide a deep dive into US immigration, primarily
focusing on the type of visas being issued and the profiles associated. The scope of this
project is limited to the data sources listed below with data being aggregated across
numerous features such as visatype, gender, port_of_entry, nationality and month.

Data Description & Sources

 I94 Immigration Data: This data comes from the US National Tourism and Trade
Office found here. Each report contains international visitor arrival statistics by
world regions and select countries (including top 20), type of visa, mode of
transportation, age groups, states visited (first intended address only), and the
top ports of entry (for select countries).
 World Temperature Data: This dataset came from Kaggle found here.
 U.S. City Demographic Data: This dataset contains information about the
demographics of all US cities and census-designated places with a population
greater or equal to 65,000. Dataset comes from OpenSoft found here.
 Airport Code Table: This is a simple table of airport codes and corresponding
cities. The airport codes may refer to either IATA airport code, a three-letter code
which is used in passenger reservation, ticketing and baggage-handling systems,
or the ICAO airport code which is a four letter code used by ATC systems and for
airports that do not have an IATA airport code (from wikipedia). It comes
from here.
Step 2: Preprocessing Data
Note: preprocessing was performed prior to storing CSV files in S3 buckets i.e.
converting expanding columns, Capitalizing/Lowercasing test etc.
Explore Data

 Identify missing values


 Identify duplicate values
Cleaning Steps

 Either drop rows or fill missing data with median values where appropriate
 Expand coordinates to Latitude & Longitude columns
 Expand locations to City & State columns e.g. the data provided
for port_of_entry_codes was originally code and location. These have
subsequently been expanded out to city and state_or_country as shown
below:

Step 3: Data Model

Step 4: Run Pipelines to Model the Data


4.1 Create the data model
Creating the data model involves various steps, which can be made significantly easier
through the use of Airflow. The process of extracting files from S3 buckets, transforming
the data and then writing CSV and PARQUET files to Redshift is accomplished through
various tasks highlighted below in the ETL Dag graph. These steps include:

 Extracting data from SAS Documents and writing as CSV files to S3 immigration
bucket
 Extracting remaining CSV and PARQUET files from S3 immigration bucket
 Writing CSV and PARQUET files from S3 to Redshift
 Performing data quality checks on the newly created tables
4.2 Data Quality Checks
Data quality checks include:

 Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 Unit tests for the scripts to ensure they are doing the right thing
 Source/Count checks to ensure completeness

Das könnte Ihnen auch gefallen