Sie sind auf Seite 1von 7

Capstone Project Overview

The goal of the Capstone Project is to provide you an opportunity to synthesize the knowledge and
skills you have learned from previous courses and apply them to solve real-world cloud computing
challenges. You will work on a transportation dataset from the US Bureau of Transportation
Statistics (BTS) that is hosted as an Amazon EBS volume snapshot.

The dataset used in the Project contains data and statistics from the US Department of
Transportation on aviation, maritime, highway, transit, rail, pipeline, bike/pedestrian, and other
modes of transportation in CSV format. The data is described in more detail by the Bureau of
Transportation Statistics. (Note that the dataset we are using does not extend beyond 2008,
although more recent data is available from the previous link.) In this Project, we will concentrate
exclusively on the aviation portion of the dataset, which contains domestic flight data such as
departure and arrival delays, flight times, etc. For an example of analysis using this dataset,
see Which flight will get you there fastest?

You will be answering a common set of questions in different stacks/systems – in a batch processing
system (Apache Hadoop / Spark), and in a stream processing system (Apache Storm / Spark
Streaming / Flink). After completing Task 1, you will be able to use the results you obtained to verify
the results from Task 2.

The set of questions that must be answered using this dataset is provided in the next section. These
questions involve discovering useful information such as the best day of week to fly to minimize
delays, the most popular airports, the most on-time airlines, etc. Each task will require you to answer
a subset of these questions using a particular set of distributed systems. The exact methodology
used to answer the questions is largely left to you, but you must integrate and utilize the specified
systems to arrive at your answers.

Questions

For each task, you must answer a subset of the following questions. Each question is over
the entire dataset, unless otherwise specified.

Group 1 (Answer any 2):

1. Rank the top 10 most popular airports by numbers of flights to/from the airport.
2. Rank the top 10 airlines by on-time arrival performance.
3. Rank the days of the week by on-time arrival performance.
Note: only the tables with information at the flight/day granularity are applicable for the
Group 1 questions.

Group 2 (Answer any 3):

For Questions 1 and 2 below, we are asking you to find, for each airport, the top 10 carriers and
destination airports from that airport with respect to on-time departure performance. We are not
asking you to rank the overall top 10 carriers/airports. For specific queries, see the Task 1
Queries and Task 2 Queries.

1. For each airport X, rank the top-10 carriers in decreasing order of on-time departure
performance from X.
2. For each source airport X, rank the top-10 destination airports in decreasing order of on-time
departure performance from X.
3. For each source-destination pair X-Y, rank the top-10 carriers in decreasing order of on-time
arrival performance at Y from X.
4. For each source-destination pair X-Y, determine the mean arrival delay (in minutes) for a
flight from X to Y.

Group 3 (Answer both questions using Hadoop (or an equivalent batch processing system).
You may also use Spark Streaming (or an equivalent stream processing system) to answer
Question 2.):

1. Does the popularity distribution of airports follow a Zipf distribution? If not, what distribution
does it follow?
2. Tom wants to travel from airport X to airport Z. However, Tom also wants to stop at airport Y
for some sightseeing on the way. More concretely, Tom has the following requirements (for
specific queries, see the Task 1 Queries and Task 2 Queries)

a) The second leg of the journey (flight Y-Z) must depart two days after the first leg (flight X-Y). For
example, if X-Y departs on January 5, 2008, Y-Z must depart on January 7, 2008.

b) Tom wants his flights scheduled to depart airport X before 12:00 PM local time and to depart
airport Y after 12:00 PM local time.

c) Tom wants to arrive at each destination with as little delay as possible. You can assume you know
the actual delay of each flight.

Your mission (should you choose to accept it!) is to find, for each X-Y-Z and day/month (dd/mm)
combination in the year 2008, the two flights (X-Y and Y-Z) that satisfy constraints (a) and (b) and
have the best individual performance with respect to constraint (c), if such flights exist.
For the queries in Group 2 and Question 3.2, you will need to compute the results for ALL input
values (e.g., airport X, source-destination pair X-Y, etc.) for which the result is nonempty. These
results should then be stored in Cassandra so that the results for an input value can be queried by a
user. Then, closer to the grading deadline, we will give you sample queries (airports, flights, etc.) to
include in your video demo and report.

For example, after completing Question 2.2, a user should be able to provide an airport code (such
as “ATL”) and receive the top 10 airports in decreasing order of on-time departure performance from
ATL. Note that for questions such as 2.3, you do not need to compute the values for all possible
combinations of X-Y, but rather only for those such that a flight from X to Y exists.

Submission

The submission you need to generate for each task consists of:

1. A report documenting what you have done with justification and explanation. The report
should address all criteria in the grading rubric and in the submission details described in each
task. The report will be no longer than 4-5 pages, 11 point font.
2. A video demonstration of the use of your system to answer the required questions. For the
queries in Group 2 and Question 3.2, it will suffice to illustrate the results for a small subset of
queries. The video demo should be no longer than 5 minutes.

Further details are provided in the instructions for each task. For your Task 2 report, you will also be
asked to include a general comparison of the stacks used in each task (Hadoop, Storm, Spark).

Make sure you review the Course Deadlines, Late Policy and Academic Calendar page for
detailed information about deadlines for each task.

Getting Started

Official documentation for the systems used in this project are available at the following locations
(there are many other guides and tutorials available online):

 Hadoop
 Storm
 Spark
 Cassandra
 Kafka

The EBS snapshot ID for the transportation dataset is snap-e1608d88 for Linux/Unix and snap-
37668b5e for Windows, in the us-east-1 (N. Virginia) region. Note: this is an EBS snapshot
ID, not an AMI. You can use any Linux or Windows AMI you like; whichever AMI you choose, you
will want to enter one of the aforementioned snapshot IDs when you have the option to add
additional storage.

Task 1 Overview
Instructions
You may use DynamoDB (or any storage service from your provider) instead of Cassandra
for this assignment if needed. Examples of these service from public cloud
include DynamoDB, CosmosDB, Google Cloud Datastore or any equivalent cloud storage of
your choice.

In general, the goals of this task are to perform the following:

1. Extract and clean the transportation dataset, and then store the result in HDFS.
transportation dataset
2. Answer 2 questions from Group 1, 3 questions from Group 2, and both questions from
Group 3 using either Apache Hadoop or Apache Spark. Store the results for questions from
Group 2 and Question 3.2 in Cassandra or DynamoDB. You can access these questions in
the Capstone Project Overview.

Part 1

Your first task is to clean and store the dataset. To accomplish this, you must first retrieve and
extract the dataset from the EBS volume snapshot. Afterwards, you should explore the data set and
decide on how you wish to clean it. The exact methodology used to clean the dataset is left to you.
The cleaned data must be stored on HDFS.

Note: The dataset contains a large amount of information that will not be useful for this task. Before
you start, you should explore the database directory and decide on what you want to keep and
discard. Consider removing or combining redundant fields and storing the useful data in a format
that makes it easier for you to answer your chosen questions.

Part 2
Your second task is to answer your chosen questions using Hadoop (or Spark). As noted above, the
results for questions from Group 2 and Question 3.2 should be stored in Cassandra or DynamoDB.
The exact approach you use to answer these questions is again left to you. Whatever approaches
you choose, make sure you briefly explain and justify them in your report. See the Task 1
Queries below for specific queries. Please do not store all of the data, only those records needed for
answering the questions in Task 1 Queries.

Task 1 Queries
Below, you will find the set of individual queries for Questions 2.1-2.4 and 3.2 that must be included
in your submission for Task 1.

Questions 2.1 and 2.2

Provide the results using the following airport codes.

 CMI (University of Illinois Willard Airport)


 BWI (Baltimore-Washington International Airport)
 MIA (Miami International Airport)
 LAX (Los Angeles International Airport)
 IAH (George Bush Intercontinental Airport)
 SFO (San Francisco International Airport)

Questions 2.3 and 2.4

Provide the results using the following routes.

 CMI → ORD
 IND → CMH
 DFW → IAH
 LAX → SFO
 JFK → LAX
 ATL → PHX
Question 3.2

Provide the results using the following routes and start dates. Dates are in dd/mm/yyyy format.

 CMI → ORD → LAX, 04/03/2008


 JAX → DFW → CRP, 09/09/2008
 SLC → BFL → LAX, 01/04/2008
 LAX → SFO → PHX, 12/07/2008
 DFW → ORD → DFW, 10/06/2008
 LAX → ORD → JFK, 01/01/2008

Submission

PDF Report

You must submit your report in PDF format. Your report should be no longer than 4-5 pages, 11
point font. Your report should include the following:

1. Give a brief overview of how you extracted and cleaned the data.
2. Give a brief overview of how you integrated each system.
3. What approaches and algorithms did you use to answer each question?
4. What are the results of each question? Use only the provided subset for questions from
Group 2 and Question 3.2.
5. What system- or application-level optimizations (if any) did you employ?
6. Give your opinion about whether the results make sense and are useful in any way.

Video Demonstration Link

In your report, you will also need to submit a link to a video demonstration (in female voice) of your
approach. Your video should be no more than 5 minutes long. Your video should include the
following:

1. Ingesting and analyzing data for each question


2. Displaying/querying the results for each question

Record Video Demonstration


1. Include the shareable link for the video demonstration in your report.

Paper Presentation Submission:


Overview

For this assignment, you will need to write a paper on the topic “Amazon Aurora: Design
Considerations for High Throughput Cloud-Native Relational Databases ”

The details on this paper can be found at https://www.allthingsdistributed.com/files/p1041-verbitski.pdf

You are required to read the paper thoroughly and create a presentation based on the paper you are
assigned. Your presentation will be reviewed and graded by your classmates.

You are required to record your presentation in video format (in Female voice)

Example Presentations:

1. Implementing Declarative Overlays,  B. T. Loo et al, SOSP 2005.

2. Starfish: a self-tuning system for big data analytics , H. Herodotou et al, CIDR 2011.
3. Spanner: Google's Globally-Distributed Database, J. C. Corbett, et al, OSDI 2012.

4. TAO: Facebook’s Distributed Data Store for the Social Graph, N. Bronson et al, ATC
2013.
5. Twitter Heron: stream processing at scale , S. Kulkarni et al, SIGMOD 2015.

6. Paxos Made Transparent, H. Cui, et al, SOSP 2015.


7. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , J. Shi et al,
VLDB 2016.
8. The SNOW Theorem and Latency-Optimal Read-Only Transactions , H. Lu et al, Usenix
OSDI 2016.
9. TensorFlow: A System for Large-Scale Machine Learning, M. Abadi et al, OSDI 2016.

10. Octopus: an RDMA-enabled Distributed Persistent Memory File System, Youyou Lu et


al, ATC 2017.

Das könnte Ihnen auch gefallen