Sie sind auf Seite 1von 23

Data Lake Foundation on the

AWS Cloud
with Apache Zeppelin, Amazon RDS, and other
AWS Services
Quick Start Reference Deployment

August 2017

Cloudwick Technologies
AWS Quick Start Reference Team

Contents
Overview ................................................................................................................................. 2
Costs and Licenses .............................................................................................................. 3
Architecture............................................................................................................................ 3
AWS Components............................................................................................................... 3
Data Visualization Components ......................................................................................... 4
Design ................................................................................................................................. 5
Prerequisites ..........................................................................................................................6
Specialized Knowledge .......................................................................................................6
Deployment Options .............................................................................................................. 7
Deployment Steps .................................................................................................................. 7
Step 1. Prepare Your AWS Account .................................................................................... 7
Step 2. Launch the Quick Start .......................................................................................... 7
Step 3. Test the Deployment ............................................................................................ 15
Step 4. Explore the Data Lake Portal ............................................................................... 15

Page 1 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Deleting the Stack ................................................................................................................ 21


Troubleshooting ................................................................................................................... 21
Additional Resources ...........................................................................................................22
Send Us Feedback ................................................................................................................23
Document Revisions ............................................................................................................23

This Quick Start deployment guide was created by Amazon Web Services (AWS) in
partnership with Cloudwick Technologies Inc., an AWS Advanced Consulting Partner
specializing in big data.

Quick Starts are automated reference deployments that use AWS CloudFormation
templates to launch, configure, and run the AWS compute, network, storage, and other
services required to deploy a specific workload on AWS.

Overview
This Quick Start reference deployment guide provides step-by-step instructions for
deploying a data lake foundation on the Amazon Web Services (AWS) Cloud.

A data lake is a repository that holds a large amount of raw data in its native (structured or
unstructured) format until the data is needed. Storing data in its native format enables you
to accommodate any future schema requirements or design changes.

This Quick Start deploys a data lake foundation that integrates various AWS Cloud
components to help you migrate your structured and unstructured data from your on-
premises environment to the AWS Cloud, and store, monitor, and analyze the data. The
deployment uses Amazon Simple Storage Service (Amazon S3) as a core service to store the
data. It also includes other AWS services such as Amazon Relational Database Service
(Amazon RDS), AWS Data Pipeline, Amazon Redshift, AWS CloudTrail, and Amazon
Elasticsearch Service (Amazon ES). The Quick Start deploys Apache Zeppelin and Kibana
for analyzing and visualizing the data stored in Amazon S3.

The Quick Start also deploys a data lake portal, where you can upload files to, and
download files from, the data lake repository in Amazon S3, monitor real-time streaming
data using Amazon Kinesis Firehose, analyze and explore the data you’ve uploaded in
Kibana, and check your cloud resources. You can follow the instructions in this guide to
upload your data into an Amazon RDS table and try out some of this functionality.

Page 2 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

This Quick Start supports multiple user scenarios, including:


 Ingestion, storage, and analytics of original data sets, whether they are structured or
unstructured
 Integration and analysis of data originating from disparate sources
 Reduction in analytics costs as the data captured grows exponentially
 Ability to leverage multiple analytic engines and processing frameworks by using the
same data stored in Amazon S3

Costs and Licenses


You are responsible for the cost of the AWS services used while running this Quick Start
reference deployment. There is no additional cost for using the Quick Start.

The AWS CloudFormation template for this Quick Start includes configuration parameters
that you can customize. Some of these settings, such as instance type, will affect the cost of
deployment. For cost estimates, see the pricing pages for each AWS service you will be
using. Prices are subject to change.

This Quick Start also deploys the Kibana and Apache Zeppelin open-source software, which
are both free of charge.

Architecture
AWS Components
The core AWS components used by this Quick Start include the following AWS services.
Infrastructure:
 Amazon EC2 – The Amazon Elastic Compute Cloud (Amazon EC2) service enables you
to launch virtual machine instances with a variety of operating systems. You can choose
from existing Amazon Machine Images (AMIs) or import your own virtual machine
images.
 AWS Lambda – Lambda is used to run code without provisioning or managing servers.
Your Lambda code can be triggered based on an event.
 Amazon VPC – The Amazon Virtual Private Cloud (Amazon VPC) service lets you
provision a private, isolated section of the AWS Cloud where you can launch AWS
services and other resources in a virtual network that you define. You have complete
control over your virtual networking environment, including selection of your own IP
address range, subnet creation, and configuration of route tables and network gateways.

Page 3 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

 IAM – AWS Identity and Access Management (IAM) enables you to securely control
access to AWS services and resources for your users. With IAM, you can manage users,
security credentials such as access keys, and permissions that control which AWS
resources users can access, from a central location.
 AWS CloudTrail – CloudTrail enables governance, compliance, operational auditing,
and risk auditing of your AWS account. With CloudTail, you can log, continuously
monitor, and retain events related to API calls across your AWS infrastructure.

Storage:
 Amazon S3 – Amazon Simple Storage Service (Amazon S3) provides a secure and
scalable repository for your data, and is closely integrated with other AWS services for
post-processing and analytics. This Quick Start uses Amazon S3 to store data in its
original format.

Database:
 Amazon RDS – Amazon RDS helps set up, operate, and scale MySQL deployments in
the cloud. This Quick Start deploys Amazon RDS to demonstrate how AWS Data
Pipeline can be used to migrate data from your relational database to AWS Cloud
services such as Amazon S3 and Amazon Redshift.
 Amazon Redshift – Amazon Redshift helps you analyze all your data using standard
SQL and your existing business intelligence (BI) tools. This Quick Start uses Amazon
Redshift as the data warehouse for the data that’s migrated from an on-premises
relational database.

Analytics:
 Amazon ES – Amazon Elasticsearch Service (Amazon ES) helps deploy, operate, and
scale Elasticsearch for log analytics, full text search, and application and metadata
monitoring.
 Amazon Kinesis Firehose – Kinesis Firehose is part of the Kinesis streaming data
platform. It delivers real-time streaming data to Amazon ES, and this Quick Start
displays the streaming data captured by Kinesis in Kibana.

Data Visualization Components


 Kibana plugin for Amazon ES – Kibana is a web interface for Elasticsearch and provides
visualization capabilities for content indexed on an Elasticsearch cluster.
 Apache Zeppelin – Zeppelin is an open-source tool for data ingestion, analysis, and
visualization based on the Apache Spark processing engine.

Page 4 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Design
Deploying this Quick Start for a new virtual private cloud (VPC) with default parameters
builds the following data lake environment in the AWS Cloud.

Figure 1: Quick Start data lake foundation architecture on AWS

The Quick Start sets up the following:


 A virtual private cloud (VPC) that spans two Availability Zones and includes two public
and two private subnets.*
 An Internet gateway to allow access to the Internet.*
 In the public subnets, managed NAT gateways to allow outbound Internet access for
resources in the private subnets.
 In the public subnets, optional Linux bastion hosts in an Auto Scaling group to allow
inbound Secure Shell (SSH) access to EC2 instances in public and private subnets.
 In a private subnet, a web server instance (Amazon Machine Image, or AMI) in an Auto
Scaling group to host the data lake portal. This web server also installs Zeppelin to run
analytics on the data loaded into Amazon S3.

Page 5 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

 IAM roles to provide permissions to access AWS resources; for example, to access data
in Amazon S3, to enable Amazon Redshift to copy data from Amazon S3 into its tables,
and to associate the generated IAM role with the Amazon Redshift cluster.
 In the private subnets, Amazon RDS to enable migrating data from a relational database
to Amazon Redshift using AWS Data Pipeline.
 Integration with Amazon S3, AWS Lambda, Amazon ES with Kibana, Amazon Kinesis
Firehose, and CloudTrail for data storage and analysis.
 Your choice to create a new VPC or deploy the data lake components into your existing
VPC on AWS. The template that deploys the Quick Start into an existing VPC skips the
components marked by asterisks above.

Here’s how these components work together, with Amazon S3 at the center of the
architecture:
 AWS Data Pipeline migrates your RDBMS data from Amazon RDS to Amazon Redshift.
After you deploy the Quick Start, you can follow the instructions in this guide to upload
your data into an Amazon RDS table to explore this functionality.
 Zeppelin analyzes and visualizes the data being migrated.
 Amazon S3 stores the structured or unstructured data files and associated log files for
the data lake.
 Lambda functions capture the metadata associated with the uploaded files and push the
metadata to Amazon ES.
 Kinesis Firehose captures streams of metadata associated with the files being uploaded
to Amazon S3.
 Kibana fetches and displays the statistics from Amazon ES, and also displays graphics
based on the API calls made to the data lake.

Prerequisites
Specialized Knowledge
Before you deploy this Quick Start, we recommend that you become familiar with the AWS
services listed in the previous section by following the provided links. (If you are new to
AWS, see Getting Started with AWS.)

Page 6 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Deployment Options
This Quick Start provides two deployment options:
 Deploy the Quick Start into a new VPC (end-to-end deployment). This option
builds a new AWS environment consisting of the VPC, subnets, NAT gateways,
bastion hosts, security groups, and other infrastructure components, and then
deploys the data lake services and components into this new VPC.

 Deploy the Quick Start into an existing VPC. This option deploys the data lake
services and components in your existing AWS infrastructure.

The Quick Start provides separate templates for these options. It also lets you configure
CIDR blocks, instance types, and data lake settings, as discussed later in this guide.

Deployment Steps
Step 1. Prepare Your AWS Account
1. If you don’t already have an AWS account, create one at https://aws.amazon.com by
following the on-screen instructions.
2. Use the region selector in the navigation bar to choose the AWS Region where you want
to deploy the data lake components on AWS.

Important This Quick Start uses Amazon Kinesis Firehose, which is supported
only in the regions listed on the AWS Regions and Endpoints webpage.

3. Create a key pair in your preferred region.


4. If necessary, request a service limit increase for the Amazon EC2 M1 instance type. You
might need to do this if you already have an existing deployment that uses this instance
type, and you think you might exceed the default limit with this reference deployment.

Step 2. Launch the Quick Start


Note You are responsible for the cost of the AWS services used while running this
Quick Start reference deployment. There is no additional cost for using this Quick
Start. For full details, see the pricing pages for each AWS service you will be using in
this Quick Start. Prices are subject to change.

Page 7 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

1. Choose one of the following options to launch the AWS CloudFormation template into
your AWS account. For help choosing an option, see deployment options earlier in this
guide.

Option 1 Option 2
Deploy Quick Start into a Deploy Quick Start into an
new VPC on AWS existing VPC on AWS

Launch Launch

Important If you’re deploying the Quick Start into an existing VPC, make sure
that your VPC has two private subnets in different Availability Zones for the database
instances. These subnets require NAT gateways or NAT instances in their route
tables, to allow the instances to download packages and software without exposing
them to the Internet. You’ll also need the domain name option configured in the
DHCP options as explained in the Amazon VPC documentation. You’ll be prompted
for your VPC settings when you launch the Quick Start.

Each deployment takes about 20 minutes to complete.


2. Check the region that’s displayed in the upper-right corner of the navigation bar, and
change it if necessary. This is where the network infrastructure for the data lake will be
built. The template is launched in the US West (Oregon) Region by default.

Important This Quick Start uses Amazon Kinesis Firehose, which is supported
only in the regions listed on the AWS Regions and Endpoints webpage.

3. On the Select Template page, keep the default setting for the template URL, and then
choose Next.
4. On the Specify Details page, change the stack name if needed. Review the parameters
for the template. Provide values for the parameters that require input. For all other
parameters, review the default settings and customize them as necessary. When you
finish reviewing and customizing the parameters, choose Next.
In the following tables, parameters are listed by category and described separately for
the two deployment options:
– Parameters for deploying the Quick Start into a new VPC
– Parameters for deploying the Quick Start into an existing VPC

Page 8 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

 Option 1: Parameters for deploying the Quick Start into a new VPC
View template
Network Configuration:
Parameter label Default Description
(name)

Availability Zones Requires input The list of Availability Zones to use for resource distribution in
(AvailabilityZones) the VPC. This field displays the available zones within your
selected region. You can choose 2, 3, or 4 Availability Zones
from this list. The logical order of your selections is preserved
in your deployment. After you make your selections, make
sure that the value of the Number of Availability Zones
parameter matches the number of selections.

Number of Availability 2 The number of Availability Zones to use in the VPC. This count
Zones must match the number of selections in the Availability
(NoOfAzs) Zones parameter; otherwise, your deployment will fail with
an AWS CloudFormation template validation error. (Note that
some regions provide only 2 or 3 Availability Zones.)

VPC CIDR 10.0.0.0/16 CIDR block for the VPC.


(VPCCIDR)

Private Subnet 1 CIDR 10.0.0.0/19 CIDR block for the private subnet located in Availability Zone
(PrivateSubnet1CIDR) 1.

Private Subnet 2 CIDR 10.0.32.0/19 CIDR block for the private subnet located in Availability Zone
(PrivateSubnet2CIDR) 2.

Public Subnet 1 CIDR 10.0.128.0/20 CIDR block for the public (DMZ) subnet located in Availability
(PublicSubnet1CIDR) Zone 1.

Public Subnet 2 CIDR 10.0.144.0/20 CIDR block for the public (DMZ) subnet located in Availability
(PublicSubnet2CIDR) Zone 2.

Permitted IP range Requires input The CIDR IP range that is permitted to access the data lake
(AccessCIDR) web server instances. We recommend that you set this value to
a trusted IP range. For example, you might want to grant only
your corporate network access to the software.

Add Bastion Host Yes Set this parameter to No if you don’t want to include Linux
(AddBastion) bastion host instances in an Auto Scaling group in the VPC.

Amazon RDS Configuration:


Parameter label Default Description
(name)

RDS Instance Type db.t2.small EC2 instance type for the RDS DB instances.
(RDSInstanceType)

RDS Allocated Storage 5 Size, in GiB, of the RDS database, in the range 5-1024 GiB.
(RDSAllocatedStorage)

Page 9 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Parameter label Default Description


(name)

RDS Database Name awsdatalakeqs The name of the RDS database. This is a 4-20 character string
(RDSDatabaseName) consisting of letters and numbers. The database name must
start with a letter and contain no special characters.

RDS User Name admin The user name associated with the administrator account for
(RDSUserName) the RDS database instance. This is a 4-20 character string
consisting of letters and numbers. The user name must start
with a letter and contain no special characters.

RDS Password Requires input The password associated with the administrator account for
(RDSPassword) the RDS database instance. This string must be a minimum of
8 characters, consisting of letters, numbers, and symbols.

Elasticsearch Configuration:
Parameter label Default Description
(name)

Elasticsearch Instance t2.medium. EC2 instance type for the Elasticsearch instances.
Type elasticsearch
(ElasticsearchInstance
Type)

Elasticsearch Instance 1 The number of Elasticsearch instances to provision. For


Count guidance, see the Amazon ES documentation.
(ElasticsearchInstance
Count)

Elasticsearch Instance 20 Volume size of the Elasticsearch instances, in GiBs.


Volume Size
(ElasticsearchVolumeSize)

Elasticsearch Instance gp2 Volume type of the Elasticsearch instances:


Volume Type  gp2 – General Purpose (SSD)
(ElasticsearchVolumeType)  standard – Magnetic
 io1 – Provisioned IOPS (SSD)

Amazon Redshift Configuration:


Parameter label Default Description
(name)

Redshift Cluster Type single-node Cluster type for the Amazon Redshift instances. Options are
(RedshiftClusterType) single-node and multi-node.

Redshift Node Type dc1.large Instance type for the nodes in the Amazon Redshift cluster.
(RedshiftNodeType)

Number of Amazon 1 The number of nodes in the Amazon Redshift cluster. If the
Redshift Nodes Redshift Cluster Type parameter is set to single-node,
(NumberOfNodes) this parameter value should be 1.

Page 10 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Amazon EC2 Configuration:


Parameter label Default Description
(name)

Keypair Name Requires input Public/private key pair, which allows you to connect securely
(KeyPairName) to your instance after it launches. When you created an AWS
account, this is the key pair you created in your preferred
region.

NAT Instance Type t2.micro EC2 instance type for NAT instances. This parameter is used
(NATInstanceType) only if your selected AWS Region doesn’t support NAT
gateways.

Data Lake Portal m1.medium EC2 instance type for the data lake web portal.
Instance Type
(PortalInstanceType)

Data Lake Administrator Configuration:


Parameter label Default Description
(name)

Administrator Name AdminName User name for data lake portal access.
(AdministratorName)

Administrator Email Requires input Email address to which information for accessing the data lake
(AdministratorEmail) portal will be sent after deployment is complete. (See step 3
for details.)

AWS Quick Start Configuration:


Parameter label Default Description
(name)

Quick Start S3 Bucket quickstart- S3 bucket where the Quick Start templates and scripts are
Name reference installed. Use this parameter to specify the S3 bucket name
(QSS3BucketName) you’ve created for your copy of Quick Start assets, if you decide
to customize or extend the Quick Start for your own use. The
bucket name can include numbers, lowercase letters,
uppercase letters, and hyphens, but should not start or end
with a hyphen.

Quick Start S3 Key data/lake/ The S3 key name prefix used to simulate a folder for your copy
Prefix cloudwick/latest/ of Quick Start assets, if you decide to customize or extend the
(QSS3KeyPrefix) Quick Start for your own use. This prefix can include numbers,
lowercase letters, uppercase letters, hyphens, and forward
slashes.

Page 11 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

 Option 2: Parameters for deploying the Quick Start into an existing VPC
View template

Network Configuration:
Parameter label Default Description
(name)

VPC ID Requires input ID of your existing VPC (e.g., vpc-0343606e).


(VPCID)

VPC CIDR Requires input CIDR block for the VPC.


(VPCCIDR)

Private Subnet 1 ID Requires input ID of the private subnet in Availability Zone 1 in your
(PrivateSubnet1ID) existing VPC (e.g., subnet-a0246dcd).

Private Subnet 2 ID Requires input ID of the private subnet in Availability Zone 2 in your
(PrivateSubnet2ID) existing VPC (e.g., subnet-b1f432cd).

Public Subnet 1 ID Requires input ID of the public subnet in Availability Zone 1 in your
(PublicSubnet1ID) existing VPC (e.g., subnet-9bc642ac).

Public Subnet 2 ID Requires input ID of the public subnet in Availability Zone 2 in your
(PublicSubnet2ID) existing VPC (e.g., subnet-e3246d8e).

Amazon RDS Configuration:


Parameter label Default Description
(name)

RDS Instance Type db.t2.small EC2 instance type for the RDS DB instances.
(RDSInstanceType)

RDS Allocated Storage 5 Size, in GiB, of the RDS database, in the range 5-1024 GiB.
(RDSAllocatedStorage)

RDS Database Name awsdatalakeqs The name of the RDS database. This is a 4-20 character string
(RDSDatabaseName) consisting of letters and numbers. The database name must
start with a letter and contain no special characters.

RDS User Name admin The user name associated with the administrator account for
(RDSUserName) the RDS database instance. This is a 4-20 character string
consisting of letters and numbers. The user name must start
with a letter and contain no special characters.

RDS Password Requires input The password associated with the administrator account for
(RDSPassword) the RDS database instance. This string must be a minimum of
8 characters, consisting of letters, numbers, and symbols.

Page 12 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Elasticsearch Configuration:
Parameter label Default Description
(name)

Elasticsearch Instance t2.medium. EC2 instance type for the Elasticsearch instances.
Type elasticsearch
(ElasticsearchInstance
Type)

Elasticsearch Instance 1 The number of Elasticsearch instances to provision. For


Count guidance, see the Amazon ES documentation.
(ElasticsearchInstance
Count)

Elasticsearch Instance 20 Volume size of the Elasticsearch instances, in GiBs.


Volume Size
(ElasticsearchVolumeSize)

Elasticsearch Instance gp2 Volume type of the Elasticsearch instances:


Volume Type  gp2 – General Purpose (SSD)
(ElasticsearchVolumeType)  standard – Magnetic
 io1 – Provisioned IOPS (SSD)

Amazon Redshift Configuration:


Parameter label Default Description
(name)

Redshift Cluster Type single-node Cluster type for the Amazon Redshift instances. Options are
(RedshiftClusterType) single-node and multi-node.

Redshift Node Type dc1.large Instance type for the nodes in the Amazon Redshift cluster.
(RedshiftNodeType)

Number of Amazon 1 The number of nodes in the Amazon Redshift cluster. If the
Redshift Nodes Redshift Cluster Type parameter is set to single-node,
(NumberOfNodes) this parameter value should be 1.

Amazon EC2 Configuration:


Parameter label Default Description
(name)

Keypair Name Requires input Public/private key pair, which allows you to connect securely
(KeyPairName) to your instance after it launches. When you created an AWS
account, this is the key pair you created in your preferred
region.

Data Lake Portal m1.medium EC2 instance type for the data lake web portal.
Instance Type
(PortalInstanceType)

Page 13 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Parameter label Default Description


(name)

NAT Instance Type t2.micro EC2 instance type for NAT instances. This parameter is used
(NATInstanceType) only if your selected AWS Region doesn’t support NAT
gateways.

Data Lake Administrator Configuration:


Parameter label Default Description
(name)

Administrator Name AdminName User name for data lake portal access.
(AdministratorName)

Administrator Email Requires input Email address to which information for accessing the data lake
(AdministratorEmail) portal will be sent after deployment is complete. (See step 3
for details.)

AWS Quick Start Configuration:


Parameter label Default Description
(name)

Quick Start S3 Bucket quickstart- S3 bucket where the Quick Start templates and scripts are
Name reference installed. Use this parameter to specify the S3 bucket name
(QSS3BucketName) you’ve created for your copy of Quick Start assets, if you decide
to customize or extend the Quick Start for your own use. The
bucket name can include numbers, lowercase letters,
uppercase letters, and hyphens, but should not start or end
with a hyphen.

Quick Start S3 Key data/lake/ The S3 key name prefix used to simulate a folder for your copy
Prefix cloudwick/latest/ of Quick Start assets, if you decide to customize or extend the
(QSS3KeyPrefix) Quick Start for your own use. This prefix can include numbers,
lowercase letters, uppercase letters, hyphens, and forward
slashes.

5. On the Options page, you can specify tags (key-value pairs) for resources in your stack
and set advanced options. When you’re done, choose Next.
6. On the Review page, review and confirm the template settings. Under Capabilities,
select the check box to acknowledge that the template will create IAM resources.
7. Choose Create to deploy the stack.
8. Monitor the status of the stack. When the status is CREATE_COMPLETE, the data
lake cluster is ready.
9. Check the Events tab to check the status of the resources in the stack.

Page 14 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Step 3. Test the Deployment


1. When the Quick Start deployment has completed successfully, you’ll receive an email
with a URL, login ID, and password. Check your inbox for this information 15-20
minutes after deployment is complete.
2. Open the URL in your browser window and log in with the credentials you received to
access the data lake portal, as illustrated in Figure 2.

Figure 2: Login screen for portal

Step 4. Explore the Data Lake Portal


When you log in, you’ll see the data lake portal shown in Figure 3.

Figure 3: Data lake portal

Page 15 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

From this portal page, you can manage data, check resources, and visualize data using the
Data Management, Resources, and Visualize options in the upper-right corner.

 Choose Data Management to manage data in Amazon S3 or Kinesis Firehose.


– Use the Amazon S3 option to upload files to, download files from, and delete files in
the data lake repository.

Figure 4: Data management in Amazon S3

– Use the Explore Catalogue option to monitor the metadata of the files in
Amazon S3.

Figure 5: Data management in Kinesis Firehose

Page 16 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

 Choose Resources in the upper-right corner to check all the AWS resources used and
their endpoints in the data lake.
a. In the RDS Details section, choose the link next to Instance Identifier to go to
the Amazon RDS page.

Figure 6: Reviewing AWS resources used in the data lake

To test the Data Pipeline to migrate data from Amazon RDS to Amazon Redshift,
you’ll need to add some tables with data to Amazon RDS.

Page 17 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

b. Choose SQL command in the left pane to create tables and insert data.

Figure 7: Adding data tables to Amazon RDS

Alternatively, you can use the Import option (next to the SQL command button)
to import your .sql files and execute them.

Figure 8: Importing .sql files

c. Choose Resources again in the upper right and scroll down the page to the
Datapipeline Details section. Choose Run a datapipeline.
d. Fill out the form to migrate your data from Amazon RDS to Amazon Redshift.

Page 18 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Figure 9: Using AWS Data Pipeline to migrate data to Amazon Redshift

e. When the data has been migrated, you can view it in Amazon Redshift by using the
Amazon Redshift endpoint link on the Resources screen.

Page 19 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Figure 10: Amazon Redshift endpoint in Resources

 Choose Visualize in the upper-right corner to visualize your data using Zeppelin or
Kibana.

– Use Zeppelin to run Spark code on the data in Amazon S3. You can also fetch data
from Amazon Redshift by using the Interpreter option in Zeppelin.

Figure 11: Using Zeppelin from the data lake portal

Page 20 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

– Use Kibana to visualize real-time streaming data with histograms, line graphs, pie
charts, and heat maps on the type of API calls being made on the data lake.

Figure 12: Data streaming in Kibana

Deleting the Stack


When you have finished using the resources created by this Quick Start, you can delete the
stack. Deleting a stack, either by using the command line interface (CLI) or through the
AWS CloudFormation console, will remove all the resources created by the template for the
stack.

Note The data pipeline is created by the data lake portal and is invoked on
demand. Data pipelines are launched and are terminated once they are done.

Troubleshooting
Q. I encountered a CREATE_FAILED error when I launched the Quick Start. What should
I do?
A. If AWS CloudFormation fails to create the stack, we recommend that you relaunch the
template with Rollback on failure set to No. (This setting is under Advanced in the
AWS CloudFormation console, Options page.) With this setting, the stack’s state will be
retained and the instance will be left running, so you can troubleshoot the issue. (You'll
want to look at the log files in %ProgramFiles%\Amazon\EC2ConfigService and C:\cfn\log.)

Page 21 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Important When you set Rollback on failure to No, you’ll continue to


incur AWS charges for this stack. Please make sure to delete the stack when
you’ve finished troubleshooting.

For additional information, see Troubleshooting AWS CloudFormation on the AWS website
or contact us on the AWS Quick Start Discussion Forum.

Q. I encountered a size limitation error when I deployed the AWS Cloudformation


templates.
A. We recommend that you launch the Quick Start templates from the location we’ve
provided or from another S3 bucket. If you deploy the templates from a local copy on your
computer or from a non-S3 location, you might encounter template size limitations when
you create the stack. For more information about AWS CloudFormation limits, see the AWS
documentation.

Additional Resources
AWS services
 Amazon EC2
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/
 AWS CloudFormation
https://aws.amazon.com/documentation/cloudformation/
 Amazon VPC
https://aws.amazon.com/documentation/vpc/
 For a complete set of links to AWS services used in this Quick Start, see the AWS
Components section.

Data lake visualization tools


 Kibana plug-in
https://aws.amazon.com/elasticsearch-service/kibana/
 Apache Zeppelin
http://zeppelin.apache.org/

Quick Start reference deployments


 AWS Quick Start home page
https://aws.amazon.com/quickstart/

Page 22 of 23
Amazon Web Services – Data Lake Foundation on the AWS Cloud August 2017

Send Us Feedback
You can visit our GitHub repository to download the templates and scripts for this Quick
Start, to post your comments, and to share your customizations with others.

Document Revisions
Date Change In sections

August 2017 Initial publication —

© 2017, Amazon Web Services, Inc. or its affiliates, and Cloudwick Technologies, Inc. All
rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings
and practices as of the date of issue of this document, which are subject to change without notice. Customers
are responsible for making their own independent assessment of the information in this document and any
use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether
express or implied. This document does not create any warranties, representations, contractual
commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities
and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of,
nor does it modify, any agreement between AWS and its customers.

The software included with this paper is licensed under the Apache License, Version 2.0 (the "License"). You
may not use this file except in compliance with the License. A copy of the License is located at
http://aws.amazon.com/apache2.0/ or in the "license" file accompanying this file. This code is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

Page 23 of 23

Das könnte Ihnen auch gefallen