Sie sind auf Seite 1von 81

Processing data

D ATA E N G I N E E R I N G F O R E V E R YO N E

Hadrien Lacroix
Content Developer at DataCamp
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
A general de nition
Data processing: converting raw data into meaningful information

DATA ENGINEERING FOR EVERYONE


Data processing value
Conceptually At Spot ix

Remove unwanted data No long term need for testing feature data

Optimize memory, process and network Can't afford to store and stream les this
costs big

Convert data from one type to another

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Data processing value
Conceptually At Spot ix

Remove unwanted data No need for lossless format

To save memory Can't afford to store les this big

Convert data from one type to another Convert songs from .flac to .ogg

Organize data Reorganize data from the data lake to data


warehouses
To t into a schema/structure
Employee table example
Increase productivity
Enable data scientists

DATA ENGINEERING FOR EVERYONE


How data engineers process data
Data manipulation, cleaning, and tidying tasks Rejecting corrupt song les
that can be automated Deciding what happens with missing metadata
that will always need to be done Separate artists and albums tables...
Store data in a sanely structured database ...but provide view combining them
Create views on top of the database tables Indexing
Optimizing the performance of the database

DATA ENGINEERING FOR EVERYONE


1 The difference between batch and stream will be explained in the next lesson!

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
Summary
What data processing is

Why it's necessary

What it consists in

How we process data at Spot ix

DATA ENGINEERING FOR EVERYONE


Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Scheduling data
D ATA E N G I N E E R I N G F O R E V E R YO N E

Hadrien Lacroix
Content Developer at DataCamp
Scheduling
Can apply to any task listed in data processing

Scheduling is the glue of your system

Holds each piece and organize how they work together

Runs tasks in a speci c order and resolves all dependencies

DATA ENGINEERING FOR EVERYONE


Manual, time and sensor scheduling
Manually Manually update the employee table

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Manual, time and sensor scheduling
Manually Manually update the employee table

Automatically run at a speci c time Update the employee table at 6 AM

Automatically run if a speci c condition is met


Sensor scheduling

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Manual, time, and sensor scheduling
Manually Manually update the employee table

Automatically run at a speci c time Update the employee table at 6 AM

Automatically run if a speci c condition is met Update the department tables if a new
Sensor scheduling employee was added

DATA ENGINEERING FOR EVERYONE


Batches and streams
Batches Songs uploaded by artists
Group records at intervals Employee table
Often cheaper Revenue table
Streams New users signing in
Send individual records right away
Another example: online vs. of ine listening

DATA ENGINEERING FOR EVERYONE


Scheduling tools

DATA ENGINEERING FOR EVERYONE


Summary
What scheduling is

Different ways to set it up

Difference between batches and streams

How scheduling is implemented at Spot ix

Air ow, Luigi

DATA ENGINEERING FOR EVERYONE


Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Parallel computing
D ATA E N G I N E E R I N G F O R E V E R YO N E

Hadrien Lacroix
Content Developer at DataCamp
Parallel computing
Basis of modern data processing tools

Necessary:
Mainly because of memory

Also for processing power

How it works:
Split tasks up into several smaller subtasks

Distribute these subtasks over several computers

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
1 Emojis by Mohamed Hassan

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Bene ts and risks of parallel computing
Employees = processing units

Advantages
Extra processing power

Reduced memory footprint

Disadvantages
Moving data incurs a cost

Communication time

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Summary
Bene ts and risks

How it's implemented at Spot ix

DATA ENGINEERING FOR EVERYONE


Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Cloud computing
D ATA E N G I N E E R I N G F O R E V E R YO N E

Hadrien Lacroix
Content Developer
Cloud computing for data processing
Servers on premises Servers on the cloud

Bought Rented

Need space Don't need space

Electrical and maintenance cost Use just the resources we need

Enough power for peak moments When we need them

Processing power unused at quieter times The closer to the user the better

DATA ENGINEERING FOR EVERYONE


Cloud computing for data storage
Database reliability: data replication

Risk with sensitive data

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Multicloud
Pros Cons

Reducing reliance on a single vendor Cloud providers try to lock in consumers

Cost-ef ciencies Incompatibility

Local laws requiring certain data to be Security and governance


physically present within the country

Militating against disasters

DATA ENGINEERING FOR EVERYONE


Summary
Bene ts and risks of cloud computing

How it is implemented at Spot ix

Can cite the main cloud providers and their services

DATA ENGINEERING FOR EVERYONE


Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
We are the
champions
D ATA E N G I N E E R I N G F O R E V E R YO N E

Hadrien Lacroix
Content Developer at DataCamp
Actually, YOU are the champion!

DATA ENGINEERING FOR EVERYONE


What you learned - chapter 1
What Data Engineering is

How important it is

How data engineers differ from data scientists

What a data pipeline is and how it works

DATA ENGINEERING FOR EVERYONE


What you learned - chapter 2
The different structures data can take

How fundamentals SQL is

The differences between data lakes, data warehouses and databases

DATA ENGINEERING FOR EVERYONE


What you learned - chapter 3
How data is processed

How scheduling holds it all together

Parallel computing

Cloud computing

DATA ENGINEERING FOR EVERYONE


And some more
What SQL code actually looks like

Main tools and technologies used in data engineering

And some more

DATA ENGINEERING FOR EVERYONE


DATA ENGINEERING FOR EVERYONE
Lexicon

DATA ENGINEERING FOR EVERYONE


A promise is a promise, DataChamps!
All the exercises are song titles

Search for "DataChamps" on Spotify

DATA ENGINEERING FOR EVERYONE


Congratulations!
D ATA E N G I N E E R I N G F O R E V E R YO N E

Das könnte Ihnen auch gefallen