Beruflich Dokumente
Kultur Dokumente
https://cognitiveclass.ai/
Module 1 - Big Data - Beyond the Hype
1. Big Data Skills and Sources of Big Data
2. Big Data Adoption
Collection
Pre-processing
Hygiene
Analysis
Interpretation
Intervention
Visualisation
Sources of Data
Technical Components (Optional). The below modules will be covered end of the day. Introduction to
Python
Jupyter
Interactive computing
Functions, arguments in Python
Introduction to Pandas
Day 2
3. Source of Data
Data collection is expensive and time consuming. In some cases you will be lucky enough to have
existing datasets available to support your analysis. You may have datasets from previous analyses,
access to providers, or curated datasets from your organization. In many cases, however, you will not
have access to the data that you require to support your analysis, and you will have to find alternate
mechanisms. Twitter data is a good example as, depending on the options selected by the twitter
user, every tweet contains not just the message or content that most users are aware of. It also
contains a view on the network of the person, home location, location from which the message was
sent, and a number of other features that can be very useful when studying networks around a topic
of interest.
Network Data
Social Context Data
Sendor Data
Systems Data
Machine log data
Structured Vs Unstructured Data
Basic Statistics
Analyse your dataset and determine features
Data validation
Noise and bias
Random errors
Systematic errors
5. Graph Theory
Technical Components (Optional). The below modules will be covered end of the day. Introduction to
NetworkX
Adjacency Matrix
Clustering
Create a Graph
Measure centrality
Degree distribution
Machine Learning
Meta Data
Training data and test data
Identifying Features
Technical Components (Optional). The below modules will be covered end of the day.
Introduction to Scikit-learn
Introduction to Mlxtend
Day 3
7. Rolling out Big Data projects
Hypothetical Big Data project use case: Cybersecurity measures within a company in relation to
insider threats. The company hosts thousands of applications for various business functions. The
context will be User Behavior Analytics. Signals include, login meta data for each application, location
data, network data, employee data, performance appraisal data, travel data, deaktop activity data.
The analytics is focused on determining a risk score based for each user.
The technology component in the insider threat context requires collection and processing of the
following data:
User Data
Application logs
Access data
Business data
Assets, CMDB
User activity
Network data
A layered approach for data processing is ideal starting with implementation of a ETL (Extract,
Transform, Load). Processing of data is done through tools.
The last layer is the data lake which stores all structured and unstructured data. This can be accessed
through libraries such as pandas, hadoop, graph db etc.,
The data lake will enable building algorithms to determine risky behavior and send alerts. The
objective is to prioritize the alerts based on a risk score. Example, a user accessing a certain
application from a specific ip address with a recent low rating on his performance appraisal and has
booked a long holiday will be flagged as high risk.
Project Management
Different Phases
Technology components
Privacy
System architecture
Technical Components (Optional). The below modules will be covered end of the day.
K-Anonimity
Data Coarsing
Data suppression
Final Exam
40 Questions