Sie sind auf Seite 1von 21

2H-MT (CARDENAS, NOGOY, TY)

UNIT I: INTRODUCTION TO DATA SCIENCE AND


ANALYTICS

Data Science & Analytics:


- NEW techniques to solve problems

DATA SCIENCE: According to Harvard Business Review →


(2012)

DATA SCIENCE:

According to Glassdoor, data scientist earn a base pay of $116,840 a year, on the average (Business
Insider)
NOTE: Is there an opportunity for everybody in the Data Science and Analytics? The encompassing
scope of Data Science and Analytics across all
industry allows an opportunity to be part of and Data
Science and Analytics Team.

DATA SCIENCE: It is a multi – disciplinary field that


uses scientific method, processes, algorithms,
computations, and systems in order to extract
understanding and insights from a structured and/or
unstructured data.

HISTORY OF DATA SCIENCE & ANALYTICS

NECESSITY is the MOTHER OF INVENTION

REPORT WRITING (1970s)

Goal: AUTOMATION

CENTRALIZED SYSTEM (1980s)

Goal: ERP (Enterprise Resource Planning)/


MIS (Management Info System)

BUSINESS INTELLIGENCE (1990s)

Goal: APPS for everyone


Applications for personal use were invented and made to share (not YET to analyze)

INTERNET & DATA MINING (2000s)


Applications for personal use were invented and made to share (not YET to analyze)

BIG DATA & DATA SCIENCE (2010s)


used for real-time analysis

EVOLUTION OF DATA SCIENCE & ANALYTICS

Short history of Data Science and Analytics on how the needed necessity requires the skillset and tolls
in order to fulfill them.
Technology and necessary skills allows industries to optimized the demand of time

EVOLUTION OF DATA SCIENCE &


ANALYTICS

The needs of the industry as demanded


by the fast-moving realities of the present
time also evolve the analytics.

WHAT ARE YOU GOING TO DO WITH ALL


THAT DATA?

The VALUE in the data “haystack” is


guided by your knowledge of the
DOMAIN – not the tools or techniques

Finding that VALUE – the combination of


all the skillsets that you need is –
ANALYTICS.

WHAT IS DATA SCIENCE AND ANALYTICS?

Analytics – is the process and arts of bringing sense of the data to bear on decision – making.
Successful use of analytics and data mining requires both an understanding of the business context
where value is to be captured, and an understanding of exactly what the data mining methods do.
The visuals shows the depth of the analytics that a company could perform and how much impact
would it provide to the industry.
The visuals show the depth of the analytics that a company could perform and how much impact
would it provide to the industry.

DATA SCIENCE & ANALYTICS IN HEALTH CARE

➢ Medical Image Analysis


➢ Machine Learning in Medicine
➢ Genetics & Genomics
➢ Drug Dev’t
➢ Virtual assistance for patients and customer support

DATA MINING

➢ Finding useful pattern in a data.


➢ it is the process of knowledge discovery, machine learning and predictive analytics.
➢ Extracting Meaningful Patterns.
➢ Building Representative Models.
➢ Combination of Statistics, Machine Learning, and Computing
➢ Algorithms

DATA MINING IS NOT ABOUT:

• Descriptive statistics.
• Exploratory visualization.
• Dimensional slicing
• Hypothesis testing
• Queries

DATA MINING: Types of Learning Models

• Supervised
o directed data mining
o The model generalizes the relationship between the input and output variables.
• Unsupervised
o Undirected data mining
o The objective of this class of data mining techniques is to find patterns in data based on
the relationship between data points themselves

DATA MINING: Groups of Learning Models

• Classification Models
• Regression Models
• Clustering Models
• Anomaly Detection
• Time Series Forecasting
• Association
• Text and Sentiment Analysis

DATA MINING: Steps

▪ Business Understanding
▪ Data Understanding
▪ Data Preparation
▪ Modeling
▪ Testing and Evaluation
▪ Deployment
UNIT II: DATA PREPARATION

How to import Data?

➢ Click FILE then IMPORT DATA


➢ OR CLICK IMPORT DATA in the
➢ Repository Tab
➢ Choose the source of your data set
➢ Locate the date then click Next.
(CustomerDetails.xls)
➢ Verify the cells you want to import and click Next.
➢ Format the columns with your specifications.
➢ You may change the type, role, and name of each attribute (variable).
➢ Click Next.
➢ Choose the folder where the data will be stored.
➢ Type the file name.
➢ Click Finish.
➢ The data will appear in the result view.
➢ The data will appear in the Results tab.

TYPES OF DATA

 Polynomial - many different string values (for example: red, green, blue, yellow)
 Binomial - exactly two values (for example: true/false, yes/no)
 Real - a fractional number (for example: 11.23 or -0.0001)
 Integer - a whole number (for example: 23, -5, or 11,024,768).
 Date_time - both date and time (for example: 23.12.2014 17:59).
 Date - date without time (for example 23.12.2014).
 Time - time without date (for example 17:59).

How to import Data? (Using a RapidMiner operator)


➢ In the Views tab, click Design.
➢ Search for Read Excel in the operator tab.
➢ Drag and drop it to the canvas.
➢ Click Import Configuration Wizard.
➢ Locate and open the file. (OrderDetails.xls.)
➢ Click Next, Next, and Finish.

Exploratory Analysis

➢ View Results.
➢ To find the basic statistics of each attributes, click Statistics.

Data preparation

➢ Go back to Design view.


➢ Connect the Out node of the Read Excel operator and res of the result knob.

➢ Click Run to execute the process.


➢ View Data.
➢ Check Statistics.

DATA FILTERING USING RAPIDMINER

Data Preparation

➢ Go back to Design view.


➢ Filtering cases.
o In the operator tab, search for Filter Examples, then drag and drop on the line
connecting the Read Excel and the res knob.
o In the parameter tab, choose Add Filter in the condition class.

o Choose the attribute’s filtering criteria.


o Example, retaining only the orders before 2016.

▪ This will remove case(s) ordered from 2016 and beyond.


o You may add more criteria by clicking Add Entry.
o Once all criteria have been set, click OK then RUN.
▪ RapidMiner removed 1 case, an order taken from 2016 onwards.

MISSING VALUE IMPUTATION USING RapidMiner

Data preparation

➢ Instead of filtering, you may remove all cases with missing values, using the condition class,
instead of Add Filters.
o As seen in the statistics of the data, 199 cases have missing values in the Discount
attribute.
➢ Go back to Design view.
➢ Imputing Missing Data
o In the operator tab, search for Replace Missing Values, then drag and drop on the line
connecting the Filtering Examples and the res knob.
o In the parameter tab, select how many attribute filters. Choose single if the imputation
will apply to a single attribute.
o Select the attribute where the imputation be applied.
o Select the imputation method in the Default.
o Click Run to see result.
▪ No more missing values in the Discount attribute.

DEALING WITH MISCODED ENTRIES USING RapidMiner

➢ Go back to Design view.


➢ Instead of the Order Details data, we will use the Customer Details data.
➢ Drag and drop the Customer Details in the canvas.
➢ The Customer Details data can be viewed in the Results view.
➢ Notice in the statistics tab, that the Gender attribute has miscoded entries.
o Click Details…
➢ Go back to Design view.
➢ Dealing with miscoded data
o Connect the Out node of the Retrieve Customer operator and second res of the result
knob.

o To remove “white spaces” in the encoding, use the TRIM operator.


o Select single if trimming shall be applied to a single attribute.
o Then click RUN.
▪ You may see the trimming result by viewing the statistics.
• Click Details…

oGo back to Design view.


oTo remove “duplicates” in the encoding, use the Remove Duplicates operator.
oSelect single if trimming shall be applied to a single attribute.
▪ This will retain only one entry if duplicate Customer IDs have been found.
o Then click RUN.
▪ Still, 2267 cases are retained, indicating that there are no duplicates in Customer
IDs.
➢ Go back to Design view.
o To recode miscoded values, use the REPLACE operator.
o Select single if replacing of values shall be applied to a single attribute.
o Add another REPLACE operator

o Replace FEMALE with girl.


▪ Add another REPLACE operator replacing male with boy;
▪ Add another REPLACE operator replacing m with boy;
▪ Add another REPLACE operator replacing f with girl;
▪ Add another REPLACE operator replacing MALE with boy;
▪ Add another REPLACE operator replacing Male with boy;
▪ To replace back girl and boy to female and male, respectively,
▪ Add another REPLACE operator replacing girl with female;
▪ Add another REPLACE operator replacing boy with male.
o Click RUN to verify the process
▪ You may impute missing values using REPLACE MISSING VALUES operator in other
attributes.

SELECTING AND SETTING ROLES OF ATTRIBUTES USING RapidMiner

➢ Selecting the Attributes for Analysis


o Use the Select Attributes operator to select the attributes that you need for analysis.
o You can select all the attributes, single, and or a subset.
o Select the Attributes that will be used for analysis.
▪ This will remove the names and Responder attribute in the final data.
➢ Setting the role that an attribute to perform.
o Use the Set Role operator to tag the attribute that will be use as the label (Target
Variable) or any other role it will act in the analysis.
COMBINING DATA SETS USING RapidMiner

➢ Joining Two Data Sets


o If two data sets are needed to be merged in order to make an analysis, use the Join
operator.
▪ Connect the first data set or its result in the left node of the Join operator and the
other data set at the right node.

▪ Connect the first data set or its result in the left node of the Join operator and the
other data set at the right node.

o In the parameter tab, use Inner as join type.


o Click Edit List.
o Select the attribute on the first data (left) and the second data (right) that will be used
in matching the two data sets.

o Click Apply, then click Run.


➢ Creating a new data set from the cleaned/pre-process data.
o Use the “Store” operator to create a RapidMiner data set from the process
o Use the “Write ***” operator to store the data in a format you want.

UNIT III: DATA VISUALIZATION

- graphical representation of data


- techniques used to communicate insights from data through visual representation.
- to distill large datasets into visual graphics to allow for easy understanding of complex
relationships within the data
- to analyze massive amounts of information and make data-driven decisions.

COMMON VISUALIZATION TECHNIQUES

- Bar Graph
- Line Graph
- Pie Graph
- Histogram
- Scatterplot
- Boxplot
- Heatmap

The HIVStages.xlsx

➢ Data from patients infected with HIV.


➢ 9 patients per group (Stage 1, Stage 2, Stage 3, Stage 4)
➢ CD4 Count Before (CD4Count1) and After (CD4Count2) after taking 6-month antiretroviral
therapy (ART).
o CD4CountIncrease – the increase in CD4 count gained
o CD4CountPercentIncrease – the % increase relative to CD4Count1

Reported Symptoms

➢ Symptom1 – if the symptom is present BEFORE taking ART


➢ Symptom2 – if the symptom is present AFTER taking ART
➢ SymptomX – if the patient’s condition have improved, worsen, or no improvement

Missing values indicate the symptom was not present before and after ART.

Click Visualizations

BAR GRAPH - to compare counts, percentage, or other measures (average) for different discrete
categories of data

HOW TO MAKE A BAR GRAPH

1. Click Visualizations
2. Click Plot Type
3. Click X-Axis Column and transfer CD4Count1 to Selected Attributes
4. Check Aggregate Data and Set the GROUP by: STAGE and use the AVERAGE AGGREGRATE
FUNCTION
5. If you click Axis Style, you can:
a. Check REVERSE AXIS (To rearrange the x-axis categories)
b. Further customization of the title, axes range, font, etc. may be done on your own.
6. Interprettt

HOW TO MAKE A CLUSTERED BAR GRAPH

1. With the bar graph you created, CLICK value columns and TRANSFER CD4Count2 & CD4Count
2 (kase pag clustered, dalawa yung variables na nakalagay)
2. Click Y-Axis then click Axis style and properly label the y-axis to Average CD4 Count

HOW TO CREATE A RADAR CHART

1. With the bar graph you created, click “Display as radar chart”

LINE GRAPH – to observe trend

PIE GRAPH – shows the relative contribution that different categories contribute to an overall total

HOW TO MAKE A PIE CHART

1. Click Plot Type and select Pie


2. In Value column click Stage
3. Check “Aggregate Data”
4. In Group by select “LymphadenopathyX”
5. In aggregate function select “Count”
6. INTERPRET

HISTOGRAM – the frequency distribution of continuous attribute

Difference of Histogram and Bar Graphs

➢ Bar graph presents categorical attribute while histogram represents numerical attribute
➢ Bar graphs have spaces between bars, while histograms do not

HOW TO MAKE HISTOGRAM

1. In Plot Type select “Histogram”


2. In value columns transfer to “Select Attributes” CD4Count1
3. In the X-axis column change the Title to CD4 count before ART
4. DO NOT CHECK the reverse axis to keep the ORDER of the values

HOW TO MAKE HISTOGRAM OF TWO OR MORE VARIABLES

1. In value columns transfer to Select Attributes “CD4Count1” and “CD4Count2”

SCATTERPLOT – plots 2 numerical attributes

HOW TO MAKE A SCATTERPLOT

1. In Plot Type select “Scatter”


2. In the X-Axis Column select “CD4Count1”
3. In value columns select “CD4CountPercentageIncrease
4. In Color “Stage”

BOXPLOTS – graphical representation of the quartiles

HOW TO MAKE A BOXPLOT

1. In Plot Type select “Boxplot”


2. In Value Column, select “CD4Count1
3. In Group by, Select Stage
4. In the X-Axis section, click Axis Style and change the title to CD4 count before ART
5. Check Reverse Axis, Show Decimal ticks and Visible

HEAT MAPS - a graphical representation of data where the individual values contained in a matrix
(map) are represented as colors.

UNIT IV: SENTIMENT ANALYSIS

SENTIMENT

- A view of or attitude toward a situation or event; an opinion


- Exaggerated and self-indulgent feelings of tenderness, sadness, or nostalgia

*Why do we need sentiment analysis? If we ask people, we are priming them. Usually ang lumalabas
sa interview ay yung gustong marinig nagtatanong. Using socmed, no one is telling you to do that
“tweet”, it’s your own choice. Much more natural, unstructured and free flowing, gives a better result.
(not + response, but a truer result).

WHAT IS SENTIENT ANALYSIS?

the process of computationally identifying and categorizing opinions expressed in a piece of text, as
either positive, negative, or neutral.

In Sentiment Analysis may tatlong opinions? Positive, Neutral and Negative

HOW TO SEARCH TWEETS IN TWITTER?

• Search for Search Twitter in the Operator Tab.

*Yung saapayan is “res”

• Drag and drop it to the canvas


• Connect the Out node of the Search Twitter operator and “res” of
the result knob
• In the Parameter tab, click the folder for selecting repository location
• Locate Twitter in the Local Repository Connections
• Type the topic that you want to search in the query box

* Query: what are you looking for (not case sensitive)


Result type: what a
Limit: how many tweets is going to be downloaded
• Manage the result type, as recent, popular, or recent or popular.
• The limit is the maximum number of tweets to be downloaded

PERFORMING SENTIMENT ANALYSIS

1. Search for Analyze Sentiment operator.

*Tokenize: to break down the statement into words = NLP will perform sentiment analysis finding each
word’s meaning; then measures how many are positive and negative words

*Mahirap sa Filipino kasi minsan yung negative words di naman ginagamit in a negative way

2. Drag and drop it to the canvas


3. In the Parameter Tab, click the button for edit configuration.
4. Click Add Connection.
5. Type Rosette as the Name of the connection.
6. Type or paste the API key upon signing up at Rosette website.
7. Click Save all changes.

→ →


8. Choose “Rosette” as the Connection
9. Choose text as the Attribute Selector.
10. Check the Sentiment Score.
11. Click RUN

→ →

NOTES NI NIKKI

➢ Kahit yung attributes tulad ng geo-location naka question mark: macacapture pa rin siya
eventually
➢ In sentiment analysis we don’t care about the name of the users, more on the locations and the
content of the tweets
➢ Even private accounts can be seen; even if mag-deactivate makukuha ang data, magiging
historical data na
➢ We need to provide our data to this service provider as a trade-off for them to be able to serve
us better (para alam nila kung anong gusto ko) the question is more on, who owns the data?
The problem with Cambridge analytica, facebook made it that the users’ data was their own,
they did not ask permission from the people who owned the data themselves, when I sign-off
dapat di na nakikita ng ibang service provider data ko pero ngayon di pa to nangyayari;
private messages hindi nakikita pero pag may hashtag nakikita

Why do we need sentiment analysis?

1. For marketing purposes to sell your product (example common sentiment for dengvaxia)
2. To determine if a tweet is real or fake (kung AI generated lang ba yung tweet or galing sa
totoong tao)
3. We can trace where a certain tweet type comes from (sang lugar)

*We have only done the most basic form of sentiment analysis

HOW TO DO VISUALIZATION WITH SENTIMENTS

1. Click VISUALIZATIONS
2. In Plot Type, Select Pie
3. In the Value Column, select “Sentiment”
4. Check “Aggregate Data”
5. In Group by select “Sentiment”
6. In Aggregate Function, select Count

Das könnte Ihnen auch gefallen