Beruflich Dokumente
Kultur Dokumente
DATA SCIENCE:
According to Glassdoor, data scientist earn a base pay of $116,840 a year, on the average (Business
Insider)
NOTE: Is there an opportunity for everybody in the Data Science and Analytics? The encompassing
scope of Data Science and Analytics across all
industry allows an opportunity to be part of and Data
Science and Analytics Team.
Goal: AUTOMATION
Short history of Data Science and Analytics on how the needed necessity requires the skillset and tolls
in order to fulfill them.
Technology and necessary skills allows industries to optimized the demand of time
Analytics – is the process and arts of bringing sense of the data to bear on decision – making.
Successful use of analytics and data mining requires both an understanding of the business context
where value is to be captured, and an understanding of exactly what the data mining methods do.
The visuals shows the depth of the analytics that a company could perform and how much impact
would it provide to the industry.
The visuals show the depth of the analytics that a company could perform and how much impact
would it provide to the industry.
DATA MINING
• Descriptive statistics.
• Exploratory visualization.
• Dimensional slicing
• Hypothesis testing
• Queries
• Supervised
o directed data mining
o The model generalizes the relationship between the input and output variables.
• Unsupervised
o Undirected data mining
o The objective of this class of data mining techniques is to find patterns in data based on
the relationship between data points themselves
• Classification Models
• Regression Models
• Clustering Models
• Anomaly Detection
• Time Series Forecasting
• Association
• Text and Sentiment Analysis
▪ Business Understanding
▪ Data Understanding
▪ Data Preparation
▪ Modeling
▪ Testing and Evaluation
▪ Deployment
UNIT II: DATA PREPARATION
TYPES OF DATA
Polynomial - many different string values (for example: red, green, blue, yellow)
Binomial - exactly two values (for example: true/false, yes/no)
Real - a fractional number (for example: 11.23 or -0.0001)
Integer - a whole number (for example: 23, -5, or 11,024,768).
Date_time - both date and time (for example: 23.12.2014 17:59).
Date - date without time (for example 23.12.2014).
Time - time without date (for example 17:59).
Exploratory Analysis
➢ View Results.
➢ To find the basic statistics of each attributes, click Statistics.
Data preparation
Data Preparation
Data preparation
➢ Instead of filtering, you may remove all cases with missing values, using the condition class,
instead of Add Filters.
o As seen in the statistics of the data, 199 cases have missing values in the Discount
attribute.
➢ Go back to Design view.
➢ Imputing Missing Data
o In the operator tab, search for Replace Missing Values, then drag and drop on the line
connecting the Filtering Examples and the res knob.
o In the parameter tab, select how many attribute filters. Choose single if the imputation
will apply to a single attribute.
o Select the attribute where the imputation be applied.
o Select the imputation method in the Default.
o Click Run to see result.
▪ No more missing values in the Discount attribute.
▪ Connect the first data set or its result in the left node of the Join operator and the
other data set at the right node.
- Bar Graph
- Line Graph
- Pie Graph
- Histogram
- Scatterplot
- Boxplot
- Heatmap
The HIVStages.xlsx
Reported Symptoms
Missing values indicate the symptom was not present before and after ART.
Click Visualizations
BAR GRAPH - to compare counts, percentage, or other measures (average) for different discrete
categories of data
1. Click Visualizations
2. Click Plot Type
3. Click X-Axis Column and transfer CD4Count1 to Selected Attributes
4. Check Aggregate Data and Set the GROUP by: STAGE and use the AVERAGE AGGREGRATE
FUNCTION
5. If you click Axis Style, you can:
a. Check REVERSE AXIS (To rearrange the x-axis categories)
b. Further customization of the title, axes range, font, etc. may be done on your own.
6. Interprettt
1. With the bar graph you created, CLICK value columns and TRANSFER CD4Count2 & CD4Count
2 (kase pag clustered, dalawa yung variables na nakalagay)
2. Click Y-Axis then click Axis style and properly label the y-axis to Average CD4 Count
1. With the bar graph you created, click “Display as radar chart”
PIE GRAPH – shows the relative contribution that different categories contribute to an overall total
➢ Bar graph presents categorical attribute while histogram represents numerical attribute
➢ Bar graphs have spaces between bars, while histograms do not
HEAT MAPS - a graphical representation of data where the individual values contained in a matrix
(map) are represented as colors.
SENTIMENT
*Why do we need sentiment analysis? If we ask people, we are priming them. Usually ang lumalabas
sa interview ay yung gustong marinig nagtatanong. Using socmed, no one is telling you to do that
“tweet”, it’s your own choice. Much more natural, unstructured and free flowing, gives a better result.
(not + response, but a truer result).
the process of computationally identifying and categorizing opinions expressed in a piece of text, as
either positive, negative, or neutral.
*Tokenize: to break down the statement into words = NLP will perform sentiment analysis finding each
word’s meaning; then measures how many are positive and negative words
*Mahirap sa Filipino kasi minsan yung negative words di naman ginagamit in a negative way
→ →
→
8. Choose “Rosette” as the Connection
9. Choose text as the Attribute Selector.
10. Check the Sentiment Score.
11. Click RUN
→ →
NOTES NI NIKKI
➢ Kahit yung attributes tulad ng geo-location naka question mark: macacapture pa rin siya
eventually
➢ In sentiment analysis we don’t care about the name of the users, more on the locations and the
content of the tweets
➢ Even private accounts can be seen; even if mag-deactivate makukuha ang data, magiging
historical data na
➢ We need to provide our data to this service provider as a trade-off for them to be able to serve
us better (para alam nila kung anong gusto ko) the question is more on, who owns the data?
The problem with Cambridge analytica, facebook made it that the users’ data was their own,
they did not ask permission from the people who owned the data themselves, when I sign-off
dapat di na nakikita ng ibang service provider data ko pero ngayon di pa to nangyayari;
private messages hindi nakikita pero pag may hashtag nakikita
1. For marketing purposes to sell your product (example common sentiment for dengvaxia)
2. To determine if a tweet is real or fake (kung AI generated lang ba yung tweet or galing sa
totoong tao)
3. We can trace where a certain tweet type comes from (sang lugar)
*We have only done the most basic form of sentiment analysis
1. Click VISUALIZATIONS
2. In Plot Type, Select Pie
3. In the Value Column, select “Sentiment”
4. Check “Aggregate Data”
5. In Group by select “Sentiment”
6. In Aggregate Function, select Count