Sie sind auf Seite 1von 40

Data Analytics for Executive

Inspektorat Jenderal
Kementerian Keuangan
September 2019

Denny, Ph.D.
Short Bio

Education

Professional
Experiences

Data science, Visual Analytics, Big Data Analytics, Large


software architecture and development, competitive
programming
Case Study: GOJEK
◼ Big Data Volume and Velocity:
◼ Around 100 GB of data from over 18 products flows
through our data systems each hour.

◼ Value:
◼ provide the best experience to the customers

◼ Objective:
◼ Help customer to meet his/her GOJEK driver — without a
single call

3
Gojek: Pickup Points

4
Gojek: Clustering Pickup Points

5
Gojek: Clustering Pickup Points

6
Naming Frequent
Place of Interest?
7
Automatic naming from booking-text
data

8
Challenge?

Use technology to disrupt illegal or


fraud activities

9
DATA SCIENCE / ANALYTICS

10
Data Science, Data Mining, Big Data

◼ Data science is a multi-disciplinary field that uses


scientific methods, processes, algorithms and
systems to extract knowledge and insights from
structured and unstructured data.

◼ Data science is the same concept as data mining and big data:
"use the most powerful hardware, the most powerful
programming systems, and the most efficient algorithms to
solve problems“

◼ Data science = Data mining = Analytics

11
Data Mining

◼ Data mining is the analysis of data sets to find


unsuspected relationships and to summarize the
data in novel ways that are both understandable
and useful to the data owner

◼ cost-effective, innovative forms of information


processing for enhanced insight and decision making

12
Data Mining vs Database

◼ Database systems store and manage data


◼ Queries return part of stored data
◼ Queries do not extract hidden patterns

◼ Examples of querying databases


◼ Find all employees with income more than $250K
◼ Find top spending customers in last month
◼ Find all students from engineering college with GPA more
than average

13
Business Value of Analytics (ATO)
◼ Fraud detection
◼ Identify High Risk Refund
• Previous practice simple business rules based on experience:
• Total claimed investment deductions > $N
• Ratio of self education deductions to total income > N
• Total international transfers > N times taxable income
• Luxury vehicle purchase $M > N times taxable income
• Use modelling
• regression, decision trees, random forests
• increase Tax revenue

◼ Identify Aggressive Tax Planning


◼ Assessing levels of debt: propensity dan capacity to pay

14
DATA MINING CAPABILITY

16
Market Basket Analysis, Association
Rule Mining

17
Classification
◼ predict the target class (categorical) for each case
in the data

◼ Examples:
◼ Customer credit rating (high risk vs low risk)
◼ Identify tourist who will violate their terms
◼ tax case selection (SPT)

◼ Need training data


◼ records that have been labelled as ‘positive’ and ‘negative’

18
Classification

19
Tax Case Selection

◼ backed up by extensive scientific research


◼ combination of
◼ Expert judgement: includes factors and issues that are not
included in the Analytical model thus improving the overall
precision of the selection decision
• Hidden context

◼ Actuarial prediction: gives case selection staff the


probability of a case being a ‘true positive’ rather than a
‘false alarm’

20
Regression
◼ predict continuous value for each case in the data
◼ example:
◼ estimate value of adjustments

21
Cluster Analysis

◼ Group data to form new classes,


◼ Examples
◼ cluster houses to find distribution patterns

◼ Understanding data – data exploration

22
Cluster Analysis and Visualizations
Component Plane: Employee Market Component Plane: lodge through e-tax

Component Plane: Sallary wage percentage


◼ challenging to identify hot spots
using component planes in high
dimensional datasets

23
23
Clustering: Identifying Hot Spots /
Outlier
Distance Matrix Visualizations

◼ The interesting hot spot is located


on the bottom right corner of the
map.

Component Plane: Count of Debt Cases Component Plane: Count of Debt Cases Paid
Outlier Analysis
Component Plane: Count of Debt Cases Component Plane: Count of Debt Cases Paid

A B
Component Plane: SEIFA Distance Matrix Visualizations
C

25
Time Series

26
Data Mining Should Not be Used
Blindly
◼ Data mining find regularities from history, but history is not
the same as the future.
◼ Concept drift
◼ Population drift

◼ Association does not dictate trend nor causality.


◼ Observational vs Experimental data

◼ Some abnormal data could be caused by human.


◼ Noise, Bias

27
Data Scientist

28
Challenges

◼ Data science needs skilled people, not off the shelf


solutions (yet)
◼ Plus staffs who are competent in statistics,
econometrics

◼ Shortage of skilled data scientist


◼ not only know how to use the tools, but
understand the underlying mechanism
◼ dedicated assignment?

◼ Data matching / Record linkage – link datasets


29
Data Matching – Record Linkage

◼ Data internal Kemenkeu (silo)


◼ Data perpajakan
◼ Data kependudukan
◼ Data perbankan
◼ Data properti
◼ Data kepemilikan kendaraan bermotor
◼ Data transaksi penjualan e-commerce
◼ Data transaksi keuangan (dan international)

30
DATA MINING PROCESS

31
Data Mining Process: CRISP-DM

32
Business Understanding
◼ understand what you want to accomplish from a
business perspective
◼ unwise to commit to data science without assessing its value
◼ the expected value “lift” enhanced insight and decision
making, as compared to
◼ the total cost of operations

◼ Value:
◼ increase revenue
◼ decrease cost
◼ improve the customer experience
◼ reduce risks and increase compliance
◼ increase productivity

◼ Role: Business Leader


33
Business Understanding
◼ validate against the hype, evaluate organization
fitness
◼ Factors:
◼ feasibility
◼ reasonability
◼ value
◼ integrability
◼ sustainability

34
Data Understanding
◼ acquire the data
listed in the project
resources
◼ describe data,
explore data, verify
data quality

◼ Data Management

◼ Data Dictionary

35
Data Understanding

36
CRISP DM: Data Preparation
◼ Data cleaning and preprocessing
◼ may take 60% of effort!

◼ Data integration, reduction, and projection

◼ SLDK: Sistem Layanan Data Kemenkeu


◼ need to be enforced

◼ Data sharing via Flash disk – not


recommended
37
Modelling

◼ Choosing functions of data mining and algorithm


◼ summarization and visualization,
◼ classification,
◼ regression,
◼ association,
◼ clustering,
◼ outlier analysis.

◼ Use software tools

38
Evaluation

◼ Evaluate the model based on some accuracy


criteria

◼ Need skill for cost effectiveness analysis

39
Deployment

◼ Deploy to operational

◼ Analytical models need to be changed


periodically to keep current with the latest
frauds, abuses, and other patterns of non-
compliance

40
THANK YOU

41

Das könnte Ihnen auch gefallen