Sie sind auf Seite 1von 18

Business Analytics Post Mid

1. Decision Tree
Pattern is the combination to get to a final label from the root node

Attributes: Decision tree includes many input variables that may have an impact on
the classification of different patterns. These input variables are usually called
attributes.

Class labels: They are the outputs of the attributes. End results.

Branch: A branch represents the outcomes of a test to classify a pattern using one
of the attributes.

Leaf Node: A leaf node at the end represents the final class choice for a pattern (a
chain of branches from the root node to the leaf node, which can be represented as
a complex if-then statement)

By Sarva
The basic idea behind a decision tree is that it recursively divides a training
set until each division consists entirely or primarily of examples from one class.
Each leaf node of the tree contains a split point , which is a test on one or more
attributes and determined how the data are to be divided further. Decision tree
algorithms, in general, build an initial tree from the training data such that each
leaf node is pure, and they then prune the tree to increase its generalisation,and
hence, the prediction accuracy on test data.
In the growth phase, the tree is built by recursively dividing the data until
each division is either pure (i.e., contains members of the same class) or relatively
small. The basic idea is to ask questions whose answers would provide the most
information.
The split used to partition data depends on the type of the attribute used in
the split.

By Sarva
Many different algorithms have been proposed for creating decision tree.
These algorithms differ primarily in terms of the way in which the attribute (and its
split values), the order of splitting the attributes (splitting the same attribute only
once or many time), the number of splits at each node (binary versus ternary), the
stopping criteria, and the pruning of the tree.
When building a decision tree, the goal at each node is to determine the
attribute and the split point of that attribute that best divides the training records in
order to purify the class representation at node.

2. Clustering
Clustering partitions a collection of things (e.g., objects, events, etc.,
presented in a structured data set) into segments (or natural grouping) whose
members share similar characteristics. Unlike in classification, in clustering
class labels are unknown. As the selected algorithm goes through the data
set, identifies the commonalities of things based on their characteristics, the
clusters are established. Because the clusters are determined using a
heuristic-type algorithm, and because different algorithms may end up with
different sets of clusters for the same data set, before the results of clustering
are pit to actual use it may be necessary for an expert to interpret, and
potential modify the suggested clusters. After reasonable clusters have been
identified, they can be used to classify and interpret new data.
Not surprisingly, clustering techniques include optimization. The goal
of clustering is to create groups so that the members within each group have
maximum similarity and the members across groups have minimum
similarity. The most commonly used clustering techniques include k-means
(from statistics).

By Sarva
It can be used in segmenting customers and directing appropriate
marketing products to the segments at the right time in the right format at the
right place. Cluster analysis is also used to identify natural grouping of
events or objects so that a common set of characteristics of these groups can
be identified to describe them.

3. Interpretation of regression output


https://www.statisticshowto.datasciencecentral.com/excel-regression-
analysis-output-explained/

4. Data Mining
Data mining is a process that uses statistical, mathematical, and artificial
intelligence techniques to extract and identify useful information and
subsequent knowledge from large set of dat. These patterns can be in the
form of business rules, affinities, correlations, trends etc.
a. Types of Data
Data refers to a collection of facts usually obtained as the result of
experiences,observations or experiments. Data may consist of numbers, letters,
words, images, voice recordings and so on as measurements of a set of variables.

By Sarva
Data are often viewed as the lowest level of abstractions from which informations
and then knowledge is derived. Data can be classified into structures and
unstructured or semi structure.

Unstructured/semistructured data is composed of any combination of


textual,imagery,voice and web content.

Structured data is what data mining algorithms use and it can be classified as
categorical or numeric data.

1. Categorical data: It represents the label of multiple classes used to divide a


variables into specific groups. The categorical data may also be called
discrete data, implying that it represents a finite number of values with no
continuum between them. Eg: race,sex,age group and educational level.
2. Nominal data: It contains measurement of simple codes assigned to objects
as labels, which are not measurements. Nominal data can be represented
with binominal values (having two possible values) or multinomial values
(having three or more possible values)
3. Ordinal data: It contains codes assigned to objects or events as labels that

By Sarva
also represent the rank order among them. Data mining algorithms such as
ordinal values having three or more possible values.
4. Numeric data: It represents the numeric values of specific variable. The
numeric data may also be called continuous data, implying that the variable
contains measures on a specific scale that allow insertion of interim data.
5. Interval data: they are variables that can be measured on interval scale.
6. Ratio data: it includes measurement variables commonly found in the
physical scenarios and engineering etc.

b. Applications of data mining


c. Patterns in data mining
Data mining builds models to identify patterns among the attributes
proposed in the data set. Models are the mathematical representations that
identify the patterns among the attributes of the objects describe in the data
set.

Data mining seeks to identify four major types of patterns


1. Associations find the commonly co-occurring grouping of things, eg;-
beer and diapers going together in market basket analysis.
2. Predictions tell the nature of future occurrences of certain events
based on what has happened in the past.
3. Cluster: It identifies natural grouping of things based on their known
characteristics such as assigning customers in different segments
based on their demographic and past purchase behaviour.
4. Sequential relationships: Discover time ordered events, such as
predicting that an existing banking customer who already has a
checking account will open a savings account followed by an
investment account within a year.

As these types of patterns have been manually extracted from


data by humans for centuries, but the increasing volume of data in
modern time has created a need for automatic approaches. The data
sets have grown in size and complexity,direct processing tools and
algorithms.

By Sarva
The manifestation of such evolution of automatic and semi
automated means of processing large data set is called as data mining.

5. Data warehousing
A data warehouse is a pool of data produced to support decision making; it also a
repository of current and historical data of potential interest to managers
throughout the organisation.

A data warehouse is a subject oriented, integrated, time-variant, non volatile


collection of data in support of management’s decision-making process.

a. Database structures
3 tiered architecture of a data warehouse

Bottom Tier (data tier)

By Sarva
● Data is collected from operational databases and external sources. On the
data collected five types of operations are performed they are: Extract, clean,
transform, load and refresh. This makes a data warehouse.
● The bottom tire also consists a metadata repository.
● The metadata repository and the data warehouse further produce data marts
with proper monitoring and administration.

Middle Tier:
● The middle tier consists of the OLAP (online analytical processor) server.
ROLAP (relational OLAP) can also be used.

Top Tier: (Client tier)


● This tier is the front-end client layer. Top tier is the tools and API
(application programming interface) that you connect and get data out from
the data warehouse.
● This layer holds the query tools and reporting tools, analysis tools and data
mining tools. Various visualisation and reporting tools are used to get data
out to the user.
2 tier architecture of a data warehouse

As shown in above diagram, application is directly connected to data source layer


without any intermediate application.

Example of the two-tier architecture would be storing patient related data into the
database and retrieving patient information when required.

By Sarva
Client Application – Client tier

Client tier is the front-end application that client uses to get data out from the data
warehouse or data tier. On the application tier code is written for saving data or
getting data out of database.Usually, data warehouse reports are hidden in the GUI
to show the required reports.

Database – Data tier

Database or data tier is where the actual data is stored. Various ETL processes are
used to load data into database or data warehouse.

b. Data warehouse types


Alternative data warehouse architectures

1.Independent Data Mart Architecture

ETL
End-user
Source Staging Independent data marts access and
system area (atomic/summarized data) application
s

The data marts are developed to operate independently of each other to serve the
needs of individual organizational units. Because of their independence they may
have inconsistent data definitions and different dimensions and measures, making
it difficult to analyse data across the data marts.

2. Data mart bus architecture

By Sarva
ETL
Dimensionalized data marts End-user
Source Staging linked by conformed access and
system area dimensions application
(atomic/ summarized data) s

This architecture is a viable alternative to the independent data marts where the
individual marts are linked to each other via some kind of middleware. Because the
data are linked among the individual marts, there is a better chance of maintaining
data consistency across the enterprise.

3. Hub-and-spoke architecture (star schema)

ETL
End-user
Source Staging Normalized relational access
system area warehouse (atomic data) and
application
s

Dependent data marts


(summarized/some atomic
data)

This perhaps the most famous data warehouse architecture today. Here the
attention is the focused on building a scalable and maintainable infrastructure that
includes centralised data warehouse and several dependent data marts. This
architecture allows customization of user interfaces and reports. On the negative
side this architecture may lead to data redundancy and data latency.

4. Centralized data warehouse

By Sarva
ETL

Normalized relational End-user


Source Staging access and
warehouse (atomic/ some
system area application
summarized data)
s

The centralized data warehouse architecture is similar to the hub-and-spoke


architecture except that there are no dependent data marts; instead, there is a
gigantic enterprise data warehouse that serves the needs of all organisational units.
This centralised data provides users with access to all data in the data warehouse
instead of limiting them to data marts. In addition, it reduces the amount the
amount of data the technical team has to transfer or change, therefore simplifying
data management and administration. If designed and implemented properly, this
architecture provides timely and holistic view of the enterprise to whoever,
wherever, whenever they me within the organization.

5. Federated data warehouse

Data mapping/
Existing data metadata End-user
warehouses Logical/physical integration of
access and
Data marts and legacy common data elements
applications
systems

It uses all the possible means to integrate analytical resources from multiple
sources to meet changing needs or business conditions. Essentially,the federated
approach involves integrated disparate systems. In a federated architecture,
existing decisions support structures are left in place, and data are accessed from
the source as need. The federated approach is supported by middleware vendors
that propose distributed queries and join capability. There eXtensible markup
language (XML)- based tools offer users a global view of distributed data sources,
including data warehouses, data marts, website, documents and operational
systems. When users choose query objects from this view and press the submit
button, the tool automatically queries the distributed sources, joins the results, and
presents them to the user.

By Sarva
6. Market Basket Analysis
7. BI Structure / Architecture

Data Warehouse Environment


In this stage the Technical staff collects the required data from several data sources.
This data is then organised, summarized and standardized and made into a Data
warehouse.

Business Analytics Environment


The business analytical environment consists of a set of tools which help
manipulate and analyse the data in the data warehouse for the required results.
(Data mining,text mining tools, SQL etc are used here)

Performance and strategy environment


In this environment the data is used by managers or executives to take decisions. It
is also used for business performance management (BPM) which focuses on
improving the corporate performance by improving the business processes.

User Interface
This the the medium through which the data analysed is displayed to the end users.
The analysed data can be projected on portal, browsers, dashboards etc.

By Sarva
8. Importance of BI
● Convert business knowledge via analytics to solve issues such as
response rate from direct mail and internet delivered marketing
system.
● Identification of profitable customers and the underlying reason for
those customer’s loyalty identify future customers with comparable if
not greater potential.
● Analyze clickstream data to improve e commerce strategy.
● Detect warranty reported problems to minimize the impact of product
design deficiencies.
● Analyse potential growth customers profitability and reduce risk
exposure through more accurate financial credit suming of customers.
● Determine what combination of products and services lines customers
are likely to purchase and when
● Set more profitable rates for insurance premiums.
● Reduce equipment downtime by applying predictive maintenance
● Determine with attrition and churn analysis, why customers leave
competitors and /or become customers
● Fraudulent behaviour detection from usage spikes in case of theft

9. Various Keys
Types of Keys
Database supports the following types of keys.
● Super Key
● Minimal Super Key
● Candidate Key
● Primary Key
● Unique Key
● Alternate Key
● Composite Key
● Foreign Key
● Natural Key
● Surrogate Key
Now we take two tables for better understanding of the key. First table is “Branch
Info” and second table is “Student_Information”.

By Sarva
Now we read about each key.
Candidate Key
A Candidate key is an attribute or set of attributes that uniquely identifies a record.
Among the set of candidate, one candidate key is chosen as Primary Key. So a
table can have multiple candidate key but each table can have maximum one
primary key.
Example:
Possible Candidate Keys in Branch_Info table.
1. Branch_Id
2. Branch_Name
3. Branch_Code

By Sarva
Possible Candidate keys in Student_Information table.
1. Student_Id
2. College_Id
3. Rtu_Roll_No
Primary Key
A Primary key uniquely identifies each record in a table and must never be the
same for the 2 records. Primary key is a set of one or more fields ( columns) of a
table that uniquely identify a record in database table. A table can have only one
primary key and one candidate key can select as a primary key. The primary key
should be chosen such that its attributes are never or rarely changed, for example,
we can’t select Student_Id field as a primary key because in some case Student_Id
of student may be changed.
Example:
Primary Key in Branch_Info table:
1. Branch_Id
Primary Key in Student_Information Table:
1. College_Id

Alternate Key:
Alternate keys are candidate keys that are not selected as primary key. Alternate
key can also work as a primary key. Alternate key is also called “Secondary Key”.
Example:
Alternate Key in Branch_Info table:
1. Branch_Name
2. Branch_Code
Alternate Key in Student_Information table:
1. Student_Id

By Sarva
2. Rtu_Roll_No
Unique Key:
A unique key is a set of one or more attribute that can be used to uniquely identify
the records in table. Unique key is similar to primary key but unique key field can
contain a “Null” value but primary key doesn’t allow “Null” value. Other
difference is that primary key field contain a clustered index and unique field
contain a non-clustered index.
Example:
Possible Unique Key in Branch_Info table.
1. Branch_Name
Possible Unique Key in Student_Information table:
1. Rtu_Roll_No
Composite Key:
Composite key is a combination of more than one attributes that can be used to
uniquely identify each record. It is also known as “Compound” key. A composite
key may be a candidate or primary key.
Example:
Composite Key in Branch_Info table.
1. { Branch_Name, Branch_Code}
2. Composite Key in Student_Information table:
3. { Student_Id, Student_Name }
Super Key
Super key is a set of one or more than one keys that can be used to uniquely
identify the record in table. A Super key for an entity is a set of one or more
attributes whose combined value uniquely identifies the entity in the entity set. A
super key is a combine form of Primary Key, Alternate key and Unique key and
Primary Key, Unique Key and Alternate Key are subset of super key. A Super Key
is simply a non-minimal Candidate Key, that is to say one with additional columns
not strictly required to ensure uniqueness of the row. A super key can have a single

By Sarva
column.
Example:
Super Keys in Branch_Info Table.
1. Branch_Id
2. Branch_Name
3. Branch_Code
4. { Branch_Id, Branch_Code }
5. { Branch_Name , Branch_Code }
Super Keys in Student_Information Table:
1. Student_Id
2. College_Id
3. Rtu_Roll_No
4. { Student_Id, Student_Name}
5. { College_Id, Branch_Id }
6. { Rtu_Roll_No, Session }
Minimal Super Key:
A minimal super key is a minimum set of columns that can be used to uniquely
identify a row. In other wordsm the minimum number of columns that can be
combined to give a unique value for every row in the table.
Example:
Minimal Super Keys in Branch_Info Table.
1. Branch_Id
2. Branch_Name
3. Branch_Code
Minimal Super Keys in Student_Information Table:
1. Student_Id
2. College_Id
3. Rtu_Roll_No

By Sarva
Natural Keys:
A natural key is a key composed of columns that actually have a logical
relationship to other columns within a table. For example, if we use Student_Id,
Student_Name and Father_Name columns to form a key then it would be “Natural
Key” because there is definitely a relationship between these columns and other
columns that exist in table. Natural keys are often called “Business Key ” or
“Domain Key”.

Surrogate Key:
Surrogate key is an artificial key that is used to uniquely identify the record in
table. For example, in SQL Server or Sybase database system contain an artificial
key that is known as “Identity”. Surrogate keys are just simple sequential number.
Surrogate keys are only used to act as a primary key.
Example:
Branch_Id is a Surrogate Key in Branch_Info table and Student_Id is a Surrogate
key of Student_Information table.
Foreign Keys:
Foreign key is used to generate the relationship between the tables. Foreign Key is
a field in database table that is Primary key in another table. A foreign key can
accept null and duplicate value.
Example:
Branch_Id is a Foreign Key in Student_Information table that primary key exist in
Branch_Info(Branch_Id) table.

In Layman's terms : https://www.dotnettricks.com/learn/sqlserver/different-types-


of-sql-keys

By Sarva

Das könnte Ihnen auch gefallen