Beruflich Dokumente
Kultur Dokumente
1. Decision Tree
Pattern is the combination to get to a final label from the root node
Attributes: Decision tree includes many input variables that may have an impact on
the classification of different patterns. These input variables are usually called
attributes.
Class labels: They are the outputs of the attributes. End results.
Branch: A branch represents the outcomes of a test to classify a pattern using one
of the attributes.
Leaf Node: A leaf node at the end represents the final class choice for a pattern (a
chain of branches from the root node to the leaf node, which can be represented as
a complex if-then statement)
By Sarva
The basic idea behind a decision tree is that it recursively divides a training
set until each division consists entirely or primarily of examples from one class.
Each leaf node of the tree contains a split point , which is a test on one or more
attributes and determined how the data are to be divided further. Decision tree
algorithms, in general, build an initial tree from the training data such that each
leaf node is pure, and they then prune the tree to increase its generalisation,and
hence, the prediction accuracy on test data.
In the growth phase, the tree is built by recursively dividing the data until
each division is either pure (i.e., contains members of the same class) or relatively
small. The basic idea is to ask questions whose answers would provide the most
information.
The split used to partition data depends on the type of the attribute used in
the split.
By Sarva
Many different algorithms have been proposed for creating decision tree.
These algorithms differ primarily in terms of the way in which the attribute (and its
split values), the order of splitting the attributes (splitting the same attribute only
once or many time), the number of splits at each node (binary versus ternary), the
stopping criteria, and the pruning of the tree.
When building a decision tree, the goal at each node is to determine the
attribute and the split point of that attribute that best divides the training records in
order to purify the class representation at node.
2. Clustering
Clustering partitions a collection of things (e.g., objects, events, etc.,
presented in a structured data set) into segments (or natural grouping) whose
members share similar characteristics. Unlike in classification, in clustering
class labels are unknown. As the selected algorithm goes through the data
set, identifies the commonalities of things based on their characteristics, the
clusters are established. Because the clusters are determined using a
heuristic-type algorithm, and because different algorithms may end up with
different sets of clusters for the same data set, before the results of clustering
are pit to actual use it may be necessary for an expert to interpret, and
potential modify the suggested clusters. After reasonable clusters have been
identified, they can be used to classify and interpret new data.
Not surprisingly, clustering techniques include optimization. The goal
of clustering is to create groups so that the members within each group have
maximum similarity and the members across groups have minimum
similarity. The most commonly used clustering techniques include k-means
(from statistics).
By Sarva
It can be used in segmenting customers and directing appropriate
marketing products to the segments at the right time in the right format at the
right place. Cluster analysis is also used to identify natural grouping of
events or objects so that a common set of characteristics of these groups can
be identified to describe them.
4. Data Mining
Data mining is a process that uses statistical, mathematical, and artificial
intelligence techniques to extract and identify useful information and
subsequent knowledge from large set of dat. These patterns can be in the
form of business rules, affinities, correlations, trends etc.
a. Types of Data
Data refers to a collection of facts usually obtained as the result of
experiences,observations or experiments. Data may consist of numbers, letters,
words, images, voice recordings and so on as measurements of a set of variables.
By Sarva
Data are often viewed as the lowest level of abstractions from which informations
and then knowledge is derived. Data can be classified into structures and
unstructured or semi structure.
Structured data is what data mining algorithms use and it can be classified as
categorical or numeric data.
By Sarva
also represent the rank order among them. Data mining algorithms such as
ordinal values having three or more possible values.
4. Numeric data: It represents the numeric values of specific variable. The
numeric data may also be called continuous data, implying that the variable
contains measures on a specific scale that allow insertion of interim data.
5. Interval data: they are variables that can be measured on interval scale.
6. Ratio data: it includes measurement variables commonly found in the
physical scenarios and engineering etc.
By Sarva
The manifestation of such evolution of automatic and semi
automated means of processing large data set is called as data mining.
5. Data warehousing
A data warehouse is a pool of data produced to support decision making; it also a
repository of current and historical data of potential interest to managers
throughout the organisation.
a. Database structures
3 tiered architecture of a data warehouse
By Sarva
● Data is collected from operational databases and external sources. On the
data collected five types of operations are performed they are: Extract, clean,
transform, load and refresh. This makes a data warehouse.
● The bottom tire also consists a metadata repository.
● The metadata repository and the data warehouse further produce data marts
with proper monitoring and administration.
Middle Tier:
● The middle tier consists of the OLAP (online analytical processor) server.
ROLAP (relational OLAP) can also be used.
Example of the two-tier architecture would be storing patient related data into the
database and retrieving patient information when required.
By Sarva
Client Application – Client tier
Client tier is the front-end application that client uses to get data out from the data
warehouse or data tier. On the application tier code is written for saving data or
getting data out of database.Usually, data warehouse reports are hidden in the GUI
to show the required reports.
Database or data tier is where the actual data is stored. Various ETL processes are
used to load data into database or data warehouse.
ETL
End-user
Source Staging Independent data marts access and
system area (atomic/summarized data) application
s
The data marts are developed to operate independently of each other to serve the
needs of individual organizational units. Because of their independence they may
have inconsistent data definitions and different dimensions and measures, making
it difficult to analyse data across the data marts.
By Sarva
ETL
Dimensionalized data marts End-user
Source Staging linked by conformed access and
system area dimensions application
(atomic/ summarized data) s
This architecture is a viable alternative to the independent data marts where the
individual marts are linked to each other via some kind of middleware. Because the
data are linked among the individual marts, there is a better chance of maintaining
data consistency across the enterprise.
ETL
End-user
Source Staging Normalized relational access
system area warehouse (atomic data) and
application
s
This perhaps the most famous data warehouse architecture today. Here the
attention is the focused on building a scalable and maintainable infrastructure that
includes centralised data warehouse and several dependent data marts. This
architecture allows customization of user interfaces and reports. On the negative
side this architecture may lead to data redundancy and data latency.
By Sarva
ETL
Data mapping/
Existing data metadata End-user
warehouses Logical/physical integration of
access and
Data marts and legacy common data elements
applications
systems
It uses all the possible means to integrate analytical resources from multiple
sources to meet changing needs or business conditions. Essentially,the federated
approach involves integrated disparate systems. In a federated architecture,
existing decisions support structures are left in place, and data are accessed from
the source as need. The federated approach is supported by middleware vendors
that propose distributed queries and join capability. There eXtensible markup
language (XML)- based tools offer users a global view of distributed data sources,
including data warehouses, data marts, website, documents and operational
systems. When users choose query objects from this view and press the submit
button, the tool automatically queries the distributed sources, joins the results, and
presents them to the user.
By Sarva
6. Market Basket Analysis
7. BI Structure / Architecture
User Interface
This the the medium through which the data analysed is displayed to the end users.
The analysed data can be projected on portal, browsers, dashboards etc.
By Sarva
8. Importance of BI
● Convert business knowledge via analytics to solve issues such as
response rate from direct mail and internet delivered marketing
system.
● Identification of profitable customers and the underlying reason for
those customer’s loyalty identify future customers with comparable if
not greater potential.
● Analyze clickstream data to improve e commerce strategy.
● Detect warranty reported problems to minimize the impact of product
design deficiencies.
● Analyse potential growth customers profitability and reduce risk
exposure through more accurate financial credit suming of customers.
● Determine what combination of products and services lines customers
are likely to purchase and when
● Set more profitable rates for insurance premiums.
● Reduce equipment downtime by applying predictive maintenance
● Determine with attrition and churn analysis, why customers leave
competitors and /or become customers
● Fraudulent behaviour detection from usage spikes in case of theft
9. Various Keys
Types of Keys
Database supports the following types of keys.
● Super Key
● Minimal Super Key
● Candidate Key
● Primary Key
● Unique Key
● Alternate Key
● Composite Key
● Foreign Key
● Natural Key
● Surrogate Key
Now we take two tables for better understanding of the key. First table is “Branch
Info” and second table is “Student_Information”.
By Sarva
Now we read about each key.
Candidate Key
A Candidate key is an attribute or set of attributes that uniquely identifies a record.
Among the set of candidate, one candidate key is chosen as Primary Key. So a
table can have multiple candidate key but each table can have maximum one
primary key.
Example:
Possible Candidate Keys in Branch_Info table.
1. Branch_Id
2. Branch_Name
3. Branch_Code
By Sarva
Possible Candidate keys in Student_Information table.
1. Student_Id
2. College_Id
3. Rtu_Roll_No
Primary Key
A Primary key uniquely identifies each record in a table and must never be the
same for the 2 records. Primary key is a set of one or more fields ( columns) of a
table that uniquely identify a record in database table. A table can have only one
primary key and one candidate key can select as a primary key. The primary key
should be chosen such that its attributes are never or rarely changed, for example,
we can’t select Student_Id field as a primary key because in some case Student_Id
of student may be changed.
Example:
Primary Key in Branch_Info table:
1. Branch_Id
Primary Key in Student_Information Table:
1. College_Id
Alternate Key:
Alternate keys are candidate keys that are not selected as primary key. Alternate
key can also work as a primary key. Alternate key is also called “Secondary Key”.
Example:
Alternate Key in Branch_Info table:
1. Branch_Name
2. Branch_Code
Alternate Key in Student_Information table:
1. Student_Id
By Sarva
2. Rtu_Roll_No
Unique Key:
A unique key is a set of one or more attribute that can be used to uniquely identify
the records in table. Unique key is similar to primary key but unique key field can
contain a “Null” value but primary key doesn’t allow “Null” value. Other
difference is that primary key field contain a clustered index and unique field
contain a non-clustered index.
Example:
Possible Unique Key in Branch_Info table.
1. Branch_Name
Possible Unique Key in Student_Information table:
1. Rtu_Roll_No
Composite Key:
Composite key is a combination of more than one attributes that can be used to
uniquely identify each record. It is also known as “Compound” key. A composite
key may be a candidate or primary key.
Example:
Composite Key in Branch_Info table.
1. { Branch_Name, Branch_Code}
2. Composite Key in Student_Information table:
3. { Student_Id, Student_Name }
Super Key
Super key is a set of one or more than one keys that can be used to uniquely
identify the record in table. A Super key for an entity is a set of one or more
attributes whose combined value uniquely identifies the entity in the entity set. A
super key is a combine form of Primary Key, Alternate key and Unique key and
Primary Key, Unique Key and Alternate Key are subset of super key. A Super Key
is simply a non-minimal Candidate Key, that is to say one with additional columns
not strictly required to ensure uniqueness of the row. A super key can have a single
By Sarva
column.
Example:
Super Keys in Branch_Info Table.
1. Branch_Id
2. Branch_Name
3. Branch_Code
4. { Branch_Id, Branch_Code }
5. { Branch_Name , Branch_Code }
Super Keys in Student_Information Table:
1. Student_Id
2. College_Id
3. Rtu_Roll_No
4. { Student_Id, Student_Name}
5. { College_Id, Branch_Id }
6. { Rtu_Roll_No, Session }
Minimal Super Key:
A minimal super key is a minimum set of columns that can be used to uniquely
identify a row. In other wordsm the minimum number of columns that can be
combined to give a unique value for every row in the table.
Example:
Minimal Super Keys in Branch_Info Table.
1. Branch_Id
2. Branch_Name
3. Branch_Code
Minimal Super Keys in Student_Information Table:
1. Student_Id
2. College_Id
3. Rtu_Roll_No
By Sarva
Natural Keys:
A natural key is a key composed of columns that actually have a logical
relationship to other columns within a table. For example, if we use Student_Id,
Student_Name and Father_Name columns to form a key then it would be “Natural
Key” because there is definitely a relationship between these columns and other
columns that exist in table. Natural keys are often called “Business Key ” or
“Domain Key”.
Surrogate Key:
Surrogate key is an artificial key that is used to uniquely identify the record in
table. For example, in SQL Server or Sybase database system contain an artificial
key that is known as “Identity”. Surrogate keys are just simple sequential number.
Surrogate keys are only used to act as a primary key.
Example:
Branch_Id is a Surrogate Key in Branch_Info table and Student_Id is a Surrogate
key of Student_Information table.
Foreign Keys:
Foreign key is used to generate the relationship between the tables. Foreign Key is
a field in database table that is Primary key in another table. A foreign key can
accept null and duplicate value.
Example:
Branch_Id is a Foreign Key in Student_Information table that primary key exist in
Branch_Info(Branch_Id) table.
By Sarva