0 Stimmen dafür0 Stimmen dagegen

70 Aufrufe44 SeitenMay 29, 2014

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

70 Aufrufe

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- global_brief_hypertension.pdf
- DataMiningFinalReport(Qamaruzzaman,Zulqarnain,Aidi)
- Natural Way to Lower Blood Pressure Without Making Blood Pressure
- Practical Introduction to Power of Enterprise Miner
- Section5_141-166_
- HighBloodPressurebook[1]
- Hypertension.docx
- student performance analysis
- TREE-CART
- badang
- OUT-OF-BAG ESTIMATION
- Obat Tradisional Hipertensi
- hipertensi
- Myers Ont Bp Survey 2008
- Heritability of Blood Pressure Among Random Adult Individuals of South Indian States
- ejcn2015119a
- 5391191_source
- Drug Profiles
- MENG.KEGAWATAN
- Morning Report 07 APRIL

Sie sind auf Seite 1von 44

COLLEGE OF ARTS AND SCIENCES

SQIT 3033

KNOWLEDGE ACQUISITION IN DECISION

MAKING

GROUP A

TITLE: MEDICAL DATA (GROUP PROJECT)

LECTURER NAME:

DR. IZWAN NIZAL MOHD SHAHARANEE

STUDENT NAME :

KANAGAMBAL D/O SUBRAMANIAM (211619)

KOGILAVANI D/O THIRUMALAIRAJAN (211292)

RENUGAA D/O S.MUTHANAGOPAL (211473)

2

CONTENTS

NO TITLE PAGE NUMBER

1. 1.0: INTRODUCTION 3-4

2. 2.0: PROBLEM STATEMENT

2.1: OBJECTIVES

2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION

TECHNIQUES OR PROBLEM SOLVING

PROCEDURES IN YOUR WORK

5-6

7

8-9

3. 3.0: RESEARCH METHODOLOGY

3.1: DATA MINING TECHNIQUE TO SOLVE THE

PROBLEM

3.2: STEPS INVOLVING IN SAS ENTERPRISE MINER

3.3: ANALYSIS OF DATA

10-15

16-41

4. 4.0: CONCLUSION 42-43

5. 5.0: REFERENCE 44

3

1.0: INTRODUCTION

Medical is field which relating to the science or practice of medicine. It is an examination to

assess a person's state of physical health or fitness. Health care is very important for human

beings for a long life. Health factors are the major part which decides the life range of human.

There are many factors that involved in deciding the life cycle of human beings. Various type

of diseases are identified to be the major contribution toward live or death of the patients who

diagnosed it.

Gary Null, PhD; Carolyn Dean MD, ND; Martin Feldman, MD; Debora Rasio, MD; and

Dorothy Smith, PhD states a group of researchers meticulously reviewed the statistical

evidence and their findings are absolutely shocking 4 These researchers have authored a

paper titled Death by Medicine that presents compelling evidence that todays system

frequently causes more harm than good. This fully referenced report shows the number of

people having in-hospital, adverse reactions to prescribed drugs to be 2.2 million per year.

The number of unnecessary antibiotics prescribed annually for viral infections is 20 million

per year. The number of unnecessary medical and surgical procedures performed annually is

7.5 million per year. The number of people exposed to unnecessary hospitalization annually

is 8.9 million per year.

The most stunning statistic, however, is that the total number of deaths caused by

conventional medicine is an astounding 783,936 per year. It is now evident that the American

medical system is the leading cause of death and injury in the US. (By contrast, the number

of deaths attributable to heart disease in 2001 was 699,697, while the number of deaths

attributable to cancer was 553,251.5). By exposing these statistics in painstaking detail, we

provide a basis for competent and compassionate medical professionals to recognize the

inadequacies of todays system and at least attempt to institute meaningful reforms.

Medical and medicines are related to each other and they have the large important in human

beings live on earth. So, the medical have to be more conscious towards identify the diseases

in right time and perfect medicine to cure the illness. This can help for a better lifestyle and

secure for a long life.

4

5

2.0: PROBLEM STATEMENT

In this modern age, there are a huge development and improvement are identified and a lot

researches are been done towards the medicines for the illness for increase the life time of the

patients. They frequently modify the medicines and always want to improve for better cure

and increase their life time in this world. However, there are still unpredictable things happen

to the patients either they alive or dead. When researchers find more alternatives towards

curing the diseases still the percentage of the people who survived is decreasing.

So, here we are also going to do a study based on a secondary medical data that shows

various causes for the live or death of the patience. There is status, death causes, age coronary

heart disease diagnose, sex, age at start, height, weight, diastolic, systolic, MRW, smoking,

age at death, cholesterol, cholesterol status, BP status, weight status and smoking status.

Status is our target variable where it is mean by the status of the patients alive or dead. The

second is death causes which categorized by chronic diseases Cancer, Coronary heart disease,

Cerebral vascular disease, others and unknown causes.

The third is age coronary heart disease diagnoses which identified to be the age between 32-

90 years old. The fourth data input is sex represent male or female. The fifth is age at which

the illness start is from 28-62 years old. The sixth and seventh data input is height and weight

of the patients in a range of 51.5-76.5 and 67-300 respectively.

The eighth and ninth input is diastolic and systolic rate. When your heart beats, it contracts

and pushes blood through the arteries to the rest of your body. This force creates pressure on

the arteries. This is called systolic blood pressure. A normal systolic blood pressure is 120 or

below. A systolic blood pressure of 120-139 means you have normal blood pressure that is

higher than ideal or borderline high blood pressure. Even people with this level are at a

greater risk of developing heart disease. A systolic blood pressure number of 140 or higher,

on repeated measurements, is considered to be hypertension, or high blood pressure. The

diastolic blood pressure number or the bottom number indicates the pressure in the arteries

when the heart rests between beats. A normal diastolic blood pressure number is 80 or less. A

diastolic blood pressure between 80 and 89 is normal but higher than ideal. A diastolic blood

pressure number of 90 or higher, on repeated measurements, is considered to be hypertension

or high blood pressure.

6

The tenth input is MRW rate from range of 67-268. The eleventh input is the rate of smoking

of the patients. Next is the age at death rate from 36-93 years old. Then the next input is

cholesterol level and status. The level is from 96 to 568 the status is borderline, desirable and

high. The next input is BP status which related to high, normal and optimal. Then the weight

and smoking status of the patience is classified as underweight, normal, overweight and light,

non-smoker, heavy, moderate, very heavy according to the weight and smoking data.

So, we are going to use this dataset to classify the status of the list of the patients either dead

or alive using the data mining techniques.

7

2.1: OBJECTIVES

1. To develop a classification model to determine the status of people whether alive or

dead based on medical causes.

2. To identify the best model in determine the medical causes.

3. To identify significant variable in determining alive people.

8

2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION TECHNIQUES OR

PROBLEM SOLVING PROCEDURES IN YOUR WORK

Decision tree

Decision tree is a tree-shaped structure that represents set of decisions or prediction of data

trends. It is suitable to describe sequence of interrelated decisions or prediction of future data

trends and has the capability to classify entities into specific classes based on feature of

entities. Each tree consists of three types of nodes: root node, internal node and terminal

node/leaf. The top most node is the root node and it represents all of the rows in the dataset.

Nodes with child nodes are the internal nodes while nodes without child node are called the

terminal node or leaf. A common algorithm for building a decision tree selects a subset of

instances from the training data to construct an initial tree. The remaining training instances

are then used to test the accuracy of the tree. If any instance is incorrectly classified the

instance is added to the current set of training data and the process is repeated. A main goal is

to minimize the number of tree levels and tree nodes, thereby maximizing data

generalization. Decision trees have been successfully applied to real problem, are easy to

understand, and map nicely to a set of production rules.

Regression

The Regression node in Enterprise Miner does either linear or logistic regression depending

upon the measurement level of the target variable. Linear regression is done if the target

variable is an interval variable. In linear regression the model predicts the mean of the target

variable at the given values of the input variables. Logistic regression is done if the target

variable is a discrete variable. In logistic regression the model predicts the probability of a

particular level(s) of the target variable at the given values of the input variables. Because the

predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space,

the probabilities must be transformed in order to be adequately modelled. The most common

transformation for a binary target is the logit transformation. Probit and complementary log-

log transformations are also available in the regression node. There are three variable

selection methods available in the Regression node of Enterprise Miner. Forward first selects

the best one-variable model. Then it selects the best two variables among those that contain

the first selected variable. This process continues until it reaches the point where no

additional variables have a p-value less than the specified entry p-value.

9

Backward starts with the full model. Next, the variable that is least significant, given the

other variables, is removed from the model. This process continues until all of the remaining

variables have a p-value less than the specified stay p-value. Stepwise is a modification of the

forward selection method. The difference is that variables already in the model do not

necessarily stay there. After each variable is entered into the model, this method looks at all

the variables already included in the model and deletes any variable that is not significant at

the specified level. The process ends when none of the variables outside the model has a p-

value less than the specified entry value and every variable in the model is significant at the

specified stay value.

Neural Network

An artificial neural network is a network of many simple processors ("units"), each possibly

having a small amount of local memory. The units are connected by communication channels

("connections") that usually carry numeric (as opposed to symbolic) data encoded by various

means. The units operate only on their local data and on the inputs they receive via the

connections. The restriction to local operations is often relaxed during training. More

specifically, neural networks are a class of flexible, nonlinear regression models, discriminant

models, and data reduction models that are interconnected in a nonlinear dynamic system.

Neural networks are useful tools for interrogating increasing volumes of data and for learning

from examples to find patterns in data. By detecting complex nonlinear relationships in data,

neural networks can help make accurate predictions about real-world problems.

10

3.0: RESEARCH METHODOLOGY

In this project, our group conducts a survey to gather information about patients who alive

and dead with different types of medical causes. The causes includes cancer, cerebral

vascular disease, coronary heart disease, others and unknown.

After gathering the data, we process the data by using KDD process. KDD process is method

uses to digest or find hidden information in database. KDD process helps to convert unknown

or hidden pattern into useful, understandable and informative way. KDD process has 5

sections, which are selection, preprocessing, transformation, data mining and interpretation &

evaluation.

a) Selection

Data selection is defined as the process of determining the appropriate data

type and source, as well as suitable instruments to collect data. Data selection

precedes the actual practice of data collection. The data is obtained from the

UCI Machine Learning Repository.

b) Pre-processing

Data pre-processing is a data mining technique that involves transforming raw data

into an understandable format. Real-world data is often incomplete, inconsistent,

and/or lacking in certain behaviours or trends, and is likely to contain many errors.

Data pre-processing is a proven method of resolving such issues. Data processing can

be categorized into few methods:

11

Data cleaning

Data cleaning is a process of detecting and correcting (or removing) corrupt or

inaccurate records from a record set, table, or database. The purpose of using

data cleaning is to identifying incomplete, incorrect, inaccurate, irrelevant, etc.

parts of the data and then replacing, modifying, or deleting this dirty data.

Data cleaning can handle incomplete, noisy and inconsistent data.

o Incomplete data is the missing value happens due to improper data

collection methods. During collection of data, there may happen no

recorded value for certain attribute and causing incomplete of data. To

overcome this problem, we can use mean-value, estimate the probable

value using regression, using constant value or ignore the missing

record. For example:

o Noisy data is random error or variance in data. This happens due to

corrupted data transmission, technological limitation. During

transmission data into certain software such as SPSS or SAS, we may

key in wrong data in it, thus, this will cause noisy data happened. To

12

solve this problem, we can use binning method or outlier removal

method.

o Inconsistent data means the data contains replication or possibly

redundancy data. Method to overcome this problem is removing

redundant or replicate data.

Data integration involves combining data residing in different sources and

providing users with a unified view of these data. Data comes from different

sources with different naming standard. This will cause in inconsistencies and

redundancies. There are several ways to handle this problem:

Consolidate different source into one repository (using metadata).

Correlation analysis (measure the strength of relationship between different

attribute)

Data reduction is the transformation of numerical or alphabetical digital

information derived empirical or experimentally into a corrected, ordered, and

simplified form. The basic concept is the reduction of multitudinous amounts

of data down to the meaningful parts. This is to increase efficiency, can reduce

the huge data set into a smaller representation. Several techniques can be used

in data reduction such as data cube aggregation, dimension reduction, data

compression and discretization.

13

In our medical data, we have use the data integration which is we eliminate the unrelated

variables from our data.

c) Transformation

In the transformation process, which also known as data normalization, is

basically re-scale the data into a suitable range. This process is important

because it can increase the processing speed and reduce the memory

allocation. There are several methods in transformation:

- Decimal Scaling

- Min Max,

- Z Score and

- Logarithmic Normalization

We choose Min Max normalization to solve our data. Min Max

normalization is linear transformation of the original input to newly specified

range. The formula use is:

14

d) Data Mining

Data mining is the use of algorithms to extract the information and patterns by the

KDD process. This step applies algorithms to the transformed data to generate the

desired results. In this project, we are using SAS Enterprise Miner to build up the

comparing models. In the SAS Enterprise Miner, we are using decision tree method.

Decision tree is a tree-shaped structure that represents set of decisions or prediction of

data trends. It is suitable to describe sequence of interrelated decisions or prediction of

future data trends and has the capability to classify entities into specific classes based

on feature of entities. Each tree consists of three types of nodes: root node, internal

node and terminal node/leaf. The top most nodes are the root node and it represents all

of the rows in the dataset. Nodes with child nodes are the internal nodes while nodes

without child node are called the terminal node or leaf. A common algorithm for

building a decision tree selects a subset of instances from the training data to construct

an initial tree. The remaining training instances are then used to test the accuracy of the

tree. If any instance is incorrectly classified the instance is added to the current set of

training data and the process is repeated. A main goal is to minimize the number of

tree levels and tree nodes, thereby maximizing data generalization. Decision trees have

been successfully applied to real problem, are easy to understand, and map nicely to a

set of production rules. The resulting trees are usually quite understandable and can be

easily used to obtain a better understanding of the phenomenon in question.

15

DATA VARIABLES

From the table above, our target is predicting the status

e) Interpretation & evaluation process

In interpretation & evaluation process, certain data mining output is non-

human understandable format and we need interpretation for better

understanding. So, we convert output into an easy understand medium.

16

3.2 STEPS INVOLVING IN SAS ENTERPRISE MINER

I. Firstly, we open the SAS Enterprise Miner. Then click File and we create a new project.

From the SAS menu bar select FileNewProject. Then, we name our project as

GroupProject.

II. Click CREATE. The Group Project will open an initial untitled diagram and name it

as Data.

17

The SAS Enterprise Miner window contains the following interface components:

Project Navigator enables you to manage projects and diagrams, add tools to the

Diagram Workspace, and view HTML reports that are created by the Reporter node.

Note that when a tool is added to the Diagram Workspace, the tool is referred to as a

node. The Project Navigator has three tabs:

Diagrams tab - lists the current project and the diagrams within the project. By

default, the project window opens with the Diagrams tab activated.

Tools tab - contains the Enterprise Miner tools palette. This tab enables you to see

all of the tools (or nodes) that are available in Enterprise Miner. The tools are

grouped according to the SEMMA data-mining methodology. Many of the

commonly used tools are shown on the Tools Bar at the top of the window. You

can add additional tools to the Tools Bar by dragging them from the Tools tab

onto the Tools Bar. In addition, you can rearrange the tools on the Tools Bar by

dragging each tool to a new location on the Tools Bar.

Reports tab - displays the HTML reports that are generated by using the Reporter

node.

Diagram Workspace - enables you to build, edit, run, and save process flow diagrams.

Tools Bar - contains a customizable subset of Enterprise Miner tools that are commonly

used to build process flow diagrams in the Diagram Workspace. You can add or delete

tools from the Tools Bar.

18

Progress Indicator - displays a progress indicator bar that indicates the execution status

of an Enterprise Miner task.

Message Panel - displays messages about the execution of an Enterprise Miner task.

Connection Status Indicator - displays the remote host name and indicates whether the

connection is active for a client/server project.

III. The Sample Nodes that we used in our Group Project:

a) Input Data Source

The Input Data Source node reads data sources and defines their attributes for later

processing by Enterprise Miner. This node can perform various tasks:

Access SAS data sets and data marts. Data marts can be defined by using the SAS Data

Warehouse Administrator, and they can be set up for Enterprise Miner by using the

Enterprise Miner Warehouse Add-ins.

Automatically create a metadata sample for each variable when you import a dataset with

the Input Data Source node. By default, Enterprise Miner obtains the metadata sample by

taking a random sample of 2,000 observations from the data set that is identified in the

Input Data Source. Optionally, you can request larger samples. If the data is smaller than

2,000 observations, the entire data set is used.

Use the metadata sample to set initial values for the measurement level and the model

role for each variable. You can change these values if you are not satisfied with the

automatic selections that are made by the node.

Display summary statistics for interval and class variables.

Define target profiles for each target in the input data set.

19

b) Data Partition

The Data Partition node enables you to partition data sets into training, test, and validation

data sets. The training data set is used for preliminary model fitting. The validation data set is

used to monitor and tune the model weights during estimation and is also used for model

assessment. The test data set is an additional data set that you can use for model assessment.

This node uses simple random sampling, stratified random sampling, or a user-defined

partition to create training, test, or validation data sets. Specify a user-defined partition if you

have determined which observations should be assigned to the training, validation, or test

data sets. This assignment is identified by a categorical variable that is in the raw data set.

b) Replacement

The Replacement node enables you to impute values for observations that have missing

values. You can replace missing values for interval variables with the mean, median,

midrange, mid-minimum spacing, or distribution-based replacement, or you can use a

replacement M-estimator such as Tukeys by weight, Hubers, or Andrews Wave. You can

also estimate the replacement values for each interval input by using a tree-based imputation

method. Missing values for class variables can be replaced with the most frequently occurring

value, distribution-based replacement, tree-based imputation, or a constant.

20

c) Transform Variables

The Transform Variables node enables you to transform variables. For example, you can

transform variables by taking the square root of a variable, by taking the natural logarithm,

maximizing the correlation with the target, or normalizing a variable. Additionally, the node

supports user-defined formulas for transformations and enables you to group interval-valued

variables into buckets or quantiles. This node also automatically places interval variables into

buckets by using a decision tree-based algorithm. Transforming variables to similar scale and

variability may improve the fit of models and, subsequently, the classification and prediction

precision of fitted models.

d) Regression

The Regression node enables you to fit both linear and logistic regression models to your

data. You can use both continuous and discrete variables as inputs. The node supports the

stepwise, forward, and backward-selection methods. A point-and-click interaction builder

enables you to create higher-order modelling terms

21

e) Decision Tree

The Tree node enables you to perform multi way splitting of your database, based on

nominal, ordinal, and continuous variables. This is the SAS implementation of decision trees,

which represents a hybrid of the best of CHAID, CART, and C4.5 algorithms. The node

supports both automatic and interactive training. When you run the Tree node in automatic

mode, it automatically ranks the input variables by the strength of their contribution to the

tree. This ranking can be used to select variables for use in subsequent modelling. In addition,

dummy variables can be generated for use in subsequent modelling. Using interactive

training, you can override any automatic step by defining a splitting rule or by pruning a node

or sub tree.

f) Neural Network

The Neural Network node enables you to construct, train, and validate multilayer feed-

forward neural networks. By default, the Neural Network node automatically constructs a

multilayer feed-forward network that has one hidden layer consisting of three neurons. In

general, each input is fully connected to the first hidden layer, each hidden layer is fully

connected to the next hidden layer, and the last hidden layer is fully connected to the output.

The Neural Network node supports many variations of this general form.

22

g) Assessment

The Assessment node provides a common framework for comparing models and predictions

from any of the modelling nodes (Regression, Tree, Neural Network, and User Defined

Model nodes). The comparison is based on the expected and actual profits or losses that

would result from implementing the model. The node produces the following charts that help

to describe the usefulness of the model: lift, profit, return on investment, receiver operating

curves, diagnostic charts, and threshold-based charts. The Reporter node assembles the

results from a process flow analysis into an HTML report that can be viewed with a Web

browser. Each report contains header information, an image of the process flow diagram, and

a separate report for each node in the flow including node settings and results. Reports are

managed in the Reports tab of the Project Navigator.

IV) After that, we drag and drop an input data source node to the workspace. We use

the Input Data Source node to access medical data sets. After open the data

sources node, we will choose the data set EMDATA.MEDICAL

23

24

Next, we continue with set the model role in the input data sources. We assign the status as

the target. Status variable becomes a target variable which it measurement is binary. The

measurement of cholesterol status, BP status, weight status and the smoking status is will be

measured as nominal. The unwanted variables such as Deathcause, Ageatdeath and

Agechodiag wil be rejected as it does not contribute to the model.

25

Using the Cursor

The shape of the cursor changes depending on where it is positioned. The behavior of the

mouse commands depends on the shape of the cursor as well as on the selection state of the

node over which the cursor is positioned. Right-click in an open area to see the pop-up menu

as shown below. You can connect the node where the cursor is positioned (beginning node)

to any other node (ending node) as follows:

i. Ensure that the beginning node is deselected. It is much easier to drag a line when the

node is deselected. If the beginning node is selected, click in an open area of the

workspace to deselect it.

ii. Position the cursor on the edge of the icon that represents the beginning node (until

the cross-hair appears).

iii. Press the left mouse button and immediately begin to drag in the direction of the

ending node. Note: If you do not begin dragging immediately after pressing the left

mouse button, you will only select the node. Dragging a selected node will generally

result in moving the node (that is, no line will form).

iv. Release the mouse button after you reach the edge of the icon that represents the

ending node.

v. Click away from the arrow. Initially, the connection will appear as follows. After you

click away from the line in an open area of the workspace, the finished arrow forms.

26

Select View distribution to see the distribution of values for Status in the metadata sample. A

distribution is showed.

We need to investigate the number of causes, percentage of missing values, and the sort order

of each variable. For a binary target, Status has two levels then Status is the target event.

Close the Input Data Source node, and save changes when you are prompted.

27

IV) After that, we use the data partition node to partition MEDICAL data sets into the

training, test and validation. The training data is used for preliminary model

fitting, the validation is used to tune model weights during estimation and the test

data is used for model assessment. We set the percentage of train for 70%, test for

30% and left validation be empty.

28

V) Add a Regression node. Then, connect it from the Data Partition node to Regression

node.

The Regression node fits models for interval, ordinal, nominal, and binary targets. Since we

selected a binary variable (Status) as the target in the Input Data Source node

(EMDATA.MEDICALDATA), the Regression node will fit (by default) a binary logistic

regression model with main effects for each input variable.

29

30

Stepwise here is refers as a modification of the forward selection method. The difference is

that variables already in the model do not necessarily stay there. After each variable is

entered into the model, this method looks at all the variables already included in the model

and deletes any variable that is not significant at the specified level. The process will end

when none of the variables outside the model has a p-value less than the specified entry value

and every variable in the model is significant at the specified stay value.

The regression process will be continued by setting the method as Forward. First it will

select the best one-variable model. Then it selects the best two variables among those that

contain the first selected variable. This process continues until it reaches the point where no

additional variables have a p-value less than the specified entry p-value.

Different from others, the backward model starts with the full model. Next the variable that

is least significant is removed from the model given that the other variables constant. This

process continues until all of the remaining variables have a p-value less than the specified

stay p-value.

31

By default, the node uses Deviation coding for categorical input variables. Right-click the

Regression node and select Run. When the run is complete, click Yes to view the results. The

Estimates tab in the Regression Results Browser displays bar charts of effect T-scores and

parameter estimates.

VARIABLES NAME EFFECT T-SCORES

X1 AgeAtStart 19.9558

X2 Smoking 9.3465

X3 Systolic 5.4062

X4 Sex-Female -4.8960

X5 Height -4.1090

X6 Weight 3.6369

X7 MRW -3.5919

X8 Cholesterol 1.9340

X9 InterceptStatus=DEAD 1.7756

X10 Diastolic 1.3250

32

The T-scores are plotted (from left to right) in decreasing order of their absolute values. The

higher the absolute value is, the more important the variable is in the regression model. In this

data, the variables X1=Ageatstart, X2= Smoking and X3= Systolic are the most important

model predictors.

Next, right-click the Regression node in the Diagram Workspace and select Model Manager.

In the Model Manager, select ToolsLift Chart . A cumulative % Response chart appears.

By default, this chart arranges observations into deciles based on their predicted probability

of response, and then plots the actual percentage of respondents.

From the lift chart, the individuals are sorted in descending order of their predicted

probability of default on medical based on the causes of death. The plotted values are the

cumulative actual probabilities of loan defaults. If the model is useful, the proportion of

individuals that defaulted will be relatively high in the top deciles and the plotted curve will

be decreasing. In this case, the default regression is not useful. Applying a default regression

model directly to the training data set is not appropriate in this case, because regression

models ignore observations that have a missing value for at least one input variable. We

should consider performing imputation before fitting a regression model. In Enterprise Miner,

we can use the Replacement node to perform imputation.

33

IV. Add an Insight node and connect it to the Data Partition node.

Run the flow from the Insight node by right-clicking the Insight node and selecting Run.

Select Yes when you are prompted to see the results. An output is shown below

34

VI) Add a Replacement node. This allows you to replace missing values for each variable.

This replacement is necessary to use all of the observations in the training data set

when you build a regression or neural network model. By default, Enterprise Miner

uses a sample from the training data set to select the values for replacement. The

following statements are true:

Observations that have a missing value for an interval variable have the missing

Value replaced with the mean of the sample for the corresponding variable.

Observations that have a missing value for a binary, nominal, or ordinal variable

Have the missing value replaced with the most commonly occurring non-missing

Level of the corresponding variable in the sample.

VII) Performing Variable Transformation

After you have viewed the results in Insight, it will be clear that some input variables

have highly skewed distributions. In highly skewed distributions, a small percentage of

the points may have a great deal of influence. Sometimes, performing a transformation on

an input variable can yield a better fitting model. In order to do that, we add a Transform

Variables node as shown in below.

35

After connecting the node, open the node by right-clicking on it and selecting Open. The

Variables tab is shown by default, which displays statistics for the interval level variables that

include the mean, standard deviation, skewness, and kurtosis (calculated from the metadata

sample).

Open the Regression node. The Variables tab is active by default. Change the status of all

input variables except the M_ variables to dont use. Close the Regression node and save the

model. Run the flow from the Assessment node and select Yes to view the results. Create a

lift chart for the stepwise regression model.

VIII) Add a default Tree node, connect the Data Partition node to the Tree node, and then

connect the Tree node to the Assessment node. Decision trees handle missing values

directly, while regression and neutral network models ignore all incomplete

observations (observations that have a missing value for one or more input variables).

The flow should now appear like the following flow.

36

Result for tree with two branch:

Lift Chart

37

Competing Splits View:

Result for tree with three branch:

38

Lift Chart

Competing Split View:

39

X. Add a default Neural Network node. Then, connect the Neural Network node to the

Assessment node.

Run the flow from the Neural Network node. Select Yes when you are prompted to view the

results. The default Neural Network node fits a multilayer perceptron (MLP) model with no

direct connections, and the number of hidden layers is data dependent. In this case, the Neural

Network node fitted a MLP model with a single hidden layer. By default, the Tables tab in

the Neural Network Results Browser displays various statistics of the fitted model. Click the

Weights tab. The Weights tab displays the weights (parameter estimates) of the connections.

The following display shows the weights of the connections from each variable to the single

hidden layer. Each level of each status variable is also connected to the hidden layer. The

Neural Network node iteratively adjusts the weights of the connections to minimize the error

function.

40

Table below shows the neural network results for medical data:

41

Weights:

We have 43 variables result. From the SAS output, we find that the highest value is 2.39012

from variable 43 which is H12.

42

Conclusion

43

By referring to the assessment node and the regression node, the misclassification rate in

training data set and misclassification rate in testing data are as below:

Tool Training Testing

Decision Tree With 2

Branch

0.24273 0.26807

Decision Tree With 3

Branch

0.24495 0.28534

Neural Network 0.24602 0.25655

Regression (Stepwise) 0.25534 0.25019

Regression (Forward) 0.25535 0.25016

Regression (Backward) 0.25151 0.25784

Misclassification rate in training data set = 0.25535

Misclassification rate in testing data set = 0.25016

Besides, Regression of forward method classify as a best model because it contains the

least misclassification rate in testing data set compared to other 2 models. It is also can be say

that it has the highest accuracy in testing data set compared to other 2 models.

44

References

http://en.wikipedia.org/wiki/Medicine

http://www.medicinenet.com/script/main/hp.asp

http://www.saedsayad.com/decision_tree.htm

http://www.sas.com/technologies/analytics/datamining/miner/

- global_brief_hypertension.pdfHochgeladen vonmuhammad kamil
- DataMiningFinalReport(Qamaruzzaman,Zulqarnain,Aidi)Hochgeladen vonMuhamad Aidi Taufiq Idris
- Natural Way to Lower Blood Pressure Without Making Blood PressureHochgeladen vonkirang gandhi
- Practical Introduction to Power of Enterprise MinerHochgeladen vonPurna Ganti
- Section5_141-166_Hochgeladen vonBellarinda Asr
- HighBloodPressurebook[1]Hochgeladen vonRangaswamy.N.Naidu
- Hypertension.docxHochgeladen vonCia Yee Yeoh
- student performance analysisHochgeladen vonMohamed Boussakssou
- TREE-CARTHochgeladen vonArindam Mondal
- badangHochgeladen vonEmJay Balansag
- OUT-OF-BAG ESTIMATIONHochgeladen vonSabilah Margirizki
- Obat Tradisional HipertensiHochgeladen vonaunt8class
- hipertensiHochgeladen vondamai140390
- Myers Ont Bp Survey 2008Hochgeladen vonJozhy Nugraha Mahmuda
- Heritability of Blood Pressure Among Random Adult Individuals of South Indian StatesHochgeladen vonEditor IJTSRD
- ejcn2015119aHochgeladen vonImada Khoironi
- 5391191_sourceHochgeladen vonRudhy Thabuthy Sfarmapt
- Drug ProfilesHochgeladen vong3ogaddi
- MENG.KEGAWATANHochgeladen vonLa Ode Rinaldi
- Morning Report 07 APRILHochgeladen vonRachma Susteriana Putri
- Normal ValuesHochgeladen vonishak789
- checklist sofa, apache, saps.docxHochgeladen vonharjuna
- Chap3 Basic ClassificationHochgeladen vonshilpa veeru
- puente023Hochgeladen vonaptureinc
- MS43X-17(1).pdfHochgeladen vonijasrjournal
- The Association Between Headache and Elevated Blood Pressure Among Patients Presenting to an Emergency DepartmentHochgeladen vonRoberto López Mata
- comp527-15Hochgeladen vonKasu_777
- re-created tree for marketing acme unsolvedHochgeladen vonapi-203304719
- Portal Hypertension FeaturesHochgeladen vonMaha Mukhtar
- Woo 2008Hochgeladen vonCorin Boice Tello

- Eng. Khalid Footing-DesignHochgeladen vonMohamed
- EToken PKI 5 1 Admin Guide Windows Rev AHochgeladen vonAna Maria Doana
- Wi-Fi testingHochgeladen vonRaja Prabu
- The IslamicHochgeladen vonJojo Hoho
- equity walkHochgeladen vonapi-348938826
- File Zilla FtpHochgeladen vonRadha Raman
- 1962_01_03Hochgeladen vonfrankknebel
- Ch03Hochgeladen vonVitor Anes
- 5 Ways the Obeya.docxHochgeladen vonJuan José Tamayo
- 3G Radio Resource ManagementHochgeladen vonFrensel Petrona
- ARGUS 180 P6214xx_GBHochgeladen vonchris_ohabo
- nsk_cat_e728g_13[1]LHochgeladen vonIvana Stevanić
- Barriers of Communication.docxHochgeladen vonSahil Bakshi
- Epigard Rich Primer_2006-005[1].PDFHochgeladen vonabdulyunus_amir
- Budgetary Control -l g ElectonicsHochgeladen vonsaiyuvatech
- MSMHochgeladen vonAmol Laxman Thorat
- Android Multiscreen XamarinHochgeladen vonGeorge Carroll
- ProHHochgeladen vonTuyen Nguyen
- Grails AclHochgeladen vonXerus Anatas
- 4 Nonferrous MetalsHochgeladen vonMolinaMartheniel
- Factory Equipment 2Hochgeladen vonikanku
- GUIDs.txtHochgeladen vonSebastian Diaz
- Zuraifah'sHochgeladen vonSonia Patel
- GnuPlot (intro)Hochgeladen vonFernando Bortotti
- Codex Tau EmpireHochgeladen vonJoshua MacLeod Watson
- Online CEng App FormHochgeladen vonRantharu Attanayake
- RRU3808 Installation Guide(V200_04)Hochgeladen vonc2poyraz
- s 7200 Cable EnglishHochgeladen vonKhánh Vũ Duy
- Allison Lockup Switch Instructions 5-Spd v5Hochgeladen vonnikodjole
- Placa Amplificadora de Sinal AnalôgicaHochgeladen vonuniversoemdesencanto

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.