My

Data Mining Approaches for Intrusion Detection
Wenke Lee and Salvatore J. Stolfo Computer Science Department Columbia University
Overview
Intrusion detection and computer security Current intrusion detection approaches Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans
Overview
Current intrusion detection approaches and problems Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans
Intrusion Detection and Computer Security

Computer security goals: confidentiality, integrity, and availability Intrusion is a set of actions aimed to compromise these security goals Intrusion prevention (authentication, encryption, etc.) alone is not sufficient Intrusion detection is needed
Intrusion Detection
Primary assumption: user and program activities can be monitored and modeled Key elements:
Resources to be protected Models of the normal or legitimate behavior on the resources Efficient methods that compare real-time activities against the models and report probably intrusive activities.
Learning Agent
Inductive Learning Engine
Audit Records Audit Data Preprocessor
Base Detection Agent
Activity Data
Detection Models Rules (Base) Detection Engine Evidence Evidence from Other Agents
Meta Detection Agent Decision Table
(Meta) Detection Engine
Final Assertion
Decision Engine Action/Report
10:35:41.5 128.59.23.34.30 > 113.22.14.65.80 : . 512:1024(512) ack 1 win 9216 10:35:41.5 102.20.57.15.20 > 128.59.12.49.3241: . ack 1073 win 16384 10:35:41.6 128.59.25.14.2623 > 115.35.32.89.21: . ack 2650 win 16225
Connection Records
dur
1.2 0.5 10.2
time
src dst bytes srv

A C E B D F 42 22 1036 http user ftp ...
tcpdump
10:35:41 10:35:41 10:35:41
Learning
execve(/usr/ucb/finger, open(/dev/zero mmap( ...
truss
Profile
System call Sequence

execve open mmap ...
Learning Profile
Intrusion Detection
Two categories of techniques:
Misuse detection: use patterns of well-known attacks to identify intrusions Anomaly detection: use deviation from normal usage patterns to identify intrusions
Current Intrusion Detection Approaches

Misuse detection:
Record the specific patterns of intrusions Monitor current audit trails (event sequences) and pattern matching Report the matched events as intrusions Representation models: expert rules, Colored Petri Net, and state transition diagrams

Anomaly detection:
Establishing the normal behavior profiles Observing and comparing current activities with the (normal) profiles Reporting significant deviations as intrusions Statistical measures as behavior profiles: ordinal and categorical (binary and linear)

Main problems: manual and ad-hoc
Misuse detection:
Known intrusion patterns have to be hand-coded Unable to detect any new intrusions (that have no matched patterns recorded in the system)
Anomaly detection:
Selecting the right set of system features to be measured is ad hoc and based on experience Unable to capture sequential interrelation between events
Our Proposed Approach

A systematic framework to:
Build good models:
select appropriate features of audit data to build intrusion detection models
Build better models:

architect a hierarchical detector system that combines multiple detection models
Build updated models:

dynamically update and deploy new detection system as needed

Support for the feature selection and model construction process:
Apply data mining algorithms to find consistent inter- and intra- audit record (event) patterns Use the features and time windows in the discovered patterns to build detection models A support environment to semi-automate this process

Combining multiple detection models:
Each (base) detector model monitors one aspect of the system They can employ different techniques and be independent of each other The learned (meta) detector combines evidence from a number of base detectors

An intelligent agent-based architecture:
learning agents: continuously compute (learn) the detection models detection agents: use the (updated) models to detect intrusions
Data Mining
KDD (Knowledge Discovery in Database):
The process of identifying valid, useful and understandable patterns in data Steps: understanding the application domain, data preparation, data mining, interpretation, and utilizing the discovered knowledge Data mining: applying specific algorithms to extract patterns from data
Data Mining
Relevant data mining algorithms:
Classification: maps a data item into one of several pre-defined categories Link analysis: determines relations between fields in the database Sequence analysis: models sequence patterns
Data Mining
Why is it applicable to intrusion detection?
Normal and intrusive activities leave evidence in audit data From the data-centric point view, intrusion detection is a data analysis process Successful applications in related domains, e.g., fraud detection, fault/alarm management
Building Classifiers for Intrusion Detection

Experiments in constructing classification models for anomaly detection Two experiments:
sendmail system call data network tcpdump data
Use meta classifier to combine multiple classification models
Classification Models on sendmail

The data: sequence of system calls made by sendmail. Classification models (rules): describe the normal patterns of the system call sequences. The rule set is the normal profile of sendmail Detection: calculate the deviation from the profile
large number/high scores of violations to the rules in a new trace suggests an exploit

The sendmail data:
Each trace has two columns: the process ids and the system call numbers Normal traces: sendmail and sendmail daemon Abnormal traces: sunsendmailcap, syslogremote, syslog-remote, decode, sm5x and sm56a attacks.

Data preprocessing:
Use sliding window to create sequence of consecutive system calls Label the sequences to create training data:
sequences (length 7) 4 2 66 66 4 138 66 5 5 5 4 59 105 104 class labels normal abnormal

Experiment 1 - learning patterns of normal sequences:
Each record: n consecutive system calls plus a class label, normal or abnormal Training data: sequences from 80% of the normal traces plus some of the attack traces Testing data: traces not used in training Use RIPPER to learn specific rules for the minority classes
sendmail Experiment 1
Examples of output RIPPER rules:
if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is normal if the 6th system call is lseek and the 7th is sigvec, then the sequence is normal if none of the above, then the sequence is abnormal
Using the learned rules to analyze a new trace:
label all sequences according to the rules define a region as l consecutive sequences define a abnormal region as having more abnormal sequences than normal ones calculate the percentage of abnormal regions the trace is abnormal if the percentage is above a threshold
Hypothesis: need specific rules of normal sequences to detect unknown/new intrusions Some results using various normal v.s. abnormal distributions:
Experiment A: 46% normal, length 11 Experiment B: 46% normal, length 7 Experiment C: 54% normal, length 11 Experiment D: 54% normal, length 7
All 4 experiments:
Training data includes sequences from intrusion traces in Bold and Italic, and sequences from 80% of the normal sendmail traces Percentage of abnormal regions of each trace (showed in the table) is used as the intrusion indicator The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences. Experiment A and B generate rules that characterize normal sequences of length 11 and 7 respectively Experiment C and D generate rules that characterize abnormal sequences of length 11 and 7 respectively
traces sscp-1 sscp-2 sscp-3 syslog-remote-1 syslog-remote-2 syslog-local-1 syslog-local-2 decode-1 decode-2 sm565a sm5x sendmail Forrest et al. 5.2 5.2 5.2 5.1 1.7 4.0 5.3 0.3 0.3 0.6 2.7 0 A 41.9 40.4 40.4 30.8 27.1 16.7 19.9 4.7 4.4 11.7 17.7 1.0 B 32.2 30.4 30.4 21.2 15.6 11.1 15.9 2.1 2.0 8.0 6.5 0.1 C 40.0 37.6 37.6 30.3 26.8 17.0 19.8 3.1 2.5 1.1 5.0 0.2 D 33.1 33.3 33.3 21.9 16.5 13.0 15.9 2.1 2.2 1.0 3.0 0.3
3.4
1.9
0.9
0.7
Anomaly detectors A and B performs better then misuse detectors C and D.

Experiment 2 - learning to predict normal system call:
Each record: n-1 consecutive system calls plus a class label, the nth or the middle system call Training data: sequences from 80% of the normal traces (no abnormal traces) Testing data: traces not used in training Use RIPPER to learn rules
Examples of output RIPPER rules:
if the 3rd system call is lstat and the 4th is write, then the 7th is stat if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt if none of the above, then the 7th is open
Using the learned rules to analyze a new trace:
predict system calls according to the rules if a rule is violated, the violation score is increased by 100 times the accuracy of the rule the trace is abnormal if the violation score is above a threshold
Some results:
Experiment A: predict the 11th system call Experiment B: predict the middle system call in a sequence of length 7 Experiment C: predict the middle system call in a sequence of length 11 Experiment D: predict the 7th system call
All 4 experiments:
Training data includes only the sequences from 80% of the normal sendmail traces Output rules predict what should be the normal nth or the middle system call Score of rule violation (mismatch) of each trace (showed in the table) is used as the intrusion indicator The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences.
Traces sscp-1 sscp-2 sscp-3 syslog-remote-1 syslog-remote-2 syslog-local-1 syslog-local-2 decode-1 decode-2 sm565a sm5x *sendmail A 24.1 23.5 23.5 19.3 15.9 13.4 15.2 9.4 9.6 14.4 17.2 5.7 3.7 B 13.5 13.6 13.6 11.5 8.4 6.1 8.0 3.9 4.2 8.1 8.2 0.6 3.3 C 14.3 13.9 13.9 13.9 10.9 7.2 9.0 2.4 2.8 9.4 10.1 1.2 1.2 D 24.7 24.4 24.4 24.0 23.0 19.0 20.2 11.3 11.5 20.6 18.0 12.6 1.3
The 11th (A) and 4th (B) system call are more predictable

Lessons learned:
Normal behavior can be established and used to detect anomalous usage Need to collect near complete normal data in order to build the normal model But how do we know when to stop collecting? Need tools to guide the audit data gathering process
Classification Models on tcpdump

The tcpdump data (part of a public data visualization contest):
Packets of incoming, out-going, and internal broadcast traffic One trace of normal network traffic Three traces of network intrusions

Data preprocessing:
Extract the connection level features:
Record connection attempts Monitor data packets and count: # of bytes in each direction, resent rate, hole rate, etc. Watch how connection is terminated

Data Preprocessing:
Each record has:
start time and duration participating hosts and ports (applications) statistics (e.g., # of bytes) flag: normal or a connection/termination error protocol: TCP or UDP
Divide connections into 3 types: incoming, outgoing, and inter-lan

Building classifier for each type of connections:
Use the destination service (port) as the class label Training data: 80% of the normal connections Testing data: 20% of the normal connections and connections in the 3 intrusion traces Apply RIPPER to learn rules

The output RIPPER rules describe the normal characteristics of the destination services. The rule set is the profile of the normal network traffic. Using the rules to analyze tcpdump traces:
Examine each connection record according to the rules Calculate the percentage of misclassification (violation of a rule). This percentage is the deviation from the profile.

Results - misclassification rate on each type of connections:
Connection data Normal Intrusion1 Intrusion2 Intrusion3 Out-going 3.91% 3.81% 4.76% 3.71% In-coming 4.68% 6.76% 7.47% 13.7% Inter-lan 4% 22.65% 8.7% 7.86%
This model is not very effective in detecting intrusions

Adding temporal features for better models:
Examine all connections in the past n seconds, and count:
the number of connection errors, all other errors, connections to system services, user applications, and connection to the same service as the current connection average duration and data bytes of all connections; and the same averages of connections to the same service.

Results of adding the temporal features, the time window is 30 seconds:
Connection data Normal Intrusion1 Intrusion2 Intrusion3 Out-going 0.88% 2.54% 3.04% 2.32% In-coming 0.31% 27.37% 27.42% 42.20% Inter-lan 1.43% 20.48% 5.63% 6.80%
Adding temporal statistical features improves the effectiveness of the detection models
Effects of time window length on misclassification rate

0.45 0.4
misclassification rate
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 20 40 60 80 100 time window in seconds
normal attack1 attack2 attack3
How do we obtain the optimal time window length?

Lessons learned:
Data preprocessing requires extensive domain knowledge Adding temporal features improves classification accuracy Need tools to guide (temporal) feature selection
Building Classifiers for Intrusion Detection

Meta classifier that combines evidence from multiple detection models:
Build base classifiers that each model one aspect of the system The meta learning task:
each record has a collection of evidence from base classifiers, and a class label normal or abnormal on the state of the system
Apply a learning algorithm to produce the meta classifier
Mining Patterns from Audit Data

Association rules: describe multi-feature (attribute) correlation from a database X => Y , confidence, support:
X and Y are subsets of the attribute values in a record support is the percentage of records that contain X and Y confidence is support(X+Y)/support(X)
Association Rules
Motivations:
Audit data can be easily formatted into a database table Program executions and user activities have frequent correlation among system features Incremental updating of the rule set is easy
An example from the .sh_history :

trn => rec.humor, [0.3, 0.1] Meaning: 30% of the time when using trn, the user is reading rec.humor; and reading this newsgroup constitutes 10% of all sh commands
Mining Patterns from Audit Data

Frequent Episodes: frequent events occurring within a time window X => Y, confidence, support, window:
X and Y are subsets of the attribute values in a record support is the percentage of (sliding) windows that contain X and Y confidence is support(X+Y)/support(X)
Frequent Episodes
Motivation:
Sequence information needs to be included in a detection model
An example from a departments web log:

home, research => theory, [0.2, 0.05], [30] Meaning: 20% of the time, after home and research pages are visited (in that order), the theory is then visited within 30 seconds from when home is visited; and visiting these three pages constitutes 5% of all visits to the web site
Using the Mined Patterns

Guide the audit data gathering process:
Run a program under different settings For each run, calculate the association rules and frequent episodes from its audit data Merge them into an aggregate rule set Stop gathering audit data when no rules can be added from a new run

Support the feature selection process:
System features in the association rules and frequent episodes should be included in the classification models Time window and features in the frequent episodes suggest additional temporal features should be considered

Alternatives and complement to classification models:
Examine new audit trace and calculate violation scores: missing rules, new rules, deviations in confidence and support, etc. Study the unique patterns in the trace of suspected attack to further pin point the cause of the intrusion alarms.

tcpdump data revisited:
How to select the right time window? Hypothesis: the appropriate window should contain stable sets of frequent episodes Experiments: mine frequent episodes using different window lengths, and count the number of episodes
Results on time window length v.s. # of episodes:

300
250
200
raw episodes episode rules, conf=0.8 episode rules, conf=0.6
# of episodes
150
100
50
0 0 50 100 150 200 250 time window in seconds
The optimal time window length for classification has stable # of episodes

tcpdump data revisited:
unique patterns in intrusion data may provide some insights intrusion 3:
one of the unique frequent episode rules:
dst_srv=auth => flag=unwanted_syn_ack, [0.82, 0.1], [30]
one of the unique association rules:

src_srv=smtp => duration=0, flag=unwanted_syn_ack, dst_srv=user_apps, [1.0, 0.38]
Architecture Support
Dedicated learning agents are responsible for building detection models Base and meta detection agents are equipped with learned models Detection agents provide new audit data to the learning agents Learning agents dispatch updated models JAM (Java Agents for Meta-learning) on fraud detection is the model architecture
Learning Agent
Inductive Learning Engine
Audit Records Audit Data Preprocessor
Base Detection Agent
Activity Data
Detection Models Rules (Base) Detection Engine Evidence Evidence from Other Agents
Meta Detection Agent Decision Table
(Meta) Detection Engine
Final Assertion
Decision Engine Action/Report
Current Status
Accomplished:
Experiments on sendmail and tcpdump data Implementation of the association rules and the frequent episodes algorithms. Testing on medium size data sets (30,000+ records, each with 6+ fields) has been completed. Design and 35% of the implementation of a support environment for mining patterns from audit data High level design system architecture design
Research Plans
To be completed within the next year and a half:
Finish the implementation of the support environment for mining patterns Experiments on using the algorithms and the environment to gather audit data and select features Experiments on building meta detection models
Research Plans
To be completed within the next year and a half:
Detailed architecture design Implementing a prototype intrusion detection system Final evaluation using standard/public data sets
Conclusions
We demonstrated the effectiveness of classification models for intrusion detection We propose to use systematic data mining approaches to select the relevant system features to build better detection models We propose to use (meta) learning agentbased architecture to combine multiple models, and to continuously update the detection models.

My

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

My

Hochgeladen von

Copyright:

Verfügbare Formate

Data Mining Approaches for Intrusion Detection

Intrusion Detection and Computer Security

Inductive Learning Engine

Audit Records Audit Data Preprocessor

Base Detection Agent

Meta Detection Agent Decision Table

(Meta) Detection Engine

src dst bytes srv

10:35:41 10:35:41 10:35:41

System call Sequence

Current Intrusion Detection Approaches

Current Intrusion Detection Approaches

Current Intrusion Detection Approaches

Our Proposed Approach

Build better models:

Build updated models:

Our Proposed Approach

Our Proposed Approach

Our Proposed Approach

Building Classifiers for Intrusion Detection

Use meta classifier to combine multiple classification models

Classification Models on sendmail

Classification Models on sendmail

Classification Models on sendmail

Classification Models on sendmail

Anomaly detectors A and B performs better then misuse detectors C and D.

Classification Models on sendmail

Classification Models on sendmail

Classification Models on tcpdump

Classification Models on tcpdump

Classification Models on tcpdump

Divide connections into 3 types: incoming, outgoing, and inter-lan

Classification Models on tcpdump

Classification Models on tcpdump

Classification Models on tcpdump

This model is not very effective in detecting intrusions

Classification Models on tcpdump

Classification Models on tcpdump

Effects of time window length on misclassification rate

How do we obtain the optimal time window length?

Classification Models on tcpdump

Building Classifiers for Intrusion Detection

Apply a learning algorithm to produce the meta classifier

Mining Patterns from Audit Data

An example from the .sh_history :

Mining Patterns from Audit Data

An example from a departments web log:

Using the Mined Patterns

Using the Mined Patterns

Using the Mined Patterns

Using the Mined Patterns

Results on time window length v.s. # of episodes:

raw episodes episode rules, conf=0.8 episode rules, conf=0.6

0 0 50 100 150 200 250 time window in seconds

Using the Mined Patterns

one of the unique association rules:

Inductive Learning Engine

Audit Records Audit Data Preprocessor

Base Detection Agent

Meta Detection Agent Decision Table

(Meta) Detection Engine

Das könnte Ihnen auch gefallen