Beruflich Dokumente
Kultur Dokumente
Wenke Lee and Salvatore J. Stolfo Computer Science Department Columbia University
Overview
Intrusion detection and computer security Current intrusion detection approaches Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans
Overview
Current intrusion detection approaches and problems Our proposed approach Data mining Classification models for intrusion detection Mining patterns from audit data System architecture Current status Research plans
Intrusion Detection
Primary assumption: user and program activities can be monitored and modeled Key elements:
Resources to be protected Models of the normal or legitimate behavior on the resources Efficient methods that compare real-time activities against the models and report probably intrusive activities.
Learning Agent
Activity Data
Detection Models Rules (Base) Detection Engine Evidence Evidence from Other Agents
Final Assertion
Decision Engine Action/Report
10:35:41.5 128.59.23.34.30 > 113.22.14.65.80 : . 512:1024(512) ack 1 win 9216 10:35:41.5 102.20.57.15.20 > 128.59.12.49.3241: . ack 1073 win 16384 10:35:41.6 128.59.25.14.2623 > 115.35.32.89.21: . ack 2650 win 16225
Connection Records
dur
1.2 0.5 10.2
time
tcpdump
Learning
execve(/usr/ucb/finger, open(/dev/zero mmap( ...
truss
Profile
Learning Profile
Intrusion Detection
Two categories of techniques:
Misuse detection: use patterns of well-known attacks to identify intrusions Anomaly detection: use deviation from normal usage patterns to identify intrusions
Anomaly detection:
Selecting the right set of system features to be measured is ad hoc and based on experience Unable to capture sequential interrelation between events
Data Mining
KDD (Knowledge Discovery in Database):
The process of identifying valid, useful and understandable patterns in data Steps: understanding the application domain, data preparation, data mining, interpretation, and utilizing the discovered knowledge Data mining: applying specific algorithms to extract patterns from data
Data Mining
Relevant data mining algorithms:
Classification: maps a data item into one of several pre-defined categories Link analysis: determines relations between fields in the database Sequence analysis: models sequence patterns
Data Mining
Why is it applicable to intrusion detection?
Normal and intrusive activities leave evidence in audit data From the data-centric point view, intrusion detection is a data analysis process Successful applications in related domains, e.g., fraud detection, fault/alarm management
sendmail Experiment 1
Examples of output RIPPER rules:
if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is normal if the 6th system call is lseek and the 7th is sigvec, then the sequence is normal if none of the above, then the sequence is abnormal
sendmail Experiment 1
Using the learned rules to analyze a new trace:
label all sequences according to the rules define a region as l consecutive sequences define a abnormal region as having more abnormal sequences than normal ones calculate the percentage of abnormal regions the trace is abnormal if the percentage is above a threshold
sendmail Experiment 1
Hypothesis: need specific rules of normal sequences to detect unknown/new intrusions Some results using various normal v.s. abnormal distributions:
Experiment A: 46% normal, length 11 Experiment B: 46% normal, length 7 Experiment C: 54% normal, length 11 Experiment D: 54% normal, length 7
sendmail Experiment 1
All 4 experiments:
Training data includes sequences from intrusion traces in Bold and Italic, and sequences from 80% of the normal sendmail traces Percentage of abnormal regions of each trace (showed in the table) is used as the intrusion indicator The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences. Experiment A and B generate rules that characterize normal sequences of length 11 and 7 respectively Experiment C and D generate rules that characterize abnormal sequences of length 11 and 7 respectively
sendmail Experiment 1
traces sscp-1 sscp-2 sscp-3 syslog-remote-1 syslog-remote-2 syslog-local-1 syslog-local-2 decode-1 decode-2 sm565a sm5x sendmail Forrest et al. 5.2 5.2 5.2 5.1 1.7 4.0 5.3 0.3 0.3 0.6 2.7 0 A 41.9 40.4 40.4 30.8 27.1 16.7 19.9 4.7 4.4 11.7 17.7 1.0 B 32.2 30.4 30.4 21.2 15.6 11.1 15.9 2.1 2.0 8.0 6.5 0.1 C 40.0 37.6 37.6 30.3 26.8 17.0 19.8 3.1 2.5 1.1 5.0 0.2 D 33.1 33.3 33.3 21.9 16.5 13.0 15.9 2.1 2.2 1.0 3.0 0.3
3.4
1.9
0.9
0.7
sendmail Experiment 2
Examples of output RIPPER rules:
if the 3rd system call is lstat and the 4th is write, then the 7th is stat if the 1st system call is sigblock and the 4th is bind, then the 7th is setsockopt if none of the above, then the 7th is open
sendmail Experiment 2
Using the learned rules to analyze a new trace:
predict system calls according to the rules if a rule is violated, the violation score is increased by 100 times the accuracy of the rule the trace is abnormal if the violation score is above a threshold
sendmail Experiment 2
Some results:
Experiment A: predict the 11th system call Experiment B: predict the middle system call in a sequence of length 7 Experiment C: predict the middle system call in a sequence of length 11 Experiment D: predict the 7th system call
sendmail Experiment 2
All 4 experiments:
Training data includes only the sequences from 80% of the normal sendmail traces Output rules predict what should be the normal nth or the middle system call Score of rule violation (mismatch) of each trace (showed in the table) is used as the intrusion indicator The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences.
sendmail Experiment 2
Traces sscp-1 sscp-2 sscp-3 syslog-remote-1 syslog-remote-2 syslog-local-1 syslog-local-2 decode-1 decode-2 sm565a sm5x *sendmail A 24.1 23.5 23.5 19.3 15.9 13.4 15.2 9.4 9.6 14.4 17.2 5.7 3.7 B 13.5 13.6 13.6 11.5 8.4 6.1 8.0 3.9 4.2 8.1 8.2 0.6 3.3 C 14.3 13.9 13.9 13.9 10.9 7.2 9.0 2.4 2.8 9.4 10.1 1.2 1.2 D 24.7 24.4 24.4 24.0 23.0 19.0 20.2 11.3 11.5 20.6 18.0 12.6 1.3
The 11th (A) and 4th (B) system call are more predictable
Adding temporal statistical features improves the effectiveness of the detection models
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 20 40 60 80 100 time window in seconds
normal attack1 attack2 attack3
Association Rules
Motivations:
Audit data can be easily formatted into a database table Program executions and user activities have frequent correlation among system features Incremental updating of the rule set is easy
Frequent Episodes
Motivation:
Sequence information needs to be included in a detection model
250
200
# of episodes
150
100
50
The optimal time window length for classification has stable # of episodes
Architecture Support
Dedicated learning agents are responsible for building detection models Base and meta detection agents are equipped with learned models Detection agents provide new audit data to the learning agents Learning agents dispatch updated models JAM (Java Agents for Meta-learning) on fraud detection is the model architecture
Learning Agent
Activity Data
Detection Models Rules (Base) Detection Engine Evidence Evidence from Other Agents
Final Assertion
Decision Engine Action/Report
Current Status
Accomplished:
Experiments on sendmail and tcpdump data Implementation of the association rules and the frequent episodes algorithms. Testing on medium size data sets (30,000+ records, each with 6+ fields) has been completed. Design and 35% of the implementation of a support environment for mining patterns from audit data High level design system architecture design
Research Plans
To be completed within the next year and a half:
Finish the implementation of the support environment for mining patterns Experiments on using the algorithms and the environment to gather audit data and select features Experiments on building meta detection models
Research Plans
To be completed within the next year and a half:
Detailed architecture design Implementing a prototype intrusion detection system Final evaluation using standard/public data sets
Conclusions
We demonstrated the effectiveness of classification models for intrusion detection We propose to use systematic data mining approaches to select the relevant system features to build better detection models We propose to use (meta) learning agentbased architecture to combine multiple models, and to continuously update the detection models.