Anda di halaman 1dari 91

Computational Forensics

Digital & Computational


Forensics
Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
www.nislab.no
F o r e n s i cs L a b

1
Computational Forensics
Outline
Introduction to Digital Forensic Science
Digital Evidence
Digital Forensic Ontology
Computational Forensics
Machine Learning and Data Mining
Pattern Classification, Search, Clustering
Dimensionality Reduction
Application Examples
Approximate String Search
Feature / Attribute Selection
Distributed Malware Detection

F o r e n s i cs L a b

2
Computational Forensics

Digital Forensics:
A Brief Introduction

Katrin Franke, Andre rnes,


Andreas Tellefsen, Kristian Nordhaug, Philip Clark
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
www.nislab.no
F o r e n s i cs L a b

3
Computational Forensics
NISlab-App to Information Security
Biometrics Forensics
Forensic Readiness
User Authentication
Incidence Response
BTA Protocol
Investigation/Analysis

Security Management Security Technology


Risk-based Design Software Security
Security Economics System Administration
System/Adversary Modeling Network and Critical
Human Factors, Policies Infrastructure Protection

Testimon (lat. evidence)


Computational & Digital Forensics: F o r e n s i cs L a b

4 Fraud Detection, Analysis and Prevention


Computational Forensics
Forensic Sciences
Methodological correct application of a broad spectrum
of scientific disciplines, to answer questions significant
to the legal system.
Forensic methods consist of multi-disciplinary
approaches to perform the following tasks:
Application
Investigate and to Reconstruct a crime scene or
a scene of an accident,
Collect and Analyze trace evidence found,
Identify, Classify, Quantify, Individualize
persons, objects, processes,
Establish Linkages, associations and Methodology Technology
Reconstructions, and
Use those findings in the prosecution or the
defense in a court of law.
Mostly dealt with previously committed crime, now Forensic Science
greater focus is now to prevent future crime. F o r e n s i cs L a b

5
Computational Forensics
Challenges and Demands in
Forensic Science
Challenges:
Tiny Pieces of Evidence are hidden in a
mostly Chaotic Environment,
Trace Study to reveal Specific Properties,

Traces found will be Never Identical,


Reasoning and Deduction have to be
performed on the basis of Demands:
- Partial Knowledge, Objective Measurement and Classification,

- Approximations, Robustness and Reproducibility,

- Uncertainties and Secure against Falsifications.

- Conjectures. F o r e n s i cs L a b

6
Computational Forensics
Current Situation
Knowledge and intuition of the
human expert plays a central role in daily
forensic casework.
Courtroom forensic testimony is
often criticized by defense lawyers as
lacking a scientific basis.
Huge amount of data, tide operational times,
and data linkage pose challenges.

Computational Forensics, aka applying


Artificial Intelligence Methodologies in
Forensic Sciences F o r e n s i cs L a b

7
Computational Forensics

Strengthening Forensic Science in


the United States: A Path Forward
Committee on Identifying the Needs of the Forensic
Sciences Community, National Research Council
ISBN: 0-309-13131-6, 352 pages, 6 x 9, (2009)
This PDF is available from the National Academies Press at:
http://www.nap.edu/catalog/12589.html

F o r e n s i cs L a b

8
Computational Forensics
Computational vs.
Computer (Digital) Forensics
Computational Forensics uses computational
sciences to study any type of evidence:
Computer forensics
Crime Scene Investigation
Forensic paleography
Forensic anthropology
Forensic chemistry
Computer Forensics studies digital evidence:
File-system forensics
Live-system forensics
Mobile-device forensics etc. F o r e n s i cs L a b

9
Computational Forensics
Examples of Ongoing Research I

Anthropology by Balleri et al. 2007

Crime Scene Reconstruction by Gryz et al. 2007 F o r e n s i cs L a b

10
Computational Forensics
Examples of Ongoing Research II

Reconstruction of
Shredder and Ripped-Up
Documents by
Ukowich et al. 2007,
de Smet 2007, and
Chanda et al. 2010.

Forensic Odontology by Nomir et al. 2007 F o r e n s i cs L a b

11
Computational Forensics
Computational vs.
Computer (Digital) Forensics
Computational Forensics uses computational
sciences to study any type of evidence:
Computer forensics
Crime Scene Investigation
Forensic paleography
Forensic anthropology
Forensic chemistry
Computer Forensics studies digital evidence:
File-system forensics
Live-system forensics
Mobile-device forensics etc. F o r e n s i cs L a b

12
Computational Forensics
Digital Evidence Sources

Computer, Network and Internet


Mobile Devices and Phones
Surveillance Cameras
Biometrics and Electronic
Passports
Multimedia Forensics:
Image, Video and Music Files

F o r e n s i cs L a b

13
Computational Forensics
Examples of Digital Evidence I
Undeleted (renamed) files, Deleted files Digital Images, Videos, Audio files
Windows registry, Log files Text Documents, Notes, Emails, Chat
Print spool files, Browser caches Documents (e.g., GPS location, MACtimes)
Temp files (all those .TMP files!) Registries, Log files
Swap files Bomb-making diagrams
Alternate partitions Malicious Software (e.g. Viruses, Worms)

Removable media (floppies, ZIP, Jazz,



Evidence by network connections made
tapes, )
Cell phone SMS messages

Devices
Computers, PDAs, cellular phones,
SIM / Smart Card,
videogame consoles,
Copy machines, printers, F o r e n s i cs L a b
14 Cameras, electronic pen-tablets
Computational Forensics
Examples of Digital Evidence II
Wireless telephones Landline Telephones &
Numbers called Answering machines
Incoming calls Incoming/outgoing messages

Voice mail access numbers Numbers called

Debit/credit card numbers Incoming call info

Email addresses Access codes for voice mail systems

Call forwarding numbers Contact lists

PDAs/Smart Phones Copiers


Above, plus contacts, maps, Especially digital copiers, which may

pictures, passwords, store entire copy jobs


documents,

Systems
operating systems, database systems, networks, middleware, F o r e n s i cs L a b

15 wireless systems, firewalls, biometrics


Computational Forensics
Princiles
*

Evidence Integrity Chain of Custody


Preservation of evidence in Documentation of evidence
original form acquisition, control, analysis,
Methods and tools to ensure that disposal (both physical and
evidence is not altered (willingly or electronically)
accidentally) Use lists, notes, reports,
Examples: timestamps and hash values.
Write Blocker, Hashes (MD5) kept up to date and controlled
Note: in live forensics the evidence throughout the investigative
change during acquisition process.
Consult Order of Volatility (OOV) list Slip might be costly in a court of
to assess what to captured and to law; one might speculate that the
capture the most volatile data. evidence is faulty, or question the
origin of it.
F o r e n s i cs L a b

16 * Forensic investigation process as defined by NIST


Computational Forensics
Overview of Forensic Techniques
Post-mortem Analysis: File System, Registry,
Event logs, Recovery of deleted files
Live Analysis: - Volatility System Date, Time,
Running Processes, Network Connections, Users
logged on, Open Files, Full Memory Dump
Network Analysis: Traffic Analysis
API Analysis: API commands, Data processed
Data Enrichment from the Internet
Forensic Readiness
F o r e n s i cs L a b

17
Computational Forensics
Selected Forensic Tools
EnCase by Guidance Software, Windows suite of forensic
tools, Quasi-standard
Forensic Toolkit (FTK) by AccessData, court-validated
investigator platform for forensic analysis, incl.
decryption and password cracking capabilities, popular
alternative to EnCase suite.
Autopsy & The Sleuth Kit is Open Source, Autopsy is
graphical interface for The Sleuth Kit (TSK) command line
tools, both on UNIX platforms, and Cygwin for Windows.
Oxygen Forensic Suite by Oxygen Software, is a mobile
forensic software, smart forensics for smart phones
COFEE by Microsoft Inc., a useful tool for basic forensics
F o r e n s i cs L a b

18
Computational Forensics
Forensics Tasks vs Problem Areas
Tasks Accomplished Problem Areas
(Examples) Damaged Hardware - device
Reveal evidence that put a is physically destroyed,
person at a keyboard at a
specific time.
Securely overwritten - tools
are used to destroy all the
Recover deleted files,
binary data on the disk,
Discover when files where
created, modified, deleted, Encrypted devices - unless
applications run and installed, encryption key can be
websites, obtained
Reassemble fragmented parts of
images, and other files.

F o r e n s i cs L a b

19
Computational Forensics
Forensic-tool Testing
Background / Motivation
US Supreme Court ruling of Frye v. United States and
Daubert v. Merrell Dow Pharmaceuticals Inc
Daubert Criteria
Has the method in question undergone empirical testing?
Has the method been subjected to peer review?
Does the method have any known or potential error rate?
Do standards exist for the control of the technique's operation?
Has the method received general acceptance in the relevant scientific community?
NIST Computer Forensics Tool Testing (CFTT)
http://www.cftt.nist.gov/
Scientific Working Group on Digital Evidence (SWGDE)
2009-01-15 SWGDE Recommendations for Validation Testing Version v1.1
IEEE Standard 829 - Standard for Software Test Documentation:
F o r e n s i cs L a b

20
1983 version superseded by 1998 version.
Computational Forensics

Digital Forensics Ontology


Framework
TR, June 2010, on behalf of Armaswiss

Jarle Kittelsen, Katrin Franke, Bernhard Hmmerli


Norwegian Information Security Laboratory (NISlab)
Gjvik University College
www.nislab.no
F o r e n s i cs L a b

21
Computational Forensics
Objective
Comprehensive overview of the main
topics and concepts
Update framework ontology for the
domain of digital forensics
Attempted to map some of the
existing relations between these concepts
Intend to be seed for further definition
Over time common reference and define
common vocabulary
F o r e n s i cs L a b

22
Computational Forensics
Previous Work
Significant contributions
Forensics wiki. http://www.forensicswiki.org/wiki/.
Ashley Brinson, Abigail Robinson, and Marcus Rogers.
A cyber forensics ontology: Creating a new approach to
studying cyber forensics. In The Proceedings of the 6th
Annual Digital Forensic Research Workshop (DFRWS 06),
volume 3, 2006.
David Christopher Harrill and Richard P. Mislan.
A small scale digital device forensics ontology. Small scale
digital device forensics journal, 1, 2007.

Total Number of References: 31


F o r e n s i cs L a b

23
Computational Forensics
Developed Ontology two layers expanded

F o r e n s i cs L a b

24 http://www.mindmeister.com/maps/show/48668592
Computational Forensics
Main Concepts I
Digital Evidence Digital Forensic Tools
Physical Evidence Counter-Forensics
Logical
Encryption
External
Steganography
Digital Forensic Methods Proxies
Data Duplication
Image Analysis Storageless devices
Audio Analysis Secure Deletion
Document Analysis Data Tampering
File Analysis Digital Forensic Crime Cases
Network Analysis Cyber Crime Cases
Data Reduction Traditional Crime Cases
Data Recovery
Data Analysis F o r e n s i cs L a b

25
Computational Forensics
Main Concepts II
Digital Forensic Process Professions
Preparation Law
Identification Academia
Approach Strategy Military
Preservation Private sector
Collection Legal Aspects
Examination / Analysis Terminology
Presentation
Returning evidence
F o r e n s i cs L a b

26
Computational Forensics
Future Work
Mapping / Linking all the relations that
exist across classes
Represent the digital forensics ontology in
machine readable form
Usage of web ontology language (OWL)
World Wide Web Consortium (W3C). Owl web ontology language
overview. http://www.w3.org/TR/owl-features/.

Recommender / Expert systems to support


evidence collection ad analysis

F o r e n s i cs L a b

27
Computational Forensics

Computational Forensics:
Admission of Artificial Intelligence
Methodologies in Forensic Sciences

Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
www.nislab.no

F o r e n s i cs L a b

28
Computational Forensics
Requirement of Adapted
Computer Models & Operators
Brain

FL
NN

Reasoning
EC

Imprecision, Computational
Uncertainty, Intelligence
Partial Truth NN: Neuronal Networks
FL: Fuzzy Logic
EC: Evolutionary Computation
Natural Evolution

F o r e n s i cs L a b

29
Computational Forensics
Computational Methods
Signal / Image Processing : one-dimensional signals and two-dimensional
images are transformed for better human or machine processing,
Computer Vision : images are automatically recognized to identify objects,
Computer Graphics / Data Visualization :
two-dimensional images or three-dimensional scenes are synthesized from
multi-dimensional data for better human understanding,
Statistical Pattern Recognition :
abstract measurements are classified as belonging to one or more classes, e.g.,
whether a sample belongs to a known class and with what probability,
Machine Learning : a mathematical model is learnt from examples.
Data Mining : large volumes of data are processed to discover nuggets of
information, e.g., presence of associations, number of clusters, outliers, etc.
Robotics : human movements are replicated by a machine.
F o r e n s i cs L a b

30
Computational Forensics
Objective
Study and development of computational
methods to
Assist in basic and applied research, e.g. to
establish or prove the scientific basis of a
particular investigative procedure,
Support the forensic examiner in their
daily casework.

Modern crime investigation shall profit from the


hybrid-intelligence of humans and machines.

F o r e n s i cs L a b

31
Computational Forensics
Computational Forensics -
Definition
It is understood as the hypothesis-driven investigation of a
specific forensic problem using computers, with the primary
goal of discovery and advancement of forensic knowledge.

CF works towards:
1) In-depth Understanding of a forensic discipline,
2) Evaluation of a particular scientific method basis and
3) Systematic Approach to forensic sciences by applying
techniques of computer science, applied mathematics and
statistics.
It involves Modeling and computer Simulation (Synthesis)
and/or computer-based Analysis and Recognition
F o r e n s i cs L a b

32
Computational Forensics
Admission of
Computational Forensics
1. Need of Automatization,
Standardization, and Benchmarking

2. Need of Education, Joint


Research, and Development by
Forensic and Computer Scientist

3. Need of Legal Framework

F o r e n s i cs L a b

33
Computational Forensics
Automatization, Standardization,
and Benchmarking
Increase Efficiency and Effectiveness
Perform Method / Tool Testing regarding their
Strengths/Weaknesses and their Likelihood Ratio
(Error Rate)
Gather, manage and extrapolate data, and to
synthesize new Data Sets on demand.
Establish and implement Standards for data,
work procedures and journal processes

Fulfillment of Daubert Criteria


http://en.wikipedia.org/wiki/Daubert_Standard F o r e n s i cs L a b

34
Computational Forensics
Joint Research & Development:
Forensic and Computer Scientist
Education and training,
Revealing the state-of-the art in *each* domain
Sources of information on events, activities and financing
opportunities
International forum to peer-review
and exchange, e.g., IWCF workshops
Performance evaluation, benchmarking, proof and
standardization of algorithms
Resources in forms of data sets, software tools, and
specifications e.g. data formats
New Insights on problem description and procedures
F o r e n s i cs L a b

35
Computational Forensics
Legal Framework ?!

Questions on methods for


dimensionality reduction loss of relevant
information
Questions on extracted numerical parameters
loss of information due to inappropriate features
Questions on the reliability of applied
computational method / tool
Questions on the final conclusion due to wrong
computational results

F o r e n s i cs L a b

36
Computational Forensics
Educational Information
CompFor courses and study programs
Gjvik University College, NO (Master, PhD)
Uni. of Amsterdam, NL (Master)
TU Kaiserslautern, DE (Master, announced)
Article in Wikipedia *
http://en.wikipedia.org/wiki/Computational_foren
sics
Brief Tutorial and Overview Article
http://sites.google.com/site/compforgroup/publica
tions
Links to relevant data collections *
http://sites.google.com/site/compforgroup/data

F o r e n s i cs L a b

37 * To be extended !
Computational Forensics

5th International Workshop on


Computational Forensics
Gjvik, Norway, March 15-16, 2012

http://iwcf2012.arsforensica.org

F o r e n s i cs L a b

38
Computational Forensics

Computational Methods:
Machine Learning and Data Mining

Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
www.nislab.no

F o r e n s i cs L a b

39
Computational Forensics
Machine Learning, Data Mining
and Artificial Intelligence

If an expert system - brilliantly designed,


engineered and implemented - cannot learn not to
repeat its mistakes, it is not as intelligent as a worm
or a sea anemone or a kitten.
Oliver G. Selfridge, from The Gardens of Learning.

Find a bug in a program, and fix it, and the


program will work today. Show the program how to
find and fix a bug, and the program will work
forever.
Oliver G. Selfridge, in AI's Greatest Trends and Controversies.

F o r e n s i cs L a b

40
Computational Forensics
General ML Approach
Data Collection
Large sample of data of how humans
perform the task
Model Selection
Settle on a parametric statistical model of
the process
Parameter Estimation
Calculate parameter values by inspecting
the data
Using learned model perform: Search
Find optimal solution to given problem

F o r e n s i cs L a b

41
Computational Forensics
Example Problem:
Handwritten Digit Recognition

Handcrafted rules will


result in large no of
rules and exceptions
Better to have a
machine that learns
Wide variability of same numeral from a large training set

F o r e n s i cs L a b

42
Computational Forensics
Role of Machine Learning
Principled way of building high performance information
processing systems

Machine Learning vs Pattern Recognition


Machine Learning has origins in Computer Science
Pattern Recognition has origins in Engineering
They are different facets of the same field

Language Related Technologies


Information Retrieval, Natural Language Processing, Automatic
Speech Recognition
Humans perform them well
Difficult to specify algorithmically

F o r e n s i cs L a b

43
Computational Forensics
Pattern Recognition

A
Goals:
A supervised / unsupervised classification of
A patterns by means of computer technology
B
small intraclass and large interclass variation
B
B
B
Pattern:
C as opposite of a chaos;
C it is an entity, vaguely defined, that
C
C
could be given a name Watanabe 1985
X

F o r e n s i cs L a b

44
Computational Forensics
Pattern Classification

C
C
C

? A
A
A
B

B B
B
* * B X
*

Supervised Classification Unsupervised Classification


pre-defined by the system designer learned-based on the similarity of pattern

Machine Learning Data Mining


F o r e n s i cs L a b

45
Computational Forensics

Representation of Pattern Characteristics


Feature Vector
Goal:
Machine-readable
Attribute/Feature Vector Size Label
C Number of
A corners
Tasks:
Feature Extraction B B Feature Vector 1

and Selection by * *
using Training Patterns Feature Vector 2
B *
Cross-validation by using
Feature Vector 3
Test Patterns

F o r e n s i cs L a b

46
Computational Forensics
Pattern Representation and
Classification Example

X C A B B
A

Feature Vector 1
1** 2** 1** 1** 2** 2
* *
Feature Vector 2

* 14* 24* 16* 16* 20* 4

Feature Vector 3
5
14A 24C 16A 16B 20B

Classes
Size Label
Number of
corners

F o r e n s i cs L a b

47
Computational Forensics
Classifier Training, or
How does Computers learn?
Learning by Example !
Requirements
Representative Sample Data
Appropriate Feature
Encoding
Challenge
Class Discrimination
Avoid Over Learning

F o r e n s i cs L a b

48
Computational Forensics
Best-known Approaches for
Pattern Recognition

Template Matching
Syntactical or Structural PR
Statistical PR
Neuronal Networks

F o r e n s i cs L a b

49
Computational Forensics
Model for Pattern Recognition

Test
pattern Feature
Preprocessing Measurement Classification

Classification
Training

Training Preprocessing Feature Learning


pattern Extraction /
Selection

F o r e n s i cs L a b

50
*
Computational Forensics
Recognition Methods in Numbers

9 Feature Extraction and


Projection Methods
7 Feature Selection Methods
7 Learning Algorithms
14 Classification Methods

18 Classifier Combination Schemes

Statistical Pattern Recognition: A Review, A.K. Jain, R.P.W. Duin and J. Mao, 2000, PAMI F o r e n s i cs L a b
*
51 Note that biological-inspired methods come in addition
Computational Forensics

Three Application Examples


Approximate String Search
Feature Selections for IDS
Malware-Abnormality Detection

F o r e n s i cs L a b

52
Computational Forensics
Challenges in Cybercrime

Rapidly growing computer storage


capacities and data volumes
Increasing complexities in computer systems
Hidden evidence caused by, e.g., obfuscated
malware
Architectural limitations in many digital
forensic tools leaves much of the work to the
forensic analyst and his/hers workstation, being costly, time and resource
consuming work
Independent analysis of machines, in cases where multiple parties or
systems are involved in the same crime, can lead to loss of essential
evidence that is only visible when correlated

F o r e n s i cs L a b

53
Computational Forensics

Towards a Generic Feature-


Selection Measure for Intrusion
Detection
Hai Thanh Nguyen, Katrin Franke and Slobodan Petrovid

Norwegian Information Security Laboratory (NISlab)


Gjvik University College
www.nislab.no

F o r e n s i cs L a b

54
Computational Forensics
Model for Pattern Recognition

Test
Feature
pattern Preprocessing Classification
Measurement
Classification
Training
Preprocessing Feature Learning
Training
Extraction /
pattern
Selection

F o r e n s i cs L a b

55
Computational Forensics
Our Research Focus
1. Generalization of several feature selection measures.
2. Optimization to derive globally optimal feature subsets.

Considering the CFS measure (Hall, 1999) and the mRMR


measure (Peng, 2005) for intrusion detection because:
Filter methods are usually used to select features from high-
dimensional data sets, such as intrusion detection systems.
Relevance of features and relationship between features are
considered
The relevance and relationship are usually characterized in
terms of correlation (CFS) or mutual information (mRMR).

F o r e n s i cs L a b

56
Computational Forensics
CFS and mRMR Feature Selection

Correlation feature-selection Feature-selection measure based


(CFS) measure on mutual information (mRMR)
Class-feature correlation Class-feature mutual inform.
Feature-feature correlation Feature-feature mutual inform.

M. Hall. Correlation Based Feature Selection for Machine Learning. H. Peng, F. Long, and C. Ding. Feature selection based on mutual
Doctoral Dissertation, University of Waikato, Department of Comp. information: criteria of max-dependency, max-relevance, and min-
Science, 1999. redundancy. IEEE Transactions on PAMI, Vol. 27, No. 8, pp.1226-1238,
2005.
F o r e n s i cs L a b

57
Computational Forensics
Generic Feature Selection (GeFS)
Question: Can the CFS measure and the mRMR measures be
fused and generalized into a generic feature selection measure?
Definition 1: A generic feature selection (GeFS) measure is
defined as follows:

Proposition 1: The CFS and the mRMR measures are instances of


the GeFS measure.
Proposition 2: The feature selection by means of the GeFS
measure is a polynomial mixed 0-1 fractional programming
(PM01FP) problem.
How to solve is? F o r e n s i cs L a b

58
Computational Forensics
Search Procedures
Exhaustive search Heuristic search
Globally optimal feature Locally optimal feature subsets
subsets Faster than exhaustive search
Slow with complexity of
O(2n)

Examples: Examples:
Exhaustive search Greedy search
Breadth-first search Gradient search
Depth-first search
Iterative deepening
Simulated annealing
Branch-and-Bound Genetic algorithms

Can we find globally optimal feature subsets


yet avoid standard exhaustive search? F o r e n s i cs L a b

59
Computational Forensics
Problem Transformation
Changs method for Our method for
solving PM01FP solving PM01FP
Linearizing Differently linearizing
PM01FP problem into mixed PM01FP problem into mixed
0-1 linear programming 0-1linear programming
problem (M01LP). problem (M01LP).
The number of variables & The number of variables &
constraints: constraints:
n2 4n+1
Branch and Bound algorithm. Branch and Bound algorithm.
C-T. Chang. On the polynomial mixed 0-1 fractional
programming problems, European Journal of
Operational Research, vol. 131, issue 1, F o r e n s i cs L a b

60 pages 224-227, 2001.


Computational Forensics
Experimental Setting
10% of the overall (5 millions of instances)
KDD CUP99 test data set for Intrusion Detection; Systems, which
have normal traffic and 4 attack classes (DoS, Probe, U2R, R2L).
Consider 4 data subsets of the KDD CUP99:
Data sets Number of instances

Normal & DoS 488.736


Normal & Probe 138.391
Normal & U2R 97.330
Normal & R2L 98.404

Feature selection methods: Opt-CFS & Opt-mRMR.


Classifiers: Hierarchical Classifier C4.5 & Bayesian Network.
Apply 5-fold cross validation.
F o r e n s i cs L a b

61
Computational Forensics
Experimental Results

Number of Selected Features Achieved Recognition Performance

Please see listed reference publications for further details.


Nygen, Franke Petrovic, 2009-2010
F o r e n s i cs L a b

62
Computational Forensics
Summary

Fusing and generalizing the CFS and the mRMR measures


into a generic feature selection (GeFS) measure.
Transforming the feature-selection problem by means of
the GeFS into a mixed 0-1 linear programming problem.
Proposing a new approach that ensures globally optimal
feature sets.
Proposing a new feature-selection method (Opt-GeFS).
Validating the new method by conducting experiments on
KDD Cup 1999 test IDS data set and on others data sets.
Our method is domain-independent.
F o r e n s i cs L a b

63
Computational Forensics

Cross-Computer Malware
Detection in Digital Forensics
Anders Orsten Flaglien, Peter Ekstrand Berg, Lars Arne Sand
Katrin Franke, Andre Arnes

Norwegian Information Security Laboratory (NISlab)


Gjvik University College
www.nislab.no

F o r e n s i cs L a b

64
Computational Forensics
Model for Pattern Recognition

Test
pattern Feature
Preprocessing Classification
Measurement
Classification
Training
Preprocessing Feature Learning
Training
Extraction /
pattern
Selection

F o r e n s i cs L a b

65
Computational Forensics
Distributed Malware, as Botnets

Botnets utilize malware to establish control


over infected machines. Botnet initiated
attacks represent one of the biggest
computer threats existing today, and they
are heavily used for computer crime (e.g.,
DDoS attacks) [23, 28, 5]
The command and control (C&C)
architecture of botnets, ensures evidence is
present in multiple locations
_Patterns of malware, used in botnets, can be hard to detect, as
malware often uses obfuscation techniques. However, in cases of C&C
dependent malware, like botnets, certain patterns are required in
order for the bot-master to control bots.

F o r e n s i cs L a b

66
Computational Forensics
Application Scenario

F o r e n s i cs L a b

67
Computational Forensics
Applied Method
Data Collection
Examination
File Metadata Extraction
Hash Filtering
Feature Extraction
Analysis (Link Mining)
Combining
Pre-processing
Clustering
68
Linking Machines F o r e n s i cs L a b
Computational Forensics
deLink
A method that can assist human analysis in order to improve the decision
making and further improve the result of digital forensics

Aims to improve the efficiency of the time invested on investigation, and


the effectiveness of detecting relevant and high quality evidence.

F o r e n s i cs L a b

69
Computational Forensics Features of interest has to be selected, that
best represent the characteristics of the
input data
File metadata represent most of the
features
Content-based features improve
knowledge and represent file content
Special characteristics that
reflect typical malware patterns
Strings from regular expressions
Case supplied metadata should only
be used for tracing its origin, not for
the clustering task

F o r e n s i cs L a b

70
Computational Forensics Examination of input data to create a
textual and structured representation
Feature files are created from Feature File Machine n
selected features, extracted from Feature File Machine
File ObjectFeature File Machine
1.... Feature A, B, C,1D, E, F, G, H
copied disk images FileObject
File Object1....
2....Feature
FeatureA, A,B,B,C,C,D, D,E,E,F,F,G, G,HH
ARFF file format is used (built File
FileObject
File Object1....
Object 3....Feature
2.... FeatureA,A,A,B,B,B,C,C,C,D,D,D,E,E,E,F,F,F,G,
Feature G,H
G, H
H
using Fiwalk [16] and python File
FileObject
Object2....
3....Feature
FeatureA,A,B,B,C,C,D,D,E,E,F,F,G, G,HH
File
FileObject
Object3.... Feature
n.... A,A,B,B,C,C,D,D,E,E,F, F,G,G,H
Feature H
scripts for additional features) File Object n.... Feature A, B, C, D, E, F, G, H
File Object n.... Feature A, B, C, D, E, F, G, H

Machines are represented by individual Feature Files, where each file is


a file object represented as a feature vector with n number of features

Hash-based filtering is used to remove noise files and to limit the


amount of data to correlate
Known system and application files (NSRL RDS dataset, clean system)
F o r e n s i cs L a b

71
Computational Forensics Link Mining is performed to identify
correlations, and thus identify malware traces
A dataset is generated from all Feature Files
from the machines (separated with case
metadata)
A descriptive data mining method, using an
unsupervised clustering algorithm
Clustering provides a group detection, the
links between multiple machines exist for
clusters in which files are present
The simple K-means algorithm is used for clustering, with Euclidean
Distance for proximity measure
Preprocessing is performed to suit and to optimize the result of the
clustering algorithm
Removing redundant features
Converting features to an appropriate format (nominal and numeric)
F o r e n s i cs L a b

72
Computational Forensics Link Mining Evaluation is performed to
measure the results and to find clusters of
interest
Unalloc
Misinterpretation of the link
mining and use of features have to
be considered and can be Dir
identified through link mining
evaluations [24, 33]
Classified data can be compared to Raw

clustering results to reveal C1 C2 C3 C4 C5


uncertainties regarding the Clusters
Figure 1: showing cluster characteristics
integrity of the clustering task,
algorithm and features used [33]
Two-dimensional graphs can be used to examine links and features
involvement in the created clusters. This indicate variations within (to
measure the compactness) and between (to measure the distances
between) the K clusters (C) [22]
F o r e n s i cs L a b

73
Computational Forensics The practical implementation and involved
processing steps for Machines m in Case n

F o r e n s i cs L a b

74
Computational Forensics Multiple Experiments were executed to
evaluate deLinks efficiency and effectiveness
Three experiments were
executed in a virtual
environment
Proof-of-concept
Passive self developed bot
malware
Malware from the wild
Spybot v1.3 Figure 2: Botnet with C&C server and attack website

Most realistic results were obtained from the experiment using Spybot v1.3
A network infrastructure with command & control server, infection site and 5
victim computers were used
Online banking attack scenario, where the computers have been infected and
taken control over by an adversary bot master to, e.g., execute DDoS

F o r e n s i cs L a b

75
Computational Forensics Hash-based filtering in the experiments
reduced the amount of input data for the
link mining to process

Filtering reduced number


of data objects down to 3%
Homogeneous experiment
machines, also resulting in
many correlations
The effect of perfect hash-
based filtering is reflected

Automatically reduction of the total amount of files to be used for the


link mining increase the value of clusters produced by the link mining

F o r e n s i cs L a b

76
Computational Forensics Cluster creation in the experiment grouped
the file objects, according to their
characteristics
SOM diagrams was used to help estimate the nature of
input dataset (used as input k to k-means, but could also
be used alone for clustering objects)
Three segments were identified in the dataset from the
Malware from the Wild experiment
Figure 3: SOM diagram
Clustered the file objects (with k=3, means)with common
characteristics across all machines (using Weka machine
learning tool)
5

Machine ID
4

C1 C2 C3 F o r e n s i cs L a b
Clusters 77
77 Figure 4: Machine files representation in clusters
Computational Forensics Timeline visualization of experiment results,
reflecting one cluster, filtered with the
malicious IP address
Expert knowledge about an incident are available, e.g., the approximated
time when an incident occurred, origin, indications of effect and extent.
1

2
Machine ID

5
T0 AccessT1
Time T2
Figure 5: One cluster with files, having suspected IP, timelined F o r e n s i cs L a b

78
T0: IE history files accessed, T1: Infection file accessed, T2: Additional infections files accessed
Computational Forensics
Summary

Efficient techniques to decrease the initially large data set


Clustering further separated and grouped common file
objects
Ability to detect common file objects between the
machines increased the understanding of large data set
Correlations further improved the knowledge of what
files could be associated with a malware and thereby an
incident
Improving the efficiency and effectiveness of digital
forensics by using mining techniques

F o r e n s i cs L a b

79
Computational Forensics
Concluding Remarks
Computational forensics holds the
potential to greatly benefit all of the
forensic sciences.
For the computer scientist it poses a new
frontier where new problems and challenges
are to be faced.
The potential benefits to society, meaningful
inter-disciplinary research, and
challenging problems should attract high
quality students and researchers to the field.
F o r e n s i cs L a b

80
Computational Forensics
Further Reading
NAS Report: Strengthening Forensic Science in the United States: A Path Forward
http://www.nap.edu/catalog/12589.html
van der Steen, M., Blom, M.: A roadmap for future forensic research. Technical report, Netherlands Forensic Institute
(NFI), The Hague, The Netherlands (2007)
M. Saks and J. Koehler. The coming paradigm shift in forensic identification science. Science, 309:892-895, 2005.
Starzecpyzel. United states vs. Starzecpyzel. 880 F. Supp. 1027 (S.D.N.Y), 1995.
http://en.wikipedia.org/wiki/Daubert_Standard
C. Aitken and F. Taroni. Statistics and the Evaluation of Evidence for Forensic Scientists. Wiley, 2nd edition, 2005.
K. Foster and P. Huber. Judging Science. MIT Press, 1999.
Franke, K., Srihari, S.N. (2008). Computational Forensics: An Overview, in Computational Forensics - IWCF 2008, LNCS
5158, Srihari, S., Franke, K. (Eds.), Springer Verlag, pp. 1-10.
http://sites.google.com/site/compforgroup/
Nguyen, H., Franke, K., Petrovic, S. (2010). Towards a Generic Feature-Selection Measure for Intrusion Detection, In
Proc. International Conference on Pattern Recognition (ICPR), Turkey.
Nguyen, H., Petrovic, S. Franke, K. (2010). A Comparison of Feature-Selection Methods for Intrusion Detection, In
Proceedings of Fifth International Conference on Mathematical Methods, Models, and Architectures for Computer
Networks Security (MMM-ACNS), St.Petersburg, Russia, September 8-11. (accepted for publication)
Nguyen, H., Franke, K., Petrovic, S. (2010). Improving Effectiveness of Intrusion Detection by Correlation Feature
Selection, International Conference on Availability, Reliability and Security (ARES), Krakow, Poland, pp. 17-24.
Flagien, A.O., Arnes, A., Franke, K., (2010). Cross-Computer Malware Detection in Digital Forensics, Techn. Report,
Gjvik University College, June 2010.

F o r e n s i cs L a b

81
Computational Forensics

Thank you for your


consideration of comments!
Getting in touch
WWW: kyfranke.com
Email: kyfranke@ieee.org
Skype/gTalk: kyfranke

F o r e n s i cs L a b

82
Computational Forensics

Improving the Efficiency of


Digital Forensic Search by
Means of Constrained
Edit Distance
Slobodan Petrovic, Katrin Franke

Norwegian Information Security Laboratory (NISlab)


Gjvik University College
www.nislab.no

F o r e n s i cs L a b

83
Computational Forensics
Model for Pattern Recognition

Test
Feature
pattern Preprocessing Classification
Measurement
Classification
Training
Preprocessing Feature Learning
Training
Extraction /
pattern
Selection

F o r e n s i cs L a b

84
Computational Forensics
Challenges

Huge search space ( size can be TByte of data )


Distorted stings ( intentionally / unintentionally )
Substituting
Inserting 1...N characters
Deleting

Consecutive intersection and


deletions

ROLEX POLEC ROLEKS ROLLE_

F o r e n s i cs L a b

85
Computational Forensics
Objective

Efficient, non-exhaustive procedure


Approximate search
No predefined distortions

... Buy replica POLEC and fake ROLEKS watches online


from our online Swiss replica watch store and save money.
Find, compare and buy! ....

F o r e n s i cs L a b

86
Computational Forensics
Applied Method

M ROLEX di

N ... replica POLEC and fake ROLEKS watches ...

fragi

M - length of the search string


N - length of the data fragment
fragi - the specific fragment
di - edit distance of fragi and the search string

F o r e n s i cs L a b

88
Computational Forensics
Edit Distance Revisited
Definition: The edit distance between two words (strings) is
the minimal number of edit operations (insertions, deletions,
or substitutions) that must be performed to convert one word
into the other.

ROLEX ROLEX ROLEX ROLEX


POLEX POLEC POLEKS ROLLE_

di = 1 di = 2 di = 3 di = 2

Many definitions and implementations of this metric, e.g.,


Hamming distance Levenshtein distance Damerau-
Levenshtein distance Jaro-Winkler distance Wagner-Fischer
edit distanceUkkonenHirshberg
F o r e n s i cs L a b

89
Computational Forensics
Constrained Edit Distance
F - the maximum number of
consecutive deletions
G - the maximum number of
M ROLEX di consecutive insertions
R ...
... ... ... ... ... ... ...
... ...
... ...
... ...
... O
......
......
......
......
......
......
......
......
...L.....................E ...
... ...
... ...
X ...

... ... ... ... ...


P ...
......
......
......
......
...O...................................................L.....................E ... ... ...
C ...

F
N

F o r e n s i cs L a b

90
Computational Forensics
Results I

R = f (N, F)
Dependency of the data-set
reduction R on the length of
the fragments (N) for
different values of
consecutive deletions (F).

Const. restrictness
threshold
=0
IF ( di ( N - M + ))
THEN (accept (fragi))
M - length of the search string
F o r e n s i cs L a b
fragi - the specific fragment
92 di - edit distance of fragi
Computational Forensics
Summary
Two-stage search procedure
Constrained edit distance for pre-selection
(1st phase)
Enables rejection of fragments, in which the
detected string is too distorted
Level of tolerance is controlled by means of
the value of the constraints

Detailed algorithm and simplification are


described in publication
Experimental results proof the efficiency of
the proposed method
F o r e n s i cs L a b

95