Anda di halaman 1dari 91

Computational Forensics

Digital & Computational

Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
F o r e n s i cs L a b

Computational Forensics
Introduction to Digital Forensic Science
Digital Evidence
Digital Forensic Ontology
Computational Forensics
Machine Learning and Data Mining
Pattern Classification, Search, Clustering
Dimensionality Reduction
Application Examples
Approximate String Search
Feature / Attribute Selection
Distributed Malware Detection

F o r e n s i cs L a b

Computational Forensics

Digital Forensics:
A Brief Introduction

Katrin Franke, Andre rnes,

Andreas Tellefsen, Kristian Nordhaug, Philip Clark
Norwegian Information Security Laboratory (NISlab)
Gjvik University College
F o r e n s i cs L a b

Computational Forensics
NISlab-App to Information Security
Biometrics Forensics
Forensic Readiness
User Authentication
Incidence Response
BTA Protocol

Security Management Security Technology

Risk-based Design Software Security
Security Economics System Administration
System/Adversary Modeling Network and Critical
Human Factors, Policies Infrastructure Protection

Testimon (lat. evidence)

Computational & Digital Forensics: F o r e n s i cs L a b

4 Fraud Detection, Analysis and Prevention

Computational Forensics
Forensic Sciences
Methodological correct application of a broad spectrum
of scientific disciplines, to answer questions significant
to the legal system.
Forensic methods consist of multi-disciplinary
approaches to perform the following tasks:
Investigate and to Reconstruct a crime scene or
a scene of an accident,
Collect and Analyze trace evidence found,
Identify, Classify, Quantify, Individualize
persons, objects, processes,
Establish Linkages, associations and Methodology Technology
Reconstructions, and
Use those findings in the prosecution or the
defense in a court of law.
Mostly dealt with previously committed crime, now Forensic Science
greater focus is now to prevent future crime. F o r e n s i cs L a b

Computational Forensics
Challenges and Demands in
Forensic Science
Tiny Pieces of Evidence are hidden in a
mostly Chaotic Environment,
Trace Study to reveal Specific Properties,

Traces found will be Never Identical,

Reasoning and Deduction have to be
performed on the basis of Demands:
- Partial Knowledge, Objective Measurement and Classification,

- Approximations, Robustness and Reproducibility,

- Uncertainties and Secure against Falsifications.

- Conjectures. F o r e n s i cs L a b

Computational Forensics
Current Situation
Knowledge and intuition of the
human expert plays a central role in daily
forensic casework.
Courtroom forensic testimony is
often criticized by defense lawyers as
lacking a scientific basis.
Huge amount of data, tide operational times,
and data linkage pose challenges.

Computational Forensics, aka applying

Artificial Intelligence Methodologies in
Forensic Sciences F o r e n s i cs L a b

Computational Forensics

Strengthening Forensic Science in

the United States: A Path Forward
Committee on Identifying the Needs of the Forensic
Sciences Community, National Research Council
ISBN: 0-309-13131-6, 352 pages, 6 x 9, (2009)
This PDF is available from the National Academies Press at:

F o r e n s i cs L a b

Computational Forensics
Computational vs.
Computer (Digital) Forensics
Computational Forensics uses computational
sciences to study any type of evidence:
Computer forensics
Crime Scene Investigation
Forensic paleography
Forensic anthropology
Forensic chemistry
Computer Forensics studies digital evidence:
File-system forensics
Live-system forensics
Mobile-device forensics etc. F o r e n s i cs L a b

Computational Forensics
Examples of Ongoing Research I

Anthropology by Balleri et al. 2007

Crime Scene Reconstruction by Gryz et al. 2007 F o r e n s i cs L a b

Computational Forensics
Examples of Ongoing Research II

Reconstruction of
Shredder and Ripped-Up
Documents by
Ukowich et al. 2007,
de Smet 2007, and
Chanda et al. 2010.

Forensic Odontology by Nomir et al. 2007 F o r e n s i cs L a b

Computational Forensics
Computational vs.
Computer (Digital) Forensics
Computational Forensics uses computational
sciences to study any type of evidence:
Computer forensics
Crime Scene Investigation
Forensic paleography
Forensic anthropology
Forensic chemistry
Computer Forensics studies digital evidence:
File-system forensics
Live-system forensics
Mobile-device forensics etc. F o r e n s i cs L a b

Computational Forensics
Digital Evidence Sources

Computer, Network and Internet

Mobile Devices and Phones
Surveillance Cameras
Biometrics and Electronic
Multimedia Forensics:
Image, Video and Music Files

F o r e n s i cs L a b

Computational Forensics
Examples of Digital Evidence I
Undeleted (renamed) files, Deleted files Digital Images, Videos, Audio files
Windows registry, Log files Text Documents, Notes, Emails, Chat
Print spool files, Browser caches Documents (e.g., GPS location, MACtimes)
Temp files (all those .TMP files!) Registries, Log files
Swap files Bomb-making diagrams
Alternate partitions Malicious Software (e.g. Viruses, Worms)

Removable media (floppies, ZIP, Jazz,

Evidence by network connections made
tapes, )
Cell phone SMS messages

Computers, PDAs, cellular phones,
SIM / Smart Card,
videogame consoles,
Copy machines, printers, F o r e n s i cs L a b
14 Cameras, electronic pen-tablets
Computational Forensics
Examples of Digital Evidence II
Wireless telephones Landline Telephones &
Numbers called Answering machines
Incoming calls Incoming/outgoing messages

Voice mail access numbers Numbers called

Debit/credit card numbers Incoming call info

Email addresses Access codes for voice mail systems

Call forwarding numbers Contact lists

PDAs/Smart Phones Copiers

Above, plus contacts, maps, Especially digital copiers, which may

pictures, passwords, store entire copy jobs


operating systems, database systems, networks, middleware, F o r e n s i cs L a b

15 wireless systems, firewalls, biometrics

Computational Forensics

Evidence Integrity Chain of Custody

Preservation of evidence in Documentation of evidence
original form acquisition, control, analysis,
Methods and tools to ensure that disposal (both physical and
evidence is not altered (willingly or electronically)
accidentally) Use lists, notes, reports,
Examples: timestamps and hash values.
Write Blocker, Hashes (MD5) kept up to date and controlled
Note: in live forensics the evidence throughout the investigative
change during acquisition process.
Consult Order of Volatility (OOV) list Slip might be costly in a court of
to assess what to captured and to law; one might speculate that the
capture the most volatile data. evidence is faulty, or question the
origin of it.
F o r e n s i cs L a b

16 * Forensic investigation process as defined by NIST

Computational Forensics
Overview of Forensic Techniques
Post-mortem Analysis: File System, Registry,
Event logs, Recovery of deleted files
Live Analysis: - Volatility System Date, Time,
Running Processes, Network Connections, Users
logged on, Open Files, Full Memory Dump
Network Analysis: Traffic Analysis
API Analysis: API commands, Data processed
Data Enrichment from the Internet
Forensic Readiness
F o r e n s i cs L a b

Computational Forensics
Selected Forensic Tools
EnCase by Guidance Software, Windows suite of forensic
tools, Quasi-standard
Forensic Toolkit (FTK) by AccessData, court-validated
investigator platform for forensic analysis, incl.
decryption and password cracking capabilities, popular
alternative to EnCase suite.
Autopsy & The Sleuth Kit is Open Source, Autopsy is
graphical interface for The Sleuth Kit (TSK) command line
tools, both on UNIX platforms, and Cygwin for Windows.
Oxygen Forensic Suite by Oxygen Software, is a mobile
forensic software, smart forensics for smart phones
COFEE by Microsoft Inc., a useful tool for basic forensics
F o r e n s i cs L a b

Computational Forensics
Forensics Tasks vs Problem Areas
Tasks Accomplished Problem Areas
(Examples) Damaged Hardware - device
Reveal evidence that put a is physically destroyed,
person at a keyboard at a
specific time.
Securely overwritten - tools
are used to destroy all the
Recover deleted files,
binary data on the disk,
Discover when files where
created, modified, deleted, Encrypted devices - unless
applications run and installed, encryption key can be
websites, obtained
Reassemble fragmented parts of
images, and other files.

F o r e n s i cs L a b

Computational Forensics
Forensic-tool Testing
Background / Motivation
US Supreme Court ruling of Frye v. United States and
Daubert v. Merrell Dow Pharmaceuticals Inc
Daubert Criteria
Has the method in question undergone empirical testing?
Has the method been subjected to peer review?
Does the method have any known or potential error rate?
Do standards exist for the control of the technique's operation?
Has the method received general acceptance in the relevant scientific community?
NIST Computer Forensics Tool Testing (CFTT)
Scientific Working Group on Digital Evidence (SWGDE)
2009-01-15 SWGDE Recommendations for Validation Testing Version v1.1
IEEE Standard 829 - Standard for Software Test Documentation:
F o r e n s i cs L a b

1983 version superseded by 1998 version.
Computational Forensics

Digital Forensics Ontology

TR, June 2010, on behalf of Armaswiss

Jarle Kittelsen, Katrin Franke, Bernhard Hmmerli

Norwegian Information Security Laboratory (NISlab)
Gjvik University College
F o r e n s i cs L a b

Computational Forensics
Comprehensive overview of the main
topics and concepts
Update framework ontology for the
domain of digital forensics
Attempted to map some of the
existing relations between these concepts
Intend to be seed for further definition
Over time common reference and define
common vocabulary
F o r e n s i cs L a b

Computational Forensics
Previous Work
Significant contributions
Forensics wiki.
Ashley Brinson, Abigail Robinson, and Marcus Rogers.
A cyber forensics ontology: Creating a new approach to
studying cyber forensics. In The Proceedings of the 6th
Annual Digital Forensic Research Workshop (DFRWS 06),
volume 3, 2006.
David Christopher Harrill and Richard P. Mislan.
A small scale digital device forensics ontology. Small scale
digital device forensics journal, 1, 2007.

Total Number of References: 31

F o r e n s i cs L a b

Computational Forensics
Developed Ontology two layers expanded

F o r e n s i cs L a b

Computational Forensics
Main Concepts I
Digital Evidence Digital Forensic Tools
Physical Evidence Counter-Forensics
Digital Forensic Methods Proxies
Data Duplication
Image Analysis Storageless devices
Audio Analysis Secure Deletion
Document Analysis Data Tampering
File Analysis Digital Forensic Crime Cases
Network Analysis Cyber Crime Cases
Data Reduction Traditional Crime Cases
Data Recovery
Data Analysis F o r e n s i cs L a b

Computational Forensics
Main Concepts II
Digital Forensic Process Professions
Preparation Law
Identification Academia
Approach Strategy Military
Preservation Private sector
Collection Legal Aspects
Examination / Analysis Terminology
Returning evidence
F o r e n s i cs L a b

Computational Forensics
Future Work
Mapping / Linking all the relations that
exist across classes
Represent the digital forensics ontology in
machine readable form
Usage of web ontology language (OWL)
World Wide Web Consortium (W3C). Owl web ontology language

Recommender / Expert systems to support

evidence collection ad analysis

F o r e n s i cs L a b

Computational Forensics

Computational Forensics:
Admission of Artificial Intelligence
Methodologies in Forensic Sciences

Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College

F o r e n s i cs L a b

Computational Forensics
Requirement of Adapted
Computer Models & Operators



Imprecision, Computational
Uncertainty, Intelligence
Partial Truth NN: Neuronal Networks
FL: Fuzzy Logic
EC: Evolutionary Computation
Natural Evolution

F o r e n s i cs L a b

Computational Forensics
Computational Methods
Signal / Image Processing : one-dimensional signals and two-dimensional
images are transformed for better human or machine processing,
Computer Vision : images are automatically recognized to identify objects,
Computer Graphics / Data Visualization :
two-dimensional images or three-dimensional scenes are synthesized from
multi-dimensional data for better human understanding,
Statistical Pattern Recognition :
abstract measurements are classified as belonging to one or more classes, e.g.,
whether a sample belongs to a known class and with what probability,
Machine Learning : a mathematical model is learnt from examples.
Data Mining : large volumes of data are processed to discover nuggets of
information, e.g., presence of associations, number of clusters, outliers, etc.
Robotics : human movements are replicated by a machine.
F o r e n s i cs L a b

Computational Forensics
Study and development of computational
methods to
Assist in basic and applied research, e.g. to
establish or prove the scientific basis of a
particular investigative procedure,
Support the forensic examiner in their
daily casework.

Modern crime investigation shall profit from the

hybrid-intelligence of humans and machines.

F o r e n s i cs L a b

Computational Forensics
Computational Forensics -
It is understood as the hypothesis-driven investigation of a
specific forensic problem using computers, with the primary
goal of discovery and advancement of forensic knowledge.

CF works towards:
1) In-depth Understanding of a forensic discipline,
2) Evaluation of a particular scientific method basis and
3) Systematic Approach to forensic sciences by applying
techniques of computer science, applied mathematics and
It involves Modeling and computer Simulation (Synthesis)
and/or computer-based Analysis and Recognition
F o r e n s i cs L a b

Computational Forensics
Admission of
Computational Forensics
1. Need of Automatization,
Standardization, and Benchmarking

2. Need of Education, Joint

Research, and Development by
Forensic and Computer Scientist

3. Need of Legal Framework

F o r e n s i cs L a b

Computational Forensics
Automatization, Standardization,
and Benchmarking
Increase Efficiency and Effectiveness
Perform Method / Tool Testing regarding their
Strengths/Weaknesses and their Likelihood Ratio
(Error Rate)
Gather, manage and extrapolate data, and to
synthesize new Data Sets on demand.
Establish and implement Standards for data,
work procedures and journal processes

Fulfillment of Daubert Criteria F o r e n s i cs L a b

Computational Forensics
Joint Research & Development:
Forensic and Computer Scientist
Education and training,
Revealing the state-of-the art in *each* domain
Sources of information on events, activities and financing
International forum to peer-review
and exchange, e.g., IWCF workshops
Performance evaluation, benchmarking, proof and
standardization of algorithms
Resources in forms of data sets, software tools, and
specifications e.g. data formats
New Insights on problem description and procedures
F o r e n s i cs L a b

Computational Forensics
Legal Framework ?!

Questions on methods for

dimensionality reduction loss of relevant
Questions on extracted numerical parameters
loss of information due to inappropriate features
Questions on the reliability of applied
computational method / tool
Questions on the final conclusion due to wrong
computational results

F o r e n s i cs L a b

Computational Forensics
Educational Information
CompFor courses and study programs
Gjvik University College, NO (Master, PhD)
Uni. of Amsterdam, NL (Master)
TU Kaiserslautern, DE (Master, announced)
Article in Wikipedia *
Brief Tutorial and Overview Article
Links to relevant data collections *

F o r e n s i cs L a b

37 * To be extended !
Computational Forensics

5th International Workshop on

Computational Forensics
Gjvik, Norway, March 15-16, 2012

F o r e n s i cs L a b

Computational Forensics

Computational Methods:
Machine Learning and Data Mining

Katrin Franke
Norwegian Information Security Laboratory (NISlab)
Gjvik University College

F o r e n s i cs L a b

Computational Forensics
Machine Learning, Data Mining
and Artificial Intelligence

If an expert system - brilliantly designed,

engineered and implemented - cannot learn not to
repeat its mistakes, it is not as intelligent as a worm
or a sea anemone or a kitten.
Oliver G. Selfridge, from The Gardens of Learning.

Find a bug in a program, and fix it, and the

program will work today. Show the program how to
find and fix a bug, and the program will work
Oliver G. Selfridge, in AI's Greatest Trends and Controversies.

F o r e n s i cs L a b

Computational Forensics
General ML Approach
Data Collection
Large sample of data of how humans
perform the task
Model Selection
Settle on a parametric statistical model of
the process
Parameter Estimation
Calculate parameter values by inspecting
the data
Using learned model perform: Search
Find optimal solution to given problem

F o r e n s i cs L a b

Computational Forensics
Example Problem:
Handwritten Digit Recognition

Handcrafted rules will

result in large no of
rules and exceptions
Better to have a
machine that learns
Wide variability of same numeral from a large training set

F o r e n s i cs L a b

Computational Forensics
Role of Machine Learning
Principled way of building high performance information
processing systems

Machine Learning vs Pattern Recognition

Machine Learning has origins in Computer Science
Pattern Recognition has origins in Engineering
They are different facets of the same field

Language Related Technologies

Information Retrieval, Natural Language Processing, Automatic
Speech Recognition
Humans perform them well
Difficult to specify algorithmically

F o r e n s i cs L a b

Computational Forensics
Pattern Recognition

A supervised / unsupervised classification of
A patterns by means of computer technology
small intraclass and large interclass variation
C as opposite of a chaos;
C it is an entity, vaguely defined, that
could be given a name Watanabe 1985

F o r e n s i cs L a b

Computational Forensics
Pattern Classification


? A

* * B X

Supervised Classification Unsupervised Classification

pre-defined by the system designer learned-based on the similarity of pattern

Machine Learning Data Mining

F o r e n s i cs L a b

Computational Forensics

Representation of Pattern Characteristics

Feature Vector
Attribute/Feature Vector Size Label
C Number of
A corners
Feature Extraction B B Feature Vector 1

and Selection by * *
using Training Patterns Feature Vector 2
B *
Cross-validation by using
Feature Vector 3
Test Patterns

F o r e n s i cs L a b

Computational Forensics
Pattern Representation and
Classification Example


Feature Vector 1
1** 2** 1** 1** 2** 2
* *
Feature Vector 2

* 14* 24* 16* 16* 20* 4

Feature Vector 3
14A 24C 16A 16B 20B

Size Label
Number of

F o r e n s i cs L a b

Computational Forensics
Classifier Training, or
How does Computers learn?
Learning by Example !
Representative Sample Data
Appropriate Feature
Class Discrimination
Avoid Over Learning

F o r e n s i cs L a b

Computational Forensics
Best-known Approaches for
Pattern Recognition

Template Matching
Syntactical or Structural PR
Statistical PR
Neuronal Networks

F o r e n s i cs L a b

Computational Forensics
Model for Pattern Recognition

pattern Feature
Preprocessing Measurement Classification


Training Preprocessing Feature Learning

pattern Extraction /

F o r e n s i cs L a b

Computational Forensics
Recognition Methods in Numbers

9 Feature Extraction and

Projection Methods
7 Feature Selection Methods
7 Learning Algorithms
14 Classification Methods

18 Classifier Combination Schemes

Statistical Pattern Recognition: A Review, A.K. Jain, R.P.W. Duin and J. Mao, 2000, PAMI F o r e n s i cs L a b
51 Note that biological-inspired methods come in addition
Computational Forensics

Three Application Examples

Approximate String Search
Feature Selections for IDS
Malware-Abnormality Detection

F o r e n s i cs L a b

Computational Forensics
Challenges in Cybercrime

Rapidly growing computer storage

capacities and data volumes
Increasing complexities in computer systems
Hidden evidence caused by, e.g., obfuscated
Architectural limitations in many digital
forensic tools leaves much of the work to the
forensic analyst and his/hers workstation, being costly, time and resource
consuming work
Independent analysis of machines, in cases where multiple parties or
systems are involved in the same crime, can lead to loss of essential
evidence that is only visible when correlated

F o r e n s i cs L a b

Computational Forensics

Towards a Generic Feature-

Selection Measure for Intrusion
Hai Thanh Nguyen, Katrin Franke and Slobodan Petrovid

Norwegian Information Security Laboratory (NISlab)

Gjvik University College

F o r e n s i cs L a b

Computational Forensics
Model for Pattern Recognition

pattern Preprocessing Classification
Preprocessing Feature Learning
Extraction /

F o r e n s i cs L a b

Computational Forensics
Our Research Focus
1. Generalization of several feature selection measures.
2. Optimization to derive globally optimal feature subsets.

Considering the CFS measure (Hall, 1999) and the mRMR

measure (Peng, 2005) for intrusion detection because:
Filter methods are usually used to select features from high-
dimensional data sets, such as intrusion detection systems.
Relevance of features and relationship between features are
The relevance and relationship are usually characterized in
terms of correlation (CFS) or mutual information (mRMR).

F o r e n s i cs L a b

Computational Forensics
CFS and mRMR Feature Selection

Correlation feature-selection Feature-selection measure based

(CFS) measure on mutual information (mRMR)
Class-feature correlation Class-feature mutual inform.
Feature-feature correlation Feature-feature mutual inform.

M. Hall. Correlation Based Feature Selection for Machine Learning. H. Peng, F. Long, and C. Ding. Feature selection based on mutual
Doctoral Dissertation, University of Waikato, Department of Comp. information: criteria of max-dependency, max-relevance, and min-
Science, 1999. redundancy. IEEE Transactions on PAMI, Vol. 27, No. 8, pp.1226-1238,
F o r e n s i cs L a b

Computational Forensics
Generic Feature Selection (GeFS)
Question: Can the CFS measure and the mRMR measures be
fused and generalized into a generic feature selection measure?
Definition 1: A generic feature selection (GeFS) measure is
defined as follows:

Proposition 1: The CFS and the mRMR measures are instances of

the GeFS measure.
Proposition 2: The feature selection by means of the GeFS
measure is a polynomial mixed 0-1 fractional programming
(PM01FP) problem.
How to solve is? F o r e n s i cs L a b

Computational Forensics
Search Procedures
Exhaustive search Heuristic search
Globally optimal feature Locally optimal feature subsets
subsets Faster than exhaustive search
Slow with complexity of

Examples: Examples:
Exhaustive search Greedy search
Breadth-first search Gradient search
Depth-first search
Iterative deepening
Simulated annealing
Branch-and-Bound Genetic algorithms

Can we find globally optimal feature subsets

yet avoid standard exhaustive search? F o r e n s i cs L a b

Computational Forensics
Problem Transformation
Changs method for Our method for
solving PM01FP solving PM01FP
Linearizing Differently linearizing
PM01FP problem into mixed PM01FP problem into mixed
0-1 linear programming 0-1linear programming
problem (M01LP). problem (M01LP).
The number of variables & The number of variables &
constraints: constraints:
n2 4n+1
Branch and Bound algorithm. Branch and Bound algorithm.
C-T. Chang. On the polynomial mixed 0-1 fractional
programming problems, European Journal of
Operational Research, vol. 131, issue 1, F o r e n s i cs L a b

60 pages 224-227, 2001.

Computational Forensics
Experimental Setting
10% of the overall (5 millions of instances)
KDD CUP99 test data set for Intrusion Detection; Systems, which
have normal traffic and 4 attack classes (DoS, Probe, U2R, R2L).
Consider 4 data subsets of the KDD CUP99:
Data sets Number of instances

Normal & DoS 488.736

Normal & Probe 138.391
Normal & U2R 97.330
Normal & R2L 98.404

Feature selection methods: Opt-CFS & Opt-mRMR.

Classifiers: Hierarchical Classifier C4.5 & Bayesian Network.
Apply 5-fold cross validation.
F o r e n s i cs L a b

Computational Forensics
Experimental Results

Number of Selected Features Achieved Recognition Performance

Please see listed reference publications for further details.

Nygen, Franke Petrovic, 2009-2010
F o r e n s i cs L a b

Computational Forensics

Fusing and generalizing the CFS and the mRMR measures

into a generic feature selection (GeFS) measure.
Transforming the feature-selection problem by means of
the GeFS into a mixed 0-1 linear programming problem.
Proposing a new approach that ensures globally optimal
feature sets.
Proposing a new feature-selection method (Opt-GeFS).
Validating the new method by conducting experiments on
KDD Cup 1999 test IDS data set and on others data sets.
Our method is domain-independent.
F o r e n s i cs L a b

Computational Forensics

Cross-Computer Malware
Detection in Digital Forensics
Anders Orsten Flaglien, Peter Ekstrand Berg, Lars Arne Sand
Katrin Franke, Andre Arnes

Norwegian Information Security Laboratory (NISlab)

Gjvik University College

F o r e n s i cs L a b

Computational Forensics
Model for Pattern Recognition

pattern Feature
Preprocessing Classification
Preprocessing Feature Learning
Extraction /

F o r e n s i cs L a b

Computational Forensics
Distributed Malware, as Botnets

Botnets utilize malware to establish control

over infected machines. Botnet initiated
attacks represent one of the biggest
computer threats existing today, and they
are heavily used for computer crime (e.g.,
DDoS attacks) [23, 28, 5]
The command and control (C&C)
architecture of botnets, ensures evidence is
present in multiple locations
_Patterns of malware, used in botnets, can be hard to detect, as
malware often uses obfuscation techniques. However, in cases of C&C
dependent malware, like botnets, certain patterns are required in
order for the bot-master to control bots.

F o r e n s i cs L a b

Computational Forensics
Application Scenario

F o r e n s i cs L a b

Computational Forensics
Applied Method
Data Collection
File Metadata Extraction
Hash Filtering
Feature Extraction
Analysis (Link Mining)
Linking Machines F o r e n s i cs L a b
Computational Forensics
A method that can assist human analysis in order to improve the decision
making and further improve the result of digital forensics

Aims to improve the efficiency of the time invested on investigation, and

the effectiveness of detecting relevant and high quality evidence.

F o r e n s i cs L a b

Computational Forensics Features of interest has to be selected, that
best represent the characteristics of the
input data
File metadata represent most of the
Content-based features improve
knowledge and represent file content
Special characteristics that
reflect typical malware patterns
Strings from regular expressions
Case supplied metadata should only
be used for tracing its origin, not for
the clustering task

F o r e n s i cs L a b

Computational Forensics Examination of input data to create a
textual and structured representation
Feature files are created from Feature File Machine n
selected features, extracted from Feature File Machine
File ObjectFeature File Machine
1.... Feature A, B, C,1D, E, F, G, H
copied disk images FileObject
File Object1....
FeatureA, A,B,B,C,C,D, D,E,E,F,F,G, G,HH
ARFF file format is used (built File
File Object1....
Object 3....Feature
2.... FeatureA,A,A,B,B,B,C,C,C,D,D,D,E,E,E,F,F,F,G,
Feature G,H
G, H
using Fiwalk [16] and python File
FeatureA,A,B,B,C,C,D,D,E,E,F,F,G, G,HH
Object3.... Feature
n.... A,A,B,B,C,C,D,D,E,E,F, F,G,G,H
Feature H
scripts for additional features) File Object n.... Feature A, B, C, D, E, F, G, H
File Object n.... Feature A, B, C, D, E, F, G, H

Machines are represented by individual Feature Files, where each file is

a file object represented as a feature vector with n number of features

Hash-based filtering is used to remove noise files and to limit the

amount of data to correlate
Known system and application files (NSRL RDS dataset, clean system)
F o r e n s i cs L a b

Computational Forensics Link Mining is performed to identify
correlations, and thus identify malware traces
A dataset is generated from all Feature Files
from the machines (separated with case
A descriptive data mining method, using an
unsupervised clustering algorithm
Clustering provides a group detection, the
links between multiple machines exist for
clusters in which files are present
The simple K-means algorithm is used for clustering, with Euclidean
Distance for proximity measure
Preprocessing is performed to suit and to optimize the result of the
clustering algorithm
Removing redundant features
Converting features to an appropriate format (nominal and numeric)
F o r e n s i cs L a b

Computational Forensics Link Mining Evaluation is performed to
measure the results and to find clusters of
Misinterpretation of the link
mining and use of features have to
be considered and can be Dir
identified through link mining
evaluations [24, 33]
Classified data can be compared to Raw

clustering results to reveal C1 C2 C3 C4 C5

uncertainties regarding the Clusters
Figure 1: showing cluster characteristics
integrity of the clustering task,
algorithm and features used [33]
Two-dimensional graphs can be used to examine links and features
involvement in the created clusters. This indicate variations within (to
measure the compactness) and between (to measure the distances
between) the K clusters (C) [22]
F o r e n s i cs L a b

Computational Forensics The practical implementation and involved
processing steps for Machines m in Case n

F o r e n s i cs L a b

Computational Forensics Multiple Experiments were executed to
evaluate deLinks efficiency and effectiveness
Three experiments were
executed in a virtual
Passive self developed bot
Malware from the wild
Spybot v1.3 Figure 2: Botnet with C&C server and attack website

Most realistic results were obtained from the experiment using Spybot v1.3
A network infrastructure with command & control server, infection site and 5
victim computers were used
Online banking attack scenario, where the computers have been infected and
taken control over by an adversary bot master to, e.g., execute DDoS

F o r e n s i cs L a b

Computational Forensics Hash-based filtering in the experiments
reduced the amount of input data for the
link mining to process

Filtering reduced number

of data objects down to 3%
Homogeneous experiment
machines, also resulting in
many correlations
The effect of perfect hash-
based filtering is reflected

Automatically reduction of the total amount of files to be used for the

link mining increase the value of clusters produced by the link mining

F o r e n s i cs L a b

Computational Forensics Cluster creation in the experiment grouped
the file objects, according to their
SOM diagrams was used to help estimate the nature of
input dataset (used as input k to k-means, but could also
be used alone for clustering objects)
Three segments were identified in the dataset from the
Malware from the Wild experiment
Figure 3: SOM diagram
Clustered the file objects (with k=3, means)with common
characteristics across all machines (using Weka machine
learning tool)

Machine ID

C1 C2 C3 F o r e n s i cs L a b
Clusters 77
77 Figure 4: Machine files representation in clusters
Computational Forensics Timeline visualization of experiment results,
reflecting one cluster, filtered with the
malicious IP address
Expert knowledge about an incident are available, e.g., the approximated
time when an incident occurred, origin, indications of effect and extent.

Machine ID

T0 AccessT1
Time T2
Figure 5: One cluster with files, having suspected IP, timelined F o r e n s i cs L a b

T0: IE history files accessed, T1: Infection file accessed, T2: Additional infections files accessed
Computational Forensics

Efficient techniques to decrease the initially large data set

Clustering further separated and grouped common file
Ability to detect common file objects between the
machines increased the understanding of large data set
Correlations further improved the knowledge of what
files could be associated with a malware and thereby an
Improving the efficiency and effectiveness of digital
forensics by using mining techniques

F o r e n s i cs L a b

Computational Forensics
Concluding Remarks
Computational forensics holds the
potential to greatly benefit all of the
forensic sciences.
For the computer scientist it poses a new
frontier where new problems and challenges
are to be faced.
The potential benefits to society, meaningful
inter-disciplinary research, and
challenging problems should attract high
quality students and researchers to the field.
F o r e n s i cs L a b

Computational Forensics
Further Reading
NAS Report: Strengthening Forensic Science in the United States: A Path Forward
van der Steen, M., Blom, M.: A roadmap for future forensic research. Technical report, Netherlands Forensic Institute
(NFI), The Hague, The Netherlands (2007)
M. Saks and J. Koehler. The coming paradigm shift in forensic identification science. Science, 309:892-895, 2005.
Starzecpyzel. United states vs. Starzecpyzel. 880 F. Supp. 1027 (S.D.N.Y), 1995.
C. Aitken and F. Taroni. Statistics and the Evaluation of Evidence for Forensic Scientists. Wiley, 2nd edition, 2005.
K. Foster and P. Huber. Judging Science. MIT Press, 1999.
Franke, K., Srihari, S.N. (2008). Computational Forensics: An Overview, in Computational Forensics - IWCF 2008, LNCS
5158, Srihari, S., Franke, K. (Eds.), Springer Verlag, pp. 1-10.
Nguyen, H., Franke, K., Petrovic, S. (2010). Towards a Generic Feature-Selection Measure for Intrusion Detection, In
Proc. International Conference on Pattern Recognition (ICPR), Turkey.
Nguyen, H., Petrovic, S. Franke, K. (2010). A Comparison of Feature-Selection Methods for Intrusion Detection, In
Proceedings of Fifth International Conference on Mathematical Methods, Models, and Architectures for Computer
Networks Security (MMM-ACNS), St.Petersburg, Russia, September 8-11. (accepted for publication)
Nguyen, H., Franke, K., Petrovic, S. (2010). Improving Effectiveness of Intrusion Detection by Correlation Feature
Selection, International Conference on Availability, Reliability and Security (ARES), Krakow, Poland, pp. 17-24.
Flagien, A.O., Arnes, A., Franke, K., (2010). Cross-Computer Malware Detection in Digital Forensics, Techn. Report,
Gjvik University College, June 2010.

F o r e n s i cs L a b

Computational Forensics

Thank you for your

consideration of comments!
Getting in touch
Skype/gTalk: kyfranke

F o r e n s i cs L a b

Computational Forensics

Improving the Efficiency of

Digital Forensic Search by
Means of Constrained
Edit Distance
Slobodan Petrovic, Katrin Franke

Norwegian Information Security Laboratory (NISlab)

Gjvik University College

F o r e n s i cs L a b

Computational Forensics
Model for Pattern Recognition

pattern Preprocessing Classification
Preprocessing Feature Learning
Extraction /

F o r e n s i cs L a b

Computational Forensics

Huge search space ( size can be TByte of data )

Distorted stings ( intentionally / unintentionally )
Inserting 1...N characters

Consecutive intersection and



F o r e n s i cs L a b

Computational Forensics

Efficient, non-exhaustive procedure

Approximate search
No predefined distortions

... Buy replica POLEC and fake ROLEKS watches online

from our online Swiss replica watch store and save money.
Find, compare and buy! ....

F o r e n s i cs L a b

Computational Forensics
Applied Method


N ... replica POLEC and fake ROLEKS watches ...


M - length of the search string

N - length of the data fragment
fragi - the specific fragment
di - edit distance of fragi and the search string

F o r e n s i cs L a b

Computational Forensics
Edit Distance Revisited
Definition: The edit distance between two words (strings) is
the minimal number of edit operations (insertions, deletions,
or substitutions) that must be performed to convert one word
into the other.



di = 1 di = 2 di = 3 di = 2

Many definitions and implementations of this metric, e.g.,

Hamming distance Levenshtein distance Damerau-
Levenshtein distance Jaro-Winkler distance Wagner-Fischer
edit distanceUkkonenHirshberg
F o r e n s i cs L a b

Computational Forensics
Constrained Edit Distance
F - the maximum number of
consecutive deletions
G - the maximum number of
M ROLEX di consecutive insertions
R ...
... ... ... ... ... ... ...
... ...
... ...
... ...
... O
...L.....................E ...
... ...
... ...
X ...

... ... ... ... ...

P ...
...O...................................................L.....................E ... ... ...
C ...


F o r e n s i cs L a b

Computational Forensics
Results I

R = f (N, F)
Dependency of the data-set
reduction R on the length of
the fragments (N) for
different values of
consecutive deletions (F).

Const. restrictness
IF ( di ( N - M + ))
THEN (accept (fragi))
M - length of the search string
F o r e n s i cs L a b
fragi - the specific fragment
92 di - edit distance of fragi
Computational Forensics
Two-stage search procedure
Constrained edit distance for pre-selection
(1st phase)
Enables rejection of fragments, in which the
detected string is too distorted
Level of tolerance is controlled by means of
the value of the constraints

Detailed algorithm and simplification are

described in publication
Experimental results proof the efficiency of
the proposed method
F o r e n s i cs L a b