Anda di halaman 1dari 6

2012 International Conference on Innovations in Information Technology (IIT)

MOMENTUM: MetamOrphic Malware Exploration Techniques Using MSA signatures


Vinod P.1 , V.Laxmi2 , M.S.Gaur3 , Grijesh Chauhan
4

Department of Computer Engineering, Malaviya National Institute of Technology Jaipur, Rajasthan, India 302017 {vinodp1|vlaxmi2 |gaurms3 }@mnit.ac.in, grijesh.mnit@gmail.com

AbstractModern malware that are metamorphic or polymorphic in nature mutate their code by employing code obfuscation and encryption methods to thwart detection. Thus, conventional signature based scanners fail to detect these malware. In order to address the problems of detecting known variants of metamorphic malware, we propose a method using bioinformatics techniques effectively used for Protein and DNA matching. Instead of using exact signature matching methods, more sophisticated signature(s) are extracted using multiple sequence alignment (MSA). The results show that the proposed method is capable of identifying malware variants with minimum false alarms and misses. Also, the detection rate achieved with our proposed method is better compared to commercial antivirus products used in the study. Keywords-metamorphic malware, sequence alignment, bioinformatics, multiple sequence alignment

I. I NTRODUCTION Metamorphic malware mutates its code on each replication, preserving the functionality of the program. The code is morphed by a small mutation engine also known as metamorphic engine. These malware use different obfuscation techniques to evade the conventional signature based scanner employing exact string matching methods. The size of metamorphic engine is designed to be small in order to bypass detection [3]. Structural transformation in the malicious code is introduced with limited set of instruction replacements as total change in the malicious code is impossible (as it would alter the functionality of the variant). Detection of metamorphic malware is still possible as they are less diverse compared to benign samples. The primary reason is in order to preserve maliciousness some generic code is embedded which cannot be transformed to large extent.

DNA/protein sequences mutate from one generation to another inheriting some functional, structural similarity with their ancestors. In this implementation work, it was assumed that metamorphic malware like the DNA/protein sequence transforms code with modication in the opcode sequence. The mismatches in the opcode sequence from one generation to another may be considered as point of mutation. Thus, the exact string matching technique would fail to detect new malware variants. Bioinformatics sequence alignment method is thus used in this research work to align the opcode sequences based on the evolutionary relationship. Motivated by bioinformatics methods, the objective of this article is to detect metamorphic malware. Using sequence alignment method, three types of signature(s), known as (a) single (b) group and (b) probabilistic signature, are constructed for each malware family. Detection of unseen malware samples are veried with generated signature(s). II. R ELATED W ORK In their proposed work, authors [7] and [8] created a rewriting engine for detecting morphed malware variants. The analysis of variants of malware is based on syntactic as well as semantic structure of a program. Krugel et al [9] proposed a method based on the CFG generated for worms, which describes a ngerprint for worm. Their system is found to be resilient against common code transformation techniques. Authors in [10] proposed a novel method for analyzing malware based on code graph. Each malware executable was inspected and instructions corresponding to system call sequence were represented in the form of a topological graph. In their proposed work [4], authors proposed a semantic based approach for detecting variants of malware. This method is based on the functionality of system call executed by malware samples. The main focus is to identify all instructions and its parameters which are used for calling a system call. In [5], Hidden Markov Models (HMMs) were used to represent statistical properties of a set of metamorphic virus variants. Malware data set was generated from metamorphic engines: Second Generation virus generator (G2), Next Generation Virus Construction Kit (NGVCK), Virus Creation Lab for Win32 (VCL32) and Mass Code

Figure 1. Metamorphic malware variants using obfuscation. Embedded with metamorphic engine.

978-1-4673-1101-4/12/$31.00 2012 IEEE

232

Generator (MPCGEN). Also, the authors [16] found that HMM based detector can be defeated if the malware variants are generated by inserting garbage instructions which are extracted from subroutines of benign les. The authors in [1] proposed a phylogeny model for identifying malware. The ngram feature extraction technique was proposed and xed permutation was applied on the code to generate new sequence, called n-perms. III. B IOINFORMATICS S EQUENCE A LIGNMENT M ETHODS Sequence alignment is an elementary method used in any biological study to compare two or more biological sequences (protein or DNA). The alignment methods attempt to identify regions of high similarity. This is performed to infer evolutionary relationship between sequences. Metamorphic malware, like proteins or nucleotide, have segments of code which are inherited from their base malware. It can be assumed that as DNA sequence characterizes biological species similarly, opcode sequence characterizes malware and benign executables. Each sample is represented as a sequence of mnemonic pattern without considering the operands. Initially, the approach might appear to be trivial, but metamorphic malware variants cannot undergo total transformation. Our assumption is that there may be replacement of some opcode(s) with equivalent opcode(s) but complete change is impossible in order to maintain functional equivalence of variants in a family. Thus, it can be inferred that variants preserve some base malicious code which is transformed by the engine to produce new variant(s). Therefore, using sequence alignment techniques, opcode sequences are arranged (a) to determine similarity amongst malware samples and (b) to explore frequent occurring patterns in a family of malware. These patterns depict maliciousness. Approaches to sequence alignment can be broadly categorized as (a) global and (b) local sequence alignment and (c) Multiple Sequence Alignment (MSA). A. Global Alignment Global Alignment is used to align end-to-end opcode sequences. A well known method for global alignment is NeedlemanWunsch [6]. In our proposed method, we have used Needleman Wunsch global alignment method for aligning opcode sequence of malware samples. 1) NeedlemanWunsch Method: NeedlemanWunsch method determines global optimal alignment within the two sequence X and Y . Following are basic steps used in aligning opcode sequence: Initialization: In this phase, a score and trace back matrix of size (M + 1) (N + 1) and T (M + 1, N + 1) is created, where size of sequences are M and N respectively. Populate Score Matrix: The score of each cell S (i, j ) is determined by the scores of neighboring three cells

i.e. (top, diagonal and left). In addition to populating score matrix, trace back matrix is lled with three directions as left(L), diagonal(D) and up(U ). The trace back matrix depicts a cell amongst the neighbouring cells having maximum value in the score matrix. Thus, S (i, j ) is computed as follows:
S (i, 0) = i S (0, j ) = j S (i, j ) = max(S (i 1, j 1) + (X [i], Y [i]), S (i 1, j )+ , S (i, j 1) + )) (1)

Where, (X [i], Y [i]) indicates match/mismatch score when character X [i], Y [i] are aligned and is gap penalty. Traceback: The trace back matrix is read starting at the cell S (M + 1, N + 1) until the rst row or column is encountered. Each cell with direction D depicts match and cells with directions L, U depicts the gap introduced in the sequence.

2) Multiple Sequence Alignment Method: The multiple sequence alignment (MSA) is used to add a series of sequences to an existing alignment. Multiple sequences can be aligned by repeatedly applying global/local alignment methods. MSA in particularly used in the proposed methodology to determine related functional and structural aspects of opcode sequences or is used to investigate semantically equivalent opcode sequence(s). Given a set of k malware samples with opcode sequences M1 , M2 , Mk , gaps are inserted while aligning the opcode sequence so that all opcode sequences have same length. Semantically equivalent opcode sequences are conserved and the number of gaps are minimized. Progressive alignment method identies most similar instance and successively less similar instances are added to the initial alignment. This process is repeated until combined results of aligning opcode sequences of a malware family is obtained. The basic approach involves three steps:

Perform pairwise alignment for all opcode sequences of a malware family. The result of pairwise alignment is a score matrix. The score matrix consists of the score value (bottom rightmost value) obtained by aligning two sequences. With the alignment score, a Phylogenetic Tree is constructed using the NeighbourJoining (NJ) [15] method. Neighbour Joining method is bottomup method for constructing Phylogenetic or Phenetic Trees. A Phylogenetic tree is used to depict evolutionary relationship between various species. Opcode sequences are aligned using the evolutionary relationship abstracted from the Phylogenetic Tree, with similar ones aligned rst followed by the less similar instances.

233

Windows XP operating system. Malware families (73 families) are created by scanning the samples using 14 antiviruses (with updated database). Likewise, benign samples are also scanned. Entire dataset is divided into two parts: one for training (signature modelling) and other reserved for testing. B. Malware Unpacking
Figure 2. Progressive Alignment

IV. I MPLEMENTATION M ETHODOLOGY Metamorphic malware can be detected using three types of signatures extracted from multiple aligned opcode sequence of a malware family. Types of signature created in our implementation methods are (a) single (b) group and (c) probabilistic signature. Figure 3 depicts different steps used for detecting metamorphic malware.
Malware Executables

Unpack Executables

Malware authors use packers to avoid detection by antivirus products. The basic function of packers is to encrypt the code, resources and import table. Executable packers add some random number of jump instructions in order to confuse the disassemblers. Advanced packers also encrypt the Portable Executable (PE) sections so that the antivirus virtually fails to scan malicious code. Unpacking could be performed using the generic unpacker like VMPacker (http://www.woodman.com) or GUNPacker (http://leechermods.com). The basic problems with these signature based packers are (a) packer signatures need to be updated periodically and (b) difculty in the detection of multiple layer packed executables. Another way of software unpacking is by using Ether Unpack (http://ether.gtisc.gatech.edu/). C. Pairwise Alignment The unpacked executable samples are disassembled using IDA Pro [2] disassembler to obtain the assembly code. The disassembled code is parsed to extract mnemonics from each le which is used for performing alignment with other samples. Pairwise alignment (global alignment) of opcode sequence for each sample is performed with other samples to obtain a score matrix. Higher the score, closer the sequences. The score matrix for a family is used to determine evolutionary relationship with the help of Phylogenetic tree. Each cell of a score matrix depicts match, mismatch and gap introduced while aligning the opcode sequences. Therefore, score is computed as: score = #M atch + #M ismatch + #Gap D. Weighted Mnemonic Pair Pairwise alignments of malware samples depict match, mismatch and gaps introduced in the alignment. The frequency of all mismatch mnemonic pairs in the alignment is determined and a weight is associated with semantically equivalent mnemonic pairs. During experimentation, we obtained more than 5000 mismatch mnemonic pairs. Top 500 mismatch mnemonic pairs were extracted. Weight for any pair of mnemonics mi and mj is computed as: weight(mi , mj ) = f (mi , mj ) f (Mn )

Separate into Training and Test set

Pairwise Alignment of samples

Generate Phylogenetic Tree and Construct Multiple Aligned Sequence

Determine Semantically Equivalent Mismatch Mnemonic Pairs

Generate Signature(s) for Train Set

Assign weights to Mnemonic Pairs

Single Signature

Group Signature

Probabilistic Signature

Test Set

Compute Threshold for a Family

Generate Scan Log

Figure 3.

Proposed method for detecting metamorphic malware.

A. Dataset Preparation Malware dataset consisting of 1209 executables is prepared. Samples are collected from different sources like VX Heavens (http://vx.netlux.org), user agencies and some samples created using malware constructors like NGVCK, G2, MPCGEN, PSMPC. Benign samples (150 executables) were collected from System32 folder of fresh installation of

where, mi and mj represent pairs of mnemonic in an alignment. f (mi , mj ) is the frequency of opcode pairs in all alignments and f (Mn ) represents frequency of all mismatch mnemonic pairs in the alignment.

234

E. Phylogenetic Tree and MSA The global alignment score matrix is used to construct Phylogenetic tree which depicts evolutionary relationship in a malware family. Phylogenetic tree is also called guided tree which is used to build MSA corresponding to a malware family. The Phylogenetic tree is constructed using Neighbour Joining method which groups the closer samples followed by distant ones. F. Signature Construction Signature for each malware family in the training set is constructed. Three different types of signatures are extracted for a family (a) single (b) group and (c) probabilistic signature. Once a signature is constructed, we determine the threshold for each family by computing similarity score of malware variants and benign samples (in training set) with the generated signatures. The threshold is used for detecting unseen sample (malware or benign) to investigate the effectiveness of the approach. Types of signatures are discussed below: Single Signature: Opcode sequence corresponding to each malware family is aligned using MSA. From each row of aligned MSA sequence, an opcode that appears in more than 50% of the a row is preserved. The concatenation of all such opcode sequences from entire rows of a MSA represent single signature. Figure 4 depict single signature extracted from MSA of opcode sequences.

Table I M ISMATCH M NEMONIC PAIRS AND ASSOCIATED WEIGHTS FOR EACH PAIRS OF MNEMONICS . Mnemonic Pairs push mov mov lea call mov push lea pop mov push call push pop mov add mov jmp Weights 0.06326 0.02312 0.02308 0.01962 0.01839 0.01737 0.01482 0.01470 0.01352

signature. In this gure, Mi represents the signature for ith malware family.

Figure 5.

Representation of Group signature.

Figure 4.

Extraction of single signature.

Group Signature: Each malware family is subdivided into number of smaller groups based on evolutionary relationship abstracted from a Phylogenetic tree. All samples which are close based on distances, are grouped to form a subgroup. A subgroup may contain two or more samples. Opcode sequences are aligned using MSA and single signature for each subgroup is extracted. Thus, for k subgroups we obtain k signatures. MSA of k signatures are further created and represented in the form of wildcard. This wildcard signature is also referred to as group signature (MSA signature). The main advantage of representing group signature based on wildcard is that time spent during the testing phase is minimized. Otherwise, test sample needs to be matched against i prominent signatures from k subgroup signature where i < k . Figure 5 shows group

Probabilistic Signature: For constructing probabilistic signature, pairwise alignment of all malware samples in a family is determined. The pairwise alignment yields (a) alignment between pairs of malware samples and (b) a score matrix corresponding to each malware family. The score matrix is mainly used to determine threshold of a family. Pairwise alignment is performed using global alignment method depicting pairs of mnemonics that match, mismatch and places where gaps are introduced in the alignment. Mutations in malware are primarily depicted by mismatch and gaps introduced in the alignment. A list of semantically equivalent mnemonic pair is determined and weights are assigned based on the frequency of appearance in all alignments. The probabilistic signature is also represented in the form of a wild card. Table I depict semantically equivalent mnemonic pairs and their associated weights.

G. Testing with unseen samples Threshold of each malware family is determined and samples in the test set are detected by using three types of signature. For computing the threshold corresponding to a family, both malware and benign samples in the training set are considered. Samples in the training are aligned with the signature(s) and a similarity score is determined. Higher score represents high match with a signature. Threshold th for a family is determined as follows. th = (Bmax + Mmin ) 2

235

where, Bmin , Bmax depict minimum and maximum score corresponding to benign samples with signature(s). Similarly Mmin , Mmax represent highest and lowest score of a malware with the signature(s). Figure 6 depicts the diagrammatical representation for computing threshold for malware and benign samples. A test sample t is considered as benign if the similarity score with the signature is less than threshold th, otherwise, the sample is agged as malware.

opcode sequence in a multiple aligned sequence of a family of malware. Each row of MSA depicts match, mismatch and gap corresponding to opcode sequences. Group signature is the wildcard representation of signatures of the subfamilies in a family. Weighted signature is constructed by assigning weights to pairs of opcodes which are semantically equivalent and responsible for mutation. Figure 7 shows values for detection rate and false positives obtained with different signatures.

Figure 6.

Selection of Threshold.
Figure 7. Detection rate and false positives for three types of generated signatures.

V. R ESULTS AND A NALYSIS The detection of metamorphic malware is performed by dividing the dataset into two parts (a) train and (b) test set. Train set consists of 629 samples, consisting of 45 benign and 584 malware samples. Test set contains total of 724 samples which include 100 benign, 623 malware and 37 samples which were not at all detected by any antiviruses. Signature modeling and threshold is computed using training set and unseen samples are tested with the threshold corresponding to each family. Experimental results are evaluated using evaluation metrics like TPR, TNR, FPR, FNR. These metrics are computed using True positives (T P ), True Negatives (T N ), False Positives (F P ) and False Negatives (F N ). T P indicates the number of samples classied as malware, T N is the number of correctly classied benign instances, F P is the number of benign samples incorrectly classied as malware and F N is the number of malicious samples classied as benign. The performance of any detector/scanner can be measured by primarily checking the True Positive rate (TPR) and True Negative Rate (TNR) which are also known as sensitivity and specicity respectively. 1) True Positive Rate (TPR): TPR = TP/(TP + FN) 2) False Positive Rate (FPR): FPR = FP/(FP + TN) 3) True Negative Rate (TNR): TNR = TN/(TN + FP) 4) False Negative Rate (FNR): FNR = FN/(FN + TP) A. Comparative Analysis with Generated Signatures Malware families created using the scanners were separated into number of families. For each malware family, three types of signatures (single, group, and probabilistic) are extracted. Single signature is the maximum preserving

It is observed that the detection rate (with single signature)is approximately 91% with a FPR of 52%. This indicates that most of the malware samples are detected but many benign samples are incorrectly classied as malware. Since single signature is constructed by extracting maximum preserving opcodes in MSA row, opcodes responsible for mutations are lost in signature (they appears to be less dominant). Thus, most of the benign samples in test set scores well with the signature and are detected as malware. In case of group signature, a detection rate of 72.2% is obtained with very less false positive rate (FPR = 0.01). This indicates that malware samples in the test set are detected by wild card representation of signature. The group signature actually depicts wildcard representation of signatures of subfamilies for a family. Opcode sequence present in this signature is absent in benign samples, thus, they could be discriminated from the malware samples. Top 500 pairs of mnemonics are selected and the test samples are matched with the probabilistic signature and values of evaluation metrics is determined. Experiment are repeated for top 100, 200, 300, 400 and 500 mnemonic pairs and highest detection rate is recorded. Through our experiments we have observed that with top 200 pairs of opcode, detection rate of 71% was achieved. Also, 7% FPR is obtained, depicting that most of the benign samples being correctly classied as benign. The detection rate degrades if the size of mnemonic pair is further increased. This primarily happens because large number of mismatch pairs are introduced that are less dominant in malware instances but predominant in benign. Thus, malware samples are

236

falsely classied as benign. Likewise, detection rate of group signature and probabilistic signature is also found to be comparable. B. Comparative Analysis with Antiviruses Entire dataset was scanned using 14 antiviruses and the detection rate was computed from their scan logs. Figure 8 depicts the detection rate obtained from antiviruses and our implementation scheme. The top ve detection rates were obtained with antiviruses like Avast, Avira, AVG, G Data, Kaspersky (arranged in descending order of detection rate). It was observed that the detection rate obtained with our method is close to the top three commercial antivirus product. Some of the malicious les (total 37 malware) were not detected by any of the antivirus.

R EFERENCES
[1] Md.Enamul Karim, Andrew Walenstein, and Arun Lakhotia (2005) Malware Phylogeny Generation using Permutations of Code. Journal in Computer Virology, 1(12):1323. [2] The IDA PRO Disassembler. http://www.datarescue.com/ idabase [3] Chouchane, Mohamed R. and Lakhotia, Arun (2006) Using engine signature to detect metamorphic malware. In Proceedings of the 4th ACM workshop on Recurring malcode, WORM 06, 7378, New York, NY, USA [4] Qinghua Zhang and Douglas S. Reeves. MetaAware: Identifying Metamorphic Malware. Computer Security Applications Conference, Annual, 0:411420, 2007. [5] Mark Stamp Wing Wong. Hunting for Metamorphic Engines. 2006. [6] Sagle B. Needleman and Christian D. Wunsch. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. pages 443453, 1970. [7] Guillaume Bonfante, Matthieu Kaczmarek, and Jean-Yves Marion. Architecture of a Morphological Malware Detector. Computer Virology, pages 263270, 2009. [8] Matthieu Kaczmarek Guillaume Bonfante and Jean-Yves Marion. Control Flow Graphs as Malware Signatures. 2007. [9] Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. Polymorphic Worm Detection using Structural Information of Executables. In In RAID, pages 207226. Springer-Verlag, 2005. [10] Heejo Lee Kyoochang Jeong. Code Graph for Malware Detection. In In International conference on Information Networking,ICOIN, pages 15. IEEE, 2008. [11] Mona Singh, Phylogenetics, Lecture Notes: http://www.cs. princeton.edu/mona/Lecture/msa1.pdf [12] The NeedlemanWunsch algorithm for sequence alignment, Vladimir Likic http://www.ludwig.edu.au/ course/lectures2005/Likic.pdf [13] T. F. Smith and M. S. Waterman Identication of common molecular subsequences, In Journal of Molecular Biology, vol. 147, 1, pp 195 - 197, 1981, [14] ClustalW2 - Multiple Sequence Alignment http://www.ebi.ac. uk/Tools/msa/clustalw2/ [15] N Saitou and M Nei, The Neighborjoining Method: A New Method for Reconstructing Phylogenetic Trees, Oxford Journals, Life Sciences Medicine, Molecular Biology and Evolution, Volume 4,pp 406-425. [16] Lin, Da and Stamp, Mark, Hunting for undetectable metamorphic viruses, In Journal Computer Virology, volume (7), issue (3), pp. 201214, August, 2011,

Figure 8. Detection rate and false positives with different types of signature constructed with proposed method.

Out of 37 undetected malware executables from different commercial antiviruses, using our implementation methodology, 30 malware were detected with single signature and 20 malcode were detected using group signature (wildcard signature) and between 4 to 7 malware were detected by probabilistic signature. Effectiveness of the method suggests that bioinformatics sequence alignment methods can be used effectively to detect malware. Also, these methods can be used for generating malware signatures and in assisting scanners for detection purpose. VI. C ONCLUSIONS In this paper, the problem of detection of metamorphic malware is discussed using MSA methods. Signature(s) (single, group and probabilistic) for a malware family is extracted and tested using the unseen samples. It was observed that the unseen samples were detected using signatures with low false positives. Also, the detection rate of implementation method is comparable with that of antivirus like Avast, Avira, AVG. Some of the undetected malware executables from all commercial antiviruses were detected by signatures generated using our implementation method.

237

Anda mungkin juga menyukai