ae
448 The Open Cybernetics & Systemics Journal, 2014, 8, 448-454
Open Access
A Computer Virus Detecting Model based on Artificial Immune and Key
Code
Zhang Li1,*, Xie Bin1, Lou Fang1, He Z. Qiang2 and Dong Z. Xin1
1
Institute of Computer Application, China; 2Academy of Engineering Physics, Sichuan, 621900 China
Abstract: Existing antivirus technology depends on extracting signatures. They are inefficient on detecting diverse forms
of computer viruses, especially new variants and unknown viruses. Inspired by biological immune system, a virus detec-
tion model based on artificial immune and key-signatures extraction is proposed. This model adopt TF-IDF Algorithm to
extract virus ODNS from virus DNA parts on code level, and on gene level these virus ODNs are matched by slither win-
dow to form virus candidate gene library and normal candidate gene library; then distinguish these gene through negative
selection algorithm to generate a detecting virus gene library; Last on the testing procedure level, use a cosine similarity
algorithm to estimate the testing procedure relevant to virus. To identify most of new variants and camouflage viruses,
virus polymorphism is considered. Different unsteady length genes compose a virus, and a r-adjustable match rule based
on RCB r-chunks is adopted to extract virus detecting library, which can mostly present virus signatures. In order to make
full use of effective information and fully taking the advantages of relevance between virus genes, in procedure phase,
suspicious programs are analyzed in contrast to the detecting gene matching technique, which leads to a fairly level false
and positive rate.
Keywords: Artificial immune, cosine similarity algorithm, feature extraction, successive matching, TF-IDF algorithm.
different family virus classification. Data mining is adopted virus in training sets, the virus code is extracted through a
to distinguish internal and external family to establish de- key extraction algorithm, which is composed of different
tailed index of virus features. kinds of virus gene pool, and then through the selection of
threshold as control, uses the prior knowledge mining infor-
Inspired by negative selection mechanism, Forrest pro-
poses detection algorithm for abnormal challenge. This algo- mation available to the greatest extent.
rithm without non-self information is especially suitable for Inspired by biological immune system, some academic
the unknown time-varying environment of the fault diagno- terms are redefined in this paper:
sis and computer security monitoring. But, features in collec-
-code DNA: Hex string in program;
tion are exponential based on the "self" information. In addi-
tion, the algorithm adopts without wizard randomly gener- -Gene: Hex string represent virus feature, as comparative
ated characteristics will have a lot of useless operations, item;
which will cause detector redundancy. -Nucleotides: each two bytes Hex string represents a Nu-
Adopting fixed length string to identify the individual cleotides, as ODN. A quantity of Nucleotides compose a
problem is also irrational. Therefore, a variable fuzzy match- gene.
ing negative selection algorithm is proposed [9], in which a A small amount of effective key code forms the virus
low complexity and detector selection generation algorithm genes and these genes are composed of several ODNs or-
is given to make the characteristic number and "self" infor- derly. This paper adopts the sliding window for ODN counts.
mation number into a linear relationship. This method
greatly reduces the number of features, but not fundamen- With gray pigeons virus code (text) as an example, the
tally solve the problem of feature. code of virus is as follow:
From above point of view, virus detection model based 14 EB 57 FF 75 14 6A 66 56 F8 24
on artificial immune system is proposed to detect unknown ODN code:
viruses [10]. The model uses the prior knowledge to generate
the virus features regularly, overcome random selection and 14EB EB57 57FF FF75 7514 146A 6A66 6656 56F8 F824
is only suitable for stabilization system. However, the valid- In order to locate and extract the key code accurately, this
ity of the model is influenced by feature extraction field, and paper intends to introduce TF-IDF keywords localization
unit size. Unknown factors in practice are not available while algorithm and combines the concentration ODN of virus, to
a small translation will cause failure. extract the key virus code from training sets.
For these reasons, Karnik et al. proposed a cosine simi- - I S : the number of ODN of normal program in training
larity method to measure program features [11]. This method sets.
can identify form of polymorphism of file and detect viruses.
This method has high measuring efficiency for unknown - I V : the number of ODN of virus program in training
viruses and virus variation. sets.
i
In addition, Chen qi et al. [12] proposed a filtering fea- - I S : the number of ODNi of normal program in training
ture of the selected algorithm based on TF*IDF and also sets.
i
provides good ideas for virus detection. - IV : the number of ODNi of virus program in training
sets.
3. VIRUS FEATURE EXTRACTION - N S : the number of normal programs in training sets.
3.1. Tendency Selection Algorithm - NV : the number of virus programs in training sets.
Although many experts, based on the negative selection i
- N S : the number of normal programs which contain
algorithm proposed linear method, greedy method, template ODNi in training sets.
method, evolutionary method, the negative database, and r- i
variable improved algorithm. However, these models do not - NV : the number of virus programs which contain ODNi
fully consider the operational features of the virus itself, and in training sets.
lack correlation between the viral genes. The overall non-self i
- ODN S : ODNi in normal program in training set.
detection rate is not high, the rate of false positives for self i
has also a higher level. - ODNV : ODNi in virus program in training set.
To solve these problems, this paper attempts to run the This paper adopt TF-IDF algorithm [13] to calculate
mechanism of the virus itself. According to the feature of the ODN Si and ODNVi , and generate TF-IDF formulas (3-1-1):
Ii N
s log S ,ODNi normal program
max I i
i s { } N i +1
s
TF IDF(ODNi) = (3-1-1)
I vi N
log V ,ODNi virus program
i
max I v
i { } N i +1
v
450 The Open Cybernetics & Systemics Journal, 2014, Volume 8 Li et al.
I si Nsi
W i = £¬ODNi normal program
S Ii N
s
s
i
concentration(ODNi) = (3-1-2)
i I vi N vi
WV = £¬ODNi virus program
i N
v
I v
i
G1i and Vi1 . In order to give accurate measurement to calcu- In this paper, we give the matching probability of candi-
date genes and legal virus genes:
late the similarity of two detecting programs, similarity ma-
A = { normal-program-ODN}
IS
trix of detecting gene and virus detection are as follows:
Vi1 Vi2 Vin
B = { virus-program-ODN} V , C = { virus-ODN pool}
I KL
j
S / S' S1,in / S in
' '
{ } , E = { D C}
S1,i2 / S i2
D = ODNi { A B}
1,i1 i1 i= M
G11 j M1
j . After the trend
X S 2,in / S in
'
G12 S 2,i1 / S i1
' '
S 2,i2 / S i2 i=1
similarity maxtrix = j j
T selection, ODN j C Let KL be the number of ODN in
G1m S / S
'
S / S
'
S / S
'
the virus ODN pool, M is the number of common ODN be-
m,i1 i1 m,i2 i2 m,in j in j
tween normal program and virus program. So anything be-
' tween candidate genes and normal suspicious virus genes
S 1i is the number of ODN in gene Vi1 . Calculation pro-
cedures are as follow: { A A B||C||A B-C} , it can be seen that each of genes
Step 1: Calculate consists of three parts:
'
{ ' '
A1i = max S i,i1 / S i1 , S i,i2 / S i2 , , S i,in / S in
j j } ,
1). Its own ODN cluster;
2). Virus ODN cluster;
A1i {
= max Si,i1 / S i1 , Si,i 2 / S i 2 ,, Si,in / S in
' '
j
'
j
}. 3). Remove common ODN cluster of virus ODN cluster.
Thus it is concluded that { B A B||C||A B-C} satisfy-
[ ] ,
{ }
i 1, Nv
Step 2: Vn Vn ing the candidate genes and legal procedures virus gene
i i
[
j 1, N ]
, Calculate genes similarity of
j j v
matching probability P1 of variable r, matching rules meets
virus i
Vn and virus detecting genes, as formula (3-3-2). the formula (4-2):
j
m '
r
CM1 min{GV 1,GS 1} C Ml 1
i=1 A1i A 1i r
P1 l=r r (4-2)
COS = (3-3-2) C KL C KL
m 2 m ' 2
( A1i ) ( A 1i )
i=1 i=1 4.1.3. Analysis of Similarity and Detection Rate
Similarity Matrix model is given in the paper, by com-
Nv
{
Step 3: Denote max cos 1 , cos 2 ,..., cos N
v
} as similarity paring the virus gene pool and each gene in the virus detec-
of detecting programs and virus programs, compare it with tion to be detected in the gene pool similarity value, to de-
S, if termine whether the program to be detected is a virus.
Obviously, the similarity value is proportional to each
Nv
{
max cos 1 , cos 2 ,..., cos N
v
}>s element value in the similarity matrix and is inversely pro-
portional to the length of the gene in the virus detection gene
decide whether this program is a virus program or not. pool. By the formula (3-3-1), the similarity value S satisfies
Here, the value of S is 0.5. the inequality (4-3);
m
4. ALGORITHM ANALYSIS i=1 A1iT A'1i
S 1 (4-3)
4.1. Match the Time Complexity and Space Complexity m
T
m
( A1i ) ( A 1i )
2 ' 2
Length N S be the length of the code of the normal pro- Set P2 as the detection rate of the virus program. Taking
i
in account that a program to be detected is misidentified as
gram Si . Let Length NV j be its counterpart of the virus legal, the procedure is as follows: the program to be detected
program, set T as the algorithm complexity to calculate all and the virus ODN pool generated class virus gene pool and
normal programs and virus programs, ODN generated and all virus detection of virus samples and not any one gene in
j1,NV
trend selection, the algorithm complexity meets the formula the gene pool match, namely {cos }j
< S , calculate
(4-1):
452 The Open Cybernetics & Systemics Journal, 2014, Volume 8 Li et al.
o
i
c
i 0.03
p
s Min probability
u
s Max probability
- 0.025
l
a
m
r
o
n 0.02
d
n
a
0.015
s
u
r
i
v 0.01
e
t
a
d
i 0.005
d
n
a
C
0
2.5 3 3.5 4 4.5 5 5.5 6 6.5
Matching threshold r
Fig. (1). Candidate virus genes with normal-suspicious virus gene matching probability.
the probability of the event P3, by the formula threshold S is inversely proportional to the value r and
i1,NV matching value, but r is proportional to the virus detection
{G1i } { }
i[1,m]
Vi,n , formula (4-4) is established. rate. Therefore, with the increasing of S, virus detection rate
j j1,NV
falls.
{
P3 = P {cos j }
j1,NV
<S } (4-4)
The method was applied to this scenario, compared with
the algorithm of literature [18].
{ j}
P2 = 1 P3 = 1 P cos j1,NV < S
virus, it is easy to cause the variations of the virus to
identify the effectiveness of the shortage, thus to calcu-
This paper collected 800 viruses and 1200 legal proce- late the similarity values; This scenario uses gene com-
dures, according to their respective properties. They are di- prehensive similarity values and calculates the program
vided into 245 kinds of viruses. Through the statistical of a particular gene to be detected and the sum of all
analysis, the simulation results are presented: genes similarity to each type of virus as the genetic simi-
larity, well balanced frequency problem;
Fig. (1) shows matching probability between the candi-
date gene pool and legal virus genes, it can be seen that after 2). As well as using the maximum as the judge whether the
virus ODN generated by a candidate gene pool and legal program has viruses, because its core code is crucial.
virus gene pool there still exists certain intersection. There is only a few and often hidden in a large number
of data. There is plenty of virus variation and the un-
Fig. (2) shows the relationship between the matching known virus detection is very difficult. Maximum value
threshold r and similarity threshold S. As the threshold r in- evaluation method can maximize the ability to detect
creases, the similarity threshold S also will increase. The
and mine programs that contain virus features.
similarity model given in the conclusion conforms to the
virus. The matching similarity threshold value S of the last sec-
tion is in chapter 3.4.3. Considering this is related to the cov-
Fig. (3) shows the intersection between virus program erage of virus sample program and the number of detection
ODN and legal program ODN, the intersection part is related sets, in this case, to balance the comprehensiveness of detec-
to the samples space concentrated in the training set.
tion accuracy, misjudgment rate and false negative rate, it
Fig. (4) shows the correct detection rate of program, with needs to constantly add new virus samples from training set.
the increasing of similarity threshold S, the detection rate of Give a more appropriate threshold and update the virus
legal procedure increases, but the detection rate of virus pro- detection gene pool to improve S value.
gram will decrease. The main reason is that the similarity
A Computer Virus Detecting Model based on Artificial Immune The Open Cybernetics & Systemics Journal, 2014, Volume 8 453
d 0.68
l
o
h 0.66
s
e
r 0.64
h
t
0.62
y
t
i 0.6
r
a
l 0.58
i
m
i 0.56
S
0.54
0.52
0.5
3 4 5 6 7 8
Matching threshold r
o
t 500
g
n
o 400
l
e
b 300
N
D 200
O
-
- 100
'
*
'
0
0 100 200 300 400 500 600 700 800
'o'--Virus program ODN
Fig. (3). Virus program ODN and normal ODN distribution.
n
0.9 Virus program detection rate
o Normal program detection rate
i
t 0.8
c
e
t 0.7
e
d
0.6
m
a 0.5
r
g
o
r 0.4
P
0.3
0.2
0.1
0
0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66
Similarity threshold S
Fig. (4). Program detection rate.
454 The Open Cybernetics & Systemics Journal, 2014, Volume 8 Li et al.
Received: November 01, 2014 Revised: December 15, 2014 Accepted: December 16, 2014
© Li et al.; Licensee Bentham Open.
This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/-
licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.