Traffic
Text
Text Binary data
Text
Text documents
documents (doc,
(doc, pdf,
pdf, txt
txt etc.)
etc.)
Detected letter body
Detected letter body Archives
Text
Text documents
documents (doc,
(doc, pdf,
pdf, txt
txt etc.)
etc.) Archives (zip,
(zip, tar.gz
tar.gz etc.)
etc.)
Instant
Instant messaging
messaging Detected
Detected letter
letter body
body Video
Video files
files
Instant
Instant messaging Images
Images (jpg,
(jpg, png,
png, tiff,
tiff, gif
gif etc.)
etc.)
OCR result
OCR result messaging
PDF
PDF (containing
(containing images)
images)
OCR
OCR result
result
Images,
Images, extracted
extracted from
from MS MS Office
Office files
files
Figure 1 Figure 2
using standard interfaces and receive only certain customize the DLP system and use simpler
small amount of data. Considering everything analysis techniques for interception. For example,
mentioned above we could propose the solution without Multilevel Analysis it would be possible
that would help us to protect databases from to protect some types of documents using only
uploading. The main purpose of the technique is to classification algorithm, but it would be difficult
protect exact data that may leak. However, it is to configure and might lead to a large amount of
worth noting that some data are not confidential false positives and false negatives. Multilevel
themselves, but they become confidential in Analysis allows us to prevent confidential data
conjunction with some other data. For example, from leakage using more effective technologies
the list of employees may not be confidential; like Filled Form Detection, Database Uploading
however, the list of employees along with their Detection, Template Analysis (regular
salaries may be considered a commercial secret. expressions) etc. Thus, the final verdict of the
Therefore it would be reasonable to introduce a system is the mutual result of all analysis
mechanism that could allow us to define relations algorithms.
between the columns of DB. All these actions are
likely to improve accuracy and reduce the number 4 DLP CORPUS
of false positives.
We have created a corpus for training and
3.2.2 Filled Form Detection. In general, the evaluating DLP classification algorithms. This
standard digital fingerprint detection algorithm corpus contains text documents, images or
seems to be enough to detect filled forms. documents containing both text and images. First,
However, it detects both filled and empty forms. we analysed the corpus using only 3 standard
Therefore, there is likely to be a lot of false approaches (digital fingerprints, classification and
positives as empty forms containing no template analysis), and after that we added
confidential data will be detected. Another extended analysis algorithms. The results showed
problem is that the majority of forms contain such better analysis quality after extended analysis
typical input fields as First Name, Second Name, algorithms had been added. For example, in first
and Address etc. Thus, there is the high possibility case (sample form) we had a false positive: empty
of false positives to occur while other non- form was considered a violation. In second case
confidential texts containing these fields may be only the filled form was detected (no false
detected. In order to improve the detection quality positive). On the contrary, analysing DB
of a DLP system it is recommended to use a more uploading in the first case we had false negative
complicated solution which would consider result (fingerprints are sensitive only to the
relative position of the form fields and allow us to significant amounts of information). Then, using
determine whether the intercepted text is a form. Database Uploading Detection technique we
managed to detect even single string from large
3.3 Multilevel Analysis database; it would be even possible to detect a
Multilevel Analysis may be the best method to confidential part of a string from a database. Then
solve the problems mentioned above. In order to we continued with image analysis: (the corpus
decide whether intercepted data is confidential, contained copies of Russian Federation passports)
DLP systems usually utilize limited number of images were of different quality. OCR method
techniques, and each analysis result affects the usually fails to extract any text from low quality
decision separately. In order to improve detection images; as a result, DLP systems rarely intercept
accuracy and reduce number of false positives and such images. We intercepted about 90% of
false negatives we would recommend you to use passports using our image classification algorithm
Multilevel Analysis. The main feature of this (we trained it on a collection of other passports).
approach is that the decision is reached by The same results were obtained when we analysed
combination of results of several algorithms. credit card scans as well as documents with
Multilevel Analysis allows us to flexibly sample stamps.
ISBN: 978-0-9891305-5-4 2014 SDIWC 212
Proceedings of the International conference on Computing Technology and Information Management, Dubai, UAE, 2014