Anda di halaman 1dari 5

International Journal of Computer Information Systems, Vol. 3, No.

2, 2012

Probabilistic Data Deduplication Using Modern Backup Operation


Nirmalin Bala S H Department Of Computer Science and Engineering Hindustan University Chennai, India nirmalin_bala@yahoo.co.in Abirami G Department of Computer Science and Engineering Hindustan University Chennai, India gabirami08@gmail.com Uma Maheshwari M Department of Computer Science and Engineering Hindustan University Chennai, India uma.muthusamy60@yahoo.co.in Padmaveni K Assistant Professor Department Of Computer Science Engineering Hindustan University Chennai, India kpadmaveni@hindustanuniv.ac.in

Abstract- In many data mining projects the data to be analysed contains huge amounts of data. Cleaning and preprocessing of such data involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Deduplication is an important step in the preprocessing phase of data mining projects. It is also important to improve the quality of data before data is loaded into data warehouse. However, publication of data which contains personal information is normally impossible due to privacy and confidentiality issues. In this project, optimization of the deduplication system is carried out by adjusting the factors in fingerprint lookup and chunking. These factors are the key ingredients of efficient deduplication.

I. INTRODUCTION Data deduplication otherwise known as single instancing refers to the elimination of redundant data. In the deduplication process duplicate data is deleted and only one copy of the data to be stored is left out. This is one of this years hottest topics in data storage. The logic behind deduplication is simple: Eliminate the duplicate data and reduce the capacity needed during backups and other data copy activities. This paper focuses on the important design aspects of deduplication, giving evaluators the information needed to make informed decisions when examining deduplication solutions. Deduplication is the process of unduplicating data. The term deduplication has been introduced by database administrators many years ago as a way of describing the process of removing duplicate database records after combining two databases. The chunks generate the summary information. In this project a file is retrieved from the database it is partitioned into chunks and fingerprints

are generated for the chunks. The Fingerprint Manager checks if there are any redundant file and if so delete the redundant and stores only one copy of the file. Fingerprint is generated using SHA1 algorithm. In the disk storage environment, deduplication refers to any algorithm that searches for duplicate data objects, such as blocks, chunks, or files, and discards these duplicates. When a duplicate object is detected, its reference pointers are modified so that the object can also be located and retrieved. This project is highly effective and feasible for both the developers, vendors and users. In information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. Backups have two distinct purposes. The primary purpose is to recover data after its loss, be it by data deletion or corruption. Data loss is a very common experience of computer users. 67% of Internet users have suffered serious data loss. The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required. Though backups popularly represent a simple form of disaster recovery, and should be part of a disaster recovery plan, by themselves, backups should not alone be considered disaster recovery. Not all backup systems or backup applications are able to reconstitute a computer system, or in turn other complex configurations such as a computer cluster, active directory servers, or a database server, by restoring only data from a backup. Since a backup system contains at least one copy of all data worth saving, the data storage requirements are considerable. Organizing this storage space and managing the backup process is a complicated undertaking. A data repository model can be used to provide structure to the storage. In the modern era of computing there are many different types of data storage devices that are useful for

February Issue

Page 48 of 62

ISSN 2229 5208

making backups. There are also many different ways in which these devices can be arranged to provide geographic redundancy, data security, and portability. Before data is sent to its storage location, it is selected, extracted, and manipulated. Many different techniques have been developed to optimize the backup procedure. These include optimizations for dealing with open files and live data sources as well as compression, encryption, and de-duplication, among others. Many organizations and individuals try to have confidence that the process is working as expected and work to define measurements and validation techniques. It is also important to recognize the limitations and human factors involved in any backup scheme.The chunk location is indicated by chunk field id, offset. Fingerprint table consists of a group of fingerprints. The new chunk file is appended at the end of current chunk file. There is a predefined limit for the chunk size and when that limit is reached the current chunk is closed and new chunk size is created. For restoring a backup the client requests the server . The header file is located by the server and it contains fingerprint of the chunks. The backup server locates the fingerprint table for the location of original chunk data and hence restores the original file.

SYSTEM ARCHITECTURE

International Journal of Computer Information Systems, Vol. 3, No. 2, 2012 data compression [1], which eliminates redundancy internal to an object and generally reduces textual data by factors of two to six. Due to the progress of storage technology, larger portion of data is now being maintained in digitized form. The amount of digitized data increases by factor of 10 in every 3 years. The type of data ranges from business data eg medical record and logical data to personal one. [2] Backup applications also exploit information redundancy to reduce the amount of information to be backed up [3], [4], [5]. In a distributed file system, the deduplication technique has been used to reduce the network traffic involved in synchronizing file system contents between a client and a server [6]. In a SAN file system, when different files share an identical piece of information, each file harbors the pointer to the shared data chunk instead of maintaining redundant information [7]. Jaehong Min has proposed Context aware chunking to maximize the chunking speed and deduplication ratio. An analysis of chunking speed, deduplication ratio and tablet management algorithms has been proposed.[8] Aronovich et al. proposed to maintain summary information in larger unit, e.g., 10 MB, and deduplicate the data based upon its similarity with the existing data[9]. To reduce the overhead of fingerprint lookup, Hamilton et al. maintain fingerprints in hierarchical manner. They maintain tree of fingerprints where a parent nodes fingerprint is the hash value of the fingerprints of the child nodes [10]. Bhagawat et al. exploit the file similarity instead of locality in deduplication [11]. manifests itself when backing up a set of small files arriving from different hosts. Third approach is to reduce the disk overhead in fingerprint lookup. The key ingredient is to enforce access locality in storing fingerprints at the storage. Zhu et al. [5] proposed a technique called SISL, where they simply append the incoming fingerprints at the end of existing table. Spyglass, a file metadata search system, proposed hierarchical partitioning of name space organization for performance and scalability [12]. Spyglass exploits namespace locality to improve performance since the files that satisfy a query are often clustered in only a portion of the namespace. III. PROPOSED SYSTEM In the proposed system deduplication is done by partitioning a file into chunks and generating fingerprint for each chunk. The Fingerprint Manager checks for redundant data and eliminates them. An analysis is done for the original data size and the compressed data size. Benefits: Deduplication is best for organizations needing backup, combine and improve performance during backups. In cases where the data is being backed up or archived again and again, the storage savings get better and better, achieving 20:1 (95%) in many instances.

II. RELATED WORK Reducing bytes means eliminating unneeded data, and there are numerous techniques for reducing redundancy when objects are stored or sent. The most outstanding example is

February Issue

Page 49 of 62

ISSN 2229 5208

Disk backup systems extend the benefits of data deduplication across the Enterprise, and integrate it with tape, replication, and encryption into a complete backup solution for multi-site environments Data de-duplication reduces disk requirements by 90% or more and makes WAN-based replication a practical disaster recovery tool. The result is fast, reliable backup and restore, reduced media usage, reduced power and cooling requirements (kind of a Green solution), and lower overall data protection and retention costs.

Chunk Analysis Chunking is the process of scanning a file and partitioning it into pieces. Each file piece is called a chunk and is a unit of redundancy detection. There are two types of chunking: fixed-size chunking and variable-size chunking. For fixed-size chunking, a file is partitioned into fixed size units, e.g., 8 KB blocks. Fixed-size chunking is conceptually simple and fast. However, this method has an important drawback: when a small amount of data is inserted into a file or deleted from a file, an entirely different set of chunks is generated from the updated file. To effectively address this problem, variable-size chunking, which is also known as content-based chunking, has been proposed. For variable-size chunking, the chunking boundary is determined based on the content of Chunk Data, not on the offset from the beginning of the file. The Basic Sliding Window algorithm slides the window from the beginning of a file, one byte at a time. A window is a byte stream of a given length. We generate a signature for the window and determine if it matches the predefined pattern called the target pattern. If the signature matches the target pattern, the end position of a window is set at the end of the chunk. Otherwise, the window is shifted by one byte and the signature is generated again. Partitioning the file based on its content is a very CPUdemanding process. Efficient chunking is one of the key ingredients that governs the overall deduplication performance. Fingerprint analysis Fingerprint Generator generates the summary information for each chunk. To expedite the comparison, we examine the fingerprints of the chunks, instead of performing a byte-level comparison between two chunks. We use SHA-1 algorithm for generating the fingerprint for the chunks. The SHA1 (Secure Hash Algorithm) algorithm is used to generate a condensed representation of message called message digest. It is used with Digital Signature Algorithm as specified with Digital Signature Standard. Both the sender and receiver of a message computing digital signature uses SHA1 algorithm. SHA1 Features The message needs to be a bit string. The number of bits in the message indicates the length of the message.

International Journal of Computer Information Systems, Vol. 3, No. 2, 2012 If the number of bits in the message is a multiple of 8, we can represent the message in hexadecimal format. Message padding is done to make the total length of the message as a multiple of 512. The SHA1 sequentially processes blocks of 512 bits when computing the message digest. To produce a message digest of length 512*n, a 1 followed by m 0s followed by a 16 bit integer are appended to the end of the message. The 64 bit integer is the length of the original message. The padded message is then processed by SHA1 as n 512 bit blocks.

Fingerprint Manager Fingerprint Manager is responsible for detecting redundancy and for locating the respective chunk in Chunk Repository, respectively. Fingerprint Manager at the client is responsible for insert, delete and search of the fingerprint to the existing set of fingerprints. In PRUNE, fingerprints are maintained as the collection of small fingerprint table called tablets. Fingerprint Manager maintains an array of pointers whose entry points to the individual tablets. In the client, when a fingerprint is passed to Fingerprint Manager, it searches the list of tablets to see if a given fingerprint is redundant. We use commodity DBMS for lookup and insert for each tablet. One table in the list is designated as current. If the fingerprint is new, it is inserted at the current tablet. Backup Generator Backup Generator generates the original file size and the size of the file after deduplication. Data structures There are three important backup objects in PRUNE: Backup Stream, Backup History, and Restore Stream. Backup Stream is a sequence of bytes generated at the client. Backup Stream consists of the Backup Header and a set of file information. Backup Header consists of Prune version and Backup requested time. The file information consists of file Header and the array of chunk information. The chunk information consists of Chunk Header and Chunk Data. The Chunk data field is empty if a certain particular chunk is redundant. Fingerprint of a chunk is stored at Chunk Header field. Chunk type field can be original, compressed or redundant. If a chunk is non- redundant the chunking module passes that chunk to Backup Generator. During backup session if a user specifies that compression is required, then each chunk data is compressed before passing it to Backup Generator. When Chunking Module knows that a given chunk is redundant, then only metadata is passed to Backup Generator. When a chunk is compressed, the original size and the size of the compressed chunk are different. Chunk Header carries fields for both of these sizes.

February Issue

Page 50 of 62

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 2, 2012 TABLE1: DATASET EXPERIMENT FOR CHUNKING SPEED [3] F. Douglis and A. Iyengar, Application Specific DeltaEncoding via Resemblance Detection, Proc Conf. USENIX 03, June 2003. [4] Y. Won, J. Ban, J. Min, J. Hur, S. Oh and J. Lee, Efficient Index Lookup for Deduplication Backup System, Proc. IEEE Intl Conf. Symp. Modeling Analysis and Simulation of Computers and Telecomm System, Sept 2008. [5] B. Zhu, K. Li and H. Patterson, Avoiding the disk bottleneck in the data domain Deduplication File System, Proc. Fast 08: Sixth USENIX Conf. File and Storage Technologies, pp. 1-14, 2008. [6] A. Muthitacharoen, B. Chen, and D. Mazieres, A low Bandwidth Network File System, SIGOPS Operating Systems Rev., vol. 35, no. 5, pp. 174-187,200 [7] B. Hong and D. D. E. Long, Duplicate Data Elimination in a San File System, Proc. 21st IEEE / 12th NASA Goddard Conf. Mass Storage Systems and Technologies (MSST), pp. 301-314, Apr. 2004. [8] Jaehong Min, Daeyoung Yoon, and Youjip Won, Efficient Deduplication Techniques for Modern Backup Operation, IEEE Transactions On Computers, Vol. 60, No.6, June 2011. [9] L. Aronovich, R. Asher, E. Bachmat, The Design of a Similarity Based Deduplication System, Proc. SYSTOR 09: The Israeli Experimental Systems Conf., pp. 1- 14, May 2009. [10] J. Hamilton and E. Olsen, Design and Implementation of Storage Repository Using Commonality Factoring, Proc. 20th IEEE/ 11th NASA Goddard Conf. Mass Storage Systems and Technologies (MSS 03), Aug.2003. [11] D. Bhagwat, and M. Lillibridge, Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup, Proc. 17th IEEE Intl Symp. Modeling, Analysis and Simulation of Computer and Telecomm. Systems(MASCOTS 09) Sept. 2009. [12] A. Jeung, M. Shao, T. Bisson, S. Pasupathy, Spyglass: Fast, Scalable Metadata Search for Large Scale Storage Systems, Proc. 23rd Intl Conf. Supercomputing,( ICS 09), pp.370-379,2009.

IV. CONCLUSION In this work an analysis is done to predict the original file size and the compressed file size. We propose fixed chunking mechanism to partition the files. Since the 8KB partition will not be viewable in a single line we use SHA1 algorithm to compress the string. This is done using Fingerprint analysis. The Fingerprint Manager eliminates the redundant data. When the target pattern size is increased from 11 to 13 bits, the deduplication detection rate decreased by two percent and the chunking performance decreased from 150 to 100 MB/sec with files being in memory. However, the whole backup speed increased from 51 to 77 MB/sec. This experimental result delivers a very important implication: for the deduplication speed, we put more emphasis on reducing the fingerprint lookup overhead than in finding more commonalities. There are a number of factors that optimize the performance of deduplication. They are the chunking speed, deduplication ratio, Fingerprint lookup speed etc. Inappropriate deduplication may affect the overall performance of backup. REFERENCES [1] D. A. Lelewer and D. S. Hirschberg, Data Compression, ACM Computing, Springer Verlag( Heidelberg, FRG and New York NY USA)- Verlag Surveys,; ACM CR 8902- 0069, 19(3),1987. [2] Y. Won ,R. Kim, J. Ban, J.Hur, Prun: Eliminating Information Redundancy for Large Scale Data Backup System, Proc. IEEE Intl Conf. Computational Sciences and its applications,2008.

AUTHORS PROFILE
Nirmalin Bala S H, Chennai, 27.05.1984, M.E computer science and engineering, School of computing sciences and engineering, Hindustan University, Chennai, Tamil Nadu, India..

February Issue

Page 51 of 62

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 2, 2012


Uma Maheswari M, Chennai, 29.06.1987, M.E computer science and engineering, School of computing sciences and engineering, Hindustan University, Chennai, Tamil Nadu, India

Abirami G Chennai, 27.05.1983, M.E computer science and engineering, School of computing sciences and engineering, Hindustan University, Chennai, Tamil Nadu, India

Padmaveni K, Chennai, Assistant Professor, School of computing sciences and engineering, Hindustan University, Chennai, Tamil Nadu, India.

February Issue

Page 52 of 62

ISSN 2229 5208

Anda mungkin juga menyukai