Data deduplication is a method of reducing the amount of data you need to store by only keeping one real copy of any given data chunk. Any duplicate data chunks in the system are replaced by bookmarks that point to the real copy.
What are the benefits of data duplication?
Imagine that you’re setting up a web page and you want to refer to content that other people have written on their websites. What do you do? You link, of course. But imagine that you couldn’t link, and if you wanted to refer to other content elsewhere on the web, you would need to copy-paste the desired content into your web page.
Your page would balloon in size, very quickly. It would be long; it would be heavy; it would put more drain on your servers, take more time to load and create more frustration for your users. In short, it would be bad. Good thing we can link.
Data deduplication keeps your data size from ballooning due to multiple copies of the same data. With only one real copy of the data and all other instances of that data replaced by bookmarks pointing to the copy (effectively like links), you need to store less data, saving you money and processing time.
When should you deduplicate data? Inline vs. post-processing
Data cleansing deduplication processes can be run while the data is being processed and sent to its storage target (“inline”), or once it has been put into storage in the target (“post-processing”).
Inline deduplication makes the initial processing time longer, potentially holding up the use of the data. Its benefit is that the data reaches the target already deduplicated and reduced in size, requiring less storage capacity.
Post-processing deduplication can be run whenever it is convenient and does not have to interfere with the initial data processing or data use. However, because it only reduces the data volume after the data is already stored, there needs to be more storage capacity constantly available when using this deduplication method.
How should you deduplicate data? File-level vs. block level
Data deduplication algorithms can target copies of files or copies of chunks of data (“blocks”) of a certain size, significantly smaller than most files. Usually, the blocks are of fixed sizes (like 8KB), but some deduplication providers enable blocks of variable sizes, using more intelligent data deduplication algorithms to determine where natural, intuitive start and end points would be for a block.
File-level deduplication compares each file to unique file listings in an index; if the file under examination exactly matches any file listing, it is deleted and a bookmark to the original unique file is put in its place. If the file does not exactly match, it is kept and a unique listing for it is stored in the index. Block-level deduplication does a similar thing for the small blocks of data that make up a file.
File-level deduplication is usually faster and requires less processing power because the index is smaller. For the same reason, however, it doesn’t eliminate as much potentially duplicate data. As an illustration, think of a 200-page document. In file-level deduplication, the entire file is the “chunk.” Change the headline in the document or any other small change, and the entire file is marked as unique and saved. In block-level deduplication, in contrast, the only chunk that would be marked as unique and saved would be the one that contains the headline. Every other chunk of data that makes up the 200-page document would be identified as a copy, eliminating the need to store it. That’s a big data storage savings! While the storage savings is more efficient with block-level deduplication, it goes slower and uses more processing power than file-level deduplication.