Select Language
Products & Solutions
Data-driven Organization
Industries
Solutions
Products
Services & Support
Data as a Service
Customer Success Services
Community
Developer Community
Documentation
Demo
Blog
Partners
Why Partner with AISHU
Become a Signed Partner
Training & Certification
Find A Partner
About AISHU
About Us
News & Events
Join Us
Products & Solutions
Data-driven Organization
Industries
Solutions
Products
Services & Support
Data as a Service
Customer Success Services
Community
Developer Community
Documentation
Demo
Blog
Partners
Why Partner with AISHU
Become a Signed Partner
Training & Certification
Find A Partner
About AISHU
About Us
News & Events
Join Us

AISHU Blog

All AnyBackup AnyShare AnyRobot

AnyBackup 7 The Fourth-generation Deduplication

2024-05-13 155 0
AnyBackup 7 adopts the fourth-generation deduplication which supports both source deduplication and parallel deduplication. Data is deduplicated before it is transferred to the storage media, thus the backup performance is greatly enhanced.
 
Content-Based Length-Variable Data Slicing
The fourth-generation deduplication adopts content-based length-variable data slicing algorithm, which intelligently identifies modified data and unmodified data, so as not to partition unmodified data to new data blocks caused by modifying data displacement. This greatly enhances the deduplication performance and the deduplication ratio, with the purpose of no redundant data being backed up.
Compared with length-fixed slicing algorithm, the content-based length-variable slicing achieves higher deduplication ratio. The length-fixed slicing splits the file or data source into a fixed size. If one character is added or subtracted in the beginning of a file, the fingerprints of all slices will be changed. Two files with only one different character may be backed up with deduplication ratio of 0.



Source Deduplication
After the data or file is sliced by the intelligent content-based length-variable data slicing algorithm, the data blocks are uniquely marked by hash algorithm, that is, fingerprints. The same fingerprints are queried in the fingerprint library. If the fingerprint exists, the same data block is saved. Thus the media server will not save this data block but cites the existing one, thereby saving more backup space. In addition, a large amount of bandwidth is saved.
The deduplication workflow is shown in the following figure.



Parallel Deduplication
Traditional deduplication is usually performed based on a single node, thus challenges of data access, low processing performance and insufficient storage space are faced in the big data era.
AnyBackup 7 adopts parallel deduplication which constructs fingerprint libraries on multiple nodes and distributes fingerprints in parallel to multiple nodes, so as to solve the problems of single-point performance and storage space pressure.


The traditional deduplication adopts disk-based fingerprint read/write, which produces a large number of random IO and seriously affects the performance. However, AnyBackup 7 parallel deduplication adopts memory-level fingerprint library. All fingerprint read/write are stored in the memory, thereby improving the fingerprint query and processing efficiency, and meanwhile reducing the random IO pressure caused by the increase of the fingerprint libraries in the disk. When the fingerprint library is no longer needed for backup, the fingerprints can be synchronized to the disk. The fingerprint libraries are built on different deduplication nodes, but the production data can be deduplicated in parallel based on multiple fingerprint libraries.


   
The reused data can be stored in the same space sequentially and continuously via the fingerprint library policy. In this way, the following can be realized.
First, the time loss in querying all fingerprints in global deduplication is reduced.
Second, the read cache mechanism of the storage can be utilized to reduce frequent switching of disk seek caused by random disk reading, thus improving the recovery efficiency.

 
How does the deduplication enhance the backup performance?
Deduplication ensures that the backup for large amount of data can be completed in the limited backup window.
  • Parallel Deduplication: Suitable for backup scenarios with larger data volume. The higher the deduplication ratio, the smaller the backup window.
  • Multiple Deduplication Nodes: The fingerprint library is comprised of multiple deduplication nodes, which linearly increases the deduplication capacity of a single fingerprint library. The deduplication node allocation follows the principle of load balancing which allocates the node resources evenly, thus avoiding overload or idle resources.
  • Source deduplication: The data is deduplicated before it is transferred, thus the occupation on the bandwidth is greatly reduced and the backup efficiency is enhanced.
 
How to handle the capacity and scalability problems?
As data grows, new features are needed to satisfy the demands on capacity and scalability.
  • Scale-out: Up to 32 nodes are supported to satisfy the demands on massive data protection.
  • Flexible Selection Mechanism of Fingerprint Libraries: Multiple backup jobs of the same application type can share one fingerprint library. The data and fingerprints of these jobs are shared after deduplication, which can be used for recovery of different jobs, thus greatly increasing the deduplication ratio and reducing the amount of backup data to be stored. For backup jobs of different application types, different fingerprint libraries are provided. The backup job of each application type uses the independent fingerprint library, so as to reduce the comparison time and enhance the query efficiency, meanwhile to balance the deduplication performance, deduplication ratio and resources needed in the data center to an optimal value, with an optimal effect.
  • Automated Cleanup of Redundant Data: Historical copies are gradually phased out as redundant data based on the copy data protection policy. When the redundant data reaches a certain scale, the system will automatically clean up the redundant data to improve the space utilization rate.

How to enhance ROI while reduce risks?
  • Low Storage Cost: The highest deduplication ratio can reach 100:1, which greatly reduces the total amount of backup data and the storage space occupied.
  • No Resource Contention: Source deduplication takes up few client resources, thus it poses little impact on the business system.

Was this helpful?

Related Articles

Stay up to date

Stay up to date

Must be valid email.

Submit

Enter your email upper to be notified about new articles.

You have successfully subscribed, thank you

Subscription failed, please try again

 

    support@aishu.cn(Japan & Korea)