Naoko KIFUNE Hironori UCHIKAWA
At a flash memory, each stored data frame is protected by error correction codes (ECC) such as Bose-Chaudhuri-Hocquenghem (BCH) codes from random errors. Exclusive-OR (XOR) based erasure codes like RAID-5 have also been employed at the flash memory to protect from memory block defects. Conventionally, the ECC and erasure codes are used separately since their target errors are different. Due to recent aggressive technology scaling, additional error correction capability for random errors is required without adding redundancy. We propose an algorithm to improve error correction capability by using XOR parity with a simple counter that counts the number of unreliable bits in the XOR stripe. We also propose to apply Chase decoding to the proposed algorithm. The counter makes it possible to reduce the false correction and execute the efficient Chase decoding. We show that combining the proposed algorithm with Chase decoding can significantly improve the decoding performance.
Yang DING Qingye LI Yuting QIU
Locally repairable codes have attracted lots of interest in Distributed Storage Systems. If a symbol of a code can be repaired respectively by t disjoint groups of other symbols, each groups has size at most r, we say that the code symbol has (r, t)-locality. In this paper, we employ parity-check matrix to construct information single-parity (r, t)-locality LRCs. All our codes attain the Singleton-like bound of LRCs where each repair group contains a single parity symbol and thus are optimal.
Hybrid storage techniques are useful methods to improve the cost performance for input-output (IO) intensive workloads. These techniques choose areas of concentrated IO accesses and migrate them to an upper tier to extract as much performance as possible through greater use of upper tier areas. Automated tiered storage with fast memory and slow flash storage (ATSMF) is a hybrid storage system situated between non-volatile memories (NVMs) and solid-state drives (SSDs). ATSMF aims to reduce the average response time for IO accesses by migrating areas of concentrated IO access from an SSD to an NVM. When a concentrated IO access finishes, the system migrates these areas from the NVM back to the SSD. Unfortunately, the published ATSMF implementation temporarily consumes much NVM capacity upon migrating concentrated IO access areas to NVM, because its algorithm executes NVM migration with high priority. As a result, it often delays evicting areas in which IO concentrations have ended to the SSD. Therefore, to reduce the consumption of NVM while maintaining the average response time, we developed new techniques for making ATSMF more practical. The first is a queue handling technique based on the number of IO accesses for NVM migration and eviction. The second is an eviction method that selects only write-accessed partial regions in finished areas. The third is a technique for variable eviction timing to balance the NVM consumption and average response time. Experimental results indicate that the average response times of the proposed ATSMF are almost the same as those of the published ATSMF, while the NVM consumption is three times lower in best case.
Pan TAN Zhengchun ZHOU Haode YAN Yong WANG
Locally repairable codes (LRCs) with availability have received considerable attention in recent years since they are able to solve many problems in distributed storage systems such as repairing multiple node failures and managing hot data. Constructing LRCs with locality r and availability t (also called (r, t)-LRCs) with new parameters becomes an interesting research subject in coding theory. The objective of this paper is to propose two generic constructions of cyclic (r, t)-LRCs via linearized polynomials over finite fields. These two constructions include two earlier ones of cyclic LRCs from trace functions and truncated trace functions as special cases and lead to LRCs with new parameters that can not be produced by earlier ones.
Kazuichi OE Takeshi NANRI Koji OKAMURA
In previous studies, we determined that workloads often contain many input-output (IO) concentrations. Such concentrations are aggregations of IO accesses. They appear in narrow regions of a storage volume and continue for durations of up to about an hour. These narrow regions occupy a small percentage of the logical unit number capacity, include most IO accesses, and appear at unpredictable logical block addresses. We investigated these workloads by focusing on page-level regularity and found that they often include few regularities. This means that simple caching may not reduce the response time for these workloads sufficiently because the cache migration algorithm uses page-level regularity. We previously developed an on-the-fly automated storage tiering (OTF-AST) system consisting of an SSD and an HDD. The migration algorithm identifies IO concentrations with moderately long durations and migrates them from the HDD to the SSD. This means that there is little or no reduction in the response time when the workload includes few such concentrations. We have now developed a hybrid storage system consisting of a cache drive with an SSD and HDD and a multi-tier SSD that uses OTF-AST, called “OTF-AST with caching.” The OTF-AST scheme handles the IO accesses that produce moderately long duration IO concentrations while the caching scheme handles the remaining IO accesses. Experiments showed that the average response time for our system was 45% that of Facebook FlashCache on a Microsoft Research Cambridge workload.
Zhisheng HUO Limin XIAO Zhenxue HE Xiaoling RONG Bing WEI
Previous works have studied the throughput allocation of the heterogeneous storage system consisting of SSD and HDD in the dynamic setting where users are not all present in the system simultaneously, but those researches make multiple servers as one large resource pool, and cannot cope with the multi-server environment. We design a dynamic throughput allocation mechanism named DAM, which can handle the throughput allocation of multiple heterogeneous servers in the dynamic setting, and can provide a number of desirable properties. The experimental results show that DAM can make one dynamic throughput allocation of multiple servers for making sure users' local allocations in each server, and can provide one efficient and fair throughput allocation in the whole system.
Yingxun FU Junyi GUO Li MA Jianyong DUAN
As the demand of data reliability becomes more and more larger, most of today's storage systems adopt erasure codes to assure the data could be reconstructed when suffering from physical device failures. In order to fast recover the lost data from a single failure, recovery optimization methods have attracted a lot of attention in recent years. However, most of the existing optimization methods focus on homogeneous devices, ignoring the fact that the storage devices are usually heterogeneous. In this paper, we propose a new recovery optimization method named HSR (Heterogeneous Storage Recovery) method, which uses both loads and speed rate among physical devices as the optimization target, in order to further improve the recovery performance for heterogeneous devices. The experiment results show that, compared to existing popular recovery optimization methods, HSR method gains much higher recovery speed over heterogeneous storage devices.
Kazuichi OE Mitsuru SATO Takeshi NANRI
The response times of solid state drives (SSDs) have decreased dramatically due to the growing use of non-volatile memory express (NVMe) devices. Such devices have response times of less than 100 micro seconds on average. The response times of all-flash-array systems have also decreased dramatically through the use of NVMe SSDs. However, there are applications, particularly virtual desktop infrastructure and in-memory database systems, that require storage systems with even shorter response times. Their workloads tend to contain many input-output (IO) concentrations, which are aggregations of IO accesses. They target narrow regions of the storage volume and can continue for up to an hour. These narrow regions occupy a few percent of the logical unit number capacity, are the target of most IO accesses, and appear at unpredictable logical block addresses. To drastically reduce the response times for such workloads, we developed an automated tiered storage system called “automated tiered storage with fast memory and slow flash storage” (ATSMF) in which the data in targeted regions are migrated between storage devices depending on the predicted remaining duration of the concentration. The assumed environment is a server with non-volatile memory and directly attached SSDs, with the user applications executed on the server as this reduces the average response time. Our system predicts the effect of migration by using the previously monitored values of the increase in response time during migration and the change in response time after migration. These values are consistent for each type of workload if the system is built using both non-volatile memory and SSDs. In particular, the system predicts the remaining duration of an IO concentration, calculates the expected response-time increase during migration and the expected response-time decrease after migration, and migrates the data in the targeted regions if the sum of response-time decrease after migration exceeds the sum of response-time increase during migration. Experimental results indicate that ATSMF is at least 20% faster than flash storage only and that its memory access ratio is more than 50%.
Jung-Hyun KIM Min Kyu SONG Hong-Yeop SONG
In this paper, we investigate how to obtain binary locally repairable codes (LRCs) with good locality and availability from binary Simplex codes. We first propose a Combination code having the generator matrix with all the columns of positive weights less than or equal to a given value. Such a code can be also obtained by puncturing all the columns of weights larger than a given value from a binary Simplex Code. We call by block-puncturing such puncturing method. Furthermore, we suggest a heuristic puncturing method, called subblock-puncturing, that punctures a few more columns of the largest weight from the Combination code. We determine the minimum distance, locality, availability, joint information locality, joint information availability of Combination codes in closed-form. We also demonstrate the optimality of the proposed codes with certain choices of parameters in terms of some well-known bounds.
Locally repairable codes, which can repair erased symbols from other symbols, have attracted a good deal of attention in recent years because its local repair property is effective on distributed storage systems. (ru, δu)u∈[s]-locally repairable codes with multiple localities, which are an extension of ordinary locally repairable codes, can repair δu-1 erased symbols simultaneously from a set consisting of at most ru symbols. An upper bound on the minimum distance of these codes and a construction method of optimal codes, attaining this bound with equality, were given by Chen, Hao, and Xia. In this paper, we discuss the parameter restrictions of the existing construction, and we propose explicit constructions of optimal codes with multiple localities with relaxed restrictions based on the encoding polynomial introduced by Tamo and Barg. The proposed construction can design a code whose minimum distance is unrealizable by the existing construction.
Yingxun FU Shilin WEN Li MA Jianyong DUAN
With the rapid growth on data scale and complexity, single disk failure recovery becomes very important for erasure coded storage systems. In this paper, we propose a new strip-switched deployment method, which utilizes the feature that strips of each stripe of erasure codes could be switched, and uses simulated annealing algorithm to search for the proper strip-deployment on the stack level to balance the read accesses, in order to improve the recovery performance. The analysis and experiments results show that SSDM could effectively improve the single failure recovery performance.
Joon-Young PAIK Rize JIN Tae-Sun CHUNG
In terms of system reliability, data recovery is a crucial capability. The lack of data recovery leads to the permanent loss of valuable data. This paper aims at improving data recovery in flash-based storage devices where extremely poor data recovery is shown. For this, we focus on garbage collection that determines the life span of data which have high possibility of data recovery requests by users. A new garbage collection mechanism with awareness of data recovery is proposed. First, deleted or overwritten data are categorized into shallow invalid data and deep invalid data based on the possibility of data recovery requests. Second, the proposed mechanism selects victim area for reclamation of free space, considering the shallow invalid data that have the high possibility of data recovery requests. Our proposal prohibits more shallow invalid data from being eliminated during garbage collections. The experimental results show that our garbage collection mechanism can improve data recovery with minor performance degradation.
Joobeom YUN Junbeom HUR Youngjoo SHIN Dongyoung KOO
Ransomware becomes more and more threatening nowadays. In this paper, we propose CLDSafe, a novel and efficient file backup system against ransomware. It keeps shadow copies of files and provides secure restoration using cloud storage when a computer is infected by ransomware. After our system measures file similarities between a new file on the client and an old file on the server, the old file on the server is backed up securely when the new file is changed substantially. And then, only authenticated users can restore the backup files by using challenge-response mechanism. As a result, our proposed solution will be helpful in recovering systems from ransomware damage.
Daisuke ANDO Fumio TERAOKA Kunitake KANEKO
With rapid growth of producing high-resolution digital contents such as Full HD, 4K, and 8K movies, the demand for low cost and high throughput sharing of content files is increasing at digital content productions. In order to meet this demand, we have proposed DRIP (Distributed chunks Retrieval and Integration Procedure), a storage and retrieval mechanism for large file sharing using forward error correction (FEC) and global dispersed storage. DRIP was confirmed that it contributes to low cost and high throughput sharing. This paper describes the design and implementation of Content Espresso, a distributed large file sharing system for digital content productions using DRIP, and presents performance evaluations. We set up experimental environment using 79 physical machines including 72 inexpensive storage servers, and evaluate file metadata access performance, file storage/retrieval performance, FEC block size, and system availability by emulating global environments. The results confirm that Content Espresso has capability to deal with 15,000 requests per second, achieves 1 Gbps for file storage, and achieves more than 3 Gbps for file retrieval. File storage and retrieval performance are not significantly affected by the network conditions. Thus, we conclude that Content Espresso is capable of a global scale file sharing system for digital content productions.
The use of flash memory based storage devices is rapidly increasing, and user demands for high performance are also constantly increasing. The performance of the flash storage device is greatly influenced by cleaning operations of Flash Translation Layer (FTL). Various studies have been conducted to lower the cost of cleaning operations. However, there are limits to achieve sufficient performance improvement of flash storages without help of a host system, with only limited information in storage devices. Recently, SCSI, eMMC, and UFS standards provide an interface for sending semantic information from a host system to a storage device. In this paper, we analyze effects of semantic information on performance and lifetime of flash storage devices. We evaluate performance and lifetime improvement through SA-FTL (Semantic Aware Flash Translation Layer), which can take advantage of semantic information in storage devices. Experiments show that SA-FTL improves performance and lifetime of flash based storages by up to 30 and 35%, respectively, compared to a simple page-level FTL.
RAID has been widely deployed in disk array storage systems to manage both performance and reliability simultaneously. RAID conducts two performance-critical operations during disk failures known as degraded reads/writes and recovery process. Before the recovery process is complete, reads and writes are degraded because data is reconstructed using data redundancy. The performance of degraded reads/writes is critical in order to meet stipulations in customer service level agreements (SLAs), and the recovery process affects the reliability of a storage system considerably. Both operations require fast data reconstruction. Among the erasure codes for fast reconstruction, Local Reconstruction Codes (LRC) are known to offer the best (or optimal) trade-off between storage overhead, fault tolerance, and the number of disks involved in reconstruction. Originally, LRC was designed for fast reconstruction in distributed cloud storage systems, in which network traffic is a major bottleneck during reconstruction. Thus, LRC focuses on reducing the number of disks involved in data reconstruction, which reduces network traffic. However, we observe that when LRC is applied to primary array storage systems, a major bottleneck in reconstruction results from uneven disk utilization. In other words, underutilized disks can no longer receive I/O requests as a result of the bottleneck of overloaded disks. Uneven disk utilization in LRC is due to its dedicated group partitioning policy to achieve the Maximally Recoverable property. In this paper, we present Distributed Reconstruction Codes (DRC) that support fast reconstruction in primary array storage systems. DRC is designed with group shuffling policy to solve the problem of uneven disk utilization. Experiments on real-world workloads show that DRC using global parity rotation (DRC-G) improves degraded performance by as much as 72% compared to RAID-6 and by as much as 35% compared to LRC under the same reliability. In addition, our study shows that DRC-G reduces the recovery process completion time by as much as 52% compared to LRC.
Liyu WANG Lan CHEN Xiaoran HAO
NAND flash memory has been widely used in storage systems. Aiming to design an efficient buffer policy for NAND flash memory, a life-aware buffer management algorithm named LAB-LRU is proposed, which manages the buffer by three LRU lists. A life value is defined for every page and the active pages with higher life value can stay longer in the buffer. The definition of life value considers the effect of access frequency, recency and the cost of flash read and write operations. A series of trace-driven simulations are carried out and the experimental results show that the proposed LAB-LRU algorithm outperforms the previous best-known algorithms significantly in terms of the buffer hit ratio, the numbers of flash write and read operations and overall runtime.
Hiroaki AKUTSU Kazunori UEDA Takeru CHIBA Tomohiro KAWAGUCHI Norio SHIMOZONO
In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.
Aseffa DEREJE TEKILU Chin-Hsien WU
A map-reduce framework is popular for big data analysis. In the typical map-reduce framework, both master node and worker nodes can use hard-disk drives (HDDs) as local disks for the map-reduce computation. However, because of the inherit mechanical problems of HDDs, the I/O performance is a bottleneck for the map-reduce framework when I/O-intensive applications (e.g., sorting) are performed. Replacing HDDs with solid-state drives (SSDs) is not economical, although SSDs have better performance than HDDs. In this paper, we propose a virtualization-based hybrid storage system for the map-reduce framework. The objective of the paper is to combine the advantages of the fast access property of SSDs and the low cost of HDDs by realizing an economical design and improving I/O performance of a map-reduce framework in a virtualization environment. We propose three storage combinations: SSD-based, HDD-based, and a hybrid of SSD-based and HDD-based storage systems which balances speed, capacity, and lifetime. According to experiments, the hybrid of SSD-based and HDD-based storage systems offers superior performance and economy.
Yuya TARUTANI Yuichi OHSITA Masayuki MURATA
Cloud storage has become popular and is being used to hold important data. As a result, availability to become important; cloud storage providers should allow users to upload or download data even if some part of the system has failed. In this paper, we discuss distributed cloud storage that is robust against failures. In distributed cloud storage, multiple replicas of each data chunk are stored in the virtual storage at geographically different locations. Thus, even if one of the virtual storage systems becomes unavailable, users can access the data chunk from another virtual storage system. In distributed cloud storage, the placement of the virtual storage system is important; if the placement of the virtual cloud storage system means that a large number of virtual storages are possible could become unavailable from a failure, a large number of replicas of each data chunk should be prepared to maintain availability. In this paper, we propose a virtual storage placement method that assures availability with a small number of replicas. We evaluated our method by comparing it with three other methods. The evaluation shows that our method can maintain availability while requiring only with 60% of the network costs required by the compared methods.