[PDF] SSDFS: Towards LFS Flash-Friendly File System without GC operation

Abstract

Solid state drives have a number of interesting characteristics. However, there are numerous file system and storage design issues for SSDs that impact the performance and device endurance. Many flash-oriented and flash-friendly file systems introduce significant write amplification issue and GC overhead that results in shorter SSD lifetime and necessity to use the NAND flash overprovisioning. SSDFS file system introduces several authentic concepts and mechanisms: logical segment, logical extent, segment's PEBs pool, Main/Diff/Journal areas in the PEB's log, Diff-On-Write approach, PEBs migration scheme, hot/warm data self-migration, segment bitmap, hybrid b-tree, shared dictionary b-tree, shared extents b-tree. Combination of all suggested concepts are able: (1) manage write amplification in smart way, (2) decrease GC overhead, (3) prolong SSD lifetime, and (4) provide predictable file system's performance.

Full PDF

SSSDFS: Towards LFS Flash-Friendly File System without GCoperations

Viacheslav Dubeyko

Abstract

Solid state drives have a number of interesting character-istics. However, there are numerous ﬁle system and storagedesign issues for SSDs that impact the performance and de-vice endurance. Many ﬂash-oriented and ﬂash-friendly ﬁlesystems introduce signiﬁcant write ampliﬁcation issue andGC overhead that results in shorter SSD lifetime and ne-cessity to use the NAND ﬂash overprovisioning. SSDFSﬁle system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write ampliﬁcation insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable ﬁle system’s performance.

Index terms: NAND ﬂash, SSD, Log-structured ﬁlesystem (LFS), write ampliﬁcation issue, GC overhead,ﬂash-friendly ﬁle system, SSDFS, delta-encoding, Copy-On-Write (COW), Diff-On-Write (DOW), PEB migra-tion, deduplication.

Flash memory characteristics . Flash is available in twotypes NOR and NAND. NOR ﬂash is directly addressable,helps in reading but also in executing of instructions di-rectly from the memory. NAND based SSD consists of setof blocks which are ﬁxed in number and each block com-prises of a ﬁxed set of pages or whole pages set makes upa block. There are three types of operations in ﬂash mem-ory: read, write and erase. Execution of operations read andwrite takes place per page level. On the other hand the datais erased on block level by using erase operation. Becauseof the physical feature of ﬂash memory, write operations areable to modify bits from one to zero. Hence the erase op-eration should be executed before rewriting as it set all bits to one. The typical latencies: (1) read operation - 20 us, (2)write operation - 200 us, (3) erase operation - 2 ms.

Flash Translation Layer (FTL) . FTL emulates the func-tionality of a block device and enables operating system touse ﬂash memory without any modiﬁcation. FTL mimicsblock storage interface and hides the internal complexities ofﬂash memory to operating systems, thus enabling the oper-ating system to read/write ﬂash memory in the same way asreading/writing the hard disk. The basic function of FTL al-gorithm is to map the page number from logical to physical.However, internally FTL needs to deal with erase-before-write, which makes it critical to overall performance and life-time of SSD.

Garbage Collection and Wear Leveling . The process ofcollecting, moving of valid data and erasing the invalid datais called as garbage collection. Through SSD ﬁrmware com-mand TRIM the garbage collection is triggered for deletedﬁle blocks by the ﬁle system. Commonly used erase blocksputs off quickly, slows down access times and ﬁnally burn-ing out. Therefore the erase count of each erase block shouldbe monitored. There are wide number of wear-leveling tech-niques used in FTL.

Building blocks of SSD . SSD includes a controller thatincorporates the electronics that bridge the NAND memorycomponents to the host computer. The controller is an em-bedded processor that executes ﬁrmware-level code. Someof the functions performed by the controller includes, error-correcting code (ECC), wear leveling, bad block mapping,read scrubbing and read disturb management, read and writecaching, garbage collection etc. A ﬂash-based SSD typicallyuses a small amount of DRAM as a cache, similar to thecache in hard disk drives. A directory of block placementand wear leveling data is also kept in the cache while thedrive is operating.

Write ampliﬁcation . For write requests that come in ran-dom order, after a period of time, the free page count in ﬂashmemory becomes low. The garbage-collection mechanismthen identiﬁes a victim block for cleaning. All valid pagesin the victim block are relocated into a new block with free1 a r X i v : . [ c s . O S ] J u l ages, and ﬁnally the candidate block is erased so that thepages become available for rewriting. This mechanism in-troduces additional read and write operations, the extent ofwhich depends on the speciﬁc policy deployed, as well as onthe system parameters. These additional writes result in themultiplication of user writes, a phenomenon referred to aswrite ampliﬁcation. Read disturbance . Flash data block is composed of mul-tiple NAND units to which the memory cells are connectedin series. A memory operation on a speciﬁc ﬂash cell willinﬂuence the charge contents on a different cells. This iscalled disturbance, which can occur on any ﬂash operationand predominantly this is observed during the read operationand leads to errors in undesignated memory cells. To avoidfailure on reads, error-correcting codes (ECC) are widelyemployed. Read disturbance can occur when reading thesame target cell multiple times without an erase and pro-gram operation. In general, to preserve data consistency,ﬂash ﬁrmware reads all live data pages, erases the block, andwrites down the live pages to the erased block. This process,called read block reclaiming, introduces long latencies anddegrades performance.

SSD design issues . Solid state drives have a numberof interesting characteristics that change the access pat-terns required to optimize metrics such as disk lifetime andread/write throughput. In particular, SSDs have approxi-mately two orders of magnitude improvement in read andwrite latencies, as well as a signiﬁcant increase in overallbandwidth. However, there are numerous ﬁle system andstorage array design issues for SSDs that impact the perfor-mance and device endurance. SSDs suffer well-documentedshortcomings: log-on-log, large tail-latencies, unpredictableI/O latency, and resource underutilization. These shortcom-ings are not due to hardware limitations: the non-volatilememory chips at the core of SSDs provide predictable high-performance at the cost of constrained operations and limitedendurance/reliability. Providing the same block I/O interfaceas a magnetic disk is one of the important reason of thesedrawbacks.

SSDFS features . Many ﬂash-oriented and ﬂash-friendlyﬁle systems introduce signiﬁcant write ampliﬁcation issueand GC overhead that results in shorter SSD lifetime andnecessity to use the NAND ﬂash overprovisioning. SSDFSﬁle system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write ampliﬁcation insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable ﬁle system’s performance.The rest of this paper is organized as follows. Section IIsurveys the related works. Section III explains the SSDFS architecture and approaches. Section IV includes ﬁnal dis-cussion. Section V offers conclusions.

File size . Agrawal, et al. [10] discovered that 1-1.5% of ﬁleson a ﬁle system’s volume have a size of zero. The arithmeticmean ﬁle size was 108 KB in 2000 year and 189 KB in 2004year. This metric grows roughly 15% per year. The medianweighted ﬁle size increasing from 3 MB to 9 MB. Most ofthe bytes in large ﬁles are in video, database, and blob ﬁles,and that most of the video, database, and blob bytes are inlarge ﬁles. A large number of small ﬁles account for a smallfraction of disk usage. Douceur, et al. [12] conﬁrmed that1.7% of all ﬁles have a size of zero. The mean ﬁle size rangesfrom 64 kB to 128 kB across the middle two quartiles of allﬁle systems. The median size is 2 MB and it conﬁrms thatmost ﬁles are small but most bytes are in large ﬁles. Ullah,et al. [16] have results about 1 to 10 KB the value observedis up to 32% of the total occurrences. There are 29% valuesin the range of 10 KB to 100 KB. Gibson, et al. [17] agreedthat most ﬁles are relatively small, more than half are lessthan 8 KB. 80% or more of the ﬁles are smaller than 32 KB.On the other hand, while only 25% are larger than 8 KB, this25% contains the majority of the bytes used on the differentsystems.

File age . Agrawal, et al. [10] stated that the median ﬁleage ranges between 80 and 160 days across datasets, with noclear trend over time. Douceur, et al. [12] has the vision ofthe median ﬁle age is 48 days. Studies of short-term tracedata have shown that the vast majority of ﬁles are deletedwithin a few minutes of their creation. On 50% of ﬁle sys-tems, the median ﬁle age ranges by a factor of 8 from 12 to 97days, and on 90% of ﬁle systems, it ranges by a factor of 256from 1.5 to 388 days. Gibson, et al. [17] showed that whileonly 15% of the ﬁles are modiﬁed daily, these modiﬁcationsaccount for over 70% of the bytes used daily. Relatively fewﬁles are used on any one day normally less than 5%. 5-10% of all ﬁles created are only used on one day, dependingon the system. On the other hand, approximately 0.8% ofthe ﬁles are used more than 129 times essentially every day.These ﬁles which are used more than 129 times account forless than 1% of all ﬁles created and approximately 10% ofall the ﬁles which were accessed or modiﬁed. 90% of allﬁles are not used after initial creation, those that are used arenormally short-lived, and that if a ﬁle is not used in somemanner the day after it is created, it will probably never beused. 1% of all ﬁles are used daily.

Files count per ﬁle system . Agrawal, et al. [10] showedthat the count of ﬁles per ﬁle system is going up from yearto year. The arithmetic mean has grown from 30K to 90Kﬁles and the median has grown from 18K to 52K ﬁles (20002 2004 years). However, some percentage of ﬁle systems hasachieved about 512K ﬁles already in 2004 year. Douceur, etal. [12] have vision that 31% of all ﬁle systems contain 8k to16k ﬁles. 30% of ﬁle systems have fewer than 4k ﬁles. Ullah,et al. [16] found that 67% of the occurrences are found in therange of 1-8 number of ﬁles in a directory. The percentageof the 9-16 ﬁles in a directory comprises of 15% of the totaldata found. The number of ﬁle in the range 17-32 are 9%and only 9% occurrences are found in the data more than 32ﬁles in a directory.

File names . Agrawal, et al. [10] made conclusion that dy-namic link libraries (dll ﬁles) contain more bytes than anyother ﬁle type. And virtual hard drives are consuming arapidly increasing fraction of ﬁle-system space. Ullah, etal. [16] discovered that the ﬁle name length falls in the rangefrom 9 to 17 characters. The peak occurs for ﬁle names withlength of 12 characters. The ﬁle names smaller than 8 char-acters are up to 11% of the total data collected whereas theoccurrences of ﬁle names larger than 16 characters and upto 32 characters is 26% but the ﬁle names greater than 32characters are found to be only 6%.

Directory size . Agrawal, et al. [10] discovered that acrossall years, 23-25% of directories contain no ﬁles. The arith-metic mean directory size has decreased slightly and steadilyfrom 12.5 to 10.2 over the sample period, but the median di-rectory size has remained steady at 2 ﬁles. Across all years,65-67% of directories contain no subdirectories. Across allyears, 46-49% of directories contain two or fewer entries.Douceur, et al. [12] shared that 18% of all directories con-tain no ﬁles and the median directory size is 2 ﬁles. 69%of all directories contain no subdirectories, 16% contain one,and fewer than 0.5% contain more than twenty. On 50% ofﬁle systems, the median directory size ranges from 1 to 4ﬁles, and on 90% of ﬁle systems, it ranges from 0 to 7 ﬁles.On 95% of all ﬁle systems, the median count of subdirecto-ries per directory is zero. 15% of all directories are at depthof 8 or greater.

Directories count per ﬁle system . Agrawal, et al. [10]registered that the count of directories per ﬁle system hasincreased steadily over ﬁve-year sample period. The arith-metic mean has grown from 2400 to 8900 directories and themedian has grown from 1K to 4K directories. Douceur, etal. [12] shared that 28% of all ﬁle systems contain 512 to1023 directories, and 29% of ﬁle systems have fewer than256 directories. Ullah, et al. [16] shared that 59% of thedirectories have sub-directories in the range of 1-5, 35% oc-currences are found in the range of 6-10. But the results showthat only 6% occurrences are found in the range of above 10sub-directories in a directory.

Namespace tree depth . Agrawal, et al. [10] shared thatthere are many ﬁles deep in the namespace tree, especiallyat depth 7. Also, ﬁles deeper in the namespace tree tendto be orders-of-magnitude smaller than shallower ﬁles. Thearithmetic mean has grown from 6.1 to 6.9, and the median directory depth has increased from 5 to 6. The count of ﬁlesper directory is mostly independent of directory depth. Filesdeeper in the namespace tree tend to be smaller than shal-lower ones. The mean ﬁle size drops by two orders of mag-nitude between depth 1 and depth 3, and there is a drop ofroughly 10% per depth level thereafter.

Capacity and usage . Agrawal, et al. [10] registered that80% of ﬁle systems become fuller over a one-year period,and the mean increase in fullness is 14 percentage points.This increase is predominantly due to creation of new ﬁles,partly offset by deletion of old ﬁles, rather than due to extantﬁles changing size. The space used in ﬁle systems has in-creased not only because mean ﬁle size has increased (from108 KB to 189 KB), but also because the number of ﬁles hasincreased (from 30K to 90K). Douceur, et al. [12] discoveredthat ﬁle systems are on average only half full, and their full-ness is largely independent of user job category. On average,half of the ﬁles in a ﬁle system have been created by copy-ing without subsequent writes, and this is also independentof user job category. The mean space usage is 53%.

A File Is Not a File . Harter, et al. [14] showed thatmodern applications manage large databases of informationorganized into complex directory trees. Even simple word-processing documents, which appear to users as a ”ﬁle”, arein actuality small ﬁle systems containing many sub-ﬁles.

Auxiliary ﬁles dominate . Tan, et al. [13] discovered thaton iOS, applications access resource, temp, and plist ﬁlesvery often. This is especially true for Facebook which uses alarge number of cache ﬁles. Also for iOS resource ﬁles suchas icons and thumbnails are stored individually on the ﬁlesystem. Harter, et al. [14] agree with that statement. Appli-cations help users create, modify, and organize content, butuser ﬁles represent a small fraction of the ﬁles touched bymodern applications. Most ﬁles are helper ﬁles that appli-cations use to provide a rich graphical experience, supportmultiple languages, and record history and other metadata.

Sequential Access Is Not Sequential . Harter, et al. [14]stated that even for streaming media workloads, ”pure” se-quential access is increasingly rare. Since ﬁle formats ofteninclude metadata in headers, applications often read and re-read the ﬁrst portion of a ﬁle before streaming through itscontents.

Writes are forced . Tan, et al. [13] shared that on iOS,Facebook calls fsync even on cache ﬁles, resulting in thelargest number of fsync calls out of the applications. OnAndroid, fsync is called for each temporary write-ahead log-ging journal ﬁles. Harter, et al. [14] found that applicationsare less willing to simply write data and hope it is eventuallyﬂushed to disk. Most written data is explicitly forced to diskby the application; for example, iPhoto calls fsync thousandsof times in even the simplest of tasks.

Temporary ﬁles . Tan, et al. [13] showed that applicationscreate many temporary ﬁles. It might have a negative impacton the durability of the ﬂash storage device. Also, creat-3ng many ﬁles result in storage fragmentation. The SQLitedatabase library creates many short-lived temporary journalﬁles and calls fsync often.

Copied ﬁles . Agrawal, et al. [10] shared the interest-ing point that over sample period (2000 - 2004), the arith-metic mean of the percentage of copied ﬁles has grown from66% to 76%, and the median has grown from 70% to 78%.It means that more and more ﬁles are being copied acrossﬁle systems rather than generated locally. Downey [15] con-cluded that the vast majority of ﬁles in most ﬁle systems werecreated by copying, either by installing software (operatingsystem and applications) or by downloading from the WorldWide Web. Many new ﬁles are created by translating a ﬁlefrom one format to another, compiling, or by ﬁltering an ex-isting ﬁle. Using a text editor or word processor, users addor remove material from existing ﬁles, sometimes replacingthe original ﬁle and sometimes creating a series of versions.

Renaming Is Popular . Harter, et al. [14] discovered thathome-user applications commonly use atomic operations, inparticular rename, to present a consistent view of ﬁles tousers.

Multiple Threads Perform I/O . Harter, et al. [14]showed that virtually all of the applications issue I/O re-quests from a number of threads; a few applications launchI/Os from hundreds of threads. Part of this usage stems fromthe GUI-based nature of these applications; threads are re-quired to perform long-latency operations in the backgroundto keep the GUI responsive.

Frameworks Inﬂuence I/O . Harter, et al. [14] foundthat modern applications are often developed in sophisticatedIDEs and leverage powerful libraries, such as Cocoa andCarbon. Whereas UNIX-style applications often directly in-voke system calls to read and write ﬁles, modern librariesput more code between applications and the underlying ﬁlesystem. Default behavior of some Cocoa APIs induces ex-tra I/O and possibly unnecessary (and costly) synchroniza-tions to disk. In addition, use of different libraries for similartasks within an application can lead to inconsistent behaviorbetween those tasks.

Applications’ behavior . Harter, et al. [14] made severalconclusions about applications’ nature. Applications tend toopen many very small ﬁles ( < > ≤ File System and Block IO Scheduler . Hui, et al. [20]have made the estimation of interaction between ﬁle sys-tems and block I/O scheduler. They concluded that moreread or append-write may cause better performance and lessenergy consumption, such as in the workload of the web-server. And along with the increasing of the write operation,especially random write, the performance declines and en-ergy consumption increases. The extent ﬁle systems expressbetter performance and lower energy consumption. They ex-pected that NOOP I/O scheduler is better suit for the caseof SSDs because it does not sort the request, which can costmuch time and decline performance. But after the test, theyfound that as CFQ as NOOP may be suit for the SSDs.

NAND ﬂash storage device . Parthey, et al. [21] analyzedaccess timing of removable ﬂash media. They found thatmany media access address zero especially fast. For somemedia, other locations such as the middle of the medium aresometimes slower than the average. Accessing very smallblocks ( ≤ Rosenblum, et al. [34] introduced a new technique for diskstorage management called a log-structured ﬁle system. Alog-structured ﬁle system writes all modiﬁcations to disk se-quentially in a log-like structure, thereby speeding up bothﬁle writing and crash recovery. The log is the only struc-ture on disk; it contains indexing information so that ﬁlescan be read back from the log efﬁciently. In order to main-tain large free areas on disk for fast writing, they divided thelog into segments and use a segment cleaner to compress thelive information from heavily fragmented segments. Log-structured ﬁle systems are based on the assumption that ﬁlesare cached in main memory and that increasing memorysizes will make the caches more and more effective at sat- isfying read requests. As a result, disk trafﬁc will becomedominated by writes. A log-structured ﬁle system writesall new information to disk in a sequential structure calledthe log. This approach increases write performance dramat-ically by eliminating almost all seeks. The sequential natureof the log also permits much faster crash recovery: currentUnix ﬁle systems typically must scan the entire disk to re-store consistency after a crash, but a log-structured ﬁle sys-tem need only examine the most recent portion of the log.For a log-structured ﬁle system to operate efﬁciently, it mustensure that there are always large extents of free space avail-able for writing new data. This is the most difﬁcult challengein the design of a log-structured ﬁle system. It was presenteda solution based on large extents called segments, where asegment cleaner process continually regenerates empty seg-ments by compressing the live data from heavily fragmentedsegments.

JFFS (The Journalling Flash File System) [35, 36] is apurely log-structured ﬁle system [LFS]. Nodes containingdata and metadata are stored on the ﬂash chips sequentially,progressing strictly linearly through the storage space avail-able. In JFFS v1, there is only one type of node in the log;a structure known as struct jffs raw inode. Each such nodeis associated with a single inode. It starts with a commonheader containing the inode number of the inode to which itbelongs and all the current ﬁle system metadata for that in-ode, and may also carry a variable amount of data. There isa total ordering between the all the nodes belonging to anyindividual inode, which is maintained by storing a versionnumber in each node. Each node is written with a versionhigher than all previous nodes belonging to the same inode.In addition to the normal inode metadata such as uid, gid,mtime, atime, mtime etc., each JFFS v1 raw node also con-tains the name of the inode to which it belongs and the inodenumber of the parent inode. Each node may also containan amount of data, and if data are present the node will alsorecord the offset in the ﬁle at which these data should appear.The entire medium is scanned at mount time, each node be-ing read and interpreted. The data stored in the raw nodesprovide sufﬁcient information to rebuild the entire directoryhierarchy and a complete map for each inode of the physi-cal location on the medium of each range of data. Metadatachanges such as ownership or permissions changes are per-formed by simply writing a new node to the end of the logrecording the appropriate new metadata. File writes are sim-ilar; differing only in that the node written will have dataassociated with it.The oldest node in the log is known as the head, and newnodes are added to the tail of the log. In a clean ﬁlesys-tem which on which garbage collection has never been trig-gered, the head of the log will be at the very beginning of the5ash. As the tail approaches the end of the ﬂash, garbage col-lection will be triggered to make space. Garbage collectionwill happen either in the context of a kernel thread which at-tempts to make space before it is actually required, or in thecontext of a user process which ﬁnds insufﬁcient free spaceon the medium to perform a requested write. In either case,garbage collection will only continue if there is dirty spacewhich can be reclaimed. If there is not enough dirty space toensure that garbage collection will improve the situation, thekernel thread will sleep, and writes will fail with ENOSPCerrors. The goal of the garbage collection code is to erase theﬁrst ﬂash block in the log. At each pass, the node at the headof the log is examined. If the node is obsolete, it is skippedand the head moves on to the next node. If the node is stillvalid, it must be rendered obsolete. The garbage collectioncode does so by writing out a new data or metadata node tothe tail of the log.While the original JFFS had only one type of node on themedium, JFFS2 is more ﬂexible, allowing new types of nodeto be deﬁned while retaining backward compatibility throughuse of a scheme inspired by the compatibility bitmasks ofthe ext2 ﬁle system. Every type of node starts with a com-mon header containing the full node length, node type and acyclic redundancy checksum (CRC). Aside from the differ-ences in the individual nodes, the high-level layout of JFFS2also changed from a single circular log format, because ofthe problem caused by strictly garbage collecting in order.In JFFS2, each erase block is treated individually, and nodesmay not overlap erase block boundaries as they did in theoriginal JFFS. This means that the garbage collection codecan work with increased efﬁciency by collecting from oneblock at a time and making intelligent decisions about whichblock to garbage collect from next.In traditional ﬁle systems the index is usually kept andmaintained on the media, but unfortunately, this is not thecase for JFFS2. In JFFS2, the index is maintained in RAM,not on the ﬂash media. And this is the root of all the JFFS2scalability problems. Of course, having the index in RAMJFFS2 achieves extremely high ﬁle system throughput, justbecause it does not need to update the index on ﬂash aftersomething has been changed in the ﬁle system. And thisworks very well for relatively small ﬂashes, for which JFFS2was originally designed. But as soon as one tries to useJFFS2 on large ﬂashes (starting from about 128MB), manyproblems come up. JFFS2 needs to build the index in RAMwhen it mounts the ﬁle system. For this reason, it needs toscan the whole partition in order to locate all the nodes whichare present there. So, the larger is JFFS2 partition, the morenodes it has, the longer it takes to mount it. The second, itis evidently that the index consumes some RAM. And thelarger is the JFFS2 ﬁle system, the more nodes it has, themore memory is consumed.

UBIFS (Unsorted Block Image File System) [37, 38]follows a node-structured design, that enables their garbage collectors to read eraseblocks directly and determine whatdata needs to be moved and what can be discarded, and toupdate their indexes accordingly. The combination of dataand metadata is called a node. Each node records which ﬁle(more speciﬁcally inode number) that the node belongs toand what data (for example ﬁle offset and data length) iscontained in the node. The big difference between JFFS2and UBIFS is that UBIFS stores the index on ﬂash whereasJFFS2 stores the index only in main memory, rebuilding itwhen the ﬁle system is mounted. Potentially that places alimit on the maximum size of a JFFS2 ﬁle system, becausethe mount time and memory usage grow linearly with thesize of the ﬂash. UBIFS was designed speciﬁcally to over-come that limitation.The master node stores the position of all on-ﬂash struc-tures that are not at ﬁxed logical positions. The master nodeitself is written repeatedly to logical eraseblocks (LEBs) oneand two. LEBs are an abstraction created by UBI. UBI mapsphysical eraseblocks (PEBs) to LEBs, so LEB one and twocan be anywhere on the ﬂash media (strictly speaking, theUBI device), however UBI always records where they are.Two eraseblocks are used in order to keep two copies ofthe master node. This is done for the purpose of recovery,because there are two situations that can cause a corrupt ormissing master node. LEB zero stores the superblock node.The superblock node contains ﬁle system parameters thatchange rarely if at all. For example, the ﬂash geometry(eraseblock size, number of eraseblocks etc) is stored in thesuperblock node. The other UBIFS areas are: the log area(or simply the log), the LEB properties tree (LPT) area, theorphan area and the main area. The log is a part of UBIFS’sjournal.The purpose of the UBIFS journal is to reduce the fre-quency of updates to the on-ﬂash index. The index consistsof the top part of the wandering tree that is made up of onlyindex nodes, and that to update the ﬁle system a leaf nodemust be added or replaced in the wandering tree and all theancestral index nodes updated accordingly. It would be veryinefﬁcient if the on-ﬂash index were updated every time aleaf node was written, because many of the same index nodeswould be written repeatedly, particularly towards the top ofthe tree. Instead, UBIFS deﬁnes a journal where leaf nodesare written but not immediately added to the on-ﬂash index.Note that the index in memory (see TNC) is updated. Peri-odically, when the journal is considered reasonably full, it iscommitted. The commit process consists of writing the newversion of the index and the corresponding master node.After the log area, comes the LPT area. The size of thelog area is deﬁned when the ﬁle system is created and conse-quently so is the start of the LPT area. At present, the size ofthe LPT area is automatically calculated based on the LEBsize and maximum LEB count speciﬁed when the ﬁle sys-tem is created. Like the log area, the LPT area must neverrun out of space. Unlike the log area, updates to the LPT6rea are not sequential in nature - they are random. In addi-tion, the amount of LEB properties data is potentially quitelarge and access to it must be scalable. The solution is tostore LEB properties in a wandering tree. In fact the LPTarea is much like a miniature ﬁle system in its own right. Ithas its own LEB properties - that is, the LEB properties ofthe LEB properties area (called ltab). It has its own form ofgarbage collection. It has its own node structure that packsthe nodes as tightly as possible into bit-ﬁelds. However, likethe index, the LPT area is updated only during commit. Thusthe on-ﬂash index and the on-ﬂash LPT represent what theﬁle system looked like as at the last commit. The differencebetween that and the actual state of the ﬁle system, is repre-sented by the nodes in the journal.The next UBIFS area to describe is the orphan area. Anorphan is an inode number whose inode node has been com-mitted to the index with a link count of zero. That happenswhen an open ﬁle is deleted (unlinked) and then a commitis run. In the normal course of events the inode would bedeleted when the ﬁle is closed. However in the case of anunclean unmount, orphans need to be accounted for. After anunclean unmount, the orphans’ inodes must be deleted whichmeans either scanning the entire index looking for them, orkeeping a list on ﬂash somewhere. UBIFS implements thelatter approach.The ﬁnal UBIFS area is the main area. The main areacontains the nodes that make up the ﬁle system data and theindex. A main area LEB may be an index eraseblock or anon-index eraseblock. A non-index eraseblock may be a bud(part of the journal) or have been committed. A bud may becurrently one of the journal heads. A LEB that contains com-mitted nodes can still become a bud if it has free space. Thusa bud LEB has an offset from which journal nodes begin,although that offset is usually zero.There are three important differences between UBIFS andJFFS2. The ﬁrst has already been mentioned: UBIFS has anon-ﬂash index, JFFS2 does not - thus UBIFS is potentiallyscalable. The second difference is implied: UBIFS runs ontop of the UBI layer which runs on top of the MTD subsys-tem, whereas JFFS2 runs directly over MTD. UBIFS bene-ﬁts from the wear-leveling and error handling of UBI at thecost of the ﬂash space, memory and other resources takenby UBI. The third important difference is that UBIFS allowswriteback.

Yaffs (Yet Another Flash File System) [39] contains ob-jects. The object is anything that is stored in the ﬁle system.These are: (1) Regular data ﬁles, (2) Directories, (3) Hard-links, (4) Symbolic links, (5) Special objects (pipes, devicesetc). All objects are identiﬁed by a unique integer objectId. In Yaffs, the unit of allocation is the chunk. Typically achunk will be the same as a NAND page, but there is ﬂexi-bility to use chunks which map to multiple pages.Many, typically 32 to 128 but as many as a few hundred,chunks form a block. A block is the unit of erasure. NAND ﬂash may be shipped with bad blocks and further blocksmay go bad during the operation of the device. Thus, Yaffsis aware of bad blocks and needs to be able to detect andmark bad blocks. NAND ﬂash also typically requires the useof some sort of error detection and correction code (ECC).Yaffs can either use existing ECC logic or provide its own.Yaffs2 has a true log structure. A true log structured ﬁlesystem only ever writes sequentially. Instead of writing datain locations speciﬁc to the ﬁles, the ﬁle system data is writtenin the form of a sequential log. The entries in the log are allone chunk in size and can hold one of two types of chunk:(1) Data chunk - a chunk holding regular data ﬁle contents,(2) Object Header - a descriptor for an object (directory, reg-ular data ﬁle, hard link, soft link, special descriptor,...). Thisholds details such as the identiﬁer for the parent directory,object name, etc. Each chunk has tags associated with it. Thetags comprise the following important ﬁelds: (1) ObjectId -identiﬁes which object the chunk belongs to, (2) ChunkId -identiﬁes where in the ﬁle this chunk belongs, (3) DeletionMarker - (Yaffs1 only) shows that this chunk is no longer inuse, (4) Byte Count - number of bytes of data if this is a datachunk, (5) Serial Number - (Yaffs1 only) serial number usedto differentiate chunks with the same objectId and chunkId.When a block is made up only of deleted chunks, thatblock can be erased and reused. However, it needs to copythe valid data chunks off a block, deleting the originalsand allowing the block to be erased and reused. This pro-cess is referred to as garbage collection. If garbage collec-tion is aggressive, the whole block is collected in a singlegarbage collection cycle. If the collection is passive thenthe number of copies is reduced thus spreading the effortover many garbage collection cycles. This is done to reducegarbage collection load and improve responsiveness. The ra-tionale behind the above heuristics is to delay garbage col-lection when possible to reduce the amount of collection thatneeds to be performed, thus increasing average system per-formance. Yet there is a conﬂicting goal of trying to spreadthe garbage collection so that it does not all happen at thesame causing ﬂuctuations in ﬁle system throughput. Theseconﬂicting goals make garbage tuning quite challenging.Mount scanning takes quite a lot of time and slows mount-ing. Checkpointing is a mechanism to speed the mountingby taking a snapshot of the Yaffs runtime state at unmountor sync() and then reconstituting the runtime state on re-mounting. The actual checkpoint mechanism is quite sim-ple. A stream of data is written to a set of blocks which aremarked as holding checkpoint data and the important run-time state is written to the stream.

NAFS (NAND ﬂash memory Array File System) [46]consists of a Conventional File System and the NAND FlashMemory Array Interface; the former provides the users withbasic ﬁle operations while the latter allows concurrent ac-cesses to multiple NAND ﬂash memories through a stripingtechnique in order to increase I/O performance. Also, parity7its are distributed across all ﬂash memories in the array toprovide fault tolerance like RAID5.The NAND ﬂash memory is partitioned into two areas:one for the superblock addresses and the other for the su-perblock itself, inodes, and data. In order to provide uniformwear-leveling, the superblock is stored at the random loca-tion in the Data/Superblock/Inode-block Partition while itsaddress is stored in the Superblock Address Partition. NAFSattempts to write ﬁle data consecutively into each block ofNAND ﬂash memory for better read and write performance.In addition, NAFS adopts a new double list cache schemethat takes into account the characteristics of both large-capacity storage and NAND ﬂash memory in order to in-crease I/O performance. The double list cache makes it pos-sible to defer write operations and increase the cache hit ratioby prefetching relevant pages through data striping of NANDFlash Memory Array Interface. The double list cache con-sists of the clean list for the actual caching and the dirty listfor monitoring and analyzing page reference patterns. Thedirty list maintains dirty pages in order to reduce their searchtimes, and the clean list maintains clean pages. All the pagesthat are brought into memory by read operations are insertedinto the clean list. If clean pages in the clean list are modiﬁedby write operations, they are removed from the clean list andinserted into the head of the dirty list. Also, if a new ﬁle iscreated, its new pages are inserted into the head of the dirtylist. If a page fault occurs, clean pages are removed from thetail of the clean list.When NAFS performs delayed write operations using thecache, since two blocks are assigned to each NAND ﬂashmemory, it is always guaranteed that a ﬁle data can be writtencontiguously within at least one block. In addition, NAFSperforms delayed write operations in the dirty list, resultingin reduction of the number of write operations and consec-utive write operations of ﬁle data in each block of NANDﬂash memory.

CFFS (Core Flash File System) [49], which is anotherﬁle system based on YAFFS, stores index entries and meta-data into index blocks which are distinct from data blocks.Since CFFS just reads in the index blocks during mount, itsmount time is faster than YAFFS2’s. Furthermore, since fre-quently modiﬁed metadata are collected and stored into in-dex blocks, garbage collection performance of CFFS is bet-ter than YAFFS2’s. However, since CFFS stores the physicaladdresses of index blocks into the ﬁrst block of NAND ﬂashmemory in order to reduce mount time, wear-leveling perfor-mance of CFFS is worse than others due to frequent erasureof the ﬁrst block.

NAMU (NAnd ﬂash Multimedia ﬁle system) [48] takesinto consideration the characteristics of both NAND ﬂashmemory and multimedia ﬁles. NAMU utilizes an indexstructure that is suitable for large-capacity ﬁles to shortenthe mount time by scanning only index blocks located in theindex area during mount. In addition, since NAMU manages data in the segment unit rather than in the page unit, NAMU’smemory usage efﬁciency is better than JFFS2’s, YAFFS2’s.

MNFS (novel mobile multimedia ﬁle system) [50] intro-duces (1) hybrid mapping, (2) block-based ﬁle allocation, (3)an in-core only Block Allocation Table (iBAT), and (4) up-ward directory representation. Using these methods, MNFSachieves uniform write-responses, quick mounting, and asmall memory footprint.The hybrid mapping scheme means that MNFS uses apage mapping scheme (log-structured method) for the meta-data by virtue of the frequent updates. On the other hand,a block mapping scheme is used for user data, because itis rarely updated in mobile multimedia devices. The entireﬂash memory space is logically divided into two variable-sized areas: the Metadata area and the User data area.MNFS uses a log structure to manage the ﬁle system meta-data. The metadata area is a collection of log blocks that con-tain ﬁle system metadata; the page mapping scheme is usedfor this area. The user data area is a collection of data blocksthat contains multimedia ﬁle data; the block mapping schemeis used for this area. A multimedia ﬁle, e.g. a music or videoclip, is an order of magnitude larger than the text-based ﬁle.Therefore, MNFS uses a larger allocation unit than the blocksize (usually 4 Kbytes) typically found in a legacy generalpurpose ﬁle system. MNFS deﬁnes the allocation unit of theﬁle system as a block of NAND ﬂash memory. The blocksize of NAND ﬂash memory ranges from 16 Kbyte to 128Kbyte, and this size is device speciﬁc.MNFS uses the iBAT, which is similar to the File Allo-cation Table in the FAT ﬁle system, for both uniform write-responses and for robustness of the ﬁle system. There aretwo important differences between the FAT and the iBAT.First, the iBAT is not stored in the ﬂash memory. Like thein-memory tree structure in YAFFS, the iBAT is dynami-cally constructed, during the mount time, in the main mem-ory (RAM) through scanning the spare area of all the blocks.Secondly, the iBAT uses block-based allocation whereas theFAT uses cluster-based allocation. In the FAT ﬁle system,as the ﬁle size grows, a new cluster is allocated, requiringmodiﬁcation of the ﬁle allocation table in the storage de-vice. Access to the storage device for the metadata updatenot only affects the response time of the write request, butit can also invoke ﬁle system inconsistency when the systemcrashes during the update. In MNFS, the iBAT is not storedseparately in the ﬂash memory, and the block allocation in-formation is stored in the spare area of the block itself whilethe block is allocated to a ﬁle. These two differences makeMNFS more robust than the FAT ﬁle system.MNFS uses upward directory representation method. Inthis method, each directory entry in the log block has its par-ent directory entry ID. That is, the child entry points to itsparent entry. The directory structure of the ﬁle system can berepresented using this parent directory entry ID. For the up-ward directory representation method, it is necessary to read8ll of the directory entries in order to construct the directorystructure of the ﬁle system in the memory.

NILFS (New Implementation of a Log-structured FileSystem) [40, 41] has the on-disk layout is divided into sev-eral parts: (1) superblock, (2) full segment, (3) partial seg-ment, (4) logical segment, (5) segment management block.Superblock has the parameters of the ﬁle system, the diskblock address of the latest segment being written, etc. Eachfull segment consists of a ﬁxed length of disk blocks. Thisis a basic management unit of the garbage collector. Partialsegment is write units. Dirty buffers are written out as par-tial segments. The partial segment does not exceed the fullsegment boundaries. The partial segment sequence includesinseparable directory operations. For example, a logical seg-ment could consist of two partial segments. In the recoveryoperations, the two partial segments are treated as one insep-arable segment. There are two ﬂag bits, Logical Begin andLogical End, at the segment summary of the partial segment.The NILFS adopts the B-tree structure for both ﬁle blockmapping and inode block mapping. The two mappings areimplemented in the common B-tree operation routine. TheB-tree intermediate node is used to construct the B-tree. Ithas 64-bit-wide key and 64-bit-wide pointer pairs. The ﬁleblock B-tree uses a ﬁle block address as its key, whereas theinode block B-tree uses an inode number as its key. The rootblock number of the ﬁle block B-tree is stored to the corre-sponding inode block. The root block number of the inodeblock B-tree is stored to the superblock of the ﬁle system.So, there is only one inode block B-tree in the ﬁle system.File blocks, B-tree blocks for ﬁle block management, inodeblocks, and B-tree blocks for inode management are writtento the disk as logs. A newly created ﬁle ﬁrst exists only inthe memory page cache. Because the ﬁle must be accessiblebefore being written to the disk, the B-tree structure existseven in memory. The B-tree intermediate node in memory ison the memory page cache, the data structures are the sameas those of the disk blocks. The pointer of the B-tree nodestored in memory holds the disk block number or the mem-ory address of the page cache that reads the block. Whenlooking up a block in the B-tree, if the pointer of the B-treenode is a disk block number, the disk block is read into anewly allocated page cache before the pointer is rewritten.The original disk block number remains in the buffer-headstructure on the page cache.The partial segment consists of three parts: (1) The seg-ment summary keeps the block usage information of the par-tial segment. The main contents are checksums of the dataarea, the segment summary, the length of the partial segment,and partial segment creation time. (2) Data area contains ﬁledata blocks, ﬁle data B-tree node blocks, inode blocks, andinode block B-tree node blocks in order. (3) A checkpoint is placed on the last tail of the partial segment. The checkpointincludes a checksum of the checkpoint itself. The checkpointaccuracy means successfully writing the partial segment tothe disk. The most important information in the checkpointis the root block number of the inode block B-tree. The blocknumber is written out last, and the whole ﬁle system state isupdated.The data write process started by the sync system call andNILFS kernel thread, advances in the following order: (1)Lock the directory operations, (2) The dirty pages of the ﬁledata are gathered from its radix-tree, (3) The dirty B-tree in-termediate node pages of both ﬁle block management andinode management are gathered, (4) The dirty inode blockpages are gathered, (5) The B-tree intermediate node pageswhich will be dirty for registered block address being reneware gathered, (6) New disk block addresses are assigned tothose blocks in order of ﬁle data blocks, B-tree node blocksfor ﬁle data, inode blocks, B-tree node blocks for inodes, (7)Rewrite the disk block addresses to new ones in the radix-tree and B-tree nodes, (8) Call block device input/output rou-tine to writing out the blocks, (9) Unlock the directory oper-ations. The NILFS snapshot is a whole consistent ﬁle systemat some time instant.In LFS, all blocks remain as is (until they are collected bygarbage collection), therefore, no new information is neededto make a snapshot. In NILFS, the B-tree structure managesthe ﬁle and inode blocks, and B-tree nodes are written out asa log too. So, the root block number of the inode manage-ment B-tree is the snapshot of the NILFS ﬁle system. Theroot block number is stored in the checkpoint position of apartial segment. The NILFS checkpoint is the snapshot of theﬁle system itself. Actually, user can specify the disk blockaddress of the NILFS checkpoint to Linux using the ”mount”command, and the captured ﬁle system is mounted as a read-only ﬁle system. However, when the user use all checkpointsas the snapshot, there is no disk space for garbage collection.The user can select any checkpoint as a snapshot, and thegarbage collector collects other checkpoint blocks.

F2FS (Flash-Friendly File System) [43] employs threeconﬁgurable units: segment, section and zone. It allocatesstorage blocks in the unit of segments from a number of in-dividual zones. It performs ”cleaning” in the unit of section.These units are introduced to align with the underlying FTL’soperational units to avoid unnecessary (yet costly) data copy-ing.F2FS introduced a cost-effective index structure in theform of node address table with the goal to attack the ”wan-dering tree” problem. Also multi-head logging was sug-gested. F2FS uses an effective hot/cold data separationscheme applied during logging time (i.e., block allocationtime). It runs multiple active log segments concurrently andappends data and metadata to separate log segments basedon their anticipated update frequency. Since the ﬂash stor-age devices exploit media parallelism, multiple active seg-9ents can run simultaneously without frequent managementoperations. F2FS builds basically on append-only loggingto turn random writes into sequential ones. At high stor-age utilization, however, it changes the logging strategy tothreaded logging to avoid long write latency. In essence,threaded logging writes new data to free space in a dirty seg-ment without cleaning it in the foreground. F2FS optimizessmall synchronous writes to reduce the latency of fsync re-quests, by minimizing required metadata writes and recover-ing synchronized data with an efﬁcient roll-forward mecha-nism.F2FS divides the whole volume into ﬁxed-size segments.The segment is a basic unit of management in F2FS and isused to determine the initial ﬁle system metadata layout. Asection is comprised of consecutive segments, and a zoneconsists of a series of sections. F2FS splits the entire volumeinto six areas: (1) Superblock (SB), (2) Checkpoint (CP), (3)Segment Information Table (SIT), (4) Node Address Table(NAT), (5) Segment Summary Area (SSA), (6) Main Area.Superblock (SB) has the basic partition information anddefault parameters of F2FS, which are given at the formattime and not changeable. Checkpoint (CP) keeps the ﬁlesystem status, bitmaps for valid NAT/SIT sets, orphan in-ode lists and summary entries of currently active segments.Segment Information Table (SIT) contains per-segment in-formation such as the number of valid blocks and the bitmapfor the validity of all blocks in the ”Main” area. The SIT in-formation is retrieved to select victim segments and identifyvalid blocks in them during the cleaning process. Node Ad-dress Table (NAT) is a block address table to locate all the”node blocks” stored in the Main area. Segment SummaryArea (SSA) stores summary entries representing the ownerinformation of all blocks in the Main area, such as parent in-ode number and its node/data offsets. The SSA entries iden-tify parent node blocks before migrating valid blocks duringcleaning. Main Area is ﬁlled with 4KB blocks. Each block isallocated and typed to be node or data. A node block containsinode or indices of data blocks, while a data block containseither directory or user ﬁle data. Note that a section does notstore data and node blocks simultaneously.F2FS utilizes the ”node” structure that extends the inodemap to locate more indexing blocks. Each node block has aunique identiﬁcation number, ”node ID”. By using node IDas an index, NAT serves the physical locations of all nodeblocks. A node block represents one of three types: inode,direct and indirect node. An inode block contains a ﬁle’smetadata, such as ﬁle name, inode number, ﬁle size, atimeand dtime. A direct node block contains block addresses ofdata and an indirect node block has node IDs locating an-other node blocks. In F2FS, a 4KB directory entry (”dentry”)block is composed of a bitmap and two arrays of slots andnames in pairs. The bitmap tells whether each slot is validor not. A slot carries a hash value, inode number, length of aﬁle name and ﬁle type (e.g., normal ﬁle, directory and sym- bolic link). A directory ﬁle constructs multi-level hash tablesto manage a large number of dentries efﬁciently.F2FS maintains six major log areas to maximize the ef-fect of hot and cold data separation. F2FS statically deﬁnesthree levels of temperaturehot, warm and coldfor node anddata blocks. Direct node blocks are considered hotter thanindirect node blocks since they are updated much more fre-quently. Indirect node blocks contain node IDs and are writ-ten only when a dedicated node block is added or removed.Direct node blocks and data blocks for directories are consid-ered hot, since they have obviously different write patternscompared to blocks for regular ﬁles. Data blocks satisfyingone of the following three conditions are considered cold:(1) Data blocks moved by cleaning, (2) Data blocks labeled”cold” by the user, (3) Multimedia ﬁle data.F2FS performs cleaning in two distinct manners, fore-ground and background. Foreground cleaning is triggeredonly when there are not enough free sections, while a kernelthread wakes up periodically to conduct cleaning in back-ground. A cleaning process takes three steps: (1) Victimselection, (2) Valid block identiﬁcation and migration, (3)Post-cleaning process.The cleaning process starts ﬁrst to identify a victim sec-tion among non-empty sections. There are two well-knownpolicies for victim selection during LFS cleaninggreedy andcost-beneﬁt. The greedy policy selects a section with thesmallest number of valid blocks. Intuitively, this policy con-trols overheads of migrating valid blocks. F2FS adopts thegreedy policy for its foreground cleaning to minimize thelatency visible to applications. Moreover, F2FS reserves asmall unused capacity (5% of the storage space by default)so that the cleaning process has room for adequate opera-tion at high storage utilization levels. On the other hand,the cost-beneﬁt policy is practiced in the background clean-ing process of F2FS. This policy selects a victim section notonly based on its utilization but also its ”age”. F2FS infersthe age of a section by averaging the age of segments in thesection, which, in turn, can be obtained from their last mod-iﬁcation time recorded in SIT. With the cost-beneﬁt policy,F2FS gets another chance to separate hot and cold data.After selecting a victim section, F2FS must identify validblocks in the section quickly. To this end, F2FS maintainsa validity bitmap per segment in SIT. Once having identi-ﬁed all valid blocks by scanning the bitmaps, F2FS retrievesparent node blocks containing their indices from the SSA in-formation. If the blocks are valid, F2FS migrates them toother free logs. For background cleaning, F2FS does not is-sue actual I/Os to migrate valid blocks. Instead, F2FS loadsthe blocks into page cache and marks them as dirty. Then,F2FS just leaves them in the page cache for the kernel workerthread to ﬂush them to the storage later. This lazy migrationnot only alleviates the performance impact on foregroundI/O activities, but also allows small writes to be combined.Background cleaning does not kick in when normal I/O or10oreground cleaning is in progress.After all valid blocks are migrated, a victim section is reg-istered as a candidate to become a new free section (called a”pre-free” section in F2FS). After a checkpoint is made, thesection ﬁnally becomes a free section, to be reallocated. Wedo this because if a pre-free section is reused before check-pointing, the ﬁle system may lose the data referenced by aprevious checkpoint when unexpected power outage occurs.

SFS (SSD File System) [62] is based on three designprinciples. A ﬁle system should exploit the ﬁle block se-mantics directly. It needs to take a log-structured approachbased on the observation that the random write bandwidth ismuch slower than the sequential one. The existing lazy datagrouping in LFS during segment cleaning fails to fully uti-lize the skewness in write patterns and argue that an eagerdata grouping is necessary to achieve sharper bimodality insegment utilization. SFS takes a log-structured approach thatturns random writes at the ﬁle level into sequential writes atthe LBA level. Moreover, in order to utilize nearly 100% ofthe raw SSD bandwidth, the segment size is set to a multipleof the clustered block size. The result is that the performanceof SFS will be limited by the maximum sequential write per-formance regardless of random write performance.It shows that, if hot data and cold data are grouped intoseparate segments, the segment utilization distribution be-comes bimodal: most of the segments are almost either fullor empty of live blocks. Therefore, because the segmentcleaner can almost always work with nearly empty segments,the cleaning overhead will be drastically reduced. To form abimodal distribution, LFS uses a cost-beneﬁt policy for seg-ment cleaning that prefers cold segments to hot segments.However, previous studies show that even the cost-beneﬁtpolicy performs poorly under the large segment size (e.g.,8 MB), because the increased segment size makes it harderto ﬁnd nearly empty segments. With SSD, the cost-beneﬁtpolicy encounters a dilemma: small segment size enablesLFS to form a bimodal distribution, but small random writescaused by the small segment severely degrades write perfor-mance of SSD. Instead of separating the data lazily on seg-ment cleaning after writing them regardless of their hotness,SFS classiﬁes data proactively on writing using ﬁle blocklevel statistics, as well as on segment cleaning. In such eagerdata grouping, since segments are already composed of ho-mogeneous data with similar update likelihood, the segmentcleaning overhead will be signiﬁcantly reduced. In particu-lar, the I/O skewness commonly found in many real work-loads will make this more attractive.SFS has four core operations: segment writing, segmentcleaning, reading, and crash recovery. The ﬁrst step of seg-ment writing in SFS is to determine the hotness criteria forblock grouping. This is, in turn, determined by segmentquantization that quantizes a range of hotness values into asingle hotness value for a group. It is assumed that there arefour segment groups: hot, warm, cold, and read-only groups. The second step is to calculate the block hotness for eachdirty block and assign them to the nearest quantized groupby comparing the block hotness and the group hotness. Atthis point, those blocks with similar hotness levels shouldbelong to the same group. The third step is to ﬁll a segmentwith blocks belonging to the same group. If the number ofblocks in a group is not enough to completely ﬁll a segment,the segment writing of the group is deferred until the groupgrows to completely ﬁll a segment. This eager grouping ofﬁle blocks according to the hotness serves to colocate blockswith similar update likelihoods in the same segment.Segment cleaning in SFS consists of three steps: selectvictim segments, read the live blocks from the victim seg-ments into the page cache and mark the live blocks as dirty,and trigger the writing process. The writing process treatsthe live blocks from victim segments the same as normalblocks; each live block is classiﬁed into a speciﬁc quan-tized group according to its hotness. After all the live blocksare read into the page cache, the victim segments are thenmarked as free so that they can be reused for writing. Forbetter victim segment selection, cost-hotness policy is in-troduced, which takes into account both the number of liveblocks in the segment (i.e., cost) and the segment hotness.

Wu, et al. [132] proposed the greedy algorithm (GR) forgarbage collection The greedy algorithm selects the blockwith the fewest valid pages as the victim block for garbagecollection. This approach can reduce the overhead requiredfor copying valid pages within the victim block to free spaceduring garbage collection. However, the GR algorithm doesnot take into account wear leveling in ﬂash-based consumerelectronic devices. It has been shown that the GR algorithmperforms well in terms of wear leveling for random memoryaccesses but does not perform well for memory accesses witha high spatial locality of reference.Kawaguchi, et al. [133] proposed the cost-beneﬁt (CB)algorithm for ﬂash memory. CB calculates a cost-beneﬁtvalue for each block and selects the block with the highestvalue as a victim. The cost-beneﬁt value for a block is calcu-lated as (age * (1u))/2u, where age is the elapsed time sincethe last modiﬁcation of a page within the block and u is thepercentage of valid pages within the block. Because the CBalgorithm takes into account both the age of invalid pagesand the percentage of valid pages in a block, it could provideimproved wear leveling in ﬂash-based consumer electronicdevices. However, because the CB algorithm does not takeinto account the erase count for each block, its wear levelingperformance is not sufﬁcient.Chiang, et al. [134] proposed the cost-age-time (CAT)algorithm, which extends the CB algorithm by consider-ing the erase count for each block when selecting a victimblock. The CAT algorithm attempts to maintain a balance11etween reducing the garbage collection overhead and im-proving wear leveling in ﬂash memory.Syu, et al. [87] developed a mechanism that takes ad-vantage of FTL inactive time between requests and activelylaunching tasks to reclaim invalidated space. The underlyingconcept is that the timing impact could be reduced throughdistributing the one-time cost of space recycling.The active space recycling mechanism is developed basedon a task model composed of four tasks, namely the managertask, the collector task, the eraser task, and the read/writehandler task. Jobs of these four tasks are released in a ﬁxedtime interval. The jobs of the ﬁrst three tasks follow a prece-dence constraint that the manager task precedes the collectortask while the collector task precedes the eraser task. Theread/write handler task is assigned the lowest priority amongthe four tasks.At the start of each time interval, a job of the manager taskis released. It is responsible for determining the amount ofinvalidated space to be reclaimed. It gathers necessary statis-tic information, and computes the number of pages holdinginvalidated data to be reclaimed. The mission of the collec-tor task is to collect dirty blocks holding invalid pages andmaintains a garbage queue holding these blocks. The collec-tor task ﬁrst selects ”right” dirty blocks to form a candidatelist. It then moves, from the candidate list, the block withthe maximum invalidated space into the garbage queue. Thevalid data, if any, in the dirty block is copied to other freespace before it is put into the garbage queue. The action car-ried out by the eraser task is quite straightforward. It removesand erases the dirty blocks in the garbage queue maintainedby the collector task. When a block is erased, it is marked asa free block and added into the free block list. Its associatederase count is also updated.The read/write handler task is responsible for carrying outthe requests of read or write operations. It is active onlywhen the other three jobs have ﬁnished and there is pend-ing or unﬁnished read/write request. When executing, theread/write handler task calculates how much time can beused for handling the read/write request before the end ofcurrent time period. If the time left is not enough to com-plete reading or writing a page, the remaining read/write op-erations of this request will be postponed.Yan, et al. [84] proposed an efﬁcient ﬁle-aware garbagecollection algorithm, called FaGC. The FaGC algorithmcopies valid pages in a victim block to clusters in free blocksaccording to the calculated update frequency of the asso-ciated chunk. The FaGC algorithm adopts a hybrid wear-leveling policy to improve the lifespan of NAND ﬂash mem-ory. To avoid unnecessary garbage collection, a scatteringfactor is deﬁned and calculated to determine when to triggerthe garbage collection policy.A ﬁle consists of a series of chunks mapped to physicalpages in the ﬂash memory. Each ﬁle is assigned a uniquenumber, called File ID, and each chunk in a ﬁle is assigned a unique number, called Chunk ID. In general, different ﬁleshave different update frequencies, and different chunks inthe same ﬁle have different update frequencies. In a ﬁle-aware system structure, an update frequency table (UFT) isbuilt into random access memory (RAM) to record the up-date frequency for each chunk in a ﬁle. Each UFT entry con-tains four values: File ID, Chunk ID, Time, and Freq. Timerecords the most recent time that a chunk in a ﬁle has beenupdated, and Freq records the frequency with which a chunkhas been updated.Simultaneously, the physical-to-logical translation table(PLT) maintains the File ID and Chunk ID for each blockand physical page in the ﬂash memory. When a chunk in aﬁle is modiﬁed or updated, it is rewritten to another physicalpage in ﬂash memory according to the out-of-place updatescheme. At that time, File ID and Chunk ID in the PLT areupdated. Additionally, when a chunk in a ﬁle is modiﬁed orupdated, the current time is recorded in the UFT, and Freq iscalculated and recorded as follows.In general, the block with the fewest valid pages is se-lected as the victim block to minimize the overhead for thecopy operation, as in the GR algorithm. After the victimblock is selected, the valid pages in the victim block arecopied to free space, and the victim block is then erased andreclaimed. Before the valid pages are copied, the update fre-quency of the chunk associated with each valid page in thevictim block will be checked in the PLT and UFT. Therefore,wear leveling is improved using this clustering procedurebased on the update frequency. Additionally, the decision ofwhen to trigger garbage collection affects the performanceof NAND ﬂash-based consumer electronic devices.

Kuo, et al. [68] suggested an efﬁcient on-line hot-dataidentiﬁcation and a multi-hash-function framework with thegoal to manage the write ampliﬁcation issue. The proposedframework adopts K independent hash functions to hash agiven LBA into multiple entries of a M-entry hash table totrack the write number of the LBA, where each entry is asso-ciated with a counter of C bits. Whenever a write is issued tothe FTL, the corresponding LBA is hashed simultaneouslyby K given hash functions. Each counter corresponding tothe K hashed values (in the hash table) is incremented by oneto reﬂect the fact that the LBA is written again. Wheneveran LBA needs to be veriﬁed to see if it is associated with hotdata, the LBA is hashed simultaneously and in the same wayby the K hash functions. The data addressed by the givenLBA is considered as hot data if the H most signiﬁcant bitsof every counter of the K hashed values contain a non-zerobit value.Jagmohan, et al. [66] proposed a NAND Flash systemwhich uses multi-write coding to reduce write ampliﬁcation.Multi-write coding allows a NAND Flash page to be written12ore than once without requiring an intervening block erase.They presented a novel two-write coding technique based onenumerative coding, which achieves linear coding rates withlow computational complexity. The proposed technique alsoseeks to minimize memory wear by reducing the number ofprogrammed cells per page write.

Chen, et al. [100] implemented CAFTL (A Content-AwareFlash Translation Layer). CAFTL eliminates duplicatewrites and redundant data through a combination of bothin-line and out-of-line deduplication. Inline deduplicationrefers to the case where CAFTL proactively examines the in-coming data and cancels duplicate writes before committinga write request to ﬂash. As a ’best-effort’ solution, CAFTLdoes not guarantee that all duplicate writes can be examinedand removed immediately. Thus CAFTL also periodicallyscans the ﬂash memory and coalesces redundant data out ofline. When a write request is received at the SSD, (1) the in-coming data is ﬁrst temporarily maintained in the on-devicebuffer; (2) each updated page in the buffer is later computeda hash value, also called ﬁngerprint, by a hash engine, whichcan be a dedicated processor or simply a part of the controllerlogic; (3) each ﬁngerprint is looked up against a ﬁngerprintstore, which maintains the ﬁngerprints of data already storedin the ﬂash memory; (4) if a match is found, which meansthat a residing data unit holds the same content, the mappingtables, which translate the host-viewable logical addresses tothe physical ﬂash addresses, are updated by mapping it tothe physical location of the residing data, and correspond-ingly the write to ﬂash is canceled; (5) if no match is found,the write is performed to the ﬂash memory as a regular write.Wang, et al. [126] proposed to develop a real-time, per-process per-stream based pattern detection scheme that iden-tiﬁes various write patterns. These patterns are then used toguide the write buffer to improve the write performance ofSSDs that employ a log-structured block-based FTL. Theyclassify ﬁne-grained write patterns into the following threecategories: (a) sequential, (b) clustered (page or block), and(c) random. Each of the above patterns is deﬁned as follows:1) A sequential pattern is deﬁned as a series of requests withconsecutive logical addresses in an ascending order; 2) Apage clustered pattern is deﬁned as a process repeatedly up-dates a speciﬁc page; 3) A block clustered pattern is deﬁnedas a process repeatedly updates a speciﬁc block; 4) A patternwhich falls in none of above is classiﬁed as random. In or-der to ﬁlter out transition ”noise”, each pattern is allocatedwith a bit map. The number of bits n will determine howmany times in a row a pattern has to be detected, before thealgorithm decides that the I/O stream has entered into a newpattern. For each pattern, they devised an adaptive dirty ﬂushpolicy. Each dirty page in the buffer cache is associated witha pattern type. The dirty pages with the same pattern will be linked together in a linked list, so that dirty pages in thebuffer cache are virtually partitioned according to their asso-ciated patterns. When a page is written by a process, it willbe moved to the head of the pattern list. As a result, in eachlist, the head will be the most recently written page while thetail will be the least recently written one. When the systemneeds to ﬂush dirty pages, it will scan the lists according tothe following priorities. The sequential pattern will be giventhe highest priority to be ﬂushed since its pages are mostlikely written only once, therefore there is no point to keepthem in the cache. The random pattern will be given the nextpriority. The page clustered and block clustered dirty pageswill have the lowest priority, since they may be overwrittenin the future hence we want to keep them in the cache. Thesuggested schemes reduce SSD erase cycles which is directlytranslated to a major improvement on the life-span of SSDs.Huang, et al. [55] presented a Content and SemanticsAware File System (CSA-FS) which is able to reduce writetrafﬁc to SSDs. It employs deduplication and delta-encodingtechniques to ﬁle system data blocks and semantic blocks,respectively. It is motivated by two important observations:(1) there exists a huge amount of content redundancy withinprimary storage systems, and (2) semantic blocks are vis-ited much more frequently than data blocks, with each up-date bringing very minimal changes. By separately dedu-plicating redundant data blocks and delta-encoding similarsemantic blocks, CSA-FS can signiﬁcantly reduce the totalwrite trafﬁc to SSDs and greatly improve their lifetime cor-respondingly. CSA-FS applies deduplication to data blocksand delta-encoding to semantic blocks, respectively. Seman-tic blocks are extracted from the ﬁle system and exportedfor lookups. Semantic blocks mainly include super-blocks,group descriptors, data block bitmap, inode bitmap and in-ode tables. For every block write request, CSA-FS checkswhether it accesses semantic block or data block by consult-ing the exported semantic blocks. For data block write, itcomputes its MD5 digest and looks up the hash value in ahash table to determine whether it is a duplicate block write.If it is a duplicate write, CSA-FS simply returns the blocknumber in the found hash entry, and then uses that blocknumber to update the block pointer table of the ﬁle’s inode.If it is a new write request, it ﬁrst goes through the normalprocedure, i.e., allocating a free block, updating the corre-sponding bitmap block and performing necessary accountingstatistics, and ﬁnally inserts a new entry containing the blocknumber, its MD5 value and some housekeeping informationto the hash table. For metadata block write, CSA-FS cal-culates the content delta relative to its original content, andthen appends the delta to a delta-logging region.

Fu, et al. [98] made research of cloud backup services in thepersonal computing environment. They concluded that the13ajority of storage space is occupied by a small number ofcompressed ﬁles with low sub-ﬁle redundancy. About 61%of all ﬁles are smaller than 10KB, accounting for only 1.2%of the total storage capacity, and only 1.4% ﬁles are largerthan 1MB but occupy 75% of the storage capacity. Thissuggests that tiny ﬁles can be ignored during the dedupli-cation process as so to improve the deduplication efﬁciency,since it is the large ﬁles in the tiny minority that dominatethe deduplication efﬁciency. Static chunking (SC) methodcan outperform content deﬁned chunking (CDC) in dedu-plication effectiveness for static application data and virtualmachine images. The computational overhead for dedupli-cation is dominated by data capacity. The amount of datashared among different types of applications is negligible.They suggested AA-Dedupe (An Application-Aware SourceDeduplication Approach) where tiny ﬁles are ﬁrst ﬁlteredout by ﬁle size ﬁlter for efﬁciency reasons, and backup datastreams are broken into chunks by an intelligent chunker us-ing an application-aware chunking strategy. Data chunksfrom the same type of ﬁles are then deduplicated in theapplication-aware deduplicator by looking up their hash val-ues in an application-aware index that is stored in the localdisk. If a match is found, the metadata for the ﬁle containingthat chunk is updated to point to the location of the existingchunk. If there is no match, the new chunk is stored basedon the container management in the cloud, the metadata forthe associated ﬁle is updated to point to it and a new entryis added into the application-aware index to index the newchunk.Meister, et al. [102] showed that most ﬁles are very small,but the minority of very large ﬁles occupies most of the stor-age capacity: 90% of the ﬁles occupy less than 10% of thestorage space, in some cases even less than 1%. In mostdata sets, between 15% and 30% of the data is stored re-dundantly and can be removed by deduplication techniques.Often small ﬁles have high deduplication rates, but they con-tribute little to the overall savings. Middle-sized ﬁles usuallyhave a high deduplication ratio. Full ﬁle duplication typi-cally reduces the data capacity by 5% - 10%. In most datasets, most of the deduplication potential is lost if only fullﬁle elimination is used. The deduplication slowly decreasesslowly with increasing chunk sizes. Fixed size chunkingdetects around 6-8% less redundancies than content-deﬁnedchunking. Between 3.1% and 9.4% of the data are zeros.Of all chunks, 90% were only referenced once. This meansthat the chunks are unique and do not contribute to dedupli-cation. The most referenced chunk is the zero chunk. Themean number of references is 1.2, the median is 1 reference.The most referenced 5% of all chunks account for 35% andthe ﬁrst 24% account for 50% of all references. Of all multi-referenced chunks, about 72% were only referenced twice.A small fraction of the chunks causes most of the dedupli-cation. The evaluation shows that typically 20% to 30% ofonline data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This re-duction can only be achieved by a subﬁle deduplication ap-proach, while approaches based on whole-ﬁle comparisonsonly lead to small capacity savings.Xia, et al. [101] suggested DARE (a Deduplication-Aware Resemblance detection and Elimination scheme) forcompressing backup datasets. DARE is designed to im-prove resemblance detection for additional data reduction indeduplication-based backup/archiving storage systems. Foran incoming backup stream, DARE goes through the fol-lowing four key steps: (1) Duplicate Detection - the datastream is ﬁrst chunked by the CDC approach, ﬁngerprintedby SHA-1, duplicate-detected, and then grouped into con-tainer of sequential chunks to preserve the backup-streamlocality. (2) Resemblance Detection - the DupAdj resem-blance detection module in DARE ﬁrst detects duplicate-adjacent chunks in the containers formed in Step 1. Af-ter that, DAREs improved super-feature module further de-tects similar chunks in the remaining non-duplicate and non-similar chunks that may have been missed by the DupAdjdetection module when the duplicate-adjacency informationis lacking or weak. (3) Delta Compression - for each of theresembling chunks detected in Step 2, DARE reads its base-chunk, then delta encodes their differences by the Xdelta al-gorithm. In order to reduce the number of disk reads, an LRUand locality-preserved cache is implemented to prefetch thebase-chunks in the form of locality-preserved containers. (4)Storage Management - the data NOT reduced, i.e., non-similar or delta chunks, will be stored as containers intothe disk. The ﬁle mapping information among the dupli-cate chunks, resembling chunks, and non-similar chunks willalso be recorded as the ﬁle recipes to facilitate future data re-store operations in DARE. They concluded that supplement-ing delta compression to deduplication can effectively en-large the logical space of the restoration cache, but the datafragmentation in data reduction systems remains a seriousproblem.Kim, et al. [110] designed a deduplication layer on FTL.It consists of three components, namely, ﬁngerprint gener-ator, ﬁngerprint manager, and mapping manager. The ﬁn-gerprint generator creates a hash value, called ﬁngerprint,which summarizes the content of written data. The ﬁnger-print manager manipulates generated ﬁngerprints and con-ducts ﬁngerprint lookups for detecting deduplication. Fi-nally, the mapping manager deals with the physical loca-tions of duplicate data. They proposed two acceleration tech-niques: sampling-based ﬁltering and recency-based ﬁnger-print management. The former selectively applies dedupli-cation based upon sampling and the latter effectively exploitslimited controller memory while maximizing the deduplica-tion ratio. Experimental results have shown that they achievethe duplication rate ranging from 4% to 51%, with an aver-age of 17%, for the nine considered workloads. The responsetime of a write request can be improved by up to 48% with14n average of 15%, while the lifespan of SSDs is expected toincrease up to 4.1 times with an average of 2.4 times.Ha, et al. [103] proposed a new deduplication schemecalled block-level content-aware chunking to extend the life-time of SSDs. The proposed scheme divides the data within aﬁxed-size block into a set of variable-sized chunks based onits contents and avoids storing duplicate copies of the samechunk. Evaluations on a real SSD platform showed that theproposed scheme improves the average deduplication rate by77% compared to the previous block-level ﬁxed-size chunk-ing scheme. Additional optimizations reduce the averagememory consumption by 39% with a 1.4% gain in the av-erage deduplication rate.Li [95] presented Flash Saver, which is coupled withthe ext2/3 ﬁle system and aims to signiﬁcantly reduce thewrite trafﬁc to SSDs. Flash Saver deploys deduplicationand delta-encoding to reduce the write trafﬁc. Speciﬁcally,Flash Saver applies deduplication to ﬁle system data blocksand delta-encoding to ﬁle system meta-data blocks, based ontwo important observations which are: (1) there exist largeamounts of duplicate data blocks (2) metadata blocks areaccessed/modiﬁed much more frequently than data blocks,but with very minimal changes for each update. Speciﬁcally,Flash Saver semantically identiﬁes the ﬁxed-size (e.g. 4KB)content blocks of the ﬁle system into data blocks and metablocks. For data blocks, it computes the SHA-1 hash valueof the block and uses the hash value to examine whether thesame block has already been stored in order to avoid stor-ing multiple copies of the block having the same content.For metablocks, it logs the incremental changes relative tothe corresponding meta block to save I/Os and storage space.Obviously, under most cases, ﬁle system meta blocks are fre-quently modiﬁed with minor changes. The experimental re-sults have shown that Flash Saver can save up to 63% of thetotal write trafﬁc, which implies reasonably prolonged life-time, larger effective ﬂash space and higher reliability thanthat of the original counterpart within their allowable lifes-pan.Rozier, et al. [112] have modeled the fault tolerance con-sequences of deduplication. They concluded that deduplica-tion has a net negative impact on reliability, both due to itsimpact on unrecoverable data loss, and the impact of silentdata corruptions, though the former is easily countered byusing higher level RAID conﬁgurations. In both cases, sys-tem reliability can be increased by maintaining additionalcopies of deduplicated instances typically by keeping mul-tiple copies for a very small percentage of the deduplicatedinstances in a given category.Nam, et al. [107] introduced two reliability parametersfor deduplication storage: chunk reliability and chunk lossseverity. To provide a demanded reliability for an incomingdata stream, most deduplication storage systems ﬁrst carryout deduplication process by eliminating duplicates from thedata stream and then apply erasure coding for the remain- ing (unique) chunks. A unique chunk may be shared (i.e.,duplicated) at many places of the data stream and sharedby other data streams. That is why deduplication can re-duce the required storage capacity. However, this occasion-ally becomes problematic to assure certain reliability levelsrequired from different data streams. The chunk reliabilitymeans each chunk’s tolerance level in the face of any fail-ures. The chunk loss severity represents an expected dam-age level in the event of a chunk loss, formally deﬁned asthe multiplication of actual damage by the probability of achunk loss. They proposed a reliability-aware deduplicationsolution that not only assures all demanded chunk reliabilitylevels by making already existing chunks sharable only if itsreliability is high enough, but also mitigates the chunk lossseverity by adaptively reducing the probability of having achunk loss.

A distinction is made between lossless and lossy algorithms,i.e. those algorithms that will preserve the original data ex-actly, and those that will discard parts of the data, reducingthe quality. The latter type is typically domain speciﬁc, i.e.knowledge about what type of data is being compressed isneeded to determine what to discard. For general compres-sion three of the most often used algorithms are: (1) Runlength coding a simple and fast scheme that replaces re-peating patterns with the patterns and number of repetitions;(2) Huffman [135] coding Huffman coding analyzes the fre-quency of different ﬁxed length symbols in a data set, and tothe symbols assigns codes whose lengths correspond to thefrequency of the respective symbol in the data set, i.e. fre-quent symbols get short codes, infrequent get long codes; (3)Lempel-Ziv [136, 137] these algorithms basically replacestrings (variable length symbols) found in a dictionary withcodes representing those strings.The efﬁciency of these algorithms is determined by thesize of the dictionary and how much effort is spent searchingin the dictionaries. Gzip [138] and others use general com-pression algorithms based on the two Lempel-Ziv algorithmsLZ77 [136] and LZ78 [137] (or Lempel-Ziv-Welch [139],which is based on LZ78). These algorithms are sometimesaugmented with Huffman [135] coding. Huffman coding onits own is typically faster than Lempel-Ziv based algorithms,although it will typically yield less compression. Run lengthencoding is extremely fast, but the gain is often small com-pared to Lempel-Ziv or Huffman coding.Mannan, et al. [140] introduced the concept of block Huff-man coding. Their main idea is to break the input stream intoblocks and compress each block separately. They chooseblock size in such a way that it can store one full single blockin main memory. They use a block size as moderate as 5 KiB,10 KiB or 12 KiB. Finally, they observed that to obtain bet-ter efﬁciency from block Huffman coding, a moderate sized15lock is better and the block size does not depend on ﬁletypes.Chang, et al. [120] have made the performance evalua-tion of block LZSS compression algorithm. They studied theblock LZSS algorithm and investigated the relationship be-tween the compression ratio of block LZSS and the value ofindex or length. They found that as the block size increases,the compression ratio becomes better. To obtain better ef-ﬁciency from block LZSS, a moderate sized block which isgreater than 32KiB, may be optimal, and the optimal blocksize does not depend on ﬁle types. They have found that,in some cases, the block Huffman coding has a better com-pression ratio than no blocking Huffman coding, and withthe increasing block size, the compression ratio deteriorates.The optimal block size in which it obtains the best compres-sion ratio is about 16KiB. The reason for the better efﬁciencymay be attributed to the principle of locality of data.

Douglis, et al. [109] have found that the beneﬁts ofapplication-speciﬁc deltas vary depending on the mix of con-tent types. For example, HTML and email messages displaya great deal of redundancy across large datasets, resulting indeltas that are signiﬁcantly smaller than simply compress-ing the data, while mail attachments are often dominatedby non-textual data that do not lend themselves to the tech-nique. A few large ﬁles can contribute much of the totalsavings if they are particularly amenable to delta-encoding.Application-speciﬁc techniques, such as delta-encoding anunzipped version of a zip or gzip ﬁle and then zipping the re-sult, can signiﬁcantly improve results for a particular ﬁle, butunless an entire dataset consists of such ﬁles, overall resultsimprove by just a couple of percent.For web content, it have been found substantial overlapamong pages on a single site. For the ﬁve web datasets theyconsidered, deltas reduced the total size of the dataset to 8-19% of the original data, compared to 29-36% using com-pression. For ﬁles and email, there was much more variabil-ity, and the overall beneﬁts are not as dramatic, but they aresigniﬁcant: two of the largest datasets reduced the overallstorage needs by 10-20% beyond compression. There wassigniﬁcant skew in at least one dataset, with a small fractionof ﬁles accounting for a large portion of the savings. Factorssuch as shingle size and the number of features compareddo not dramatically affect these results. Given a particularnumber of maximal matching features, there is not a widevariation across base ﬁles in the size of the resulting deltas.A new ﬁle will often be created by making a small number ofchanges to an older ﬁle; the new ﬁle may even have the samename as the old ﬁle. In these cases, the new ﬁle can often bedelta-encoded from the old ﬁle with minimal overhead.

REDO (REfactored Design of I/O architecture) [59]refactors two main components of the I/O subsystemthe ﬁlesystem and the storage device. REDO removes logical-to-physical mapping and garbage collection from the storagedevice. Instead, a refactored ﬁle system (RFS) directly man-ages the storage address space, including the garbage col-lection. Unlike host-based FTL, all those functions are con-ducted by RFS without any helps from an intermediate hostlayer like a device driver. This eliminates the need for main-taining a large logical-to-physical page-map table, allowingus to perform garbage collection more efﬁciently at the ﬁlesystem level. A refactored storage device controller (RSD)becomes simpler because it runs a small number of essen-tial ﬂash management functions. RSD maintains a muchsmaller logical-to-physical segment-map table to managewear-leveling and bad blocks. Unlike FFS, REDO providesinteroperability with block I/O subsystems, allowing SSDvendors to hide all the details of their devices and NANDcharacteristics.RFS is designed differently from the conventional LFS intwo ways; it only issues out-place update commands andinforms a storage device about which blocks have becomeerasable via TRIM commands. This frees the ﬂash con-troller from the task of garbage collection all together. RFSwrites ﬁle data, inodes, and the pieces of the inode map inan out-place update manner. Unlike LFS, the incoming dataare written to a physical segment corresponding to a logi-cal segment, and their relative offsets in the logical segmentare preserved in the physical one. For check-pointing, RFSreserves two ﬁxed logical segments, called check-point seg-ments. RFS then appends new check-points with differentversion numbers, so that the overwrites never happen. RFSmanages all the obsolete data at the level of a ﬁle systemand triggers garbage collection when free space is exhausted.RFS chooses the logical segment 2 as a victim and copieslive data to free space. The victim segment becomes freefor future use. To inform that the physical segment for thevictim has obsolete data, RFS delivers a TRIM command toRSD. Finally, RSD marks the physical segment out-of-dateand erases ﬂash blocks.RSD maintains the segment-map table, and each entry ofthe table points to physical blocks that are mapped to a log-ical segment. When write requests come, RSD calculatesa logical segment number (i.e., 100) using the logical ﬁle-system page number (i.e., 1,600). Then, it looks up theremapping table to ﬁnd the physical blocks mapped to thelogical segment. If physical blocks are not mapped yet, RSDbuilds the physical segment by allocating new ﬂash blocks.RSD picks up free blocks with the smallest P/E cycles in thecorresponding channel/way. A bad block is ignored. If thereare ﬂash blocks already mapped, RSD writes the data to the16xed location in the physical segment. Block erasure com-mands are not explicitly issued from RFS. But, RSD eas-ily ﬁgures out which blocks are out-of-date and are readyfor erasure because RFS informs RSD of physical segmentsonly with obsolete data via a TRIM command. RSD handlesoverwrites like block-level FTL.

LightNVM (The Linux Open-Channel SSD Subsys-tem) [127, 128, 129, 130, 131] proposes that SSD manage-ment trade-offs should be handled through Open-ChannelSSDs, a new class of SSDs, that give hosts control overtheir internals. It introduces a new Physical Page AddressI/O interface that exposes SSD parallelism and storage mediacharacteristics. LightNVM integrates into traditional storagestacks, while also enabling storage engines to take advantageof the new I/O interface.The Physical Page Address (PPA) I/O interface is basedon a hierarchical address space. It deﬁnes administrationcommands to expose the device geometry and let the hosttake control of SSD management, and data commands to ef-ﬁciently store and retrieve data. The interface is indepen-dent of the type of non-volatile media chip embedded on theopen-channel SSD. Open-channel SSDs expose to the host acollection of channels, each containing a set of Parallel Units(PUs), also known as LUNs. A PU may cover one or morephysical die, and a die may only be a member of one PU.Each PU processes a single I/O request at a time. Regard-less of the media, storage space is quantized on each PU.NAND ﬂash chips are decomposed into blocks, pages (theminimum unit of transfer), and sectors (the minimum unitof ECC). Byte-addressable memories may be organized as aﬂat space of sectors. PPAs are organized as a decompositionhierarchy that reﬂects the SSD and media architecture.The PPA address space can be organized logically to actas a traditional logical block address (LBA), e.g., by arrang-ing NAND ﬂash using ”block, page, plane, and sector”. Thisenables the PPA address space to be exposed through tradi-tional read/write/trim commands. In contrast to traditionalblock I/O, the I/Os must follow certain rules. Writes must beissued sequentially within a block. Trim may be issued for awhole block, so that the device interprets the command as anerase.LightNVM is organized in three layers, each providinga level of abstraction for open-channel SSDs: (1) NVMeDevice Driver. A LightNVM-enabled NVMe device drivergives kernel modules access to open-channel SSDs throughthe PPA I/O interface. (2) LightNVM Subsystem. An in-stance of the subsystem is initialized on top of the PPA I/O-supported block device. The instance enables the kernel toexpose the geometry of the device through both an internalnvm dev data structure and sysfs. (3) High-level I/O Inter-face. A target gives kernel-space modules or user-space ap-plications access to open-channel SSDs through a high-levelI/O interface, either a standard interface like the block I/O in-terface provided by pblk, or an application-speciﬁc interface provided by a custom target.

Segment is the cornerstone concept of any Log-structuredFile System (LFS). This notion (segment concept) points outthe reality of presence of Physical Erase Blocks (PEB) on thestorage device (SSD) side. Generally speaking, segment canbe imagined like a portion of storage device that includes oneor several erase blocks. The erase block is very importantitem of any NAND-based storage device because it is theunit of the erase operation. Finally, any SSDFS ﬁle systemsvolume can be imagined like a sequence of segments (Fig.1). Figure 1: Logical segment concept.

Logical segment . Generally speaking, segment wouldrepresent the real physical unit(s) (for example, one or sev-eral PEBs are identiﬁed by LBAs on the storage device).However, SSDFS operates by logical segments. The logi-cal segment is the unit that is always located on some offsetfrom the volume’s beginning for the whole lifetime of ﬁlesystem volume (Fig. 1). Segment is capable to include avariable number of PEBs. However, SSDFS ﬁle system’svolume includes a ﬁxed number of segments with identicalsize after the deﬁnition of segment size (during a ﬁle systemvolume creation). Very important goal of the segment con-cept is the capability to execute the erase operation for thewhole segment. However, segment is the aggregation of sev-eral PEBs in the case of SSDFS ﬁle system. It means thatsuch segment construction provides the opportunity to exe-cute the erase operation on the basis of particular PEB(s) in-side of the same segment. Finally, the aggregation of severalPEBs inside of one segment has several goals: (1) exploita-tion of operation parallelism for different PEBs inside of thesegment, (2) capability to execute the partial erase operationon the PEB basis instead of the whole segment, (3) capabil-ity to use a RAID-like or erasure coding scheme inside of thesegment, (4) capability to select a proper segment size for aparticular workload.17igure 2: Logical extent concept.

Logical extent . Usually, segment is associated witha PEB (ﬂash-oriented ﬁle system) or with a LBA (ﬂash-friendly ﬁle system). However, segment is the pure logicalentity without the strict relation with PEB or LBA in the caseof SSDFS ﬁle system. Generally speaking, the segment issimply some portion of the ﬁle system volume is always lo-cated on some offset from the volumes beginning (Fig. 1).The nature of SSDFS ﬁle system’s segment has the goalto implement a logical extent concept. This concept impliesthat logical extent (segment ID + logical block + length) isalways located in the same logical position inside of the samesegment (Fig. 2). Generally speaking, the goal of logical ex-tent is to exclude the necessity to update the metadata abouta logical block’s position in the case of data migration fromthe initial PEB into another one (in the case of update or GCoperation, for example).The logical extent concept is the technique of resolvingthe write ampliﬁcation issue for the case of LFS ﬁle system.It means that any metadata structure keeping a logical extentdoesn’t need in updating the logical extent value in the caseof data migration between the PEBs because the logical ex-tent remains the same until the data is living in the same seg-ment (Fig. 2). The implementation of logical extent conceptneeds in introduction of Logical Erase Block (LEB) concept(Fig. 1 - 2). LEB represents the logical analogue of PEB onstorage device side.Generally speaking, segment is a sequence of LEBs andevery LEB is equal to the size of one or multiple PEBs. Asa result, the LEB can be imagined like a container that couldbe associated with any PEB on storage device side. Finally,every LEB always has the same index in the particular seg-ment. And the problem of association the particular LEBwith a PEB is resolving by means of a special mapping ta-ble. The PEB mapping table has the several important goalsin the SSDFS ﬁle system.LEB and PEB represent different notions. PEB representsthe physical erase block on a storage device side. Generallyspeaking, PEB is really allocated portion of storage deviceis able to receive read and write I/O requests. Also it is pos-sible to apply the erase operation for this portion of storagedevice. Oppositely, LEB represents a ﬁxed logical portion of ﬁle system’s volume is identiﬁed by the segment ID andthe index in a segment. It is possible to say that LEB is pre-allocated portion of the ﬁle system’s volume space.However, the real association of LEB with PEB takesplace only if some LEB’s logical block/extent was allocatedand ﬁlled by data. Otherwise, empty LEB doesn’t need in as-sociation with any PEB. It means that if LEB hasn’t any datathen no PEB is linked with such LEB. Generally speaking,SSDFS ﬁle system’s volume represents a sequence of logi-cal segments. Every logical segment contains a set of LEBsthat provide opportunity to allocate some number of logicalblocks (Fig. 1 - 2). As a result, internal metadata structuresof SSDFS ﬁle system operates by logical extents that need tobe updated only in the case of segment ID change.Figure 3: Segment parallelism.

Segment parallelism . One of the important goal to haveseveral LEBs/PEBs in one segment is the trying to employthe parallelism of operation with PEBs are located on dif-ferent dies. Usually, any SSD contains a set of dies areable to execute various operations independently and con-currently (for example, erase operation). Moreover, multi-channel SSD architecture is capable to deliver commandsand data to different dies by means of independent channels.Generally speaking, LEBs of the same segment are able tobe associated with PEBs are located on different dies (Fig.3). As a result, the operation parallelism in one segment isable to improve the ﬁle system performance at whole. Fromanother point of view, the opportunity to associate any LEBwith any PEB creates the ﬂexibility in policy of distributionof LEBs of the same segment in the different areas of thestorage device.The critical point is the capability to have the knowledgeabout a distribution of PEBs’ ranges amongst the differentdies. Generally speaking, such distribution can be imple-mented on the basis of static or dynamic policies. The staticpolicy means that the whole address space of storage deviceis distributed among the different dies in static manner. Oth-erwise, the storage device itself should be able to inform thehost about such distribution by means of special protocol.For example, Open-channel SSD could be able to providesuch data on the host side.18 .2 LEB/PEB Architecture

Log concept . Log is the fundamental basic structure of SS-DFS ﬁle system (Fig. 4). Any user data or metadata arestored in the log of SSDFS ﬁle system’s volume. Generallyspeaking, the log concept tries to achieve the several veryimportant goals: (1) replication of critical metadata struc-tures that characterize a ﬁle system volume, (2) creation theopportunity to recover the log’s payload (user data or meta-data) on the basis of log’s metadata even if all other logs arecorrupted, (3) localization of block bitmap by scope of onePEB, (4) implementation the concept of an offsets translationtable, (5) implementation the concept of main, diff updates,and journal areas. Figure 4: Log concept.It is possible to imagine the log like a container that in-cludes a header, a payload, and a footer (Fig. 4). The respon-sibility of header (Fig. 5) is the identiﬁcation of ﬁle systemtype and the log’s beginning because the header is capableto play the role of ﬁle system’s superblock. Any PEB couldcontain one or several logs and end-user is able to deﬁne thesize of the log. Moreover, various segment types are able tohave the different log’s size. However, mount type, unmountoperation, segment type, or workload type could result in thenecessity to commit the log without enough data in the log’spayload. As a result, it needs to distinguish full and par-tial logs. The full log has to contain such number of logicalblocks (or NAND ﬂash pages) that were deﬁned by end-userfor this particular segment type during the ﬁle system vol-ume creation. Every full log contains the segment header,payload, and footer (Fig. 4).Figure 5: Log header. If it exists the necessity to commit a log without the pres-ence of enough data in the payload then it needs to create achain of partial logs in a PEB (Fig. 4). The ﬁrst partial logcontains the segment header (Fig. 5), the partial log header(Fig. 6), and the payload. Every next partial log includesonly the partial log header and the payload. Finally, the lastpartial log ends with the log footer (Fig. 4, Fig. 7). Generallyspeaking, to select an optimal value of log’s size could be noteasy task because different workloads is able to need in spe-cialized log’s size. It means that collecting statistics aboutpartial logs’ size could be the basis for searching and grad-ual correction of the full log’s size with the goal to achievethe local or global optimum value.Figure 6: Partial log header.Figure 7: Log footer.

Superblock . Usually, any ﬁle system starts from a su-perblock that is located in one or several ﬁxed position(s)on the ﬁle system’s volume. The responsibility of the su-perblock is to identify the ﬁle system’s type and to providethe description of the key ﬁle system’s metadata structures.SSDFS represents the LFS ﬁle system type that is using theCopy-On-Write (COW) policy for updating the state of anydata or metadata. Generally speaking, it means that the su-perblock cannot be located in the ﬁxed position of SSDFSﬁle system’s volume. Oppositely, every log of SSDFS ﬁlesystem contains the superblock copy. It means that extract-ing the header (Fig. 5 - 6) and footer (Fig. 7) of any log pro-vides the state of ﬁle system’s superblock is actual for sometimestamp. SSDFS ﬁle system is using the special algorithmof fast searching the last actual superblock’s state.19SDFS ﬁle system splits the superblock’s state on: (1)static data, (2) dynamic data. The static part of superblock ischaracterized the key parameters/features of a ﬁle system’svolume (for example, logical block size, erase block size,segment size, creation timestamp and so on) that are deﬁnedduring the volume creation. This part of superblock is keptin the volume header of log’s header (Fig. 5). Oppositely, thedynamic part of superblock is represented by mutable param-eters/features of ﬁle system’s volume (for example, segmentnumbers, free logical blocks number, volume UUID, volumelabel and so on). The volume state of the log footer (Fig. 7)keeps the dynamic part of superblock. And, ﬁnally, the par-tial log header (Fig. 6) represents the restricted combinationof static and dynamic parts of the superblock. Every meta-data structure of the log (segment header, partial log header,log footer) starts from a magic signature (Fig. 5 - 7). Gen-erally speaking, the responsibility of magic signature is toidentify the ﬁle system type and the type of metadata struc-ture. Another very important ﬁeld is the log’s area descrip-tors (Fig. 5 - 7). These descriptors describe the position andthe size of every existing area (user data or metadata) in alog.

Block bitmap . One of the very important log’s metadatastructure is a block bitmap. Usually, a ﬁle system uses theblock bitmap as a single metadata structure for the wholevolume. The responsibility of block bitmap is to track thestate of logical blocks (free or used). As a result, the blockbitmap is frequently accessed and modiﬁed metadata struc-ture. However, this compact and efﬁcient metadata structurecannot be used in traditional way for the case of LFS ﬁle sys-tem by virtue of: (1) frequent updates of the block bitmap isable to increase the write ampliﬁcation, (2) logical block ofLFS ﬁle system needs in more states (free, used, invalid),(3) the volume capacity could change because of necessityto track the presence of bad erase blocks.Figure 8: Block bitmap concept.SSDFS ﬁle system introduces the PEB-based blockbitmap because of proven efﬁciency and compactness of thismetadata structure. First of all, the block bitmap (Fig. 8)tracks such states of logical block: (1) free, (2) used, (3) pre-allocated, (4) invalid. The free state means that logical blockis ready for allocation and write operation. Oppositely, theused state means that the logical block was allocated and thewrite operation has taken place for this logical block. Theinvalid state represents the case when the update or GC op-eration invalidates (makes not actual) the state of a logical block in one PEB and to store the actual state into anotherone. And, ﬁnally, the pre-allocated state can be used for rep-resenting the case when several fragments of different logicalblocks can be stored into one NAND ﬂash page (for example,in the case of compression or delta-encoding).Figure 9: Technique of using the block bitmap.Block bitmap is the PEB-based metadata structure in thecase of SSDFS ﬁle system (Fig. 8 - 9). The goals of such ap-proach are: (1) opportunity to access/modify block bitmapsof different PEBs without the necessity to use any synchro-nization primitives, (2) capability to lose the bad erase blockswithout the necessity to rebuild the block bitmap, (3) capa-bility to allocate the logical blocks and to execute GC oper-ations concurrently for different PEBs in the same segmentor for the ﬁle system’s volume at whole, (4) opportunity totrack the state of logical blocks only inside the log’s pay-load. Every log keeps the actual state of the block bitmap forthe case of some timestamp. It means that previous PEB’slogs play the role of block bitmap’s checkpoints or snapshots(Fig. 9). As a result, it is possible to use the block bitmapsof previous logs in the case of corruption of particular PEB’slog. If some LEB is under active migration then every logof the destination PEB has to store block bitmap as sourcePEB as destination PEB (Fig. 9) because migration could beexecuted in several phases.Figure 10: Offsets translation table concept.

Offsets translation table . Any subsystem of SSDFS ﬁlesystem’s driver that needs to store user data or metadata20reats the segment like a sequence of logical blocks. Gen-erally speaking, the goal of such approach is to provide theopportunity to access the stored data by means of segment IDand logical block number without the knowledge what PEBkeeps the actual state of data for the requested logical block.The SSDFS ﬁle system uses an offsets translation table (Fig.10) for implementation this approach. Generally speaking,offsets translation table looks like a sequence of fragmentsand every fragment is stored in the particular log (Fig. 10).The fragment keeps a portion of the table that associates alogical block number with a descriptor is keeping the off-set to the data in the log’s payload. As a result, if someonewould like to retrieve the actual state of data then it needs toﬁnd the latest record in the sequence of fragments of offsetstranslation table for the requested logical block number. Thefound record will identify the PEB, log index, and byte offsetto the actual state of data in the log’s payload.Figure 11: Offsets translation table architecture.Generally speaking, the offsets translation table includesseveral metadata structures inside of the log (Fig. 11): (1)logical block table, (2) block descriptor table, (3) payloadarea. The logical block table represents an array of descrip-tors where the logical block number can be used as the index.Every descriptor of logical block table keeps: (1) logical off-set from the beginning of ﬁle or metadata structure, (2) PEBpage number that identiﬁes an index of logical block in theblock bitmap, (3) log’s area that identiﬁes metadata area orpayload is keeping the content of logical block, (4) byte off-set from the area’s beginning till the data portion. Finally,the descriptor of logical block table could point out directlyin the payload (for example, in the case of full plain logi-cal block) or into the block descriptor table (Fig. 11). Ev-ery record of block descriptor table keeps the inode ID andseveral descriptors on logical block’s states in the payloadarea(s). The goal of keeping the several descriptors on log-ical block’s states in one record of block descriptor table isto provide the capability to represent the several sequentialmodiﬁcations of a logical block or the various delta-encodedfragments of the same logical block. Finally, the payloadcould keep the plain full logical block or compressed (delta-encoded) fragment with associated checkpoint and parentsnapshot IDs. It needs to point out that the checkpointand parent snapshot IDs can be extracted from the segment header for the case of plain full logical block. Generallyspeaking, the knowledge of logical offset from ﬁle’s begin-ning, inode ID, checkpoint ID, and parent snapshot ID pro-vides the capability to recover the stored data from the log’spayload on the basis of log’s metadata only.Figure 12: Log structure.

Log structure . As a result, log’s structure (Fig. 12) beginswith the header that identiﬁes the ﬁle system’s type and thelog’s beginning by means of magic signature. Moreover, theheader contains an array of area descriptors that describes theexisting areas in the log: (1) block bitmap, (2) logical blocktable, (3) block descriptor table, (4) payload, (5) footer. Thefooter is also able to include the array of area descriptorswith the goal to replicate the critical metadata structures ofthe log (for example, block bitmap and logical block table).

Main/Diff/Journal areas . It is very important to distin-guish ”cold” and ”hot” data for the case of LFS ﬁle system.Because the identiﬁcation of ”cold” and ”hot” types of dataprovides the opportunity to implement an efﬁcient data man-agement scheme, especially, for the case of GC operations.As a result, SSDFS ﬁle system introduces (Fig. 13) the threetypes of payload areas: (1) main area, (2) diff updates area,(3) journal area. The main area is used for storing the plainfull blocks. Generally speaking, the write operation for anylogical block takes place in the main area only once and thefollowing updates are stored into the diff updates or jour-nal area. As a result, such write/update policy creates area(main area) with ”cold” data because all following updatesof any logical block in the main area will be stored into an-other area(s) (diff updates or journal area). If the diff updatesor the journal area gathers signiﬁcant amount of updates forsome logical block in the main area of some log then thislogical block could be stored in the main area of another logwith applying of all existing updates.The diff updates area (Fig. 13) could play the role of areawith the ”warm” data. The responsibility of diff updates areais to gather into one NAND ﬂash page the compressed blocksor delta-encoded fragments of the same ﬁle. It means thatthis area is able to store the signiﬁcant amount of updatesfor logical blocks in the main area. However, the updates ofthe data in the diff updates area could be not so signiﬁcantlike for the main or journal areas. Finally, the diff updatesarea will be hotter than main area but it could be colder that21igure 13: Main, diff and journal payload areas.the journal area. The goal of journal area (Fig. 13) is torepresent the area with the ”hot” data. One NAND ﬂash pageof journal area is used for compaction of several small ﬁles orupdates of logical blocks of different ﬁles. Finally, it meansthat the NAND ﬂash page with fragments of various ﬁles isable to receive more updates than main or diff updates areas.From one point of view, the compaction of several frag-ments of different logical blocks into one NAND ﬂash pagecreates the capability to move more data for one GC opera-tion. From another viewpoint, warm/hot areas introduce theareas with high frequency of update operations. Generallyspeaking, it is possible to expect that high frequency of up-date operations (in diff updates and journal areas) creates thenatural migration of data between PEBs without the neces-sity to use the extensive GC operations.

Usually, user data and metadata are based on different gran-ularity of items and very different frequency of updates.Moreover, various metadata structures have different archi-tectures and live under different workloads. SSDFS ﬁle sys-tem distinguishes various type of segments with the goal toguarantee a predictable and deterministic nature of data man-agement. As a result, there are several type of segments onany SSDFS ﬁle system’s volume: (1) superblock segment,(2) snapshot segment, (3) PEB mapping table segment, (4)segment bitmap, (5) b-tree segment, (6) user data segment.Generally speaking, the goal to distinguish the different typeof segments is to localize the peculiarities of different typesof data (user data and metadata, for example) inside of spe-cialized segments. Another important responsibility of thesegments’ specialization is to provide a reliable basis fordata and metadata recovering in the case of ﬁle system’s vol-ume corruption. It means that if a PEB keeps a specializedtype of metadata or user data then it simpliﬁes the task ofdata/metadata recovering in the case of ﬁle system’s volumecorruption.

Superblock segment . Superblock is one of the criticalmetadata structure of any ﬁle system (Fig. 14). First of all, the superblock identiﬁes a type of ﬁle system (ext4, xfs, btrfs,for example). From another viewpoint, the superblock’s re-sponsibility is to describe the crucial features of a ﬁle sys-tem’s volume (logical block size, number of free blocks,number of folders, for example). And, ﬁnally, ﬁle systemdriver extracts from superblock the knowledge about posi-tion of the key metadata structures (block bitmap, inodesarray, for example) on the volume. Usually, superblock isstored into one or several ﬁxed position(s) on the ﬁle sys-tem’s volume (Fig. 14). Generally speaking, the ﬁxed po-sition of the superblock provides the opportunity to ﬁnd thesuperblock easily and to identify a ﬁle system’s type on thevolume. However, SSDFS is Log-structured (LFS) and ﬂash-friendly ﬁle system. It means that the ﬁxed position of thesuperblock is not suitable solution for the case of SSDFSﬁle system. If anybody considers the superblock metadatastructure (Fig. 14) then it is possible to distinguish the twoprincipal types of ﬁelds: (1) static metadata - describe basicand unchangeable features of ﬁle system’s volume (logicalblock size, for example), (2) mutable metadata - describe thevolume’s features that are modiﬁed during the mounted stateof ﬁle system’s volume (number of free blocks, for example).Figure 14: Classic superblock approach.Figure 15: Distributed superblock approach.Any SSDFS ﬁle system’s volume represents a sequence oflogical segments. Every segment contains some number ofLEBs. Finally, it needs to associate a LEB with a PEB in thecase of necessity to store any data in the segment. As a result,22ny associated PEB stores a sequence of logs. Moreover, ev-ery full log is started from the ﬁxed position in the PEB andthe full log contains the segment header and log footer (Fig.15). Generally speaking, segment header and log footer arelocated in ﬁxed positions if anybody knows the size of fulllog. SSDFS ﬁle system keeps the static part of superblock’smetadata in the segment header but the mutable part in thelog footer of every full log (Fig. 15). The massive repli-cation of superblock’s metadata has the goal to increase thereliability of storing the superblock inside of SSDFS ﬁle sys-tem’s volume. From another point of view, keeping the su-perblock’s metadata in every full log creates the opportunityto start the recovery of the corrupted ﬁle system’s volumefrom any particular PEB on the volume. Moreover, even ifit survives only one log from the whole SSDFS ﬁle system’svolume then it will be possible to extract and to recover thedata or metadata from the survived log on the basis of meta-data in the header and footer.Figure 16: Specialized superblock concept.However, the massive replication of superblock’s meta-data creates the problem to ﬁnd the last actual state of mu-table part of superblock’s metadata. To resolve this problemthe SSDFS ﬁle system introduces a special type of segment- the superblock segment (Fig. 16). Generally speaking, thegoal of the superblock segment is to keep a sequence of su-perblock’s states are stored for every mount and unmountoperations. As a result, the superblock segment contains asequence of logs that keep the state of superblock in headerand footer (Fig. 16). Moreover, every log of superblock seg-ment is able to store in the payload a snapshot of some crit-ical metadata structures. The key goals of superblock seg-ment are: (1) to store the superblock’s state for every mountand unmount operations, (2) to provide the mechanism offast search the last actual state of the superblock.The fast lookup method is based on the knowledge ofnumbers of current and next superblock segments that arestored in every segment header (Fig. 17). It means that thesegment header of any valid PEB with logs is able to providethe number of current and next superblock segments thatwere actual for some timestamp. The operation of check-ing the available numbers of segments is able to discover the Figure 17: Superblock segments’ migration scheme.actual superblock segments or more actual numbers of su-perblock segments. Finally, it is possible to ﬁnd the actualsuperblock segment by means of passing through the chainof segment numbers. As a result, it needs to ﬁnd the lat-est log in the found actual superblock segment with the goalto retrieve the actual superblock’s state. Moreover, SSDFSﬁle system keeps two copies of the superblock segment withthe goal to improve the reliability and to increase the perfor-mance of the lookup operation.The segment header of any full log keeps the previous,current, and next numbers of superblock segment (Fig. 17).These numbers creates the basis for migration technique ofsuperblock segment. Initially, ﬁle system driver is usingthe number of current superblock segment for storing thelogs with superblocks’ state. The next superblock segment’snumber plays the role of reserve for the case of exhaustionof the current superblock segment. Finally, the ﬁle systemdriver moves the number of superblock segment from thecurrent to the previous state in the case of exhaustion. If theprevious state contained the number of some superblock seg-ment then this superblock segment saved as snapshot or theerase operation is applied for the old superblock segment. Asa result, the next superblock segment’s number becomes thecurrent superblock segment. It needs to allocate some cleansegment for the new reservation of space for superblock seg-ment. Otherwise, it needs to use the number of segment thatwas previous superblock segment in the case of inability toallocate a clean segment. Generally speaking, distributed su-perblock approach and specialized superblock segment con-cept provide the reliable way of superblock storing and efﬁ-cient technique of searching the last actual superblock’s statefor the case of mount operation.

Snapshot segment . SSDFS ﬁle system is Log-structuredFile System (LFS) with using of Copy-On-Write (COW) pol-icy for data updates. It means that SSDFS provides the richbasis for the concept of snapshots of ﬁle system’s volume’sstates. Generally speaking, every log represents a checkpointof user data’s or metadata’s state. Such checkpoint is acces-sible until the applying on a PEB the next erase operation. Itmeans that the checkpoint has to be converted into the snap-23hot for the long-term keeping the state of this checkpoint.As a result, it is possible to state that snapshot is the long-term storage of the ﬁle system’s state or the namespace’sportion for some timestamp. The snapshot is able to playthe role of starting point for the evolution of some version ofa ﬁle system’s state. Generally speaking, the snapshot’s stateor some version of the ﬁle system’s state is able to be ac-cessed by means of the mount operation for a snapshot’s ID.The superblock segment is able to store in the log’s payloadthe content of some critical metadata structures. As a result,the snapshots table is capable to be stored into the superblocksegment (Fig. 18). Generally speaking, the snapshots tableis the array of records that keep the snapshot ID and the cor-responding number of snapshot segment. Finally, this tableprovides the mechanism to ﬁnd the snapshot segment num-ber(s) for the case of a snapshot ID.Figure 18: Snapshots table concept.Figure 19: Snapshot segment concept.SSDFS ﬁle system introduces the concept of specializedsnapshot segment (Fig. 19). The snapshot segment is dedi-cated to the long-term storing of a checkpoint’s state or someportion of ﬁle system’s namespace (that could play the roleof parent snapshot for a version of ﬁle system’s state). Gen-erally speaking, the snapshot segment contains a sequenceof logs that keep a special journal (Fig. 19). This journalis simply aggregation of records that keep the state of ﬁles’content or metadata. Moreover, log’s header or footer is ca-pable to store the root node of inodes tree that represents theinitial state of namespace for the particular snapshot. Thisroot node points out on the index/leaf nodes that will be stored inside of the regular, specialized segments. Gener-ally speaking, only inodes tree needs to be deﬁned explic-itly in the snapshot segment because the rest of b-trees (ex-tents, dentries, xattr b-trees) will be deﬁned by means of rootnode(s) in the particular inode records. Finally, the nodes ofthese child b-tree will be stored in the regular, specializedsegments that are dedicated to keep the index/leaf nodes ofb-trees. Figure 20: Snapshots concept.Finally, snapshot table in the superblock segment is ca-pable to associate the snapshot IDs with segment numbers(Fig. 20). Every segment number in this table deﬁnes thespecialized snapshot segment that stores a snapshot of somestate of ﬁle systems volume. Also snapshot segment containsthe root node of inodes tree in the log’s segment header thatplays the role of starting point for evolving the ﬁle system’sstate by means of adding index and leaf nodes into the in-odes, extents, dentries, and xattr b-trees. The snapshot stateis able to evolve in the case when the ﬁle system’s volume ismounted with using some snapshot ID.

PEB mapping table . SSDFS ﬁle system is based on theconcept of logical segment that is the aggregation of LEBs.Moreover, initially, LEB hasn’t association with a particularPEB. It means that segment could have the association notfor all LEBs or, even, to have no association at all with anyPEB (for example, in the case of clean segment). Gener-ally speaking, SSDFS ﬁle system needs in special metadatastructure (PEB mapping table) that is capable to associateany LEB with any PEB. The PEB mapping table is the cru-cial metadata structure that has several goals: (1) mappingLEB to PEB, (2) implementation the logical extent concept,(3) implementation the concept of PEB migration, (4) imple-mentation the delayed erase operation by specialized thread,(5) implementation the approach of bad erase block recover-ing.Generally speaking, PEB mapping table describes thestate of all PEBs on a particular SSDFS ﬁle system’s vol-ume. These descriptors are split on several fragments thatare distributed amongst PEBs of specialized segments (Fig.21). Numbers of these specialized segments are stored in thesegment headers of every log and it describes the reservedspace of a SSDFS ﬁle system’s volume for the PEB map-24igure 21: PEB mapping table architecture.ping table. Because SSDFS ﬁle system employs the conceptof logical segment then the reserved numbers of specializedsegments remain the same for the volume’s lifetime. But ifsome PEB achieves the exhausted state then it triggers themigration mechanism of moving the exhausted PEB into an-other one. Also PEB mapping table is enhanced by specialcache is stored in the payload of superblock segment’s log(Fig. 21). Generally speaking, the cache stores the copy ofrecords of PEBs’ state. The goal of PEB mapping table’scache is to resolve the case when a PEB’s descriptor is asso-ciated with a LEB of PEB mapping table itself, for example.If unmount operation triggers the ﬂush of PEB mapping tablethen there are the cases when the PEB mapping table couldbe modiﬁed during the ﬂush operation’s activity. As a result,actual PEB’s state is stored only into PEB mapping table’scache. Such record is marked as inconsistent and the incon-sistency has to be resolved during the next mount operationby means of storing the actual PEB’s state into the PEB map-ping table by specialized thread. Moreover, the cache playsanother very important role. Namely, PEB mapping table’scache is used for conversion the LEB ID into PEB ID forthe case of basic metadata structures (PEB mapping table,segment bitmap, for example) before the ﬁnishing of PEBmapping table initialization during the mount operation.Figure 22: PEB mapping table’s fragment structure.Every fragment of PEB mapping table represents the log’spayload in a specialized segment (Fig. 22). Generally speak-ing, the payload’s content is split on: (1) LEB table, and (2)PEB table. The LEB table starts from the header and it con- tains the array of records are ordered by LEB IDs. It meansthat LEB ID plays the role of index in the array of records.As a result, the responsibility of LEB table is to deﬁne an in-dex inside of PEB table. Moreover, every LEB table’s recorddeﬁnes two indexes. The ﬁrst index (physical index) asso-ciates the LEB ID with some PEB ID. Additionally, the sec-ond index (relation index) is able to deﬁne a PEB ID thatplays the role of destination PEB during the migration pro-cess from the exhausted PEB into a new one. It is possibleto see (Fig. 22) that PEB table starts from the header and itcontains the array of PEB’s state records is ordered by PEBID. The most important ﬁelds of the PEB’s state record are:(1) erase cycles, (2) PEB type, (3) PEB state.Figure 23: Possible PEB’s types and states.PEB type (Fig. 23) describes possible types of data thatPEB could contain: (1) user data, (2) leaf b-tree node, (3) hy-brid b-tree node, (4) index b-tree node, (5) snapshot, (6) su-perblock, (7) segment bitmap, (8) PEB mapping table. PEBstate (Fig. 23) describes possible states of PEB during thelifecycle: (1) clean state means that PEB contains only freeNAND ﬂash pages are ready for write operations, (2) us-ing state means that PEB could contain valid, invalid, andfree pages, (3) used state means that PEB contains only validpages, (4) pre-dirty state means that PEB contains as valid asinvalid pages only, (5) dirty state means that PEB containsonly invalid pages, (6) migrating state means that PEB is un-der migration, (7) pre-erase state means that PEB is addedinto the queue of PEBs are waiting the erase operation, (8)recovering state means that PEB will be untouched duringsome amount of time with the goal to recover the ability tofulﬁll the erase operation, (9) bad state means that PEB isunable to be used for storing the data. Generally speaking,the responsibility of PEB state is to track the passing of PEBsthrough various phases of their lifetime with the goal to man-age the PEBs’ pool of the ﬁle system’s volume efﬁciently.PEB mapping table’s cache (Fig. 24) starts from theheader that precedes to: (1) LEB ID / PEB ID pairs, (2) PEBstate records. The pairs’ area associates the LEB IDs withPEB IDs. Additionally, PEB state records’ area contains in-formation about the last actual state of PEBs for every recordin the pairs’ area. It makes sense to point out that the mostimportant ﬁelds in PEB state area are: (1) consistency, (2)25igure 24: PEB mapping table’s cache.PEB state, and (3) PEB ﬂags. Generally speaking, the con-sistency ﬁeld simply shows that a record in the cache andmapping table is identical or not. If some record in the cachehas marked as inconsistent then it means that the PEB map-ping table has to be modiﬁed with the goal to keep the actualvalue of the cache. As a result, ﬁnally, the value in the tableand the cache will be consistent.The PEB migration approach is the key technique of SS-DFS ﬁle system. Generally speaking, the migration mech-anism implements the logical segment and logical extentconcepts with the goal to decrease or completely eliminatethe write ampliﬁcation issue. Moreover, SSDFS ﬁle systemis widely using the data compression, delta-encoding tech-nique, and small ﬁles compaction technique that provides theopportunity to employ the PEB migration mechanism with-out the necessity to use the additional overprovisioning. PEBmapping table plays the critical role in the implementation ofPEB migration technique by means of relation index in theLEB table (Fig. 22). Finally, this index creates the relationbetween two PEBs that deﬁnes the source and destinationpoints for data migration.

Segment bitmap (Fig. 25) is the critical metadata struc-ture in SSDFS ﬁle system that implements several goals: (1)searching a candidate for a current segment is capable tostore a new data, (2) searching by GC subsystem a most op-timal segment (pre-dirty state, for example) with the goal toprepare the segment in background for storing a new data.Segment bitmap (Fig. 25) is able to represent such set ofstates: (1) clean state means that a segment contains thefree logical blocks only, (2) using state means that a seg-ment could contain valid, invalid, and free logical blocks,(3) used state means that a segment contains the valid logicalblocks only, (4) pre-dirty state means that a segment containsvalid and invalid logical blocks, (5) dirty state means that asegment contains only invalid blocks, (6) reserved state isused for reservation the segment numbers for some metadatastructures (for example, for the case of superblock segment),(7) bad state means that a segment is excluded from the us-age because a ﬁle system’s volume hasn’t enough valid eraseblocks (PEBs).Generally speaking, PEB migration scheme implies that Figure 25: Segment bitmap concept.segments are able to migrate from one state to another onewithout the explicit using of GC subsystem. For example, ifsome segment receives enough truncate operations (data in-validation) then the segment could change the used state onpre-dirty state. Additionally, the segment is able to migratefrom pre-dirty into using state by means of PEBs migrationin the case of receiving enough data update requests. As aresult, the segment in using state could be selected like thecurrent segment without any GC-related activity. However,a segment is able to stick in pre-dirty state in the case of ab-sence the update requests. Finally, such situation can be re-solved by GC subsystem by means of migration in the back-ground of pre-dirty segments into the using state if a SSDFSﬁle system’s volume hasn’t enough segments in the clean orusing state.Figure 26: Segment bitmap architecture.Segment bitmap is implemented like the bitmap metadatastructure that is split on several fragments (Fig. 26). Everyfragment is stored into a log of specialized PEB. As a result,the full size of segment bitmap and PEB’s capacity deﬁnethe number of fragments. The mkfs utility reserves the nec-essary number of segments for storing the segment bitmap’sfragments during a SSDFS ﬁle system’s volume creation. Fi-26ally, the numbers of reserved segments are stored into thesegment headers of every log on the volume. The segmentbitmap ”lives” in the same set of reserved segments duringthe whole lifetime of the volume. However, the update oper-ations of segment bitmap could trigger the PEBs migrationin the case of exhaustion of any PEB is used for keeping thesegment bitmap’s content.Figure 27: B-tree concept.

B-trees (Fig. 27) represent efﬁcient and compact metadatastructure for storing and representing metadata on a ﬁle sys-tem’s volume. Also b-tree is able to provide the efﬁcient andthe fast way of searching items. Moreover, another importantfeature of the b-tree is the compact representation of sparsemetadata and easy increasing the reserved capacity of meta-data space. Especially, such feature could be very importantfor the case of NAND ﬂash because it could be expensiveand useless way, for example, to store a huge and ﬂat arrayof inodes in the case of absence of any ﬁle in the names-pace of a ﬁle system. However, usually, b-tree is treated likenot suitable solution for the case of ﬂash-oriented ﬁle sys-tems because of wandering tree and excessive write ampli-ﬁcation issues. But SSDFS ﬁle system is based on logicalsegment concept, logical extent concept, and delta-encodingtechnique that completely exclude the wandering tree issue.Also these concepts are the basis for decreasing (or completeelimination) the write ampliﬁcation issue. Moreover, b-treemetadata structure provides the way not to keep an unneces-sary reserve of metadata space on the volume. As a result, itmeans the exclusion of management operations of reservedmetadata space (moving from a PEB to another one) withthe goal to support it in the valid state. Generally speaking,it is the way to decrease the amount of PEBs’ erase and writeoperations.Root node of the key b-trees (inodes b-tree, shared extentsb-tree, shared dictionary b-tree) is stored into segment headeror/and log footer (Fig. 27). However, oppositely, root nodeof extents b-tree, dentries b-tree, and xattr b-tree is storedinto an inode record. Generally speaking, if the root node iscapable to store some amount of metadata records or b-treehasn’t any metadata records at all then a SSDFS ﬁle sys-tem’s volume will not have any node on the volume. SSDFSﬁle system is using specialized types of segments. It meansthat b-tree’s leaf nodes will be stored in a segment is ded- Figure 28: B-tree segment type.icated and reserved for the leaf nodes but index nodes willbe stored into a segment is containing the index nodes only(Fig. 28). Moreover, nodes of different b-trees can be storedin the same segment. Generally speaking, it is expecting thatworkload type of b-tree’s nodes of the same type (leaf nodes,for example) could be the same and it is the basis for group-ing the nodes of different b-trees into one segment. Also itcould create more compact representation of the b-trees onthe volume. Every b-tree’s node could use one or severallogical blocks. As a result, any log of segment with b-trees’nodes (Fig. 28) contains: (1) segment header, (2) log footer,and (3) payload that contains b-trees’ nodes. Finally, b-tree’snodes’ content is distributed among the main, diff updates,and journal areas of the log.Figure 29: User data segment type.

User data segment . SSDFS ﬁle system aggregates userdata inside of segments are dedicated to user data’s type (Fig.29). It needs to point out that the user data could live underdifferent workloads and to have the various features. Theo-retically, it is possible to introduce the various types of seg-ments for the user data. However, nevertheless, it is usedonly one type of segment for the user data.Generally speaking, the main, diff updates, and journalareas are the key technique of user data’s management on aSSDFS ﬁle system’s volume (Fig. 29). The main area playsthe role of cold data because it is dedicated to store the plain,full logical blocks. Moreover, if a logical block is storedinto the main area then all subsequent updates of this logicalblock are directed to the diff updates or journal areas. As a27esult, this is the key technique to achieve the cold nature ofdata into the main area of log. Finally, the data in main areacan be moved from the initial PEB into a new one by meansof PEB migration scheme (in the case of update operationis directed to the exhausted PEB) or by GC subsystem (inthe case of absence of update operation and exhausted stateof the PEB). The moving operation of data in the main areacould be accompanied by applying the associated updates arestored into the diff updates or/and journal areas.The diff updates area is able to store into one NAND ﬂashpage the compressed blocks, delta-encoded portions of data,and tail of the same ﬁle (Fig. 29). It is possible to expectthat frequency of updates for different portions of the sameﬁle could be lower than frequency of updates for differentﬁles. As a result, diff updates area is treated like area withwarm data. It is important to point out that data in the diffupdates area could be invalidated partially or completely byupdate operations. Also, the valid data of diff updates areais used during migration of the main area’s content with thegoal to prepare the actual state of logical block(s). Gener-ally speaking, it is not necessary to reserve any space for thediff updates area’s content in a destination PEB during themigration operation.The responsibility of journal area is to gather into oneNAND ﬂash page the small ﬁles, the tails of different ﬁles,compressed updates or delta-encoded data of different ﬁles(Fig. 29). Generally speaking, the different ﬁles are able togrow or to be updated with higher frequency than content ofone ﬁle. As a result, gathering the content of different ﬁlesinto the one NAND ﬂash page increases the probability ofupdates in journal area. Finally, the journal area is treatedlike area with hot data. Moreover, the amount of updatesin journal area is expected to be high enough to achieve thecomplete invalidation of the journal area by means of natu-ral migration of data between logs and PEBs (by means ofmigration scheme) without any GC subsystem’s activity.However, some small ﬁles could be completely ”cold”or to be updated rarely. As a result, it means the neces-sity to employ the logic of GC subsystem or PEB migrationscheme to process some data in the journal area. However,the scheme of compaction several small ﬁles into one NANDﬂash page decreases the write ampliﬁcation issue by virtue ofthe opportunity to move the several small ﬁles into one pageby single copy operation.

Current segment . SSDFS ﬁle system employs the con-cept of current segments (Fig. 30). Generally speaking, if itis necessary to add some new data on the volume (new ﬁle,new logical blocks of the existing ﬁle, new b-tree’s node)then it needs to use a segment that has the free logical blocks.Only segment in clean or using state can be used for addinga new data. As a result, SSDFS ﬁle system’s driver has theset of current segments (user data, index node, hybrid node,leaf node) because of the policy of grouping different typesof data into different type of segments (Fig. 30). The seg- Figure 30: Current segment concept.ment can be used as current until the complete exhaustion ofthe free logical blocks’ pool in the particular segment (usedor pre-dirty state). In the case of absence the free logicalblocks in the current segment, the ﬁle system driver tries toﬁnd in the segment bitmap a new segment in the clean orusing state. If the driver is unable to ﬁnd any clean or us-ing segment then it will trigger the GC logic for searchingthe segments in pre-dirty or dirty state with the goal to trans-form the pre-dirty segment into using state and dirty segmentinto clean state. Finally, if no pre-dirty or dirty segment canbe processed or transformed into clean or using state thenthe driver will need to inform the user about absence of freespace on the volume. Generally speaking, GC subsystemhas to track the state of segment bitmap in the backgroundand to prepare enough number of clean or using segments(in the case of enough free space on the volume). However,GC subsystem’s activity in the background should not affectthe whole performance of the ﬁle system driver and not toincrease the write ampliﬁcation issue.From one point of view, the several types of current seg-ments create the several independent threads of processinga new data. Additionally, the segment object in ﬁle systemdriver is implemented in such way that every segment has aqueue for requests with the new data. Generally speaking, itmeans that a thread adds some new data into a current seg-ment by means of simple insertion of the request into the tailof queue. The rest processing of the requests in the queuewill be executed in the background by specialized PEBs’ﬂush threads. Finally, the whole architecture creates the fast,efﬁcient, simple, and multi-threaded mechanism of the newdata processing. Of course, if a volume was mounted in syn-chronous mode then a thread needs to wait the ﬁnishing of re-quest processing that was added into the queue. And it meansthe inevitable degradation of ﬁle system performance in onethread. However, if the several threads add the requests intothe queue then the whole ﬁle system’s performance couldnot degrade dramatically even for the case of synchronousmode. Finally, it is possible to state that SSDFS ﬁle systemhas ﬂexible and efﬁcient subsystem of the current segmentsis capable to provide the good performance of data process-ing.28 .4 B-tree Architecture

B-tree is widely used metadata structure has been proven tobe efﬁcient for the case of various ﬁle systems. For example,XFS, btrfs, jfs, reiserfs, HFS+ are using different b-tree’simplementation. Usually, b-tree provides the opportunity tohave multiple child nodes for the same parent node. A reg-ular b-tree (Fig. 31) contains a root node, index nodes, andleaf nodes. Generally speaking, the key advantage of anyb-tree is a capability to store data in the form of nodes withthe goal to provide an efﬁcient data extraction in the caseof block-oriented storage device. Because any b-tree’s nodeis capable to include multiple data items and, as a result, todecrease the number of I/O operations for the case of search-ing any item in the b-tree. Mostly, b-tree is used for storingand representation various ﬁle system’s metadata types (forexample, inodes or extents). Usually, root node (Fig. 31)represents a starting point of a b-tree. It keeps index recordsthat contain a key and a pointer on a child node. Any indexnode contains the same index records because the responsi-bility of the index node is to provide the way to ﬁnd a leafnode with the data records. The data record is the pair of akey with some associated value (for example, extent or in-ode). A hash, an ID or any other value could play a role ofthe key that would be a ﬁeld of the data record itself. More-over, key values are the basis for data records ordering inthe b-tree. Generally speaking, any lookup operation startsfrom the root node, passing by through the index levels, andﬁnding some data record on the key basis in the found leafnode. Figure 31: Common b-tree architecture.

Why b-tree for LFS ﬁle system?

Usually, b-tree isconsidered like not very good choice for the case of ﬂash-oriented and ﬂash-friendly ﬁle systems by virtue of wander-ing tree issue and high value of write ampliﬁcation. How-ever, b-tree architecture implements very important advan-tages: (1) efﬁcient search mechanism, (2) compact storageof sparse data, (3) ﬂexible technique of capacity increasingand shrinking. Generally speaking, the key reason of pos-sible b-tree’s inefﬁciency for the case of LFS ﬁle system isthe Copy-On-Write (COW) policy. The COW policy meansthe necessity to store an updated logical block into some newand free position on the ﬁle system’s volume. As a result, itinitiates a lot of metadata updates that, ﬁnally, results in sig-niﬁcant increasing of write ampliﬁcation in the case of b-tree using.However, SSDFS ﬁle system is using the logical segment,logical extent concepts, and the PEBs migration scheme.Generally speaking, these techniques provide the opportu-nity to exclude completely the wandering tree issue and todecrease signiﬁcantly the write ampliﬁcation. SSDFS ﬁlesystem introduces the technique of storing the data on thebasis of logical extent that describes this data’s position bymeans of segment ID and logical block number. Finally,PEBs migration technique guarantee that data will be de-scribed by the same logical extent until the direct change ofsegment ID or logical block number. As a result, it meansthat logical extent will be the same if data is sitting in thesame logical segment. The responsibility of PEBs migrationtechnique is to implement the continuous migration of databetween PEBs inside of the logical segment for the case ofdata updates and GC activity. Generally speaking, SSDFSﬁle system’s internal techniques guarantee that COW policywill not update the content of b-tree. But content of b-treewill be updated only by regular operations of end-user withthe ﬁle system.SSDFS ﬁle system uses b-tree architecture for metadatarepresentation (for example, inodes tree, extents tree, den-tries tree, xattr tree) because it provides the compact way ofreserving the metadata space without the necessity to use theexcessive overprovisioning of metadata reservation (for ex-ample, in the case of plain table or array). The excessiveoverprovisioning of metadata reservation dictates the neces-sity to support the reserved space in valid state by meansof migration among PEBs because of continuous increasingNAND unrecoverable bit error rate (UBER) for stored data.As a result, such migration activity (on SSD or ﬁle systemside) increases the write ampliﬁcation issue.Moreover, b-tree provides the efﬁcient technique of itemslookup, especially, for the case of aged or sparse b-tree that iscapable to contain the mixture of used and deleted (or freed)items. Such b-tree’s feature could be very useful for the caseof extent invalidation, for example. Also SSDFS ﬁle systemaggregates the b-tree’s root node in the superblock (for ex-ample, inodes tree case) or in the inode (for example, extentstree case). As a result, it means that an empty b-tree willcontain only the root node without the necessity to reserveany b-tree’s node on the ﬁle system’s volume. Moreover, if ab-tree needs to contain only several items (two items, for ex-ample) then the root node’s space can be used to store theseitems inline without the necessity to create the full-featuredb-tree’s node.One of the fundamental mechanism of SSDFS ﬁle sys-tem is the current segments approach. This approach is usedfor aggregation of data living under similar workloads in thecurrent segment of some type. For example, there are currentsegments for different b-tree’s node types (index, hybrid, leafnodes). Generally speaking, it means that the current seg-ment for leaf nodes aggregates the leaf nodes of different29-trees (inodes, extents, dentries b-trees, for example). Ev-ery current segment allocates the logical blocks till the com-plete exhaustion of this segment. Finally, SSDFS ﬁle systemdriver needs to allocate a new current segment in the caseof complete exhaustion of the free space of previous currentsegment. Moreover, SSDFS ﬁle system driver uses the com-pression and the delta-encoding techniques that is the way toachieve the compact representation of b-tree’s nodes in thePEB’s space.As a result, SSDFS uses b-trees with the goal to achievethe compact representation of metadata, the ﬂexible way toexpend or to shrink the b-tree’s space capacity, and the efﬁ-cient mechanism of items’ lookup.

Hybrid b-tree architecture . Regular b-tree contains twotypes of nodes: index and leaf ones. The index node keepsthe pointers on other nodes with the goal to implement mech-anism of fast lookup operation in the b-tree. Oppositely, theleaf node keeps items of real data that are stored in the b-tree.Moreover, a node creation means the reservation of 4-64 KBof ﬁle system’s volume’s space. However, usually, b-tree ismetadata structure that is not receiving a lot of items at once.As a result, it means that growing b-tree could contain somenumber of empty or semi-empty index nodes. These indexnodes could be empty or semi-empty signiﬁcant amount oftime that results in increasing of number of I/O operationsduring the search in b-tree and the ﬂush of the b-tree. Gen-erally speaking, this side effect could increase the write am-pliﬁcation for the case of ﬂash-friendly ﬁle systems.Figure 32: B-tree architecture with hybrid nodes.SSDFS ﬁle system uses a hybrid b-tree architecture (Fig.32) with the goal to eliminate the index nodes’ side effect.The hybrid b-tree operates by three node types: (1) indexnode, (2) hybrid node, (3) leaf node. Generally speaking, thepeculiarity of hybrid node (Fig. 33) is the mixture as indexas data records into one node. Hybrid b-tree starts with rootnode (Fig. 34 case A) that is capable to keep the two indexrecords or two data records inline (if size of data record isequal or lesser than size of index record). If the b-tree needsto contain more than two items then it should be added theﬁrst hybrid node into the b-tree. The ﬁrst level of b-tree isable to contain only two nodes (Fig. 34 case B) because theroot node is capable to store only two index records. Gener-ally speaking, the initial goal of hybrid node is to store thedata records in the presence of reserved index area (Fig. 33). Figure 33: Hybrid node architecture.Figure 34: Hybrid b-tree evolution.The exhaustion of the data area’s space of the ﬁrst hybridnode triggers addition of the second hybrid node on the ﬁrstlevel of the b-tree (Fig. 34 case B). If both hybrid nodes ofthe ﬁrst level are completely exhausted then it takes place aspecial transformation of the b-tree (Fig. 34 case C). First ofall, the left hybrid node is transformed into the leaf node. Theportion of data records are moved from the right hybrid nodeto the newly made leaf node. Also, the index area of the righthybrid node is increased in size after the move operation ofdata records. The next step is to move the index record of theleft node from the root node into the index area of the rightnode. Finally, root node will keep the index record on thehybrid node but the hybrid node will keep the index recordon the leaf node (Fig. 34 case C). As a result, hybrid b-tree will contain the completely full leaf node and the hybridnode with the free space for the new data records.The next step of hybrid b-tree’s evolution is the addingdata records in the hybrid b-tree node till the complete ex-haustion of the data area of the node. Exhaustion of thedata area of hybrid node triggers: (1) creation of another leafnode, (2) moving all data records from the hybrid node intothe newly created leaf node, (3) adding index record for theleaf node into index area of hybrid node. Usually, data areaof hybrid node is lesser than the capacity of a leaf node. Itmeans that data records will be added into the newly cre-ated leaf node at ﬁrst. Finally, the hybrid node will gatherthe data records in the case of exhaustion of leaf node’s ca-pacity. It means that data area of hybrid node plays the role30igure 35: Hybrid b-tree evolution.of temporary buffer that aggregates enough data records be-fore a leaf node creation. Generally speaking, this sequenceof leaf nodes creation takes place before the exhaustion ofindex area of hybrid node. Moreover, the index area’s ex-haustion triggers the increasing of index area’s capacity. Asa result, it means decreasing the capacity of data area in hy-brid node. If the index area of hybrid node extends on thewhole node’s space then such node becomes to be the indexnode (Fig. 34 case D). Finally, the index node needs to becompletely ﬁlled by index records.The exhaustion of index node (Fig. 34 case D) impliesthe addition of right hybrid node on the same level (Fig. 35case E). It means that root node will point out on the newlyadded hybrid node. As a result, this hybrid node plays therole of initial point for evolving the right branch of the b-treeby means of adding the new leaf nodes. Generally speaking,the evolution implies addition of data records until the statewhen the right hybrid node will be transformed into the in-dex node. If both index node becomes exhausted by indexrecords then it needs to add the hybrid node. This hybridnode will contain the pointers on exhausted index nodes androot node will point out on the newly added hybrid node (Fig.35 case G). Finally, the hybrid node plays the essential rolein the hybrid b-tree’s evolution.Figure 36: Node type migration scheme.Operation of deletion of data records could initiate thetransformation of index node(s) into the hybrid ones (Fig.36). Such transformation of node’s type could take placemany times by virtue of mixture of addition and deletion op- erations. Also it is possible to imagine the situation of neces-sity to split one index node on two hybrid ones in the case ofinserting some data record in the middle of an exhausted leafrecord. Moreover, such splitting operation could be resultedin the addition of hybrid node on the next upper level of theb-tree.

B-tree delayed invalidation . SSDFS ﬁle system pro-cesses the delete operations in hybrid b-trees by means ofspecial techniques. For example, deletion of any inode fromthe inodes b-tree is treated like freeing of correspondingitem in a particular node. Generally speaking, it means thatdeleted inodes can be allocated again and the volume spaceis used by inodes b-tree’s nodes remains the reserved space.If anyone considers the case of deletion of all items in aleaf node then such node can be transformed into the pre-allocated state instead of real deletion of the node from theb-tree structure. Moreover, the pre-allocated state means thatit doesn’t need to keep the allocated space in a PEB for thisnode but the index and/or hybrid nodes continue to keep thesame index records for the node in the pre-allocated state. Fi-nally, it decreases the write ampliﬁcation because it doesn’tneed to update the index/hybrid nodes by means of deletionof index records that point out on the leaf node. However, ifa b-tree becomes completely empty then it is the case of thereal destruction of b-tree structure.But SSDFS ﬁle system’s driver uses the technique of de-layed b-tree’s nodes or sub-trees invalidation/destruction.Especially, this technique could make the operation of bigﬁles truncation or deletion more fast and efﬁcient. Generallyspeaking, SSDFS ﬁle system’s driver has special invalida-tion queue for the index records (that point out on a node ora sub-tree) and a dedicated thread that has goal to invalidatethe data records in the leaf/hybrid nodes and to destroy thesub-tree structure in the background. Finally, it means thenecessity to place the index record on node or sub-tree intothe invalidation queue during the truncate or delete operationbut the real processing of the node or sub-tree will take placein the background. Interesting side effect of such approach isthe opportunity to fulﬁll this background activity in the idlestate of ﬁle system driver, for example.

Inode is the cornerstone metadata structure of any Linux ﬁlesystem that keeps all information about a ﬁle excluding theﬁle’s name and content (user data, for example). Generallyspeaking, this metadata structure is the critical one that re-quires as high reliability of storing as high efﬁciency of ac-cess and modiﬁcation operations. The creation of ﬁle resultsin the association of name and inode ID with newly createdﬁle. Moreover, inode ID is unique number in the scope ofparticular ﬁle system’s volume. The name of ﬁle and inodeID are stored as an item of folder. Namely folder associatesﬁle names and inode instances. As a result, if end-user or31pplication try to access a ﬁle by means of the name then OSemploys this ﬁle’s name for an inode ID lookup. The foundinode ID is used by ﬁle system driver for retrieving the inodeinstance. Figure 37: Inodes b-tree architecture.Generally speaking, inode table can be imagined like ageneralized array of inode instances (Fig. 37) because ev-ery inode is identiﬁed by integer value (inode ID). How-ever, huge capacity of the modern storage devices (HDD,SSD) and highly intensive operations of creation/deletion ofﬁles makes the efﬁcient management of inode table by verycomplex problem. Moreover, the using of simple array ortable for inode instances reserves a big space for such ta-ble. And such reservation could increase the write ampli-ﬁcation because of necessity to keep the reserved space inthe valid state. Another possible issue could be the easy ex-haustion of the reserved space without the ﬂexible way to ex-tend the reserved space. Oppositely, b-tree provides the easyway of compact representation the small and sparse set ofitems. Moreover, b-tree is easily extendable metadata struc-ture with the ﬂexible mechanism as increasing as shrinkingthe nodes’ space. Finally, the efﬁcient and fast lookup tech-nique is the another advantage of any b-tree. These pointswere the steady basis for selection b-tree as the basic meta-data structure for inodes tree in SSDFS ﬁle system.Figure 38: Raw inode structure.

SSDFS raw inode (Fig. 38) is the metadata structure of ﬁxed size that can vary from 256 bytes to several KBs.The size of inode is deﬁned during the ﬁle system’s volumecreation. Usually, inode object includes ﬁle mode; ﬁle at-tributes; user/group ID; access, change, modiﬁcation time;ﬁle size in bytes and blocks; links count. The most specialpart of the SSDFS raw inode is the private area that is usedfor storing: (1) small ﬁle inline, (2) root node of extents,dentries, and/or xattr b-tree.

SSDFS inodes b-tree is the hybrid b-tree that includes thehybrid nodes with the goal to use the node’s space in moreefﬁcient way by means of combination the index and datarecords inside of the node. Root node of inodes b-tree isstored into the log footer or partial log header of every log.Generally speaking, it means that SSDFS ﬁle system is usingthe massive replication of the root node of inodes b-tree. Ac-tually, inodes b-tree node’s space includes header, index area(in the case of hybrid node), and array of inodes are orderedby ID values. If a node has 8 KB in size and inode structureis 256 bytes in size then the maximum capacity of one inodesb-tree’s node is 32 inodes.Generally speaking, inodes table can be imagined like animaginary array that is extended by means of adding the newinodes into the tail of the array (Fig. 37). However, inode canbe allocated or deleted by virtue of create ﬁle or delete ﬁleoperations, for example. As a result, every b-tree node hasan allocation bitmap that is tracking the state (used or free)of every inode in the b-tree node. The allocation bitmap pro-vides the mechanism of fast lookup a free inode with the goalto reuse the inodes of deleted ﬁles. Also inodes b-tree usesthe special technique of processing the completely emptyleaf nodes that could achieve the empty state after deletionthe all inodes in this node. This technique is based on theconversion an empty b-tree node into the pre-allocated state.Generally speaking, the pre-allocated state means that thelogical extent continues to be reserved for this b-tree nodebut no space is allocated in segment’s PEBs. The importantpoint of such technique is the opportunity not to update theindex records in index/hybrid b-tree nodes that point out onthe leaf node has converted into pre-allocated state. Also itmeans that the leaf node’s space continues to be reserved onthe ﬁle system volume.Additionally, every b-tree node has a dirty bitmap that hasgoal to track modiﬁcation of inodes. Generally speaking, thedirty bitmap provides the opportunity to ﬂush not the wholenode but the modiﬁed inodes only. As a result, such bitmapcould play the cornerstone role in the delta-encoding or inthe Diff-On-Write approach. Moreover, b-tree node has alock bitmap that has responsibility to implement the mecha-nism of exclusive lock a particular inode without the neces-sity to lock exclusively the whole node. Generally speaking,the lock bitmap was introduced with the goal to improve thegranularity of lock operation. As a result, it provides theway to modify the different inodes in the same b-tree nodewithout the using of exclusive lock the whole b-tree node.32owever, the exclusive lock of the whole tree has to be usedfor the case of addition or deletion a b-tree node.

Linux kernel identiﬁes a ﬁle by means of inode that is uniquefor the ﬁle. However, the association of ﬁle name and inode’sinstance takes place by means of a directory entry. Moreover,different dentries in the same or different folder can identifythe same ﬁle or inode. Dentries play an important role in thedirectory caching that contains metadata of frequently ac-cessed ﬁles for the more efﬁcient access operations. Anotherimportant role of dentries is the folders hierarchy traversingbecause the dentries connect folder with ﬁles.Figure 39: Dentries b-tree architecture.

SSDFS dentry (Fig. 39) is the metadata structure of ﬁxedsize (32 bytes). It contains inode ID, name hash, namelength, and inline string for 12 symbols. Generally speak-ing, the dentry is able to store 8.3 ﬁlename inline. If thename of ﬁle/folder has longer name (more than 12 symbols)then the dentry will keep only the portion of the name butthe whole name will be stored into a shared dictionary. Thegoal of such approach is to represent the dentry by compactmetadata structure of ﬁxed size for the fast and efﬁcient op-erations with the dentries. It is possible to point out that thereare a lot of use-cases when the length of ﬁle or folder is notvery long. As a result, dentry’s inline string could be onlystorage for the ﬁle/folder name. Moreover, the goal of shareddictionary is to store the long names efﬁciently by means ofusing the deduplication mechanism.

Dentries b-tree is the hybrid b-tree (Fig. 39) with the rootnode is stored into the private inode’s area. By default, in-ode’s private area has 128 bytes in size. Also SSDFS dentryhas 32 bytes in size. As a result, inode’s private area providesenough space for 4 inline dentries. Generally speaking, if afolder contains 4 or lesser ﬁles then the dentries can be storedinto the inode’s private area without the necessity to createthe dentries b-tree. Otherwise, if a folder includes more than4 ﬁles or folders then it needs to create the regular dentriesb-tree with the root node is stored into the private area ofinode. Actually, every node of dentries b-tree contains the header, index area (for the case of hybrid node), and arrayof dentries are ordered by hash value of ﬁlename. Moreover,if a b-tree node has 8 KB size then it is capable to containmaximum 256 dentries.Generally speaking, the hybrid b-tree was opted for thedentries metadata structure by virtue of compactness ofmetadata structure representation and efﬁcient lookup mech-anism. Dentries is ordered on the basis of name’s hash. Ev-ery node of dentries b-tree has: (1) dirty bitmap - trackingmodiﬁed dentries, (2) lock bitmap - exclusive locking of par-ticular dentries without the necessity to lock the whole b-treenode. Actually, it is expected that dentries b-tree could con-tain not many nodes in average because the two nodes (8K insize) of dentries b-tree is capable to store about 400 dentries.

Any ﬁle system is dedicated to store the user data in the formof ﬁles. Various ﬁles could have different length and inodestores information about length of ﬁle in blocks and bytes.Also ﬁle system is responsible for logical blocks allocationin the case of adding a new data. Generally speaking, ﬁlesystem driver is always trying to allocate a contiguous se-quence of logical blocks for any ﬁle’s content. The contigu-ous sequence of logical blocks can be described by extent(starting LBA and length) as the most compact descriptor ofsuch sequence. However, it is not always possible to allo-cate a contiguous sequence of free logical blocks by virtueof possible fragmentation of the ﬁle system’s volume spaceby delete and truncate operations. As a result, the allocationoperation can be fulﬁlled by means of allocation the severalsmaller contiguous sequences of logical blocks in various lo-cations on the volume. Moreover, SSDFS extent cannot begreater than segment size (Fig. 40). Finally, all the men-tioned factors result in description of any ﬁle’s content bymeans of the set of extents.Figure 40: Extents b-tree architecture.

SSDFS raw extent (Fig. 40) describes a contiguous se-quence of logical blocks by means of segment ID, logicalblock number of starting position, and length. By default,SSDFS inode has the private area of 128 bytes in size and33SDFS extent has 16 bytes in size. As a result, the inode’sprivate area is capable to store not more than 8 raw extents.Generally speaking, hybrid b-tree was opted with the goalto store efﬁciently larger number of raw extents. First ofall, it was taken into account that ﬁle sizes can vary a loton the same ﬁle system’s volume. Moreover, the size of thesame ﬁle could vary signiﬁcantly during its lifetime. Finally,b-tree is the really good mechanism for storing the extentscompactly with very ﬂexible way of increasing or shrinkingthe reserved space. Also b-tree provides very efﬁcient tech-nique of extents lookup. Additionally, SSDFS ﬁle systemuses compression that guarantee the really compact storageof semi-empty b-tree nodes. Moreover, hybrid b-tree pro-vides the way to mix as index as data records in the hybridnodes with the goal to achieve much more compact represen-tation of b-tree’s content.Moreover, it needs to point out that extents b-tree’s nodesgroup the extent records into forks (Fig. 40). Generallyspeaking, the raw extent describes a position on the volumeof some contiguous sequence of logical blocks without anydetails about the offset of this extent from a ﬁle’s beginning.As a result, the fork (Fig. 40) describes an offset of someportion of ﬁle’s content from the ﬁle’s beginning and num-ber of logical blocks in this portion. Also fork contains thespace for three raw extents that are able to deﬁne the posi-tion of three contiguous sequences of logical blocks on theﬁle system’s volume. Finally, one fork has 64 bytes in size.If anybody considers a b-tree node of 4 KB in size then suchnode is capable to store about 64 forks with 192 extents intotal. Generally speaking, even a small b-tree is able to storea signiﬁcant number of extents and to determine the positionof fragments of generally big ﬁle. If anybody imagines a b-tree with the two 4 KB nodes in total, every extent deﬁnesa position of 8 MB ﬁle’s portion then such b-tree is able todescribe a ﬁle of 3 GB in total.

Deduplication . One of the known technique of decreasingthe write ampliﬁcation is the deduplication approach. Gen-erally speaking, the key mechanism of deduplication is thedetermination of replication of the same data on the volumewith the goal to store the found duplicated fragment only inone place. As a result, it means that all ﬁles contain suchdeduplicated data should store the same extent in the extentsb-trees. SSDFS ﬁle system uses a shared extents b-tree forimplementation the deduplication technique.First of all, SSDFS ﬁle system driver takes into accountthe size of a ﬁle. If the size is smaller than some threshold(for example, 4 KB - 8 KB) then such ﬁle is not consideredas a deduplication target. Otherwise, it needs to calculatethe ﬁngerprint of ﬁrst 8 KB portion of a ﬁle (the size of ini-tial portion can be deﬁned by special threshold value). Thenit needs to check the presence of calculated ﬁngerprint in Figure 41: Deduplication mechanism of shared extents b-tree.the shared extents b-tree. If no such ﬁngerprint exists in theshared extents b-tree then the only calculated ﬁngerprint hasto be stored in the b-tree. Moreover, it doesn’t need to calcu-late ﬁngerprint(s) for the rest of the ﬁle in such case.Oppositely, if there is the same ﬁngerprint for the ﬁrst 8KB of the ﬁle in the shared extents b-tree then it needs tocalculate the ﬁngerprints for the rest of the ﬁle and to checkthe presence of these ﬁngerprints in the shared extents b-tree. Again, it needs to store the calculated ﬁngerprints inthe shared extents b-tree if no such ﬁngerprints were found.Otherwise, ﬁle system driver has to store extents of founddeduplicated fragments into the extents b-trees of particularﬁles (Fig. 41).Generally speaking, shared extents b-tree will keep onlyone ﬁngerprint of the ﬁrst 8 KB for all ﬁles that have uniquecontent. Oppositely, the duplicated ﬁle’s content will be de-tected during the trying to store a second copy of the sameﬁle. However, the detection of this duplication will be re-sulted in deduplication only ﬁrst 8 KB of the ﬁle and in stor-ing the ﬁngerprints for the rest of duplicated ﬁle in the sharedextents b-tree. Finally, the third (and next) try to store the du-plicated ﬁle will be resulted in complete deduplication of theﬁle’s content.Figure 42: Record types in shared extents b-tree.

SSDFS shared extents b-tree is able to store severalrecord types (Fig. 42 - 43): (1) deduplicated extent record,(2) ﬁngerprint record, (3) invalidation record. The dedu-plicated extent records are ordered by ﬁngerprint value and34igure 43: Shared extents b-tree architecture.it contains ﬁngerprint, extent (segment ID, logical block,length), and reference counter values. Generally speaking,the goal of these records is to ﬁnd the deduplicated extentson the basis of ﬁngerprint value.The ﬁngerprint records are ordered by segment ID andlogical block values and the responsibility of such records isto provide the way to ﬁnd the ﬁngerprint value on the basisof knowledge of segment ID and logical block values. Everytime when it needs to add the information about a dedupli-cated extent then it needs to insert into the shared extentsb-tree as deduplicated extent record as ﬁngerprint record.Moreover, the reason to have two types of the record is thenecessity to use the ﬁngerprint record in the case of ﬁle dele-tion or truncation. Generally speaking, only extent data (seg-ment ID, logical block, length) is available in the beginningof the delete or truncate operation. It means that extent datacan be used for searching the ﬁngerprint value. Finally, thefound ﬁngerprint value can be used for the searching a dedu-plicated extent record that has to be found with the goal todecrement the reference counter (or completely remove therecord if the reference counter is equal to zero).The third record type is the invalidation records that imple-ment a mechanism of delayed invalidation of extents. Gen-erally speaking, it means that it doesn’t need to delete (ortruncate) a big ﬁle immediately but it is possible to createthe invalidation record(s) with the pointer on the whole (orsub-tree) extents b-tree and to store the invalidation record(s)into the shared extents b-tree at ﬁrst. The processing of in-validation records takes place in the background by a ded-icated thread (in the idle state of ﬁle system driver, for ex-ample). First of all, the thread has to extract an invalidationrecord and to check the presence of a deduplicated extentrecord for the extent under invalidation. If shared extentsb-tree contains the deduplicated extent record for this ex-tent then it needs to decrement the reference counter only.Otherwise, if shared extents b-tree hasn’t deduplicated extentrecord or the reference counter achieved the nil value then itneeds to invalidate the requested extent. Moreover, the cor-responding deduplicated extent and ﬁngerprint records haveto be deleted from the shared extents b-tree in the case of ze-roed reference counter. Finally, invalidation record has to be deleted from the shared extents b-tree also.

SSDFS ﬁle system introduces dentry metadata structure ofﬁxed size that is able to store only 12 inline symbols (8.3ﬁlename) with the goal to achieve the efﬁcient operationswith dentries b-tree. However, it means that dentry itself iscapable to store the short names only. From one viewpoint,ﬁles/folders have short names very frequently. As a result,it implies the high frequency to store the names in dentriesonly. Moreover, the ﬁxed size of dentry provides simple andfast way to search a particular dentry in the b-tree node. Op-positely, varied size of dentry makes the searching algorithmmore complex and inefﬁcient and it require to add some ad-ditional metadata in the node.As a result, SSDFS ﬁle system stores the short names onlyin the dentries and to use the shared dictionary for storingthe long names. The shared dictionary’s responsibility is togather the long names are created on the ﬁle system’s vol-ume. Generally speaking, the gathering names in one placemeans that shared dictionary keeps only one copy of thename that can be used for different ﬁles. Also shared dictio-nary provides the basis for using the technique of substringsdeduplication. Finally, shared dictionary provides the way tokeep the names in very compact representation.Moreover, one of the possible strategy of shared dictio-nary is not to delete the names at all. From one point ofview, it means that such strategy is able to decrease the num-ber of update operations for shared dictionary. From anotherpoint of view, if end-user will try to use the name of deletedﬁle for a newly created one then such name doesn’t need tobe added in the shared dictionary because it will be therealready. However, it needs to point out that strategy not touse the delete operation could have some side effect. Gen-erally speaking, the malicious activity of names generationis able to result in unmanageable growing of shared dictio-nary. However, substring deduplication technique is able tomanage such malicious activity efﬁciently.Figure 44: Shared dictionary b-tree architecture.

Shared dictionary is the hybrid b-tree with root node isstored into the superblock (Fig. 44). Every hybrid or leafnode of shared dictionary b-tree includes: (1) lookup table1,35igure 45: Names deduplication mechanism.Figure 46: Deduplicated strings representation.(2) lookup table2, (3) hash table, and (4) strings area (Fig.48).The lookup table1 is located into the node’s header andit implements clustering or grouping the items of lookuptable2. By design, lookup table1 is capable to keep only20 items. Every item (Fig. 47) contains: (1) hash value,(2) starting index in the lookup table2, and (3) number ofitems in the group. Generally speaking, the responsibility oflookup table1 is to provide the mechanism of fast search ofsome items’ cluster in the lookup table2 on the basis of hashvalue.As a result, the found item in the lookup table1 is the ba-sis for further search in the lookup table2. This table (lookuptable2) is located in the bottom of node (Fig. 47) and it hasgoal to provide the mechanism for the search a position ofname’s preﬁx (or starting keyword). Every item of lookup ta-ble2 (Fig. 47) contains: (1) hash value, (2) preﬁx length, (3)number of deduplicated names, and (4) index in hash table.Generally speaking, the lookup table2 describes positions ofnames’ preﬁxes in strings area.Finally, hash table (Fig. 47) is located upper the lookuptable2. It is responsible to describe every name in the stringsarea. Every item of hash table contains: (1) hash value, (2)name offset, (3) name length, and (4) name type. Gener-ally speaking, hash table implements the mechanism to de-ﬁne the position and the length of a sufﬁx of deduplicatedname because the full name is constructed from the preﬁxand the sufﬁx (Fig. 45 - 46). Finally, it needs to ﬁnd the pre- Figure 47: Shared dictionary b-tree’s node structure.ﬁx from the lookup table2 and the sufﬁx from the hash tablefor the extraction of a full name. The last item of the node isstrings area that keeps the full and deduplicated names. Gen-erally speaking, b-tree is efﬁcient mechanism for storing andsearching the strings of variable length.

Extended attribute represents the pair of name and valueis associated with a ﬁle or a folder. It is possible to saythat extended attributes play the role of extension of reg-ular attributes that are associated with inodes. Frequently,extended attributes are used with the goal to provide an ad-ditional functionality in ﬁle system, for example, additionalsecurity features - Access Control Lists (ACL). Name of ex-tended attribute is the null-terminated string and it is deﬁnedin the fully qualiﬁed namespace form (for example, secu-rity.selinux). Currently, it exists the security, system, trusted,and user classes of extended attributes. Usually, VFS limitsthe length of xattr’s name by 255 bytes and size of the valueby 64 KB.Figure 48: Extended attributes (xattr) b-tree architecture.SSDFS ﬁle system uses a metadata structure of ﬁxed size(64 bytes) for representation and storing the xattr record ona ﬁle system’s volume. Moreover, this metadata structureis capable to keep the 16 symbols inline and value of 32bytes (Fig. 49). However, namespace class is representednot by string itself but by means of special ﬁeld of nametype. Generally speaking, it means that if the name or thevalue is lesser than declared limit then it can be stored inline36igure 49: Extended attributes b-tree’s node structure.in the xattr record. Otherwise, if a name is longer than 16symbols then initial portion of the name will be stored inlinein the xattr record but the whole name has to be stored intothe shared dictionary. Also, if a value is bigger than 32 bytesthen the blob has to be stored in some logical block(s) of thevolume but the xattr record will keep the extent that describesthe position of this blob. Moreover, it is possible to employthe shared extents b-tree for storing the xattr’s blobs in therange from 32 bytes to 4 KB. Additionally, shared extentsb-tree is able to deduplicate the blobs with identical content.

SSDFS xattr tree (Fig. 48) is implemented as hybrid b-tree with root node is stored in the inode’s private area. Bydefault, the private area of inode has 128 bytes in size. Usu-ally, ﬁle owns the extents b-tree but folder has dentries b-tree.Finally, it means that the ﬁrst 64 bytes of private area will beused by the root node of extents or dentries b-tree but the rest64 bytes can be used for root node of xattr b-tree. Also, ifﬁle or folder has only one extended attribute then the xattrrecord (64 bytes) can be stored inline in the second half ofprivate area.

The xattr record is the metadata structure of ﬁxed size.Generally speaking, the goal of such approach is to keep inthe node an array of xattr records because the ﬁxed size ofevery item in the array provides a very efﬁcient mechanismof lookup, access and modiﬁcation operations. Moreover,the header of b-tree’s node contains a lookup table (Fig. 49)is capable to store 22 records. The goal of such lookup tableis the clustering of xattr records in the main area for imple-menting the efﬁcient mechanism of searching operation. Ev-ery item in the lookup table is a hash value of an extended at-tribute’s name. Generally speaking, the hash value identiﬁesthe position of starting xattr record in a group (or cluster) ofxattr records. Every such starting record is located on a ﬁxedposition in the main area of node. As a result, the lookuptable provides the way to restrict the search by some clusterin the main area.Generally speaking, the case of signiﬁcant number of ex-tended attributes for the same ﬁle/folder is very rare. Itmeans that it makes sense to consider the xattr record of big-ger size (128 bytes, for example) with the goal to optimizethe operations with xattr records by increasing inline area of value (blob). Moreover, it is possible to consider the inode’srecord of bigger size (512 bytes, for example). Such inodewill be able to keep about 5 inline xattr records. Addition-ally, it is possible to implement a shared xattrs b-tree that willbe able to store xattr records of different ﬁles/folders into theone b-tree. However, even if anybody considers only dedi-cated xattrs b-tree then the b-tree with 2 nodes of 4 KB insize is capable to store about 128 xattr records in total.

The write ampliﬁcation issue is the crucial problem for thecase of ﬂash-oriented and ﬂash-friendly ﬁle systems. It ispossible to state that this issue is the key reason of SSD life-time shortening. Every particular ﬁle system has unique rea-sons of the write ampliﬁcation issue and it contains sometechniques to decrease or to eliminate this problem. SSDFSﬁle system uses such techniques for resolving the problemof write ampliﬁcation issue: (1) compression, (2) small ﬁlescompaction scheme, (3) logical extent concept, (4) Diff-On-Write approach, (5) deduplication, (6) inline ﬁles.

Compression . SSDFS ﬁle system widely uses compres-sion as for user data as for metadata. Current ﬁle sys-tem driver implementation supports zlib and LZO compres-sion. Moreover, SSDFS ﬁle system uses a special com-paction scheme which gathers several compressed fragments(even for different ﬁles) into one NAND ﬂash page inside ofspecial log’s area (diff update or journal areas). Generallyspeaking, this compaction technique provides the opportu-nity to use only one NAND ﬂash page for several compressedfragments of different ﬁles instead of several ones. As a re-sult, the decreasing number of used NAND ﬂash pages de-creases number of I/O operations and it creates the opportu-nity to reduce the write ampliﬁcation issue.

Small ﬁles compaction . It took place some number of re-search papers with the goal to investigate the aged ﬁle systemvolumes’ state and to elaborate some vision of distributionof data amongst various types. As a result, it has been foundthat many ﬁle system volumes contain signiﬁcant number ofsmall ﬁles. Some researchers estimate the number of smallﬁles as 61% of total number of ﬁles on the volume. SSDFSﬁle system introduces a special compaction scheme for thecase of small ﬁles. Generally speaking, PEB’s log can con-tain a special journal area that is used for gathering into oneNAND ﬂash page the several small ﬁles. As a result, thiscompaction technique reduces the number of I/O operationsand is able to decrease the factor of write ampliﬁcation issue.

Inline content . SSDFS ﬁle system has inode’s formatwith reservation of 128 bytes for private area (by default).Moreover, increasing the size of inode transforms the privatearea to bigger size. Generally speaking, private area can beused for keeping inline the content of small ﬁles, extent, den-try or xattr records. As a result, it means that keeping datainline in the inode’s private area creates the opportunity not37o allocate the logical blocks (NAND ﬂash pages) for storingthese data or metadata. Finally, mechanism of keeping datainline is the way to reduce the write ampliﬁcation issue andto improve the ﬁle system’s performance.

Logical extent concept . SSDFS ﬁle system implementsthe logical extent concept as the additional mechanism of de-creasing the write ampliﬁcation issue. Generally speaking,Copy-On-Write policy is the main technique of data updatesin the scope of any LFS ﬁle system. It means that the neces-sity to update some data on the volume results in writing theactual state of data in a new position (logical block) on theﬁle system’s volume. As a result, the main problem of suchapproach is the necessity to update a metadata (block map-ping table, for example) for any of such update with the goalto track the position of actual state of data. Finally, it resultsin increasing the number of I/O operations and making thewrite ampliﬁcation issue like more severe problem.But SSDFS ﬁle system tracks the position of any data onthe volume by means of logical extent. The logical extentstructure includes: (1) segment ID, (2) logical block numberinside of this segment, (3) number of logical blocks in the ex-tent. Moreover, SSDFS ﬁle system implements PEBs migra-tion technique. Finally, it means that if any logical block isstored into some segment then the logical extent remains thesame during any update or modiﬁcation operations with datainside of this logical extent. Generally speaking, the logicalextent will have the same value until the data will be movedinto another segment. As a result, the nature of logical extentprovides the opportunity not to update the metadata structurethat tracks the position of data on the volume by means oflogical extents. Moreover, this technique reduces the writeampliﬁcation issue.

Diff-On-Write approach . The Copy-On-Write (COW)policy is the central technique of Log-structured ﬁle system.The goal of this policy is to overcome peculiarity of NANDﬂash. Namely, clean physical page of NAND chip can bewritten once. And it needs to erase a whole physical eraseblock for operation of re-writing the page. Usually, physicalerase block includes a bunch of pages. But, from anotherpoint of view, the COW policy can be treated as a reasonof write ampliﬁcation issue. Because every update of ﬁle’sdata results in moving updated block of ﬁle into new physicalpage of NAND ﬂash (Fig. 50).Write ampliﬁcation issue (Fig. 51) has several reasons.First of all, necessity to overcome write and read disturbanceeffects of NAND ﬂash and necessity to wear NAND ﬂasherase blocks uniformly are resulted in wear-leveling policy.This policy dictates regular moving of user data from agedsegment into new one. The COW policy as basic techniqueof Log-structured ﬁle system can be treated as another reasonof write ampliﬁcation issue. And ﬁnal reason of write am-pliﬁcation issue could be an inefﬁcient Garbage Collectionpolicy.Main, Diff Updates and Journal areas are foundation for Figure 50: Copy-On-Write policy side effect.Figure 51: Write ampliﬁcation issue.Diff-On-Write approach (Fig. 52). This approach dis-tinguishes main, unchangeable (”cold”) part of ﬁle’s data.These data are stored in Main area. For example, a ﬁle’s con-tiguous 4 KB binary stream can be treated as ”cold” data.Such piece of data can be saved into one physical page ofMain area. And read-only nature of this physical page can beprovided by means of saving of all updates of this page intoanother area (Diff Updates area). For example, File 1 hasstring ”Hello” as ”cold” data on Fig. 52. The Journal areaprovides shared space for gathering updates of different ﬁles.Joining of all current updates in one area looks like as journaland to provide gathering all ”hot” data in one area. For exam-ple, Fig. 52 shows situation when one block of Journal areacontains updates for File 1 (string ”Good weather.”) and forFile 2 (string ”Let’s walk.”). Journal area can be imagined asmixed sequence of updates for different ﬁles. As a result, ifJournal area in one or several logs has been gathered updatesof one ﬁle with accumulated size equals to physical page sizethen it makes sense to join these updates in one block of DiffUpdates or Main areas. It needs to store updates in the DiffUpdates area for the case of presence updates from differ-ent ﬁle’s parts. And, ﬁnally, it needs to store a sequence ofcontiguous updates into one block of Main area.The Copy-On-Write (COW) policy means that every up-dated block should be copied in a new place. The Diff-On-Write approach suggests to store only diff between initial andupdated state of data for every update (Fig. 53).38igure 52: Diff-On-Write approach.Figure 53: Copy-On-Write vs. Diff-On-Write.Fig. 54 shows different examples of diff. The diff can beresult of: (1) ﬁle creation; (2) adding data into existing ﬁle;(3) update of some ﬁle’s part.Diff-On-Write approach suggests to gather small parts orsmall updates of different ﬁles in one block of Journal area(Fig. 55). It is well known fact that about 61% of all ﬁles ona volume are smaller than 10KB. Such technique suggest theway of decreasing write ampliﬁcation factor and decreasingover-provisioning for the case of small ﬁles. Moreover, suchapproach gathers frequent updates in dedicated ”hot” area.As a result, it can improve efﬁciency of GC policy.Diff-On-Write approach provides basis for decreasingwrite ampliﬁcation factor in the case of gradual growing ofﬁle’s content (Fig. 56). Let’s suppose that ﬁle contains 1 KBdata after creation. Then additional 1 KB will be added onanother day, for example. And, ﬁnally, 2 KB of data will beadded after several days. First two 1 KB diffs can be storedin Journal areas of different logs. Every diff will share spaceof physical page with updates of another ﬁles. Finally, ﬁlecontent will be saved into Main area of a log with joining ofall available updates.Diff-On-Write approach provides especially good basisfor decreasing write ampliﬁcation factor in the case of mixedworkloads. Let’s assume that workload contains as addingdata to the end of ﬁle as updating of internal areas of ﬁle(Fig. 57). First of all, diffs can be stored into Journal areaof different logs. Then diffs of one ﬁle can be moved intoDiff Updates area with the goal to join updates of differentareas of the ﬁle into one block. And, ﬁnally, a sequence ofcontiguous diffs from Diff Updates and Journal areas can be Figure 54: Diff concept.Figure 55: Technique of joining ﬁles’ diffs in journal area.joined into one block of Main area.

Deduplication . Technique of deduplication is the wellknown and proven mechanism of exclusion of the duplicatedcontent of ﬁles. The essence of this technique is the detec-tion of data duplication on the basis of ﬁngerprint calculationand comparison the calculated ﬁngerprint value with the hashtable of existing ﬁngerprints. Generally speaking, the dedu-plication technique is very efﬁcient mechanism of reducingthe write ampliﬁcation factor by virtue of the opportunityto share the same deduplicated content amongst the severalﬁles.SSDFS ﬁle system uses the shared extents b-tree as thekey mechanism of deduplication implementation. Generallyspeaking, the shared extents b-tree has the goal to keep a ﬁn-gerprint value and an associated extent structure. The ﬁnger-print value is used for comparison and detection of the du-plication event but the extent structure is used for sharing thededuplicated data fragment amongst the different ﬁles. How-ever, deduplication technique could be a compute-intensivetask because a ﬁle system’s volume could contain the smallnumber of duplicated fragments or to have no duplicationsat all. Also the calculated ﬁngerprint values need to keep insome metadata structure that has to be stored on ﬁle system’svolume. As a result, the deduplication subsystem is capableto decrease the ﬁle system driver performance.The architecture of SSDFS’s deduplication subsystem isdesigned with taking into account the possible drawbacks.First of all, SSDFS ﬁle system driver calculates ﬁngerprintvalue of the ﬁrst 8 KB of the ﬁle only if the ﬁle is bigger39igure 56: Technique of main and journal areas interactionin Diff-On-Write approach.Figure 57: Technique of journal and diff updates areas inter-action in Diff-On-Write approach.than some threshold value (for example, 8 KB in total). Thenext step is searching the identical ﬁngerprint value in sharedextents b-tree. If no ﬁngerprint value has been found then thecalculated ﬁngerprint value should be stored into the sharedextents b-tree. Moreover, the rest of the ﬁle is simply ig-nored by means of skipping the calculation of ﬁngerprints.Oppositely, if it was found the identical ﬁngerprint value forthe ﬁrst 8 KB of the ﬁle in shared extents b-tree then it needsto calculate the ﬁngerprint values for the rest of ﬁle and totry to ﬁnd the identical ﬁngerprints in the tree. Again, ifno identical ﬁngerprints were found then it needs to storethe calculated ﬁngerprint values into the shared extents b-tree. But it needs to use the associated extent structures forthe ﬁle’s content deduplication in the case of detection theidentical ﬁngerprint values in shared extents b-tree. Gener-ally speaking, it means that shared extents b-tree is ready todeduplicate the ﬁle’s content only in the case of detection ofthird case of data duplication. Moreover, it means that ﬁlesystem volume will have two copies of identical data on thevolume that could increase the reliability of data storing.

Garbage Collector (GC) is inevitable subsystem of any LFSﬁle system because of Copy-On-Write (COW) policy. Gen-erally speaking, the simpliﬁed way of thinking about a vol-ume of LFS ﬁle system is to imagine the volume like a se-quence of logs are ﬁlling the volume’s space sequentially.Moreover, the data update operations create the volume’sstate when old logs are the mixture of valid and invalid data(or completely invalid data). It means that the responsi-bility of GC subsystem is the moving valid data from oldlogs into the new ones and to erase completely invalid eraseblocks (segments) with the goal to prepare the completelyclear erase blocks for allocation for the new logs. Generallyspeaking, GC activity is the vital but auxiliary action thatcould compete with the regular ﬁle system’s I/O operations.Finally, GC activity degrades the ﬁle system’s performancedramatically and in completely unpredictable way. It is pos-sible to say that GC overhead management problem is thecrucial and the key problem of any LFS ﬁle system and itneeds to be taken into account on initial stage of a ﬁle sys-tem’s architecture design.

Segment bitmap . Any classic GC subsystem of LFS ﬁlesystem is implemented like a thread that selects in the back-ground the aged segments with the goal to move valid datainto a new log(s) and to apply the erase operation for thesesegments. Generally speaking, the important problem ofsuch approach is to ﬁnd an aged segment with as minimumas possible number of valid blocks because this is the possi-ble strategy to manage the GC overhead efﬁciently. It needsto point out that SSDFS ﬁle system doesn’t use this classi-cal way of GC overhead management as the basic and fun-damental mechanism of GC operations. However, the tech-nique of searching of segment with minimal number of validblocks can be used in the environment of critical lack of freespace on the volume.SSDFS ﬁle system uses segment bitmap as the basic meta-data structure for searching the segments with minimumoverhead for GC activity. The responsibility of segmentbitmap is the tracking of segments’ state (clean, using, used,pre-dirty, dirty). However, the key responsibility of a mainGC thread is: (1) detecting the idle state of ﬁle system driver,(2) deﬁning the total I/O budget that can be employed by GCsubsystem, (3) selecting segments for processing by GC sub-system on the segment bitmap basis, (4) distribution of totalI/O budget between the GC threads of particular PEBs. Fi-nally, GC thread of particular PEB has to move gradually inthe background the cold data on the basis of determined I/Obudget.The using state means that segment has free logicalblocks. This state of segments doesn’t need to be processedby GC subsystem. The used state means that the whole seg-ment is ﬁlled by valid blocks. Finally, it means that suchsegment contains the cold data and it is the most expensive40ype of segments for processing by GC subsystem. How-ever, ﬂash-friendly ﬁle system could delegate the migrationof such cold data on SSD side and not to process it by GCsubsystem.The dirty state means that no valid blocks exist in suchsegment and GC subsystem needs to apply only erase oper-ation for all erase blocks in dirty segment. Generally speak-ing, it is the cheapest case of segment processing by GC sub-system and the dirty segments are the key target for the GCsubsystem. The pre-dirty state means that segment containsas valid as invalid logical blocks. This segment’s state has thelower priority for GC subsystem and this state will be usedonly in the case of complete absence of dirty segments. Fi-nally, the key technique of processing the pre-dirty segmentis to create the gradual migration of cold data by means ofadding of GC operations to the regular I/O operations withdata in the segment.

PEBs migration scheme . Migration scheme is the funda-mental technique of GC overhead management in the SSDFSﬁle system. The key responsibility of the migration schemeis to guarantee the presence of data in the same segment forany update operations. Generally speaking, the migrationscheme’s model is implemented on the basis of associationan exhausted PEB with a clean one. The goal of such asso-ciation of two PEBs is to implement the gradual migrationof data by means of the update operations in the initial (ex-hausted) PEB. As a result, the old, exhausted PEB becomesinvalidated after complete data migration and it will be possi-ble to apply the erase operation to convert it in the clean state.Moreover, the destination PEB in the association changes theinitial PEB for some index in the segment and, ﬁnally, it be-comes the only PEB for this position. Namely such tech-nique implements the concept of logical extent with the goalto decrease the write ampliﬁcation issue and to manage theGC overhead. Because the logical extent concept excludesthe necessity to update metadata is tracking the position ofuser data on the ﬁle system’s volume. Generally speaking,the migration scheme is capable to decrease the GC activ-ity signiﬁcantly by means of the excluding the necessity toupdate metadata and by means of self-migration of data be-tween of PEBs is triggered by regular update operations.

Hot/warm data self-migration . The important additionof migration scheme is a technique of hot/warm data self-migration. It means that any update operation in the envi-ronment of two PEBs’ association results in the moving ofdata from the exhausted PEB into the new one. Finally, if aPEB contains only hot data then all data is able to migratebetween PEBs by means of regular update operations with-out the necessity to employ the GC activity. Moreover, it ispossible to delay the applying of erase operation to the com-pletely invalidated PEB. However, the important peculiarityof such approach is to provide enough time for complete mi-gration of valid data between PEBs. If a ﬁle system’s vol-ume contains enough clean PEBs then it will be possible to ﬁnish the data migration by means of regular update oper-ations only (without using the GC service). However, if aPEB contains signiﬁcant amount of cold valid blocks or vol-ume hasn’t enough clean PEBs then it needs to stimulate themigration process by means of GC activity. The key item ofstimulation activity is the PEB’s dedicated GC thread. Gen-erally speaking, the responsibility of such GC thread is toorchestrate the gradual migration of cold data on the basis ofallocated I/O budget for a particular GC thread. The goal ofsuch approach is to minimize the GC threads’ activity andto guarantee the stable ﬁle system driver’s performance forregular I/O operations. Finally, this policy has to excludethe degradation of ﬁle system’s performance because of GCthreads’ activity and to prepare enough free space for ﬁlesystem operations.

Overprovisioning is widely using technique of reservationsome amount of SSD’s erase blocks (for example, 20% ofthe whole volume) with the goal to exchange the bad eraseblocks on the good ones from the reserved pool. One of thecritical reason of presence of bad erase blocks could be thehigh number of erase cycles because of write ampliﬁcationissue and signiﬁcant GC activity. Generally speaking, de-creasing write ampliﬁcation factor and elimination the GCactivity is able to prolong the SSD lifetime because of capa-bility to reduce the erase cycles number is used for auxiliaryﬁle system activity (for example, GC activity). Moreover, italso means the opportunity to prolong lifetime of the mainpool of SSD’s erase blocks. As a result, overprovisioningpool can be decreased or it could be used for prolongation ofSSD lifetime.

Pre-allocated state . SSDFS ﬁle system introduces a spe-cial pre-allocated state of logical blocks. Generally speaking,the pre-allocated state deﬁnes the presence of some data por-tion or reservation without the allocation of the whole NANDﬂash page for the logical block. As a result, pre-allocatedstate provides the opportunity to reserve some space on theﬁle system’s volume without the real allocation (delayed al-location). The goal of pre-allocated state is not only to re-serve some space (for metadata, for example) but it can beused for small ﬁles or compressed data portions. If someﬁle or compressed data portion has lesser than 4 KB in sizethen such data portion can be marked as pre-allocated andthe several data portions can be compacted or gathered intothe one NAND ﬂash page. Generally speaking, such com-paction scheme is able to reduce the number of used NANDﬂash pages and, as a result, could decrease the overprovi-sioning and to prolong the SSD lifetime.

Compression + delta-encoding . SSDFS ﬁle systemwidely uses the compression for more compact representa-tion of user data and metadata. Moreover, compression isadded by compaction scheme with the goal to merge several41ompressed data fragments into one NAND ﬂash page. AlsoSSDFS ﬁle system is trying to use a delta-encoding tech-nique. This delta-encoding technique implies not to save thewhole modiﬁed logical block (for example, 4 KB in size) butthe extraction and saving only modiﬁed area (for example,128 bytes) in this logical block. Finally, it means that it willbe ﬂushed on the volume only 128 bytes instead of 4 KB. SS-DFS ﬁle system uses the delta-encoding technique with thecompaction scheme for gathering several data portions intoone NAND ﬂash page. Finally, the goal of these techniquesis to achieve the more compact representation of user dataand metadata and to reduce the amount of write operationson the ﬁle system’s volume. Generally speaking, it impliesdecreasing the number of erase cycles and the prolongationof SSD lifetime.

The whole SSDFS ﬁle system’s design and architecture istrying to achieve the prolongation of SSD lifetime throughdecreasing the write ampliﬁcation factor. Generally speak-ing, the suggested and implemented approaches are capableto improve the ﬂush/write operations’ performance. How-ever, potential side effect of such efforts could be some re-ducing of read operations’ performance. But asynchronousnature of read/write latency of NAND ﬂash (read operationsis faster and no seek operation penalties) gives a steady basisto expect a good performance of read operations for the caseof SSDFS ﬁle system’s architecture. Moreover, aggregationof several PEBs into one segment, PEB’s dedicated threadsmodel, GC I/O budget model provide the rich opportunitiesfor achieving a good ﬁle system’s performance.Any SSD represents multi-die and multi-channel architec-ture. If some protocol of interaction with SSD (for example,open-channel SSD model) shares the knowledge of distri-bution erase blocks among NAND dies then ﬁle system isable to employ this knowledge for enhancing the I/O opera-tions performance. SSDFS ﬁle system uses the technique ofaggregation several PEBs inside of one segment. Generallyspeaking, if one segment aggregates several PEBs from dif-ferent NAND dies then such approach provides the way toprocess the I/O requests in parallel by different NAND dies.As a result, it is capable to improve the performance of I/Orequests in the scope of one segment signiﬁcantly.

SSDFS ﬁle system has goals: (1) manage write ampliﬁ-cation in smart way, (2) decrease GC overhead, (3) pro-long SSD lifetime, and (4) provide predictable ﬁle sys-tem’s performance. To implement these goals SSDFS ﬁlesystem introduces several authentic concepts and mecha-nisms: logical segment, logical extent, segment’s PEBs pool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dictio-nary b-tree, shared extents b-tree.

It has been shown that 80% or more of the ﬁles aresmaller than 32 KB . To manage this peculiarity, SSDFS ﬁlesystem uses inode’s private area to store the small ﬁles inline.Moreover, it was introduced a special compaction schemethat gathers several small ﬁles into one NAND ﬂash page.Also, SSDFS ﬁle system uses the block-level compressionwith addition of the compaction scheme that keeps severalcompressed portions into one NAND ﬂash page. Addition-ally, it is employed delta-encoding, compaction scheme, anddeduplication for the case of big ﬁles.

The vast majority of ﬁles are deleted within a few min-utes of their creation . One of the efﬁcient technique of man-agement such case is using inode’s private area for keepinginline ﬁles. Default raw SSDFS inode is able to store about128 bytes of ﬁle’s content. But the bigger size of raw in-ode is able to provide more space for inline ﬁles. Moreover,SSDFS ﬁle system gathers content of ﬁles into specializeduser data segment. As a result, deletion of ﬁles creates theself-invalidation effect for the case of user data segment thatcan decrease the GC activity or completely eliminate the GCoverhead through PEB migration scheme.

The median ﬁle age ranges between 80 and 160 days.0.8% of the ﬁles are used essentially every day . Flash-friendly ﬁle system doesn’t need to follow by strict wear-leveling scheme. It makes sense to delegate moving the colddata by SSD’s FTL once in 3 months (90 days). Extensiveusing the GC operations and strict wear-leveling scheme formoving cold data by LFS increases the write ampliﬁcationissue. PEB’s migration scheme is able to provide free spacein cost-efﬁcient manner. Compression and delta-encoding iscombined with PEB’s migration scheme is able to provideeasy and efﬁcient mechanism for combining in one PEB ashot as cold data and to guarantee the space for gathering hotdata updates with fast/easy migration between PEBs.

Several research works showed the growing of ﬁlescount per ﬁle system and directories count per ﬁle sys-tem. It needs to expect as minimum 30K - 90K ﬁles perﬁle system and 1K - 4K directories per ﬁle system.

Tomanage the growing demands for number of ﬁles and fold-ers on the ﬁle system volume, SSDFS ﬁle system employsinodes b-tree that provides the way for easy increasing num-ber of ﬁles and efﬁcient management for the case of frequentdelete/remove operations. Also, inline ﬁles provides the wayto store the small ﬁles in inode itself without allocation ofvolume’s space.

File name length falls in the range from 9 to 17 char-acters. The peak occurs for ﬁle names with length of 12characters.

SSDFS’s dentry is able to include 12 inline char-acters. The tail of longer name is stored into shared dictio-nary. The ﬁxed size of dentry provides the efﬁcient mech-42nism of dentries management. Mostly, ﬁle names will bestored into dentries only.

SSDFS raw inode is ableto keep two inline dentries. It means that very frequently rawinode is able to store the content of dentries tree. Moreover,SSDFS b-tree node is stored compressed. As a result, it im-plies that small dentries b-tree could be represented in veryefﬁcient way on the ﬁle system’s volume.

There are many ﬁles deep in the namespace tree, es-pecially at depth 7.

SSDFS b-tree node stores several rawinodes. It means that operation of reading one b-tree nodeis able to provide access to the whole or part of namespacetree. Also, SSDFS gathers b-tree node of the same type inone segment/PEB. As a result, the readahead operation isable to read several b-tree nodes. It implies that several con-tiguous b-tree nodes could contain the whole namespace tree.Finally, SSDFS raw inode is able to keep the content of den-tries tree inline.

Many end-users have ﬁle system volume is on averageonly half full.

PEBs migration technique is trying to em-ploy this fact. It means that PEBs association during migra-tion can be done without the affection of availability of freespace on the volume. Moreover, SSDFS uses compressionand delta-encoding. Finally, it provides good basis for thePEBs migration technique.

On average, half of the ﬁles in a ﬁle system have beencreated by copying without subsequent writes.

It is pos-sible to conclude that user data is mostly cold but metadatais mostly hot. SSDFS uses the model of current segments ofdifferent types. It means that user data is aggregated in onesegment but metadata is aggregated into another one. As aresult, user data segment will be managed under cold datapolicy but metadata segment will be managed under hot datapolicy. SSDFS uses three area types in the log (main area,diff updates area, journal area). It provides the efﬁcient wayto manage data for the case of mixed nature of data (cold andhot) into one log. SSDFS is ﬂash-friendly ﬁle system and itdoesn’t move segment with cold data as part of GC activity.SSDFS delegates error correction and read block reclaimingon FTL side. Also, deduplication is able to exclude the repli-cation of existing ﬁles on the volume.

Modern applications manage large databases of infor-mation organized into complex directory trees (A File IsNot a File).

First of all, SSDFS is able to manage caseof mixed nature of data efﬁciently by means of three areatypes in the log (main area, diff updates area, journal area).Also, using delta-encoding technique provides the way tostore only updated portion(s) of data. Moreover, segmentis able to contain several PEBs from different NAND dies. Itmeans that different extents of a ﬁle can be stored into dif-ferent PEBs and to implement the parallelism of operations.SSDFS associates read/write threads with PEBs that imple- ments parallelism as on ﬁle system as on SSD level.

Applications help users create, modify, and organizecontent, but user ﬁles represent a small fraction of theﬁles touched by modern applications. Most ﬁles arehelper ﬁles that applications use to provide a rich graph-ical experience, support multiple languages, and recordhistory and other metadata.

Auxiliary ﬁles of the same ap-plication can be aggregated into one segment/PEB. It meansthat readahead operation will be able to extract the contentof all auxiliary ﬁles from one segment/PEB. SSDFS’s seg-ment is able to contain several PEBs from different NANDdies. As a result, this approach is capable to implement par-allelism of read operation as on ﬁle system as on SSD level.

Most written data is explicitly forced to disk by the ap-plication; for example, iPhoto calls fsync thousands oftimes in even the simplest of tasks.

SSDFS supports partiallogs. It means that ﬁle system driver tries to prepare the fulllog before ﬂushing on the volume. However, the ﬁle systemdriver is able to prepare the partial logs in the case of fsyncrequests or synchronous mount. The partial logs could in-crease amount of metadata on the volume. SSDFS supportsseveral types of current segments. It means that metadataand user data will be processed simultaneously in differentthreads. PEB has associated ﬂush thread. As a result, theupdate operations in different PEBs will be processed by dif-ferent threads in multi-threaded environment.

It has been shown that applications create many tem-porary ﬁles.

From one point of view, it is possible to con-sider adding an additional type of current segment for storingthe temporary ﬁles. Finally, it means that such type of seg-ment will be invalidated completely. And it will make theGC activity for such type of segments very cheap. But it willbe much efﬁcient way to keep the temporary ﬁles in pagecache without ﬂushing on the volume.

Home-user applications commonly use atomic opera-tions, in particular rename, to present a consistent viewof ﬁles to users.

As a result, it is possible to expect morefrequent of metadata’s update operations (hot data). Finally,PEBs migration technique could migrate updated metadatabetween PEBs without necessity to use the GC activity.

Write ampliﬁcation issue . SSDFS ﬁle system uses suchtechniques for resolving the problem of write ampliﬁcationissue: (1) compression, (2) small ﬁles compaction scheme,(3) logical extent concept, (4) Diff-On-Write approach, (5)deduplication, (6) inline ﬁles.The logical extent concept is the technique of resolvingthe write ampliﬁcation issue for the case of LFS ﬁle system.It means that any metadata structure keeping a logical ex-tent doesn’t need in updating the logical extent value in thecase of data migration between the PEBs because the log-ical extent remains the same until the data is living in thesame segment. The migration mechanism implements thelogical segment and logical extent concepts with the goal todecrease or completely eliminate the write ampliﬁcation is-43ue. Moreover, SSDFS ﬁle system is widely using the datacompression, delta-encoding technique, and small ﬁles com-paction technique that provides the opportunity to employthe PEB migration mechanism without the necessity to usethe additional overprovisioning.Moreover, b-tree metadata structure provides the way notto keep an unnecessary reserve of metadata space on the vol-ume. As a result, it means the exclusion of management op-erations of reserved metadata space (moving from a PEB toanother one) with the goal to support it in the valid state.Generally speaking, it is the way to decrease the amount ofPEBs’ erase and write operations.SSDFS ﬁle system uses a special compaction schemewhich gathers several compressed fragments (even for dif-ferent ﬁles) into one NAND ﬂash page inside of special log’sarea (diff update or journal areas). Generally speaking, thiscompaction technique provides the opportunity to use onlyone NAND ﬂash page for several compressed fragments ofdifferent ﬁles instead of several ones. As a result, the de-creasing number of used NAND ﬂash pages decreases num-ber of I/O operations and it creates the opportunity to reducethe write ampliﬁcation issue.SSDFS ﬁle system introduces a special compactionscheme for the case of small ﬁles. Generally speaking, PEB’slog can contain a special journal area that is used for gather-ing into one NAND ﬂash page the several small ﬁles. As aresult, this compaction technique reduces the number of I/Ooperations and is able to decrease the factor of write ampli-ﬁcation issue. Mechanism of keeping data inline in inode’sprivate area is the way to reduce the write ampliﬁcation issueand to improve the ﬁle system’s performance.

GC overhead management . There are several type ofsegments on any SSDFS ﬁle system’s volume: (1) su-perblock segment, (2) snapshot segment, (3) PEB mappingtable segment, (4) segment bitmap, (5) b-tree segment, (6)user data segment. Generally speaking, the goal to distin-guish the different type of segments is to localize the pecu-liarities of different types of data (user data and metadata, forexample) inside of specialized segments.Migration scheme is the fundamental technique of GCoverhead management in the SSDFS ﬁle system. The keyresponsibility of the migration scheme is to guarantee thepresence of data in the same segment for any update opera-tions. Generally speaking, the migration scheme is capableto decrease the GC activity signiﬁcantly by means of the ex-cluding the necessity to update metadata and by means ofself-migration of data between of PEBs is triggered by reg-ular update operations. The important addition of migrationscheme is a technique of hot/warm data self-migration. Itmeans that any update operation in the environment of twoPEBs’ association results in the moving of data from the ex-hausted PEB into the new one. Finally, if a PEB containsonly hot data then all data is able to migrate between PEBsby means of regular update operations without the necessity to employ the GC activity.However, if a PEB contains signiﬁcant amount of coldvalid blocks or volume hasn’t enough clean PEBs then itneeds to stimulate the migration process by means of GCactivity. The key item of stimulation activity is the PEB’sdedicated GC thread. Generally speaking, the responsibilityof such GC thread is to orchestrate the gradual migration ofcold data on the basis of allocated I/O budget for a particu-lar GC thread. The goal of such approach is to minimize theGC threads’ activity and to guarantee the stable ﬁle systemdriver’s performance for regular I/O operations. Finally, thispolicy has to exclude the degradation of ﬁle system’s perfor-mance because of GC threads’ activity and to prepare enoughfree space for ﬁle system operations.The compaction of several fragments of different logi-cal blocks into one NAND ﬂash page creates the capabil-ity to move more data for one GC operation. From anotherviewpoint, warm/hot areas introduce the areas with high fre-quency of update operations. Generally speaking, it is pos-sible to expect that high frequency of update operations (indiff updates and journal areas) creates the natural migrationof data between PEBs without the necessity to use the exten-sive GC operations.

SSD lifetime . Decreasing write ampliﬁcation factor andelimination the GC activity is able to prolong the SSD life-time because of capability to reduce the erase cycles numberis used for auxiliary ﬁle system activity (for example, GCactivity). Moreover, it also means the opportunity to pro-long lifetime of the main pool of SSD’s erase blocks. Asa result, overprovisioning pool can be decreased or it couldbe used for prolongation of SSD lifetime. SSDFS ﬁle sys-tem uses the delta-encoding technique with the compactionscheme for gathering several data portions into one NANDﬂash page. Finally, the goal of these techniques is to achievethe more compact representation of user data and metadataand to reduce the amount of write operations on the ﬁle sys-tem’s volume. Generally speaking, it implies decreasing thenumber of erase cycles and the prolongation of SSD lifetime.

File system performance . The whole SSDFS ﬁle sys-tem’s design and architecture is trying to achieve the prolon-gation of SSD lifetime through decreasing the write ampli-ﬁcation factor. Generally speaking, the suggested and im-plemented approaches are capable to improve the ﬂush/writeoperations’ performance. However, potential side effect ofsuch efforts could be some reducing of read operations’ per-formance. But asynchronous nature of read/write latency ofNAND ﬂash (read operations is faster and no seek opera-tion penalties) gives a steady basis to expect a good perfor-mance of read operations for the case of SSDFS ﬁle system’sarchitecture. Moreover, aggregation of several PEBs intoone segment, PEB’s dedicated threads model, GC I/O budgetmodel provide the rich opportunities for achieving a good ﬁlesystem’s performance. Any SSD represents multi-die andmulti-channel architecture. If some protocol of interaction44ith SSD (for example, open-channel SSD model) sharesthe knowledge of distribution erase blocks among NANDdies then ﬁle system is able to employ this knowledge forenhancing the I/O operations performance. SSDFS ﬁle sys-tem uses the technique of aggregation several PEBs inside ofone segment. Generally speaking, if one segment aggregatesseveral PEBs from different NAND dies then such approachprovides the way to process the I/O requests in parallel bydifferent NAND dies. As a result, it is capable to improvethe performance of I/O requests in the scope of one segmentsigniﬁcantly.

Solid state drives have a number of interesting characteris-tics. However, there are numerous ﬁle system and storagedesign issues for SSDs that impact the performance and de-vice endurance. Many ﬂash-oriented and ﬂash-friendly ﬁlesystems introduce signiﬁcant write ampliﬁcation issue andGC overhead that results in shorter SSD lifetime and ne-cessity to use the NAND ﬂash overprovisioning. SSDFSﬁle system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write ampliﬁcation insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable ﬁle system’s performance.

Currently, SSDFS ﬁle system driver is not fully functionaland is not completely implemented. It still needs in bugﬁx. Diff-On-Write approach is implemented only partially.Deduplication and snapshot support is not implemented yet.Additionally, SSDFS ﬁle system hasn’t fsck tool.

SSDFS project information is available online( ). Sourcecode of user-space tools is available in https://github.com/dubeyko/ssdfs-tools.git . Sourcecode of ﬁle system driver is available in https://github.com/dubeyko/ssdfs-driver.git . Source code of Linuxkernel with integrated SSDFS ﬁle system driver is availablein https://github.com/dubeyko/linux.git . The author gratefully acknowledge the initial support of theidea by Zvonimir Bandic and Cyril Guyot.

References [1] SSDFS Project, [Online]. Available: , Accessed on: Jun. 19, 2019.[2] V. A. Dubeyko, C. Guyot, ”Systems and methods forimproving ﬂash-oriented ﬁle system garbage collec-tion,” U.S. Patent Application US20170017405, pub-lished January 19, 2017.[3] V. A. Dubeyko, C. Guyot, ”Systems and methods forimproving ﬂash-oriented ﬁle system garbage collec-tion,” U.S. Patent Application US20170017406, pub-lished January 19, 2017.[4] V. A. Dubeyko, C. Guyot, ”Method of decreasingwrite ampliﬁcation factor and over-provisioning ofNAND ﬂash by means of Diff-On-Write approach,”U.S. Patent Application US20170139616, publishedMay 18, 2017.[5] V. A. Dubeyko, C. Guyot, ”Method of decreasing writeampliﬁcation of NAND ﬂash using a journal approach,”U.S. Patent 10,013,346, issued March 7, 2018.[6] V. A. Dubeyko, C. Guyot, ”Method of improvinggarbage collection efﬁciency of ﬂash-oriented ﬁle sys-tems using a journaling approach,” U.S. Patent Appli-cation US20170139825, published May 18, 2017.[7] V. A. Dubeyko, ”Bitmap Processing for Log-StructuredData Store,” U.S. Patent Application US20190018601,published January 17, 2019.[8] V. A. Dubeyko, S. Song, ”Non-volatile storage systemthat reclaims bad blocks,” U.S. Patent 10,223,216, is-sued March 5, 2019.[9] V. A. Dubeyko, S. Song, ”Non-volatile storage sys-tem that reclaims bad blocks,” U.S. Patent ApplicationUS20190155703, published May 23, 2019.[10] Agrawal, et al., ”A Five-Year Study of File-SystemMetadata,” ACM Transactions on Storage (TOS), vol.3 Issue 3, Oct. 2007, Article No. 9.[11] Avishay Traeger, Erez Zadok, Nikolai Joukov, andCharles P. Wright, ”A nine year study of ﬁle systemand storage benchmarking,” Trans. Storage 4, 2, Arti-cle 5 (May 2008), 56 pages.4512] Douceur, et al., ”A Large-Scale Study of File-SystemContents,” SIGMETRICS ’99 Proceedings of the 1999ACM SIGMETRICS international conference on Mea-surement and modeling of computer systems, pp. 59-70, May 1-4, 1999.[13] Lucas Tan, Fuyao Zhao, Xu Zhang, ”15712 AdvancedOperating and Distributed System Android and iOSPlatform Study Final Report,” [Online]. Available: https://pdfs.semanticscholar.org/48f8/1b9339ec3fcee1cc8031575e6f7b84c57c84.pdf ,Accessed on: Jun. 21, 2019.[14] Tyler Harter, Chris Dragga, Michael Vaughn, AndreaC. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,”A ﬁle is not a ﬁle: understanding the I/O behaviorof Apple desktop applications,” In Proceedings of theTwenty-Third ACM Symposium on Operating SystemsPrinciples (SOSP ’11). ACM, New York, NY, USA, 71-83.[15] A. B. Downey, ”The structural cause of ﬁle size dis-tributions,” MASCOTS 2001, Proceedings Ninth Inter-national Symposium on Modeling, Analysis and Sim-ulation of Computer and Telecommunication Systems,Cincinnati, OH, USA, 2001, pp. 361-370.[16] M. I. Ullah, F. Ahsan, I. Ahmad and A. F. M. Ishaq,”Analysis of ﬁle system space utilization patterns inUNIX based volumes,” Proceedings of the IEEE Sym-posium on Emerging Technologies, 2005, Islamabad,2005, pp. 542-546.[17] Tim Gibson, Ethan L. Miller, Darrell D. E. Long,”Long-term File Activity and Inter-Reference Pat-terns,” [Online]. Available: , Ac-cessed on: Jun. 25, 2019.[18] Yifan Wang, ”A Statistical Study for File Sys-tem Meta Data On High Performance ComputingSites,” [Online]. Available: ,Accessed on: Jun. 25, 2019.[19] A. Wildani, I. F. Adams and E. L. Miller, ”Single-Snapshot File System Analysis,” 2013 IEEE 21st Inter-national Symposium on Modelling, Analysis and Sim-ulation of Computer and Telecommunication Systems,San Francisco, CA, 2013, pp. 338-341.[20] S. Hui, Z. Rui, C. Jin, L. Lei, W. Fei and X. C. Sheng,”Analysis of the File System and Block IO Schedulerfor SSD in Performance and Energy Consumption,”2011 IEEE Asia-Paciﬁc Services Computing Confer-ence, Jeju Island, 2011, pp. 48-55. [21] D. Parthey and R. Baumgartl, ”Analyzing AccessTiming of Removable Flash Media,” 13th IEEE In-ternational Conference on Embedded and Real-TimeComputing Systems and Applications (RTCSA 2007),Daegu, 2007, pp. 510-515.[22] Y. Son, H. Kang, H. Han and H. Y. Yeom, ”An Empir-ical Evaluation of NVM Express SSD,” 2015 Interna-tional Conference on Cloud and Autonomic Comput-ing, Boston, MA, 2015, pp. 275-282.[23] K. Zhou, P. Huang, C. Li and H. Wang, ”An EmpiricalStudy on the Interplay between Filesystems and SSD,”2012 IEEE Seventh International Conference on Net-working, Architecture, and Storage, Xiamen, Fujian,2012, pp. 124-133.[24] P. Olivier, J. Boukhobza and E. Senn, ”Micro-benchmarking Flash Memory File-System Wear Level-ing and Garbage Collection: A Focus on Initial StateImpact,” 2012 IEEE 15th International Conferenceon Computational Science and Engineering, Nicosia,2012, pp. 437-444.[25] P. Olivier, J. Boukhobza and E. Senn, ”Modeling driverlevel NAND ﬂash memory I/O performance and powerconsumption for embedded Linux,” 2013 11th Inter-national Symposium on Programming and Systems(ISPS), Algiers, 2013, pp. 143-152.[26] Y. Wei and D. Shin, ”NAND ﬂash storage device per-formance in Linux ﬁle system,” 2011 6th InternationalConference on Computer Sciences and ConvergenceInformation Technology (ICCIT), Seogwipo, 2011, pp.574-577.[27] G. Kim and D. Shin, ”Performance analysis of SSDwrite using TRIM in NTFS and EXT4,” 2011 6th Inter-national Conference on Computer Sciences and Con-vergence Information Technology (ICCIT), Seogwipo,2011, pp. 422-423.[28] S. Park and K. Shen, ”A performance evaluation ofscientiﬁc I/O workloads on Flash-based SSDs,” 2009IEEE International Conference on Cluster Computingand Workshops, New Orleans, LA, 2009, pp. 1-5.[29] B. Gu, J. Lee, B. M. Jung, J. Seo and H. Shin, ”Uti-lization analysis of trim-enabled NAND ﬂash mem-ory,” 2013 IEEE International Conference on Con-sumer Electronics (ICCE), Las Vegas, NV, 2013, pp.645-646.[30] Y. Wang, K. Goda, M. Nakano and M. Kitsuregawa,”Early Experience and Evaluation of File Systems onSSD with Database Applications,” 2010 IEEE Fifth In-ternational Conference on Networking, Architecture,and Storage, Macau, 2010, pp. 467-476.4631] S. S. Rizvi and T. Chung, ”Flash memory SSD basedDBMS for high performance computing embeddedand multimedia systems,” The 2010 International Con-ference on Computer Engineering & Systems, Cairo,2010, pp. 183-188.[32] L. Lin and X. Lizhen, ”The Research of Key Tech-nology in Flash-Based DBMS,” 2009 Sixth Web Infor-mation Systems and Applications Conference, Xuzhou,Jiangsu, 2009, pp. 15-18.[33] J. Chen, J. Wang, Z. Tan and C. Xie, ”Effects ofRecursive Update in Copy-on-Write File Systems: ABTRFS Case Study,” in Canadian Journal of Electricaland Computer Engineering, vol. 37, no. 2, pp. 113-122,Spring 2014.[34] Mendel Rosenblum and John K. Ousterhout, ”The de-sign and implementation of a log-structured ﬁle sys-tem,” ACM Trans. Comput. Syst. 10, 1 (February1992), 26-52.[35] David Woodhouse, ”JFFS: the journallingﬂash ﬁle system,” [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.630.3461 , Accessed on: Jun. 20, 2019.[36] Artem B. Bityutskiy, ”JFFS3 design issues,” [On-line]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.9834 ,Accessed on: Jun. 20, 2019.[37] Adrian Hunter, ”A Brief Introduction tothe Design of UBIFS,” [Online]. Available: , Accessed on: Jun. 20,2019.[38] Adrian Hunter, Artem B. Bityutskiy, ”UBIFS ﬁle sys-tem,” [Online]. Available: , Accessed on: Jun.20, 2019.[39] Charles Manning, ”How YAFFS Works,” [Online].Available: https://yaffs.net/sites/yaffs.net/files/HowYaffsWorks.pdf , Accessed on: Jun.20, 2019.[40] Technical note, the Nilfs version 1: overview. [On-line]. Available: https://nilfs.sourceforge.io/papers/overview-v1.pdf , Accessed on: Jun. 20,2019.[41] Ryusuke Konishi, ”Development of a New Log-structured File System for Linux,” Technical Note,Oct 2005. [Online]. Available: https://nilfs.sourceforge.io/papers/nilfs-051019.pdf ,Accessed on: Jun. 20, 2019. [42] J¨orn Engel, Robert Mertens, ”LogFS-ﬁnally ascalable ﬂash ﬁle system,” [Online]. Available: , Accessed on: Jun. 21, 2019.[43] Changman Lee, Dongho Sim, Joo-Young Hwang, andSangyeun Cho, ”F2FS: a new ﬁle system for ﬂash stor-age,” In Proceedings of the 13th USENIX Conferenceon File and Storage Technologies (FAST’15). USENIXAssociation, Berkeley, CA, USA, 273-286.[44] TaeHoon Kim, KwangMu Shin, TaeHoon Lee, KiDongJung, ”Design of a Reliable NAND Flash Softwarefor Mobile Device,” [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.554.8864&rep=rep1&type=pdf ,Accessed on: Jun. 24, 2019.[45] Jeong-Ki Kim, Hyung-Seok Lee and Heung-Nam Kim,”Dual Journaling Store Method for Embedded Sys-tems,” 2006 8th International Conference AdvancedCommunication Technology, Phoenix Park, 2006, pp.1241-1244.[46] S. O. Park and S. J. Kim, ”An Efﬁcient Array File Sys-tem for Multiple Small-Capacity NAND Flash Memo-ries,” 2011 14th International Conference on Network-Based Information Systems, Tirana, 2011, pp. 569-572.[47] J. Kim, H. Jo, H. Shim, J. Kim and S. Maeng, ”Efﬁ-cient Metadata Management for Flash File Systems,”2008 11th IEEE International Symposium on Objectand Component-Oriented Real-Time Distributed Com-puting (ISORC), Orlando, FL, 2008, pp. 535-540.[48] S. O. Park and S. J. Kim, ”An efﬁcient multimediaﬁle system for NAND ﬂash memory storage,” in IEEETransactions on Consumer Electronics, vol. 55, no. 1,pp. 139-145, February 2009.[49] Seung-Ho Lim and Kyu-Ho Park, ”An efﬁcient NANDﬂash ﬁle system for ﬂash memory storage,” in IEEETransactions on Computers, vol. 55, no. 7, pp. 906-912,July 2006.[50] H. Kim, Y. Won and S. Kang, ”Embedded NAND ﬂashﬁle system for mobile multimedia devices,” in IEEETransactions on Consumer Electronics, vol. 55, no. 2,pp. 545-552, May 2009.[51] C. T. Chen, C. H. Chen and W. T. Huang, ”Energy-aware management of NAND type ﬂash ﬁle system,”in Electronics Letters, vol. 42, no. 14, pp. 795-796, 6July 2006.4752] A. S. Ramasamy and P. Karantharaj, ”File systemand storage array design challenges for ﬂash mem-ory,” 2014 International Conference on Green Comput-ing Communication and Electrical Engineering (ICGC-CEE), Coimbatore, 2014, pp. 1-8.[53] B. Nahill and Z. Zilic, ”FLogFS: A lightweight ﬂashlog ﬁle system,” 2015 IEEE 12th International Confer-ence on Wearable and Implantable Body Sensor Net-works (BSN), Cambridge, MA, 2015, pp. 1-6.[54] Yang Ou, Xiaoquan Wu, Nong Xiao, Fang Liu and WeiChen, ”HIFFS: A Hybrid Index for Flash File System,”2015 IEEE International Conference on Networking,Architecture and Storage (NAS), Boston, MA, 2015,pp. 363-364.[55] P. Huang, G. Wan, K. Zhou, M. Huang, C. Li andH. Wang, ”Improve Effective Capacity and Lifetimeof Solid State Drives,” 2013 IEEE Eighth InternationalConference on Networking, Architecture and Storage,Xi’an, 2013, pp. 50-59.[56] S. Yang and C. Wu, ”A Low-Memory Management forLog-Based File Systems on Flash Memory,” 2009 15thIEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Beijing,2009, pp. 219-227.[57] W. Qiu, X. Chen, N. Xiao, F. Liu and Z. Chen, ”ANew Exploration to Build Flash-Based Storage Sys-tems by Co-designing File System and FTL,” 2013IEEE 16th International Conference on ComputationalScience and Engineering, Sydney, NSW, 2013, pp.925-932.[58] T. Chen, X. Wang, W. Hu and W. Duan, ”A New Typeof NAND Flash-Based File System: Design and Imple-mentation,” 2006 International Conference on WirelessCommunications, Networking and Mobile Computing,Wuhan, 2006, pp. 1-4.[59] S. Lee, J. Kim and A. Mithal, ”Refactored Design ofI/O Architecture for Flash Storage,” in IEEE ComputerArchitecture Letters, vol. 14, no. 1, pp. 70-74, 1 Jan.-June 2015.[60] Junkil Ryu and C. Park, ”A technique to enhance per-formance of log-based ﬁle systems for ﬂash memoryin embedded systems,” 2007 2nd International Confer-ence on Digital Information Management, Lyon, 2007,pp. 580-582.[61] Byungjo Kim, Dong Hyun Kang, Changwoo Min andYoung Ik Eom, ”Understanding implications of trim,discard, and background command for eMMC storagedevice,” 2014 IEEE 3rd Global Conference on Con-sumer Electronics (GCCE), Tokyo, 2014, pp. 709-710. [62] C. Min, S. Lee and Y. I. Eom, ”Design and Implemen-tation of a Log-Structured File System for Flash-BasedSolid State Drives,” in IEEE Transactions on Comput-ers, vol. 63, no. 9, pp. 2215-2227, Sept. 2014.[63] Jun Wang and Yiming Hu, ”A novel reordering writebuffer to improve write performance of log-structuredﬁle systems,” in IEEE Transactions on Computers, vol.52, no. 12, pp. 1559-1572, Dec. 2003.[64] Jun Wang and Yiming Hu, ”PROFS-performance-oriented data reorganization for log-structured ﬁle sys-tem on multi-zone disks,” MASCOTS 2001, Proceed-ings Ninth International Symposium on Modeling,Analysis and Simulation of Computer and Telecommu-nication Systems, Cincinnati, OH, USA, 2001, pp. 285-292.[65] R. Agarwal and M. Marrow, ”A closed-form expressionfor write ampliﬁcation in NAND Flash,” 2010 IEEEGlobecom Workshops, Miami, FL, 2010, pp. 1846-1850.[66] A. Jagmohan, M. Franceschini and L. Lastras, ”Writeampliﬁcation reduction in NAND Flash through multi-write coding,” 2010 IEEE 26th Symposium on MassStorage Systems and Technologies (MSST), InclineVillage, NV, 2010, pp. 1-6.[67] Y. Chang and T. Kuo, ”A commitment-based man-agement strategy for the performance and reliabilityenhancement of ﬂash-memory storage systems,” 200946th ACM/IEEE Design Automation Conference, SanFrancisco, CA, 2009, pp. 858-863.[68] Tei-Wei Kuo, Jen-Wei Hsieh, Li-Pin Chang and Yuan-Hao Chang, ”Conﬁgurability of performance and over-heads in ﬂash management,” Asia and South PaciﬁcConference on Design Automation, 2006., Yokohama,2006, p. 8.[69] J. Hsieh, C. Wu and G. Chiu, ”Design and Imple-mentation for Multi-level Cell Flash Memory StorageSystems,” 2010 IEEE 16th International Conferenceon Embedded and Real-Time Computing Systems andApplications, Macau SAR, 2010, pp. 247-252.[70] C. Park, W. Cheon, Y. Lee, M. Jung, W. Cho andH. Yoon, ”A Re-conﬁgurable FTL (Flash TranslationLayer) Architecture for NAND Flash based Appli-cations,” 18th IEEE/IFIP International Workshop onRapid System Prototyping (RSP ’07), Porto Alegre,2007, pp. 202-208.[71] J. Lee, H. Kim, H. Kim, J. Park and M. Ryu, ”A se-quentializing device driver for optimizing random write48erformance of eSSD,” 2014 IEEE International Con-ference on Consumer Electronics (ICCE), Las Vegas,NV, 2014, pp. 432-433.[72] Y. He, S. Wan, N. Xiong and J. H. Park, ”A NewPrefetching Strategy Based on Access Density inLinux,” International Symposium on Computer Sci-ence and its Applications, Hobart, ACT, 2008, pp. 22-27.[73] Dingqing Hu, Changsheng Xie and C. CaiBin, ”AStudy of Parallel Prefetching Algorithms Using Trace-Driven Simulation,” Sixth International Conference onParallel and Distributed Computing Applications andTechnologies (PDCAT’05), Dalian, China, 2005, pp.476-478.[74] Y. Kang, J. Yang and E. L. Miller, ”Efﬁcient StorageManagement for Object-based Flash Memory,” 2010IEEE International Symposium on Modeling, Analysisand Simulation of Computer and TelecommunicationSystems, Miami Beach, FL, 2010, pp. 407-409.[75] Q. Xie et al, ”Research on the Framework of NANDFLASH Based Object-Based-Storage-Device,” 2012Second International Conference on Intelligent SystemDesign and Engineering Application, Sanya, Hainan,2012, pp. 1298-1301.[76] Goetz Graefe, ”Modern B-Tree Techniques,” [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.7269&rep=rep1&type=pdf , Accessed on: Jun. 21, 2019.[77] J. Ahn, D. Kang, D. Jung, J. Kim and S. Maeng, ” μ *-Tree: An Ordered Index Structure for NAND FlashMemory with Adaptive Page Layout Scheme,” in IEEETransactions on Computers, vol. 62, no. 4, pp. 784-797,April 2013.[78] C. Lee and S. Lim, ”Caching and Deferred Write ofMetadata for Yaffs2 Flash File System,” 2011 IFIP 9thInternational Conference on Embedded and UbiquitousComputing, Melbourne, VIC, 2011, pp. 41-46.[79] J. He et al., ”Discovering Structure in UnstructuredI/O,” 2012 SC Companion: High Performance Com-puting, Networking Storage and Analysis, Salt LakeCity, UT, 2012, pp. 1-6.[80] Tsozen Yeh, J. Arul, Jia-Shian Wu, I. -. Chen and Kuo-Hsin Tan, ”Using File Grouping to Improve the DiskPerformance (Extended Abstract),” 2006 15th IEEEInternational Conference on High Performance Dis-tributed Computing, Paris, 2006, pp. 365-366. [81] Li-Pin Chang and Tei-Wei Kuo, ”An adaptive stripingarchitecture for ﬂash memory storage systems of em-bedded systems,” Proceedings. Eighth IEEE Real-Timeand Embedded Technology and Applications Sympo-sium, San Jose, CA, USA, 2002, pp. 187-196.[82] Y. Xin, R. Chun-ming and H. Ben-xiong, ”A Flexi-ble Garbage Collect Algorithm for Flash Storage Man-agement,” 2008 Second International Conference onFuture Generation Communication and Networking,Hainan Island, 2008, pp. 354-357.[83] Che-Wei Tsao, Yuan-Hao Chang and Ming-ChangYang, ”Performance enhancement of garbage collec-tion for ﬂash storage devices: An efﬁcient victim blockselection design,” 2013 50th ACM/EDAC/IEEE De-sign Automation Conference (DAC), Austin, TX, 2013,pp. 1-6.[84] H. Yan and Q. Yao, ”An efﬁcient ﬁle-aware garbagecollection algorithm for NAND ﬂash-based consumerelectronics,” in IEEE Transactions on Consumer Elec-tronics, vol. 60, no. 4, pp. 623-627, Nov. 2014.[85] L. Zeng, Y. Zhang and X. Zhao, ”An Improved Ap-proach on B Tree Management for NAND Flash-Memory Storage Systems,” 2009 WASE InternationalConference on Information Engineering, Taiyuan,Chanxi, 2009, pp. 443-447.[86] S. Jung, Y. Lee and Y. H. Song, ”A process-awarehot/cold identiﬁcation scheme for ﬂash memory stor-age systems,” in IEEE Transactions on Consumer Elec-tronics, vol. 56, no. 2, pp. 339-347, May 2010.[87] Sheng-Jie Syu and Jing Chen, ”An active space re-cycling mechanism for ﬂash storage systems in real-time application environment,” 11th IEEE InternationalConference on Embedded and Real-Time ComputingSystems and Applications (RTCSA’05), Hong Kong,China, 2005, pp. 53-59.[88] H. Lim and J. Park, ”Dynamic Conﬁguration of SSDFile Management,” 2014 International Conference onInformation Science & Applications (ICISA), Seoul,2014, pp. 1-3.[89] T. Huang and D. Chang, ”Extending Lifetime and Re-ducing Garbage Collection Overhead of Solid StateDisks with Virtual Machine Aware Journaling,” 2011IEEE 17th International Conference on Parallel andDistributed Systems, Tainan, 2011, pp. 1-8.[90] C. Wu, P. Wu, K. Chen, W. Chang and K. Lai, ”A Hot-ness Filter of Files for Reliable Non-Volatile MemorySystems,” in IEEE Transactions on Dependable and Se-cure Computing, vol. 12, no. 4, pp. 375-386, 1 July-Aug. 2015.4991] H. Gwak, Y. Kang and D. Shin, ”Reducing garbage col-lection overhead of log-structured ﬁle systems with GCjournaling,” 2015 International Symposium on Con-sumer Electronics (ISCE), Madrid, 2015, pp. 1-2.[92] D. Choi and D. Shin, ”Semantic-Aware Hot Data Se-lection Policy for Flash File System in Android-BasedSmartphones,” 2013 International Conference on Paral-lel and Distributed Systems, Seoul, 2013, pp. 444-445.[93] C. Wu, W. Chang and Z. Hong, ”A Reliable Non-volatile Memory System: Exploiting File-SystemCharacteristics,” 2009 15th IEEE Paciﬁc Rim Interna-tional Symposium on Dependable Computing, Shang-hai, 2009, pp. 202-207.[94] D. Shapira, ”Compressed Transitive Delta Encoding,”2009 Data Compression Conference, Snowbird, UT,2009, pp. 203-212.[95] H. Li, ”Flash Saver: Save the Flash-Based Solid StateDrives through Deduplication and Delta-encoding,”2012 13th International Conference on Parallel andDistributed Computing, Applications and Technolo-gies, Beijing, 2012, pp. 436-441.[96] Z. Zhang, Z. Jiang, C. Peng and Z. Liu, ”Analysis ofdata fragments in deduplication system,” 2012 Interna-tional Conference on System Science and Engineering(ICSSE), Dalian, Liaoning, 2012, pp. 559-563.[97] Yong-Ting Wu, Min-Chieh Yu, Jenq-Shiou Leu, Eau-Chung Lee and Tian Song, ”Design and implementa-tion of various ﬁle deduplication schemes on storagedevices,” 2015 11th International Conference on Het-erogeneous Networking for Quality, Reliability, Secu-rity and Robustness (QSHINE), Taipei, 2015, pp. 80-84.[98] Y. Fu, H. Jiang, N. Xiao, L. Tian and F. Liu, ”AA-Dedupe: An Application-Aware Source Deduplica-tion Approach for Cloud Backup Services in the Per-sonal Computing Environment,” 2011 IEEE Interna-tional Conference on Cluster Computing, Austin, TX,2011, pp. 112-120.[99] N. Wanigasekara and C. I. Keppittiyagama, ”BuddyFS:A File-System to Improve Data Deduplication in Vir-tualization Environments,” 2014 Eighth InternationalConference on Complex, Intelligent and Software In-tensive Systems, Birmingham, 2014, pp. 198-204.[100] Feng Chen, Tian Luo, and Xiaodong Zhang,”CAFTL: a content-aware ﬂash translation layer en-hancing the lifespan of ﬂash memory based solid statedrives,” In Proceedings of the 9th USENIX conferenceon File and storage technologies (FAST’11). USENIXAssociation, Berkeley, CA, USA, 6-6. [101] W. Xia, H. Jiang, D. Feng and L. Tian, ”CombiningDeduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets,” 2014Data Compression Conference, Snowbird, UT, 2014,pp. 203-212.[102] D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M.Kuhn and J. Kunkel, ”A study on data deduplicationin HPC storage systems,” SC ’12: Proceedings of theInternational Conference on High Performance Com-puting, Networking, Storage and Analysis, Salt LakeCity, UT, 2012, pp. 1-11.[103] J. Ha, Y. Lee and J. Kim, ”Deduplication with Block-Level Content-Aware Chunking for Solid State Drives(SSDs),” 2013 IEEE 10th International Conference onHigh Performance Computing and Communications &2013 IEEE International Conference on Embedded andUbiquitous Computing, Zhangjiajie, 2013, pp. 1982-1989.[104] Y. Deng, L. Song and X. Huang, ”Evaluating MemoryCompression and Deduplication,” 2013 IEEE EighthInternational Conference on Networking, Architectureand Storage, Xi’an, 2013, pp. 282-286.[105] E. W. D. Rozier and W. H. Sanders, ”A frameworkfor efﬁcient evaluation of the fault tolerance of dedupli-cated storage systems,” IEEE/IFIP International Con-ference on Dependable Systems and Networks (DSN2012), Boston, MA, 2012, pp. 1-12.[106] X. Zhao, Y. Zhang, Y. Wu, K. Chen, J. Jiang and K. Li,”Liquid: A Scalable Deduplication File System for Vir-tual Machine Images,” in IEEE Transactions on Paralleland Distributed Systems, vol. 25, no. 5, pp. 1257-1266,May 2014.[107] Youngjin Nam, Guanlin Lu and D. H. C. Du,”Reliability-aware deduplication storage: Assuringchunk reliability and chunk loss severity,” 2011 Inter-national Green Computing Conference and Workshops,Orlando, FL, 2011, pp. 1-6.[108] Calicrates Policroniades, Ian Pratt, ”Alternatives fordetecting redundancy in storage systems data,” In Pro-ceedings of the annual conference on USENIX AnnualTechnical Conference (ATEC ’04). USENIX Associa-tion, Berkeley, CA, USA, 6-6.[109] Fred Douglis, Arun Iyengar, ”Application-speciﬁc Delta-encoding via ResemblanceDetection,” [Online]. Available: https://pdfs.semanticscholar.org/5aef/da15f1dcf04529bbf518659a23112cbb5246.pdf ,Accessed on: Jun. 26, 2019.50110] J. Kim et al, ”Deduplication in SSDs: Model andquantitative analysis,” 012 IEEE 28th Symposium onMass Storage Systems and Technologies (MSST), SanDiego, CA, 2012, pp. 1-12.[111] D. Harnik, O. Margalit, D. Naor, D. Sotnikov andG. Vernik, ”Estimation of deduplication ratios in largedata sets,” 012 IEEE 28th Symposium on Mass StorageSystems and Technologies (MSST), San Diego, CA,2012, pp. 1-11.[112] E. W. D. Rozier, W. H. Sanders, P. Zhou, N.Mandagere, S. M. Uttamchandani and M. L. Yakushev,”Modeling the Fault Tolerance Consequences of Dedu-plication,” 2011 IEEE 30th International Symposiumon Reliable Distributed Systems, Madrid, 2011, pp. 75-84.[113] Y. Joo, J. Ryu, S. Park, H. Shin and K. G. Shin, ”RapidPrototyping and Evaluation of Intelligence Functionsof Active Storage Devices,” in IEEE Transactions onComputers, vol. 63, no. 9, pp. 2356-2368, Sept. 2014.[114] E. Jeannot, B. Knutsson and M. Bjorkman, ”Adap-tive online data compression,” Proceedings 11th IEEEInternational Symposium on High Performance Dis-tributed Computing, Edinburgh, UK, 2002, pp. 379-388.[115] T. Quan, D. Yeo and Y. Won, ”CMFS: Compressedmetadata ﬁle system for hybrid storage,” 2010 2ndIEEE International Conference on Network Infrastruc-ture and Digital Content, Beijing, 2010, pp. 1030-1034.[116] S. Ahn, S. Hyun, T. Kim and H. Bahn, ”A com-pressed ﬁle system manager for ﬂash memory basedconsumer electronics devices,” in IEEE Transactionson Consumer Electronics, vol. 59, no. 3, pp. 544-549,August 2013.[117] K. Kim, S. Jung and Y. H. Song, ”Compression ra-tio based hot/cold data identiﬁcation for ﬂash mem-ory,” 2011 IEEE International Conference on Con-sumer Electronics (ICCE), Las Vegas, NV, 2011, pp.33-34.[118] D. Zhao, K. Qiao, J. Yin and I. Raicu, ”DynamicVirtual Chunks: On Supporting Efﬁcient Accesses toCompressed Scientiﬁc Data,” in IEEE Transactions onServices Computing, vol. 9, no. 1, pp. 96-109, 1 Jan.-Feb. 2016.[119] S. Hyun, H. Bahn and K. Koh, ”LeCramFS: an efﬁ-cient compressed ﬁle system for ﬂash-based portableconsumer devices,” in IEEE Transactions on ConsumerElectronics, vol. 53, no. 2, pp. 481-488, May 2007. [120] W. Chang, X. Yun, B. Fang, S. Wang and X. Yu, ”Per-formance evaluation of block LZSS compression algo-rithm,” 2010 2nd International Conference on FutureComputer and Communication, Wuha, 2010, pp. V2-449-V2-454.[121] C. Constantinescu and M. Lu, ”Quick Estimation ofData Compression and De-duplication for Large Stor-age Systems,” 2011 First International Conference onData Compression, Communications and Processing,Palinuro, 2011, pp. 98-102.[122] O. Kwon, Y. Yoo, K. Koh and H. Bahn, ”Replacementand swapping strategy to improve read performance ofportable consumer devices using compressed ﬁle sys-tems,” in IEEE Transactions on Consumer Electronics,vol. 54, no. 2, pp. 551-559, May 2008.[123] A. Molfetas, A. Wirth and J. Zobel, ”Using Inter-ﬁleSimilarity to Improve Intra-ﬁle Compression,” 2014IEEE International Congress on Big Data, Anchorage,AK, 2014, pp. 192-199.[124] T. Makatos, Y. Klonatos, M. Marazakis, M. D. Flourisand A. Bilas, ”ZBD: Using Transparent Compression atthe Block Level to Increase Storage Space Efﬁciency,”2010 International Workshop on Storage Network Ar-chitecture and Parallel I/Os, Incline Village, NV, 2010,pp. 61-70.[125] B. Shen, X. Jin, Y. H. Song and S. S. Lee, ”APRA:Adaptive Page Replacement Algorithm for NANDFlash Memory Storages,” 2009 International Forumon Computer Science-Technology and Applications,Chongqing, 2009, pp. 11-14.[126] M. Wang and Y. Hu, ”Exploit real-time ﬁne-grainedaccess patterns to partition write buffer to improve SSDperformance and life-span,” 2013 IEEE 32nd Interna-tional Performance Computing and CommunicationsConference (IPCCC), San Diego, CA, 2013, pp. 1-7.[127] Matias Bjørling, Javier Gonzalez, Philippe Bon-net, ”LightNVM: The Linux Open-ChannelSSD Subsystem,” [Online]. Available: , Accessed on: Jun.26, 2019.[128] Matias Bjørling, Jesper Madsen, Philippe Bon-net, Aviad Zuck, Zvonimir Bandic, Qingbo Wang,”LightNVM: Lightning Fast Evaluation Platformfor Non-Volatile Memories,” [Online]. Available: https://pdfs.semanticscholar.org/30eb/bf2b42ef3a5714b0f5350f85842e3ca2e408.pdf ,Accessed on: Jun. 26, 2019.51129] Matias Bjørling, Jesper Madsen, Javier Gonzalez,Philippe Bonnet,”Linux Kernel Abstractions for Open-Channel Solid State Drives,” [Online]. Available: , Accessedon: Jun. 26, 2019.[130] Javier Gonzalez, Matias Bjørling, ”Multi-TenantI/O Isolation with Open-Channel SSDs,” [Online].Available: , Accessedon: Jun. 26, 2019.[131] Javier Gonzalez, Matias Bjørling, SeongnoLee, Charlie Dong, Yiren Ronnie Huang,”Application-Driven Flash Translation Layerson Open-Channel SSDs,” [Online]. Available: