SSDFS: Towards LFS Flash-Friendly File System without GC operation
SSSDFS: Towards LFS Flash-Friendly File System without GCoperations
Viacheslav Dubeyko
Abstract
Solid state drives have a number of interesting character-istics. However, there are numerous file system and storagedesign issues for SSDs that impact the performance and de-vice endurance. Many flash-oriented and flash-friendly filesystems introduce significant write amplification issue andGC overhead that results in shorter SSD lifetime and ne-cessity to use the NAND flash overprovisioning. SSDFSfile system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write amplification insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable file system’s performance.
Index terms: NAND flash, SSD, Log-structured filesystem (LFS), write amplification issue, GC overhead,flash-friendly file system, SSDFS, delta-encoding, Copy-On-Write (COW), Diff-On-Write (DOW), PEB migra-tion, deduplication.
Flash memory characteristics . Flash is available in twotypes NOR and NAND. NOR flash is directly addressable,helps in reading but also in executing of instructions di-rectly from the memory. NAND based SSD consists of setof blocks which are fixed in number and each block com-prises of a fixed set of pages or whole pages set makes upa block. There are three types of operations in flash mem-ory: read, write and erase. Execution of operations read andwrite takes place per page level. On the other hand the datais erased on block level by using erase operation. Becauseof the physical feature of flash memory, write operations areable to modify bits from one to zero. Hence the erase op-eration should be executed before rewriting as it set all bits to one. The typical latencies: (1) read operation - 20 us, (2)write operation - 200 us, (3) erase operation - 2 ms.
Flash Translation Layer (FTL) . FTL emulates the func-tionality of a block device and enables operating system touse flash memory without any modification. FTL mimicsblock storage interface and hides the internal complexities offlash memory to operating systems, thus enabling the oper-ating system to read/write flash memory in the same way asreading/writing the hard disk. The basic function of FTL al-gorithm is to map the page number from logical to physical.However, internally FTL needs to deal with erase-before-write, which makes it critical to overall performance and life-time of SSD.
Garbage Collection and Wear Leveling . The process ofcollecting, moving of valid data and erasing the invalid datais called as garbage collection. Through SSD firmware com-mand TRIM the garbage collection is triggered for deletedfile blocks by the file system. Commonly used erase blocksputs off quickly, slows down access times and finally burn-ing out. Therefore the erase count of each erase block shouldbe monitored. There are wide number of wear-leveling tech-niques used in FTL.
Building blocks of SSD . SSD includes a controller thatincorporates the electronics that bridge the NAND memorycomponents to the host computer. The controller is an em-bedded processor that executes firmware-level code. Someof the functions performed by the controller includes, error-correcting code (ECC), wear leveling, bad block mapping,read scrubbing and read disturb management, read and writecaching, garbage collection etc. A flash-based SSD typicallyuses a small amount of DRAM as a cache, similar to thecache in hard disk drives. A directory of block placementand wear leveling data is also kept in the cache while thedrive is operating.
Write amplification . For write requests that come in ran-dom order, after a period of time, the free page count in flashmemory becomes low. The garbage-collection mechanismthen identifies a victim block for cleaning. All valid pagesin the victim block are relocated into a new block with free1 a r X i v : . [ c s . O S ] J u l ages, and finally the candidate block is erased so that thepages become available for rewriting. This mechanism in-troduces additional read and write operations, the extent ofwhich depends on the specific policy deployed, as well as onthe system parameters. These additional writes result in themultiplication of user writes, a phenomenon referred to aswrite amplification. Read disturbance . Flash data block is composed of mul-tiple NAND units to which the memory cells are connectedin series. A memory operation on a specific flash cell willinfluence the charge contents on a different cells. This iscalled disturbance, which can occur on any flash operationand predominantly this is observed during the read operationand leads to errors in undesignated memory cells. To avoidfailure on reads, error-correcting codes (ECC) are widelyemployed. Read disturbance can occur when reading thesame target cell multiple times without an erase and pro-gram operation. In general, to preserve data consistency,flash firmware reads all live data pages, erases the block, andwrites down the live pages to the erased block. This process,called read block reclaiming, introduces long latencies anddegrades performance.
SSD design issues . Solid state drives have a numberof interesting characteristics that change the access pat-terns required to optimize metrics such as disk lifetime andread/write throughput. In particular, SSDs have approxi-mately two orders of magnitude improvement in read andwrite latencies, as well as a significant increase in overallbandwidth. However, there are numerous file system andstorage array design issues for SSDs that impact the perfor-mance and device endurance. SSDs suffer well-documentedshortcomings: log-on-log, large tail-latencies, unpredictableI/O latency, and resource underutilization. These shortcom-ings are not due to hardware limitations: the non-volatilememory chips at the core of SSDs provide predictable high-performance at the cost of constrained operations and limitedendurance/reliability. Providing the same block I/O interfaceas a magnetic disk is one of the important reason of thesedrawbacks.
SSDFS features . Many flash-oriented and flash-friendlyfile systems introduce significant write amplification issueand GC overhead that results in shorter SSD lifetime andnecessity to use the NAND flash overprovisioning. SSDFSfile system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write amplification insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable file system’s performance.The rest of this paper is organized as follows. Section IIsurveys the related works. Section III explains the SSDFS architecture and approaches. Section IV includes final dis-cussion. Section V offers conclusions.
File size . Agrawal, et al. [10] discovered that 1-1.5% of fileson a file system’s volume have a size of zero. The arithmeticmean file size was 108 KB in 2000 year and 189 KB in 2004year. This metric grows roughly 15% per year. The medianweighted file size increasing from 3 MB to 9 MB. Most ofthe bytes in large files are in video, database, and blob files,and that most of the video, database, and blob bytes are inlarge files. A large number of small files account for a smallfraction of disk usage. Douceur, et al. [12] confirmed that1.7% of all files have a size of zero. The mean file size rangesfrom 64 kB to 128 kB across the middle two quartiles of allfile systems. The median size is 2 MB and it confirms thatmost files are small but most bytes are in large files. Ullah,et al. [16] have results about 1 to 10 KB the value observedis up to 32% of the total occurrences. There are 29% valuesin the range of 10 KB to 100 KB. Gibson, et al. [17] agreedthat most files are relatively small, more than half are lessthan 8 KB. 80% or more of the files are smaller than 32 KB.On the other hand, while only 25% are larger than 8 KB, this25% contains the majority of the bytes used on the differentsystems.
File age . Agrawal, et al. [10] stated that the median fileage ranges between 80 and 160 days across datasets, with noclear trend over time. Douceur, et al. [12] has the vision ofthe median file age is 48 days. Studies of short-term tracedata have shown that the vast majority of files are deletedwithin a few minutes of their creation. On 50% of file sys-tems, the median file age ranges by a factor of 8 from 12 to 97days, and on 90% of file systems, it ranges by a factor of 256from 1.5 to 388 days. Gibson, et al. [17] showed that whileonly 15% of the files are modified daily, these modificationsaccount for over 70% of the bytes used daily. Relatively fewfiles are used on any one day normally less than 5%. 5-10% of all files created are only used on one day, dependingon the system. On the other hand, approximately 0.8% ofthe files are used more than 129 times essentially every day.These files which are used more than 129 times account forless than 1% of all files created and approximately 10% ofall the files which were accessed or modified. 90% of allfiles are not used after initial creation, those that are used arenormally short-lived, and that if a file is not used in somemanner the day after it is created, it will probably never beused. 1% of all files are used daily.
Files count per file system . Agrawal, et al. [10] showedthat the count of files per file system is going up from yearto year. The arithmetic mean has grown from 30K to 90Kfiles and the median has grown from 18K to 52K files (20002 2004 years). However, some percentage of file systems hasachieved about 512K files already in 2004 year. Douceur, etal. [12] have vision that 31% of all file systems contain 8k to16k files. 30% of file systems have fewer than 4k files. Ullah,et al. [16] found that 67% of the occurrences are found in therange of 1-8 number of files in a directory. The percentageof the 9-16 files in a directory comprises of 15% of the totaldata found. The number of file in the range 17-32 are 9%and only 9% occurrences are found in the data more than 32files in a directory.
File names . Agrawal, et al. [10] made conclusion that dy-namic link libraries (dll files) contain more bytes than anyother file type. And virtual hard drives are consuming arapidly increasing fraction of file-system space. Ullah, etal. [16] discovered that the file name length falls in the rangefrom 9 to 17 characters. The peak occurs for file names withlength of 12 characters. The file names smaller than 8 char-acters are up to 11% of the total data collected whereas theoccurrences of file names larger than 16 characters and upto 32 characters is 26% but the file names greater than 32characters are found to be only 6%.
Directory size . Agrawal, et al. [10] discovered that acrossall years, 23-25% of directories contain no files. The arith-metic mean directory size has decreased slightly and steadilyfrom 12.5 to 10.2 over the sample period, but the median di-rectory size has remained steady at 2 files. Across all years,65-67% of directories contain no subdirectories. Across allyears, 46-49% of directories contain two or fewer entries.Douceur, et al. [12] shared that 18% of all directories con-tain no files and the median directory size is 2 files. 69%of all directories contain no subdirectories, 16% contain one,and fewer than 0.5% contain more than twenty. On 50% offile systems, the median directory size ranges from 1 to 4files, and on 90% of file systems, it ranges from 0 to 7 files.On 95% of all file systems, the median count of subdirecto-ries per directory is zero. 15% of all directories are at depthof 8 or greater.
Directories count per file system . Agrawal, et al. [10]registered that the count of directories per file system hasincreased steadily over five-year sample period. The arith-metic mean has grown from 2400 to 8900 directories and themedian has grown from 1K to 4K directories. Douceur, etal. [12] shared that 28% of all file systems contain 512 to1023 directories, and 29% of file systems have fewer than256 directories. Ullah, et al. [16] shared that 59% of thedirectories have sub-directories in the range of 1-5, 35% oc-currences are found in the range of 6-10. But the results showthat only 6% occurrences are found in the range of above 10sub-directories in a directory.
Namespace tree depth . Agrawal, et al. [10] shared thatthere are many files deep in the namespace tree, especiallyat depth 7. Also, files deeper in the namespace tree tendto be orders-of-magnitude smaller than shallower files. Thearithmetic mean has grown from 6.1 to 6.9, and the median directory depth has increased from 5 to 6. The count of filesper directory is mostly independent of directory depth. Filesdeeper in the namespace tree tend to be smaller than shal-lower ones. The mean file size drops by two orders of mag-nitude between depth 1 and depth 3, and there is a drop ofroughly 10% per depth level thereafter.
Capacity and usage . Agrawal, et al. [10] registered that80% of file systems become fuller over a one-year period,and the mean increase in fullness is 14 percentage points.This increase is predominantly due to creation of new files,partly offset by deletion of old files, rather than due to extantfiles changing size. The space used in file systems has in-creased not only because mean file size has increased (from108 KB to 189 KB), but also because the number of files hasincreased (from 30K to 90K). Douceur, et al. [12] discoveredthat file systems are on average only half full, and their full-ness is largely independent of user job category. On average,half of the files in a file system have been created by copy-ing without subsequent writes, and this is also independentof user job category. The mean space usage is 53%.
A File Is Not a File . Harter, et al. [14] showed thatmodern applications manage large databases of informationorganized into complex directory trees. Even simple word-processing documents, which appear to users as a ”file”, arein actuality small file systems containing many sub-files.
Auxiliary files dominate . Tan, et al. [13] discovered thaton iOS, applications access resource, temp, and plist filesvery often. This is especially true for Facebook which uses alarge number of cache files. Also for iOS resource files suchas icons and thumbnails are stored individually on the filesystem. Harter, et al. [14] agree with that statement. Appli-cations help users create, modify, and organize content, butuser files represent a small fraction of the files touched bymodern applications. Most files are helper files that appli-cations use to provide a rich graphical experience, supportmultiple languages, and record history and other metadata.
Sequential Access Is Not Sequential . Harter, et al. [14]stated that even for streaming media workloads, ”pure” se-quential access is increasingly rare. Since file formats ofteninclude metadata in headers, applications often read and re-read the first portion of a file before streaming through itscontents.
Writes are forced . Tan, et al. [13] shared that on iOS,Facebook calls fsync even on cache files, resulting in thelargest number of fsync calls out of the applications. OnAndroid, fsync is called for each temporary write-ahead log-ging journal files. Harter, et al. [14] found that applicationsare less willing to simply write data and hope it is eventuallyflushed to disk. Most written data is explicitly forced to diskby the application; for example, iPhoto calls fsync thousandsof times in even the simplest of tasks.
Temporary files . Tan, et al. [13] showed that applicationscreate many temporary files. It might have a negative impacton the durability of the flash storage device. Also, creat-3ng many files result in storage fragmentation. The SQLitedatabase library creates many short-lived temporary journalfiles and calls fsync often.
Copied files . Agrawal, et al. [10] shared the interest-ing point that over sample period (2000 - 2004), the arith-metic mean of the percentage of copied files has grown from66% to 76%, and the median has grown from 70% to 78%.It means that more and more files are being copied acrossfile systems rather than generated locally. Downey [15] con-cluded that the vast majority of files in most file systems werecreated by copying, either by installing software (operatingsystem and applications) or by downloading from the WorldWide Web. Many new files are created by translating a filefrom one format to another, compiling, or by filtering an ex-isting file. Using a text editor or word processor, users addor remove material from existing files, sometimes replacingthe original file and sometimes creating a series of versions.
Renaming Is Popular . Harter, et al. [14] discovered thathome-user applications commonly use atomic operations, inparticular rename, to present a consistent view of files tousers.
Multiple Threads Perform I/O . Harter, et al. [14]showed that virtually all of the applications issue I/O re-quests from a number of threads; a few applications launchI/Os from hundreds of threads. Part of this usage stems fromthe GUI-based nature of these applications; threads are re-quired to perform long-latency operations in the backgroundto keep the GUI responsive.
Frameworks Influence I/O . Harter, et al. [14] foundthat modern applications are often developed in sophisticatedIDEs and leverage powerful libraries, such as Cocoa andCarbon. Whereas UNIX-style applications often directly in-voke system calls to read and write files, modern librariesput more code between applications and the underlying filesystem. Default behavior of some Cocoa APIs induces ex-tra I/O and possibly unnecessary (and costly) synchroniza-tions to disk. In addition, use of different libraries for similartasks within an application can lead to inconsistent behaviorbetween those tasks.
Applications’ behavior . Harter, et al. [14] made severalconclusions about applications’ nature. Applications tend toopen many very small files ( < > ≤ File System and Block IO Scheduler . Hui, et al. [20]have made the estimation of interaction between file sys-tems and block I/O scheduler. They concluded that moreread or append-write may cause better performance and lessenergy consumption, such as in the workload of the web-server. And along with the increasing of the write operation,especially random write, the performance declines and en-ergy consumption increases. The extent file systems expressbetter performance and lower energy consumption. They ex-pected that NOOP I/O scheduler is better suit for the caseof SSDs because it does not sort the request, which can costmuch time and decline performance. But after the test, theyfound that as CFQ as NOOP may be suit for the SSDs.
NAND flash storage device . Parthey, et al. [21] analyzedaccess timing of removable flash media. They found thatmany media access address zero especially fast. For somemedia, other locations such as the middle of the medium aresometimes slower than the average. Accessing very smallblocks ( ≤ Rosenblum, et al. [34] introduced a new technique for diskstorage management called a log-structured file system. Alog-structured file system writes all modifications to disk se-quentially in a log-like structure, thereby speeding up bothfile writing and crash recovery. The log is the only struc-ture on disk; it contains indexing information so that filescan be read back from the log efficiently. In order to main-tain large free areas on disk for fast writing, they divided thelog into segments and use a segment cleaner to compress thelive information from heavily fragmented segments. Log-structured file systems are based on the assumption that filesare cached in main memory and that increasing memorysizes will make the caches more and more effective at sat- isfying read requests. As a result, disk traffic will becomedominated by writes. A log-structured file system writesall new information to disk in a sequential structure calledthe log. This approach increases write performance dramat-ically by eliminating almost all seeks. The sequential natureof the log also permits much faster crash recovery: currentUnix file systems typically must scan the entire disk to re-store consistency after a crash, but a log-structured file sys-tem need only examine the most recent portion of the log.For a log-structured file system to operate efficiently, it mustensure that there are always large extents of free space avail-able for writing new data. This is the most difficult challengein the design of a log-structured file system. It was presenteda solution based on large extents called segments, where asegment cleaner process continually regenerates empty seg-ments by compressing the live data from heavily fragmentedsegments.
JFFS (The Journalling Flash File System) [35, 36] is apurely log-structured file system [LFS]. Nodes containingdata and metadata are stored on the flash chips sequentially,progressing strictly linearly through the storage space avail-able. In JFFS v1, there is only one type of node in the log;a structure known as struct jffs raw inode. Each such nodeis associated with a single inode. It starts with a commonheader containing the inode number of the inode to which itbelongs and all the current file system metadata for that in-ode, and may also carry a variable amount of data. There isa total ordering between the all the nodes belonging to anyindividual inode, which is maintained by storing a versionnumber in each node. Each node is written with a versionhigher than all previous nodes belonging to the same inode.In addition to the normal inode metadata such as uid, gid,mtime, atime, mtime etc., each JFFS v1 raw node also con-tains the name of the inode to which it belongs and the inodenumber of the parent inode. Each node may also containan amount of data, and if data are present the node will alsorecord the offset in the file at which these data should appear.The entire medium is scanned at mount time, each node be-ing read and interpreted. The data stored in the raw nodesprovide sufficient information to rebuild the entire directoryhierarchy and a complete map for each inode of the physi-cal location on the medium of each range of data. Metadatachanges such as ownership or permissions changes are per-formed by simply writing a new node to the end of the logrecording the appropriate new metadata. File writes are sim-ilar; differing only in that the node written will have dataassociated with it.The oldest node in the log is known as the head, and newnodes are added to the tail of the log. In a clean filesys-tem which on which garbage collection has never been trig-gered, the head of the log will be at the very beginning of the5ash. As the tail approaches the end of the flash, garbage col-lection will be triggered to make space. Garbage collectionwill happen either in the context of a kernel thread which at-tempts to make space before it is actually required, or in thecontext of a user process which finds insufficient free spaceon the medium to perform a requested write. In either case,garbage collection will only continue if there is dirty spacewhich can be reclaimed. If there is not enough dirty space toensure that garbage collection will improve the situation, thekernel thread will sleep, and writes will fail with ENOSPCerrors. The goal of the garbage collection code is to erase thefirst flash block in the log. At each pass, the node at the headof the log is examined. If the node is obsolete, it is skippedand the head moves on to the next node. If the node is stillvalid, it must be rendered obsolete. The garbage collectioncode does so by writing out a new data or metadata node tothe tail of the log.While the original JFFS had only one type of node on themedium, JFFS2 is more flexible, allowing new types of nodeto be defined while retaining backward compatibility throughuse of a scheme inspired by the compatibility bitmasks ofthe ext2 file system. Every type of node starts with a com-mon header containing the full node length, node type and acyclic redundancy checksum (CRC). Aside from the differ-ences in the individual nodes, the high-level layout of JFFS2also changed from a single circular log format, because ofthe problem caused by strictly garbage collecting in order.In JFFS2, each erase block is treated individually, and nodesmay not overlap erase block boundaries as they did in theoriginal JFFS. This means that the garbage collection codecan work with increased efficiency by collecting from oneblock at a time and making intelligent decisions about whichblock to garbage collect from next.In traditional file systems the index is usually kept andmaintained on the media, but unfortunately, this is not thecase for JFFS2. In JFFS2, the index is maintained in RAM,not on the flash media. And this is the root of all the JFFS2scalability problems. Of course, having the index in RAMJFFS2 achieves extremely high file system throughput, justbecause it does not need to update the index on flash aftersomething has been changed in the file system. And thisworks very well for relatively small flashes, for which JFFS2was originally designed. But as soon as one tries to useJFFS2 on large flashes (starting from about 128MB), manyproblems come up. JFFS2 needs to build the index in RAMwhen it mounts the file system. For this reason, it needs toscan the whole partition in order to locate all the nodes whichare present there. So, the larger is JFFS2 partition, the morenodes it has, the longer it takes to mount it. The second, itis evidently that the index consumes some RAM. And thelarger is the JFFS2 file system, the more nodes it has, themore memory is consumed.
UBIFS (Unsorted Block Image File System) [37, 38]follows a node-structured design, that enables their garbage collectors to read eraseblocks directly and determine whatdata needs to be moved and what can be discarded, and toupdate their indexes accordingly. The combination of dataand metadata is called a node. Each node records which file(more specifically inode number) that the node belongs toand what data (for example file offset and data length) iscontained in the node. The big difference between JFFS2and UBIFS is that UBIFS stores the index on flash whereasJFFS2 stores the index only in main memory, rebuilding itwhen the file system is mounted. Potentially that places alimit on the maximum size of a JFFS2 file system, becausethe mount time and memory usage grow linearly with thesize of the flash. UBIFS was designed specifically to over-come that limitation.The master node stores the position of all on-flash struc-tures that are not at fixed logical positions. The master nodeitself is written repeatedly to logical eraseblocks (LEBs) oneand two. LEBs are an abstraction created by UBI. UBI mapsphysical eraseblocks (PEBs) to LEBs, so LEB one and twocan be anywhere on the flash media (strictly speaking, theUBI device), however UBI always records where they are.Two eraseblocks are used in order to keep two copies ofthe master node. This is done for the purpose of recovery,because there are two situations that can cause a corrupt ormissing master node. LEB zero stores the superblock node.The superblock node contains file system parameters thatchange rarely if at all. For example, the flash geometry(eraseblock size, number of eraseblocks etc) is stored in thesuperblock node. The other UBIFS areas are: the log area(or simply the log), the LEB properties tree (LPT) area, theorphan area and the main area. The log is a part of UBIFS’sjournal.The purpose of the UBIFS journal is to reduce the fre-quency of updates to the on-flash index. The index consistsof the top part of the wandering tree that is made up of onlyindex nodes, and that to update the file system a leaf nodemust be added or replaced in the wandering tree and all theancestral index nodes updated accordingly. It would be veryinefficient if the on-flash index were updated every time aleaf node was written, because many of the same index nodeswould be written repeatedly, particularly towards the top ofthe tree. Instead, UBIFS defines a journal where leaf nodesare written but not immediately added to the on-flash index.Note that the index in memory (see TNC) is updated. Peri-odically, when the journal is considered reasonably full, it iscommitted. The commit process consists of writing the newversion of the index and the corresponding master node.After the log area, comes the LPT area. The size of thelog area is defined when the file system is created and conse-quently so is the start of the LPT area. At present, the size ofthe LPT area is automatically calculated based on the LEBsize and maximum LEB count specified when the file sys-tem is created. Like the log area, the LPT area must neverrun out of space. Unlike the log area, updates to the LPT6rea are not sequential in nature - they are random. In addi-tion, the amount of LEB properties data is potentially quitelarge and access to it must be scalable. The solution is tostore LEB properties in a wandering tree. In fact the LPTarea is much like a miniature file system in its own right. Ithas its own LEB properties - that is, the LEB properties ofthe LEB properties area (called ltab). It has its own form ofgarbage collection. It has its own node structure that packsthe nodes as tightly as possible into bit-fields. However, likethe index, the LPT area is updated only during commit. Thusthe on-flash index and the on-flash LPT represent what thefile system looked like as at the last commit. The differencebetween that and the actual state of the file system, is repre-sented by the nodes in the journal.The next UBIFS area to describe is the orphan area. Anorphan is an inode number whose inode node has been com-mitted to the index with a link count of zero. That happenswhen an open file is deleted (unlinked) and then a commitis run. In the normal course of events the inode would bedeleted when the file is closed. However in the case of anunclean unmount, orphans need to be accounted for. After anunclean unmount, the orphans’ inodes must be deleted whichmeans either scanning the entire index looking for them, orkeeping a list on flash somewhere. UBIFS implements thelatter approach.The final UBIFS area is the main area. The main areacontains the nodes that make up the file system data and theindex. A main area LEB may be an index eraseblock or anon-index eraseblock. A non-index eraseblock may be a bud(part of the journal) or have been committed. A bud may becurrently one of the journal heads. A LEB that contains com-mitted nodes can still become a bud if it has free space. Thusa bud LEB has an offset from which journal nodes begin,although that offset is usually zero.There are three important differences between UBIFS andJFFS2. The first has already been mentioned: UBIFS has anon-flash index, JFFS2 does not - thus UBIFS is potentiallyscalable. The second difference is implied: UBIFS runs ontop of the UBI layer which runs on top of the MTD subsys-tem, whereas JFFS2 runs directly over MTD. UBIFS bene-fits from the wear-leveling and error handling of UBI at thecost of the flash space, memory and other resources takenby UBI. The third important difference is that UBIFS allowswriteback.
Yaffs (Yet Another Flash File System) [39] contains ob-jects. The object is anything that is stored in the file system.These are: (1) Regular data files, (2) Directories, (3) Hard-links, (4) Symbolic links, (5) Special objects (pipes, devicesetc). All objects are identified by a unique integer objectId. In Yaffs, the unit of allocation is the chunk. Typically achunk will be the same as a NAND page, but there is flexi-bility to use chunks which map to multiple pages.Many, typically 32 to 128 but as many as a few hundred,chunks form a block. A block is the unit of erasure. NAND flash may be shipped with bad blocks and further blocksmay go bad during the operation of the device. Thus, Yaffsis aware of bad blocks and needs to be able to detect andmark bad blocks. NAND flash also typically requires the useof some sort of error detection and correction code (ECC).Yaffs can either use existing ECC logic or provide its own.Yaffs2 has a true log structure. A true log structured filesystem only ever writes sequentially. Instead of writing datain locations specific to the files, the file system data is writtenin the form of a sequential log. The entries in the log are allone chunk in size and can hold one of two types of chunk:(1) Data chunk - a chunk holding regular data file contents,(2) Object Header - a descriptor for an object (directory, reg-ular data file, hard link, soft link, special descriptor,...). Thisholds details such as the identifier for the parent directory,object name, etc. Each chunk has tags associated with it. Thetags comprise the following important fields: (1) ObjectId -identifies which object the chunk belongs to, (2) ChunkId -identifies where in the file this chunk belongs, (3) DeletionMarker - (Yaffs1 only) shows that this chunk is no longer inuse, (4) Byte Count - number of bytes of data if this is a datachunk, (5) Serial Number - (Yaffs1 only) serial number usedto differentiate chunks with the same objectId and chunkId.When a block is made up only of deleted chunks, thatblock can be erased and reused. However, it needs to copythe valid data chunks off a block, deleting the originalsand allowing the block to be erased and reused. This pro-cess is referred to as garbage collection. If garbage collec-tion is aggressive, the whole block is collected in a singlegarbage collection cycle. If the collection is passive thenthe number of copies is reduced thus spreading the effortover many garbage collection cycles. This is done to reducegarbage collection load and improve responsiveness. The ra-tionale behind the above heuristics is to delay garbage col-lection when possible to reduce the amount of collection thatneeds to be performed, thus increasing average system per-formance. Yet there is a conflicting goal of trying to spreadthe garbage collection so that it does not all happen at thesame causing fluctuations in file system throughput. Theseconflicting goals make garbage tuning quite challenging.Mount scanning takes quite a lot of time and slows mount-ing. Checkpointing is a mechanism to speed the mountingby taking a snapshot of the Yaffs runtime state at unmountor sync() and then reconstituting the runtime state on re-mounting. The actual checkpoint mechanism is quite sim-ple. A stream of data is written to a set of blocks which aremarked as holding checkpoint data and the important run-time state is written to the stream.
NAFS (NAND flash memory Array File System) [46]consists of a Conventional File System and the NAND FlashMemory Array Interface; the former provides the users withbasic file operations while the latter allows concurrent ac-cesses to multiple NAND flash memories through a stripingtechnique in order to increase I/O performance. Also, parity7its are distributed across all flash memories in the array toprovide fault tolerance like RAID5.The NAND flash memory is partitioned into two areas:one for the superblock addresses and the other for the su-perblock itself, inodes, and data. In order to provide uniformwear-leveling, the superblock is stored at the random loca-tion in the Data/Superblock/Inode-block Partition while itsaddress is stored in the Superblock Address Partition. NAFSattempts to write file data consecutively into each block ofNAND flash memory for better read and write performance.In addition, NAFS adopts a new double list cache schemethat takes into account the characteristics of both large-capacity storage and NAND flash memory in order to in-crease I/O performance. The double list cache makes it pos-sible to defer write operations and increase the cache hit ratioby prefetching relevant pages through data striping of NANDFlash Memory Array Interface. The double list cache con-sists of the clean list for the actual caching and the dirty listfor monitoring and analyzing page reference patterns. Thedirty list maintains dirty pages in order to reduce their searchtimes, and the clean list maintains clean pages. All the pagesthat are brought into memory by read operations are insertedinto the clean list. If clean pages in the clean list are modifiedby write operations, they are removed from the clean list andinserted into the head of the dirty list. Also, if a new file iscreated, its new pages are inserted into the head of the dirtylist. If a page fault occurs, clean pages are removed from thetail of the clean list.When NAFS performs delayed write operations using thecache, since two blocks are assigned to each NAND flashmemory, it is always guaranteed that a file data can be writtencontiguously within at least one block. In addition, NAFSperforms delayed write operations in the dirty list, resultingin reduction of the number of write operations and consec-utive write operations of file data in each block of NANDflash memory.
CFFS (Core Flash File System) [49], which is anotherfile system based on YAFFS, stores index entries and meta-data into index blocks which are distinct from data blocks.Since CFFS just reads in the index blocks during mount, itsmount time is faster than YAFFS2’s. Furthermore, since fre-quently modified metadata are collected and stored into in-dex blocks, garbage collection performance of CFFS is bet-ter than YAFFS2’s. However, since CFFS stores the physicaladdresses of index blocks into the first block of NAND flashmemory in order to reduce mount time, wear-leveling perfor-mance of CFFS is worse than others due to frequent erasureof the first block.
NAMU (NAnd flash Multimedia file system) [48] takesinto consideration the characteristics of both NAND flashmemory and multimedia files. NAMU utilizes an indexstructure that is suitable for large-capacity files to shortenthe mount time by scanning only index blocks located in theindex area during mount. In addition, since NAMU manages data in the segment unit rather than in the page unit, NAMU’smemory usage efficiency is better than JFFS2’s, YAFFS2’s.
MNFS (novel mobile multimedia file system) [50] intro-duces (1) hybrid mapping, (2) block-based file allocation, (3)an in-core only Block Allocation Table (iBAT), and (4) up-ward directory representation. Using these methods, MNFSachieves uniform write-responses, quick mounting, and asmall memory footprint.The hybrid mapping scheme means that MNFS uses apage mapping scheme (log-structured method) for the meta-data by virtue of the frequent updates. On the other hand,a block mapping scheme is used for user data, because itis rarely updated in mobile multimedia devices. The entireflash memory space is logically divided into two variable-sized areas: the Metadata area and the User data area.MNFS uses a log structure to manage the file system meta-data. The metadata area is a collection of log blocks that con-tain file system metadata; the page mapping scheme is usedfor this area. The user data area is a collection of data blocksthat contains multimedia file data; the block mapping schemeis used for this area. A multimedia file, e.g. a music or videoclip, is an order of magnitude larger than the text-based file.Therefore, MNFS uses a larger allocation unit than the blocksize (usually 4 Kbytes) typically found in a legacy generalpurpose file system. MNFS defines the allocation unit of thefile system as a block of NAND flash memory. The blocksize of NAND flash memory ranges from 16 Kbyte to 128Kbyte, and this size is device specific.MNFS uses the iBAT, which is similar to the File Allo-cation Table in the FAT file system, for both uniform write-responses and for robustness of the file system. There aretwo important differences between the FAT and the iBAT.First, the iBAT is not stored in the flash memory. Like thein-memory tree structure in YAFFS, the iBAT is dynami-cally constructed, during the mount time, in the main mem-ory (RAM) through scanning the spare area of all the blocks.Secondly, the iBAT uses block-based allocation whereas theFAT uses cluster-based allocation. In the FAT file system,as the file size grows, a new cluster is allocated, requiringmodification of the file allocation table in the storage de-vice. Access to the storage device for the metadata updatenot only affects the response time of the write request, butit can also invoke file system inconsistency when the systemcrashes during the update. In MNFS, the iBAT is not storedseparately in the flash memory, and the block allocation in-formation is stored in the spare area of the block itself whilethe block is allocated to a file. These two differences makeMNFS more robust than the FAT file system.MNFS uses upward directory representation method. Inthis method, each directory entry in the log block has its par-ent directory entry ID. That is, the child entry points to itsparent entry. The directory structure of the file system can berepresented using this parent directory entry ID. For the up-ward directory representation method, it is necessary to read8ll of the directory entries in order to construct the directorystructure of the file system in the memory.
NILFS (New Implementation of a Log-structured FileSystem) [40, 41] has the on-disk layout is divided into sev-eral parts: (1) superblock, (2) full segment, (3) partial seg-ment, (4) logical segment, (5) segment management block.Superblock has the parameters of the file system, the diskblock address of the latest segment being written, etc. Eachfull segment consists of a fixed length of disk blocks. Thisis a basic management unit of the garbage collector. Partialsegment is write units. Dirty buffers are written out as par-tial segments. The partial segment does not exceed the fullsegment boundaries. The partial segment sequence includesinseparable directory operations. For example, a logical seg-ment could consist of two partial segments. In the recoveryoperations, the two partial segments are treated as one insep-arable segment. There are two flag bits, Logical Begin andLogical End, at the segment summary of the partial segment.The NILFS adopts the B-tree structure for both file blockmapping and inode block mapping. The two mappings areimplemented in the common B-tree operation routine. TheB-tree intermediate node is used to construct the B-tree. Ithas 64-bit-wide key and 64-bit-wide pointer pairs. The fileblock B-tree uses a file block address as its key, whereas theinode block B-tree uses an inode number as its key. The rootblock number of the file block B-tree is stored to the corre-sponding inode block. The root block number of the inodeblock B-tree is stored to the superblock of the file system.So, there is only one inode block B-tree in the file system.File blocks, B-tree blocks for file block management, inodeblocks, and B-tree blocks for inode management are writtento the disk as logs. A newly created file first exists only inthe memory page cache. Because the file must be accessiblebefore being written to the disk, the B-tree structure existseven in memory. The B-tree intermediate node in memory ison the memory page cache, the data structures are the sameas those of the disk blocks. The pointer of the B-tree nodestored in memory holds the disk block number or the mem-ory address of the page cache that reads the block. Whenlooking up a block in the B-tree, if the pointer of the B-treenode is a disk block number, the disk block is read into anewly allocated page cache before the pointer is rewritten.The original disk block number remains in the buffer-headstructure on the page cache.The partial segment consists of three parts: (1) The seg-ment summary keeps the block usage information of the par-tial segment. The main contents are checksums of the dataarea, the segment summary, the length of the partial segment,and partial segment creation time. (2) Data area contains filedata blocks, file data B-tree node blocks, inode blocks, andinode block B-tree node blocks in order. (3) A checkpoint is placed on the last tail of the partial segment. The checkpointincludes a checksum of the checkpoint itself. The checkpointaccuracy means successfully writing the partial segment tothe disk. The most important information in the checkpointis the root block number of the inode block B-tree. The blocknumber is written out last, and the whole file system state isupdated.The data write process started by the sync system call andNILFS kernel thread, advances in the following order: (1)Lock the directory operations, (2) The dirty pages of the filedata are gathered from its radix-tree, (3) The dirty B-tree in-termediate node pages of both file block management andinode management are gathered, (4) The dirty inode blockpages are gathered, (5) The B-tree intermediate node pageswhich will be dirty for registered block address being reneware gathered, (6) New disk block addresses are assigned tothose blocks in order of file data blocks, B-tree node blocksfor file data, inode blocks, B-tree node blocks for inodes, (7)Rewrite the disk block addresses to new ones in the radix-tree and B-tree nodes, (8) Call block device input/output rou-tine to writing out the blocks, (9) Unlock the directory oper-ations. The NILFS snapshot is a whole consistent file systemat some time instant.In LFS, all blocks remain as is (until they are collected bygarbage collection), therefore, no new information is neededto make a snapshot. In NILFS, the B-tree structure managesthe file and inode blocks, and B-tree nodes are written out asa log too. So, the root block number of the inode manage-ment B-tree is the snapshot of the NILFS file system. Theroot block number is stored in the checkpoint position of apartial segment. The NILFS checkpoint is the snapshot of thefile system itself. Actually, user can specify the disk blockaddress of the NILFS checkpoint to Linux using the ”mount”command, and the captured file system is mounted as a read-only file system. However, when the user use all checkpointsas the snapshot, there is no disk space for garbage collection.The user can select any checkpoint as a snapshot, and thegarbage collector collects other checkpoint blocks.
F2FS (Flash-Friendly File System) [43] employs threeconfigurable units: segment, section and zone. It allocatesstorage blocks in the unit of segments from a number of in-dividual zones. It performs ”cleaning” in the unit of section.These units are introduced to align with the underlying FTL’soperational units to avoid unnecessary (yet costly) data copy-ing.F2FS introduced a cost-effective index structure in theform of node address table with the goal to attack the ”wan-dering tree” problem. Also multi-head logging was sug-gested. F2FS uses an effective hot/cold data separationscheme applied during logging time (i.e., block allocationtime). It runs multiple active log segments concurrently andappends data and metadata to separate log segments basedon their anticipated update frequency. Since the flash stor-age devices exploit media parallelism, multiple active seg-9ents can run simultaneously without frequent managementoperations. F2FS builds basically on append-only loggingto turn random writes into sequential ones. At high stor-age utilization, however, it changes the logging strategy tothreaded logging to avoid long write latency. In essence,threaded logging writes new data to free space in a dirty seg-ment without cleaning it in the foreground. F2FS optimizessmall synchronous writes to reduce the latency of fsync re-quests, by minimizing required metadata writes and recover-ing synchronized data with an efficient roll-forward mecha-nism.F2FS divides the whole volume into fixed-size segments.The segment is a basic unit of management in F2FS and isused to determine the initial file system metadata layout. Asection is comprised of consecutive segments, and a zoneconsists of a series of sections. F2FS splits the entire volumeinto six areas: (1) Superblock (SB), (2) Checkpoint (CP), (3)Segment Information Table (SIT), (4) Node Address Table(NAT), (5) Segment Summary Area (SSA), (6) Main Area.Superblock (SB) has the basic partition information anddefault parameters of F2FS, which are given at the formattime and not changeable. Checkpoint (CP) keeps the filesystem status, bitmaps for valid NAT/SIT sets, orphan in-ode lists and summary entries of currently active segments.Segment Information Table (SIT) contains per-segment in-formation such as the number of valid blocks and the bitmapfor the validity of all blocks in the ”Main” area. The SIT in-formation is retrieved to select victim segments and identifyvalid blocks in them during the cleaning process. Node Ad-dress Table (NAT) is a block address table to locate all the”node blocks” stored in the Main area. Segment SummaryArea (SSA) stores summary entries representing the ownerinformation of all blocks in the Main area, such as parent in-ode number and its node/data offsets. The SSA entries iden-tify parent node blocks before migrating valid blocks duringcleaning. Main Area is filled with 4KB blocks. Each block isallocated and typed to be node or data. A node block containsinode or indices of data blocks, while a data block containseither directory or user file data. Note that a section does notstore data and node blocks simultaneously.F2FS utilizes the ”node” structure that extends the inodemap to locate more indexing blocks. Each node block has aunique identification number, ”node ID”. By using node IDas an index, NAT serves the physical locations of all nodeblocks. A node block represents one of three types: inode,direct and indirect node. An inode block contains a file’smetadata, such as file name, inode number, file size, atimeand dtime. A direct node block contains block addresses ofdata and an indirect node block has node IDs locating an-other node blocks. In F2FS, a 4KB directory entry (”dentry”)block is composed of a bitmap and two arrays of slots andnames in pairs. The bitmap tells whether each slot is validor not. A slot carries a hash value, inode number, length of afile name and file type (e.g., normal file, directory and sym- bolic link). A directory file constructs multi-level hash tablesto manage a large number of dentries efficiently.F2FS maintains six major log areas to maximize the ef-fect of hot and cold data separation. F2FS statically definesthree levels of temperaturehot, warm and coldfor node anddata blocks. Direct node blocks are considered hotter thanindirect node blocks since they are updated much more fre-quently. Indirect node blocks contain node IDs and are writ-ten only when a dedicated node block is added or removed.Direct node blocks and data blocks for directories are consid-ered hot, since they have obviously different write patternscompared to blocks for regular files. Data blocks satisfyingone of the following three conditions are considered cold:(1) Data blocks moved by cleaning, (2) Data blocks labeled”cold” by the user, (3) Multimedia file data.F2FS performs cleaning in two distinct manners, fore-ground and background. Foreground cleaning is triggeredonly when there are not enough free sections, while a kernelthread wakes up periodically to conduct cleaning in back-ground. A cleaning process takes three steps: (1) Victimselection, (2) Valid block identification and migration, (3)Post-cleaning process.The cleaning process starts first to identify a victim sec-tion among non-empty sections. There are two well-knownpolicies for victim selection during LFS cleaninggreedy andcost-benefit. The greedy policy selects a section with thesmallest number of valid blocks. Intuitively, this policy con-trols overheads of migrating valid blocks. F2FS adopts thegreedy policy for its foreground cleaning to minimize thelatency visible to applications. Moreover, F2FS reserves asmall unused capacity (5% of the storage space by default)so that the cleaning process has room for adequate opera-tion at high storage utilization levels. On the other hand,the cost-benefit policy is practiced in the background clean-ing process of F2FS. This policy selects a victim section notonly based on its utilization but also its ”age”. F2FS infersthe age of a section by averaging the age of segments in thesection, which, in turn, can be obtained from their last mod-ification time recorded in SIT. With the cost-benefit policy,F2FS gets another chance to separate hot and cold data.After selecting a victim section, F2FS must identify validblocks in the section quickly. To this end, F2FS maintainsa validity bitmap per segment in SIT. Once having identi-fied all valid blocks by scanning the bitmaps, F2FS retrievesparent node blocks containing their indices from the SSA in-formation. If the blocks are valid, F2FS migrates them toother free logs. For background cleaning, F2FS does not is-sue actual I/Os to migrate valid blocks. Instead, F2FS loadsthe blocks into page cache and marks them as dirty. Then,F2FS just leaves them in the page cache for the kernel workerthread to flush them to the storage later. This lazy migrationnot only alleviates the performance impact on foregroundI/O activities, but also allows small writes to be combined.Background cleaning does not kick in when normal I/O or10oreground cleaning is in progress.After all valid blocks are migrated, a victim section is reg-istered as a candidate to become a new free section (called a”pre-free” section in F2FS). After a checkpoint is made, thesection finally becomes a free section, to be reallocated. Wedo this because if a pre-free section is reused before check-pointing, the file system may lose the data referenced by aprevious checkpoint when unexpected power outage occurs.
SFS (SSD File System) [62] is based on three designprinciples. A file system should exploit the file block se-mantics directly. It needs to take a log-structured approachbased on the observation that the random write bandwidth ismuch slower than the sequential one. The existing lazy datagrouping in LFS during segment cleaning fails to fully uti-lize the skewness in write patterns and argue that an eagerdata grouping is necessary to achieve sharper bimodality insegment utilization. SFS takes a log-structured approach thatturns random writes at the file level into sequential writes atthe LBA level. Moreover, in order to utilize nearly 100% ofthe raw SSD bandwidth, the segment size is set to a multipleof the clustered block size. The result is that the performanceof SFS will be limited by the maximum sequential write per-formance regardless of random write performance.It shows that, if hot data and cold data are grouped intoseparate segments, the segment utilization distribution be-comes bimodal: most of the segments are almost either fullor empty of live blocks. Therefore, because the segmentcleaner can almost always work with nearly empty segments,the cleaning overhead will be drastically reduced. To form abimodal distribution, LFS uses a cost-benefit policy for seg-ment cleaning that prefers cold segments to hot segments.However, previous studies show that even the cost-benefitpolicy performs poorly under the large segment size (e.g.,8 MB), because the increased segment size makes it harderto find nearly empty segments. With SSD, the cost-benefitpolicy encounters a dilemma: small segment size enablesLFS to form a bimodal distribution, but small random writescaused by the small segment severely degrades write perfor-mance of SSD. Instead of separating the data lazily on seg-ment cleaning after writing them regardless of their hotness,SFS classifies data proactively on writing using file blocklevel statistics, as well as on segment cleaning. In such eagerdata grouping, since segments are already composed of ho-mogeneous data with similar update likelihood, the segmentcleaning overhead will be significantly reduced. In particu-lar, the I/O skewness commonly found in many real work-loads will make this more attractive.SFS has four core operations: segment writing, segmentcleaning, reading, and crash recovery. The first step of seg-ment writing in SFS is to determine the hotness criteria forblock grouping. This is, in turn, determined by segmentquantization that quantizes a range of hotness values into asingle hotness value for a group. It is assumed that there arefour segment groups: hot, warm, cold, and read-only groups. The second step is to calculate the block hotness for eachdirty block and assign them to the nearest quantized groupby comparing the block hotness and the group hotness. Atthis point, those blocks with similar hotness levels shouldbelong to the same group. The third step is to fill a segmentwith blocks belonging to the same group. If the number ofblocks in a group is not enough to completely fill a segment,the segment writing of the group is deferred until the groupgrows to completely fill a segment. This eager grouping offile blocks according to the hotness serves to colocate blockswith similar update likelihoods in the same segment.Segment cleaning in SFS consists of three steps: selectvictim segments, read the live blocks from the victim seg-ments into the page cache and mark the live blocks as dirty,and trigger the writing process. The writing process treatsthe live blocks from victim segments the same as normalblocks; each live block is classified into a specific quan-tized group according to its hotness. After all the live blocksare read into the page cache, the victim segments are thenmarked as free so that they can be reused for writing. Forbetter victim segment selection, cost-hotness policy is in-troduced, which takes into account both the number of liveblocks in the segment (i.e., cost) and the segment hotness.
Wu, et al. [132] proposed the greedy algorithm (GR) forgarbage collection The greedy algorithm selects the blockwith the fewest valid pages as the victim block for garbagecollection. This approach can reduce the overhead requiredfor copying valid pages within the victim block to free spaceduring garbage collection. However, the GR algorithm doesnot take into account wear leveling in flash-based consumerelectronic devices. It has been shown that the GR algorithmperforms well in terms of wear leveling for random memoryaccesses but does not perform well for memory accesses witha high spatial locality of reference.Kawaguchi, et al. [133] proposed the cost-benefit (CB)algorithm for flash memory. CB calculates a cost-benefitvalue for each block and selects the block with the highestvalue as a victim. The cost-benefit value for a block is calcu-lated as (age * (1u))/2u, where age is the elapsed time sincethe last modification of a page within the block and u is thepercentage of valid pages within the block. Because the CBalgorithm takes into account both the age of invalid pagesand the percentage of valid pages in a block, it could provideimproved wear leveling in flash-based consumer electronicdevices. However, because the CB algorithm does not takeinto account the erase count for each block, its wear levelingperformance is not sufficient.Chiang, et al. [134] proposed the cost-age-time (CAT)algorithm, which extends the CB algorithm by consider-ing the erase count for each block when selecting a victimblock. The CAT algorithm attempts to maintain a balance11etween reducing the garbage collection overhead and im-proving wear leveling in flash memory.Syu, et al. [87] developed a mechanism that takes ad-vantage of FTL inactive time between requests and activelylaunching tasks to reclaim invalidated space. The underlyingconcept is that the timing impact could be reduced throughdistributing the one-time cost of space recycling.The active space recycling mechanism is developed basedon a task model composed of four tasks, namely the managertask, the collector task, the eraser task, and the read/writehandler task. Jobs of these four tasks are released in a fixedtime interval. The jobs of the first three tasks follow a prece-dence constraint that the manager task precedes the collectortask while the collector task precedes the eraser task. Theread/write handler task is assigned the lowest priority amongthe four tasks.At the start of each time interval, a job of the manager taskis released. It is responsible for determining the amount ofinvalidated space to be reclaimed. It gathers necessary statis-tic information, and computes the number of pages holdinginvalidated data to be reclaimed. The mission of the collec-tor task is to collect dirty blocks holding invalid pages andmaintains a garbage queue holding these blocks. The collec-tor task first selects ”right” dirty blocks to form a candidatelist. It then moves, from the candidate list, the block withthe maximum invalidated space into the garbage queue. Thevalid data, if any, in the dirty block is copied to other freespace before it is put into the garbage queue. The action car-ried out by the eraser task is quite straightforward. It removesand erases the dirty blocks in the garbage queue maintainedby the collector task. When a block is erased, it is marked asa free block and added into the free block list. Its associatederase count is also updated.The read/write handler task is responsible for carrying outthe requests of read or write operations. It is active onlywhen the other three jobs have finished and there is pend-ing or unfinished read/write request. When executing, theread/write handler task calculates how much time can beused for handling the read/write request before the end ofcurrent time period. If the time left is not enough to com-plete reading or writing a page, the remaining read/write op-erations of this request will be postponed.Yan, et al. [84] proposed an efficient file-aware garbagecollection algorithm, called FaGC. The FaGC algorithmcopies valid pages in a victim block to clusters in free blocksaccording to the calculated update frequency of the asso-ciated chunk. The FaGC algorithm adopts a hybrid wear-leveling policy to improve the lifespan of NAND flash mem-ory. To avoid unnecessary garbage collection, a scatteringfactor is defined and calculated to determine when to triggerthe garbage collection policy.A file consists of a series of chunks mapped to physicalpages in the flash memory. Each file is assigned a uniquenumber, called File ID, and each chunk in a file is assigned a unique number, called Chunk ID. In general, different fileshave different update frequencies, and different chunks inthe same file have different update frequencies. In a file-aware system structure, an update frequency table (UFT) isbuilt into random access memory (RAM) to record the up-date frequency for each chunk in a file. Each UFT entry con-tains four values: File ID, Chunk ID, Time, and Freq. Timerecords the most recent time that a chunk in a file has beenupdated, and Freq records the frequency with which a chunkhas been updated.Simultaneously, the physical-to-logical translation table(PLT) maintains the File ID and Chunk ID for each blockand physical page in the flash memory. When a chunk in afile is modified or updated, it is rewritten to another physicalpage in flash memory according to the out-of-place updatescheme. At that time, File ID and Chunk ID in the PLT areupdated. Additionally, when a chunk in a file is modified orupdated, the current time is recorded in the UFT, and Freq iscalculated and recorded as follows.In general, the block with the fewest valid pages is se-lected as the victim block to minimize the overhead for thecopy operation, as in the GR algorithm. After the victimblock is selected, the valid pages in the victim block arecopied to free space, and the victim block is then erased andreclaimed. Before the valid pages are copied, the update fre-quency of the chunk associated with each valid page in thevictim block will be checked in the PLT and UFT. Therefore,wear leveling is improved using this clustering procedurebased on the update frequency. Additionally, the decision ofwhen to trigger garbage collection affects the performanceof NAND flash-based consumer electronic devices.
Kuo, et al. [68] suggested an efficient on-line hot-dataidentification and a multi-hash-function framework with thegoal to manage the write amplification issue. The proposedframework adopts K independent hash functions to hash agiven LBA into multiple entries of a M-entry hash table totrack the write number of the LBA, where each entry is asso-ciated with a counter of C bits. Whenever a write is issued tothe FTL, the corresponding LBA is hashed simultaneouslyby K given hash functions. Each counter corresponding tothe K hashed values (in the hash table) is incremented by oneto reflect the fact that the LBA is written again. Wheneveran LBA needs to be verified to see if it is associated with hotdata, the LBA is hashed simultaneously and in the same wayby the K hash functions. The data addressed by the givenLBA is considered as hot data if the H most significant bitsof every counter of the K hashed values contain a non-zerobit value.Jagmohan, et al. [66] proposed a NAND Flash systemwhich uses multi-write coding to reduce write amplification.Multi-write coding allows a NAND Flash page to be written12ore than once without requiring an intervening block erase.They presented a novel two-write coding technique based onenumerative coding, which achieves linear coding rates withlow computational complexity. The proposed technique alsoseeks to minimize memory wear by reducing the number ofprogrammed cells per page write.
Chen, et al. [100] implemented CAFTL (A Content-AwareFlash Translation Layer). CAFTL eliminates duplicatewrites and redundant data through a combination of bothin-line and out-of-line deduplication. Inline deduplicationrefers to the case where CAFTL proactively examines the in-coming data and cancels duplicate writes before committinga write request to flash. As a ’best-effort’ solution, CAFTLdoes not guarantee that all duplicate writes can be examinedand removed immediately. Thus CAFTL also periodicallyscans the flash memory and coalesces redundant data out ofline. When a write request is received at the SSD, (1) the in-coming data is first temporarily maintained in the on-devicebuffer; (2) each updated page in the buffer is later computeda hash value, also called fingerprint, by a hash engine, whichcan be a dedicated processor or simply a part of the controllerlogic; (3) each fingerprint is looked up against a fingerprintstore, which maintains the fingerprints of data already storedin the flash memory; (4) if a match is found, which meansthat a residing data unit holds the same content, the mappingtables, which translate the host-viewable logical addresses tothe physical flash addresses, are updated by mapping it tothe physical location of the residing data, and correspond-ingly the write to flash is canceled; (5) if no match is found,the write is performed to the flash memory as a regular write.Wang, et al. [126] proposed to develop a real-time, per-process per-stream based pattern detection scheme that iden-tifies various write patterns. These patterns are then used toguide the write buffer to improve the write performance ofSSDs that employ a log-structured block-based FTL. Theyclassify fine-grained write patterns into the following threecategories: (a) sequential, (b) clustered (page or block), and(c) random. Each of the above patterns is defined as follows:1) A sequential pattern is defined as a series of requests withconsecutive logical addresses in an ascending order; 2) Apage clustered pattern is defined as a process repeatedly up-dates a specific page; 3) A block clustered pattern is definedas a process repeatedly updates a specific block; 4) A patternwhich falls in none of above is classified as random. In or-der to filter out transition ”noise”, each pattern is allocatedwith a bit map. The number of bits n will determine howmany times in a row a pattern has to be detected, before thealgorithm decides that the I/O stream has entered into a newpattern. For each pattern, they devised an adaptive dirty flushpolicy. Each dirty page in the buffer cache is associated witha pattern type. The dirty pages with the same pattern will be linked together in a linked list, so that dirty pages in thebuffer cache are virtually partitioned according to their asso-ciated patterns. When a page is written by a process, it willbe moved to the head of the pattern list. As a result, in eachlist, the head will be the most recently written page while thetail will be the least recently written one. When the systemneeds to flush dirty pages, it will scan the lists according tothe following priorities. The sequential pattern will be giventhe highest priority to be flushed since its pages are mostlikely written only once, therefore there is no point to keepthem in the cache. The random pattern will be given the nextpriority. The page clustered and block clustered dirty pageswill have the lowest priority, since they may be overwrittenin the future hence we want to keep them in the cache. Thesuggested schemes reduce SSD erase cycles which is directlytranslated to a major improvement on the life-span of SSDs.Huang, et al. [55] presented a Content and SemanticsAware File System (CSA-FS) which is able to reduce writetraffic to SSDs. It employs deduplication and delta-encodingtechniques to file system data blocks and semantic blocks,respectively. It is motivated by two important observations:(1) there exists a huge amount of content redundancy withinprimary storage systems, and (2) semantic blocks are vis-ited much more frequently than data blocks, with each up-date bringing very minimal changes. By separately dedu-plicating redundant data blocks and delta-encoding similarsemantic blocks, CSA-FS can significantly reduce the totalwrite traffic to SSDs and greatly improve their lifetime cor-respondingly. CSA-FS applies deduplication to data blocksand delta-encoding to semantic blocks, respectively. Seman-tic blocks are extracted from the file system and exportedfor lookups. Semantic blocks mainly include super-blocks,group descriptors, data block bitmap, inode bitmap and in-ode tables. For every block write request, CSA-FS checkswhether it accesses semantic block or data block by consult-ing the exported semantic blocks. For data block write, itcomputes its MD5 digest and looks up the hash value in ahash table to determine whether it is a duplicate block write.If it is a duplicate write, CSA-FS simply returns the blocknumber in the found hash entry, and then uses that blocknumber to update the block pointer table of the file’s inode.If it is a new write request, it first goes through the normalprocedure, i.e., allocating a free block, updating the corre-sponding bitmap block and performing necessary accountingstatistics, and finally inserts a new entry containing the blocknumber, its MD5 value and some housekeeping informationto the hash table. For metadata block write, CSA-FS cal-culates the content delta relative to its original content, andthen appends the delta to a delta-logging region.
Fu, et al. [98] made research of cloud backup services in thepersonal computing environment. They concluded that the13ajority of storage space is occupied by a small number ofcompressed files with low sub-file redundancy. About 61%of all files are smaller than 10KB, accounting for only 1.2%of the total storage capacity, and only 1.4% files are largerthan 1MB but occupy 75% of the storage capacity. Thissuggests that tiny files can be ignored during the dedupli-cation process as so to improve the deduplication efficiency,since it is the large files in the tiny minority that dominatethe deduplication efficiency. Static chunking (SC) methodcan outperform content defined chunking (CDC) in dedu-plication effectiveness for static application data and virtualmachine images. The computational overhead for dedupli-cation is dominated by data capacity. The amount of datashared among different types of applications is negligible.They suggested AA-Dedupe (An Application-Aware SourceDeduplication Approach) where tiny files are first filteredout by file size filter for efficiency reasons, and backup datastreams are broken into chunks by an intelligent chunker us-ing an application-aware chunking strategy. Data chunksfrom the same type of files are then deduplicated in theapplication-aware deduplicator by looking up their hash val-ues in an application-aware index that is stored in the localdisk. If a match is found, the metadata for the file containingthat chunk is updated to point to the location of the existingchunk. If there is no match, the new chunk is stored basedon the container management in the cloud, the metadata forthe associated file is updated to point to it and a new entryis added into the application-aware index to index the newchunk.Meister, et al. [102] showed that most files are very small,but the minority of very large files occupies most of the stor-age capacity: 90% of the files occupy less than 10% of thestorage space, in some cases even less than 1%. In mostdata sets, between 15% and 30% of the data is stored re-dundantly and can be removed by deduplication techniques.Often small files have high deduplication rates, but they con-tribute little to the overall savings. Middle-sized files usuallyhave a high deduplication ratio. Full file duplication typi-cally reduces the data capacity by 5% - 10%. In most datasets, most of the deduplication potential is lost if only fullfile elimination is used. The deduplication slowly decreasesslowly with increasing chunk sizes. Fixed size chunkingdetects around 6-8% less redundancies than content-definedchunking. Between 3.1% and 9.4% of the data are zeros.Of all chunks, 90% were only referenced once. This meansthat the chunks are unique and do not contribute to dedupli-cation. The most referenced chunk is the zero chunk. Themean number of references is 1.2, the median is 1 reference.The most referenced 5% of all chunks account for 35% andthe first 24% account for 50% of all references. Of all multi-referenced chunks, about 72% were only referenced twice.A small fraction of the chunks causes most of the dedupli-cation. The evaluation shows that typically 20% to 30% ofonline data can be removed by applying data deduplication techniques, peaking up to 70% for some data sets. This re-duction can only be achieved by a subfile deduplication ap-proach, while approaches based on whole-file comparisonsonly lead to small capacity savings.Xia, et al. [101] suggested DARE (a Deduplication-Aware Resemblance detection and Elimination scheme) forcompressing backup datasets. DARE is designed to im-prove resemblance detection for additional data reduction indeduplication-based backup/archiving storage systems. Foran incoming backup stream, DARE goes through the fol-lowing four key steps: (1) Duplicate Detection - the datastream is first chunked by the CDC approach, fingerprintedby SHA-1, duplicate-detected, and then grouped into con-tainer of sequential chunks to preserve the backup-streamlocality. (2) Resemblance Detection - the DupAdj resem-blance detection module in DARE first detects duplicate-adjacent chunks in the containers formed in Step 1. Af-ter that, DAREs improved super-feature module further de-tects similar chunks in the remaining non-duplicate and non-similar chunks that may have been missed by the DupAdjdetection module when the duplicate-adjacency informationis lacking or weak. (3) Delta Compression - for each of theresembling chunks detected in Step 2, DARE reads its base-chunk, then delta encodes their differences by the Xdelta al-gorithm. In order to reduce the number of disk reads, an LRUand locality-preserved cache is implemented to prefetch thebase-chunks in the form of locality-preserved containers. (4)Storage Management - the data NOT reduced, i.e., non-similar or delta chunks, will be stored as containers intothe disk. The file mapping information among the dupli-cate chunks, resembling chunks, and non-similar chunks willalso be recorded as the file recipes to facilitate future data re-store operations in DARE. They concluded that supplement-ing delta compression to deduplication can effectively en-large the logical space of the restoration cache, but the datafragmentation in data reduction systems remains a seriousproblem.Kim, et al. [110] designed a deduplication layer on FTL.It consists of three components, namely, fingerprint gener-ator, fingerprint manager, and mapping manager. The fin-gerprint generator creates a hash value, called fingerprint,which summarizes the content of written data. The finger-print manager manipulates generated fingerprints and con-ducts fingerprint lookups for detecting deduplication. Fi-nally, the mapping manager deals with the physical loca-tions of duplicate data. They proposed two acceleration tech-niques: sampling-based filtering and recency-based finger-print management. The former selectively applies dedupli-cation based upon sampling and the latter effectively exploitslimited controller memory while maximizing the deduplica-tion ratio. Experimental results have shown that they achievethe duplication rate ranging from 4% to 51%, with an aver-age of 17%, for the nine considered workloads. The responsetime of a write request can be improved by up to 48% with14n average of 15%, while the lifespan of SSDs is expected toincrease up to 4.1 times with an average of 2.4 times.Ha, et al. [103] proposed a new deduplication schemecalled block-level content-aware chunking to extend the life-time of SSDs. The proposed scheme divides the data within afixed-size block into a set of variable-sized chunks based onits contents and avoids storing duplicate copies of the samechunk. Evaluations on a real SSD platform showed that theproposed scheme improves the average deduplication rate by77% compared to the previous block-level fixed-size chunk-ing scheme. Additional optimizations reduce the averagememory consumption by 39% with a 1.4% gain in the av-erage deduplication rate.Li [95] presented Flash Saver, which is coupled withthe ext2/3 file system and aims to significantly reduce thewrite traffic to SSDs. Flash Saver deploys deduplicationand delta-encoding to reduce the write traffic. Specifically,Flash Saver applies deduplication to file system data blocksand delta-encoding to file system meta-data blocks, based ontwo important observations which are: (1) there exist largeamounts of duplicate data blocks (2) metadata blocks areaccessed/modified much more frequently than data blocks,but with very minimal changes for each update. Specifically,Flash Saver semantically identifies the fixed-size (e.g. 4KB)content blocks of the file system into data blocks and metablocks. For data blocks, it computes the SHA-1 hash valueof the block and uses the hash value to examine whether thesame block has already been stored in order to avoid stor-ing multiple copies of the block having the same content.For metablocks, it logs the incremental changes relative tothe corresponding meta block to save I/Os and storage space.Obviously, under most cases, file system meta blocks are fre-quently modified with minor changes. The experimental re-sults have shown that Flash Saver can save up to 63% of thetotal write traffic, which implies reasonably prolonged life-time, larger effective flash space and higher reliability thanthat of the original counterpart within their allowable lifes-pan.Rozier, et al. [112] have modeled the fault tolerance con-sequences of deduplication. They concluded that deduplica-tion has a net negative impact on reliability, both due to itsimpact on unrecoverable data loss, and the impact of silentdata corruptions, though the former is easily countered byusing higher level RAID configurations. In both cases, sys-tem reliability can be increased by maintaining additionalcopies of deduplicated instances typically by keeping mul-tiple copies for a very small percentage of the deduplicatedinstances in a given category.Nam, et al. [107] introduced two reliability parametersfor deduplication storage: chunk reliability and chunk lossseverity. To provide a demanded reliability for an incomingdata stream, most deduplication storage systems first carryout deduplication process by eliminating duplicates from thedata stream and then apply erasure coding for the remain- ing (unique) chunks. A unique chunk may be shared (i.e.,duplicated) at many places of the data stream and sharedby other data streams. That is why deduplication can re-duce the required storage capacity. However, this occasion-ally becomes problematic to assure certain reliability levelsrequired from different data streams. The chunk reliabilitymeans each chunk’s tolerance level in the face of any fail-ures. The chunk loss severity represents an expected dam-age level in the event of a chunk loss, formally defined asthe multiplication of actual damage by the probability of achunk loss. They proposed a reliability-aware deduplicationsolution that not only assures all demanded chunk reliabilitylevels by making already existing chunks sharable only if itsreliability is high enough, but also mitigates the chunk lossseverity by adaptively reducing the probability of having achunk loss.
A distinction is made between lossless and lossy algorithms,i.e. those algorithms that will preserve the original data ex-actly, and those that will discard parts of the data, reducingthe quality. The latter type is typically domain specific, i.e.knowledge about what type of data is being compressed isneeded to determine what to discard. For general compres-sion three of the most often used algorithms are: (1) Runlength coding a simple and fast scheme that replaces re-peating patterns with the patterns and number of repetitions;(2) Huffman [135] coding Huffman coding analyzes the fre-quency of different fixed length symbols in a data set, and tothe symbols assigns codes whose lengths correspond to thefrequency of the respective symbol in the data set, i.e. fre-quent symbols get short codes, infrequent get long codes; (3)Lempel-Ziv [136, 137] these algorithms basically replacestrings (variable length symbols) found in a dictionary withcodes representing those strings.The efficiency of these algorithms is determined by thesize of the dictionary and how much effort is spent searchingin the dictionaries. Gzip [138] and others use general com-pression algorithms based on the two Lempel-Ziv algorithmsLZ77 [136] and LZ78 [137] (or Lempel-Ziv-Welch [139],which is based on LZ78). These algorithms are sometimesaugmented with Huffman [135] coding. Huffman coding onits own is typically faster than Lempel-Ziv based algorithms,although it will typically yield less compression. Run lengthencoding is extremely fast, but the gain is often small com-pared to Lempel-Ziv or Huffman coding.Mannan, et al. [140] introduced the concept of block Huff-man coding. Their main idea is to break the input stream intoblocks and compress each block separately. They chooseblock size in such a way that it can store one full single blockin main memory. They use a block size as moderate as 5 KiB,10 KiB or 12 KiB. Finally, they observed that to obtain bet-ter efficiency from block Huffman coding, a moderate sized15lock is better and the block size does not depend on filetypes.Chang, et al. [120] have made the performance evalua-tion of block LZSS compression algorithm. They studied theblock LZSS algorithm and investigated the relationship be-tween the compression ratio of block LZSS and the value ofindex or length. They found that as the block size increases,the compression ratio becomes better. To obtain better ef-ficiency from block LZSS, a moderate sized block which isgreater than 32KiB, may be optimal, and the optimal blocksize does not depend on file types. They have found that,in some cases, the block Huffman coding has a better com-pression ratio than no blocking Huffman coding, and withthe increasing block size, the compression ratio deteriorates.The optimal block size in which it obtains the best compres-sion ratio is about 16KiB. The reason for the better efficiencymay be attributed to the principle of locality of data.
Douglis, et al. [109] have found that the benefits ofapplication-specific deltas vary depending on the mix of con-tent types. For example, HTML and email messages displaya great deal of redundancy across large datasets, resulting indeltas that are significantly smaller than simply compress-ing the data, while mail attachments are often dominatedby non-textual data that do not lend themselves to the tech-nique. A few large files can contribute much of the totalsavings if they are particularly amenable to delta-encoding.Application-specific techniques, such as delta-encoding anunzipped version of a zip or gzip file and then zipping the re-sult, can significantly improve results for a particular file, butunless an entire dataset consists of such files, overall resultsimprove by just a couple of percent.For web content, it have been found substantial overlapamong pages on a single site. For the five web datasets theyconsidered, deltas reduced the total size of the dataset to 8-19% of the original data, compared to 29-36% using com-pression. For files and email, there was much more variabil-ity, and the overall benefits are not as dramatic, but they aresignificant: two of the largest datasets reduced the overallstorage needs by 10-20% beyond compression. There wassignificant skew in at least one dataset, with a small fractionof files accounting for a large portion of the savings. Factorssuch as shingle size and the number of features compareddo not dramatically affect these results. Given a particularnumber of maximal matching features, there is not a widevariation across base files in the size of the resulting deltas.A new file will often be created by making a small number ofchanges to an older file; the new file may even have the samename as the old file. In these cases, the new file can often bedelta-encoded from the old file with minimal overhead.
REDO (REfactored Design of I/O architecture) [59]refactors two main components of the I/O subsystemthe filesystem and the storage device. REDO removes logical-to-physical mapping and garbage collection from the storagedevice. Instead, a refactored file system (RFS) directly man-ages the storage address space, including the garbage col-lection. Unlike host-based FTL, all those functions are con-ducted by RFS without any helps from an intermediate hostlayer like a device driver. This eliminates the need for main-taining a large logical-to-physical page-map table, allowingus to perform garbage collection more efficiently at the filesystem level. A refactored storage device controller (RSD)becomes simpler because it runs a small number of essen-tial flash management functions. RSD maintains a muchsmaller logical-to-physical segment-map table to managewear-leveling and bad blocks. Unlike FFS, REDO providesinteroperability with block I/O subsystems, allowing SSDvendors to hide all the details of their devices and NANDcharacteristics.RFS is designed differently from the conventional LFS intwo ways; it only issues out-place update commands andinforms a storage device about which blocks have becomeerasable via TRIM commands. This frees the flash con-troller from the task of garbage collection all together. RFSwrites file data, inodes, and the pieces of the inode map inan out-place update manner. Unlike LFS, the incoming dataare written to a physical segment corresponding to a logi-cal segment, and their relative offsets in the logical segmentare preserved in the physical one. For check-pointing, RFSreserves two fixed logical segments, called check-point seg-ments. RFS then appends new check-points with differentversion numbers, so that the overwrites never happen. RFSmanages all the obsolete data at the level of a file systemand triggers garbage collection when free space is exhausted.RFS chooses the logical segment 2 as a victim and copieslive data to free space. The victim segment becomes freefor future use. To inform that the physical segment for thevictim has obsolete data, RFS delivers a TRIM command toRSD. Finally, RSD marks the physical segment out-of-dateand erases flash blocks.RSD maintains the segment-map table, and each entry ofthe table points to physical blocks that are mapped to a log-ical segment. When write requests come, RSD calculatesa logical segment number (i.e., 100) using the logical file-system page number (i.e., 1,600). Then, it looks up theremapping table to find the physical blocks mapped to thelogical segment. If physical blocks are not mapped yet, RSDbuilds the physical segment by allocating new flash blocks.RSD picks up free blocks with the smallest P/E cycles in thecorresponding channel/way. A bad block is ignored. If thereare flash blocks already mapped, RSD writes the data to the16xed location in the physical segment. Block erasure com-mands are not explicitly issued from RFS. But, RSD eas-ily figures out which blocks are out-of-date and are readyfor erasure because RFS informs RSD of physical segmentsonly with obsolete data via a TRIM command. RSD handlesoverwrites like block-level FTL.
LightNVM (The Linux Open-Channel SSD Subsys-tem) [127, 128, 129, 130, 131] proposes that SSD manage-ment trade-offs should be handled through Open-ChannelSSDs, a new class of SSDs, that give hosts control overtheir internals. It introduces a new Physical Page AddressI/O interface that exposes SSD parallelism and storage mediacharacteristics. LightNVM integrates into traditional storagestacks, while also enabling storage engines to take advantageof the new I/O interface.The Physical Page Address (PPA) I/O interface is basedon a hierarchical address space. It defines administrationcommands to expose the device geometry and let the hosttake control of SSD management, and data commands to ef-ficiently store and retrieve data. The interface is indepen-dent of the type of non-volatile media chip embedded on theopen-channel SSD. Open-channel SSDs expose to the host acollection of channels, each containing a set of Parallel Units(PUs), also known as LUNs. A PU may cover one or morephysical die, and a die may only be a member of one PU.Each PU processes a single I/O request at a time. Regard-less of the media, storage space is quantized on each PU.NAND flash chips are decomposed into blocks, pages (theminimum unit of transfer), and sectors (the minimum unitof ECC). Byte-addressable memories may be organized as aflat space of sectors. PPAs are organized as a decompositionhierarchy that reflects the SSD and media architecture.The PPA address space can be organized logically to actas a traditional logical block address (LBA), e.g., by arrang-ing NAND flash using ”block, page, plane, and sector”. Thisenables the PPA address space to be exposed through tradi-tional read/write/trim commands. In contrast to traditionalblock I/O, the I/Os must follow certain rules. Writes must beissued sequentially within a block. Trim may be issued for awhole block, so that the device interprets the command as anerase.LightNVM is organized in three layers, each providinga level of abstraction for open-channel SSDs: (1) NVMeDevice Driver. A LightNVM-enabled NVMe device drivergives kernel modules access to open-channel SSDs throughthe PPA I/O interface. (2) LightNVM Subsystem. An in-stance of the subsystem is initialized on top of the PPA I/O-supported block device. The instance enables the kernel toexpose the geometry of the device through both an internalnvm dev data structure and sysfs. (3) High-level I/O Inter-face. A target gives kernel-space modules or user-space ap-plications access to open-channel SSDs through a high-levelI/O interface, either a standard interface like the block I/O in-terface provided by pblk, or an application-specific interface provided by a custom target.
Segment is the cornerstone concept of any Log-structuredFile System (LFS). This notion (segment concept) points outthe reality of presence of Physical Erase Blocks (PEB) on thestorage device (SSD) side. Generally speaking, segment canbe imagined like a portion of storage device that includes oneor several erase blocks. The erase block is very importantitem of any NAND-based storage device because it is theunit of the erase operation. Finally, any SSDFS file systemsvolume can be imagined like a sequence of segments (Fig.1). Figure 1: Logical segment concept.
Logical segment . Generally speaking, segment wouldrepresent the real physical unit(s) (for example, one or sev-eral PEBs are identified by LBAs on the storage device).However, SSDFS operates by logical segments. The logi-cal segment is the unit that is always located on some offsetfrom the volume’s beginning for the whole lifetime of filesystem volume (Fig. 1). Segment is capable to include avariable number of PEBs. However, SSDFS file system’svolume includes a fixed number of segments with identicalsize after the definition of segment size (during a file systemvolume creation). Very important goal of the segment con-cept is the capability to execute the erase operation for thewhole segment. However, segment is the aggregation of sev-eral PEBs in the case of SSDFS file system. It means thatsuch segment construction provides the opportunity to exe-cute the erase operation on the basis of particular PEB(s) in-side of the same segment. Finally, the aggregation of severalPEBs inside of one segment has several goals: (1) exploita-tion of operation parallelism for different PEBs inside of thesegment, (2) capability to execute the partial erase operationon the PEB basis instead of the whole segment, (3) capabil-ity to use a RAID-like or erasure coding scheme inside of thesegment, (4) capability to select a proper segment size for aparticular workload.17igure 2: Logical extent concept.
Logical extent . Usually, segment is associated witha PEB (flash-oriented file system) or with a LBA (flash-friendly file system). However, segment is the pure logicalentity without the strict relation with PEB or LBA in the caseof SSDFS file system. Generally speaking, the segment issimply some portion of the file system volume is always lo-cated on some offset from the volumes beginning (Fig. 1).The nature of SSDFS file system’s segment has the goalto implement a logical extent concept. This concept impliesthat logical extent (segment ID + logical block + length) isalways located in the same logical position inside of the samesegment (Fig. 2). Generally speaking, the goal of logical ex-tent is to exclude the necessity to update the metadata abouta logical block’s position in the case of data migration fromthe initial PEB into another one (in the case of update or GCoperation, for example).The logical extent concept is the technique of resolvingthe write amplification issue for the case of LFS file system.It means that any metadata structure keeping a logical extentdoesn’t need in updating the logical extent value in the caseof data migration between the PEBs because the logical ex-tent remains the same until the data is living in the same seg-ment (Fig. 2). The implementation of logical extent conceptneeds in introduction of Logical Erase Block (LEB) concept(Fig. 1 - 2). LEB represents the logical analogue of PEB onstorage device side.Generally speaking, segment is a sequence of LEBs andevery LEB is equal to the size of one or multiple PEBs. Asa result, the LEB can be imagined like a container that couldbe associated with any PEB on storage device side. Finally,every LEB always has the same index in the particular seg-ment. And the problem of association the particular LEBwith a PEB is resolving by means of a special mapping ta-ble. The PEB mapping table has the several important goalsin the SSDFS file system.LEB and PEB represent different notions. PEB representsthe physical erase block on a storage device side. Generallyspeaking, PEB is really allocated portion of storage deviceis able to receive read and write I/O requests. Also it is pos-sible to apply the erase operation for this portion of storagedevice. Oppositely, LEB represents a fixed logical portion of file system’s volume is identified by the segment ID andthe index in a segment. It is possible to say that LEB is pre-allocated portion of the file system’s volume space.However, the real association of LEB with PEB takesplace only if some LEB’s logical block/extent was allocatedand filled by data. Otherwise, empty LEB doesn’t need in as-sociation with any PEB. It means that if LEB hasn’t any datathen no PEB is linked with such LEB. Generally speaking,SSDFS file system’s volume represents a sequence of logi-cal segments. Every logical segment contains a set of LEBsthat provide opportunity to allocate some number of logicalblocks (Fig. 1 - 2). As a result, internal metadata structuresof SSDFS file system operates by logical extents that need tobe updated only in the case of segment ID change.Figure 3: Segment parallelism.
Segment parallelism . One of the important goal to haveseveral LEBs/PEBs in one segment is the trying to employthe parallelism of operation with PEBs are located on dif-ferent dies. Usually, any SSD contains a set of dies areable to execute various operations independently and con-currently (for example, erase operation). Moreover, multi-channel SSD architecture is capable to deliver commandsand data to different dies by means of independent channels.Generally speaking, LEBs of the same segment are able tobe associated with PEBs are located on different dies (Fig.3). As a result, the operation parallelism in one segment isable to improve the file system performance at whole. Fromanother point of view, the opportunity to associate any LEBwith any PEB creates the flexibility in policy of distributionof LEBs of the same segment in the different areas of thestorage device.The critical point is the capability to have the knowledgeabout a distribution of PEBs’ ranges amongst the differentdies. Generally speaking, such distribution can be imple-mented on the basis of static or dynamic policies. The staticpolicy means that the whole address space of storage deviceis distributed among the different dies in static manner. Oth-erwise, the storage device itself should be able to inform thehost about such distribution by means of special protocol.For example, Open-channel SSD could be able to providesuch data on the host side.18 .2 LEB/PEB Architecture
Log concept . Log is the fundamental basic structure of SS-DFS file system (Fig. 4). Any user data or metadata arestored in the log of SSDFS file system’s volume. Generallyspeaking, the log concept tries to achieve the several veryimportant goals: (1) replication of critical metadata struc-tures that characterize a file system volume, (2) creation theopportunity to recover the log’s payload (user data or meta-data) on the basis of log’s metadata even if all other logs arecorrupted, (3) localization of block bitmap by scope of onePEB, (4) implementation the concept of an offsets translationtable, (5) implementation the concept of main, diff updates,and journal areas. Figure 4: Log concept.It is possible to imagine the log like a container that in-cludes a header, a payload, and a footer (Fig. 4). The respon-sibility of header (Fig. 5) is the identification of file systemtype and the log’s beginning because the header is capableto play the role of file system’s superblock. Any PEB couldcontain one or several logs and end-user is able to define thesize of the log. Moreover, various segment types are able tohave the different log’s size. However, mount type, unmountoperation, segment type, or workload type could result in thenecessity to commit the log without enough data in the log’spayload. As a result, it needs to distinguish full and par-tial logs. The full log has to contain such number of logicalblocks (or NAND flash pages) that were defined by end-userfor this particular segment type during the file system vol-ume creation. Every full log contains the segment header,payload, and footer (Fig. 4).Figure 5: Log header. If it exists the necessity to commit a log without the pres-ence of enough data in the payload then it needs to create achain of partial logs in a PEB (Fig. 4). The first partial logcontains the segment header (Fig. 5), the partial log header(Fig. 6), and the payload. Every next partial log includesonly the partial log header and the payload. Finally, the lastpartial log ends with the log footer (Fig. 4, Fig. 7). Generallyspeaking, to select an optimal value of log’s size could be noteasy task because different workloads is able to need in spe-cialized log’s size. It means that collecting statistics aboutpartial logs’ size could be the basis for searching and grad-ual correction of the full log’s size with the goal to achievethe local or global optimum value.Figure 6: Partial log header.Figure 7: Log footer.
Superblock . Usually, any file system starts from a su-perblock that is located in one or several fixed position(s)on the file system’s volume. The responsibility of the su-perblock is to identify the file system’s type and to providethe description of the key file system’s metadata structures.SSDFS represents the LFS file system type that is using theCopy-On-Write (COW) policy for updating the state of anydata or metadata. Generally speaking, it means that the su-perblock cannot be located in the fixed position of SSDFSfile system’s volume. Oppositely, every log of SSDFS filesystem contains the superblock copy. It means that extract-ing the header (Fig. 5 - 6) and footer (Fig. 7) of any log pro-vides the state of file system’s superblock is actual for sometimestamp. SSDFS file system is using the special algorithmof fast searching the last actual superblock’s state.19SDFS file system splits the superblock’s state on: (1)static data, (2) dynamic data. The static part of superblock ischaracterized the key parameters/features of a file system’svolume (for example, logical block size, erase block size,segment size, creation timestamp and so on) that are definedduring the volume creation. This part of superblock is keptin the volume header of log’s header (Fig. 5). Oppositely, thedynamic part of superblock is represented by mutable param-eters/features of file system’s volume (for example, segmentnumbers, free logical blocks number, volume UUID, volumelabel and so on). The volume state of the log footer (Fig. 7)keeps the dynamic part of superblock. And, finally, the par-tial log header (Fig. 6) represents the restricted combinationof static and dynamic parts of the superblock. Every meta-data structure of the log (segment header, partial log header,log footer) starts from a magic signature (Fig. 5 - 7). Gen-erally speaking, the responsibility of magic signature is toidentify the file system type and the type of metadata struc-ture. Another very important field is the log’s area descrip-tors (Fig. 5 - 7). These descriptors describe the position andthe size of every existing area (user data or metadata) in alog.
Block bitmap . One of the very important log’s metadatastructure is a block bitmap. Usually, a file system uses theblock bitmap as a single metadata structure for the wholevolume. The responsibility of block bitmap is to track thestate of logical blocks (free or used). As a result, the blockbitmap is frequently accessed and modified metadata struc-ture. However, this compact and efficient metadata structurecannot be used in traditional way for the case of LFS file sys-tem by virtue of: (1) frequent updates of the block bitmap isable to increase the write amplification, (2) logical block ofLFS file system needs in more states (free, used, invalid),(3) the volume capacity could change because of necessityto track the presence of bad erase blocks.Figure 8: Block bitmap concept.SSDFS file system introduces the PEB-based blockbitmap because of proven efficiency and compactness of thismetadata structure. First of all, the block bitmap (Fig. 8)tracks such states of logical block: (1) free, (2) used, (3) pre-allocated, (4) invalid. The free state means that logical blockis ready for allocation and write operation. Oppositely, theused state means that the logical block was allocated and thewrite operation has taken place for this logical block. Theinvalid state represents the case when the update or GC op-eration invalidates (makes not actual) the state of a logical block in one PEB and to store the actual state into anotherone. And, finally, the pre-allocated state can be used for rep-resenting the case when several fragments of different logicalblocks can be stored into one NAND flash page (for example,in the case of compression or delta-encoding).Figure 9: Technique of using the block bitmap.Block bitmap is the PEB-based metadata structure in thecase of SSDFS file system (Fig. 8 - 9). The goals of such ap-proach are: (1) opportunity to access/modify block bitmapsof different PEBs without the necessity to use any synchro-nization primitives, (2) capability to lose the bad erase blockswithout the necessity to rebuild the block bitmap, (3) capa-bility to allocate the logical blocks and to execute GC oper-ations concurrently for different PEBs in the same segmentor for the file system’s volume at whole, (4) opportunity totrack the state of logical blocks only inside the log’s pay-load. Every log keeps the actual state of the block bitmap forthe case of some timestamp. It means that previous PEB’slogs play the role of block bitmap’s checkpoints or snapshots(Fig. 9). As a result, it is possible to use the block bitmapsof previous logs in the case of corruption of particular PEB’slog. If some LEB is under active migration then every logof the destination PEB has to store block bitmap as sourcePEB as destination PEB (Fig. 9) because migration could beexecuted in several phases.Figure 10: Offsets translation table concept.
Offsets translation table . Any subsystem of SSDFS filesystem’s driver that needs to store user data or metadata20reats the segment like a sequence of logical blocks. Gen-erally speaking, the goal of such approach is to provide theopportunity to access the stored data by means of segment IDand logical block number without the knowledge what PEBkeeps the actual state of data for the requested logical block.The SSDFS file system uses an offsets translation table (Fig.10) for implementation this approach. Generally speaking,offsets translation table looks like a sequence of fragmentsand every fragment is stored in the particular log (Fig. 10).The fragment keeps a portion of the table that associates alogical block number with a descriptor is keeping the off-set to the data in the log’s payload. As a result, if someonewould like to retrieve the actual state of data then it needs tofind the latest record in the sequence of fragments of offsetstranslation table for the requested logical block number. Thefound record will identify the PEB, log index, and byte offsetto the actual state of data in the log’s payload.Figure 11: Offsets translation table architecture.Generally speaking, the offsets translation table includesseveral metadata structures inside of the log (Fig. 11): (1)logical block table, (2) block descriptor table, (3) payloadarea. The logical block table represents an array of descrip-tors where the logical block number can be used as the index.Every descriptor of logical block table keeps: (1) logical off-set from the beginning of file or metadata structure, (2) PEBpage number that identifies an index of logical block in theblock bitmap, (3) log’s area that identifies metadata area orpayload is keeping the content of logical block, (4) byte off-set from the area’s beginning till the data portion. Finally,the descriptor of logical block table could point out directlyin the payload (for example, in the case of full plain logi-cal block) or into the block descriptor table (Fig. 11). Ev-ery record of block descriptor table keeps the inode ID andseveral descriptors on logical block’s states in the payloadarea(s). The goal of keeping the several descriptors on log-ical block’s states in one record of block descriptor table isto provide the capability to represent the several sequentialmodifications of a logical block or the various delta-encodedfragments of the same logical block. Finally, the payloadcould keep the plain full logical block or compressed (delta-encoded) fragment with associated checkpoint and parentsnapshot IDs. It needs to point out that the checkpointand parent snapshot IDs can be extracted from the segment header for the case of plain full logical block. Generallyspeaking, the knowledge of logical offset from file’s begin-ning, inode ID, checkpoint ID, and parent snapshot ID pro-vides the capability to recover the stored data from the log’spayload on the basis of log’s metadata only.Figure 12: Log structure.
Log structure . As a result, log’s structure (Fig. 12) beginswith the header that identifies the file system’s type and thelog’s beginning by means of magic signature. Moreover, theheader contains an array of area descriptors that describes theexisting areas in the log: (1) block bitmap, (2) logical blocktable, (3) block descriptor table, (4) payload, (5) footer. Thefooter is also able to include the array of area descriptorswith the goal to replicate the critical metadata structures ofthe log (for example, block bitmap and logical block table).
Main/Diff/Journal areas . It is very important to distin-guish ”cold” and ”hot” data for the case of LFS file system.Because the identification of ”cold” and ”hot” types of dataprovides the opportunity to implement an efficient data man-agement scheme, especially, for the case of GC operations.As a result, SSDFS file system introduces (Fig. 13) the threetypes of payload areas: (1) main area, (2) diff updates area,(3) journal area. The main area is used for storing the plainfull blocks. Generally speaking, the write operation for anylogical block takes place in the main area only once and thefollowing updates are stored into the diff updates or jour-nal area. As a result, such write/update policy creates area(main area) with ”cold” data because all following updatesof any logical block in the main area will be stored into an-other area(s) (diff updates or journal area). If the diff updatesor the journal area gathers significant amount of updates forsome logical block in the main area of some log then thislogical block could be stored in the main area of another logwith applying of all existing updates.The diff updates area (Fig. 13) could play the role of areawith the ”warm” data. The responsibility of diff updates areais to gather into one NAND flash page the compressed blocksor delta-encoded fragments of the same file. It means thatthis area is able to store the significant amount of updatesfor logical blocks in the main area. However, the updates ofthe data in the diff updates area could be not so significantlike for the main or journal areas. Finally, the diff updatesarea will be hotter than main area but it could be colder that21igure 13: Main, diff and journal payload areas.the journal area. The goal of journal area (Fig. 13) is torepresent the area with the ”hot” data. One NAND flash pageof journal area is used for compaction of several small files orupdates of logical blocks of different files. Finally, it meansthat the NAND flash page with fragments of various files isable to receive more updates than main or diff updates areas.From one point of view, the compaction of several frag-ments of different logical blocks into one NAND flash pagecreates the capability to move more data for one GC opera-tion. From another viewpoint, warm/hot areas introduce theareas with high frequency of update operations. Generallyspeaking, it is possible to expect that high frequency of up-date operations (in diff updates and journal areas) creates thenatural migration of data between PEBs without the neces-sity to use the extensive GC operations.
Usually, user data and metadata are based on different gran-ularity of items and very different frequency of updates.Moreover, various metadata structures have different archi-tectures and live under different workloads. SSDFS file sys-tem distinguishes various type of segments with the goal toguarantee a predictable and deterministic nature of data man-agement. As a result, there are several type of segments onany SSDFS file system’s volume: (1) superblock segment,(2) snapshot segment, (3) PEB mapping table segment, (4)segment bitmap, (5) b-tree segment, (6) user data segment.Generally speaking, the goal to distinguish the different typeof segments is to localize the peculiarities of different typesof data (user data and metadata, for example) inside of spe-cialized segments. Another important responsibility of thesegments’ specialization is to provide a reliable basis fordata and metadata recovering in the case of file system’s vol-ume corruption. It means that if a PEB keeps a specializedtype of metadata or user data then it simplifies the task ofdata/metadata recovering in the case of file system’s volumecorruption.
Superblock segment . Superblock is one of the criticalmetadata structure of any file system (Fig. 14). First of all, the superblock identifies a type of file system (ext4, xfs, btrfs,for example). From another viewpoint, the superblock’s re-sponsibility is to describe the crucial features of a file sys-tem’s volume (logical block size, number of free blocks,number of folders, for example). And, finally, file systemdriver extracts from superblock the knowledge about posi-tion of the key metadata structures (block bitmap, inodesarray, for example) on the volume. Usually, superblock isstored into one or several fixed position(s) on the file sys-tem’s volume (Fig. 14). Generally speaking, the fixed po-sition of the superblock provides the opportunity to find thesuperblock easily and to identify a file system’s type on thevolume. However, SSDFS is Log-structured (LFS) and flash-friendly file system. It means that the fixed position of thesuperblock is not suitable solution for the case of SSDFSfile system. If anybody considers the superblock metadatastructure (Fig. 14) then it is possible to distinguish the twoprincipal types of fields: (1) static metadata - describe basicand unchangeable features of file system’s volume (logicalblock size, for example), (2) mutable metadata - describe thevolume’s features that are modified during the mounted stateof file system’s volume (number of free blocks, for example).Figure 14: Classic superblock approach.Figure 15: Distributed superblock approach.Any SSDFS file system’s volume represents a sequence oflogical segments. Every segment contains some number ofLEBs. Finally, it needs to associate a LEB with a PEB in thecase of necessity to store any data in the segment. As a result,22ny associated PEB stores a sequence of logs. Moreover, ev-ery full log is started from the fixed position in the PEB andthe full log contains the segment header and log footer (Fig.15). Generally speaking, segment header and log footer arelocated in fixed positions if anybody knows the size of fulllog. SSDFS file system keeps the static part of superblock’smetadata in the segment header but the mutable part in thelog footer of every full log (Fig. 15). The massive repli-cation of superblock’s metadata has the goal to increase thereliability of storing the superblock inside of SSDFS file sys-tem’s volume. From another point of view, keeping the su-perblock’s metadata in every full log creates the opportunityto start the recovery of the corrupted file system’s volumefrom any particular PEB on the volume. Moreover, even ifit survives only one log from the whole SSDFS file system’svolume then it will be possible to extract and to recover thedata or metadata from the survived log on the basis of meta-data in the header and footer.Figure 16: Specialized superblock concept.However, the massive replication of superblock’s meta-data creates the problem to find the last actual state of mu-table part of superblock’s metadata. To resolve this problemthe SSDFS file system introduces a special type of segment- the superblock segment (Fig. 16). Generally speaking, thegoal of the superblock segment is to keep a sequence of su-perblock’s states are stored for every mount and unmountoperations. As a result, the superblock segment contains asequence of logs that keep the state of superblock in headerand footer (Fig. 16). Moreover, every log of superblock seg-ment is able to store in the payload a snapshot of some crit-ical metadata structures. The key goals of superblock seg-ment are: (1) to store the superblock’s state for every mountand unmount operations, (2) to provide the mechanism offast search the last actual state of the superblock.The fast lookup method is based on the knowledge ofnumbers of current and next superblock segments that arestored in every segment header (Fig. 17). It means that thesegment header of any valid PEB with logs is able to providethe number of current and next superblock segments thatwere actual for some timestamp. The operation of check-ing the available numbers of segments is able to discover the Figure 17: Superblock segments’ migration scheme.actual superblock segments or more actual numbers of su-perblock segments. Finally, it is possible to find the actualsuperblock segment by means of passing through the chainof segment numbers. As a result, it needs to find the lat-est log in the found actual superblock segment with the goalto retrieve the actual superblock’s state. Moreover, SSDFSfile system keeps two copies of the superblock segment withthe goal to improve the reliability and to increase the perfor-mance of the lookup operation.The segment header of any full log keeps the previous,current, and next numbers of superblock segment (Fig. 17).These numbers creates the basis for migration technique ofsuperblock segment. Initially, file system driver is usingthe number of current superblock segment for storing thelogs with superblocks’ state. The next superblock segment’snumber plays the role of reserve for the case of exhaustionof the current superblock segment. Finally, the file systemdriver moves the number of superblock segment from thecurrent to the previous state in the case of exhaustion. If theprevious state contained the number of some superblock seg-ment then this superblock segment saved as snapshot or theerase operation is applied for the old superblock segment. Asa result, the next superblock segment’s number becomes thecurrent superblock segment. It needs to allocate some cleansegment for the new reservation of space for superblock seg-ment. Otherwise, it needs to use the number of segment thatwas previous superblock segment in the case of inability toallocate a clean segment. Generally speaking, distributed su-perblock approach and specialized superblock segment con-cept provide the reliable way of superblock storing and effi-cient technique of searching the last actual superblock’s statefor the case of mount operation.
Snapshot segment . SSDFS file system is Log-structuredFile System (LFS) with using of Copy-On-Write (COW) pol-icy for data updates. It means that SSDFS provides the richbasis for the concept of snapshots of file system’s volume’sstates. Generally speaking, every log represents a checkpointof user data’s or metadata’s state. Such checkpoint is acces-sible until the applying on a PEB the next erase operation. Itmeans that the checkpoint has to be converted into the snap-23hot for the long-term keeping the state of this checkpoint.As a result, it is possible to state that snapshot is the long-term storage of the file system’s state or the namespace’sportion for some timestamp. The snapshot is able to playthe role of starting point for the evolution of some version ofa file system’s state. Generally speaking, the snapshot’s stateor some version of the file system’s state is able to be ac-cessed by means of the mount operation for a snapshot’s ID.The superblock segment is able to store in the log’s payloadthe content of some critical metadata structures. As a result,the snapshots table is capable to be stored into the superblocksegment (Fig. 18). Generally speaking, the snapshots tableis the array of records that keep the snapshot ID and the cor-responding number of snapshot segment. Finally, this tableprovides the mechanism to find the snapshot segment num-ber(s) for the case of a snapshot ID.Figure 18: Snapshots table concept.Figure 19: Snapshot segment concept.SSDFS file system introduces the concept of specializedsnapshot segment (Fig. 19). The snapshot segment is dedi-cated to the long-term storing of a checkpoint’s state or someportion of file system’s namespace (that could play the roleof parent snapshot for a version of file system’s state). Gen-erally speaking, the snapshot segment contains a sequenceof logs that keep a special journal (Fig. 19). This journalis simply aggregation of records that keep the state of files’content or metadata. Moreover, log’s header or footer is ca-pable to store the root node of inodes tree that represents theinitial state of namespace for the particular snapshot. Thisroot node points out on the index/leaf nodes that will be stored inside of the regular, specialized segments. Gener-ally speaking, only inodes tree needs to be defined explic-itly in the snapshot segment because the rest of b-trees (ex-tents, dentries, xattr b-trees) will be defined by means of rootnode(s) in the particular inode records. Finally, the nodes ofthese child b-tree will be stored in the regular, specializedsegments that are dedicated to keep the index/leaf nodes ofb-trees. Figure 20: Snapshots concept.Finally, snapshot table in the superblock segment is ca-pable to associate the snapshot IDs with segment numbers(Fig. 20). Every segment number in this table defines thespecialized snapshot segment that stores a snapshot of somestate of file systems volume. Also snapshot segment containsthe root node of inodes tree in the log’s segment header thatplays the role of starting point for evolving the file system’sstate by means of adding index and leaf nodes into the in-odes, extents, dentries, and xattr b-trees. The snapshot stateis able to evolve in the case when the file system’s volume ismounted with using some snapshot ID.
PEB mapping table . SSDFS file system is based on theconcept of logical segment that is the aggregation of LEBs.Moreover, initially, LEB hasn’t association with a particularPEB. It means that segment could have the association notfor all LEBs or, even, to have no association at all with anyPEB (for example, in the case of clean segment). Gener-ally speaking, SSDFS file system needs in special metadatastructure (PEB mapping table) that is capable to associateany LEB with any PEB. The PEB mapping table is the cru-cial metadata structure that has several goals: (1) mappingLEB to PEB, (2) implementation the logical extent concept,(3) implementation the concept of PEB migration, (4) imple-mentation the delayed erase operation by specialized thread,(5) implementation the approach of bad erase block recover-ing.Generally speaking, PEB mapping table describes thestate of all PEBs on a particular SSDFS file system’s vol-ume. These descriptors are split on several fragments thatare distributed amongst PEBs of specialized segments (Fig.21). Numbers of these specialized segments are stored in thesegment headers of every log and it describes the reservedspace of a SSDFS file system’s volume for the PEB map-24igure 21: PEB mapping table architecture.ping table. Because SSDFS file system employs the conceptof logical segment then the reserved numbers of specializedsegments remain the same for the volume’s lifetime. But ifsome PEB achieves the exhausted state then it triggers themigration mechanism of moving the exhausted PEB into an-other one. Also PEB mapping table is enhanced by specialcache is stored in the payload of superblock segment’s log(Fig. 21). Generally speaking, the cache stores the copy ofrecords of PEBs’ state. The goal of PEB mapping table’scache is to resolve the case when a PEB’s descriptor is asso-ciated with a LEB of PEB mapping table itself, for example.If unmount operation triggers the flush of PEB mapping tablethen there are the cases when the PEB mapping table couldbe modified during the flush operation’s activity. As a result,actual PEB’s state is stored only into PEB mapping table’scache. Such record is marked as inconsistent and the incon-sistency has to be resolved during the next mount operationby means of storing the actual PEB’s state into the PEB map-ping table by specialized thread. Moreover, the cache playsanother very important role. Namely, PEB mapping table’scache is used for conversion the LEB ID into PEB ID forthe case of basic metadata structures (PEB mapping table,segment bitmap, for example) before the finishing of PEBmapping table initialization during the mount operation.Figure 22: PEB mapping table’s fragment structure.Every fragment of PEB mapping table represents the log’spayload in a specialized segment (Fig. 22). Generally speak-ing, the payload’s content is split on: (1) LEB table, and (2)PEB table. The LEB table starts from the header and it con- tains the array of records are ordered by LEB IDs. It meansthat LEB ID plays the role of index in the array of records.As a result, the responsibility of LEB table is to define an in-dex inside of PEB table. Moreover, every LEB table’s recorddefines two indexes. The first index (physical index) asso-ciates the LEB ID with some PEB ID. Additionally, the sec-ond index (relation index) is able to define a PEB ID thatplays the role of destination PEB during the migration pro-cess from the exhausted PEB into a new one. It is possibleto see (Fig. 22) that PEB table starts from the header and itcontains the array of PEB’s state records is ordered by PEBID. The most important fields of the PEB’s state record are:(1) erase cycles, (2) PEB type, (3) PEB state.Figure 23: Possible PEB’s types and states.PEB type (Fig. 23) describes possible types of data thatPEB could contain: (1) user data, (2) leaf b-tree node, (3) hy-brid b-tree node, (4) index b-tree node, (5) snapshot, (6) su-perblock, (7) segment bitmap, (8) PEB mapping table. PEBstate (Fig. 23) describes possible states of PEB during thelifecycle: (1) clean state means that PEB contains only freeNAND flash pages are ready for write operations, (2) us-ing state means that PEB could contain valid, invalid, andfree pages, (3) used state means that PEB contains only validpages, (4) pre-dirty state means that PEB contains as valid asinvalid pages only, (5) dirty state means that PEB containsonly invalid pages, (6) migrating state means that PEB is un-der migration, (7) pre-erase state means that PEB is addedinto the queue of PEBs are waiting the erase operation, (8)recovering state means that PEB will be untouched duringsome amount of time with the goal to recover the ability tofulfill the erase operation, (9) bad state means that PEB isunable to be used for storing the data. Generally speaking,the responsibility of PEB state is to track the passing of PEBsthrough various phases of their lifetime with the goal to man-age the PEBs’ pool of the file system’s volume efficiently.PEB mapping table’s cache (Fig. 24) starts from theheader that precedes to: (1) LEB ID / PEB ID pairs, (2) PEBstate records. The pairs’ area associates the LEB IDs withPEB IDs. Additionally, PEB state records’ area contains in-formation about the last actual state of PEBs for every recordin the pairs’ area. It makes sense to point out that the mostimportant fields in PEB state area are: (1) consistency, (2)25igure 24: PEB mapping table’s cache.PEB state, and (3) PEB flags. Generally speaking, the con-sistency field simply shows that a record in the cache andmapping table is identical or not. If some record in the cachehas marked as inconsistent then it means that the PEB map-ping table has to be modified with the goal to keep the actualvalue of the cache. As a result, finally, the value in the tableand the cache will be consistent.The PEB migration approach is the key technique of SS-DFS file system. Generally speaking, the migration mech-anism implements the logical segment and logical extentconcepts with the goal to decrease or completely eliminatethe write amplification issue. Moreover, SSDFS file systemis widely using the data compression, delta-encoding tech-nique, and small files compaction technique that provides theopportunity to employ the PEB migration mechanism with-out the necessity to use the additional overprovisioning. PEBmapping table plays the critical role in the implementation ofPEB migration technique by means of relation index in theLEB table (Fig. 22). Finally, this index creates the relationbetween two PEBs that defines the source and destinationpoints for data migration.
Segment bitmap (Fig. 25) is the critical metadata struc-ture in SSDFS file system that implements several goals: (1)searching a candidate for a current segment is capable tostore a new data, (2) searching by GC subsystem a most op-timal segment (pre-dirty state, for example) with the goal toprepare the segment in background for storing a new data.Segment bitmap (Fig. 25) is able to represent such set ofstates: (1) clean state means that a segment contains thefree logical blocks only, (2) using state means that a seg-ment could contain valid, invalid, and free logical blocks,(3) used state means that a segment contains the valid logicalblocks only, (4) pre-dirty state means that a segment containsvalid and invalid logical blocks, (5) dirty state means that asegment contains only invalid blocks, (6) reserved state isused for reservation the segment numbers for some metadatastructures (for example, for the case of superblock segment),(7) bad state means that a segment is excluded from the us-age because a file system’s volume hasn’t enough valid eraseblocks (PEBs).Generally speaking, PEB migration scheme implies that Figure 25: Segment bitmap concept.segments are able to migrate from one state to another onewithout the explicit using of GC subsystem. For example, ifsome segment receives enough truncate operations (data in-validation) then the segment could change the used state onpre-dirty state. Additionally, the segment is able to migratefrom pre-dirty into using state by means of PEBs migrationin the case of receiving enough data update requests. As aresult, the segment in using state could be selected like thecurrent segment without any GC-related activity. However,a segment is able to stick in pre-dirty state in the case of ab-sence the update requests. Finally, such situation can be re-solved by GC subsystem by means of migration in the back-ground of pre-dirty segments into the using state if a SSDFSfile system’s volume hasn’t enough segments in the clean orusing state.Figure 26: Segment bitmap architecture.Segment bitmap is implemented like the bitmap metadatastructure that is split on several fragments (Fig. 26). Everyfragment is stored into a log of specialized PEB. As a result,the full size of segment bitmap and PEB’s capacity definethe number of fragments. The mkfs utility reserves the nec-essary number of segments for storing the segment bitmap’sfragments during a SSDFS file system’s volume creation. Fi-26ally, the numbers of reserved segments are stored into thesegment headers of every log on the volume. The segmentbitmap ”lives” in the same set of reserved segments duringthe whole lifetime of the volume. However, the update oper-ations of segment bitmap could trigger the PEBs migrationin the case of exhaustion of any PEB is used for keeping thesegment bitmap’s content.Figure 27: B-tree concept.
B-trees (Fig. 27) represent efficient and compact metadatastructure for storing and representing metadata on a file sys-tem’s volume. Also b-tree is able to provide the efficient andthe fast way of searching items. Moreover, another importantfeature of the b-tree is the compact representation of sparsemetadata and easy increasing the reserved capacity of meta-data space. Especially, such feature could be very importantfor the case of NAND flash because it could be expensiveand useless way, for example, to store a huge and flat arrayof inodes in the case of absence of any file in the names-pace of a file system. However, usually, b-tree is treated likenot suitable solution for the case of flash-oriented file sys-tems because of wandering tree and excessive write ampli-fication issues. But SSDFS file system is based on logicalsegment concept, logical extent concept, and delta-encodingtechnique that completely exclude the wandering tree issue.Also these concepts are the basis for decreasing (or completeelimination) the write amplification issue. Moreover, b-treemetadata structure provides the way not to keep an unneces-sary reserve of metadata space on the volume. As a result, itmeans the exclusion of management operations of reservedmetadata space (moving from a PEB to another one) withthe goal to support it in the valid state. Generally speaking,it is the way to decrease the amount of PEBs’ erase and writeoperations.Root node of the key b-trees (inodes b-tree, shared extentsb-tree, shared dictionary b-tree) is stored into segment headeror/and log footer (Fig. 27). However, oppositely, root nodeof extents b-tree, dentries b-tree, and xattr b-tree is storedinto an inode record. Generally speaking, if the root node iscapable to store some amount of metadata records or b-treehasn’t any metadata records at all then a SSDFS file sys-tem’s volume will not have any node on the volume. SSDFSfile system is using specialized types of segments. It meansthat b-tree’s leaf nodes will be stored in a segment is ded- Figure 28: B-tree segment type.icated and reserved for the leaf nodes but index nodes willbe stored into a segment is containing the index nodes only(Fig. 28). Moreover, nodes of different b-trees can be storedin the same segment. Generally speaking, it is expecting thatworkload type of b-tree’s nodes of the same type (leaf nodes,for example) could be the same and it is the basis for group-ing the nodes of different b-trees into one segment. Also itcould create more compact representation of the b-trees onthe volume. Every b-tree’s node could use one or severallogical blocks. As a result, any log of segment with b-trees’nodes (Fig. 28) contains: (1) segment header, (2) log footer,and (3) payload that contains b-trees’ nodes. Finally, b-tree’snodes’ content is distributed among the main, diff updates,and journal areas of the log.Figure 29: User data segment type.
User data segment . SSDFS file system aggregates userdata inside of segments are dedicated to user data’s type (Fig.29). It needs to point out that the user data could live underdifferent workloads and to have the various features. Theo-retically, it is possible to introduce the various types of seg-ments for the user data. However, nevertheless, it is usedonly one type of segment for the user data.Generally speaking, the main, diff updates, and journalareas are the key technique of user data’s management on aSSDFS file system’s volume (Fig. 29). The main area playsthe role of cold data because it is dedicated to store the plain,full logical blocks. Moreover, if a logical block is storedinto the main area then all subsequent updates of this logicalblock are directed to the diff updates or journal areas. As a27esult, this is the key technique to achieve the cold nature ofdata into the main area of log. Finally, the data in main areacan be moved from the initial PEB into a new one by meansof PEB migration scheme (in the case of update operationis directed to the exhausted PEB) or by GC subsystem (inthe case of absence of update operation and exhausted stateof the PEB). The moving operation of data in the main areacould be accompanied by applying the associated updates arestored into the diff updates or/and journal areas.The diff updates area is able to store into one NAND flashpage the compressed blocks, delta-encoded portions of data,and tail of the same file (Fig. 29). It is possible to expectthat frequency of updates for different portions of the samefile could be lower than frequency of updates for differentfiles. As a result, diff updates area is treated like area withwarm data. It is important to point out that data in the diffupdates area could be invalidated partially or completely byupdate operations. Also, the valid data of diff updates areais used during migration of the main area’s content with thegoal to prepare the actual state of logical block(s). Gener-ally speaking, it is not necessary to reserve any space for thediff updates area’s content in a destination PEB during themigration operation.The responsibility of journal area is to gather into oneNAND flash page the small files, the tails of different files,compressed updates or delta-encoded data of different files(Fig. 29). Generally speaking, the different files are able togrow or to be updated with higher frequency than content ofone file. As a result, gathering the content of different filesinto the one NAND flash page increases the probability ofupdates in journal area. Finally, the journal area is treatedlike area with hot data. Moreover, the amount of updatesin journal area is expected to be high enough to achieve thecomplete invalidation of the journal area by means of natu-ral migration of data between logs and PEBs (by means ofmigration scheme) without any GC subsystem’s activity.However, some small files could be completely ”cold”or to be updated rarely. As a result, it means the neces-sity to employ the logic of GC subsystem or PEB migrationscheme to process some data in the journal area. However,the scheme of compaction several small files into one NANDflash page decreases the write amplification issue by virtue ofthe opportunity to move the several small files into one pageby single copy operation.
Current segment . SSDFS file system employs the con-cept of current segments (Fig. 30). Generally speaking, if itis necessary to add some new data on the volume (new file,new logical blocks of the existing file, new b-tree’s node)then it needs to use a segment that has the free logical blocks.Only segment in clean or using state can be used for addinga new data. As a result, SSDFS file system’s driver has theset of current segments (user data, index node, hybrid node,leaf node) because of the policy of grouping different typesof data into different type of segments (Fig. 30). The seg- Figure 30: Current segment concept.ment can be used as current until the complete exhaustion ofthe free logical blocks’ pool in the particular segment (usedor pre-dirty state). In the case of absence the free logicalblocks in the current segment, the file system driver tries tofind in the segment bitmap a new segment in the clean orusing state. If the driver is unable to find any clean or us-ing segment then it will trigger the GC logic for searchingthe segments in pre-dirty or dirty state with the goal to trans-form the pre-dirty segment into using state and dirty segmentinto clean state. Finally, if no pre-dirty or dirty segment canbe processed or transformed into clean or using state thenthe driver will need to inform the user about absence of freespace on the volume. Generally speaking, GC subsystemhas to track the state of segment bitmap in the backgroundand to prepare enough number of clean or using segments(in the case of enough free space on the volume). However,GC subsystem’s activity in the background should not affectthe whole performance of the file system driver and not toincrease the write amplification issue.From one point of view, the several types of current seg-ments create the several independent threads of processinga new data. Additionally, the segment object in file systemdriver is implemented in such way that every segment has aqueue for requests with the new data. Generally speaking, itmeans that a thread adds some new data into a current seg-ment by means of simple insertion of the request into the tailof queue. The rest processing of the requests in the queuewill be executed in the background by specialized PEBs’flush threads. Finally, the whole architecture creates the fast,efficient, simple, and multi-threaded mechanism of the newdata processing. Of course, if a volume was mounted in syn-chronous mode then a thread needs to wait the finishing of re-quest processing that was added into the queue. And it meansthe inevitable degradation of file system performance in onethread. However, if the several threads add the requests intothe queue then the whole file system’s performance couldnot degrade dramatically even for the case of synchronousmode. Finally, it is possible to state that SSDFS file systemhas flexible and efficient subsystem of the current segmentsis capable to provide the good performance of data process-ing.28 .4 B-tree Architecture
B-tree is widely used metadata structure has been proven tobe efficient for the case of various file systems. For example,XFS, btrfs, jfs, reiserfs, HFS+ are using different b-tree’simplementation. Usually, b-tree provides the opportunity tohave multiple child nodes for the same parent node. A reg-ular b-tree (Fig. 31) contains a root node, index nodes, andleaf nodes. Generally speaking, the key advantage of anyb-tree is a capability to store data in the form of nodes withthe goal to provide an efficient data extraction in the caseof block-oriented storage device. Because any b-tree’s nodeis capable to include multiple data items and, as a result, todecrease the number of I/O operations for the case of search-ing any item in the b-tree. Mostly, b-tree is used for storingand representation various file system’s metadata types (forexample, inodes or extents). Usually, root node (Fig. 31)represents a starting point of a b-tree. It keeps index recordsthat contain a key and a pointer on a child node. Any indexnode contains the same index records because the responsi-bility of the index node is to provide the way to find a leafnode with the data records. The data record is the pair of akey with some associated value (for example, extent or in-ode). A hash, an ID or any other value could play a role ofthe key that would be a field of the data record itself. More-over, key values are the basis for data records ordering inthe b-tree. Generally speaking, any lookup operation startsfrom the root node, passing by through the index levels, andfinding some data record on the key basis in the found leafnode. Figure 31: Common b-tree architecture.
Why b-tree for LFS file system?
Usually, b-tree isconsidered like not very good choice for the case of flash-oriented and flash-friendly file systems by virtue of wander-ing tree issue and high value of write amplification. How-ever, b-tree architecture implements very important advan-tages: (1) efficient search mechanism, (2) compact storageof sparse data, (3) flexible technique of capacity increasingand shrinking. Generally speaking, the key reason of pos-sible b-tree’s inefficiency for the case of LFS file system isthe Copy-On-Write (COW) policy. The COW policy meansthe necessity to store an updated logical block into some newand free position on the file system’s volume. As a result, itinitiates a lot of metadata updates that, finally, results in sig-nificant increasing of write amplification in the case of b-tree using.However, SSDFS file system is using the logical segment,logical extent concepts, and the PEBs migration scheme.Generally speaking, these techniques provide the opportu-nity to exclude completely the wandering tree issue and todecrease significantly the write amplification. SSDFS filesystem introduces the technique of storing the data on thebasis of logical extent that describes this data’s position bymeans of segment ID and logical block number. Finally,PEBs migration technique guarantee that data will be de-scribed by the same logical extent until the direct change ofsegment ID or logical block number. As a result, it meansthat logical extent will be the same if data is sitting in thesame logical segment. The responsibility of PEBs migrationtechnique is to implement the continuous migration of databetween PEBs inside of the logical segment for the case ofdata updates and GC activity. Generally speaking, SSDFSfile system’s internal techniques guarantee that COW policywill not update the content of b-tree. But content of b-treewill be updated only by regular operations of end-user withthe file system.SSDFS file system uses b-tree architecture for metadatarepresentation (for example, inodes tree, extents tree, den-tries tree, xattr tree) because it provides the compact way ofreserving the metadata space without the necessity to use theexcessive overprovisioning of metadata reservation (for ex-ample, in the case of plain table or array). The excessiveoverprovisioning of metadata reservation dictates the neces-sity to support the reserved space in valid state by meansof migration among PEBs because of continuous increasingNAND unrecoverable bit error rate (UBER) for stored data.As a result, such migration activity (on SSD or file systemside) increases the write amplification issue.Moreover, b-tree provides the efficient technique of itemslookup, especially, for the case of aged or sparse b-tree that iscapable to contain the mixture of used and deleted (or freed)items. Such b-tree’s feature could be very useful for the caseof extent invalidation, for example. Also SSDFS file systemaggregates the b-tree’s root node in the superblock (for ex-ample, inodes tree case) or in the inode (for example, extentstree case). As a result, it means that an empty b-tree willcontain only the root node without the necessity to reserveany b-tree’s node on the file system’s volume. Moreover, if ab-tree needs to contain only several items (two items, for ex-ample) then the root node’s space can be used to store theseitems inline without the necessity to create the full-featuredb-tree’s node.One of the fundamental mechanism of SSDFS file sys-tem is the current segments approach. This approach is usedfor aggregation of data living under similar workloads in thecurrent segment of some type. For example, there are currentsegments for different b-tree’s node types (index, hybrid, leafnodes). Generally speaking, it means that the current seg-ment for leaf nodes aggregates the leaf nodes of different29-trees (inodes, extents, dentries b-trees, for example). Ev-ery current segment allocates the logical blocks till the com-plete exhaustion of this segment. Finally, SSDFS file systemdriver needs to allocate a new current segment in the caseof complete exhaustion of the free space of previous currentsegment. Moreover, SSDFS file system driver uses the com-pression and the delta-encoding techniques that is the way toachieve the compact representation of b-tree’s nodes in thePEB’s space.As a result, SSDFS uses b-trees with the goal to achievethe compact representation of metadata, the flexible way toexpend or to shrink the b-tree’s space capacity, and the effi-cient mechanism of items’ lookup.
Hybrid b-tree architecture . Regular b-tree contains twotypes of nodes: index and leaf ones. The index node keepsthe pointers on other nodes with the goal to implement mech-anism of fast lookup operation in the b-tree. Oppositely, theleaf node keeps items of real data that are stored in the b-tree.Moreover, a node creation means the reservation of 4-64 KBof file system’s volume’s space. However, usually, b-tree ismetadata structure that is not receiving a lot of items at once.As a result, it means that growing b-tree could contain somenumber of empty or semi-empty index nodes. These indexnodes could be empty or semi-empty significant amount oftime that results in increasing of number of I/O operationsduring the search in b-tree and the flush of the b-tree. Gen-erally speaking, this side effect could increase the write am-plification for the case of flash-friendly file systems.Figure 32: B-tree architecture with hybrid nodes.SSDFS file system uses a hybrid b-tree architecture (Fig.32) with the goal to eliminate the index nodes’ side effect.The hybrid b-tree operates by three node types: (1) indexnode, (2) hybrid node, (3) leaf node. Generally speaking, thepeculiarity of hybrid node (Fig. 33) is the mixture as indexas data records into one node. Hybrid b-tree starts with rootnode (Fig. 34 case A) that is capable to keep the two indexrecords or two data records inline (if size of data record isequal or lesser than size of index record). If the b-tree needsto contain more than two items then it should be added thefirst hybrid node into the b-tree. The first level of b-tree isable to contain only two nodes (Fig. 34 case B) because theroot node is capable to store only two index records. Gener-ally speaking, the initial goal of hybrid node is to store thedata records in the presence of reserved index area (Fig. 33). Figure 33: Hybrid node architecture.Figure 34: Hybrid b-tree evolution.The exhaustion of the data area’s space of the first hybridnode triggers addition of the second hybrid node on the firstlevel of the b-tree (Fig. 34 case B). If both hybrid nodes ofthe first level are completely exhausted then it takes place aspecial transformation of the b-tree (Fig. 34 case C). First ofall, the left hybrid node is transformed into the leaf node. Theportion of data records are moved from the right hybrid nodeto the newly made leaf node. Also, the index area of the righthybrid node is increased in size after the move operation ofdata records. The next step is to move the index record of theleft node from the root node into the index area of the rightnode. Finally, root node will keep the index record on thehybrid node but the hybrid node will keep the index recordon the leaf node (Fig. 34 case C). As a result, hybrid b-tree will contain the completely full leaf node and the hybridnode with the free space for the new data records.The next step of hybrid b-tree’s evolution is the addingdata records in the hybrid b-tree node till the complete ex-haustion of the data area of the node. Exhaustion of thedata area of hybrid node triggers: (1) creation of another leafnode, (2) moving all data records from the hybrid node intothe newly created leaf node, (3) adding index record for theleaf node into index area of hybrid node. Usually, data areaof hybrid node is lesser than the capacity of a leaf node. Itmeans that data records will be added into the newly cre-ated leaf node at first. Finally, the hybrid node will gatherthe data records in the case of exhaustion of leaf node’s ca-pacity. It means that data area of hybrid node plays the role30igure 35: Hybrid b-tree evolution.of temporary buffer that aggregates enough data records be-fore a leaf node creation. Generally speaking, this sequenceof leaf nodes creation takes place before the exhaustion ofindex area of hybrid node. Moreover, the index area’s ex-haustion triggers the increasing of index area’s capacity. Asa result, it means decreasing the capacity of data area in hy-brid node. If the index area of hybrid node extends on thewhole node’s space then such node becomes to be the indexnode (Fig. 34 case D). Finally, the index node needs to becompletely filled by index records.The exhaustion of index node (Fig. 34 case D) impliesthe addition of right hybrid node on the same level (Fig. 35case E). It means that root node will point out on the newlyadded hybrid node. As a result, this hybrid node plays therole of initial point for evolving the right branch of the b-treeby means of adding the new leaf nodes. Generally speaking,the evolution implies addition of data records until the statewhen the right hybrid node will be transformed into the in-dex node. If both index node becomes exhausted by indexrecords then it needs to add the hybrid node. This hybridnode will contain the pointers on exhausted index nodes androot node will point out on the newly added hybrid node (Fig.35 case G). Finally, the hybrid node plays the essential rolein the hybrid b-tree’s evolution.Figure 36: Node type migration scheme.Operation of deletion of data records could initiate thetransformation of index node(s) into the hybrid ones (Fig.36). Such transformation of node’s type could take placemany times by virtue of mixture of addition and deletion op- erations. Also it is possible to imagine the situation of neces-sity to split one index node on two hybrid ones in the case ofinserting some data record in the middle of an exhausted leafrecord. Moreover, such splitting operation could be resultedin the addition of hybrid node on the next upper level of theb-tree.
B-tree delayed invalidation . SSDFS file system pro-cesses the delete operations in hybrid b-trees by means ofspecial techniques. For example, deletion of any inode fromthe inodes b-tree is treated like freeing of correspondingitem in a particular node. Generally speaking, it means thatdeleted inodes can be allocated again and the volume spaceis used by inodes b-tree’s nodes remains the reserved space.If anyone considers the case of deletion of all items in aleaf node then such node can be transformed into the pre-allocated state instead of real deletion of the node from theb-tree structure. Moreover, the pre-allocated state means thatit doesn’t need to keep the allocated space in a PEB for thisnode but the index and/or hybrid nodes continue to keep thesame index records for the node in the pre-allocated state. Fi-nally, it decreases the write amplification because it doesn’tneed to update the index/hybrid nodes by means of deletionof index records that point out on the leaf node. However, ifa b-tree becomes completely empty then it is the case of thereal destruction of b-tree structure.But SSDFS file system’s driver uses the technique of de-layed b-tree’s nodes or sub-trees invalidation/destruction.Especially, this technique could make the operation of bigfiles truncation or deletion more fast and efficient. Generallyspeaking, SSDFS file system’s driver has special invalida-tion queue for the index records (that point out on a node ora sub-tree) and a dedicated thread that has goal to invalidatethe data records in the leaf/hybrid nodes and to destroy thesub-tree structure in the background. Finally, it means thenecessity to place the index record on node or sub-tree intothe invalidation queue during the truncate or delete operationbut the real processing of the node or sub-tree will take placein the background. Interesting side effect of such approach isthe opportunity to fulfill this background activity in the idlestate of file system driver, for example.
Inode is the cornerstone metadata structure of any Linux filesystem that keeps all information about a file excluding thefile’s name and content (user data, for example). Generallyspeaking, this metadata structure is the critical one that re-quires as high reliability of storing as high efficiency of ac-cess and modification operations. The creation of file resultsin the association of name and inode ID with newly createdfile. Moreover, inode ID is unique number in the scope ofparticular file system’s volume. The name of file and inodeID are stored as an item of folder. Namely folder associatesfile names and inode instances. As a result, if end-user or31pplication try to access a file by means of the name then OSemploys this file’s name for an inode ID lookup. The foundinode ID is used by file system driver for retrieving the inodeinstance. Figure 37: Inodes b-tree architecture.Generally speaking, inode table can be imagined like ageneralized array of inode instances (Fig. 37) because ev-ery inode is identified by integer value (inode ID). How-ever, huge capacity of the modern storage devices (HDD,SSD) and highly intensive operations of creation/deletion offiles makes the efficient management of inode table by verycomplex problem. Moreover, the using of simple array ortable for inode instances reserves a big space for such ta-ble. And such reservation could increase the write ampli-fication because of necessity to keep the reserved space inthe valid state. Another possible issue could be the easy ex-haustion of the reserved space without the flexible way to ex-tend the reserved space. Oppositely, b-tree provides the easyway of compact representation the small and sparse set ofitems. Moreover, b-tree is easily extendable metadata struc-ture with the flexible mechanism as increasing as shrinkingthe nodes’ space. Finally, the efficient and fast lookup tech-nique is the another advantage of any b-tree. These pointswere the steady basis for selection b-tree as the basic meta-data structure for inodes tree in SSDFS file system.Figure 38: Raw inode structure.
SSDFS raw inode (Fig. 38) is the metadata structure of fixed size that can vary from 256 bytes to several KBs.The size of inode is defined during the file system’s volumecreation. Usually, inode object includes file mode; file at-tributes; user/group ID; access, change, modification time;file size in bytes and blocks; links count. The most specialpart of the SSDFS raw inode is the private area that is usedfor storing: (1) small file inline, (2) root node of extents,dentries, and/or xattr b-tree.
SSDFS inodes b-tree is the hybrid b-tree that includes thehybrid nodes with the goal to use the node’s space in moreefficient way by means of combination the index and datarecords inside of the node. Root node of inodes b-tree isstored into the log footer or partial log header of every log.Generally speaking, it means that SSDFS file system is usingthe massive replication of the root node of inodes b-tree. Ac-tually, inodes b-tree node’s space includes header, index area(in the case of hybrid node), and array of inodes are orderedby ID values. If a node has 8 KB in size and inode structureis 256 bytes in size then the maximum capacity of one inodesb-tree’s node is 32 inodes.Generally speaking, inodes table can be imagined like animaginary array that is extended by means of adding the newinodes into the tail of the array (Fig. 37). However, inode canbe allocated or deleted by virtue of create file or delete fileoperations, for example. As a result, every b-tree node hasan allocation bitmap that is tracking the state (used or free)of every inode in the b-tree node. The allocation bitmap pro-vides the mechanism of fast lookup a free inode with the goalto reuse the inodes of deleted files. Also inodes b-tree usesthe special technique of processing the completely emptyleaf nodes that could achieve the empty state after deletionthe all inodes in this node. This technique is based on theconversion an empty b-tree node into the pre-allocated state.Generally speaking, the pre-allocated state means that thelogical extent continues to be reserved for this b-tree nodebut no space is allocated in segment’s PEBs. The importantpoint of such technique is the opportunity not to update theindex records in index/hybrid b-tree nodes that point out onthe leaf node has converted into pre-allocated state. Also itmeans that the leaf node’s space continues to be reserved onthe file system volume.Additionally, every b-tree node has a dirty bitmap that hasgoal to track modification of inodes. Generally speaking, thedirty bitmap provides the opportunity to flush not the wholenode but the modified inodes only. As a result, such bitmapcould play the cornerstone role in the delta-encoding or inthe Diff-On-Write approach. Moreover, b-tree node has alock bitmap that has responsibility to implement the mecha-nism of exclusive lock a particular inode without the neces-sity to lock exclusively the whole node. Generally speaking,the lock bitmap was introduced with the goal to improve thegranularity of lock operation. As a result, it provides theway to modify the different inodes in the same b-tree nodewithout the using of exclusive lock the whole b-tree node.32owever, the exclusive lock of the whole tree has to be usedfor the case of addition or deletion a b-tree node.
Linux kernel identifies a file by means of inode that is uniquefor the file. However, the association of file name and inode’sinstance takes place by means of a directory entry. Moreover,different dentries in the same or different folder can identifythe same file or inode. Dentries play an important role in thedirectory caching that contains metadata of frequently ac-cessed files for the more efficient access operations. Anotherimportant role of dentries is the folders hierarchy traversingbecause the dentries connect folder with files.Figure 39: Dentries b-tree architecture.
SSDFS dentry (Fig. 39) is the metadata structure of fixedsize (32 bytes). It contains inode ID, name hash, namelength, and inline string for 12 symbols. Generally speak-ing, the dentry is able to store 8.3 filename inline. If thename of file/folder has longer name (more than 12 symbols)then the dentry will keep only the portion of the name butthe whole name will be stored into a shared dictionary. Thegoal of such approach is to represent the dentry by compactmetadata structure of fixed size for the fast and efficient op-erations with the dentries. It is possible to point out that thereare a lot of use-cases when the length of file or folder is notvery long. As a result, dentry’s inline string could be onlystorage for the file/folder name. Moreover, the goal of shareddictionary is to store the long names efficiently by means ofusing the deduplication mechanism.
Dentries b-tree is the hybrid b-tree (Fig. 39) with the rootnode is stored into the private inode’s area. By default, in-ode’s private area has 128 bytes in size. Also SSDFS dentryhas 32 bytes in size. As a result, inode’s private area providesenough space for 4 inline dentries. Generally speaking, if afolder contains 4 or lesser files then the dentries can be storedinto the inode’s private area without the necessity to createthe dentries b-tree. Otherwise, if a folder includes more than4 files or folders then it needs to create the regular dentriesb-tree with the root node is stored into the private area ofinode. Actually, every node of dentries b-tree contains the header, index area (for the case of hybrid node), and arrayof dentries are ordered by hash value of filename. Moreover,if a b-tree node has 8 KB size then it is capable to containmaximum 256 dentries.Generally speaking, the hybrid b-tree was opted for thedentries metadata structure by virtue of compactness ofmetadata structure representation and efficient lookup mech-anism. Dentries is ordered on the basis of name’s hash. Ev-ery node of dentries b-tree has: (1) dirty bitmap - trackingmodified dentries, (2) lock bitmap - exclusive locking of par-ticular dentries without the necessity to lock the whole b-treenode. Actually, it is expected that dentries b-tree could con-tain not many nodes in average because the two nodes (8K insize) of dentries b-tree is capable to store about 400 dentries.
Any file system is dedicated to store the user data in the formof files. Various files could have different length and inodestores information about length of file in blocks and bytes.Also file system is responsible for logical blocks allocationin the case of adding a new data. Generally speaking, filesystem driver is always trying to allocate a contiguous se-quence of logical blocks for any file’s content. The contigu-ous sequence of logical blocks can be described by extent(starting LBA and length) as the most compact descriptor ofsuch sequence. However, it is not always possible to allo-cate a contiguous sequence of free logical blocks by virtueof possible fragmentation of the file system’s volume spaceby delete and truncate operations. As a result, the allocationoperation can be fulfilled by means of allocation the severalsmaller contiguous sequences of logical blocks in various lo-cations on the volume. Moreover, SSDFS extent cannot begreater than segment size (Fig. 40). Finally, all the men-tioned factors result in description of any file’s content bymeans of the set of extents.Figure 40: Extents b-tree architecture.
SSDFS raw extent (Fig. 40) describes a contiguous se-quence of logical blocks by means of segment ID, logicalblock number of starting position, and length. By default,SSDFS inode has the private area of 128 bytes in size and33SDFS extent has 16 bytes in size. As a result, the inode’sprivate area is capable to store not more than 8 raw extents.Generally speaking, hybrid b-tree was opted with the goalto store efficiently larger number of raw extents. First ofall, it was taken into account that file sizes can vary a loton the same file system’s volume. Moreover, the size of thesame file could vary significantly during its lifetime. Finally,b-tree is the really good mechanism for storing the extentscompactly with very flexible way of increasing or shrinkingthe reserved space. Also b-tree provides very efficient tech-nique of extents lookup. Additionally, SSDFS file systemuses compression that guarantee the really compact storageof semi-empty b-tree nodes. Moreover, hybrid b-tree pro-vides the way to mix as index as data records in the hybridnodes with the goal to achieve much more compact represen-tation of b-tree’s content.Moreover, it needs to point out that extents b-tree’s nodesgroup the extent records into forks (Fig. 40). Generallyspeaking, the raw extent describes a position on the volumeof some contiguous sequence of logical blocks without anydetails about the offset of this extent from a file’s beginning.As a result, the fork (Fig. 40) describes an offset of someportion of file’s content from the file’s beginning and num-ber of logical blocks in this portion. Also fork contains thespace for three raw extents that are able to define the posi-tion of three contiguous sequences of logical blocks on thefile system’s volume. Finally, one fork has 64 bytes in size.If anybody considers a b-tree node of 4 KB in size then suchnode is capable to store about 64 forks with 192 extents intotal. Generally speaking, even a small b-tree is able to storea significant number of extents and to determine the positionof fragments of generally big file. If anybody imagines a b-tree with the two 4 KB nodes in total, every extent definesa position of 8 MB file’s portion then such b-tree is able todescribe a file of 3 GB in total.
Deduplication . One of the known technique of decreasingthe write amplification is the deduplication approach. Gen-erally speaking, the key mechanism of deduplication is thedetermination of replication of the same data on the volumewith the goal to store the found duplicated fragment only inone place. As a result, it means that all files contain suchdeduplicated data should store the same extent in the extentsb-trees. SSDFS file system uses a shared extents b-tree forimplementation the deduplication technique.First of all, SSDFS file system driver takes into accountthe size of a file. If the size is smaller than some threshold(for example, 4 KB - 8 KB) then such file is not consideredas a deduplication target. Otherwise, it needs to calculatethe fingerprint of first 8 KB portion of a file (the size of ini-tial portion can be defined by special threshold value). Thenit needs to check the presence of calculated fingerprint in Figure 41: Deduplication mechanism of shared extents b-tree.the shared extents b-tree. If no such fingerprint exists in theshared extents b-tree then the only calculated fingerprint hasto be stored in the b-tree. Moreover, it doesn’t need to calcu-late fingerprint(s) for the rest of the file in such case.Oppositely, if there is the same fingerprint for the first 8KB of the file in the shared extents b-tree then it needs tocalculate the fingerprints for the rest of the file and to checkthe presence of these fingerprints in the shared extents b-tree. Again, it needs to store the calculated fingerprints inthe shared extents b-tree if no such fingerprints were found.Otherwise, file system driver has to store extents of founddeduplicated fragments into the extents b-trees of particularfiles (Fig. 41).Generally speaking, shared extents b-tree will keep onlyone fingerprint of the first 8 KB for all files that have uniquecontent. Oppositely, the duplicated file’s content will be de-tected during the trying to store a second copy of the samefile. However, the detection of this duplication will be re-sulted in deduplication only first 8 KB of the file and in stor-ing the fingerprints for the rest of duplicated file in the sharedextents b-tree. Finally, the third (and next) try to store the du-plicated file will be resulted in complete deduplication of thefile’s content.Figure 42: Record types in shared extents b-tree.
SSDFS shared extents b-tree is able to store severalrecord types (Fig. 42 - 43): (1) deduplicated extent record,(2) fingerprint record, (3) invalidation record. The dedu-plicated extent records are ordered by fingerprint value and34igure 43: Shared extents b-tree architecture.it contains fingerprint, extent (segment ID, logical block,length), and reference counter values. Generally speaking,the goal of these records is to find the deduplicated extentson the basis of fingerprint value.The fingerprint records are ordered by segment ID andlogical block values and the responsibility of such records isto provide the way to find the fingerprint value on the basisof knowledge of segment ID and logical block values. Everytime when it needs to add the information about a dedupli-cated extent then it needs to insert into the shared extentsb-tree as deduplicated extent record as fingerprint record.Moreover, the reason to have two types of the record is thenecessity to use the fingerprint record in the case of file dele-tion or truncation. Generally speaking, only extent data (seg-ment ID, logical block, length) is available in the beginningof the delete or truncate operation. It means that extent datacan be used for searching the fingerprint value. Finally, thefound fingerprint value can be used for the searching a dedu-plicated extent record that has to be found with the goal todecrement the reference counter (or completely remove therecord if the reference counter is equal to zero).The third record type is the invalidation records that imple-ment a mechanism of delayed invalidation of extents. Gen-erally speaking, it means that it doesn’t need to delete (ortruncate) a big file immediately but it is possible to createthe invalidation record(s) with the pointer on the whole (orsub-tree) extents b-tree and to store the invalidation record(s)into the shared extents b-tree at first. The processing of in-validation records takes place in the background by a ded-icated thread (in the idle state of file system driver, for ex-ample). First of all, the thread has to extract an invalidationrecord and to check the presence of a deduplicated extentrecord for the extent under invalidation. If shared extentsb-tree contains the deduplicated extent record for this ex-tent then it needs to decrement the reference counter only.Otherwise, if shared extents b-tree hasn’t deduplicated extentrecord or the reference counter achieved the nil value then itneeds to invalidate the requested extent. Moreover, the cor-responding deduplicated extent and fingerprint records haveto be deleted from the shared extents b-tree in the case of ze-roed reference counter. Finally, invalidation record has to be deleted from the shared extents b-tree also.
SSDFS file system introduces dentry metadata structure offixed size that is able to store only 12 inline symbols (8.3filename) with the goal to achieve the efficient operationswith dentries b-tree. However, it means that dentry itself iscapable to store the short names only. From one viewpoint,files/folders have short names very frequently. As a result,it implies the high frequency to store the names in dentriesonly. Moreover, the fixed size of dentry provides simple andfast way to search a particular dentry in the b-tree node. Op-positely, varied size of dentry makes the searching algorithmmore complex and inefficient and it require to add some ad-ditional metadata in the node.As a result, SSDFS file system stores the short names onlyin the dentries and to use the shared dictionary for storingthe long names. The shared dictionary’s responsibility is togather the long names are created on the file system’s vol-ume. Generally speaking, the gathering names in one placemeans that shared dictionary keeps only one copy of thename that can be used for different files. Also shared dictio-nary provides the basis for using the technique of substringsdeduplication. Finally, shared dictionary provides the way tokeep the names in very compact representation.Moreover, one of the possible strategy of shared dictio-nary is not to delete the names at all. From one point ofview, it means that such strategy is able to decrease the num-ber of update operations for shared dictionary. From anotherpoint of view, if end-user will try to use the name of deletedfile for a newly created one then such name doesn’t need tobe added in the shared dictionary because it will be therealready. However, it needs to point out that strategy not touse the delete operation could have some side effect. Gen-erally speaking, the malicious activity of names generationis able to result in unmanageable growing of shared dictio-nary. However, substring deduplication technique is able tomanage such malicious activity efficiently.Figure 44: Shared dictionary b-tree architecture.
Shared dictionary is the hybrid b-tree with root node isstored into the superblock (Fig. 44). Every hybrid or leafnode of shared dictionary b-tree includes: (1) lookup table1,35igure 45: Names deduplication mechanism.Figure 46: Deduplicated strings representation.(2) lookup table2, (3) hash table, and (4) strings area (Fig.48).The lookup table1 is located into the node’s header andit implements clustering or grouping the items of lookuptable2. By design, lookup table1 is capable to keep only20 items. Every item (Fig. 47) contains: (1) hash value,(2) starting index in the lookup table2, and (3) number ofitems in the group. Generally speaking, the responsibility oflookup table1 is to provide the mechanism of fast search ofsome items’ cluster in the lookup table2 on the basis of hashvalue.As a result, the found item in the lookup table1 is the ba-sis for further search in the lookup table2. This table (lookuptable2) is located in the bottom of node (Fig. 47) and it hasgoal to provide the mechanism for the search a position ofname’s prefix (or starting keyword). Every item of lookup ta-ble2 (Fig. 47) contains: (1) hash value, (2) prefix length, (3)number of deduplicated names, and (4) index in hash table.Generally speaking, the lookup table2 describes positions ofnames’ prefixes in strings area.Finally, hash table (Fig. 47) is located upper the lookuptable2. It is responsible to describe every name in the stringsarea. Every item of hash table contains: (1) hash value, (2)name offset, (3) name length, and (4) name type. Gener-ally speaking, hash table implements the mechanism to de-fine the position and the length of a suffix of deduplicatedname because the full name is constructed from the prefixand the suffix (Fig. 45 - 46). Finally, it needs to find the pre- Figure 47: Shared dictionary b-tree’s node structure.fix from the lookup table2 and the suffix from the hash tablefor the extraction of a full name. The last item of the node isstrings area that keeps the full and deduplicated names. Gen-erally speaking, b-tree is efficient mechanism for storing andsearching the strings of variable length.
Extended attribute represents the pair of name and valueis associated with a file or a folder. It is possible to saythat extended attributes play the role of extension of reg-ular attributes that are associated with inodes. Frequently,extended attributes are used with the goal to provide an ad-ditional functionality in file system, for example, additionalsecurity features - Access Control Lists (ACL). Name of ex-tended attribute is the null-terminated string and it is definedin the fully qualified namespace form (for example, secu-rity.selinux). Currently, it exists the security, system, trusted,and user classes of extended attributes. Usually, VFS limitsthe length of xattr’s name by 255 bytes and size of the valueby 64 KB.Figure 48: Extended attributes (xattr) b-tree architecture.SSDFS file system uses a metadata structure of fixed size(64 bytes) for representation and storing the xattr record ona file system’s volume. Moreover, this metadata structureis capable to keep the 16 symbols inline and value of 32bytes (Fig. 49). However, namespace class is representednot by string itself but by means of special field of nametype. Generally speaking, it means that if the name or thevalue is lesser than declared limit then it can be stored inline36igure 49: Extended attributes b-tree’s node structure.in the xattr record. Otherwise, if a name is longer than 16symbols then initial portion of the name will be stored inlinein the xattr record but the whole name has to be stored intothe shared dictionary. Also, if a value is bigger than 32 bytesthen the blob has to be stored in some logical block(s) of thevolume but the xattr record will keep the extent that describesthe position of this blob. Moreover, it is possible to employthe shared extents b-tree for storing the xattr’s blobs in therange from 32 bytes to 4 KB. Additionally, shared extentsb-tree is able to deduplicate the blobs with identical content.
SSDFS xattr tree (Fig. 48) is implemented as hybrid b-tree with root node is stored in the inode’s private area. Bydefault, the private area of inode has 128 bytes in size. Usu-ally, file owns the extents b-tree but folder has dentries b-tree.Finally, it means that the first 64 bytes of private area will beused by the root node of extents or dentries b-tree but the rest64 bytes can be used for root node of xattr b-tree. Also, iffile or folder has only one extended attribute then the xattrrecord (64 bytes) can be stored inline in the second half ofprivate area.
The xattr record is the metadata structure of fixed size.Generally speaking, the goal of such approach is to keep inthe node an array of xattr records because the fixed size ofevery item in the array provides a very efficient mechanismof lookup, access and modification operations. Moreover,the header of b-tree’s node contains a lookup table (Fig. 49)is capable to store 22 records. The goal of such lookup tableis the clustering of xattr records in the main area for imple-menting the efficient mechanism of searching operation. Ev-ery item in the lookup table is a hash value of an extended at-tribute’s name. Generally speaking, the hash value identifiesthe position of starting xattr record in a group (or cluster) ofxattr records. Every such starting record is located on a fixedposition in the main area of node. As a result, the lookuptable provides the way to restrict the search by some clusterin the main area.Generally speaking, the case of significant number of ex-tended attributes for the same file/folder is very rare. Itmeans that it makes sense to consider the xattr record of big-ger size (128 bytes, for example) with the goal to optimizethe operations with xattr records by increasing inline area of value (blob). Moreover, it is possible to consider the inode’srecord of bigger size (512 bytes, for example). Such inodewill be able to keep about 5 inline xattr records. Addition-ally, it is possible to implement a shared xattrs b-tree that willbe able to store xattr records of different files/folders into theone b-tree. However, even if anybody considers only dedi-cated xattrs b-tree then the b-tree with 2 nodes of 4 KB insize is capable to store about 128 xattr records in total.
The write amplification issue is the crucial problem for thecase of flash-oriented and flash-friendly file systems. It ispossible to state that this issue is the key reason of SSD life-time shortening. Every particular file system has unique rea-sons of the write amplification issue and it contains sometechniques to decrease or to eliminate this problem. SSDFSfile system uses such techniques for resolving the problemof write amplification issue: (1) compression, (2) small filescompaction scheme, (3) logical extent concept, (4) Diff-On-Write approach, (5) deduplication, (6) inline files.
Compression . SSDFS file system widely uses compres-sion as for user data as for metadata. Current file sys-tem driver implementation supports zlib and LZO compres-sion. Moreover, SSDFS file system uses a special com-paction scheme which gathers several compressed fragments(even for different files) into one NAND flash page inside ofspecial log’s area (diff update or journal areas). Generallyspeaking, this compaction technique provides the opportu-nity to use only one NAND flash page for several compressedfragments of different files instead of several ones. As a re-sult, the decreasing number of used NAND flash pages de-creases number of I/O operations and it creates the opportu-nity to reduce the write amplification issue.
Small files compaction . It took place some number of re-search papers with the goal to investigate the aged file systemvolumes’ state and to elaborate some vision of distributionof data amongst various types. As a result, it has been foundthat many file system volumes contain significant number ofsmall files. Some researchers estimate the number of smallfiles as 61% of total number of files on the volume. SSDFSfile system introduces a special compaction scheme for thecase of small files. Generally speaking, PEB’s log can con-tain a special journal area that is used for gathering into oneNAND flash page the several small files. As a result, thiscompaction technique reduces the number of I/O operationsand is able to decrease the factor of write amplification issue.
Inline content . SSDFS file system has inode’s formatwith reservation of 128 bytes for private area (by default).Moreover, increasing the size of inode transforms the privatearea to bigger size. Generally speaking, private area can beused for keeping inline the content of small files, extent, den-try or xattr records. As a result, it means that keeping datainline in the inode’s private area creates the opportunity not37o allocate the logical blocks (NAND flash pages) for storingthese data or metadata. Finally, mechanism of keeping datainline is the way to reduce the write amplification issue andto improve the file system’s performance.
Logical extent concept . SSDFS file system implementsthe logical extent concept as the additional mechanism of de-creasing the write amplification issue. Generally speaking,Copy-On-Write policy is the main technique of data updatesin the scope of any LFS file system. It means that the neces-sity to update some data on the volume results in writing theactual state of data in a new position (logical block) on thefile system’s volume. As a result, the main problem of suchapproach is the necessity to update a metadata (block map-ping table, for example) for any of such update with the goalto track the position of actual state of data. Finally, it resultsin increasing the number of I/O operations and making thewrite amplification issue like more severe problem.But SSDFS file system tracks the position of any data onthe volume by means of logical extent. The logical extentstructure includes: (1) segment ID, (2) logical block numberinside of this segment, (3) number of logical blocks in the ex-tent. Moreover, SSDFS file system implements PEBs migra-tion technique. Finally, it means that if any logical block isstored into some segment then the logical extent remains thesame during any update or modification operations with datainside of this logical extent. Generally speaking, the logicalextent will have the same value until the data will be movedinto another segment. As a result, the nature of logical extentprovides the opportunity not to update the metadata structurethat tracks the position of data on the volume by means oflogical extents. Moreover, this technique reduces the writeamplification issue.
Diff-On-Write approach . The Copy-On-Write (COW)policy is the central technique of Log-structured file system.The goal of this policy is to overcome peculiarity of NANDflash. Namely, clean physical page of NAND chip can bewritten once. And it needs to erase a whole physical eraseblock for operation of re-writing the page. Usually, physicalerase block includes a bunch of pages. But, from anotherpoint of view, the COW policy can be treated as a reasonof write amplification issue. Because every update of file’sdata results in moving updated block of file into new physicalpage of NAND flash (Fig. 50).Write amplification issue (Fig. 51) has several reasons.First of all, necessity to overcome write and read disturbanceeffects of NAND flash and necessity to wear NAND flasherase blocks uniformly are resulted in wear-leveling policy.This policy dictates regular moving of user data from agedsegment into new one. The COW policy as basic techniqueof Log-structured file system can be treated as another reasonof write amplification issue. And final reason of write am-plification issue could be an inefficient Garbage Collectionpolicy.Main, Diff Updates and Journal areas are foundation for Figure 50: Copy-On-Write policy side effect.Figure 51: Write amplification issue.Diff-On-Write approach (Fig. 52). This approach dis-tinguishes main, unchangeable (”cold”) part of file’s data.These data are stored in Main area. For example, a file’s con-tiguous 4 KB binary stream can be treated as ”cold” data.Such piece of data can be saved into one physical page ofMain area. And read-only nature of this physical page can beprovided by means of saving of all updates of this page intoanother area (Diff Updates area). For example, File 1 hasstring ”Hello” as ”cold” data on Fig. 52. The Journal areaprovides shared space for gathering updates of different files.Joining of all current updates in one area looks like as journaland to provide gathering all ”hot” data in one area. For exam-ple, Fig. 52 shows situation when one block of Journal areacontains updates for File 1 (string ”Good weather.”) and forFile 2 (string ”Let’s walk.”). Journal area can be imagined asmixed sequence of updates for different files. As a result, ifJournal area in one or several logs has been gathered updatesof one file with accumulated size equals to physical page sizethen it makes sense to join these updates in one block of DiffUpdates or Main areas. It needs to store updates in the DiffUpdates area for the case of presence updates from differ-ent file’s parts. And, finally, it needs to store a sequence ofcontiguous updates into one block of Main area.The Copy-On-Write (COW) policy means that every up-dated block should be copied in a new place. The Diff-On-Write approach suggests to store only diff between initial andupdated state of data for every update (Fig. 53).38igure 52: Diff-On-Write approach.Figure 53: Copy-On-Write vs. Diff-On-Write.Fig. 54 shows different examples of diff. The diff can beresult of: (1) file creation; (2) adding data into existing file;(3) update of some file’s part.Diff-On-Write approach suggests to gather small parts orsmall updates of different files in one block of Journal area(Fig. 55). It is well known fact that about 61% of all files ona volume are smaller than 10KB. Such technique suggest theway of decreasing write amplification factor and decreasingover-provisioning for the case of small files. Moreover, suchapproach gathers frequent updates in dedicated ”hot” area.As a result, it can improve efficiency of GC policy.Diff-On-Write approach provides basis for decreasingwrite amplification factor in the case of gradual growing offile’s content (Fig. 56). Let’s suppose that file contains 1 KBdata after creation. Then additional 1 KB will be added onanother day, for example. And, finally, 2 KB of data will beadded after several days. First two 1 KB diffs can be storedin Journal areas of different logs. Every diff will share spaceof physical page with updates of another files. Finally, filecontent will be saved into Main area of a log with joining ofall available updates.Diff-On-Write approach provides especially good basisfor decreasing write amplification factor in the case of mixedworkloads. Let’s assume that workload contains as addingdata to the end of file as updating of internal areas of file(Fig. 57). First of all, diffs can be stored into Journal areaof different logs. Then diffs of one file can be moved intoDiff Updates area with the goal to join updates of differentareas of the file into one block. And, finally, a sequence ofcontiguous diffs from Diff Updates and Journal areas can be Figure 54: Diff concept.Figure 55: Technique of joining files’ diffs in journal area.joined into one block of Main area.
Deduplication . Technique of deduplication is the wellknown and proven mechanism of exclusion of the duplicatedcontent of files. The essence of this technique is the detec-tion of data duplication on the basis of fingerprint calculationand comparison the calculated fingerprint value with the hashtable of existing fingerprints. Generally speaking, the dedu-plication technique is very efficient mechanism of reducingthe write amplification factor by virtue of the opportunityto share the same deduplicated content amongst the severalfiles.SSDFS file system uses the shared extents b-tree as thekey mechanism of deduplication implementation. Generallyspeaking, the shared extents b-tree has the goal to keep a fin-gerprint value and an associated extent structure. The finger-print value is used for comparison and detection of the du-plication event but the extent structure is used for sharing thededuplicated data fragment amongst the different files. How-ever, deduplication technique could be a compute-intensivetask because a file system’s volume could contain the smallnumber of duplicated fragments or to have no duplicationsat all. Also the calculated fingerprint values need to keep insome metadata structure that has to be stored on file system’svolume. As a result, the deduplication subsystem is capableto decrease the file system driver performance.The architecture of SSDFS’s deduplication subsystem isdesigned with taking into account the possible drawbacks.First of all, SSDFS file system driver calculates fingerprintvalue of the first 8 KB of the file only if the file is bigger39igure 56: Technique of main and journal areas interactionin Diff-On-Write approach.Figure 57: Technique of journal and diff updates areas inter-action in Diff-On-Write approach.than some threshold value (for example, 8 KB in total). Thenext step is searching the identical fingerprint value in sharedextents b-tree. If no fingerprint value has been found then thecalculated fingerprint value should be stored into the sharedextents b-tree. Moreover, the rest of the file is simply ig-nored by means of skipping the calculation of fingerprints.Oppositely, if it was found the identical fingerprint value forthe first 8 KB of the file in shared extents b-tree then it needsto calculate the fingerprint values for the rest of file and totry to find the identical fingerprints in the tree. Again, ifno identical fingerprints were found then it needs to storethe calculated fingerprint values into the shared extents b-tree. But it needs to use the associated extent structures forthe file’s content deduplication in the case of detection theidentical fingerprint values in shared extents b-tree. Gener-ally speaking, it means that shared extents b-tree is ready todeduplicate the file’s content only in the case of detection ofthird case of data duplication. Moreover, it means that filesystem volume will have two copies of identical data on thevolume that could increase the reliability of data storing.
Garbage Collector (GC) is inevitable subsystem of any LFSfile system because of Copy-On-Write (COW) policy. Gen-erally speaking, the simplified way of thinking about a vol-ume of LFS file system is to imagine the volume like a se-quence of logs are filling the volume’s space sequentially.Moreover, the data update operations create the volume’sstate when old logs are the mixture of valid and invalid data(or completely invalid data). It means that the responsi-bility of GC subsystem is the moving valid data from oldlogs into the new ones and to erase completely invalid eraseblocks (segments) with the goal to prepare the completelyclear erase blocks for allocation for the new logs. Generallyspeaking, GC activity is the vital but auxiliary action thatcould compete with the regular file system’s I/O operations.Finally, GC activity degrades the file system’s performancedramatically and in completely unpredictable way. It is pos-sible to say that GC overhead management problem is thecrucial and the key problem of any LFS file system and itneeds to be taken into account on initial stage of a file sys-tem’s architecture design.
Segment bitmap . Any classic GC subsystem of LFS filesystem is implemented like a thread that selects in the back-ground the aged segments with the goal to move valid datainto a new log(s) and to apply the erase operation for thesesegments. Generally speaking, the important problem ofsuch approach is to find an aged segment with as minimumas possible number of valid blocks because this is the possi-ble strategy to manage the GC overhead efficiently. It needsto point out that SSDFS file system doesn’t use this classi-cal way of GC overhead management as the basic and fun-damental mechanism of GC operations. However, the tech-nique of searching of segment with minimal number of validblocks can be used in the environment of critical lack of freespace on the volume.SSDFS file system uses segment bitmap as the basic meta-data structure for searching the segments with minimumoverhead for GC activity. The responsibility of segmentbitmap is the tracking of segments’ state (clean, using, used,pre-dirty, dirty). However, the key responsibility of a mainGC thread is: (1) detecting the idle state of file system driver,(2) defining the total I/O budget that can be employed by GCsubsystem, (3) selecting segments for processing by GC sub-system on the segment bitmap basis, (4) distribution of totalI/O budget between the GC threads of particular PEBs. Fi-nally, GC thread of particular PEB has to move gradually inthe background the cold data on the basis of determined I/Obudget.The using state means that segment has free logicalblocks. This state of segments doesn’t need to be processedby GC subsystem. The used state means that the whole seg-ment is filled by valid blocks. Finally, it means that suchsegment contains the cold data and it is the most expensive40ype of segments for processing by GC subsystem. How-ever, flash-friendly file system could delegate the migrationof such cold data on SSD side and not to process it by GCsubsystem.The dirty state means that no valid blocks exist in suchsegment and GC subsystem needs to apply only erase oper-ation for all erase blocks in dirty segment. Generally speak-ing, it is the cheapest case of segment processing by GC sub-system and the dirty segments are the key target for the GCsubsystem. The pre-dirty state means that segment containsas valid as invalid logical blocks. This segment’s state has thelower priority for GC subsystem and this state will be usedonly in the case of complete absence of dirty segments. Fi-nally, the key technique of processing the pre-dirty segmentis to create the gradual migration of cold data by means ofadding of GC operations to the regular I/O operations withdata in the segment.
PEBs migration scheme . Migration scheme is the funda-mental technique of GC overhead management in the SSDFSfile system. The key responsibility of the migration schemeis to guarantee the presence of data in the same segment forany update operations. Generally speaking, the migrationscheme’s model is implemented on the basis of associationan exhausted PEB with a clean one. The goal of such asso-ciation of two PEBs is to implement the gradual migrationof data by means of the update operations in the initial (ex-hausted) PEB. As a result, the old, exhausted PEB becomesinvalidated after complete data migration and it will be possi-ble to apply the erase operation to convert it in the clean state.Moreover, the destination PEB in the association changes theinitial PEB for some index in the segment and, finally, it be-comes the only PEB for this position. Namely such tech-nique implements the concept of logical extent with the goalto decrease the write amplification issue and to manage theGC overhead. Because the logical extent concept excludesthe necessity to update metadata is tracking the position ofuser data on the file system’s volume. Generally speaking,the migration scheme is capable to decrease the GC activ-ity significantly by means of the excluding the necessity toupdate metadata and by means of self-migration of data be-tween of PEBs is triggered by regular update operations.
Hot/warm data self-migration . The important additionof migration scheme is a technique of hot/warm data self-migration. It means that any update operation in the envi-ronment of two PEBs’ association results in the moving ofdata from the exhausted PEB into the new one. Finally, if aPEB contains only hot data then all data is able to migratebetween PEBs by means of regular update operations with-out the necessity to employ the GC activity. Moreover, it ispossible to delay the applying of erase operation to the com-pletely invalidated PEB. However, the important peculiarityof such approach is to provide enough time for complete mi-gration of valid data between PEBs. If a file system’s vol-ume contains enough clean PEBs then it will be possible to finish the data migration by means of regular update oper-ations only (without using the GC service). However, if aPEB contains significant amount of cold valid blocks or vol-ume hasn’t enough clean PEBs then it needs to stimulate themigration process by means of GC activity. The key item ofstimulation activity is the PEB’s dedicated GC thread. Gen-erally speaking, the responsibility of such GC thread is toorchestrate the gradual migration of cold data on the basis ofallocated I/O budget for a particular GC thread. The goal ofsuch approach is to minimize the GC threads’ activity andto guarantee the stable file system driver’s performance forregular I/O operations. Finally, this policy has to excludethe degradation of file system’s performance because of GCthreads’ activity and to prepare enough free space for filesystem operations.
Overprovisioning is widely using technique of reservationsome amount of SSD’s erase blocks (for example, 20% ofthe whole volume) with the goal to exchange the bad eraseblocks on the good ones from the reserved pool. One of thecritical reason of presence of bad erase blocks could be thehigh number of erase cycles because of write amplificationissue and significant GC activity. Generally speaking, de-creasing write amplification factor and elimination the GCactivity is able to prolong the SSD lifetime because of capa-bility to reduce the erase cycles number is used for auxiliaryfile system activity (for example, GC activity). Moreover, italso means the opportunity to prolong lifetime of the mainpool of SSD’s erase blocks. As a result, overprovisioningpool can be decreased or it could be used for prolongation ofSSD lifetime.
Pre-allocated state . SSDFS file system introduces a spe-cial pre-allocated state of logical blocks. Generally speaking,the pre-allocated state defines the presence of some data por-tion or reservation without the allocation of the whole NANDflash page for the logical block. As a result, pre-allocatedstate provides the opportunity to reserve some space on thefile system’s volume without the real allocation (delayed al-location). The goal of pre-allocated state is not only to re-serve some space (for metadata, for example) but it can beused for small files or compressed data portions. If somefile or compressed data portion has lesser than 4 KB in sizethen such data portion can be marked as pre-allocated andthe several data portions can be compacted or gathered intothe one NAND flash page. Generally speaking, such com-paction scheme is able to reduce the number of used NANDflash pages and, as a result, could decrease the overprovi-sioning and to prolong the SSD lifetime.
Compression + delta-encoding . SSDFS file systemwidely uses the compression for more compact representa-tion of user data and metadata. Moreover, compression isadded by compaction scheme with the goal to merge several41ompressed data fragments into one NAND flash page. AlsoSSDFS file system is trying to use a delta-encoding tech-nique. This delta-encoding technique implies not to save thewhole modified logical block (for example, 4 KB in size) butthe extraction and saving only modified area (for example,128 bytes) in this logical block. Finally, it means that it willbe flushed on the volume only 128 bytes instead of 4 KB. SS-DFS file system uses the delta-encoding technique with thecompaction scheme for gathering several data portions intoone NAND flash page. Finally, the goal of these techniquesis to achieve the more compact representation of user dataand metadata and to reduce the amount of write operationson the file system’s volume. Generally speaking, it impliesdecreasing the number of erase cycles and the prolongationof SSD lifetime.
The whole SSDFS file system’s design and architecture istrying to achieve the prolongation of SSD lifetime throughdecreasing the write amplification factor. Generally speak-ing, the suggested and implemented approaches are capableto improve the flush/write operations’ performance. How-ever, potential side effect of such efforts could be some re-ducing of read operations’ performance. But asynchronousnature of read/write latency of NAND flash (read operationsis faster and no seek operation penalties) gives a steady basisto expect a good performance of read operations for the caseof SSDFS file system’s architecture. Moreover, aggregationof several PEBs into one segment, PEB’s dedicated threadsmodel, GC I/O budget model provide the rich opportunitiesfor achieving a good file system’s performance.Any SSD represents multi-die and multi-channel architec-ture. If some protocol of interaction with SSD (for example,open-channel SSD model) shares the knowledge of distri-bution erase blocks among NAND dies then file system isable to employ this knowledge for enhancing the I/O opera-tions performance. SSDFS file system uses the technique ofaggregation several PEBs inside of one segment. Generallyspeaking, if one segment aggregates several PEBs from dif-ferent NAND dies then such approach provides the way toprocess the I/O requests in parallel by different NAND dies.As a result, it is capable to improve the performance of I/Orequests in the scope of one segment significantly.
SSDFS file system has goals: (1) manage write amplifi-cation in smart way, (2) decrease GC overhead, (3) pro-long SSD lifetime, and (4) provide predictable file sys-tem’s performance. To implement these goals SSDFS filesystem introduces several authentic concepts and mecha-nisms: logical segment, logical extent, segment’s PEBs pool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dictio-nary b-tree, shared extents b-tree.
It has been shown that 80% or more of the files aresmaller than 32 KB . To manage this peculiarity, SSDFS filesystem uses inode’s private area to store the small files inline.Moreover, it was introduced a special compaction schemethat gathers several small files into one NAND flash page.Also, SSDFS file system uses the block-level compressionwith addition of the compaction scheme that keeps severalcompressed portions into one NAND flash page. Addition-ally, it is employed delta-encoding, compaction scheme, anddeduplication for the case of big files.
The vast majority of files are deleted within a few min-utes of their creation . One of the efficient technique of man-agement such case is using inode’s private area for keepinginline files. Default raw SSDFS inode is able to store about128 bytes of file’s content. But the bigger size of raw in-ode is able to provide more space for inline files. Moreover,SSDFS file system gathers content of files into specializeduser data segment. As a result, deletion of files creates theself-invalidation effect for the case of user data segment thatcan decrease the GC activity or completely eliminate the GCoverhead through PEB migration scheme.
The median file age ranges between 80 and 160 days.0.8% of the files are used essentially every day . Flash-friendly file system doesn’t need to follow by strict wear-leveling scheme. It makes sense to delegate moving the colddata by SSD’s FTL once in 3 months (90 days). Extensiveusing the GC operations and strict wear-leveling scheme formoving cold data by LFS increases the write amplificationissue. PEB’s migration scheme is able to provide free spacein cost-efficient manner. Compression and delta-encoding iscombined with PEB’s migration scheme is able to provideeasy and efficient mechanism for combining in one PEB ashot as cold data and to guarantee the space for gathering hotdata updates with fast/easy migration between PEBs.
Several research works showed the growing of filescount per file system and directories count per file sys-tem. It needs to expect as minimum 30K - 90K files perfile system and 1K - 4K directories per file system.
Tomanage the growing demands for number of files and fold-ers on the file system volume, SSDFS file system employsinodes b-tree that provides the way for easy increasing num-ber of files and efficient management for the case of frequentdelete/remove operations. Also, inline files provides the wayto store the small files in inode itself without allocation ofvolume’s space.
File name length falls in the range from 9 to 17 char-acters. The peak occurs for file names with length of 12characters.
SSDFS’s dentry is able to include 12 inline char-acters. The tail of longer name is stored into shared dictio-nary. The fixed size of dentry provides the efficient mech-42nism of dentries management. Mostly, file names will bestored into dentries only.
SSDFS raw inode is ableto keep two inline dentries. It means that very frequently rawinode is able to store the content of dentries tree. Moreover,SSDFS b-tree node is stored compressed. As a result, it im-plies that small dentries b-tree could be represented in veryefficient way on the file system’s volume.
There are many files deep in the namespace tree, es-pecially at depth 7.
SSDFS b-tree node stores several rawinodes. It means that operation of reading one b-tree nodeis able to provide access to the whole or part of namespacetree. Also, SSDFS gathers b-tree node of the same type inone segment/PEB. As a result, the readahead operation isable to read several b-tree nodes. It implies that several con-tiguous b-tree nodes could contain the whole namespace tree.Finally, SSDFS raw inode is able to keep the content of den-tries tree inline.
Many end-users have file system volume is on averageonly half full.
PEBs migration technique is trying to em-ploy this fact. It means that PEBs association during migra-tion can be done without the affection of availability of freespace on the volume. Moreover, SSDFS uses compressionand delta-encoding. Finally, it provides good basis for thePEBs migration technique.
On average, half of the files in a file system have beencreated by copying without subsequent writes.
It is pos-sible to conclude that user data is mostly cold but metadatais mostly hot. SSDFS uses the model of current segments ofdifferent types. It means that user data is aggregated in onesegment but metadata is aggregated into another one. As aresult, user data segment will be managed under cold datapolicy but metadata segment will be managed under hot datapolicy. SSDFS uses three area types in the log (main area,diff updates area, journal area). It provides the efficient wayto manage data for the case of mixed nature of data (cold andhot) into one log. SSDFS is flash-friendly file system and itdoesn’t move segment with cold data as part of GC activity.SSDFS delegates error correction and read block reclaimingon FTL side. Also, deduplication is able to exclude the repli-cation of existing files on the volume.
Modern applications manage large databases of infor-mation organized into complex directory trees (A File IsNot a File).
First of all, SSDFS is able to manage caseof mixed nature of data efficiently by means of three areatypes in the log (main area, diff updates area, journal area).Also, using delta-encoding technique provides the way tostore only updated portion(s) of data. Moreover, segmentis able to contain several PEBs from different NAND dies. Itmeans that different extents of a file can be stored into dif-ferent PEBs and to implement the parallelism of operations.SSDFS associates read/write threads with PEBs that imple- ments parallelism as on file system as on SSD level.
Applications help users create, modify, and organizecontent, but user files represent a small fraction of thefiles touched by modern applications. Most files arehelper files that applications use to provide a rich graph-ical experience, support multiple languages, and recordhistory and other metadata.
Auxiliary files of the same ap-plication can be aggregated into one segment/PEB. It meansthat readahead operation will be able to extract the contentof all auxiliary files from one segment/PEB. SSDFS’s seg-ment is able to contain several PEBs from different NANDdies. As a result, this approach is capable to implement par-allelism of read operation as on file system as on SSD level.
Most written data is explicitly forced to disk by the ap-plication; for example, iPhoto calls fsync thousands oftimes in even the simplest of tasks.
SSDFS supports partiallogs. It means that file system driver tries to prepare the fulllog before flushing on the volume. However, the file systemdriver is able to prepare the partial logs in the case of fsyncrequests or synchronous mount. The partial logs could in-crease amount of metadata on the volume. SSDFS supportsseveral types of current segments. It means that metadataand user data will be processed simultaneously in differentthreads. PEB has associated flush thread. As a result, theupdate operations in different PEBs will be processed by dif-ferent threads in multi-threaded environment.
It has been shown that applications create many tem-porary files.
From one point of view, it is possible to con-sider adding an additional type of current segment for storingthe temporary files. Finally, it means that such type of seg-ment will be invalidated completely. And it will make theGC activity for such type of segments very cheap. But it willbe much efficient way to keep the temporary files in pagecache without flushing on the volume.
Home-user applications commonly use atomic opera-tions, in particular rename, to present a consistent viewof files to users.
As a result, it is possible to expect morefrequent of metadata’s update operations (hot data). Finally,PEBs migration technique could migrate updated metadatabetween PEBs without necessity to use the GC activity.
Write amplification issue . SSDFS file system uses suchtechniques for resolving the problem of write amplificationissue: (1) compression, (2) small files compaction scheme,(3) logical extent concept, (4) Diff-On-Write approach, (5)deduplication, (6) inline files.The logical extent concept is the technique of resolvingthe write amplification issue for the case of LFS file system.It means that any metadata structure keeping a logical ex-tent doesn’t need in updating the logical extent value in thecase of data migration between the PEBs because the log-ical extent remains the same until the data is living in thesame segment. The migration mechanism implements thelogical segment and logical extent concepts with the goal todecrease or completely eliminate the write amplification is-43ue. Moreover, SSDFS file system is widely using the datacompression, delta-encoding technique, and small files com-paction technique that provides the opportunity to employthe PEB migration mechanism without the necessity to usethe additional overprovisioning.Moreover, b-tree metadata structure provides the way notto keep an unnecessary reserve of metadata space on the vol-ume. As a result, it means the exclusion of management op-erations of reserved metadata space (moving from a PEB toanother one) with the goal to support it in the valid state.Generally speaking, it is the way to decrease the amount ofPEBs’ erase and write operations.SSDFS file system uses a special compaction schemewhich gathers several compressed fragments (even for dif-ferent files) into one NAND flash page inside of special log’sarea (diff update or journal areas). Generally speaking, thiscompaction technique provides the opportunity to use onlyone NAND flash page for several compressed fragments ofdifferent files instead of several ones. As a result, the de-creasing number of used NAND flash pages decreases num-ber of I/O operations and it creates the opportunity to reducethe write amplification issue.SSDFS file system introduces a special compactionscheme for the case of small files. Generally speaking, PEB’slog can contain a special journal area that is used for gather-ing into one NAND flash page the several small files. As aresult, this compaction technique reduces the number of I/Ooperations and is able to decrease the factor of write ampli-fication issue. Mechanism of keeping data inline in inode’sprivate area is the way to reduce the write amplification issueand to improve the file system’s performance.
GC overhead management . There are several type ofsegments on any SSDFS file system’s volume: (1) su-perblock segment, (2) snapshot segment, (3) PEB mappingtable segment, (4) segment bitmap, (5) b-tree segment, (6)user data segment. Generally speaking, the goal to distin-guish the different type of segments is to localize the pecu-liarities of different types of data (user data and metadata, forexample) inside of specialized segments.Migration scheme is the fundamental technique of GCoverhead management in the SSDFS file system. The keyresponsibility of the migration scheme is to guarantee thepresence of data in the same segment for any update opera-tions. Generally speaking, the migration scheme is capableto decrease the GC activity significantly by means of the ex-cluding the necessity to update metadata and by means ofself-migration of data between of PEBs is triggered by reg-ular update operations. The important addition of migrationscheme is a technique of hot/warm data self-migration. Itmeans that any update operation in the environment of twoPEBs’ association results in the moving of data from the ex-hausted PEB into the new one. Finally, if a PEB containsonly hot data then all data is able to migrate between PEBsby means of regular update operations without the necessity to employ the GC activity.However, if a PEB contains significant amount of coldvalid blocks or volume hasn’t enough clean PEBs then itneeds to stimulate the migration process by means of GCactivity. The key item of stimulation activity is the PEB’sdedicated GC thread. Generally speaking, the responsibilityof such GC thread is to orchestrate the gradual migration ofcold data on the basis of allocated I/O budget for a particu-lar GC thread. The goal of such approach is to minimize theGC threads’ activity and to guarantee the stable file systemdriver’s performance for regular I/O operations. Finally, thispolicy has to exclude the degradation of file system’s perfor-mance because of GC threads’ activity and to prepare enoughfree space for file system operations.The compaction of several fragments of different logi-cal blocks into one NAND flash page creates the capabil-ity to move more data for one GC operation. From anotherviewpoint, warm/hot areas introduce the areas with high fre-quency of update operations. Generally speaking, it is pos-sible to expect that high frequency of update operations (indiff updates and journal areas) creates the natural migrationof data between PEBs without the necessity to use the exten-sive GC operations.
SSD lifetime . Decreasing write amplification factor andelimination the GC activity is able to prolong the SSD life-time because of capability to reduce the erase cycles numberis used for auxiliary file system activity (for example, GCactivity). Moreover, it also means the opportunity to pro-long lifetime of the main pool of SSD’s erase blocks. Asa result, overprovisioning pool can be decreased or it couldbe used for prolongation of SSD lifetime. SSDFS file sys-tem uses the delta-encoding technique with the compactionscheme for gathering several data portions into one NANDflash page. Finally, the goal of these techniques is to achievethe more compact representation of user data and metadataand to reduce the amount of write operations on the file sys-tem’s volume. Generally speaking, it implies decreasing thenumber of erase cycles and the prolongation of SSD lifetime.
File system performance . The whole SSDFS file sys-tem’s design and architecture is trying to achieve the prolon-gation of SSD lifetime through decreasing the write ampli-fication factor. Generally speaking, the suggested and im-plemented approaches are capable to improve the flush/writeoperations’ performance. However, potential side effect ofsuch efforts could be some reducing of read operations’ per-formance. But asynchronous nature of read/write latency ofNAND flash (read operations is faster and no seek opera-tion penalties) gives a steady basis to expect a good perfor-mance of read operations for the case of SSDFS file system’sarchitecture. Moreover, aggregation of several PEBs intoone segment, PEB’s dedicated threads model, GC I/O budgetmodel provide the rich opportunities for achieving a good filesystem’s performance. Any SSD represents multi-die andmulti-channel architecture. If some protocol of interaction44ith SSD (for example, open-channel SSD model) sharesthe knowledge of distribution erase blocks among NANDdies then file system is able to employ this knowledge forenhancing the I/O operations performance. SSDFS file sys-tem uses the technique of aggregation several PEBs inside ofone segment. Generally speaking, if one segment aggregatesseveral PEBs from different NAND dies then such approachprovides the way to process the I/O requests in parallel bydifferent NAND dies. As a result, it is capable to improvethe performance of I/O requests in the scope of one segmentsignificantly.
Solid state drives have a number of interesting characteris-tics. However, there are numerous file system and storagedesign issues for SSDs that impact the performance and de-vice endurance. Many flash-oriented and flash-friendly filesystems introduce significant write amplification issue andGC overhead that results in shorter SSD lifetime and ne-cessity to use the NAND flash overprovisioning. SSDFSfile system introduces several authentic concepts and mech-anisms: logical segment, logical extent, segment’s PEBspool, Main/Diff/Journal areas in the PEB’s log, Diff-On-Write approach, PEBs migration scheme, hot/warm dataself-migration, segment bitmap, hybrid b-tree, shared dic-tionary b-tree, shared extents b-tree. Combination of all sug-gested concepts are able: (1) manage write amplification insmart way, (2) decrease GC overhead, (3) prolong SSD life-time, and (4) provide predictable file system’s performance.
Currently, SSDFS file system driver is not fully functionaland is not completely implemented. It still needs in bugfix. Diff-On-Write approach is implemented only partially.Deduplication and snapshot support is not implemented yet.Additionally, SSDFS file system hasn’t fsck tool.
SSDFS project information is available online( ). Sourcecode of user-space tools is available in https://github.com/dubeyko/ssdfs-tools.git . Sourcecode of file system driver is available in https://github.com/dubeyko/ssdfs-driver.git . Source code of Linuxkernel with integrated SSDFS file system driver is availablein https://github.com/dubeyko/linux.git . The author gratefully acknowledge the initial support of theidea by Zvonimir Bandic and Cyril Guyot.
References [1] SSDFS Project, [Online]. Available: , Accessed on: Jun. 19, 2019.[2] V. A. Dubeyko, C. Guyot, ”Systems and methods forimproving flash-oriented file system garbage collec-tion,” U.S. Patent Application US20170017405, pub-lished January 19, 2017.[3] V. A. Dubeyko, C. Guyot, ”Systems and methods forimproving flash-oriented file system garbage collec-tion,” U.S. Patent Application US20170017406, pub-lished January 19, 2017.[4] V. A. Dubeyko, C. Guyot, ”Method of decreasingwrite amplification factor and over-provisioning ofNAND flash by means of Diff-On-Write approach,”U.S. Patent Application US20170139616, publishedMay 18, 2017.[5] V. A. Dubeyko, C. Guyot, ”Method of decreasing writeamplification of NAND flash using a journal approach,”U.S. Patent 10,013,346, issued March 7, 2018.[6] V. A. Dubeyko, C. Guyot, ”Method of improvinggarbage collection efficiency of flash-oriented file sys-tems using a journaling approach,” U.S. Patent Appli-cation US20170139825, published May 18, 2017.[7] V. A. Dubeyko, ”Bitmap Processing for Log-StructuredData Store,” U.S. Patent Application US20190018601,published January 17, 2019.[8] V. A. Dubeyko, S. Song, ”Non-volatile storage systemthat reclaims bad blocks,” U.S. Patent 10,223,216, is-sued March 5, 2019.[9] V. A. Dubeyko, S. Song, ”Non-volatile storage sys-tem that reclaims bad blocks,” U.S. Patent ApplicationUS20190155703, published May 23, 2019.[10] Agrawal, et al., ”A Five-Year Study of File-SystemMetadata,” ACM Transactions on Storage (TOS), vol.3 Issue 3, Oct. 2007, Article No. 9.[11] Avishay Traeger, Erez Zadok, Nikolai Joukov, andCharles P. Wright, ”A nine year study of file systemand storage benchmarking,” Trans. Storage 4, 2, Arti-cle 5 (May 2008), 56 pages.4512] Douceur, et al., ”A Large-Scale Study of File-SystemContents,” SIGMETRICS ’99 Proceedings of the 1999ACM SIGMETRICS international conference on Mea-surement and modeling of computer systems, pp. 59-70, May 1-4, 1999.[13] Lucas Tan, Fuyao Zhao, Xu Zhang, ”15712 AdvancedOperating and Distributed System Android and iOSPlatform Study Final Report,” [Online]. Available: https://pdfs.semanticscholar.org/48f8/1b9339ec3fcee1cc8031575e6f7b84c57c84.pdf ,Accessed on: Jun. 21, 2019.[14] Tyler Harter, Chris Dragga, Michael Vaughn, AndreaC. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,”A file is not a file: understanding the I/O behaviorof Apple desktop applications,” In Proceedings of theTwenty-Third ACM Symposium on Operating SystemsPrinciples (SOSP ’11). ACM, New York, NY, USA, 71-83.[15] A. B. Downey, ”The structural cause of file size dis-tributions,” MASCOTS 2001, Proceedings Ninth Inter-national Symposium on Modeling, Analysis and Sim-ulation of Computer and Telecommunication Systems,Cincinnati, OH, USA, 2001, pp. 361-370.[16] M. I. Ullah, F. Ahsan, I. Ahmad and A. F. M. Ishaq,”Analysis of file system space utilization patterns inUNIX based volumes,” Proceedings of the IEEE Sym-posium on Emerging Technologies, 2005, Islamabad,2005, pp. 542-546.[17] Tim Gibson, Ethan L. Miller, Darrell D. E. Long,”Long-term File Activity and Inter-Reference Pat-terns,” [Online]. Available: , Ac-cessed on: Jun. 25, 2019.[18] Yifan Wang, ”A Statistical Study for File Sys-tem Meta Data On High Performance ComputingSites,” [Online]. Available: ,Accessed on: Jun. 25, 2019.[19] A. Wildani, I. F. Adams and E. L. Miller, ”Single-Snapshot File System Analysis,” 2013 IEEE 21st Inter-national Symposium on Modelling, Analysis and Sim-ulation of Computer and Telecommunication Systems,San Francisco, CA, 2013, pp. 338-341.[20] S. Hui, Z. Rui, C. Jin, L. Lei, W. Fei and X. C. Sheng,”Analysis of the File System and Block IO Schedulerfor SSD in Performance and Energy Consumption,”2011 IEEE Asia-Pacific Services Computing Confer-ence, Jeju Island, 2011, pp. 48-55. [21] D. Parthey and R. Baumgartl, ”Analyzing AccessTiming of Removable Flash Media,” 13th IEEE In-ternational Conference on Embedded and Real-TimeComputing Systems and Applications (RTCSA 2007),Daegu, 2007, pp. 510-515.[22] Y. Son, H. Kang, H. Han and H. Y. Yeom, ”An Empir-ical Evaluation of NVM Express SSD,” 2015 Interna-tional Conference on Cloud and Autonomic Comput-ing, Boston, MA, 2015, pp. 275-282.[23] K. Zhou, P. Huang, C. Li and H. Wang, ”An EmpiricalStudy on the Interplay between Filesystems and SSD,”2012 IEEE Seventh International Conference on Net-working, Architecture, and Storage, Xiamen, Fujian,2012, pp. 124-133.[24] P. Olivier, J. Boukhobza and E. Senn, ”Micro-benchmarking Flash Memory File-System Wear Level-ing and Garbage Collection: A Focus on Initial StateImpact,” 2012 IEEE 15th International Conferenceon Computational Science and Engineering, Nicosia,2012, pp. 437-444.[25] P. Olivier, J. Boukhobza and E. Senn, ”Modeling driverlevel NAND flash memory I/O performance and powerconsumption for embedded Linux,” 2013 11th Inter-national Symposium on Programming and Systems(ISPS), Algiers, 2013, pp. 143-152.[26] Y. Wei and D. Shin, ”NAND flash storage device per-formance in Linux file system,” 2011 6th InternationalConference on Computer Sciences and ConvergenceInformation Technology (ICCIT), Seogwipo, 2011, pp.574-577.[27] G. Kim and D. Shin, ”Performance analysis of SSDwrite using TRIM in NTFS and EXT4,” 2011 6th Inter-national Conference on Computer Sciences and Con-vergence Information Technology (ICCIT), Seogwipo,2011, pp. 422-423.[28] S. Park and K. Shen, ”A performance evaluation ofscientific I/O workloads on Flash-based SSDs,” 2009IEEE International Conference on Cluster Computingand Workshops, New Orleans, LA, 2009, pp. 1-5.[29] B. Gu, J. Lee, B. M. Jung, J. Seo and H. Shin, ”Uti-lization analysis of trim-enabled NAND flash mem-ory,” 2013 IEEE International Conference on Con-sumer Electronics (ICCE), Las Vegas, NV, 2013, pp.645-646.[30] Y. Wang, K. Goda, M. Nakano and M. Kitsuregawa,”Early Experience and Evaluation of File Systems onSSD with Database Applications,” 2010 IEEE Fifth In-ternational Conference on Networking, Architecture,and Storage, Macau, 2010, pp. 467-476.4631] S. S. Rizvi and T. Chung, ”Flash memory SSD basedDBMS for high performance computing embeddedand multimedia systems,” The 2010 International Con-ference on Computer Engineering & Systems, Cairo,2010, pp. 183-188.[32] L. Lin and X. Lizhen, ”The Research of Key Tech-nology in Flash-Based DBMS,” 2009 Sixth Web Infor-mation Systems and Applications Conference, Xuzhou,Jiangsu, 2009, pp. 15-18.[33] J. Chen, J. Wang, Z. Tan and C. Xie, ”Effects ofRecursive Update in Copy-on-Write File Systems: ABTRFS Case Study,” in Canadian Journal of Electricaland Computer Engineering, vol. 37, no. 2, pp. 113-122,Spring 2014.[34] Mendel Rosenblum and John K. Ousterhout, ”The de-sign and implementation of a log-structured file sys-tem,” ACM Trans. Comput. Syst. 10, 1 (February1992), 26-52.[35] David Woodhouse, ”JFFS: the journallingflash file system,” [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.630.3461 , Accessed on: Jun. 20, 2019.[36] Artem B. Bityutskiy, ”JFFS3 design issues,” [On-line]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.9834 ,Accessed on: Jun. 20, 2019.[37] Adrian Hunter, ”A Brief Introduction tothe Design of UBIFS,” [Online]. Available: , Accessed on: Jun. 20,2019.[38] Adrian Hunter, Artem B. Bityutskiy, ”UBIFS file sys-tem,” [Online]. Available: , Accessed on: Jun.20, 2019.[39] Charles Manning, ”How YAFFS Works,” [Online].Available: https://yaffs.net/sites/yaffs.net/files/HowYaffsWorks.pdf , Accessed on: Jun.20, 2019.[40] Technical note, the Nilfs version 1: overview. [On-line]. Available: https://nilfs.sourceforge.io/papers/overview-v1.pdf , Accessed on: Jun. 20,2019.[41] Ryusuke Konishi, ”Development of a New Log-structured File System for Linux,” Technical Note,Oct 2005. [Online]. Available: https://nilfs.sourceforge.io/papers/nilfs-051019.pdf ,Accessed on: Jun. 20, 2019. [42] J¨orn Engel, Robert Mertens, ”LogFS-finally ascalable flash file system,” [Online]. Available: , Accessed on: Jun. 21, 2019.[43] Changman Lee, Dongho Sim, Joo-Young Hwang, andSangyeun Cho, ”F2FS: a new file system for flash stor-age,” In Proceedings of the 13th USENIX Conferenceon File and Storage Technologies (FAST’15). USENIXAssociation, Berkeley, CA, USA, 273-286.[44] TaeHoon Kim, KwangMu Shin, TaeHoon Lee, KiDongJung, ”Design of a Reliable NAND Flash Softwarefor Mobile Device,” [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.554.8864&rep=rep1&type=pdf ,Accessed on: Jun. 24, 2019.[45] Jeong-Ki Kim, Hyung-Seok Lee and Heung-Nam Kim,”Dual Journaling Store Method for Embedded Sys-tems,” 2006 8th International Conference AdvancedCommunication Technology, Phoenix Park, 2006, pp.1241-1244.[46] S. O. Park and S. J. Kim, ”An Efficient Array File Sys-tem for Multiple Small-Capacity NAND Flash Memo-ries,” 2011 14th International Conference on Network-Based Information Systems, Tirana, 2011, pp. 569-572.[47] J. Kim, H. Jo, H. Shim, J. Kim and S. Maeng, ”Effi-cient Metadata Management for Flash File Systems,”2008 11th IEEE International Symposium on Objectand Component-Oriented Real-Time Distributed Com-puting (ISORC), Orlando, FL, 2008, pp. 535-540.[48] S. O. Park and S. J. Kim, ”An efficient multimediafile system for NAND flash memory storage,” in IEEETransactions on Consumer Electronics, vol. 55, no. 1,pp. 139-145, February 2009.[49] Seung-Ho Lim and Kyu-Ho Park, ”An efficient NANDflash file system for flash memory storage,” in IEEETransactions on Computers, vol. 55, no. 7, pp. 906-912,July 2006.[50] H. Kim, Y. Won and S. Kang, ”Embedded NAND flashfile system for mobile multimedia devices,” in IEEETransactions on Consumer Electronics, vol. 55, no. 2,pp. 545-552, May 2009.[51] C. T. Chen, C. H. Chen and W. T. Huang, ”Energy-aware management of NAND type flash file system,”in Electronics Letters, vol. 42, no. 14, pp. 795-796, 6July 2006.4752] A. S. Ramasamy and P. Karantharaj, ”File systemand storage array design challenges for flash mem-ory,” 2014 International Conference on Green Comput-ing Communication and Electrical Engineering (ICGC-CEE), Coimbatore, 2014, pp. 1-8.[53] B. Nahill and Z. Zilic, ”FLogFS: A lightweight flashlog file system,” 2015 IEEE 12th International Confer-ence on Wearable and Implantable Body Sensor Net-works (BSN), Cambridge, MA, 2015, pp. 1-6.[54] Yang Ou, Xiaoquan Wu, Nong Xiao, Fang Liu and WeiChen, ”HIFFS: A Hybrid Index for Flash File System,”2015 IEEE International Conference on Networking,Architecture and Storage (NAS), Boston, MA, 2015,pp. 363-364.[55] P. Huang, G. Wan, K. Zhou, M. Huang, C. Li andH. Wang, ”Improve Effective Capacity and Lifetimeof Solid State Drives,” 2013 IEEE Eighth InternationalConference on Networking, Architecture and Storage,Xi’an, 2013, pp. 50-59.[56] S. Yang and C. Wu, ”A Low-Memory Management forLog-Based File Systems on Flash Memory,” 2009 15thIEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Beijing,2009, pp. 219-227.[57] W. Qiu, X. Chen, N. Xiao, F. Liu and Z. Chen, ”ANew Exploration to Build Flash-Based Storage Sys-tems by Co-designing File System and FTL,” 2013IEEE 16th International Conference on ComputationalScience and Engineering, Sydney, NSW, 2013, pp.925-932.[58] T. Chen, X. Wang, W. Hu and W. Duan, ”A New Typeof NAND Flash-Based File System: Design and Imple-mentation,” 2006 International Conference on WirelessCommunications, Networking and Mobile Computing,Wuhan, 2006, pp. 1-4.[59] S. Lee, J. Kim and A. Mithal, ”Refactored Design ofI/O Architecture for Flash Storage,” in IEEE ComputerArchitecture Letters, vol. 14, no. 1, pp. 70-74, 1 Jan.-June 2015.[60] Junkil Ryu and C. Park, ”A technique to enhance per-formance of log-based file systems for flash memoryin embedded systems,” 2007 2nd International Confer-ence on Digital Information Management, Lyon, 2007,pp. 580-582.[61] Byungjo Kim, Dong Hyun Kang, Changwoo Min andYoung Ik Eom, ”Understanding implications of trim,discard, and background command for eMMC storagedevice,” 2014 IEEE 3rd Global Conference on Con-sumer Electronics (GCCE), Tokyo, 2014, pp. 709-710. [62] C. Min, S. Lee and Y. I. Eom, ”Design and Implemen-tation of a Log-Structured File System for Flash-BasedSolid State Drives,” in IEEE Transactions on Comput-ers, vol. 63, no. 9, pp. 2215-2227, Sept. 2014.[63] Jun Wang and Yiming Hu, ”A novel reordering writebuffer to improve write performance of log-structuredfile systems,” in IEEE Transactions on Computers, vol.52, no. 12, pp. 1559-1572, Dec. 2003.[64] Jun Wang and Yiming Hu, ”PROFS-performance-oriented data reorganization for log-structured file sys-tem on multi-zone disks,” MASCOTS 2001, Proceed-ings Ninth International Symposium on Modeling,Analysis and Simulation of Computer and Telecommu-nication Systems, Cincinnati, OH, USA, 2001, pp. 285-292.[65] R. Agarwal and M. Marrow, ”A closed-form expressionfor write amplification in NAND Flash,” 2010 IEEEGlobecom Workshops, Miami, FL, 2010, pp. 1846-1850.[66] A. Jagmohan, M. Franceschini and L. Lastras, ”Writeamplification reduction in NAND Flash through multi-write coding,” 2010 IEEE 26th Symposium on MassStorage Systems and Technologies (MSST), InclineVillage, NV, 2010, pp. 1-6.[67] Y. Chang and T. Kuo, ”A commitment-based man-agement strategy for the performance and reliabilityenhancement of flash-memory storage systems,” 200946th ACM/IEEE Design Automation Conference, SanFrancisco, CA, 2009, pp. 858-863.[68] Tei-Wei Kuo, Jen-Wei Hsieh, Li-Pin Chang and Yuan-Hao Chang, ”Configurability of performance and over-heads in flash management,” Asia and South PacificConference on Design Automation, 2006., Yokohama,2006, p. 8.[69] J. Hsieh, C. Wu and G. Chiu, ”Design and Imple-mentation for Multi-level Cell Flash Memory StorageSystems,” 2010 IEEE 16th International Conferenceon Embedded and Real-Time Computing Systems andApplications, Macau SAR, 2010, pp. 247-252.[70] C. Park, W. Cheon, Y. Lee, M. Jung, W. Cho andH. Yoon, ”A Re-configurable FTL (Flash TranslationLayer) Architecture for NAND Flash based Appli-cations,” 18th IEEE/IFIP International Workshop onRapid System Prototyping (RSP ’07), Porto Alegre,2007, pp. 202-208.[71] J. Lee, H. Kim, H. Kim, J. Park and M. Ryu, ”A se-quentializing device driver for optimizing random write48erformance of eSSD,” 2014 IEEE International Con-ference on Consumer Electronics (ICCE), Las Vegas,NV, 2014, pp. 432-433.[72] Y. He, S. Wan, N. Xiong and J. H. Park, ”A NewPrefetching Strategy Based on Access Density inLinux,” International Symposium on Computer Sci-ence and its Applications, Hobart, ACT, 2008, pp. 22-27.[73] Dingqing Hu, Changsheng Xie and C. CaiBin, ”AStudy of Parallel Prefetching Algorithms Using Trace-Driven Simulation,” Sixth International Conference onParallel and Distributed Computing Applications andTechnologies (PDCAT’05), Dalian, China, 2005, pp.476-478.[74] Y. Kang, J. Yang and E. L. Miller, ”Efficient StorageManagement for Object-based Flash Memory,” 2010IEEE International Symposium on Modeling, Analysisand Simulation of Computer and TelecommunicationSystems, Miami Beach, FL, 2010, pp. 407-409.[75] Q. Xie et al, ”Research on the Framework of NANDFLASH Based Object-Based-Storage-Device,” 2012Second International Conference on Intelligent SystemDesign and Engineering Application, Sanya, Hainan,2012, pp. 1298-1301.[76] Goetz Graefe, ”Modern B-Tree Techniques,” [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.7269&rep=rep1&type=pdf , Accessed on: Jun. 21, 2019.[77] J. Ahn, D. Kang, D. Jung, J. Kim and S. Maeng, ” μ *-Tree: An Ordered Index Structure for NAND FlashMemory with Adaptive Page Layout Scheme,” in IEEETransactions on Computers, vol. 62, no. 4, pp. 784-797,April 2013.[78] C. Lee and S. Lim, ”Caching and Deferred Write ofMetadata for Yaffs2 Flash File System,” 2011 IFIP 9thInternational Conference on Embedded and UbiquitousComputing, Melbourne, VIC, 2011, pp. 41-46.[79] J. He et al., ”Discovering Structure in UnstructuredI/O,” 2012 SC Companion: High Performance Com-puting, Networking Storage and Analysis, Salt LakeCity, UT, 2012, pp. 1-6.[80] Tsozen Yeh, J. Arul, Jia-Shian Wu, I. -. Chen and Kuo-Hsin Tan, ”Using File Grouping to Improve the DiskPerformance (Extended Abstract),” 2006 15th IEEEInternational Conference on High Performance Dis-tributed Computing, Paris, 2006, pp. 365-366. [81] Li-Pin Chang and Tei-Wei Kuo, ”An adaptive stripingarchitecture for flash memory storage systems of em-bedded systems,” Proceedings. Eighth IEEE Real-Timeand Embedded Technology and Applications Sympo-sium, San Jose, CA, USA, 2002, pp. 187-196.[82] Y. Xin, R. Chun-ming and H. Ben-xiong, ”A Flexi-ble Garbage Collect Algorithm for Flash Storage Man-agement,” 2008 Second International Conference onFuture Generation Communication and Networking,Hainan Island, 2008, pp. 354-357.[83] Che-Wei Tsao, Yuan-Hao Chang and Ming-ChangYang, ”Performance enhancement of garbage collec-tion for flash storage devices: An efficient victim blockselection design,” 2013 50th ACM/EDAC/IEEE De-sign Automation Conference (DAC), Austin, TX, 2013,pp. 1-6.[84] H. Yan and Q. Yao, ”An efficient file-aware garbagecollection algorithm for NAND flash-based consumerelectronics,” in IEEE Transactions on Consumer Elec-tronics, vol. 60, no. 4, pp. 623-627, Nov. 2014.[85] L. Zeng, Y. Zhang and X. Zhao, ”An Improved Ap-proach on B Tree Management for NAND Flash-Memory Storage Systems,” 2009 WASE InternationalConference on Information Engineering, Taiyuan,Chanxi, 2009, pp. 443-447.[86] S. Jung, Y. Lee and Y. H. Song, ”A process-awarehot/cold identification scheme for flash memory stor-age systems,” in IEEE Transactions on Consumer Elec-tronics, vol. 56, no. 2, pp. 339-347, May 2010.[87] Sheng-Jie Syu and Jing Chen, ”An active space re-cycling mechanism for flash storage systems in real-time application environment,” 11th IEEE InternationalConference on Embedded and Real-Time ComputingSystems and Applications (RTCSA’05), Hong Kong,China, 2005, pp. 53-59.[88] H. Lim and J. Park, ”Dynamic Configuration of SSDFile Management,” 2014 International Conference onInformation Science & Applications (ICISA), Seoul,2014, pp. 1-3.[89] T. Huang and D. Chang, ”Extending Lifetime and Re-ducing Garbage Collection Overhead of Solid StateDisks with Virtual Machine Aware Journaling,” 2011IEEE 17th International Conference on Parallel andDistributed Systems, Tainan, 2011, pp. 1-8.[90] C. Wu, P. Wu, K. Chen, W. Chang and K. Lai, ”A Hot-ness Filter of Files for Reliable Non-Volatile MemorySystems,” in IEEE Transactions on Dependable and Se-cure Computing, vol. 12, no. 4, pp. 375-386, 1 July-Aug. 2015.4991] H. Gwak, Y. Kang and D. Shin, ”Reducing garbage col-lection overhead of log-structured file systems with GCjournaling,” 2015 International Symposium on Con-sumer Electronics (ISCE), Madrid, 2015, pp. 1-2.[92] D. Choi and D. Shin, ”Semantic-Aware Hot Data Se-lection Policy for Flash File System in Android-BasedSmartphones,” 2013 International Conference on Paral-lel and Distributed Systems, Seoul, 2013, pp. 444-445.[93] C. Wu, W. Chang and Z. Hong, ”A Reliable Non-volatile Memory System: Exploiting File-SystemCharacteristics,” 2009 15th IEEE Pacific Rim Interna-tional Symposium on Dependable Computing, Shang-hai, 2009, pp. 202-207.[94] D. Shapira, ”Compressed Transitive Delta Encoding,”2009 Data Compression Conference, Snowbird, UT,2009, pp. 203-212.[95] H. Li, ”Flash Saver: Save the Flash-Based Solid StateDrives through Deduplication and Delta-encoding,”2012 13th International Conference on Parallel andDistributed Computing, Applications and Technolo-gies, Beijing, 2012, pp. 436-441.[96] Z. Zhang, Z. Jiang, C. Peng and Z. Liu, ”Analysis ofdata fragments in deduplication system,” 2012 Interna-tional Conference on System Science and Engineering(ICSSE), Dalian, Liaoning, 2012, pp. 559-563.[97] Yong-Ting Wu, Min-Chieh Yu, Jenq-Shiou Leu, Eau-Chung Lee and Tian Song, ”Design and implementa-tion of various file deduplication schemes on storagedevices,” 2015 11th International Conference on Het-erogeneous Networking for Quality, Reliability, Secu-rity and Robustness (QSHINE), Taipei, 2015, pp. 80-84.[98] Y. Fu, H. Jiang, N. Xiao, L. Tian and F. Liu, ”AA-Dedupe: An Application-Aware Source Deduplica-tion Approach for Cloud Backup Services in the Per-sonal Computing Environment,” 2011 IEEE Interna-tional Conference on Cluster Computing, Austin, TX,2011, pp. 112-120.[99] N. Wanigasekara and C. I. Keppittiyagama, ”BuddyFS:A File-System to Improve Data Deduplication in Vir-tualization Environments,” 2014 Eighth InternationalConference on Complex, Intelligent and Software In-tensive Systems, Birmingham, 2014, pp. 198-204.[100] Feng Chen, Tian Luo, and Xiaodong Zhang,”CAFTL: a content-aware flash translation layer en-hancing the lifespan of flash memory based solid statedrives,” In Proceedings of the 9th USENIX conferenceon File and storage technologies (FAST’11). USENIXAssociation, Berkeley, CA, USA, 6-6. [101] W. Xia, H. Jiang, D. Feng and L. Tian, ”CombiningDeduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets,” 2014Data Compression Conference, Snowbird, UT, 2014,pp. 203-212.[102] D. Meister, J. Kaiser, A. Brinkmann, T. Cortes, M.Kuhn and J. Kunkel, ”A study on data deduplicationin HPC storage systems,” SC ’12: Proceedings of theInternational Conference on High Performance Com-puting, Networking, Storage and Analysis, Salt LakeCity, UT, 2012, pp. 1-11.[103] J. Ha, Y. Lee and J. Kim, ”Deduplication with Block-Level Content-Aware Chunking for Solid State Drives(SSDs),” 2013 IEEE 10th International Conference onHigh Performance Computing and Communications &2013 IEEE International Conference on Embedded andUbiquitous Computing, Zhangjiajie, 2013, pp. 1982-1989.[104] Y. Deng, L. Song and X. Huang, ”Evaluating MemoryCompression and Deduplication,” 2013 IEEE EighthInternational Conference on Networking, Architectureand Storage, Xi’an, 2013, pp. 282-286.[105] E. W. D. Rozier and W. H. Sanders, ”A frameworkfor efficient evaluation of the fault tolerance of dedupli-cated storage systems,” IEEE/IFIP International Con-ference on Dependable Systems and Networks (DSN2012), Boston, MA, 2012, pp. 1-12.[106] X. Zhao, Y. Zhang, Y. Wu, K. Chen, J. Jiang and K. Li,”Liquid: A Scalable Deduplication File System for Vir-tual Machine Images,” in IEEE Transactions on Paralleland Distributed Systems, vol. 25, no. 5, pp. 1257-1266,May 2014.[107] Youngjin Nam, Guanlin Lu and D. H. C. Du,”Reliability-aware deduplication storage: Assuringchunk reliability and chunk loss severity,” 2011 Inter-national Green Computing Conference and Workshops,Orlando, FL, 2011, pp. 1-6.[108] Calicrates Policroniades, Ian Pratt, ”Alternatives fordetecting redundancy in storage systems data,” In Pro-ceedings of the annual conference on USENIX AnnualTechnical Conference (ATEC ’04). USENIX Associa-tion, Berkeley, CA, USA, 6-6.[109] Fred Douglis, Arun Iyengar, ”Application-specific Delta-encoding via ResemblanceDetection,” [Online]. Available: https://pdfs.semanticscholar.org/5aef/da15f1dcf04529bbf518659a23112cbb5246.pdf ,Accessed on: Jun. 26, 2019.50110] J. Kim et al, ”Deduplication in SSDs: Model andquantitative analysis,” 012 IEEE 28th Symposium onMass Storage Systems and Technologies (MSST), SanDiego, CA, 2012, pp. 1-12.[111] D. Harnik, O. Margalit, D. Naor, D. Sotnikov andG. Vernik, ”Estimation of deduplication ratios in largedata sets,” 012 IEEE 28th Symposium on Mass StorageSystems and Technologies (MSST), San Diego, CA,2012, pp. 1-11.[112] E. W. D. Rozier, W. H. Sanders, P. Zhou, N.Mandagere, S. M. Uttamchandani and M. L. Yakushev,”Modeling the Fault Tolerance Consequences of Dedu-plication,” 2011 IEEE 30th International Symposiumon Reliable Distributed Systems, Madrid, 2011, pp. 75-84.[113] Y. Joo, J. Ryu, S. Park, H. Shin and K. G. Shin, ”RapidPrototyping and Evaluation of Intelligence Functionsof Active Storage Devices,” in IEEE Transactions onComputers, vol. 63, no. 9, pp. 2356-2368, Sept. 2014.[114] E. Jeannot, B. Knutsson and M. Bjorkman, ”Adap-tive online data compression,” Proceedings 11th IEEEInternational Symposium on High Performance Dis-tributed Computing, Edinburgh, UK, 2002, pp. 379-388.[115] T. Quan, D. Yeo and Y. Won, ”CMFS: Compressedmetadata file system for hybrid storage,” 2010 2ndIEEE International Conference on Network Infrastruc-ture and Digital Content, Beijing, 2010, pp. 1030-1034.[116] S. Ahn, S. Hyun, T. Kim and H. Bahn, ”A com-pressed file system manager for flash memory basedconsumer electronics devices,” in IEEE Transactionson Consumer Electronics, vol. 59, no. 3, pp. 544-549,August 2013.[117] K. Kim, S. Jung and Y. H. Song, ”Compression ra-tio based hot/cold data identification for flash mem-ory,” 2011 IEEE International Conference on Con-sumer Electronics (ICCE), Las Vegas, NV, 2011, pp.33-34.[118] D. Zhao, K. Qiao, J. Yin and I. Raicu, ”DynamicVirtual Chunks: On Supporting Efficient Accesses toCompressed Scientific Data,” in IEEE Transactions onServices Computing, vol. 9, no. 1, pp. 96-109, 1 Jan.-Feb. 2016.[119] S. Hyun, H. Bahn and K. Koh, ”LeCramFS: an effi-cient compressed file system for flash-based portableconsumer devices,” in IEEE Transactions on ConsumerElectronics, vol. 53, no. 2, pp. 481-488, May 2007. [120] W. Chang, X. Yun, B. Fang, S. Wang and X. Yu, ”Per-formance evaluation of block LZSS compression algo-rithm,” 2010 2nd International Conference on FutureComputer and Communication, Wuha, 2010, pp. V2-449-V2-454.[121] C. Constantinescu and M. Lu, ”Quick Estimation ofData Compression and De-duplication for Large Stor-age Systems,” 2011 First International Conference onData Compression, Communications and Processing,Palinuro, 2011, pp. 98-102.[122] O. Kwon, Y. Yoo, K. Koh and H. Bahn, ”Replacementand swapping strategy to improve read performance ofportable consumer devices using compressed file sys-tems,” in IEEE Transactions on Consumer Electronics,vol. 54, no. 2, pp. 551-559, May 2008.[123] A. Molfetas, A. Wirth and J. Zobel, ”Using Inter-fileSimilarity to Improve Intra-file Compression,” 2014IEEE International Congress on Big Data, Anchorage,AK, 2014, pp. 192-199.[124] T. Makatos, Y. Klonatos, M. Marazakis, M. D. Flourisand A. Bilas, ”ZBD: Using Transparent Compression atthe Block Level to Increase Storage Space Efficiency,”2010 International Workshop on Storage Network Ar-chitecture and Parallel I/Os, Incline Village, NV, 2010,pp. 61-70.[125] B. Shen, X. Jin, Y. H. Song and S. S. Lee, ”APRA:Adaptive Page Replacement Algorithm for NANDFlash Memory Storages,” 2009 International Forumon Computer Science-Technology and Applications,Chongqing, 2009, pp. 11-14.[126] M. Wang and Y. Hu, ”Exploit real-time fine-grainedaccess patterns to partition write buffer to improve SSDperformance and life-span,” 2013 IEEE 32nd Interna-tional Performance Computing and CommunicationsConference (IPCCC), San Diego, CA, 2013, pp. 1-7.[127] Matias Bjørling, Javier Gonzalez, Philippe Bon-net, ”LightNVM: The Linux Open-ChannelSSD Subsystem,” [Online]. Available: , Accessed on: Jun.26, 2019.[128] Matias Bjørling, Jesper Madsen, Philippe Bon-net, Aviad Zuck, Zvonimir Bandic, Qingbo Wang,”LightNVM: Lightning Fast Evaluation Platformfor Non-Volatile Memories,” [Online]. Available: https://pdfs.semanticscholar.org/30eb/bf2b42ef3a5714b0f5350f85842e3ca2e408.pdf ,Accessed on: Jun. 26, 2019.51129] Matias Bjørling, Jesper Madsen, Javier Gonzalez,Philippe Bonnet,”Linux Kernel Abstractions for Open-Channel Solid State Drives,” [Online]. Available: , Accessedon: Jun. 26, 2019.[130] Javier Gonzalez, Matias Bjørling, ”Multi-TenantI/O Isolation with Open-Channel SSDs,” [Online].Available: , Accessedon: Jun. 26, 2019.[131] Javier Gonzalez, Matias Bjørling, SeongnoLee, Charlie Dong, Yiren Ronnie Huang,”Application-Driven Flash Translation Layerson Open-Channel SSDs,” [Online]. Available: