APEX: Adaptive Ext4 File System for Enhanced Data Recoverability in Edge Devices
AAPEX: Adaptive Ext4 File System for EnhancedData Recoverability in Edge Devices
Shreshth Tuli
Indian Institute of Technology
Delhi, [email protected]
Shikhar Tuli
Indian Institute of Technology
Delhi, [email protected]
Udit Jain
Indian Institute of Technology
Delhi, [email protected]
Rajkumar Buyya
The University of Melbourne
Melbourne, [email protected]
Abstract —Recently Edge Computing paradigm has gainedsignificant popularity both in industry and academia. Withits increased usage in real-life scenarios, security, privacy andintegrity of data in such environments have become critical.Malicious deletion of mission-critical data due to ransomware,trojans and viruses has been a huge menace and recovering suchlost data is an active field of research. As most of Edge computingdevices have compute and storage limitations, difficult constraintsarise in providing an optimal scheme for data protection. Thesedevices mostly use Linux/Unix based operating systems. Hence,this work focuses on extending the Ext4 file system to APEX(Adaptive Ext4): a file system based on novel on-the-fly learningmodel that provides an Adaptive Recover-ability Aware fileallocation platform for efficient post-deletion data recovery andtherefore maintaining data integrity. Our recovery model andits lightweight implementation allow significant improvement inrecover-ability of lost data with lower compute, space, time, andcost overheads compared to other methods. We demonstratethe effectiveness of APEX through a case study of overwritingsurveillance videos by CryPy malware on Raspberry-Pi basedEdge deployment and show 678% and 32% higher recovery thanExt4 and current state-of-the-art File Systems. We also evaluatethe overhead characteristics and experimentally show that theyare lower than other related works.
Index Terms —Edge computing, Security, File System, DataRecovery, Data Stealing Malware
I. I
NTRODUCTION
Internet of Things (IoT) paradigm enables integration ofdifferent sensors, compute resources and actuators to perceiveexternal environment and act to provide utility in differentapplications like healthcare, transportation, surveillance amongothers [1]. The original idea of providing enhanced Qualityof Service (QoS), which is a measure of the performanceof such services, was to provide distributed computation andcomputation on Cloud [2]. A new computing paradigm namely‘Fog Computing’ leverages resources both at public cloud andthe edge of the network. It provides resource abundant ma-chines available at multi-hop distance and resource constrainedmachines close to the user. Several works have shown that Fogcomputing can provide better and cheaper solutions comparedto only Cloud based approaches [3], [4]. Thus, many ofsuch IoT systems have recently been realized using Edge/Fogcomputing frameworks [5]. Due to the increased usage of suchdevices and frameworks, there has been increasing interest indeveloping efficient techniques to provide improved QoS thatsuch systems provide [6]. Edge is considered as a nontrivial extension of the cloud,therefore, it is inevitable that same security and privacy chal-lenges will persist [7]. Many works in literature show that thereexist new security threats in Edge paradigm which provideopportunities of developing more robust and better systems [8].One of the most crucial security problems is the loss of criticaldata due to malicious entities [8]. Some ways used by hackersto corrupt, delete and steal such critical data include the usageof data stealing malwares, ransomwares, trojans, deleting oroverwriting viruses. Most malware attacks have been based onstealing crucial data from the system and requesting ransompayments in return for the data. Unfortunately, according toa recent survey [9], the number of novel ransomwares/trojanshas increased up to 50 times in the last decade amounting tomillions of dollars of illicit revenue.Preventing such attacks has not been popular in IoT. Thisis because it requires significant computation and space [10]that drastically increases the cost of IoT deployment which isunfavorable for users. Also, it has been shown that no matterwhat protection mechanisms are put in place, edge paradigmswill be successfully attacked [11]. Thus, the detection andrecovery from such attacks seems a critical requirement forEdge Computing domain. Due to increasing frequency ofnovel attacks and their types, detection is quite challenging[12]. Despite such challenges, there exist prior work that havehigh accuracy of detecting such attacks [13], [14] but onlya few of them utilize the ability of policy based allocationto recover from such attacks. The ones that utilize policybased allocation schemes [14], [15] have significant overheadof read/write latencies, computation time, space requirementand limited recover-ability that are not feasible for Edge nodes.We discuss more limitations of such strategies in
Section II and identify the scope of improvement in terms of properallocation with mechanisms allowing faster, portable and moreefficient recovery. Our work primarily focuses on recovery inHard Disk Drives (HDD), Flash Media and Solid State Devices(SSD).Most of the conventional file systems such as Ext, NTFSand exFAT do not allow the users to specifically tagblocks/clusters/sectors - deleted or unused independently [16]where a block in hard-disk is a group of sectors that theoperating system can point to. Lack of such freedom limitsthe kernel to overwrite the data on random locations reducingrecovery. Some optimizations exist but are proprietary and not a r X i v : . [ c s . O S ] O c t ustomizable to user specific needs. This limits the currentlyavailable kernels to utilize the full potential of file allocationand tagging for efficient recovery. Even the file allocationand mapping with virtual tables is restricted to specific fixedalgorithms. Allowing these algorithms to be recover-abilityaware in allocating blocks/clusters to files can improve theamount of data that can be recovered from such file systems.Making these algorithms adaptive and equipped with somelearning model can further lead to optimization absent incurrent systems.The proposed file system APEX (Adaptive Ext4), imple-ments an adaptive file allocation policy that supports a widediversity of platforms due to its portable implementation. Itprovides a significant improvement in recovery of files withlow overheads. It is designed to be lightweight and easy todeploy in Edge/Fog computing frameworks, increasing theirreliability and data protection. Another advantage is that itprovides improved forensic based recovery for criminal inves-tigations to expose evidence and hence catch hackers/invaders.The main contributions of our work are as follows: • We propose a lightweight, adaptive, portable and efficientfile allocation system optimized for higher post deletionrecovery which is flexible, robust and is independent ofstorage architecture • We provide a set of pre-optimized weights that need onlyslight variation of hyper-parameters dependent on usageand thus low adapting time for new scenarios • We develop a prototype file system APEX and showits efficacy on a real life scenario for malicious dele-tion/overwriting of video surveillance footage.The rest of the paper is organized as follows. In
Section II , weprovide related works and compare them with ours. In
SectionIII we first provide a basic recover-ability aware allocationmechanism and describe a heuristic based block rankingmethod that can optimize post-deletion data recovery. We thenimprove our heuristic measure by updating it dynamically toprioritize files based on the user’s file access characteristicsin
Section IV and also provide model level details of a DiskSimulator for learning the weights (hyper-parameters) of theblock parameters, based on general-user file access characteris-tics. In
Section V , we extend our implementation discussion toAPEX file system using the FUSE (FileSystem in UserSpace)framework [17]. In
Section VI we provide a case study ofoverwriting video surveillance data using CryPy malware [18]and provide experimental results of the model and comparisonwith other works both for recovery and overheads to show thatAPEX outperforms them.
Section VII concludes the paper andprovides future directions to improve APEX.II. R
ELATED WORK
The goal of data recovery is to recover lost data from adisk to maximum extent. This data might be ‘lost’ becausethe disk has been corrupted or the files have been deleted. Wefocus on the latter aspect of the problem that too from datastealing/deleting malware. There has been significant work ondata recovery at different levels, including file system, file allocation software, and recovery tools. However, there stilllacks a holistic system that focuses on this aspect on generic,highly recoverable data storage by modifying the allocationpolicy. Table I provides a brief summary for comparing APEXwith other systems.There are two main directions in which the work ondata recovery has progressed. The first concerns how datais recovered after deletion, and the second concerns withallocating data such that recovery later is optimized. Signifi-cant parameters for comparison include overheads, on demandrecover-ability, adaptivity to different application scenarios,and custom policy employ-ability among others. There hasbeen work on dynamic file systems and allocation like byUlrich et al. [19], where the data is allocated between differ-ent drives for optimum resource utilization and distribution.Complex parity generation and comparison methods are used,spanning multiple drives, for improved utilization and recoveryperformance. However, it lacks priority based allocation andreplacement policies that are optimized for recovery andaccess time. The popular Andrew File System (AFS) [15]in distributed systems also provides a backup mechanism torecover deleted or lost files for a limited period of time. This isnot suitable for Fog nodes due to limited disk space available(there is high storage-to-compute cost ratio in Fog frameworkdeployment and other communication limitations across Fognetwork). Many efforts have also been made in the directionsof hardware tweaking and optimizations. An example is byHanson [20].Techniques that involve tagging of file blocks with someidentifiers have been used by Alhussein et al. in [22] and [24],where frameworks like FUSE have been used for developmentof forensic based file-systems. They provide forensic fileidentifiers at cluster-level file allocation to provide informationneeded for file types to be recovered after deletion. As they areonly limited to file cluster identification and the identificationof file types, the amount of recover-ability of data is limitedbecause they completely ignore the file usage characteristicsand temporal locality. Another adaptive approach by Stoicaet al. [13] uses weighted least squares Iterative AdaptiveApproach (IAA) to detect and recover missing data, butthis works mostly across streams of data signals and is notscalable to actual file systems. We consider it because ofits unique adaptive approach to categorize cluster cells forefficiently allocating them in buffers/disk. Other works likeby Continella et al. [14] or Baek et al. [25] provide a self-healing ransomware-aware filesystem which is capable of bothdetection and the recovery from ransomware attacks. It worksby analyzing the I/O data trace of various processes and uses aclassifier to detect if the process is maliciously deleting data. Ifsuch a process is discovered, it uses a recovery approach usedby copy-on-write filesystems. This poses significant overheadsin terms of disk space and I/O bandwidth requirements andhence is not optimum for Edge nodes. Another problem is that,it has non-zero detection error in file destructive ransomwaresas it has been trained for those that are encryption based. Wetested it for Gandcrab and it was not able to identify it. ork Recovery specific Low Overhead Selective files can Allows custom For Adaptive User Crossallocation Computation Memory Disk File I/O be marked critical policies Edge/Fog/Cloud specific PlatformUlrich et al. [19] (cid:88)
AFS [15] (cid:88) (cid:88) (cid:88)
Hanson [20] (cid:88) (cid:88) (cid:88) (cid:88)
Breeuwsma et al. [21] (cid:88) (cid:88)
Alhussein et al [22] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Stoica et al. [13] (cid:88) (cid:88) (cid:88)
Huang et al. [23] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Alhussein et al [24] (cid:88) (cid:88) (cid:88) (cid:88)
Baek et al. [25] (cid:88) (cid:88) (cid:88) (cid:88)
Lee et al. [26] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Continella et al. [14] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
APEX [this work] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table IA
COMPARISON OF RELATED WORKS WITH OURS
Another technique by Huang et al. [23] has been proposedfor recovering encrypted data from flash drives. They providea ransomware tolerant SSD which has firmware-level recoverysystems, but the main drawback of their approach is that theykeep multiple copies of data and hence need significantlymore space than required. Another disadvantage is that theirapproach is specific to SSDs and not for generic storage de-vices. Lee et al. [26] propose ExtSFR, a scalable file recoveryframework for distributed filesystems in IoT domains. Thisuses files’ metadata to identify and recover them, but ignorefile usage and access characteristics which limit recoverability.In aforementioned works, file access characteristics, recoverybased allocation strategies, and cluster identifiers are exploitedfrom a narrow perspective and capabilities of adaptive prioritybased allocation approaches have not been fully leveraged.In addition, most systems overlook the constraints in edgeparadigm and hence are not cost and/or energy efficient forsuch deployments. APEX ensures data security using adaptiveprioritization of blocks in storage media based on recoveryheuristics and provides efficient mechanisms that have minimalcompute, bandwidth and space overheads.III. R
ECOVERABILITY A WARE F ILE ALLOCATION
In the previous section, we described the flaws of exist-ing systems and emphasized on how a recovery-aware fileallocation system can provide a more robust and efficientmechanism for recovering deleted data. Here we provide thedetails of implementing such system and later describe howmuch improvement in terms of recovery it provides. For thiswork, we have kept the threat model as a malicious entitywhich directly attacks the system to sabotage critical fileswhile the attack surface is limited to the user applicationwherein the kernel is secured.
A. Block Parameters and Priority Factor
The four proposed parameters that act as heuristics torank a block for its priority for allocation are History Factor( HF ), Usage Factor ( U F ), Spatial Factor ( SF ) and LinkingFactor ( LF ). The Priority Factor ( P F ) of a block is alinear combination of these parameters. The weights of thesesfactors are kept as hyper-parameters which are dynamicallylearned for improved recovery based on the user’s file usagecharacteristics in the APEX model. This priority score is usedto sort the unused blocks in a priority queue which then is used for finding fresh blocks to allocate to a new file. Now,the Priority factor (
P F ) is defined according to the equation:
P F = λ · HF − σ · U F + ρ · SF + µ · LF Priority Factor is periodically calculated for each unusedblock of the disk. Here the hyper-parameters: λ, σ, ρ, µ whichare coefficients of the disk block parameters. They are dynam-ically updated based on the user’s file access characteristics.The different types of clusters/blocks are
Used and
Unused .When a file is created, some unused blocks are allocated andthus belong to the
Used category. When a file is deleted,the
Used blocks that correspond to that file are convertedto
Unused category. The different types of files in APEXnomenclature are: • Used: File exists in disk • Deleted: File exists in disk but deleted and blocks canbe overwritten (only partial fragments of the file may bepresent) • Obsolete: No block of the file exists in the current stateof the diskDifferent types of operations are read, write, delete, create(new file). We now define the block parameters and reasontheir importance:
1) History factor ( HF ): The history factor accounts forhow old a particular file’s blocks are, in terms of “delete” or“over-write” operations. Each time a file’s blocks are deletedor over-written, the rest of the blocks’ HF increases by one.At block level the history factor can be visualized as shownin Figure 1(a). Here, the “file” of the block shown as “a”represents the set of blocks that belong to the last file of whichblock “a” was part of. For different cases of a block, we have: • Unused: for every Delete/over-write operation, the HF israised by 1 • Unused to used transition: set HF to 1 • Used: no change • Used to unused transition: set HF to 0This parameter exists for each block in the disk. The reasonHF is important for recovery is: if blocks of the same fileare overwritten then the extent of recovery of other blocksdecreases. This is also mentioned in other works [15], [25],[27] where they use similar notions to capture the history offile. When HF increases, it makes the recovery of the cluster
Delete/over-write/new file operation
HF++A block of the file (last file i.e. set of blocks, to which this block belonged)
Blocks of a file (a) History Factor aRead/write operation (last file i.e. set of blocks, to which this block belonged)
UF++A block of the file
Blocks of a file (b) Usage FactorFigure 1. Block heurisics more difficult and hence PF should increase, which impliesthat the positive scalar constant ( λ ) will be positive.
2) Usage Factor (
U F ): The usage factor takes into accountthe usage of a file (and thus its blocks). It is quantified by thenumber of read/write operations on that particular file beforedeletion/over-writing. The higher the usage of a particular file,the more recovery sensitive (important to user) the file gets.The change to the usage factor based on the file operationis shown in Figure 1(b). The UF of all blocks in a disk areinitialized to 0.For different cases of a blocks, we have: • Used: For each read/write operation of any blocks of thefile, the UF of each file increases by 1 • Used to Unused transition: no change • Unused: no change • Unused to used transition: set UF to 1 when a new fileis createdOther works [25], [27] also use similar notions of file/blockprioritization based on usage of file. This is because frequentlyused files are considered more important to the user. Thismeans that the PF (priority for overwriting it) should reduceand hence the scalar constant ( σ ) has negative sign with it.
3) Spatial Factor ( SF ): The spatial factor includes thepossibility of the recovery of blocks located especially in theneighborhood of a particular block. Its importance has beenshown in other works as well [25], [28] because of its directeffect on file access time. The spatial factor of a block is higherif the overall priority factor of the neighboring blocks are high.Thus, SF is kept as the average of blocks that are physicallyspatially adjacent. ‘Spatial Adjacency’ depends on the physicalcharacteristics of the medium. For HDD we may consider theblocks in the same sector as the neighboring ones. As theblocks being allocated for a new file are the high priorityblocks, if the neighboring blocks have high PF, the spatiallocality would increase. Consequently, this block should alsoget replaced and hence for high SF, the PF should be high.This shows that the scalar constant ( ρ ) is positive. SFs of allblocks are updated after each I/O operation. For different casesfor a block, we have: • Used: Reset to 0 • Unused: Average
P F of nearby blocksThis factor is initialized to 0. This would be very usefulin HDDs, and our algorithm can take into account on thefly de-fragmentation in such storage devices. However, theintroduction of SF should not hamper with the wear levelingalgorithms employed by flash drives and thus this factor is dropped in random access drives. For random access drivesand SSDs, optimizing spatial locality is not required.
4) Linking Factor ( LF ): The linking factor depends onthe format of a particular file and captures the extent ofrecovery possible for specific file formats. Some files like .jpg,.mp3, .avi, can be partially recovered even if some blocks aredeleted/over-written. Such type of files should be distinguishedfrom other ELF (Executable and Linkable Format) types like.exe which can not be recovered even when only one block hasbeen over-written [29]. Blocks that are unused but the last fileto which they belonged are present and the file belongs to thenon ELF class of formats, would have LF = 0. Others likethose which belonged to file with ELF formats would have LF = 1. This factor is initialized to 1, as in the beginningthese files are not linked to anything. The scalar constant ( µ )should be positive in this case as well. The Priority Factor( P F ) depicts the priority of a block to be overwritten by newdata. The higher
P F blocks would be ready for over-writingfirst. The blocks with low
P F are more sensitive to recovery.
B. Functional Model Working
We use
P F as a reasonable heuristic to rank blocks inthe decreasing priority of being allocated to new files. Blockallocation, whenever a new file is created, is thus based on thepriority factor. The blocks with the highest priority factors areallocated to the new file, number equal to those required forthe new file. Whenever a file is deleted, its used blocks areshifted to the unused blocks’ set and this set is again used forallocation to new files with allocating those blocks first thathave the highest PF.We now show how the ranking based on priority factor canbe improved further by tuning the hyperparameters: λ, σ, ρ, µ dynamically. This tuning is based on the file access character-istics of the user to allow more efficient allocation.IV. H
YPER - PARAMETER OPTIMIZATION : A
DAPTIVEALLOCATION MODEL
In the previous section, we described how priority factor canbe used to rank blocks to improve post-deletion data recovery.This heuristic measure is still an approximation and thereexists scope for further tuning the hyperparameters: λ, σ, ρ, µ discussed before. This tuning can be dynamic based on user’sfile access characteristics which can make this heuristic more
User OS + file type+ file size+ read+ write+ delete FIle Allocation Table + File tags+ File size+ Block sizeStorage Media Read/Write? Update blockparametersBlocks canbe replaced
A Learning - State TableBlock Index uniqueIdBlock parameters Priority FactorMRPF
Block virtual addressto physical addresstranslationStorage Charateristics: - Read / Write latencies - Random / Sequential access Delete/Overwrite?Blocks cannot be replacednoyes yesData V i r t ua l A dd r e ss File Pointers
Learning Module
Allocate newblocksbased on priorityStartEnd
Figure 2. Model description using UML (Unified Modeling Language) recise and lead to better allocation. We next describe asimulated environment close to real life scenario to determinethe optimum values of the hyperparameters.
A. Learning Model
We use Q -Learning, which is a reinforcement learningmodel to optimize the set of hyper-parameters. Here, the stateof the Q -Learning Model is defined by the tuple of hyper-parameters. An action is defined by the increment/decrementof the hyper parameters. Converging to the optimal set of thesehyper parameters would be the goal of the model. The modelis divided into two parts, the first is the learning model whichoptimizes these parameters. The other part is the allocationsystem, which allocates blocks to new files based on thecurrent hyper-parameter set and corresponding P F s of thelearning model.Considering each I/O operation as one iteration, the learningmodel updates the coefficients at each iteration and graduallyconverges to the best set of values. The coefficients are updatedfrequently because of the dynamism of disk/file accesses. Themodel would be trained on the basis of a Performance measure( P ), which is a combination of the average recovery ratio offiles and the spatial locality (only recovery ratio for randomaccess drives). Each action would change this measure, andby the learning algorithm the model would converge to aoptimum hyper-parameter set. In the building of the prototypeFile System APEX, we have trained the Q learning modelbased on some representative file access rates and distributionof common file types.Figure 2 shows a diagrammatic representation of the modeland how the learning model interacts with the File AllocationTable (FAT), OS and the disk. As the state space needs to befinite and the model should be able to converge to a definitesolution, the set of values on which the hyper-parameters varyis kept as: 1,2,..,10. This range is based on empirical resultsand convergence constraints. This gives a crude idea as to whatare the values required for the best recovery efficiency keepingthe least count of value measures to be 1. B. Objective Function
The objective function, referred as performance P of thelearning model is implemented as a linear combination ofthe weighted average Recovery Ratio (RR). RR is a standardmetric for comparing recovery of file systems [14], [22] andis weighted by the usage frequency of the files. Moreover, ameasure to quantify average access time of files in the disk[25], [28] is also common metric. We define for each file: • Recovery Ratio (RR): – For executable, objects and archived file types: 1 forcomplete recovery, 0 for partial/no recovery. – For files including text, word processor, multimedia,.pdf etc.:
Data recovered in bytesOriginal file size in bytes if meta-data re-covered else, 0. (Because file cannot be even partiallyread without the meta-information) • Approximate Access Time (AAT): Approximate measureof access time of a file which is the time for last read/write operation. When a file is created it is the filecreation time.Now, Performance P is defined as a convex combination ofthe above terms: P = α · (cid:80) all deleted files RR · UF (cid:80) all deleted files UF − β (cid:80) all current files AATNumber of current files
Here, ≤ α ≤ , ≤ β ≤ and α + β = 1 . Both termsof which α and β are coefficients, are dependent on eachother. The term involving access time depends not only onthe spatial distribution of files but also the average size ofcurrent files which affects recovery of other files. It also affectsrecovery because in many spatial distributions the meta-datais erased which makes a partially recoverable file (i.e. someblocks of data exist) have RR = 0. The other term that involvesrecovery ratio depends on how many blocks of deleted filesare present and also on which of them are present (meta-dataor data). Different values of α and β are applicable to differentscenarios. If we want to optimize recovery only, then α = 1 and β = 0 , for example, where edge devices store critical datalike video surveillance footage or health monitoring data. Suchapplications do not require high I/O bandwidth but criticallyrequire recoverability. If we want to optimize I/O only then α = 0 and β = 1 . This case would arise in edge configurationswith high bandwidth data streams and all data being stored ina separate database node or cloud which requires fast I/O. C. Disk Simulator
The disk block size for simulation model is kept the same asthe real value = 4KB. Total disk space is kept as 256MB anda file can vary from 16KB (4 blocks) to 1024KB/1MB (256blocks). Based on the rules for parameter update, the parametermaps are created for each state, i.e, a hyper-parameter set. Thissimulator model also allocates the blocks (based on rankingon
P F ) whenever there is a call from OS simulator (explainednext). It also keeps track of the MRPF (Most Recent ParentFile) for each block, which is a pair of the filename that aspecific block belongs to (or did before logical deletion) andthe set of blocks that were allocated to this particular file.
D. OS / File IO Simulator
This simulator does the file level/ OS level disk manage-ment. This introduces random file operations which belong toone of these categories: (1) Read/write, (2) Create file, (3)Delete file. The “Create file” operation updates the MRPF ofeach block that belongs to that new file after the operation.
E. Learning Hyper Parameters
Based on the definition of the hyper-parameters and perfor-mance function, the optimization problem is formulated as:maximize λ,σ,ρ,µ P subject to λ, σ, ρ, µ ∈ [1 , λ, σ, ρ, µ ∈ N ,Blocks allocated to new f iles in decreasingorder of P F = λHF − σU F + ρSF + µLF Model Iteration Number P e r f o r m an c e PerformaceMoving Avg
Figure 3. Performance with Model Iteration Number (MIN)
The model has been simulated using two iteration counts:1) Model Iteration Number (MIN): This increments by onefor each state change (action) in the learning model.2) Operation Iteration Number (OIN): This increments byone for each File I/O Operation performed by the OSSimulatorThe File I/O operations should be greater than the modelactions as it is not much useful to have a learning model thatanalyses the state space after many operations. To learn theoptimal values of the hyper parameters, we set the MIN incre-ment after 1000 OIN increments. This helps to reach a stableconfiguration after many I/O operations and in evaluating thedisk performance based on the current set of hyper-parametervalues. The conditions kept for the learning stage are:1) Disk Size: 16 ×
16 = 256 blocks2) Maximum file size = 20 blocks3) Percentage of linked files = 20 (stochastically varyingaround this value)4) Minimum disk Utilization = 70%As in general Q -Learning models, this learning model has an“Exploration factor” depicted as (cid:15) . This factor drops downfrom 1 exponentially. It decides with what probability a ran-dom action takes place. The probability with which a randomaction is chosen is (cid:15) and for an optimal action it is − (cid:15) .As (cid:15) decreases, the model chooses the action with the highest ∆ P , which is the change in the performance function of thelearning model. Gradually, as the model explores, (cid:15) reducesand model converges to the optimal state. For this experiment,the learning model converges when (cid:15) is ≤ × − . Forthe simulation setting at this convergence point: MIN reaches . × and OIN reaches . × . The performancemeasure, as described earlier, starts from 44.00 (based on Ext4prioritization factors). The value of P tends nearly 190 withtime, thus based on the equation of performance, there is anapproximately 280% improvement as shown in Figure 3. Thehyper-parameter values converge to (4, 7, 1, 9).This set of hyper-parameter values are then used in a dy-namic setting for block prioritization in the APEX framework.The initial Priority Factor, which is updated on-the-fly basedon the learning model, used for the block allocation for newfiles in APEX is: P F = 4 · HF − · U F + SF + 9 · LF V. I
MPLEMENTATION OF
APEX
USING
FUSETo implement APEX, we used the FUSE framework [17].FUSE (Filesystem in Userspace) is an interface for userprograms to export a file-system to the Linux kernel. FUSEmainly consists of two components: the FUSE kernel moduleand the lib-fuse library. Libfuse provides the reference imple-mentation for communicating with the FUSE kernel module.It is the most widely used library with active support thatallows users to develop, simulate and test their own file-systemwithout writing kernel-level code.To implement APEX, we need the following structures: (1)Disk, (2) Directory, (3) File, (4) Block. For the Disk simulationsystem, we only required the
Disk , File and
Block structureswhile modifying them according to FUSE format and includea new
Directory implementation for APEX. Instead of
Disk being a 2D array of Blocks, now its a large
ByteBuffer whichis indexed by
Blocks . Disk contains the optimal coefficientvalues of the parameters (found from the RL model), currentand deleted file list a
Heap of unused blocks and a
HashSet of used blocks. The
File and
Directory are derived from anabstract class
Path which has specifications required for a Filesystem structure.
Directory can have several children
Paths (which can be a file or directory) recursively.
File has severalchildren
Blocks , file-type (FUSE & linking factor).At the block level, main operations are:1)
Allocate : Sets the block’s state to used, initializes theblock’s parameters, updates the parent file pointer andobtains the linking factor (found by the parent file’sF
USE F ILE I NFO ).2)
De-allocate : Sets the state to unused, updates the block’sparameters and parent pointer.3)
Write : Writes the required content to the block. Includeswriting from a buffer and offset.4)
Read : Returns the block’s data.5)
Change : Updates the individual factors of the block.6)
Update : Updates the priority score of the block andchanges its position in the unused heap (if applicable).VI. O
VERWRITING VIDEO S URVEILLANCE USING C RY P YMALWARE : A C
ASE S TUDY
In order to compare APEX with other works, we use aFog environment built on FogBus [4] for video surveillanceon Raspberry Pi based Fog Nodes. As such nodes are usuallyconnected to cloud via the Internet, many viruses, trojans, etc.are likely to creep into the network. We compare APEX withother recovery specific systems that can be integrated in Edgenodes: AFS [15], FFS [22], ShieldFS [14], ExtSFR [26].
A. Setup
The machine - Raspberry Pi 3: Model B, used for theexperiments has the following specifications:1) SoC Broadcom BCM28372) CPU: 1.2 GHZ quad-core ARM Cortex A533) RAM: 1 GB LPDDR2-900 SDRAM4) Storage: 100 GB WD Black HDD or Samsung T5 SSD [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ 5 H F R Y H U \ 5 D W L R (a) Secondary data size = 414 MB ( [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ 5 H F R Y H U \ 5 D W L R (b) Secondary data size = 794 MBFigure 4. Recovery ratio for different Secondary data size B. Experiment 1: Recovery ratio performance
In order to compare the data recovery performance, similartests were executed on the APEX File System and the Base(Ext4) File System, with other works and recovery ratios werecompared. For a simple analysis, we keep the file system sizefor user data limited to 1GB. First, primary data was deletedusing the CryPy malware [18] after custom modifications, andsecondary data was written on the drive. After this, the primarydata was recovered. We perform tests for both HDD and SSD,which give very similar results (difference of < . recoveryratio) and hence report the average.
1) Test files:
Primary Data was kept consistent throughoutthe experiment to maintain consistency and consist of a set of 5sample videos each of size = 95 MB. Secondary Data size wasincreased in each iteration by predefined constants and onlywritten after soft deletion of primary data. Data recovered wasmeasured in terms of the size of recovered primary data. Alsomeasured in terms of recovered files that were view-able afterrecovery
2) Recovery ratio:
Data recovered / P rimary Data size
3) Observations:
Across different file systems as shown inFigure 4(a) and Figure 4(b), for the Secondary data size =414 MB and 794 MB respectively, we see that the APEX filesystem has the highest recovery ratio. As there is no policy forrecovery in Ext4, it has the least (even 0 MB in second case)recovery. As the surveillance footage was taken such that eachvideo accounts for 1 hour of data, AFS file system with 1 hourof delayed deletion causes 4 out of 5 files to be permanentlydeleted. This remains unchanged in both experiments. Forothers including FFS, ShieldFS, and ExtSFR; as they use onlyfile identifiers in blocks and not which files were most recentlyand frequently used (video files in our case), they overwritetheir data and hence lead to lower recovery. APEX on theother hand allocates new files i.e. Secondary data to separatelocations preventing overwriting of Original data, is able torecover maximum. APEX system improve the recovery ratioby 678%, and 32% compared to the base Ext4 and ExtSFRrespectively for 414 MB secondary data and 31% comparedto ExtSFR for 794 MB secondary data.
C. Experiment 2: Read-write performance
As the APEX file system adds the block level - prioritizedfile allocation implementation over Ext4, which leads to ad-ditional overhead, thus the FUSE implementation of APEX ( [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ : U L W H V S H H G L Q . % V (a) Write speed inKB/s ( [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ 5 H D G V S H H G L Q 0 % V (b) Read speed in MB/s ( [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ ' H O H W H V S H H G L Q 0 % V (c) Delete speed inMB/sFigure 5. Write, Read and Delete speed comparison was compared to the Base file system, which in our case is Ext4 . To test the read, write and delete speed in both systems,we used FileBench Filesystem benchmarking tool [30]. Thesebenchmarks provide read, write and delete speeds for n files,each of size m (where n and m are given as inputs to thebenchmark code). For the current study, random files weresequentially created, read, written and deleted where n variedfrom 10 to 1000 and m varied from 1 MB to 1000 MB.Figures 5(a)-5(c) show the results in bytes per second forwriting, reading and deleting for different frameworks usingHDD media (which has higher overhead so we report theseresults). The graphs show that the delete operation speedfor all frameworks except AFS is close to Ext4. Read/writespeeds are lower than Ext4 but the overhead is minimumin APEX. This is primarily because of identifier mappingin blocks being updated periodically in FFS. In ShieldFS,the detection algorithm has its overhead which continuouslymonitors the disk footprint of filebench program and consumesbandwidth. Overall, APEX has the least overhead among allimplementations. D. Experiment 3: CPU and RAM overhead
Due to various complex background tasks of process I/Odetection in ShieldFS, file block identification in FFS andbackup of data in AFS, the CPU and RAM usage of suchframeworks is much higher compared to APEX. ShieldFSuses many sophisticated cryptographic primitives to identifyif a program could be ‘potentially malicious’. This requiresconstant overhead of computation and memory. ExtSFR hasa post-deletion inode tracing and journal checking protocolwhich needs to check the whole file system for very smallchanges. APEX maintains the state of the files, hence onlythe files of interest can be checked. FFS maintains ‘forensicfile system identifier’ for each file and identifies the relevantinformation needed for recovery using such identifiers. Theseidentifiers are stored in disk and are required to be fetchedwhen checking for recoverable files in the file system, whichmakes it slow. APEX uses the disk cache information to updatethe block parameters of the most used blocks more frequentlythan others. This allows APEX to have a very small workingset at any instant of time (maximum 7% in current tests). Itintegrates file allocation and energy management to maintainleast CPU and memory consumption as shown in Figure 6. [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ & 3 8 8 W L O L ] D W L R Q (a) CPU usage (%) in differentframeworks ( [ W $ ) 6 ) ) 6 6 K L H O G ) 6 ( [ W 6 ) 5 $ S H [ 5 $ 0 8 W L O L ] D W L R Q (b) RAM usage (%) in differentframeworksFigure 6. CPU and RAM comparison VII. C
ONCLUSIONS AND F UTURE W ORK
APEX can successfully simulate and provide optimumvalues of coefficients in the Priority Factor (PF) which canrank the blocks to optimize recovery performance. The modelconverges to give PF as given in
Section IV . Using this factor(dynamically optimized by Q-Learning), if we rank all blocksin the disk then it can provide a file allocation system that hasa higher data recovery ratio. Thus, at any point if we want torecover a file, it would be recovered to the maximum extent inthis file system design compared to existing implementations.The current model gives a recovery performance improvementof 280%, where performance is determined as mentioned in
Section IV . Experiments in
Section VI-B show that the APEXimproves the recovery ratio by 678%, and 32% comparedto the base Ext4 file system and ExtSFR (best among theprior work). Hence, APEX can improve the data recovery withminimum read-write or compute overhead and hence is mostapt for resource-constrained edge devices vulnerable to attacksusing data stealing malware.The proposed work only aims to enhance recover-ability ofdeleted files. This can be extended to secure file-systems frommalicious transfer and corruption of files and to cover othertypes of file-systems like Distributed file-systems. Further, themodel can optimize wear levelling in tandem with recoveryperformance, especially for Flash and SSDs. It can also be ex-tended for emerging Non-Volatile Memories like RRAMs (Re-sistive Random Access Memory) [31], for upcoming energyefficient edge devices. The codes developed and used for datarecovery and the Q-Learning Model with simulation resultscan be found at: https://github.com/HS-Optimization-with-AI.Rshow that the APEXimproves the recovery ratio by 678%, and 32% comparedto the base Ext4 file system and ExtSFR (best among theprior work). Hence, APEX can improve the data recovery withminimum read-write or compute overhead and hence is mostapt for resource-constrained edge devices vulnerable to attacksusing data stealing malware.The proposed work only aims to enhance recover-ability ofdeleted files. This can be extended to secure file-systems frommalicious transfer and corruption of files and to cover othertypes of file-systems like Distributed file-systems. Further, themodel can optimize wear levelling in tandem with recoveryperformance, especially for Flash and SSDs. It can also be ex-tended for emerging Non-Volatile Memories like RRAMs (Re-sistive Random Access Memory) [31], for upcoming energyefficient edge devices. The codes developed and used for datarecovery and the Q-Learning Model with simulation resultscan be found at: https://github.com/HS-Optimization-with-AI.R