[PDF] APEX: Adaptive Ext4 File System for Enhanced Data Recoverability in Edge Devices

Abstract

Recently Edge Computing paradigm has gained significant popularity both in industry and academia. With its increased usage in real-life scenarios, security, privacy and integrity of data in such environments have become critical. Malicious deletion of mission-critical data due to ransomware, trojans and viruses has been a huge menace and recovering such lost data is an active field of research. As most of Edge computing devices have compute and storage limitations, difficult constraints arise in providing an optimal scheme for data protection. These devices mostly use Linux/Unix based operating systems. Hence, this work focuses on extending the Ext4 file system to APEX (Adaptive Ext4): a file system based on novel on-the-fly learning model that provides an Adaptive Recover-ability Aware file allocation platform for efficient post-deletion data recovery and therefore maintaining data integrity. Our recovery model and its lightweight implementation allow significant improvement in recover-ability of lost data with lower compute, space, time, and cost overheads compared to other methods. We demonstrate the effectiveness of APEX through a case study of overwriting surveillance videos by CryPy malware on Raspberry-Pi based Edge deployment and show 678% and 32% higher recovery than Ext4 and current state-of-the-art File Systems. We also evaluate the overhead characteristics and experimentally show that they are lower than other related works.

Full PDF

AAPEX: Adaptive Ext4 File System for EnhancedData Recoverability in Edge Devices

Shreshth Tuli

Indian Institute of Technology

Delhi, [email protected]

Shikhar Tuli

Indian Institute of Technology

Delhi, [email protected]

Udit Jain

Indian Institute of Technology

Delhi, [email protected]

Rajkumar Buyya

The University of Melbourne

Melbourne, [email protected]

Abstract —Recently Edge Computing paradigm has gainedsigniﬁcant popularity both in industry and academia. Withits increased usage in real-life scenarios, security, privacy andintegrity of data in such environments have become critical.Malicious deletion of mission-critical data due to ransomware,trojans and viruses has been a huge menace and recovering suchlost data is an active ﬁeld of research. As most of Edge computingdevices have compute and storage limitations, difﬁcult constraintsarise in providing an optimal scheme for data protection. Thesedevices mostly use Linux/Unix based operating systems. Hence,this work focuses on extending the Ext4 ﬁle system to APEX(Adaptive Ext4): a ﬁle system based on novel on-the-ﬂy learningmodel that provides an Adaptive Recover-ability Aware ﬁleallocation platform for efﬁcient post-deletion data recovery andtherefore maintaining data integrity. Our recovery model andits lightweight implementation allow signiﬁcant improvement inrecover-ability of lost data with lower compute, space, time, andcost overheads compared to other methods. We demonstratethe effectiveness of APEX through a case study of overwritingsurveillance videos by CryPy malware on Raspberry-Pi basedEdge deployment and show 678% and 32% higher recovery thanExt4 and current state-of-the-art File Systems. We also evaluatethe overhead characteristics and experimentally show that theyare lower than other related works.

Index Terms —Edge computing, Security, File System, DataRecovery, Data Stealing Malware

I. I

NTRODUCTION

Internet of Things (IoT) paradigm enables integration ofdifferent sensors, compute resources and actuators to perceiveexternal environment and act to provide utility in differentapplications like healthcare, transportation, surveillance amongothers [1]. The original idea of providing enhanced Qualityof Service (QoS), which is a measure of the performanceof such services, was to provide distributed computation andcomputation on Cloud [2]. A new computing paradigm namely‘Fog Computing’ leverages resources both at public cloud andthe edge of the network. It provides resource abundant ma-chines available at multi-hop distance and resource constrainedmachines close to the user. Several works have shown that Fogcomputing can provide better and cheaper solutions comparedto only Cloud based approaches [3], [4]. Thus, many ofsuch IoT systems have recently been realized using Edge/Fogcomputing frameworks [5]. Due to the increased usage of suchdevices and frameworks, there has been increasing interest indeveloping efﬁcient techniques to provide improved QoS thatsuch systems provide [6]. Edge is considered as a nontrivial extension of the cloud,therefore, it is inevitable that same security and privacy chal-lenges will persist [7]. Many works in literature show that thereexist new security threats in Edge paradigm which provideopportunities of developing more robust and better systems [8].One of the most crucial security problems is the loss of criticaldata due to malicious entities [8]. Some ways used by hackersto corrupt, delete and steal such critical data include the usageof data stealing malwares, ransomwares, trojans, deleting oroverwriting viruses. Most malware attacks have been based onstealing crucial data from the system and requesting ransompayments in return for the data. Unfortunately, according toa recent survey [9], the number of novel ransomwares/trojanshas increased up to 50 times in the last decade amounting tomillions of dollars of illicit revenue.Preventing such attacks has not been popular in IoT. Thisis because it requires signiﬁcant computation and space [10]that drastically increases the cost of IoT deployment which isunfavorable for users. Also, it has been shown that no matterwhat protection mechanisms are put in place, edge paradigmswill be successfully attacked [11]. Thus, the detection andrecovery from such attacks seems a critical requirement forEdge Computing domain. Due to increasing frequency ofnovel attacks and their types, detection is quite challenging[12]. Despite such challenges, there exist prior work that havehigh accuracy of detecting such attacks [13], [14] but onlya few of them utilize the ability of policy based allocationto recover from such attacks. The ones that utilize policybased allocation schemes [14], [15] have signiﬁcant overheadof read/write latencies, computation time, space requirementand limited recover-ability that are not feasible for Edge nodes.We discuss more limitations of such strategies in

Section II and identify the scope of improvement in terms of properallocation with mechanisms allowing faster, portable and moreefﬁcient recovery. Our work primarily focuses on recovery inHard Disk Drives (HDD), Flash Media and Solid State Devices(SSD).Most of the conventional ﬁle systems such as Ext, NTFSand exFAT do not allow the users to speciﬁcally tagblocks/clusters/sectors - deleted or unused independently [16]where a block in hard-disk is a group of sectors that theoperating system can point to. Lack of such freedom limitsthe kernel to overwrite the data on random locations reducingrecovery. Some optimizations exist but are proprietary and not a r X i v : . [ c s . O S ] O c t ustomizable to user speciﬁc needs. This limits the currentlyavailable kernels to utilize the full potential of ﬁle allocationand tagging for efﬁcient recovery. Even the ﬁle allocationand mapping with virtual tables is restricted to speciﬁc ﬁxedalgorithms. Allowing these algorithms to be recover-abilityaware in allocating blocks/clusters to ﬁles can improve theamount of data that can be recovered from such ﬁle systems.Making these algorithms adaptive and equipped with somelearning model can further lead to optimization absent incurrent systems.The proposed ﬁle system APEX (Adaptive Ext4), imple-ments an adaptive ﬁle allocation policy that supports a widediversity of platforms due to its portable implementation. Itprovides a signiﬁcant improvement in recovery of ﬁles withlow overheads. It is designed to be lightweight and easy todeploy in Edge/Fog computing frameworks, increasing theirreliability and data protection. Another advantage is that itprovides improved forensic based recovery for criminal inves-tigations to expose evidence and hence catch hackers/invaders.The main contributions of our work are as follows: • We propose a lightweight, adaptive, portable and efﬁcientﬁle allocation system optimized for higher post deletionrecovery which is ﬂexible, robust and is independent ofstorage architecture • We provide a set of pre-optimized weights that need onlyslight variation of hyper-parameters dependent on usageand thus low adapting time for new scenarios • We develop a prototype ﬁle system APEX and showits efﬁcacy on a real life scenario for malicious dele-tion/overwriting of video surveillance footage.The rest of the paper is organized as follows. In

Section II , weprovide related works and compare them with ours. In

SectionIII we ﬁrst provide a basic recover-ability aware allocationmechanism and describe a heuristic based block rankingmethod that can optimize post-deletion data recovery. We thenimprove our heuristic measure by updating it dynamically toprioritize ﬁles based on the user’s ﬁle access characteristicsin

Section IV and also provide model level details of a DiskSimulator for learning the weights (hyper-parameters) of theblock parameters, based on general-user ﬁle access characteris-tics. In

Section V , we extend our implementation discussion toAPEX ﬁle system using the FUSE (FileSystem in UserSpace)framework [17]. In

Section VI we provide a case study ofoverwriting video surveillance data using CryPy malware [18]and provide experimental results of the model and comparisonwith other works both for recovery and overheads to show thatAPEX outperforms them.

Section VII concludes the paper andprovides future directions to improve APEX.II. R

ELATED WORK

The goal of data recovery is to recover lost data from adisk to maximum extent. This data might be ‘lost’ becausethe disk has been corrupted or the ﬁles have been deleted. Wefocus on the latter aspect of the problem that too from datastealing/deleting malware. There has been signiﬁcant work ondata recovery at different levels, including ﬁle system, ﬁle allocation software, and recovery tools. However, there stilllacks a holistic system that focuses on this aspect on generic,highly recoverable data storage by modifying the allocationpolicy. Table I provides a brief summary for comparing APEXwith other systems.There are two main directions in which the work ondata recovery has progressed. The ﬁrst concerns how datais recovered after deletion, and the second concerns withallocating data such that recovery later is optimized. Signiﬁ-cant parameters for comparison include overheads, on demandrecover-ability, adaptivity to different application scenarios,and custom policy employ-ability among others. There hasbeen work on dynamic ﬁle systems and allocation like byUlrich et al. [19], where the data is allocated between differ-ent drives for optimum resource utilization and distribution.Complex parity generation and comparison methods are used,spanning multiple drives, for improved utilization and recoveryperformance. However, it lacks priority based allocation andreplacement policies that are optimized for recovery andaccess time. The popular Andrew File System (AFS) [15]in distributed systems also provides a backup mechanism torecover deleted or lost ﬁles for a limited period of time. This isnot suitable for Fog nodes due to limited disk space available(there is high storage-to-compute cost ratio in Fog frameworkdeployment and other communication limitations across Fognetwork). Many efforts have also been made in the directionsof hardware tweaking and optimizations. An example is byHanson [20].Techniques that involve tagging of ﬁle blocks with someidentiﬁers have been used by Alhussein et al. in [22] and [24],where frameworks like FUSE have been used for developmentof forensic based ﬁle-systems. They provide forensic ﬁleidentiﬁers at cluster-level ﬁle allocation to provide informationneeded for ﬁle types to be recovered after deletion. As they areonly limited to ﬁle cluster identiﬁcation and the identiﬁcationof ﬁle types, the amount of recover-ability of data is limitedbecause they completely ignore the ﬁle usage characteristicsand temporal locality. Another adaptive approach by Stoicaet al. [13] uses weighted least squares Iterative AdaptiveApproach (IAA) to detect and recover missing data, butthis works mostly across streams of data signals and is notscalable to actual ﬁle systems. We consider it because ofits unique adaptive approach to categorize cluster cells forefﬁciently allocating them in buffers/disk. Other works likeby Continella et al. [14] or Baek et al. [25] provide a self-healing ransomware-aware ﬁlesystem which is capable of bothdetection and the recovery from ransomware attacks. It worksby analyzing the I/O data trace of various processes and uses aclassiﬁer to detect if the process is maliciously deleting data. Ifsuch a process is discovered, it uses a recovery approach usedby copy-on-write ﬁlesystems. This poses signiﬁcant overheadsin terms of disk space and I/O bandwidth requirements andhence is not optimum for Edge nodes. Another problem is that,it has non-zero detection error in ﬁle destructive ransomwaresas it has been trained for those that are encryption based. Wetested it for Gandcrab and it was not able to identify it. ork Recovery speciﬁc Low Overhead Selective ﬁles can Allows custom For Adaptive User Crossallocation Computation Memory Disk File I/O be marked critical policies Edge/Fog/Cloud speciﬁc PlatformUlrich et al. [19] (cid:88)

AFS [15] (cid:88) (cid:88) (cid:88)

Hanson [20] (cid:88) (cid:88) (cid:88) (cid:88)

Breeuwsma et al. [21] (cid:88) (cid:88)

Alhussein et al [22] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Stoica et al. [13] (cid:88) (cid:88) (cid:88)

Huang et al. [23] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Alhussein et al [24] (cid:88) (cid:88) (cid:88) (cid:88)

Baek et al. [25] (cid:88) (cid:88) (cid:88) (cid:88)

Lee et al. [26] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Continella et al. [14] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

APEX [this work] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table IA

COMPARISON OF RELATED WORKS WITH OURS

Another technique by Huang et al. [23] has been proposedfor recovering encrypted data from ﬂash drives. They providea ransomware tolerant SSD which has ﬁrmware-level recoverysystems, but the main drawback of their approach is that theykeep multiple copies of data and hence need signiﬁcantlymore space than required. Another disadvantage is that theirapproach is speciﬁc to SSDs and not for generic storage de-vices. Lee et al. [26] propose ExtSFR, a scalable ﬁle recoveryframework for distributed ﬁlesystems in IoT domains. Thisuses ﬁles’ metadata to identify and recover them, but ignoreﬁle usage and access characteristics which limit recoverability.In aforementioned works, ﬁle access characteristics, recoverybased allocation strategies, and cluster identiﬁers are exploitedfrom a narrow perspective and capabilities of adaptive prioritybased allocation approaches have not been fully leveraged.In addition, most systems overlook the constraints in edgeparadigm and hence are not cost and/or energy efﬁcient forsuch deployments. APEX ensures data security using adaptiveprioritization of blocks in storage media based on recoveryheuristics and provides efﬁcient mechanisms that have minimalcompute, bandwidth and space overheads.III. R

ECOVERABILITY A WARE F ILE ALLOCATION

In the previous section, we described the ﬂaws of exist-ing systems and emphasized on how a recovery-aware ﬁleallocation system can provide a more robust and efﬁcientmechanism for recovering deleted data. Here we provide thedetails of implementing such system and later describe howmuch improvement in terms of recovery it provides. For thiswork, we have kept the threat model as a malicious entitywhich directly attacks the system to sabotage critical ﬁleswhile the attack surface is limited to the user applicationwherein the kernel is secured.

A. Block Parameters and Priority Factor

The four proposed parameters that act as heuristics torank a block for its priority for allocation are History Factor( HF ), Usage Factor ( U F ), Spatial Factor ( SF ) and LinkingFactor ( LF ). The Priority Factor ( P F ) of a block is alinear combination of these parameters. The weights of thesesfactors are kept as hyper-parameters which are dynamicallylearned for improved recovery based on the user’s ﬁle usagecharacteristics in the APEX model. This priority score is usedto sort the unused blocks in a priority queue which then is used for ﬁnding fresh blocks to allocate to a new ﬁle. Now,the Priority factor (

P F ) is deﬁned according to the equation:

P F = λ · HF − σ · U F + ρ · SF + µ · LF Priority Factor is periodically calculated for each unusedblock of the disk. Here the hyper-parameters: λ, σ, ρ, µ whichare coefﬁcients of the disk block parameters. They are dynam-ically updated based on the user’s ﬁle access characteristics.The different types of clusters/blocks are

Used and

Unused .When a ﬁle is created, some unused blocks are allocated andthus belong to the

Used category. When a ﬁle is deleted,the

Used blocks that correspond to that ﬁle are convertedto

Unused category. The different types of ﬁles in APEXnomenclature are: • Used: File exists in disk • Deleted: File exists in disk but deleted and blocks canbe overwritten (only partial fragments of the ﬁle may bepresent) • Obsolete: No block of the ﬁle exists in the current stateof the diskDifferent types of operations are read, write, delete, create(new ﬁle). We now deﬁne the block parameters and reasontheir importance:

1) History factor ( HF ): The history factor accounts forhow old a particular ﬁle’s blocks are, in terms of “delete” or“over-write” operations. Each time a ﬁle’s blocks are deletedor over-written, the rest of the blocks’ HF increases by one.At block level the history factor can be visualized as shownin Figure 1(a). Here, the “ﬁle” of the block shown as “a”represents the set of blocks that belong to the last ﬁle of whichblock “a” was part of. For different cases of a block, we have: • Unused: for every Delete/over-write operation, the HF israised by 1 • Unused to used transition: set HF to 1 • Used: no change • Used to unused transition: set HF to 0This parameter exists for each block in the disk. The reasonHF is important for recovery is: if blocks of the same ﬁleare overwritten then the extent of recovery of other blocksdecreases. This is also mentioned in other works [15], [25],[27] where they use similar notions to capture the history ofﬁle. When HF increases, it makes the recovery of the cluster

Delete/over-write/new ﬁle operation

HF++A block of the ﬁle (last ﬁle i.e. set of blocks, to which this block belonged)

Blocks of a ﬁle (a) History Factor aRead/write operation (last ﬁle i.e. set of blocks, to which this block belonged)

UF++A block of the ﬁle

Blocks of a ﬁle (b) Usage FactorFigure 1. Block heurisics more difﬁcult and hence PF should increase, which impliesthat the positive scalar constant ( λ ) will be positive.

2) Usage Factor (

U F ): The usage factor takes into accountthe usage of a ﬁle (and thus its blocks). It is quantiﬁed by thenumber of read/write operations on that particular ﬁle beforedeletion/over-writing. The higher the usage of a particular ﬁle,the more recovery sensitive (important to user) the ﬁle gets.The change to the usage factor based on the ﬁle operationis shown in Figure 1(b). The UF of all blocks in a disk areinitialized to 0.For different cases of a blocks, we have: • Used: For each read/write operation of any blocks of theﬁle, the UF of each ﬁle increases by 1 • Used to Unused transition: no change • Unused: no change • Unused to used transition: set UF to 1 when a new ﬁleis createdOther works [25], [27] also use similar notions of ﬁle/blockprioritization based on usage of ﬁle. This is because frequentlyused ﬁles are considered more important to the user. Thismeans that the PF (priority for overwriting it) should reduceand hence the scalar constant ( σ ) has negative sign with it.

3) Spatial Factor ( SF ): The spatial factor includes thepossibility of the recovery of blocks located especially in theneighborhood of a particular block. Its importance has beenshown in other works as well [25], [28] because of its directeffect on ﬁle access time. The spatial factor of a block is higherif the overall priority factor of the neighboring blocks are high.Thus, SF is kept as the average of blocks that are physicallyspatially adjacent. ‘Spatial Adjacency’ depends on the physicalcharacteristics of the medium. For HDD we may consider theblocks in the same sector as the neighboring ones. As theblocks being allocated for a new ﬁle are the high priorityblocks, if the neighboring blocks have high PF, the spatiallocality would increase. Consequently, this block should alsoget replaced and hence for high SF, the PF should be high.This shows that the scalar constant ( ρ ) is positive. SFs of allblocks are updated after each I/O operation. For different casesfor a block, we have: • Used: Reset to 0 • Unused: Average

P F of nearby blocksThis factor is initialized to 0. This would be very usefulin HDDs, and our algorithm can take into account on theﬂy de-fragmentation in such storage devices. However, theintroduction of SF should not hamper with the wear levelingalgorithms employed by ﬂash drives and thus this factor is dropped in random access drives. For random access drivesand SSDs, optimizing spatial locality is not required.

4) Linking Factor ( LF ): The linking factor depends onthe format of a particular ﬁle and captures the extent ofrecovery possible for speciﬁc ﬁle formats. Some ﬁles like .jpg,.mp3, .avi, can be partially recovered even if some blocks aredeleted/over-written. Such type of ﬁles should be distinguishedfrom other ELF (Executable and Linkable Format) types like.exe which can not be recovered even when only one block hasbeen over-written [29]. Blocks that are unused but the last ﬁleto which they belonged are present and the ﬁle belongs to thenon ELF class of formats, would have LF = 0. Others likethose which belonged to ﬁle with ELF formats would have LF = 1. This factor is initialized to 1, as in the beginningthese ﬁles are not linked to anything. The scalar constant ( µ )should be positive in this case as well. The Priority Factor( P F ) depicts the priority of a block to be overwritten by newdata. The higher

P F blocks would be ready for over-writingﬁrst. The blocks with low

P F are more sensitive to recovery.

B. Functional Model Working

We use

P F as a reasonable heuristic to rank blocks inthe decreasing priority of being allocated to new ﬁles. Blockallocation, whenever a new ﬁle is created, is thus based on thepriority factor. The blocks with the highest priority factors areallocated to the new ﬁle, number equal to those required forthe new ﬁle. Whenever a ﬁle is deleted, its used blocks areshifted to the unused blocks’ set and this set is again used forallocation to new ﬁles with allocating those blocks ﬁrst thathave the highest PF.We now show how the ranking based on priority factor canbe improved further by tuning the hyperparameters: λ, σ, ρ, µ dynamically. This tuning is based on the ﬁle access character-istics of the user to allow more efﬁcient allocation.IV. H

YPER - PARAMETER OPTIMIZATION : A

DAPTIVEALLOCATION MODEL

In the previous section, we described how priority factor canbe used to rank blocks to improve post-deletion data recovery.This heuristic measure is still an approximation and thereexists scope for further tuning the hyperparameters: λ, σ, ρ, µ discussed before. This tuning can be dynamic based on user’sﬁle access characteristics which can make this heuristic more

User OS + ﬁle type+ ﬁle size+ read+ write+ delete FIle Allocation Table + File tags+ File size+ Block sizeStorage Media Read/Write? Update blockparametersBlocks canbe replaced

A Learning - State TableBlock Index uniqueIdBlock parameters Priority FactorMRPF

Block virtual addressto physical addresstranslationStorage Charateristics: - Read / Write latencies - Random / Sequential access Delete/Overwrite?Blocks cannot be replacednoyes yesData V i r t ua l A dd r e ss File Pointers

Learning Module

Allocate newblocksbased on priorityStartEnd

Figure 2. Model description using UML (Uniﬁed Modeling Language) recise and lead to better allocation. We next describe asimulated environment close to real life scenario to determinethe optimum values of the hyperparameters.

A. Learning Model

We use Q -Learning, which is a reinforcement learningmodel to optimize the set of hyper-parameters. Here, the stateof the Q -Learning Model is deﬁned by the tuple of hyper-parameters. An action is deﬁned by the increment/decrementof the hyper parameters. Converging to the optimal set of thesehyper parameters would be the goal of the model. The modelis divided into two parts, the ﬁrst is the learning model whichoptimizes these parameters. The other part is the allocationsystem, which allocates blocks to new ﬁles based on thecurrent hyper-parameter set and corresponding P F s of thelearning model.Considering each I/O operation as one iteration, the learningmodel updates the coefﬁcients at each iteration and graduallyconverges to the best set of values. The coefﬁcients are updatedfrequently because of the dynamism of disk/ﬁle accesses. Themodel would be trained on the basis of a Performance measure( P ), which is a combination of the average recovery ratio ofﬁles and the spatial locality (only recovery ratio for randomaccess drives). Each action would change this measure, andby the learning algorithm the model would converge to aoptimum hyper-parameter set. In the building of the prototypeFile System APEX, we have trained the Q learning modelbased on some representative ﬁle access rates and distributionof common ﬁle types.Figure 2 shows a diagrammatic representation of the modeland how the learning model interacts with the File AllocationTable (FAT), OS and the disk. As the state space needs to beﬁnite and the model should be able to converge to a deﬁnitesolution, the set of values on which the hyper-parameters varyis kept as: 1,2,..,10. This range is based on empirical resultsand convergence constraints. This gives a crude idea as to whatare the values required for the best recovery efﬁciency keepingthe least count of value measures to be 1. B. Objective Function

The objective function, referred as performance P of thelearning model is implemented as a linear combination ofthe weighted average Recovery Ratio (RR). RR is a standardmetric for comparing recovery of ﬁle systems [14], [22] andis weighted by the usage frequency of the ﬁles. Moreover, ameasure to quantify average access time of ﬁles in the disk[25], [28] is also common metric. We deﬁne for each ﬁle: • Recovery Ratio (RR): – For executable, objects and archived ﬁle types: 1 forcomplete recovery, 0 for partial/no recovery. – For ﬁles including text, word processor, multimedia,.pdf etc.:

Data recovered in bytesOriginal file size in bytes if meta-data re-covered else, 0. (Because ﬁle cannot be even partiallyread without the meta-information) • Approximate Access Time (AAT): Approximate measureof access time of a ﬁle which is the time for last read/write operation. When a ﬁle is created it is the ﬁlecreation time.Now, Performance P is deﬁned as a convex combination ofthe above terms: P = α · (cid:80) all deleted files RR · UF (cid:80) all deleted files UF − β (cid:80) all current files AATNumber of current files

Here, ≤ α ≤ , ≤ β ≤ and α + β = 1 . Both termsof which α and β are coefﬁcients, are dependent on eachother. The term involving access time depends not only onthe spatial distribution of ﬁles but also the average size ofcurrent ﬁles which affects recovery of other ﬁles. It also affectsrecovery because in many spatial distributions the meta-datais erased which makes a partially recoverable ﬁle (i.e. someblocks of data exist) have RR = 0. The other term that involvesrecovery ratio depends on how many blocks of deleted ﬁlesare present and also on which of them are present (meta-dataor data). Different values of α and β are applicable to differentscenarios. If we want to optimize recovery only, then α = 1 and β = 0 , for example, where edge devices store critical datalike video surveillance footage or health monitoring data. Suchapplications do not require high I/O bandwidth but criticallyrequire recoverability. If we want to optimize I/O only then α = 0 and β = 1 . This case would arise in edge conﬁgurationswith high bandwidth data streams and all data being stored ina separate database node or cloud which requires fast I/O. C. Disk Simulator

The disk block size for simulation model is kept the same asthe real value = 4KB. Total disk space is kept as 256MB anda ﬁle can vary from 16KB (4 blocks) to 1024KB/1MB (256blocks). Based on the rules for parameter update, the parametermaps are created for each state, i.e, a hyper-parameter set. Thissimulator model also allocates the blocks (based on rankingon

P F ) whenever there is a call from OS simulator (explainednext). It also keeps track of the MRPF (Most Recent ParentFile) for each block, which is a pair of the ﬁlename that aspeciﬁc block belongs to (or did before logical deletion) andthe set of blocks that were allocated to this particular ﬁle.

D. OS / File IO Simulator

This simulator does the ﬁle level/ OS level disk manage-ment. This introduces random ﬁle operations which belong toone of these categories: (1) Read/write, (2) Create ﬁle, (3)Delete ﬁle. The “Create ﬁle” operation updates the MRPF ofeach block that belongs to that new ﬁle after the operation.

E. Learning Hyper Parameters

Based on the deﬁnition of the hyper-parameters and perfor-mance function, the optimization problem is formulated as:maximize λ,σ,ρ,µ P subject to λ, σ, ρ, µ ∈ [1 , λ, σ, ρ, µ ∈ N ,Blocks allocated to new f iles in decreasingorder of P F = λHF − σU F + ρSF + µLF Model Iteration Number P e r f o r m an c e PerformaceMoving Avg

Figure 3. Performance with Model Iteration Number (MIN)

The model has been simulated using two iteration counts:1) Model Iteration Number (MIN): This increments by onefor each state change (action) in the learning model.2) Operation Iteration Number (OIN): This increments byone for each File I/O Operation performed by the OSSimulatorThe File I/O operations should be greater than the modelactions as it is not much useful to have a learning model thatanalyses the state space after many operations. To learn theoptimal values of the hyper parameters, we set the MIN incre-ment after 1000 OIN increments. This helps to reach a stableconﬁguration after many I/O operations and in evaluating thedisk performance based on the current set of hyper-parametervalues. The conditions kept for the learning stage are:1) Disk Size: 16 ×

16 = 256 blocks2) Maximum ﬁle size = 20 blocks3) Percentage of linked ﬁles = 20 (stochastically varyingaround this value)4) Minimum disk Utilization = 70%As in general Q -Learning models, this learning model has an“Exploration factor” depicted as (cid:15) . This factor drops downfrom 1 exponentially. It decides with what probability a ran-dom action takes place. The probability with which a randomaction is chosen is (cid:15) and for an optimal action it is − (cid:15) .As (cid:15) decreases, the model chooses the action with the highest ∆ P , which is the change in the performance function of thelearning model. Gradually, as the model explores, (cid:15) reducesand model converges to the optimal state. For this experiment,the learning model converges when (cid:15) is ≤ × − . Forthe simulation setting at this convergence point: MIN reaches . × and OIN reaches . × . The performancemeasure, as described earlier, starts from 44.00 (based on Ext4prioritization factors). The value of P tends nearly 190 withtime, thus based on the equation of performance, there is anapproximately 280% improvement as shown in Figure 3. Thehyper-parameter values converge to (4, 7, 1, 9).This set of hyper-parameter values are then used in a dy-namic setting for block prioritization in the APEX framework.The initial Priority Factor, which is updated on-the-ﬂy basedon the learning model, used for the block allocation for newﬁles in APEX is: P F = 4 · HF − · U F + SF + 9 · LF V. I

MPLEMENTATION OF

APEX

USING

FUSETo implement APEX, we used the FUSE framework [17].FUSE (Filesystem in Userspace) is an interface for userprograms to export a ﬁle-system to the Linux kernel. FUSEmainly consists of two components: the FUSE kernel moduleand the lib-fuse library. Libfuse provides the reference imple-mentation for communicating with the FUSE kernel module.It is the most widely used library with active support thatallows users to develop, simulate and test their own ﬁle-systemwithout writing kernel-level code.To implement APEX, we need the following structures: (1)Disk, (2) Directory, (3) File, (4) Block. For the Disk simulationsystem, we only required the

Disk , File and

Block structureswhile modifying them according to FUSE format and includea new

Directory implementation for APEX. Instead of

Disk being a 2D array of Blocks, now its a large

ByteBuffer whichis indexed by

Blocks . Disk contains the optimal coefﬁcientvalues of the parameters (found from the RL model), currentand deleted ﬁle list a

Heap of unused blocks and a

HashSet of used blocks. The

File and

Directory are derived from anabstract class

Path which has speciﬁcations required for a Filesystem structure.

Directory can have several children

Paths (which can be a ﬁle or directory) recursively.

File has severalchildren

Blocks , ﬁle-type (FUSE & linking factor).At the block level, main operations are:1)

Allocate : Sets the block’s state to used, initializes theblock’s parameters, updates the parent ﬁle pointer andobtains the linking factor (found by the parent ﬁle’sF

USE F ILE I NFO ).2)

De-allocate : Sets the state to unused, updates the block’sparameters and parent pointer.3)

Write : Writes the required content to the block. Includeswriting from a buffer and offset.4)

Read : Returns the block’s data.5)

Change : Updates the individual factors of the block.6)

Update : Updates the priority score of the block andchanges its position in the unused heap (if applicable).VI. O

VERWRITING VIDEO S URVEILLANCE USING C RY P YMALWARE : A C

ASE S TUDY

In order to compare APEX with other works, we use aFog environment built on FogBus [4] for video surveillanceon Raspberry Pi based Fog Nodes. As such nodes are usuallyconnected to cloud via the Internet, many viruses, trojans, etc.are likely to creep into the network. We compare APEX withother recovery speciﬁc systems that can be integrated in Edgenodes: AFS [15], FFS [22], ShieldFS [14], ExtSFR [26].

A. Setup

The machine - Raspberry Pi 3: Model B, used for theexperiments has the following speciﬁcations:1) SoC Broadcom BCM28372) CPU: 1.2 GHZ quad-core ARM Cortex A533) RAM: 1 GB LPDDR2-900 SDRAM4) Storage: 100 GB WD Black HDD or Samsung T5 SSD [W $)6 ))6 6KLHOG)6([W6)5 $SH[ 5 HF RY HU \ 5 D W L R (a) Secondary data size = 414 MB ([W $)6 ))6 6KLHOG)6([W6)5 $SH[ 5 HF RY HU \ 5 D W L R (b) Secondary data size = 794 MBFigure 4. Recovery ratio for different Secondary data size B. Experiment 1: Recovery ratio performance

In order to compare the data recovery performance, similartests were executed on the APEX File System and the Base(Ext4) File System, with other works and recovery ratios werecompared. For a simple analysis, we keep the ﬁle system sizefor user data limited to 1GB. First, primary data was deletedusing the CryPy malware [18] after custom modiﬁcations, andsecondary data was written on the drive. After this, the primarydata was recovered. We perform tests for both HDD and SSD,which give very similar results (difference of < . recoveryratio) and hence report the average.

1) Test ﬁles:

Primary Data was kept consistent throughoutthe experiment to maintain consistency and consist of a set of 5sample videos each of size = 95 MB. Secondary Data size wasincreased in each iteration by predeﬁned constants and onlywritten after soft deletion of primary data. Data recovered wasmeasured in terms of the size of recovered primary data. Alsomeasured in terms of recovered ﬁles that were view-able afterrecovery

2) Recovery ratio:

Data recovered / P rimary Data size

3) Observations:

Across different ﬁle systems as shown inFigure 4(a) and Figure 4(b), for the Secondary data size =414 MB and 794 MB respectively, we see that the APEX ﬁlesystem has the highest recovery ratio. As there is no policy forrecovery in Ext4, it has the least (even 0 MB in second case)recovery. As the surveillance footage was taken such that eachvideo accounts for 1 hour of data, AFS ﬁle system with 1 hourof delayed deletion causes 4 out of 5 ﬁles to be permanentlydeleted. This remains unchanged in both experiments. Forothers including FFS, ShieldFS, and ExtSFR; as they use onlyﬁle identiﬁers in blocks and not which ﬁles were most recentlyand frequently used (video ﬁles in our case), they overwritetheir data and hence lead to lower recovery. APEX on theother hand allocates new ﬁles i.e. Secondary data to separatelocations preventing overwriting of Original data, is able torecover maximum. APEX system improve the recovery ratioby 678%, and 32% compared to the base Ext4 and ExtSFRrespectively for 414 MB secondary data and 31% comparedto ExtSFR for 794 MB secondary data.

C. Experiment 2: Read-write performance

As the APEX ﬁle system adds the block level - prioritizedﬁle allocation implementation over Ext4, which leads to ad-ditional overhead, thus the FUSE implementation of APEX ( [ W $ ) 6 )) 66K L H O G ) 6 ( [ W 6 ) 5 $ S H [ : U L W H V S HH G L Q . % V (a) Write speed inKB/s ( [ W $ ) 6 )) 6 6K L H O G ) 6 ( [ W 6 ) 5 $ S H [ 5 H D G V S HH G L Q 0 % V (b) Read speed in MB/s ( [ W $ ) 6 )) 6 6K L H O G ) 6 ( [ W 6 ) 5 $ S H [ ' H O H W H V S HH G L Q 0 % V (c) Delete speed inMB/sFigure 5. Write, Read and Delete speed comparison was compared to the Base ﬁle system, which in our case is Ext4 . To test the read, write and delete speed in both systems,we used FileBench Filesystem benchmarking tool [30]. Thesebenchmarks provide read, write and delete speeds for n ﬁles,each of size m (where n and m are given as inputs to thebenchmark code). For the current study, random ﬁles weresequentially created, read, written and deleted where n variedfrom 10 to 1000 and m varied from 1 MB to 1000 MB.Figures 5(a)-5(c) show the results in bytes per second forwriting, reading and deleting for different frameworks usingHDD media (which has higher overhead so we report theseresults). The graphs show that the delete operation speedfor all frameworks except AFS is close to Ext4. Read/writespeeds are lower than Ext4 but the overhead is minimumin APEX. This is primarily because of identiﬁer mappingin blocks being updated periodically in FFS. In ShieldFS,the detection algorithm has its overhead which continuouslymonitors the disk footprint of ﬁlebench program and consumesbandwidth. Overall, APEX has the least overhead among allimplementations. D. Experiment 3: CPU and RAM overhead

Due to various complex background tasks of process I/Odetection in ShieldFS, ﬁle block identiﬁcation in FFS andbackup of data in AFS, the CPU and RAM usage of suchframeworks is much higher compared to APEX. ShieldFSuses many sophisticated cryptographic primitives to identifyif a program could be ‘potentially malicious’. This requiresconstant overhead of computation and memory. ExtSFR hasa post-deletion inode tracing and journal checking protocolwhich needs to check the whole ﬁle system for very smallchanges. APEX maintains the state of the ﬁles, hence onlythe ﬁles of interest can be checked. FFS maintains ‘forensicﬁle system identiﬁer’ for each ﬁle and identiﬁes the relevantinformation needed for recovery using such identiﬁers. Theseidentiﬁers are stored in disk and are required to be fetchedwhen checking for recoverable ﬁles in the ﬁle system, whichmakes it slow. APEX uses the disk cache information to updatethe block parameters of the most used blocks more frequentlythan others. This allows APEX to have a very small workingset at any instant of time (maximum 7% in current tests). Itintegrates ﬁle allocation and energy management to maintainleast CPU and memory consumption as shown in Figure 6. [W $)6 ))6 6KLHOG)6([W6)5 $SH[ & 3 8 8 W LOL ] D W L R Q (a) CPU usage (%) in differentframeworks ([W $)6 ))6 6KLHOG)6([W6)5 $SH[ 5$ 0 8 W LOL ] D W L R Q (b) RAM usage (%) in differentframeworksFigure 6. CPU and RAM comparison VII. C

ONCLUSIONS AND F UTURE W ORK

APEX can successfully simulate and provide optimumvalues of coefﬁcients in the Priority Factor (PF) which canrank the blocks to optimize recovery performance. The modelconverges to give PF as given in

Section IV . Using this factor(dynamically optimized by Q-Learning), if we rank all blocksin the disk then it can provide a ﬁle allocation system that hasa higher data recovery ratio. Thus, at any point if we want torecover a ﬁle, it would be recovered to the maximum extent inthis ﬁle system design compared to existing implementations.The current model gives a recovery performance improvementof 280%, where performance is determined as mentioned in

Section IV . Experiments in

Section VI-B show that the APEXimproves the recovery ratio by 678%, and 32% comparedto the base Ext4 ﬁle system and ExtSFR (best among theprior work). Hence, APEX can improve the data recovery withminimum read-write or compute overhead and hence is mostapt for resource-constrained edge devices vulnerable to attacksusing data stealing malware.The proposed work only aims to enhance recover-ability ofdeleted ﬁles. This can be extended to secure ﬁle-systems frommalicious transfer and corruption of ﬁles and to cover othertypes of ﬁle-systems like Distributed ﬁle-systems. Further, themodel can optimize wear levelling in tandem with recoveryperformance, especially for Flash and SSDs. It can also be ex-tended for emerging Non-Volatile Memories like RRAMs (Re-sistive Random Access Memory) [31], for upcoming energyefﬁcient edge devices. The codes developed and used for datarecovery and the Q-Learning Model with simulation resultscan be found at: https://github.com/HS-Optimization-with-AI.Rshow that the APEXimproves the recovery ratio by 678%, and 32% comparedto the base Ext4 ﬁle system and ExtSFR (best among theprior work). Hence, APEX can improve the data recovery withminimum read-write or compute overhead and hence is mostapt for resource-constrained edge devices vulnerable to attacksusing data stealing malware.The proposed work only aims to enhance recover-ability ofdeleted ﬁles. This can be extended to secure ﬁle-systems frommalicious transfer and corruption of ﬁles and to cover othertypes of ﬁle-systems like Distributed ﬁle-systems. Further, themodel can optimize wear levelling in tandem with recoveryperformance, especially for Flash and SSDs. It can also be ex-tended for emerging Non-Volatile Memories like RRAMs (Re-sistive Random Access Memory) [31], for upcoming energyefﬁcient edge devices. The codes developed and used for datarecovery and the Q-Learning Model with simulation resultscan be found at: https://github.com/HS-Optimization-with-AI.R