AST-Based Deep Learning for Detecting Malicious PowerShell
PPOSTER: AST-Based Deep Learning for DetectingMalicious PowerShell
Gili Rusak, Abdullah Al-Dujaili, Una-May O’Reilly
CSAIL, MIT, [email protected],[email protected],[email protected]
ABSTRACT
With the celebrated success of deep learning, some attempts todevelop effective methods for detecting malicious PowerShell pro-grams employ neural nets in a traditional natural language pro-cessing setup while others employ convolutional neural nets todetect obfuscated malicious commands at a character level. Whilethese representations may express salient PowerShell properties,our hypothesis is that tools from static program analysis will bemore effective. We propose a hybrid approach combining tradi-tional program analysis (in the form of abstract syntax trees) anddeep learning. This poster presents preliminary results of a fun-damental step in our approach: learning embeddings for nodes ofPowerShell ASTs. We classify malicious scripts by family type andexplore embedded program vector representations.
CCS CONCEPTS • Security and privacy → Malware and its mitigation ; •
Com-puting methodologies → Neural networks ; KEYWORDS powershell scripts; malware; deep learning; abstract syntax trees
ACM Reference Format:
Gili Rusak, Abdullah Al-Dujaili, Una-May O’Reilly. 2018. POSTER: AST-Based Deep Learning for Detecting Malicious PowerShell. In
Proceedings of2018 ACM SIGSAC Conference on Computer & Communications Security (CCS’18).
ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3243734.3278496
PowerShell is a popular scripting language and a command-lineshell. Originally only compatible with Windows, Powershell hasgained a multitude of users over the last several years, especiallywith its cross-platform and open-source version,
PowerShell Core .PowerShell is built on the .NET framework and allows third-partyusers to write cmdlets and scripts that they can disseminate to othersthrough PowerShell [4]. Along with increasing usage, PowerShellhas also unfortunately been subject to malicious attacks throughdifferent types of computer viruses [10]. PowerShell scripts caneasily be encoded and obfuscated making it increasingly difficultto detect malicious activity [6]. According to the FireEye DynamicThreat Intelligence (DTI) cloud, malicious PowerShell attacks have
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
CCS ’18, October 15–19, 2018, Toronto, ON, Canada © 2018 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-5693-0/18/10.https://doi.org/10.1145/3243734.3278496
PowerShell Scripts Parsed Abstract Syntax Trees (ASTs) Unsupervised Representation Learning
Neural Net (N) ...
Task-Specific Supervised Learning
Neural Net (N) malicious malicious benign
Figure 1:
AST-based deep learning for malicious PowerShell detec-tion. been rising throughout the past year [5]. Detecting these maliciousbehaviors with Powershell can be challenging for a number of rea-sons. Attackers can perform malicious activity without deployingbinaries on the attacked machines [10]. Additionally, PowerShell isautomatically downloaded on Windows machines. Further, attack-ers have shifted towards sophisticated obfuscation techniques thatmake detecting malicious scripts difficult [9]. Notably, attackers usethe -EncodedCommand flag to pass Base-64 encoded commands by-passing the Powershell execution regulations. Recently, emergingresearch has deployed machine learning based models to detect mal-ware in general [1, 7] and malicious PowerShell in particular [5, 6],where deep learning is employed to analyze malicious PowerShellscripts inspired by natural language understanding and computervision approaches. Though these approaches may support learn-ing the features necessary to distinguish malicious scripts, withthe wide range of obfuscation options used in Powershell scripts,we speculate that they might overlook some of the rich structuraldata in the codes. We therefore propose to break away from text-based deep learning and to use structure-based deep learning. Ourproposition is motivated by the successful use of Abstract SyntaxTrees (ASTs) in manually crafting features to detect obfuscatedPowerShell scripts [2]. While this use case does consider structuralinformation, manually-crafted features can be vulnerable to high-level obfuscation (e.g., AST-based techniques [3]). Therefore, in thispaper, we propose to learn representations of PowerShell scripts inan end-to-end deep learning framework based on their parsed ASTs.Specifically, we build on the work of Peng et al. [8] to learn repre-sentations (embeddings) for AST nodes. These representations canthen be incorporated in any of the tasks associated with PowerShellanalysis, including malware detection as shown in Fig. 1.
Deep Learning for PowerShell.
Hendler et al. [6] proposed to useseveral deep learning models to distinguish benign and maliciousPowerShell commands . With a dataset of 6 ,
290 malicious and 60 , a r X i v : . [ c s . S E ] O c t evel. According to their results on different architectures (includinga 9-layer CNN, a 4-layer CNN, and a long short-term memory net),all of the detectors obtained high AUC levels between 0 .
985 and0 . commands rather than scripts which are a more difficult challenge.Moreover, the features are derived from the commands’ textualform, which may not capture the command’s functional semanticsand are prone to character frequency tampering. AST for PowerShell.
Bohannon and Holmes [2] studied obfus-cated PowerShell scripts. They presented a baseline character fre-quency analysis and used Cosine similarity to detect obfuscationin PowerShell scripts. They identify promising preliminary resultsand note a significant difference between obfuscated and non-obfuscated codes. Like [6], the authors run into the issue of falsenegatives and suggest taking advantage of PowerShell AbstractSyntax Trees (ASTs) since PowerShell’s API allows for simple ASTextraction. Based on the parsed ASTs, the authors crafted 4098distributional features (e.g., distribution of AST types). The engi-neered feature vectors led to robust obfuscation classifiers on thetest set. Similar to the character frequency tampering challenge intext-based representations, the AST-based distributional featurescan be vulnerable to AST-based obfuscation [3].
Deep Learning with AST.
Peng et al. [8] developed a technique tobuild program vector representations, or embeddings, of differentabstract syntax node types based on a corpus of ASTs for deeplearning approaches. They used nearest-neighbors similarity andk-means clustering to determine the accuracy of their resultingembeddings. They reported qualitative and quantitative resultssuggested that deep learning is a promising direction for programanalysis. In this project, we build on [8]’s findings and further studythis claim.
To learn a robust representation of PowerShell scripts, we take ahybrid approach combining traditional program analysis and deeplearning approaches. We convert the PowerShell scripts to theirAST counterparts, and then build embedding vector representationsof each AST node type based on a corpus of PowerShell programs.
PowerShell scripts to Abstract Syntax Trees.
The considered datasetwas composed of Base-64 encoded PowerShell scripts. Thus, as apreprocessing step, each PowerShell script/command was decoded.Given a decoded PowerShell script, we determined its abstract syn-tax tree representation by recursively traversing the script’s prop-erties using [object.PSObject.Properties] and storing itemsof type [System.Management.Automation.Language.Ast] . Westored the parent-child relationships among the AST nodes in adepth-first-search order as a text file. There were 37 different AST node types. With multi-core machines, ASTs generation can becarried out in parallel.
Preliminary Analysis of Abstract Syntax Trees.
After collecting thetree structures of our PowerShell scripts corpus, we conducted anexploratory analysis on the ASTs and their statistics. Furthermore,we used a random forest classifier to label a PowerShell script byits malware family type. As will be shown in Section 4, few simpleAST-based features were indicative of the malware family.
Abstract Syntax Trees to Vector Representations.
Having outlinedour approach to the problem of malicious PowerShell programs,we herein take a fundamental step towards learning robust AST-based representations. We employed [8, Algorithm 1] on the Pow-erShell dataset to learn real-valued vector representations of the62 AST node types. To this end, we parsed each constructed ASTto a list of data structures to which we refer by subtrees . A sub-tree of an AST represents a non-leaf node and its immediate childnodes, each labeled by its type. Next, we shuffled the subtrees toavoid reaching a local minima specific to a given script. For eachsubtree, with parent node p and n child nodes { c i } ≤ i ≤ n , define l i = ( c i )/( p ) . Similar to [8], we define a lossfunction to measure how well the learnt vectors are describing thesubtrees. Let T be the number of distinct AST types whose embed-dings we are trying to learn. Let V ∈ R N f × T be the embeddingmatrix of the AST node types and define vec ( p ) ⊂ V ∈ R N f × asthe embedding vector that corresponds to the type of node p . Thesame holds for { vec ( c i )} ≤ i ≤ n . Additionally, let W l , W r ∈ R N f × N f be weight matrices and b ∈ R N f × be a bias vector. Further, define W i as the weights matrix of node i as W i = n − in − W l + i − n − W r . (1)Let the distance metric d be defined by d = || vec ( p ) − tanh ( n (cid:213) i = l i W i · vec ( c i ) + b )|| . (2)Let d c be the distance function applied on a negative example ofa given subtree where k ≤ n of the children nodes { c i } ≤ i ≤ n arechanged to different AST types. Given the parameters: V , W l , W r , b ,we optimized max ( , △ + d − d c ) , the distance between a normalsubtree’s construction and that of a corrupted adversarial subtree.We used the Adam optimizer to find optimal embedding vectorsand adjust the hyperparameters △ and k . By default, △ = , k = Setup.
We utilize a corpus of hand-annotated and thoroughlyanalyzed malicious PowerShell scripts [9]. This dataset consists of4 ,
079 known malicious Powershell scripts annotated and classifiedbased on their family types. These include ShellCode Inject, Power-fun Reverse, and others. The code repository will be made availableat https://github.com/ALFA-group.
Experiment 1: Malware Family Classification.
As a preliminaryexperiment, we attempted to classify malicious PowerShell scriptsby family types. We used properties from the abstract syntax treerepresentation to conduct this classification. Specifically, we usedonly two features: depth and number of nodes per PowerShell AST. igure 2:
Heatmap for the confusion matrix results on the held outtest set in the Malware Family Classification experiment.
We used the family types as the labels of our classifier. Since thedataset used suffered from a class-imbalance problem, we weightedthe classes when training the classifier (in this case a random forestclassifier) based on how many examples each class contained. Afterhyperparameter tuning on maximum depth, we fit a classifier witha maximum depth of 11. Due to sparsity of the dataset we used, welimited our experiment to family types with more than 40 examplesper family, resulting in eight different families. We randomly splitthe data into 70 /
30 train/test split. The confusion matrix of theheld-out test data is shown in Fig. 2. To our surprise, we found thattwo naive AST-based features—AST node count and AST depth—were enough to achieve an 3-fold cross-validation accuracy of 85%.Notably, even very simple features performed well because of theinherent program analysis background. This serves as a motivatingexample for the effectiveness of ASTs and exemplifies the power ofharnessing ASTs to understand program representations.
Experiment 2: Learning AST Node Representations.
Extendingthese results, we build program vector representations of the dataset.As a case study, we analyzed a random sample of 10 ,
000 malicioussubtrees from the total of 107 ,
000 subtrees in the malicious Power-Shell corpus. This collection contained 37 distinct AST node typescomprising 175 unique subtrees. We built the embedding matrix forthese node types using the method described earlier. We trainedour model for 200 epochs until the loss stabilized towards 0. Thequalitative results are summarized in a dendrogram in Fig. 3. Itshows the relationships of embeddings with similar ones. Notably,the
TryStatement and
CatchClause node types are neighbors, aswell as
ForStatement and
DoWhileStatement , and
Command and
CommandParameter . This is promising since one would expect suchcommands to serve similar functions in scripts. This preliminaryexperiment has limitations: for example, one would expect the
ForEachStatement to land near the
ForStatement as well. Addi-tional training on the full malicious dataset is required to fullyassess the validity of these methods. As next steps, we hope tomake use of these embeddings to build robust classifiers to clas-sify a malicious script based on family. Afterwards, we will usethese embeddings to build robust classifiers to determine if a givenPowerShell script is malicious or not.
Figure 3:
Dendrogram of node types and their relationships in theLearning Node Representations experiment.
PowerShell scripts have targeted industries including Higher Edu-cation, High Tech, Professional and Legal Services, and Healthcare.This paper motivated the use of static program analysis (in theform of abstract syntax trees) to supplement deep learning tech-niques with rich structural information about the code, insteadof text-based representations. We seek to use deep learning in anend-to-end unsupervised framework to identify intrinsic commonpatterns in our programs since even ASTs can be obfuscated. Wesaw that the depth and node count of an AST were enough to dis-tinguish malware families and we took our first fundamental stepin learning representations of PowerShell programs.
ACKNOWLEDGEMENT
This work was supported by the MIT-IBM Watson AI Lab and CSAIL Cy-berSecurity Initiative. We thank Palo Alto Networks for the dataset.
REFERENCES [1] Abdullah Al-Dujaili et al. 2018. Adversarial Deep Learning for Robust Detectionof Binary Encoded Malware. In
Proceedings of the 2018 on Asia Conference on Computerand Communications Security . ACM, 187–197.[7] Alex Huang et al. 2018. On Visual Hallmarks of Robustness to AdversarialMalware. arXiv preprint arXiv:1805.03553 (2018).[8] Hao Peng et al. 2015. Building program vector representations for deep learning.In