SAED: Edge-Based Intelligence for Privacy-Preserving Enterprise Search on the Cloud
SSAED: Edge-Based Intelligence forPrivacy-Preserving Enterprise Search on the Cloud
Sakib (SM) Zobaed ∗ , Mohsen Amini Salehi ∗ , and Rajkumar Buyya †∗ High Performance Cloud Computing (HPCC) lab,School of Computing & Informatics, University of Louisiana at Lafayette, USA † Cloud Computing & Distributed Systems (CLOUDS) lab, School of Computing and Information SystemsThe University of Melbourne, Parkville, VIC 3010, AustraliaEmail: { sm.zobaed1, amini } @louisiana.edu, [email protected] Abstract —Cloud-based enterprise search services ( e.g.,
AWSKendra) have been entrancing big data owners by offeringconvenient and real-time search solutions to them. However,the problem is that individuals and organizations possessingconfidential big data are hesitant to embrace such servicesdue to valid data privacy concerns. In addition, to offer anintelligent search, these services access the user’s search historythat further jeopardizes his/her privacy. To overcome the privacyproblem, the main idea of this research is to separate theintelligence aspect of the search from its pattern matching aspect.According to this idea, the search intelligence is provided byan on-premises edge tier and the shared cloud tier only servesas an exhaustive pattern matching search utility. We proposeSmartness At Edge (SAED mechanism that offers intelligence inthe form of semantic and personalized search at the edge tierwhile maintaining privacy of the search on the cloud tier. Atthe edge tier, SAED uses a knowledge-based lexical database toexpand the query and cover its semantics. SAED personalizesthe search via an RNN model that can learn the user’s interest.A word embedding model is used to retrieve documents basedon their semantic relevance to the search query. SAED is genericand can be plugged into existing enterprise search systems andenable them to offer intelligent and privacy-preserving searchwithout enforcing any change on them. Evaluation results ontwo enterprise search systems under real settings and verified byhuman users demonstrate that SAED can improve the relevancyof the retrieved results by on average ≈ for plain-text and ≈ for encrypted generic datasets. Index Terms —Enterprise-search; Semantic; Edge; Context-aware
I. I
NTRODUCTION
The expeditious growth of digitalization has been producinga massive volume of data, known as big data , in both struc-tured and unstructured formats. It is estimated that of thegenerated data is in the unstructured format, produced fromvarious sources, such as organizational documents, emails,web pages, and social networks [1]. Cloud services havebeen effective in relieving big data owners from the burdenof maintaining these data. Recently, cloud providers beganoffering enterprise search services that enable data owners tosemantically search over their datasets in the cloud. For in-stance, AWS has launched an enterprise search service namedAWS Kendra [2] that offers real-time semantic searchabilityusing natural language-based machine learning techniques.Although the cloud services have been fascinating for bigdata owners [3], there have been numerous privacy violation incidents [4] during recent years that have made individualsand businesses with sensitive data ( e.g., healthcare documents)hesitant to fully embrace the data management cloud services.In one incident, confidential information of over three billionYahoo users were exposed [5]. In another incident, informationof over million Verizon customer accounts were exposedfrom the company’s cloud system [5].Ideally, data owners desire a privacy-preserving cloud ser-vice that offers semantic and personalized searchability ina real-time manner, without overwhelming their resource-constrained (thin) client devices ( e.g., smartphones). A largebody of research has been undertaken on privacy-preservingenterprise search services in the cloud [6], [7], [8], [9], [10]whose goals are to protect user’s sensitive data from internaland external attackers. However, most of these works fall shortin retrieving search results that are semantically relevant to thecontext and user’s interest ( i.e., personalized search) [10], [9].In addition, these works often rely on the client device andimpose significant overhead on it to perform a secure queryprocessing or to encrypt/decrypt user documents.To satisfy all of the aforementioned desires of a particularuser, our main idea in this research is to separate the intelli-gence aspect of the enterprise search from its pattern matchingaspect. According to this idea, we propose to leverage on-premises edge computing [11], [12] to handle the searchintelligence and user-side encryption. For that purpose, theedge-based mechanism, called Smartness At Edge (SAED) ,is developed to extract both contextualized and personalizedsemantics from the search query and the user’s search historyas well.Then, SAED feeds the cloud resources with proactivelyaugmented and encrypted search queries. In this case, thehigh-end cloud resources are employed only to store encryptedcontents and to exhaustively perform pattern matching of thefed query across the entire dataset.Figure 1 provides a bird-eye view of the SAED mechanism.On one end, it communicates with the client device(s) tohandle the security processing of the user contents and toachieve search intelligence before feeding the query set to thecloud tier. On the other end, SAED communicates with thecloud tier where an existing enterprise search service ( e.g.,
Kendra) works with the computing and storage services in the a r X i v : . [ c s . I R ] F e b dgeUser Cloud ComputeStorageEnterprise Search Service ~~~~
SAED
Fig. 1:
Bird-eye view of SAED mechanism in a three-tier architectureto facilitate smart and privacy-preserving enterprise search service.SAED provides the secure search intelligence on the on-premisesedge resources. The high-end storage and compute resources on thecloud tier are utilized by the existing enterprise search systems toexhaustively carry out pattern matching on the entire dataset. cloud to perform exhaustive pattern matching of the encryptedquery set on the uploaded dataset. Upon completion of thepattern matching process, the set of resulting documents isretrieved and ordered by SAED with respect to the user’sinterests. Ultimately, the ordered results are handed over tothe user’s device.To identify the actual context of the query and to proac-tively expand it to a set of contextually-related queries, weleverage WordNet [13] that is a widely adopted knowledge-based lexical database. However, contextualizing the querycannot help in certain scenarios where the query is short andambiguous. For instance, considering jaguar as the searchquery, it can be contextualized to both a car brand or a wildanimal. For this type of queries, identifying the user’s interestcan complement the contextualization and navigate the searchtowards the semantics intended by the user ( i.e., achievingpersonalized search). For that purpose, SAED utilizes a recur-rent neural network model to infer the user’s interest basedon his/her search history. Although proactive query expansion( i.e., augmenting the user query to a set of queries) is vitalto capture the search semantics, not every element of theexpanded query set is equally relevant to the original query.As such, SAED assigns a weight to each expanded query thatrepresents its semantic distance to the original query.In summary, the contributions of the work are as follows: • We develop the open-source SAED mechanism at theedge tier that offers personalized semantic searchabilityon existing cloud-based enterprise search services whilemaintaining data privacy. • We propose a method to extract the context of a givensearch query that often appears in form of a short andincomplete sentence. • We design a method for proactive query expansion tocover the search semantic with respect to its context. • We develop a method based on a recurrent neural networkmodel to personalize the search via assigning a weight toeach expanded query. • We evaluate the search accuracy and privacy of SAED viaplugging it to the existing cloud-based search services.The rest of the paper is organized as follows. In SectionII, we discuss background study and related prior works. Later, we provide the architectural details of SAED mechanismin Section III. In Section IV, we discuss the pluggabilityof SAED in the context of AWS Kendra. We discuss resultsand performance analysis in Section V. Finally, Section VIconcludes the paper.II. B
ACKGROUND AND P RIOR L ITERATURE
Several research works have been undertaken in seman-tic and/or privacy-aware search systems. Here, we introducesome notable mentions and position the contributions ofSAED against them.
A. Cloud-Based Enterprise Search Services
Cloud-based enterprise search services, such as AWSKendra, offer semantic searchability, given that they are pro-vided with the plain-text data. That means the semantic abilitycomes with the cost of compromising the users’ data privacy[10], [14], [5]. This is, in fact, the trapdoor that particularlyinternal attackers can misuse to breach the confidentiality oreven the integrity of the users’ data.
It is this type of attackmodel that we try to make the cloud-based enterprise searchservices resistant against.
We note that, for encrypted datasets,the current enterprise search services cannot offer anythingbeyond na¨ıve string matching.Even for plain-text datasets, our investigations revealed thatKendra covers only ontological semantics in the search andit falls short in providing context-aware and personalizedsemantics. For instance, we tested Kendra to verify the abilityof capturing context-aware semantics by feeding soccer asa query and in the result set, there were documents about rugby . In another test, river bank query returned docu-ments about commercial bank that indicates the lack ofcontext-awareness in the search.Alternatively, SAED can offer context-aware and personal-ized search while maintaining data privacy. It can be pluggedinto any enterprise search service without enforcing anychange on them and enrich their semantic search quality byincorporating context-awareness and personalization.
B. Semantic Representation of Query Keywords
Query expansion is a process to seek keywords that aresemantically related to a given query and fill the lexical gapbetween the user queries and the searchable documents. Oneof the widely-used methods of query expansion is Pseudo-Relevance Feedback (PRF) [15], [16] that extends an unsuc-cessful query with various related keywords and then re-ranksthe search results to increase the likelihood of retrieving rele-vant documents. Although the PRF-based approach generallyimproves the retrieval effectiveness, it is sensitive to the qualityof the original search results.Latent semantic analysis [17], latent dirichlet analysis [18],and neural-based linguistic models [16], [19] are some ofthe query expansion methods that can obtain the semanticrepresentation of a given query. In these methods, vectorsare commonly referred to as word embeddings that represent uery Handler InterestDetector Weighting UnitQueryHistory ContextIdentifier QueryExpansion Ranking Unit
Enterprise Search Service
Cloud TierUser Edge Tier
SAED ~~~~
Fig. 2:
Architectural overview of the SAED system within edge tier and as part of the three-tier enterprise search service. SAED providessemantic search via identifying the query context and combining that with the user’s interests. Then, Query Expansion and Weighting unitof SAED, respectively, incorporate the semantic and assure the relevancy of the results. Solid and dashed lines indicate the interactions fromuser to the cloud tier and from the cloud tier to the user respectively. words into a low-dimensional semantic space, where the vicin-ity of words demonstrates the syntactic or semantic similaritybetween them [20]. However, pre-trained word embeddingmodels, such as Word2vec [20], always generate the samevector representation for an input word, regardless of thecontext in which the word has appeared in. Hence, if anyambiguous keyword(s) present in a query, the underlying topicof the query could not be detected.WordNet [13] is one of the widely-used and lexically-rich resources in English that is utilized to infer the senseof ambiguous words in a given corpus. In WordNet, wordscontaining similar meanings are grouped into synonym sets,whereby each set has a semantic and conceptual relationshipwith the other sets. Song et al. [21] and Nakade et al. [22] evaluate the effectiveness of utilizing WordNet for queryexpansion in National Institute of Standards and Technology(NIST) and Twitter datasets. They identify important key-phrases of the query and use WordNet to obtain the relevantsynonym sets. Later, they utilize the synonym sets to constructthe expanded query. Nevertheless, in most of the prior researchon query expansion using WordNet ( e.g., [23]), the elementsof the expanded query set are considered uniformly thatundermines the relevancy and ranking of the result set.
C. Privacy-Preserving Search Systems
In addition to plain-text data, searching is performed onprivacy-preserving data ensuring negligible chances of dataleakage. Therefore, various searchable encryption-based solu-tions are adopted to facilitate search over such data.Few works at the time of writing have combined the ideasof semantic searching and searchable encryption. Works thatattempt to provide a semantic search often only considerword similarity instead of true semantics. Li et al. [6]propose a system which could handle minor user typos througha fuzzy keyword search. Moataz et al. [24] use variousstemming approaches on terms in the index and query toprovide more general matching. Sun et al. [7] present asystem that used an indexing approach over encrypted file metadata and data mining techniques to capture the semanticsof queries. This approach, however, builds a semantic networkonly using the documents that are given to the set and onlyconsiders words that are likely to co-occur as semanticallyrelated, leaving out many possible synonyms or categoricallyrelated terms. Woodworth et al. propose S3BD [10], a securesemantic search system that could search semantically overencrypted confidential big data. They expand their searchquery by incorporating semantic data extracted blindly from anontological network.They do not consider context-aware queryexpansion that created confusion for the search system whileprocessing ambiguous or multi-context keywords in a query.To perform query processing in client devices, they end uprequiring additional computational overhead in the client tier.III. SAED: S
MART E DGE -B ASED E NTERPRISE S EARCH S YSTEM
A. Architectural Overview
In this part, we provide a bird-eye view of the SAED system,that enables intelligent and secure enterprise search in thecloud. The system is structured around three tiers, shown inFigure 1, and explained as follows: • Client tier ( e.g., smartphone, tablet) contains alightweight application that provides a user interface foruploading documents and to search over them in thecloud. Datasets are either uploaded by the user or by theorganization that owns the data. • Edge tier extracts representative keywords of the doc-uments being uploaded to the cloud tier and builds anindex on the cloud tier. Upon receiving a search queryfrom the client tier, the SAED system on the edge tieroffers intelligence by considering the query semantics andthe user’s interest. The edge tier is located in the client’spremises, hence, deemed as an honest and secure system.To offer a secure enterprise search service, the edge tierencrypts both the uploaded data and the search query.n addition, it decrypts the result set before delivering itback to the client tier. • Cloud tier contains numerous high-end servers that areutilized for storing (encrypted) data and performing thelarge-scale computation required to exhaustively searchagainst the index [9], [10]. The index can be clusteredbased on the underlying topics of its keywords (pleaserefer to our prior works [10], [5] for further details).In Figure 2, we depict the components of SAED and showthe interactions between them. At first, a user-provided searchquery is received by the
Query Handler that keeps track of theuser’s search history and initializes the
Context Identifier unitwhose job is to extract the context and disambiguate the queryphrase. Then, according to the extracted context, the queryis proactively expanded by the
Query Expansion unit and a query set is constructed. To achieve the personalized search,the
Interest Detector unit of SAED leverages the user’s searchhistory to recognize his/her interest and weight each element ofthe query set ( i.e., expanded queries) based on its relatednessto the user interest. Once the pattern matching phase isaccomplished on the cloud tier, the resulted documents arereturned to SAED on the edge tier. Next, the
Ranking Unit utilizes the assigned weights to order the retrieved documentsbased on their relevance to the user’s interest and generatesa retrieved document list, denoted as D θ , that is sent to theuser’s device. In the next parts, we elaborate on each unit ofthe SAED system. B. Query Context Identification
Identifying the context of a given search phrase is vitalto navigate the search to the semantics intended by theuser. Considering the example of cloud computing asthe search query, without a proper context identification thereturned document set can potentially include documents about sky and climate , whereas, an efficient context identifier canrecognize the right semantic and navigate the search to thetopics around distributed, edge, fog, and cloudcomputing . In fact, identifying the context helps the QueryExpansion unit to form a query set diversified around relevantkeywords that semantically represent the search query andsubsequently improve the relevancy of the results.Prior context identification works ( e.g., [25], [26], [19])have the following shortcomings: first, they often assumeeach keyword has the same importance in the query andrecognize the query context via averaging the embeddings ofits keywords. However, not all keywords in a query necessarilyhelp in identifying the context. For example, the keyword various in various cloud providers does not bringany significance to the context and can be eliminated. Second, the embedding methods used by the existing works alwaysprovide the same representation for a given keyword, irrespec-tive of the underlying context. This is particularly problematicfor ambiguous keywords whose meaning vary based on thequery context. For instance, the embedding of cloud inthe aforementioned example should be different when it isused along with the computing as opposed to when it is used along with the weather in a given query.
Third, existing methods only consider the embeddings of the commonkeywords, while discarding most of the name-entities ( e.g., names and locations) that do not exist in the vocabulary ofWord2Vec [13], [27]. For instance, consider best sellingbooks of J.K. Rowling as the query;
Book and
Sell are identified as the query context and
J.K. Rowling isdiscarded. However, our analysis suggests that the context ofa short query phrase often has contextual association with thediscarded name-entities.To overcome the shortcomings and identify the actual con-text of a given query, we propose to take a holistic approachand extract the semantic across query keywords, proportionateto the importance of each keyword . The main output of theContext Identification unit is a set of keywords, denoted as C ,that collectively represent the context of the query.Specifically, to eliminate unimportant keywords that do notcontribute to the semantic of query Q , the Context Identifica-tion unit utilizes Yake [28], which is a unsupervised keywordextractor that discards unimportant keywords of the query. Theremaining keywords ( i.e., the trimmed query, denoted as the Q (cid:48) set) are considered for context identification. To learn thetrue semantic of Q (cid:48) , the unit leverages the Lesk algorithm[27] of WordNet to disambiguate each keyword q ∈ Q (cid:48) . Leskalgorithm works based on the fact that keywords in a givensentence (query) tend to imply a certain topic. For keyword q , Lesk can determine its true semantics via comparing thedictionary definitions of q against other keywords in Q (cid:48) ( i.e., Q (cid:48) − { q } ). Let c q be the set of keywords representingthe context of q . Then, the context of Q is determined as C = ∪ ∀ q ∈ Q (cid:48) c q . Lastly, the Context Identifier recognizes name-entities from Q using WordNet and considers them as part ofthe context, but in a separate set, denoted as N . The reason forconsidering a separate set is that we apply a different treatmenton N and C in the other units of SAED. ALGORITHM 1:
Pseudo-code to detect the context of a givenquery in the Context Identification unit of SAED.
Input : query Q Output: C : set of keywords representing context of Q , N : set of name-entity in Q Function contextIdentification( Q ): Q (cid:48) ← extract keywords from Q using Yake alg. foreach q ∈ Q do if q ∈ Name-entity then N ← N ∪ { q } end else if q ∈ Q (cid:48) then E q ← define q based on Q (cid:48) − q using Lesk alg. c ← extract set of keywords of E q using Yakealg. C ← C ∪ c end end end return C, N end lgorithm 1 provides a pseudo-code for identifying thecontext of incoming query Q . The outputs of the pseudo-codeare two sets, namely C and N , that collectively represent thecontext of Q . In Step 2 of the pseudo-code, Yake algorithmis used to filter Q by extracting its important keywords andgenerate the Q (cid:48) set. Name-entities of Q are identified bychecking against WordNet and form the set N (Steps 4–6).Next, in Steps 8–12, for each keyword q ∈ Q (cid:48) , the Leskalgorithm is employed to disambiguate q and find its truedefinition with respect to the rest of keywords in Q (cid:48) . Importantkeywords of the definitions form the context set ( C ) for Q . C. Query Expansion Unit
The
Query Expansion unit is in charge of proactively ex-panding the query keywords based on their relevant synonymsthat are in line with their identified context. Neglecting thequery context and blindly considering all the synonyms, asachieved in [25], [26], [19], [10], leads to finding irrelevantdocuments. Accordingly, the unit leverages the context of Q ( i.e., C and N ) to only find the set of synonyms, denoted as P , that are semantically close to the query context.Word2Vec [20] is a shallow neural network model that canbe trained to generate vector representation of keywords, suchthat the cosine similarity of two given keywords indicates thesemantic similarity between them. Accordingly, to proactivelyexpand each keyword q ∈ Q , the Query Expansion unitinstruments Word2Vec, pre-trained with Google News dataset[29], to form the set of nominated synonyms, denoted as s q .Let s iq be a synonym of q ( i.e., s iq ∈ s q ). Then, the similarityof s iq and the query context, denoted as sim ( s iq , C ) , is definedbased on the sum of similarities with each element of C , asshown in Equation 1. sim ( s iq , C ) = (cid:88) ∀ C j ∈ C sim ( s iq , C j ) (1)Then, s iq is chosen as an element of P , only if it issemantically close enough to the query context. To determinethe sufficient closeness, we consider sim ( s iq , C ) to be greaterthan the mean of the pair-wise similarity across all membersof s q ( i.e., sim ( s iq , C ) > µ ∀ q ∀ j ( sim ( s jq , C )) ). We note thatbecause the elements of C and N represent the context of Q ,they as well are added to P .Algorithm 2 provides a high level pseudo-code for generat-ing the expanded query set P . In Steps 2–7 of the pseudo-code, the synonym set for each q is generated. Next, thesimilarity between each word s iq and C is calculated. Thesimilarity values are used to calculate the mean similarity ofall nominated queries in Step 8. In Steps 9–15, expanded queryset P is formed by including nominated synonyms whosesemantic closeness is greater than µ . Lastly, in Step 16, set P is expanded by including context set and name-entities. D. User Interest Detection
Detecting the user’s search interest is essential to deliverpersonalized search. In SAED, interest detection is achievedby analyzing two factors: (A) the user’s search history; and
ALGORITHM 2:
Pseudo-code to expand query based on thecontext in the Query Expansion unit of SAED
Input : Q , C, N
Output: P : the expanded query set Function
QueryExpansion( Q , C , N ) foreach q ∈ Q do s q ← use WordNet to obtain synonym set of q foreach s iq ∈ s q do sim ( s iq , C ) ← (cid:88) ∀ C j ∈ C sim ( s iq , C j ) end end µ ← calculate mean sim ( s jq , C ) across all q ∈ Q, ∀ s jq ∈ s q foreach q ∈ Q do foreach s iq ∈ s q do if sim ( s iq , C ) > µ then Add s iq to set P end end end P ← P ∪ C ∪ N return P end (B) the user’s reaction to the retrieved results of prior searchqueries. This can be detected based on the results chosen bythe user or the time spent for browsing them.Let ∆ (cid:48) represent the whole resulted documents that are sentto the user and τ represent the documents where the user isinterested in. We have τ ⊆ ∆ (cid:48) . Accordingly, the user’s interestcan be derived from the topics of τ . The Interest Detector unituses an existing document classification model [30], operatingbased on Na¨ıve Biased (NB) method, to determine the topicsof τ , denoted as t τ . We also perform majority voting on t τ tofind the user’s main interest. The process is repeated to store n -prior search interests data of the user. The data is characterizedas sequential as it is harvested from each successful search.By analyzing the user’s prior search interests, the edge tiertrains a recurrent neural network-based prediction model [31]that can predict the user’s search interest. In case of SAED,as the data does not contain long dependency and to keepthe model simple and to maintain real-timeliness, instead ofa stacked ( i.e., deeper) model, we feed the harvested user-specific historical search data to train a many-to-one vanillaRNN model [32]. E. Weighting Unit
Once SAED learns the user interest, the next step to accom-plish a context-aware and personalized enterprise search is todetermine the closeness of contextually-expanded queries ( i.e., elements of P ) to the user’s interest. In fact, not all expandedqueries have the same significance in the interpretation ofthe query. Accordingly, the objective of the Weighting unit is defined as quantifying the closeness of each expandedquery to the user’s interest. Later, upon completion of thesearch operation on the cloud tier, the weights are used bythe
Ranking unit of SAED to prune and sort the result set.rior weighting schemes ( e.g., [9], [10], [19], [16], [26]) of-ten use the word frequency-based approach ( e.g.,
TF-IDF [10])and discard the user interests. Alternatively, the weighting pro-cedure of SAED quantifies the importance of each expandedquery p ∈ P based on two factors: (A) The type of p , whichmeans if it directly belongs to the context ( C and N sets) oris derived from them; and (B) The semantic similarity of p tothe user interest.In particular, those elements of P that directly representthe query context or name-entities ( i.e., ∀ p | p ∈ P ∩ ( C ∪ N ) )explicitly indicate the user’s search intention, hence, weightingthem should be carried out irrespective of the user interest.A deeper analysis indicates that name-entities that potentiallyexist in a query represent the search intention, thus, biasing thesearch results to them can lead to a higher user satisfaction. Assuch, the highest weight is assigned to ∀ p | p ∈ ( P ∩ N ). Thehighest weight is determined by the domain expert, however,in the experiments we consider it as η max = 1 . We define the contribution of q ∈ Q as the ratio of the number of keywordsadded to C because of q (denoted C q ) to the cardinality of C .Let η p denote the weight of p ∈ P . Then, for those elementsof P that are in the query context ( i.e., ∀ p ∈ ( P ∩ C ) ), η p iscalculated based on the contribution of the query keyword q corresponding to p . Equation 2 formally represents how η p iscalculated. η p = η max · | C q || C | (2)The weight assignment for those p that are derived fromelements of C , as explained in Section III-C, ( i.e., ∀ p | p ∈ P − ( C ∪ N ) ) is carried out via considering semantic similarityof p with the user interest θ . That is, η p = sim ( p, θ ) . F. Ranking Unit
Once the expanded query set P is formed, the cloud tierperforms string matching for each p ∈ P across the indexstructure. We note that, if the user chooses to perform a securesearch, the elements of P are encrypted before delivered to thecloud tier. In addition, in our prior works [5], we proposedmethods for the cloud tier to cluster the index structure andperform the pattern matching only on the clusters that arerelevant to the query.The cloud tier returns the resulted document set, denotedas ∆ , to the edge tier where the Ranking unit of SAED ranksthem based on the relevance and the user’s interest andgenerates a document list, called ∆ (cid:48) to show to the user.For a document δ i ∈ ∆ , the ranking score, denoted as γ i , iscalculated by aggregating the importance values of each p ∈ P within δ i and with respect to its weight ( η p ). The importanceof p in δ i is conventionally measured based on the TF-IDF score [33]. Accordingly, γ i is formally calculated based onEquation 3. γ i = (cid:88) ∀ p ∈ P (cid:16) η p · T F IDF ( p, δ i ) (cid:17) (3)The TF-IDF score of p in δ i is defined based on thefrequency of p in δ i versus the inverse document frequency of p across all documents in ∆ . Details of calculating the tf-idfscore can be found in [33].Once the Ranking unit calculates the ranking score for all δ i ∈ ∆ , then the documents are sorted in the descending orderbased on their ranks and thus, the document list ∆ (cid:48) are formedwith each δ i and displayed to the user.IV. SAED A S A P LUGGABLE M ODULE TO E NTERPRISE S EARCH S OLUTIONS
The advantage of SAED is to be independent from theenterprise search service deployed on the cloud tier. That is,using SAED neither interferes with nor implies any changeon the cloud-based enterprise search service. SAED can beplugged into any enterprise search solution. It provides thesearch smartness on the on-premises edge tier and leaves thecloud tier only for large-scale pattern matching. The wholeSAED solution reforms the enterprise search to be semantic,personalized, and confidential services.In this work, we set SAED to work both with AWSKendra and S3BD. In the case of using AWS Kendra, theQuery Expansion unit sends the expanded query set P toKendra to search each keyword p against the dataset onthe Amazon cloud. The resulted documents are received bySAED and ranked before being delivered to the client tier.In the implementation, we only show top 10 documents fromthe resulted list to the user. Similarly, we plugged SAED toS3BD to perform confidential semantic search on the cloud.Because S3BD maintains an encrypted index structure thathas to be traversed against each search query, the elementsof P had to be encrypted before handing them over to thecloud tier. We also verified SAED when it is used along withAWS Kendra where the dataset was encrypted. We noticed thatSAED can achieve smart search even when Kendra is set towork with encrypted dataset. The performance measurementand analysis of using SAED along with AWS Kendra andS3BD are elaborated in the next Section.V. P ERFORMANCE E VALUATION
A. Experimental Setup
We have developed a fully working version of SAED andmade it available publicly in our Github page. To conducta comprehensive performance evaluation of SAED on theenterprise search solutions, we developed it to work with bothS3BD [10] and AWS Kendra [2]. S3BD already has the queryexpansion and weighting mechanisms, but we deactivatedthem and set it to use the expanded queries generated bySAED. In the experiments, the combination of SAED andS3BD is shown as SAED+S3BD. Likewise, the combinationof SAED and AWS Kendra is shown as SAED+Kendra.We evaluated SAED using two different datasets, namely Request For Comments (RFC) and
BBC that have dis-tinct properties and volume. The reason we chose the
RFC dataset is that it is domain-specific and includes , docu-ments about the Internet and wireless communication network. https://github.com/hpcclab/SAED-Security-At-Edge ABLE I:
Benchmark search queries developed for the RFC andBBC datasets.
BBC Dataset RFC Dataset
European Commission ( EC ) Network Information ( NI )Parliament Archives ( PA ) Host Network Configuration ( HNC )Top Camera Phones 2020 (
TCP ) Data Transfer ( DT )Credit Card Fraud ( CCF ) Service Extension( SE )Animal Welfare Bill ( AWB ) Transport Layer ( TL )Piracy and Copyright Issues ( PCI ) Message Authentication ( MA )Car and Property Market ( CPM ) Network Access ( NA )Rugby Football League ( RFL ) Internet Engineering ( IE )Opera in Vienna ( OV ) Fibre Channel ( FC )Windows Operating System ( WOS ) Streaming Media Service (
SMS ) Alternatively, the
BBC dataset is more diverse. It includes , news documents in five distinct categories, includingpolitics, entertainment, business, sports, and technology.To conduct a comprehensive evaluation, we used bothsystematic metrics and human-based feedback as elaboratedin Section V-C. We deployed and experimented SAED ona Virtual Machine (VM) within our local edge computingsystem. The VM had two 10-core 2.8 GHz E5 Xeon processorswith 64 GB memory and Ubuntu 18.4 operating system. B. Benchmark Queries
The datasets that we use to carry out the experiments arenot featured with any benchmark. Therefore, we required todevelop benchmark queries for the datasets before evaluatingthe performance of SAED. We developed benchmarkqueries, shown in Table I, for each one of the two datasets.The benchmark queries are proactively designed to explore thebreadth and depth of the datasets in question. In addition, someof the queries intentionally contain ambiguous keywords toenable us examining the context detection capability of SAED.For the sake of brevity, we provide one acronym for eachbenchmark query (see Table I). For each benchmark query,we collected at most the top-20 retrieved documents. Then,the quality of the retrieved documents were measured via bothautomated script and human-based users. C. Evaluation Metrics
We have to measure the search relevancy metric to under-stand how related the resulted documents are with respectto the user’s query and how they meet the his/her interests.For the measurement, we use TREC-Style Average Precision(TSAP) score, described by Mariappan et al. [34]. TSAP pro-vides a qualitative score in a relatively fast manner and withoutthe knowledge of the entire dataset [10]. It works based on theprecision-recall concept that is commonly used for judgingtext retrieval systems. The TSAP score is calculated based on (cid:80) Ni =0 r i /N , where r i denotes score for i th retrieved documentand N denotes the cutoff number (total number of retrieveddocuments). Since we consider N = 10 , we call the scoringmetric as TSAP@10 .To determine r i for retrieved document δ (cid:48) i ∈ ∆ (cid:48) , we con-ducted a human-based evaluation. We engaged five volunteerstudents to judge the relevancy of each retrieved document. For every search query, the volunteers labeled each retrieved docu-ment as highly relevant, partially relevant, or irrelevant . After performing majority voting based onthe provided responses for document i , the value of r i isdetermined as follows: • r i = 1 /i if a document is highly relevant • r i = 1 / i if a document is partially relevant • r i = 0 if a document is irrelevant We report TSAP@10 score to show the relevancy of resultsfor each benchmark query. In addition, mean TSAP score isreported to show the overall relevancy across each dataset. Aswe set the top 10 documents to be retrieved for each search,the highest possible for
TSAP@10 score can be 0.292 [34].In addition to the TSAP score, we measure
Mean F-1 score too to compare the search quality offered by theSAED-plugged enterprise search solutions against the originalenterprise search solutions ( i.e., without SAED in place). TheF-1 score maintains a balance between the precision and recallmetrics, which is useful for unstructured datasets with non-uniform topic distribution.
D. Evaluating Search Relevancy
The purpose of this experiment is to evaluate the search rel-evancy of enterprise search systems that have SAED pluggedinto them and compare them against the original (unmodified)systems. To evaluate the personalized search, we set (assumed) technology as the user’s interest for both datasets. We notethat, in this part, the enterprise search solutions (S3BD andAWS Kendra) are set to work in the plain-text datasets.
S3BD vs SAED+S3BD : Figure 3a shows the TSAP@10score for the RFC and BBC datasets for the original S3BD andSAED+S3BD. The horizontal axes in both subfigures showthe benchmark queries and the vertical axes show the searchrelevancy based on the TSAP@10 score.In both Figure 3a and 3b, we observe that for all queries inboth datasets, SAED+S3BD outperforms the S3BD system.In addition, we observe that S3BD produces less relevantresults for the BBC dataset compared to the RFC dataset.This is because, unlike the RFC dataset, in several cases,the exact keywords of the benchmark queries do not existin the BBC dataset. The worst case of these issues hasoccurred for the
PCI query in S3BD, because its queryexpansion procedure could not capture the complete semantics.In contrast, SAED+S3BD is able to handle the cases wherethe exact keyword does not exist in the dataset, thus, we seethat it yields to a remarkably higher relevancy.Even if we consider
PCI as an outlier and exclude that fromthe analysis, in Figure 3a, we still notice that the TSAP@10score of SAED+S3BD is on average . higher than S3BD.Although the difference between S3BD and SAED+S3BDis less significant for the RFC dataset (in Figure 3b), westill notice some improvement in TSAP@10 score. Thisis because RFC is a domain-specific dataset and the exactkeywords of queries can be found in the dataset, hence, makinguse of smart methods to extract the semantic is not acute toearn relevant results. From these results, we can conclude that C P A T C P CC F A W B P C I C P M R F L O V W O S T S A P @ S c o r e SEA+S3BDS3BD (a) BBC dataset N I H N C D T S E T L M A N A I E F C S M S T S A P @ S c o r e SEA+S3BDS3BD (b) RFC dataset
Fig. 3: Comparing TSAP@10 scores of SAED+S3BD and S3BD systems. Horizontal axes show the benchmark queries. E C P A T C P CC F A W B P C I C P M R F L O V W O S T S A P @ S c o r e SAED+KendraKendra (a) BBC dataset N I H N C D T S E T L M A N A I E F C S M S T S A P @ S c o r e SAED+KendraKendra (b) RFC dataset
Fig. 4: Comparing TSAP@10 scores obtained from SAED+Kendra versus AWS Kendra in searching benchmark queries.SAED can be specifically effective for generic datasets wherenumerous topics exist in the documents.
AWS Kendra vs SAED+Kendra : In Figures 4a and4b, we report TSAP@10 score obtained from AWS Kendraversus SAED+Kendra for BBC and RFC datasets, respec-tively. Specifically, in Figure 4a (BBC dataset), a significantimprovement (on average . ) is noticed in the TSAP@10score of SAED+Kendra. However, unlike SAED+S3BD,SAED+Kendra does not beat Kendra for all the queries. Thereason Kendra outperforms SAED+Kendra for AWB and
CPM queries is that SAED injects extra keywords and sends theexpanded query set to AWS Kendra. Then, Kendra returnsdocuments that are related to the queries and to the expandedkeywords. We realized that the Ranking unit of SAED oc-casionally prioritizes documents that include keywords of theexpanded queries instead of those with the query keywords.Similar to the S3BD experiment, we observe that therelevancy resulted from Kendra and SAED+Kendra is lesssignificant for RFC. However, we still obtain around improvement in TSAP@10 score according to Figure 4b.
E. Relevancy of Privacy-Preserving Enterprise Search
To examine the efficiency of SAED for privacy-preservingenterprise search systems, we conducted experiments usingencrypted BBC and RFC datasets. The encrypted datasets were uploaded to the cloud tier and the expanded queries were alsoencrypted and searched on the cloud tier via Kendra.We use the TSAP@10 score, as shown in Figure 5a and 5b,for the BBC and RFC datasets, respectively. Figure 5a in-dicates that SAED+Kendra substantially outperforms Kendrafor all the benchmark queries. We can see that for encrypteddataset Kendra cannot do anything except pattern matchingand returning documents that exactly include the encryptedquery. Therefore, searching for several queries ( e.g., PA , TCP , CPM , etc.) does not retrieve any documents. We notice that,in both systems, the highest TSAP@10 score is in searching EC . The reason is the high number of documents in BBC thatcontain the exact phrase European commission .The reported TSAP@10 scores for the RFC dataset inFigure 5b shows a clear improvement in compared with theBBC dataset. We observe that seven out of ten queries providean equal TSAP@10 scores in both systems. The reason thatmakes Kendra competitive to SAED+Kendra is the exactavailability of the benchmark queries in RFC. However, for
HNC and FC , the exact query keywords are not present in thedataset, hence, Kendra fails to find any results. F. Discussion of the Relevancy Results
In Table II, we report mean F-1 and mean TSAP@10 scoresfor the SAED-plugged enterprise search systems along with C P A T C P CC F A W B P C I C P M R F L O V W O S T S A P @ S c o r e SAED+KendraKendra (a) Encrypted BBC dataset N I H N C D T S E T L M A N A I E F C S M S T S A P @ S c o r e SAED+KendraKendra (b) Encrypted RFC dataset
Fig. 5: Comparing TSAP@10 scores obtained from SAED+Kendra vs AWS Kendra systems in the encrypted domain.their original versions upon utilizing the datasets both in theplain-text and encrypted forms. From the table, we notice that,regardless of the enterprise search system being employed, ahigher search relevancy is consistently achieved for the RFCdataset as opposed to the BBC dataset.The search relevancy is consistently improved whenSAED+Kendra is used and it provides on average of im-provement in mean F-1 score and in the mean TSAP@10score. Although original S3BD is the underperformer, usingSAED+S3BD improves its mean F-1 and mean TSAP@10scores by on average of and , respectively.
BBC RFCSystems MeanF-1 MeanTSAP@10 MeanF-1 MeanTSAP@10
S3BD 0.50 0.17 0.80 0.24SAED+S3BD 0.82 0.25 0.92
Kendra 0.67 0.20 0.88 0.26SAED+Kendra
Kendra (Encry.) 0.31 0.09 0.75 0.22SAED+Kendra (Encry.) 0.73 0.22 0.90 0.27
TABLE II:
Comparing the mean F-1 and the mean TSAP@10 scoresobtained from SAED-plugged enterprise search systems versus theiroriginal forms. The highest resulted scores are shown in bold font.
In the encrypted domain, we notice that SAED+Kendraoffers a substantially higher (up to ) search relevancyfor BBC dataset. As the exact keywords of the given searchqueries are not present in the encrypted form of BBC dataset,AWS Kendra fails to perform semantic search, rather does onlya pattern matching, which makes it an underperformer for thisdataset. On the other hand, search relevancy is improved forRFC dataset since mean F-1 and mean TSAP@10 scores areimproved by at least . This is because, most of the queriesare present exactly in the dataset and Kendra retrieves most ofthe relevant documents by relying only on pattern matching.
G. Evaluating the Search Time
Figure 6 presents the total incurred search time of theexperimented queries for each dataset. The search time is cal-culated as the summation of the elapsed time taken by a queryto be processed ( e.g., expansion, weighting) and turnaround time until the result set is received. To eliminate the impactof any randomness in the computing system, we searchedeach set of experimented queries 10 times and reported theresults in the form of box plots. The figure indicates thatS3BD system has the highest search time overhead for bothdatasets which could impact real-time searchability in caseof big data. SAED+S3BD incurs less query processing timeoverhead compared to the original (unmodified) S3BD system.On the other hand, AWS Kendra causes the lowest timeoverhead for both datasets compared to SAED+Kendra.SAED+Kendra causes around times more time overheadcompared to original Kendra. However, in the prior set ofexperiments, we determine that SAED+Kendra achieves asubstantially higher search relevancy for most of the queriesand, particularly, for datasets with privacy constraints. BBC RFC S e a r c h T i m e ( S ) S3BDKendraSAED+S3BDSAED+Kendra
Fig. 6: Search time comparison among S3BD, Kendra,SAED+S3BD, and SAED+Kendra systems.VI. CONCLUSIONS
AND F UTURE W ORK
A context-aware, personalized, and privacy-preserving en-terprise search service is the need of the hour for dataowners who wish to use cloud services. Our approach toaddress this demand was to separate the search intelligenceand privacy aspects from the pattern matching aspect. Wedeveloped SAED that achieves privacy and intelligence atthe edge tier and leaves the large-scale pattern matching forhe cloud tier. SAED is pluggable and can work with anyenterprise search solution ( e.g.,
AWS Kendra and S3BD)without dictating any change on them. Utilizing edge com-puting on the user’s premises preserves the user’s privacyand makes SAED a lightweight solution. Leveraging recurrentneural network-based prediction models, WordNet database,and Word2Vec, SAED proactively expands a search query ina proper contextual direction and weights the expanded queryset based on the user’s interest. In addition, SAED providesthe ability to perform semantic search while the data arestored in the encrypted form on the cloud. In this case, theexisting enterprise search solutions just perform the patternmatching without knowing the underlying data. Evaluationresults, verified by human users, show that SAED can improvethe relevancy of the retrieved results by on average ≈ forplain-text and ≈ for encrypted generic datasets. Thereare several avenues to improve SAED. One avenue is to coverdomain-specific and trendy keywords. Another avenue is tomake the SAED flexibly deployed on various devices. Forinstance, when the user is on the move and does not haveaccess to the edge, SAED should shrink to the bare minimumsearch intelligence and vice versa.A CKNOWLEDGMENTS