[PDF] In-The-Field Monitoring of Functional Calls: Is It Feasible?

Abstract

Collecting data about the sequences of function calls executed by an application while running in the field can be useful to a number of applications, including failure reproduction, profiling, and debugging. Unfortunately, collecting data from the field may introduce annoying slowdowns that negatively affect the quality of the user experience. So far, the impact of monitoring has been mainly studied in terms of the overhead that it may introduce in the monitored applications, rather than considering if the introduced overhead can be really recognized by users. In this paper we take a different perspective studying to what extent collecting data about sequences of function calls may impact the quality of the user experience, producing recognizable effects. Interestingly we found that, depending on the nature of the executed operation and its execution context, users may tolerate a non-trivial overhead. This information can be potentially exploited to collect significant amount of data without annoying users.

Full PDF

IIn-The-Field Monitoring of Functional Calls:Is It Feasible?

Oscar Cornejo, Daniela Briola, Daniela Micucci, Leonardo Mariani

Department of Informatics, Systems and CommunicationUniversity of Milano - Bicocca, Milan, ItalyEmail address: oscar.cornejo, daniela.briola, daniela.micucci,[email protected] (Oscar Cornejo, Daniela Briola, Daniela Micucci, LeonardoMariani)

Preprint submitted to Elsevier January 22, 2020 a r X i v : . [ c s . S E ] J a n n-The-Field Monitoring of Functional Calls:Is It Feasible? Oscar Cornejo, Daniela Briola, Daniela Micucci, Leonardo Mariani

Department of Informatics, Systems and CommunicationUniversity of Milano - Bicocca, Milan, Italy

Abstract

Collecting data about the sequences of function calls executed by an ap-plication while running in the ﬁeld can be useful to a number of applications,including failure reproduction, proﬁling, and debugging. Unfortunately, collect-ing data from the ﬁeld may introduce annoying slowdowns that negatively aﬀectthe quality of the user experience.So far, the impact of monitoring has been mainly studied in terms of theoverhead that it may introduce in the monitored applications, rather than con-sidering if the introduced overhead can be really recognized by users. In thispaper we take a diﬀerent perspective studying to what extent collecting dataabout sequences of function calls may impact the quality of the user experience,producing recognizable eﬀects. Interestingly we found that, depending on thenature of the executed operation and its execution context, users may toleratea non-trivial overhead. This information can be potentially exploited to collectsigniﬁcant amount of data without annoying users.

Keywords:

Monitoring, dynamic analysis, user experience.

1. Introduction

Behavioral information collected from the ﬁeld can complement and com-plete the inherently partial knowledge about applications gained with in-housetesting and analysis activities. For instance, observing applications that runin the ﬁeld can produce data otherwise diﬃcult to obtain, such as informationabout the behavior of the application when executed with actual productiondata and within various real execution environments. Indeed, collecting ﬁelddata is common practice in the area of software experimentation [1, 2, 3], wherecontrolled experiments are performed to evaluate how a change may impact theuser experience.

Email address: oscar.cornejo, daniela.briola, daniela.micucci,[email protected] (Oscar Cornejo, Daniela Briola, Daniela Micucci, LeonardoMariani)

Preprint submitted to Elsevier January 22, 2020 diversity of data can be collected to study the behavior of software appli-cations [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. In this paper, we focus on sequences offunction calls, which is a speciﬁc but extremely common type of data recordedand used by analysis techniques. Studying the behavior of an application interms of the function calls produced under diﬀerent circumstances is in factboth common and useful. For example, sequences of function calls extractedfrom the ﬁeld can be used to reproduce failures [12], detect malicious behav-iors [14], debug applications [15], proﬁle software [9], optimize applications [16],and mine models [17, 18, 19].Collecting information from the ﬁeld is challenging since it slows down theapplication, and this may imply a negative eﬀect on the quality of the userexperience. If the slowdowns are frequent, the usability can be compromisedup to the point users may stop using the application. It is thus extremelyimportant to understand how the slowdowns introduced into an application canaﬀect users.The impact of monitoring has been mainly studied in terms of its relativeoverhead, that is, by measuring how much the execution time of a given op-eration is increased due to the presence of the monitor. Although this is animportant information, it does not reﬂect how and if this overhead can be per-ceived by the users of the application. For instance, increasing by 20% the timethat every menu item requires to open may introduce a small but annoyingslowdown to operations that should be instantaneous from a user perspective.On the contrary, taking 20% more time on the execution of a query might be ac-ceptable for users, as long as the total time does not exceed their expectation. Itis thus important to investigate the relation between the overhead introduced bymonitoring techniques and the user experience, to understand how to seamlesslyand feasibly collect data from the ﬁeld.In our initial study [20], we discovered that a non-trivial overhead can betolerated by users and that the overhead can be tolerated diﬀerently dependingon the nature of the operation that is executed. This paper extends this ini-tial study considering a larger number of operations exposed to overhead, newexperiments to study how the availability of the computational resources mayaﬀect overhead, a study based on human subjects, and additional analyses ofthe empirical data. The results show that function calls can be frequently col-lected without impacting on the user experience, regardless of the availabilityof the computational resources, but speciﬁc operations may require ad-hoc sup-port to be monitored without aﬀecting users. These evidences can be exploitedto design better monitor and analysis procedures running in the ﬁeld.This paper is organized as follows. Section 2 describes our experimentalsetup. Sections 3 and 4 report the results obtained when studying the impactof the overhead on the users with good and poor availability of computationalresources, respectively. Section 5 describes the results obtained with our studyinvolving human subjects. Section 6 discusses threats to validity. Section 7summarizes our ﬁndings. Section 8 discusses related work. Section 9 providesﬁnal remarks. 3 . Experiment Design

This section describes the research questions that we have addressed and thedesign of the experiments to answer the research questions.

The general objective of our study is understanding how collecting ﬁeld datacan aﬀect the user experience . We investigated this question in a speciﬁc, al-though common, scenario, that is, while recording the sequence of function callsexecuted by applications.We thus organized our study around three main research questions thatinvestigate the impact of monitoring in diﬀerent conditions.

RQ1 - How is the user experience aﬀected by monitoring functioncalls?

This research question analyzes the relation between the overhead pro-duced by the monitoring activity and its impact on the user experience. RQ1is further organized in three sub-research questions:

RQ1a - What is the overhead introduced by monitoring functioncalls?

RQ1a measures the overhead introduced in an application by the mon-itoring activity.

RQ1b - What is the impact of monitoring function calls on theuser experience?

RQ1b studies if the overhead introduced with monitoringcan be recognized by the user of the application.

RQ1c - What is the tolerance of the operations to the introducedoverhead?

RQ1c studies how diﬀerent user operations tolerate overhead beforeproducing slowdowns recognizable by users.

RQ1d - Do failures change the overhead introduced by functioncalls monitoring?

RQ1d studies if the overhead introduced by the monitoris diﬀerent in the context of failures.

RQ2 - What is the impact of monitoring function calls when theavailability of computational resources is limited?

This research questioninvestigates if, and how much, the overhead produced by collecting function callschanges with the availability of the computational resources. The study focuseson the availability of the two most relevant resources, CPU and memory, ascaptured by the following two sub-research questions:

RQ2a - What is the impact of CPU availability on the intrusive-ness of monitoring?

RQ2a studies how the overhead introduced by moni-toring function calls is aﬀected by diﬀerent levels of CPU utilization.

RQ2b - What is the impact of memory availability on the intru-siveness of monitoring?

RQ2b studies how the overhead introduced bymonitoring function calls is aﬀected by diﬀerent levels of memory utilization.Since we investigate RQ1 and RQ2 referring to the classiﬁcation of the Sys-tem Response Time as proposed by Seow [21], we consider the following research4uestion to investigate the alignment between the user behavior and the adoptedclassiﬁcation.

RQ3 - How do expert computer users react to the overhead producedby function calls monitoring, compared to the results obtained withRQ1 and RQ2?

This research question analyzes the alignment between thebehavior of expert computer users recruited from our CS department and theresults reported in RQ1 and RQ2 with a study involving human subjects.

This section presents the design of the experiment that we performed toanswer to our research questions. To study the impact of monitoring we se-lected four widely used interactive applications: Notepad++ 6.9.2 , Paint.NET4.0.12 , VLC Media Player 2.2.4 , and Adobe Reader DC 2019 .To collect sequences of function calls from these applications, we instru-mented the applications using a probe that we implemented with the Intel PinBinary Instrumentation Tool .Pin supports the instrumentation of compiled binaries, including shared li-braries that are loaded at runtime, and optimizes performance by automaticallyin-lining routines that have no control-ﬂow changes [22]. Our probe is a cus-tom plug-in utility written in C++ that intercepts and logs every function call,included nested calls.The probe uses a buﬀer of 50MB to store data in memory before saving toﬁle. We used this value based on the results we obtained in our preliminaryexperiment, where 50MB resulted to produce the best compromise betweenCPU and memory consumption [20].To run each application, we have implemented a Sikulix test case thatcan be automatically executed to run multiple functionalities of the monitoredapplications. The test cases simulate rich usage scenarios. For instance, AdobeReader DC is executed by opening a PDF ﬁle, moving inside the document upand down several times, changing the view to full-screen, inserting commentsin the text, searching for a speciﬁc word in the document, highlighting text,and closing the document. Notepad++ is executed by writing a Java program,opening diﬀerent ﬁles, copying and pasting text in a document, counting theoccurrences of a given word, marking the occurrences of a given word, andclosing all the opened tabs. Paint.NET is executed by loading an image, resizingit, drawing several shapes and shaded shapes, rotating the image, applyingdiﬀerent ﬁlters to the image (black and white, sepia), inverting the colors of the https://notepad-plus-plus.org http://get.adobe.com/reader https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool http://sikulix.com o . . . o n , we collect functions calls andmeasure the overhead and its impact on the user experience for every operation o i . Since our study targets interactive applications, we collect traces composedof user operations.It is possible to precisely distinguish the portion of the trace that correspondsto each operation by exploiting the knowledge of the name of the functionsthat implement the operations. This information is typically available if theorganization that deﬁnes the monitoring strategy and the one that implementsthe application are the same. Otherwise, traces can still be split based oninteractions with the GUI, but it implies a more sophisticated analysis of thecollected traces.To respond to RQ1, we only executed the test cases and the monitoredapplications, that is, no processes were running in addition to the basic operat-ing system processes. To answer RQ2, we selectively saturated computationalresources occupying 60%, 75%, and 90% of both CPU and RAM.We performed linear sampling of the memory because we could not predictwhen there would be observable consequences. We considered saturation up to90% for both resources, since higher values would not allow us to satisfy theminimum requirements of Pin.To saturate resources in a controlled way we used CPUStress 1.0.0.1 andHeavyLoad 3.4 . To mitigate any eﬀect due to non-determinism, we repeatedeach test 5 times and reported mean values. The overall study implied collectingand processing more than 10.000 samples about operations and their duration,all available at http://github.com/ocornejo/fieldmonitoringfeasibility . https://blogs.msdn.microsoft.com/vijaysk/2012/10/26/tools-to-simulate-cpu-memory-disk-load/ .3. Measuring Overhead and Its Estimated Impact on the User Experience Measuring the overhead is straightforward, that is, we measure the diﬀer-ence in the duration of the same operations when executed with and withoutmonitoring. Here it is important to discuss how we estimated the eﬀect of themonitor on the user experience. In principle, assessing if a given overhead mayor may not annoy users requires direct user involvement. However, user studiesare expensive and can be hardly designed to cope with a volume of samples likethe ones that we collected, which would require involving users in the evaluationof the duration of thousands of operations.To estimate the impact of the overhead on users we thus exploited the re-sults already available from the human-computer interaction domain, and westrengthen the collected evidence with a human study focusing on a restrictednumber of cases. In particular, we used the well-known and widely acceptedclassiﬁcation proposed by Seow [21] of the System Response Time (SRT, i.e.,the time taken by an application to respond to a user request) that can be asso-ciated with each operation based on its nature. In this classiﬁcation, operationsare organized according to four categories, which have been derived from directuser engagement: • Instantaneous : these are the most simple operations that can be performedon an application, such as entering inputs or navigating throughout menus.Users expect to receive a response by 100 − • Immediate : these are operations that are expected to generate acknowledg-ments or very simple outputs. Users expect to receive a response by 0 . − • Continuous : these are operations that are requested to produce results withina short time frame to not interrupt the dialog with the user. They are ex-pected to produce a response in 2 −

5s at most, depending on the complexityof the operation that is executed. We assume Simple Continuous operationto produce a response by 2 − .

5s and more Complex Continuous operationsto produce a response by 3 . − • Captive : these are operations requiring some relevant processing for whichusers will wait for results, but will also give up if a response is not producedwithin a certain time. These operations are expected to produce a responseby 7 − slow operations .Measuring the number of operations that become slow due to overhead providesan estimate of how often users are likely to be annoyed while using a monitoredapplication. We attribute categories to the operations performed by the tests7ased on their execution time when no overhead is introduced in the systemand considering the lower limit of the execution time of each category. Forinstance, operations that take at most 100ms are classiﬁed as Instantaneous,while operations that take more than 100ms but less than 0 .

5s are classiﬁed asImmediate. If the execution time of an operation executed while the applicationis monitored exceeds the lower limit of its category, the operation is consideredto be slow.This strategy allows us to use the SRT classiﬁcation as a continuous scale,using the lowest limit for both the categorization of the operations and theidentiﬁcation of the slow operations . We thus obtained a conservative measureof the slow operations, that is, the real number of slow operations reported byusers are likely to be lower than the ones reported with this metric.We use the overhead and the number of slow operations as the main variablesto answer our research questions.

3. RQ1 - How is the user experience aﬀected by monitoring functioncalls?

This section reports the results obtained for each sub-research questionRQ1a-RQ1d, and ﬁnally discusses the overall results obtained for RQ1.Since the monitored applications are desktop applications, we executed allthe experiments in a machine running Windows 7 Pro with a 3.47 GHz IntelXeon X5690 processor and 4 GB of RAM.

Figure 1 shows the overhead that we observed for operations in each categoryand for each subject application. Note that not all types of operations occurin every application, for instance Captive operations are present in Paint.NETonly.The overhead proﬁle per category is quite consistent. In the case of Instan-taneous operations the overhead is always close to 0. This is probably due to thenature of Instantaneous operations that perform simple operations that implythe execution of a limited amount of logic and thus produce a limited numberof function calls. A similar result can be observed for Immediate operations,where the overhead is small for Adobe Reader DC and Notepad++. Paint.NETrepresents an exception because its overhead is higher. The overhead proﬁle isagain quite consistent across operations in the Continuous Simple and Complexcategories, with the overhead ranging between 0% and 200%.Although there are similarities for operations in the same category even ifpresent in diﬀerent applications, we can also observe that there are exceptions.In fact, there are several outliers represented in the boxplot, with some of themshowing very diﬀerent overhead values compared to the rest of the samples. Forexample, we had two Continuous Simple operations in Notepad++ (selectingthe Java highlighting and dismissing a save operation) with a high overhead8 igure 1: Overhead per category and application. (the two outliers) compared to the other operations, which experienced 100%overhead at most.Figure 2 shows the percentage of operations in each category aﬀected byoverhead levels within speciﬁc ranges. Collecting function calls produces anoverhead in the interval 0-10% in the majority of the cases (65% of the executedoperations). In 8% of the cases, operations are exposed to an overhead between10% and 30%. In 12% of the cases monitoring produced an overhead in theinterval 30-80%, and for less than 15% of the operations the overhead is higher.

Figure 2: Percentage of operations undergoing a speciﬁc overhead interval. e can conclude that the observed behavior within operations of a samecategory is not signiﬁcantly diﬀerent, although speciﬁc operations may violatethis pattern (Figure 1). Moreover, collecting function calls exposes operationsto an overhead that is lower than 10% in the large majority of cases, and isseldom higher than 80% (Figure 2). Estimating if and how much this overheadcan be intrusive with respect to user activity is studied with the next researchquestion.

Adobe Reader DCOperationcategory Total Instantaneous Immediate ContinuousSimple ContinuousComplex Captive > Captive SlowOperations [%]

Instantaneous 55 50 Immediate 15 0 15 0 0 0 0 Continuous Simple 90 0 0 69

16 5 Continuous Complex 15 0 0 0 13

Captive 0 0 0 0 0 0 0 > Captive SlowOperations [%]

Instantaneous 45 40 0 Immediate 20 0 19 Continuous Simple 70 0 0 48

Continuous Complex 5 0 0 0 5 0 0 Captive 0 0 0 0 0 0 0 > Captive SlowOperations [%]

Instantaneous 35 35 0 0 0 0 0 Immediate 25 0 0 Continuous Simple 55 0 0 45 Continuous Complex 40 0 0 0 29 Captive 60 0 0 0 0 57 > Captive SlowOperations [%]

Instantaneous 30 30 0 0 0 0 0 Immediate 0 0 0 0 0 0 0 Cont. Simple 125 0 0 99 Cont. Complex 30 0 0 0 24 Captive 0 0 0 0 0 0 0 Table 1: Slow operations per application

Table 1 reports the analytical results obtained for the operations recorded asslow in the four subject applications. For each application the table shows thenumber of operations in each category that have been executed in the experi-ment and how the operation has been classiﬁed once aﬀected by the overheadcaused by function calls monitoring. The overhead is not recognizable by usersif the category does not change with monitoring overhead. A perfect result im-plies having all 0s outside the values in the diagonal (highlighted with a grey10 igure 3: Percentage of slow operations with respect to the SRT Categories. background). When an operation changes its category, the table shows whatthe new category of the operation is. The column > Captive shows the numberof operations whose duration is longer than the maximum allowed for a Captiveoperation. The last column,

Slow Operations [%], speciﬁes the percentage ofslow operations across all the executions.Figure 3 visually illustrates how slow operations distribute across operationscategories. The last column in each category shows the percentage of slowoperations for that category across all subject applications.The empirical data suggests that Instantaneous operations seldom present aslowdown that aﬀects the user experience: in fact only 6% of the cases produceda recognizable slowdown. We obtained a similar result for Immediate operationswith the exception of Paint.NET, where the slowdown has been signiﬁcant forevery Immediate operation that has been executed. This result is coherentwith the exceptional overhead reported for Immediate operations in Paint.NETfor RQ1a. This is likely caused by the nature of the Immediate operations inPaint.NET, which execute non trivial logic (e.g., the operation that closes animage) and are more expensive to monitor.When the portion of logic of the application that is executed increases, thepercentage of operations that become slow also increases, as observed for Contin-uous operations that in some cases become even slower than Captive operations(see Table 1): for instance, the execution time of ﬁve Continuous Simple op-erations in Notepad++ exceeded the time expected for a Captive operation.The higher cost of monitoring Continuous operations is visible also in Figure 3,where more than 20% of the Continuous operations (both Simple and Complex)have been signiﬁcantly slowed down in average, compared to Instantaneous andImmediate operations where about 5% of the operations have been slowed down,if we do not consider those from Paint.NET (which is a special case).Extremely long tasks, such as Captive operations, seem to tolerate well theoverhead caused by function calls monitoring. However, since they are presentin one application only, it is hard to distill a more general lesson learnt.11 igure 4: Percentage of slow operations for diﬀerent overhead intervals.

We can conclude that the operations that are likely to be perceived as sloweddown are quite limited in number ( <

20% overall) and mostly concentrated inthe Continuous operations. Moreover, applications that implement small piecesof logic that must be executed quickly, as Paint.NET does, might be particularlyhard to monitor, in fact its Immediate operations have been all signiﬁcantlyslowed down when collecting function calls.3.3. RQ1c - What is the tolerance of the operations to the introduced overhead?

Since we exposed operations in diﬀerent categories to various overhead lev-els, this research question studies how often a certain overhead is the cause ofoperations resulting in a too slow response time. Figure 4 shows the percentageof operations, for all the categories, reported to be slow for overhead within agiven range and for operations in all categories.In our previous study [20], we identiﬁed 30%, 80%, and 180% as interestingoverhead values that may produce diﬀerent reactions by users, so we used theseranges in this study to analyze the collected data.We obtained a similar result with this experiment: an overhead level be-tween 30% and 80% is hard to tolerate for operations in any category with theexception of Instantaneous operations, while overhead values higher than 80%can be prohibitive.

We can conclude that overhead levels up to 30% are not harmful, but higheroverhead levels must be introduced wisely with the exception of Instantaneousoperations that seem to tolerate overhead slightly better than operations in theother categories. igure 5: Percentage of operations undergoing a speciﬁc overhead interval. This research question investigates if monitoring functional calls may aﬀectfailures diﬀerently than regular executions. To compare the impact of monitor-ing when exactly the same operations terminate correctly or terminate with afailure, we inject faults into our subject applications. To this end, we conﬁg-ure PIN to modify the ﬁrst instruction of a function if it is a MOV instructionwith the AX register as a destination. The change consists of multiplying thedestination address by a constant value.With this process, we achieved two applications failing abruptly (VLC MediaPlayer and Paint.NET) and two applications presenting various misbehaviours(Adobe Reader DC and Notepad++). In the former case, the execution simplystopped prematurely, without producing any noticeable diﬀerence in terms ofoverhead. In the latter case, we obtained misbehaviors such as Adobe ReaderDC failing to open ﬁles and Notepad++ failing to load graphical elements.We collected and analyzed the overhead values for Adobe Reader DC andNotepad++.Figure 5 shows the percentage of operations in each category aﬀected byoverhead levels within speciﬁc ranges. The result is very similar to the onepresented in Figure 2 when the execution terminates correctly. In particular,collecting function calls during a failure produced an overhead in the interval0-10% in the majority of the cases for operation in any category (65% of theoperations that have been executed). We also observe that 2.91% of operationsproduced an overhead in the range 10-30%; 9.27% of the operations producedan overhead in the interval 30-80%; and for less than 20% of operations theoverhead was higher.

In summary, failures do not change the cost of function calls monitoring,according to our observations. .5. Discussion Collecting function calls exposed the operations performed in the subjectapplications to various overhead levels: often below 10% (65.73% of the cases),and sometime to higher levels (8.48%, 12.02%, 13.77% of the cases in the ranges10%-30%, 30%-80%, >

4. RQ2 - What is the impact of monitoring function calls when theavailability of computational resources is limited?

In this section we study the impact of the monitoring activity when thecomputational resources cannot be completely allocated to the monitored ap-plications but they are also allocated to other tasks. We ﬁrst discuss the impactof CPU availability and then we discuss the impact of memory availability.Similarly to RQ1, we study the impact of collecting function calls by ana-lyzing the overhead and studying the number of operations changing categorywhen CPU and RAM are under stress.

Figure 6 shows the system response time (presented in log scale) of theexecuted operations per operation category. We report timing information con-sidering four CPU load levels: 0%, 60%, 75%, and 90%. The ﬁgure includestwo types of boxplots: the orange boxplot corresponds to the execution timeobserved when monitoring is in place, while the brown boxplot corresponds tothe execution time when no monitoring is in place.The trend is quite similar for all classes of operations with the exception ofImmediate operations, which show decreasing values of the overhead for higherCPU load values. We conducted a Kruskal-Wallis test to check if the overhead14 igure 6: Execution time for various CPU load levels per operation category.

Treatment Chi-square p -value df Instantaneous 2.2107 0.5298 3Immediate 1.1327 0.7692 3Continuous Simple 3.3914 0.3351 3Continuous Complex 4.4726 0.2147 3Captive 1.54 0.6731 3

Table 2: Kruskal-Wallis test results per operation category. introduced for a given CPU load and a given class of operations diﬀers fromthe overhead for the same class of operations exposed to a diﬀerent CPU load(signiﬁcance expected for p -value < . Figure 7: Percentage of slow operations for various CPU load levels per application.

We also considered how monitoring aﬀects the number of slow operations perapplication, shown in Figure 7, and the number of slow operations per operation15 igure 8: Percentage of slow operations for various CPU load levels per operation category. category, shown in Figure 8. The usage of a loaded CPU already generates anumber of slow operations for each application. Adding function calls moni-toring further increases the number of operations that have been slowed down.We can however notice that the only addition of monitoring makes the userexperience worse by a similar degree across CPU load levels, conﬁrming thatthe CPU load level is not a signiﬁcant factor when considering the impact ofmonitoring. To conﬁrm this intuition we computed the linear regression of thenumber of slow operations for the instrumented and non-instrumented version ofeach application, and considered the diﬀerence between the angular coeﬃcientsof the computed lines. We further considered the percentage of operations witha diﬀerent classiﬁcation when the CPU saturates to 100% (highest saturationpossible) based on the computed trends. Table 3 reports the results. For eachapplication we indicate the diﬀerence between the angular coeﬃcients (on theleft) and the percentage of operations with a diﬀerent categorization (on theright).We can notice that the diﬀerence in the increase of the number of slow op-erations is between 2.66% and 14.33% of the operations, indicating a similartrend (i.e., slope) for the two cases (with and without monitoring). The smallpositive values of the diﬀerence between the coeﬃcients indicates that, when adiﬀerence is observed (e.g., 14.33% of the operations in Paint.NET), the satu-ration of the CPU increases the number of slow operations by a lower degreewhen monitoring is active.The plot of the data per operation category, Figure 8, reveals that Instanta-neous operations behave better than the other operations in terms of their abilityto tolerate monitoring, in fact the number of slow operations does not change

Adobe Reader DC Notepad++ Paint.NET VLC Media PlayerCPU

Table 3: Trend analysis for CPU.

We can conclude that the CPU load level does not signiﬁcantly aﬀect theintrusiveness of function calls monitoring. In fact, the impact of the additionof monitoring tends to be the same regardless of CPU availability, and whena diﬀerence is observed, monitoring results to be slightly less intrusive with ahigher saturation of the CPU.4.2. RQ2b - What is the impact of memory availability on the intrusiveness ofmonitoring?

To better discuss the results for RQ2b, we report the memory usage of eachapplication as summarized in Table 4: the maximum memory consumptionobserved during the execution of our tests for Adobe Reader DC is 353 MB ofRAM, for Notepad++ is 278 MB of RAM, for Paint.NET is 520 MB of RAM,and for VLC Media Player is 303 MB of RAM.Figure 9 shows the overhead introduced in the system response time (pre-sented in log scale) per operation category when varying the amount of occupiedmemory up to 90%. The orange boxplot corresponds to the execution timeobserved when function calls are collected, while the brown boxplot correspondsto the execution time when no monitoring is in place.

Figure 9: Execution time for diﬀerent RAM availability per operation category.

Adobe Reader DC Notepad++ Paint.NET VLC Media Player

Max RAM 356 MB 278 MB 520 MB 303 MB

Table 4: Maximum memory used during experimentation. reatment Chi-square p -value df Instantaneous 3.5831 0.3101 3Immediate 3.5298 0.3169 3Continuous Simple 2.1604 0.5398 3Continuous Complex 1.1438 0.7665 3Captive 0.1913 0.979 3

Table 5: Kruskal-Wallis test results per operation category.

Similar to Section 4.1 we check for statistical diﬀerences between groupsusing a Kruskal-Wallis test (see Table 5), obtaining no signiﬁcant diﬀerencebetween diﬀerent levels of RAM load. Particularly, the results show a clearlynegligible eﬀect of the memory on the overhead, indeed the overhead is similarfor diﬀerent values of memory occupation.We also investigated how memory occupation impacts on the operations thatbecome slow. Figure 10 shows the number of slow operations per application,while Figure 11 shows the number of slow operations per operation category.The behavior of the applications does not reveal any trend. To conﬁrm thisintuition we computed the linear regression of the number of slow operationsfor the instrumented and non-instrumented version of each app and consideredthe diﬀerence between the angular coeﬃcients of the computed lines. We fur-ther considered the percentage of operations with a diﬀerent classiﬁcation whenthe memory saturates to 100% (highest saturation possible) based on the com-puted trends. Table 6 reports the results. For each application we indicate thediﬀerence between the angular coeﬃcients (on the left) and the percentage ofoperations with a diﬀerent categorization (on the right). We can notice negli-gible diﬀerence in the coeﬃcients and the number of slow operations, suggestingsimilar trend for the two cases.The results per operation category conﬁrm the same behavior we observedfor CPU utilization: Instantaneous operations better tolerate low availabilityof the computational resources compared to operations in the other categories.Anyway, memory occupation does not produce relevant eﬀects when analyzingthe results per operation category, either.

Adobe Reader DC Notepad++ Paint.NET VLC Media PlayerRAM

Table 6: Memory trend analysis.

We can conclude that the memory load level does not aﬀect the intrusivenessof function calls monitoring by a signiﬁcant degree. In fact, the monitoringoverhead tends to be the same regardless of memory availability.4.3. Discussion

The analysis of the impact of function calls monitoring on the user expe-rience when the availability of the computational resources is limited revealed18 igure 10: Percentage of slow operations for various RAM load levels per application.Figure 11: Percentage of slow operations for various RAM load levels per operation category. little inﬂuence of the computational resources. As a consequence, the logicof the monitoring can be activated and deactivated with limited attention tocomputational resources. Only in the case of CPU saturation higher than 90%,monitoring should be avoided since this could turn the application unresponsive.Finally, results revealed that Instantaneous operations are less sensitive tomemory availability compared to other kinds of operations.

5. RQ3 - How do expert computer users react to the overhead pro-duced by function calls monitoring, compared to the results ob-tained with RQ1 and RQ2?

This research question investigates the coherence between the classiﬁcationof the operations as resulting by the application of the criteria proposed bySeow and the feedback provided by actual users from our CS department, onthe applications and operations considered in our study. To this end, we asked19 verhead range Operation categoryInstantaneous Immediate ContinuousSimple ContinuousComplex Captive0-30%

Table 7: Number of operations for the diﬀerent combinations of overhead ranges and cate-gories. a number of users to assess operations of diﬀerent categories while exposed toa range of overhead values, and we compared the results to the ones obtainedwith the classiﬁcation criteria by Seow. In the following, we present the designof the empirical study, the results, and their critical discussion.

We study how actual users perceive the system response time by consideringoperations exposed to overhead values in the ranges 0 − − − To evaluate the consistency between the assessment based on the classiﬁca-tion by Seow and the responses provided by the human subjects, we reclassiﬁed20 igure 12: Consistency in the evaluation of SRT between our approach and user perception. each operation assessed by users according to the classiﬁcation by Seow andmeasure their coherence.Figure 12 shows the results. Each bar is an operation category, the label atthe top of the bar shows its classiﬁcation as running slow or running as expectedaccording to our deﬁnition of slow operation, and the percentage shows thenumber of participants who responded coherently with the label at the top.Based on these results, we conclude that: • Our classiﬁcation strategy and the participants agree on the operationsthat should not be considered slow . In fact, all the operations labeled as expected are classiﬁed in the same way by a percentage of the participantsranging from 75% to 100%. • Our classiﬁcation and the participants tend to agree on the continuousoperations that should be considered slow . In fact, there is agreementin considering Continuous Simple operations exposed to more than 80%overhead and the Continuous Complex operations exposed to more than30% overhead as slow operations. • The participants tolerate overhead better than revealed by our classiﬁcationfor quick operations . In fact, there is disagreement on the Instantaneousand Immediate operations exposed to overhead higher than 80%

We can conclude that identifying slow operations based on the Seow clas-siﬁcation is conservative: the operations that we identify running as expectedare ﬁne also for the actual users; on the other hand, there might be operationsthat we consider too slow but are instead running as expected for the users.In practice, developers following the recommendations resulting from our workencounter into a negligible risk of introducing noticeable overheads in their ap-plications. 21 . Threats to Validity

The main threat to

Internal Validity of our empirical investigation is theusage of the system response time categories deﬁned by Seow [21] to identify theoperations that have been slowed down up to a level that can be recognized bythe users. Involving a signiﬁcant number of human subjects in the evaluationand asking them to evaluate every individual operation that is executed, indiﬀerent applications and contexts, is however nearly infeasible. This is why wedecided to rely on a well-known categorization of user operations that let us workwith a signiﬁcant number of samples. To mitigate this issue, we investigatedthe coherence between our evaluation and the assessment performed by actualusers for a subset of the operations and discovered that our ﬁndings provide aslightly conservative picture of how users perceive overhead.Another potential threat is the representativeness of the participants we usedin the empirical experiment to respond to RQ3. However, the use of popularapplications also known to non-computer experts mitigates the potential biasintroduced by the selected subjects.Another potential threat is the choice of the individual operations that havebeen used in the study. Although we can potentially design the test cases inmany diﬀerent ways, we mitigated the issue of choosing the operations by focus-ing on the most relevant functionalities of each application, possibly includinga large number of operations from most of the categories.The main threat to

External Validity of our empirical evaluation is the gen-eralizability of the results. The study focuses on function calls monitoring forregular desktop applications and, although some results might have a broaderapplicability, they should be interpreted mainly in that context. Consideringother contexts require the replication of this study.Another potential threat is related to the sample size of actual subjects weused in the empirical experiment to respond to RQ3. To further validate theresults achieved, it is necessary to extend the experiment in such a way as toinclude many more subjects and potentially with diﬀerent backgrounds.

7. Findings

In this section we summarize the main ﬁndings that result from the empiricalexperience reported in this paper: • Overhead up to 30% is likely to be well tolerated by users . Ourresults show that an overhead up to 30% is seldom the cause of operationsrecognized as slow. This suggests that enriching applications running inthe ﬁeld with processes that collect data and analyze executions is feasible. • Collecting sequences of function calls from the ﬁeld is feasible inmost of the cases . Our results show that the actual overhead producedby function calls monitored is below 30% in the vast majority of the cases.Furthermore, less than the 20% of the executed operations are likely to22e perceived as slowed down. Although the cases where the impact ofmonitoring is heavier must be carefully handled, results suggest that ex-tensively collecting data about sequences of function calls from the ﬁeldis possible. • Speciﬁc operations require special handling of monitoring fea-tures . In our experiment we reported operations that presented an excep-tionally high overhead. This happened both across categories and specif-ically in one application, that is, the Immediate operations present inPaint.NET. This result suggests that applications must be carefully ana-lyzed before being instrumented so that the overhead introduced by theprobes can be properly controlled, detecting these special cases. • Computational resources have little inﬂuence on the impact ofmonitoring . Results show that CPU and RAM availability have not asigniﬁcant impact on the relative cost of collecting function calls. It ishowever true that the cumulative eﬀect of CPU load and monitoring mayintroduce a prohibitive overhead reaching a peak of more than 50% of theoperations perceived as slowed down, in contrast with the normal impactof monitoring that aﬀected less than 20% of the operations in the worstcase. • Instantaneous operations are more resilient to overhead . Instan-taneous operations demonstrated to tolerate well overhead, also when theCPU is extremely busy. This is probably due to the intrinsic nature ofthese operations that can be executed fast almost without interruption,even if the CPU is busy. Moreover, human subjects demonstrated to toler-ate particularly well the overhead introduced in Instantaneous operations.These ﬁndings can be exploited by organizations that use experimentationtechniques to improve their products and processes. Indeed, they can reﬁnedata collection strategies to be less intrusive while collecting signiﬁcant amountof data about the behavior of the software.In particular, our ﬁndings may impact: • Requirements engineers , who may exploit ﬁeld monitoring solutions toproﬁle users and evolve requirements following the usage scenarios discov-ered in the ﬁeld. However, in order to be used without impacting on theuser experience, the overhead introduced by proﬁling techniques shouldnot exceed 30%. • Software developers , who may exploit highly-optimized monitoring pro-cedures to collect software behavioral data, discovering how software isactually used in the ﬁeld and improve their products, accordingly. Sincecomputational resources have little inﬂuence on the impact of monitor-ing, environmental conditions for monitoring should not be a problem fordevelopers. 23

Testers , who may exploit ﬁne-grained monitoring to collect accurate in-formation about the behavior of a deployed product, with a speciﬁc focuson failures, to reveal and ﬁx faults earlier. We demonstrated that func-tion calls monitoring is feasible in most of the cases. If applied carefullyit could be of great impact to the software testing community, consider-ing that function calls monitoring has been widely used for several tasks,including debugging and fault failures reproduction [15, 12]. • DevOps architects, who engineer continuous monitoring solutions that cancost-eﬀectively support continuous deployment, and the development pro-cess more in general, as long as they monitor between the boundaries weidentiﬁed in this study, such as keeping overhead under 30% and priori-tizing Instantaneous operations because of their resilience to monitoringoverhead.

8. Related Work

In this section we relate our work to software experimentation, to studiesabout the impact of SRT delays on the quality of the user experience, and tostudies on ﬁeld monitoring and analysis.Regarding software experimentation , there are several approaches and stud-ies that exploited software systems to collect actual evidence, especially usingﬁeld data. A systematic way to collect ﬁeld data is to perform randomisedcontrolled experiments (e.g., A/B tests), for instance to study how a new fea-ture or a change may impact the user experience. Continuous deployment isa well-known practice that beneﬁts directly from controlled experiments [28],for instance companies such as Microsoft reported to run more than 200 ex-periments concurrently every day within their products [1]. Relevantly, Kevicet al. [2] presented an empirical characterisation of an experimentation processwhen applied to the Bing web search engine, and Fagerholm et al. [3] provided amodel that enables continuous customer experiments aimed to software qualityimprovement. Our work relates to these studies because it provides evidencethat can help engineers designing better data collection solutions that do notaﬀect the user experience.Regarding the perception of the SRT , there are several studies in the contextof the research in Human Computer Interaction (

HCI ) and controlled experi-ments.The importance of controlled experiments has been extensively discussed anddemonstrated. For instance, Fabijan et al. [28] reported the beneﬁts of onlinecontrolled experiments for software development processes, studying how usingcustomer and product data could support decisions throughout the productlifecyle.Killeen et al. [29] demonstrated that users are unlikely to recognize timevariations inferior to 20% of the original value. Our results are aligned with thisstudy, since delays up to 30% do not generate slowdowns recognizable by futureusers. 24eaparu et al. [30] studied how interactions with a personal computer maycause frustrations. In their experiment, users were asked to describe in a writtenform sources of frustration in human-computer interactions. The participantsdeclared that applications not responding in an appropriate amount of time andWeb pages taking long time to process are the main sources of frustration inhuman-computer interactions, thus conﬁrming the relevance of our investigation.Other studies stressed user tolerance in speciﬁc settings. For instance, Nahet al. [31] analyzed how long users are willing to wait for a Web page to bedownloaded. The results of the experiment showed that users start noticing theslowdowns after two seconds delays and that do not tolerate slowdowns longerthat 15 s. A threshold of 15 s has been reported as the maximum that can betolerated before perceiving an interruption in a conversation with an applicationalso in other studies [32, 33]. An experiment conducted by Hoxmeier and DiCesare [34] studied how ﬁxed slowdowns (3, 6, 9, and 12 s) to interactions mayaﬀect the user appreciation and perception. Results show that a limit for theuser tolerance is 12 s and that a linear relationship exists between SRT and usersatisfaction.Kohavi et al. [35] show that slowdowns in Web applications may aﬀect theuser experience, causing loss of money for companies. For example, Amazonreported a loss of 1% in sales because of a 100 ms slowdown, and Microsoftsimilarly, reported a loss of 1% in user queries when adding a slowdown of onesecond to their search site. Diﬀerently from these studies, our investigation doesnot aim to identify the maximum overhead that can be tolerated nor the costto companies, but rather to identify delays and overhead levels that cannot beeven recognized by users.In the scope of monitoring techniques , there are techniques that implementedmechanisms to limit the overhead introduced in the monitored system [36]. Forinstance, distributive monitoring can be used to divide the monitoring workloadbetween several instances of a same application in order to lower the overheadintroduced by monitoring activities [27, 26]. Briola et al. [37, 38] and Anconaet al. [39] exploited a similar intuition to cost-eﬃciently monitor multi-agentsystems.Alternatively, information can be collected at run-time only with a givenprobability or according to a strategy [40]. This strategy has been exploitedin the context of debugging [7, 8], program veriﬁcation [41], and proﬁling [42].Finally, monitoring can be optimized carefully balancing in-memory and sav-ing operations [43]. The results reported in this paper can be exploited bythese techniques and by practitioners [44], to further optimize their monitoringstrategy, collecting more data without aﬀecting users.Depending on the kind of collected data, monitoring solutions may introduceoverhead levels up to 10000% [12] as it is in the case of function calls monitoring.Also, slow software is one of the main reasons why users stop using applications,as reported in [30, 34, 33]. Delaying too much some functionalities may causeloss of users and consequently the failure of the project. The results obtainedwith our study may help practitioners to design context-aware techniques that25chieve a better compromise between collecting data and impacting on the userexperience, such as proposed in [41]. Our ﬁndings provide initial insights in thisdirection.

9. Conclusions

Collecting data from the ﬁeld is extremely useful to discover how applica-tions are actually used and support software engineering tasks. For instance,several monitoring techniques collect sequences of function calls to reproducefailures [12], detect malicious behaviors [14], debug applications [15], proﬁlesoftware [9], optimize applications [16], and mine models [17, 18]. If retrievingthis data is indeed useful, knowing the impact of the monitoring activity on theuser experience is also extremely important. In fact, monitoring techniques canbe feasibly applied only if they work seamlessly.This paper presented a study about the impact of function calls monitor-ing considering both the monitoring overhead and the operations that may beperceived as slowed down by the users. Results show that an overhead up to30% can be likely introduced in the operations without annoying users and thatfunction calls monitoring often produce an overhead below this limit. We foundhowever operations that are slowed signiﬁcantly and that require special carewhen monitored. These ﬁndings suggest that monitoring capabilities cannot beintroduced blindly, but they must be customized to the characteristics of themonitored program. Results also suggest that computational resources (RAMand CPU) have little inﬂuence on the impact of monitoring.Future work consists of exploiting the results obtained with this study todesign monitoring techniques that can collect function calls from the ﬁeld with-out being recognized by users. We will further consider the categorization ofthe monitoring metrics we introduced in this work into useful usability groups,that can be reused at large scale experiments as recommended in the study byRodden et al. [45]. Also, we will consider applications with longer responsetime, such as scientiﬁc experiments and job processing applications, studyingthe feasibility of monitoring in these ﬁelds.

Acknowledgment

This work has been partially supported by the H2020 Learn project, whichhas been funded under the ERC Consolidator Grant 2014 program (ERC GrantAgreement n. 646867) and the GAUSS national research project, which has beenfunded by the MIUR under the PRIN 2015 program (Contract No. 2015KWREMX).

References [1] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, N. Pohlmann, On-line controlled experiments at large scale, in: Proceedings of the CMSIGKDD International Conference on Knowledge Discovery and Data Min-ing (KDD), 2013. 262] K. Kevic, B. Murphy, L. Williams, J. Beckmann, Characterizing experi-mentation in continuous deployment: a case study on bing, in: Proceed-ings of the International Conference on Software Engineering: SoftwareEngineering in Practice Track (ICSE-SEP), 2017.[3] F. Fagerholm, A. S. Guinea, H. M¨aenp¨a¨a, J. M¨unch, The right model forcontinuous experimentation, Journal of Systems and Software 123 (2017)292–305.[4] Eclipse Community, Eclipse, (visited in 2019).[5] Microsoft, Windows 10,