[PDF] Integrative Object and Pose to Task Detection for an Augmented-Reality-based Human Assistance System using Neural Networks

Abstract

As a result of an increasingly automatized and digitized industry, processes are becoming more complex. Augmented Reality has shown considerable potential in assisting workers with complex tasks by enhancing user understanding and experience with spatial information. However, the acceptance and integration of AR into industrial processes is still limited due to the lack of established methods and tedious integration efforts. Meanwhile, deep neural networks have achieved remarkable results in computer vision tasks and bear great prospects to enrich Augmented Reality applications . In this paper, we propose an Augmented-Reality-based human assistance system to assist workers in complex manual tasks where we incorporate deep neural networks for computer vision tasks. More specifically, we combine Augmented Reality with object and action detectors to make workflows more intuitive and flexible. To evaluate our system in terms of user acceptance and efficiency, we conducted several user studies. We found a significant reduction in time to task completion in untrained workers and a decrease in error rate. Furthermore, we investigated the users learning curve with our assistance system.

Full PDF

IIntegrative Object and Pose to Task Detection for anAugmented-Reality-based Human Assistance System using NeuralNetworks

Linh Kästner , Leon Eversberg , Marina Mursa and Jens Lambrecht Abstract — As a result of an increasingly automatized anddigitized industry, processes are becoming more complex. Aug-mented Reality has shown considerable potential in assistingworkers with complex tasks by enhancing user understand-ing and experience with spatial information. However, theacceptance and integration of AR into industrial processesis still limited due to the lack of established methods andtedious integration efforts. Meanwhile, deep neural networkshave achieved remarkable results in computer vision tasks andbear great prospects to enrich Augmented Reality applications .In this paper, we propose an Augmented-Reality-based humanassistance system to assist workers in complex manual taskswhere we incorporate deep neural networks for computervision tasks. More speciﬁcally, we combine Augmented Realitywith object and action detectors to make workﬂows moreintuitive and ﬂexible. To evaluate our system in terms of useracceptance and efﬁciency, we conducted several user studies.We found a signiﬁcant reduction in time to task completion inuntrained workers and a decrease in error rate. Furthermore,we investigated the users learning curve with our assistancesystem.

I. I

NTRODUCTION

Due to the raise of automation and digitization in all areasof industries, the complexity of manufacturing processesincreased in recent years and tools to handle the rapidlygrowing amounts of information are needed [1]. Despite theupcoming changes in the industrial workforce, it is unlikelythat human labor will become obsolete due to complete au-tomation. Manufacturers look for ways to use technologicaladvancement to assist their employees cognitively and physi-cally rather than replacing them entirely [2]. These assistancesystems can help with the global trend of an increasingdemand in high-qualiﬁcation- and decreasing demand in low-qualiﬁcation-jobs [3]. In this context, Augmented-Reality(AR) -based assistance systems have shown promising resultssuch as reducing task execution time, improving productquality or increasing workers ability to learn new tasks [4].While AR assistance systems have great potential of im-proving the collaboration between humans and machines,the implementation of these systems face technological, en-vironmental and organisational challenges [5] and are still anongoing ﬁeld of research. Furthermore, state-of-the-art ARassistance systems often require an additional comissioningstep, are restricted to speciﬁc areas and do not take advantage Linh Kästner, Leon Eversberg, Marina Mursa and Jens Lambrecht arewith the Chair Industry Grade Networks and Clouds, Faculty of ElectricalEngineering, and Computer Science, Berlin Institute of Technology, Berlin,Germany [email protected]

Fig. 1: Proposed pipeline of industrial AR assistance systemof the recent success of computer vision approaches likeneural networks [6], [7], [8]. The majority of AR systemsstill rely on ﬁducial markers [4], [9] or manual calibration[6], [7], [8]. Additionally, while most AR systems haveintegrated some form of object detection, the addition of anaction detection is not yet established. Moreover, only littleattention has been put on static screens as AR visualizationhardware [10] despite their simplicity and potential to employmore processing power compared to handheld devices orhead-mounted devices (HMDs).On this account, we present an AR-based industrial as-sistance system, which incorporates deep learning (DL)architectures to ensure a more ﬂexible and intuitive userexperience. We combine an object detection model with anaction detection model to guide untrained users in manualtasks with the objective of promoting knowledge-transfer (s.Fig. 1). We tested our assistance system with 30 participantsin an exemplary assembly scenario consisting of 9 stepsand evaluated the results in terms of task completion time,error rate, user acceptance and learning promotion. The maincontributions of this paper are the following: • A proposal of an industrial AR assistance system forcomplex manual tasks using deep neural networks rely-ing on RGB camera input only • The combination of object detection and action detec-tion enabling the system to perform interaction analysis • A quantitative and qualitative evaluation of our systemsuch as time to task completion, error rate and a user a r X i v : . [ c s . H C ] A ug cceptance questionnaire • A novel study on the effect of task repetition on theimpact of the AR assistance systemThe paper is structured as follows: Sec. II gives an overviewof related work. Sec. III presents the methodology. Sec. IVpresents the results and evaluation. Finally, Sec. V gives aconclusion of our work.II. RELATED WORKOver the last decade there has been a growing interest inindustrial AR. The term Augmented Reality was ﬁrst intro-duced by Caudell and Mizell [11] in 1992 where they usedHMDs in aircraft manufacturing to reduce manufacturingcost and improve workers efﬁciency. The largest focus ofresearch has been in the application of manual assemblytasks [4]. Loch et al. [12] created an AR assistance systemfor manual assembly tasks of Lego bricks. They provideassistance on a screen by overlaying animations over thevideo feed of a camera. They compared their AR systemto video instructions and found an improvement in time tocompletion and number of errors in participants when usingthe AR system for the ﬁrst time. Uva et al. [7] designed aSpatial Augmented Reality (SAR) prototype intended to beused for working stations in the context of smart factories.They used a workbench consisting of a controller, a projectorand camera mounted on a frame and a rotating table withﬁducial markers. The instructions were projected directly onthe objects to be maintained. Their research shows that SARinstructions have signiﬁcant beneﬁts compared to paper-based instructions in tasks with high difﬁculty. The authorsnote that they did not consider the participants experienceor the learning effect after multiple task repetitions. Incontrast to [7], Funk et al. [13] did a long-term analysisof their SAR assistance system. The researchers equippedan assembly line of a car manufacturing company withprojectors and depth cameras and tested their prototypeover 11 work days with expert and untrained workers. Thedepth images were used to detect the picking of boxes ina speciﬁc location and direct feedback was given by a top-mounted projector. Subsequently, the task completion time,error rate and NASA-TLX scores were compared and foundno improvements when comparing in-situ instructions withno instructions. Instead, it was observed that the assistancesystem was useful for untrained workers during the learningphase and the researchers suggest that there is a tippingpoint after which the assistance system loses its beneﬁts,thus suggesting that AR assistance systems are best forknowledge transfer and helping untrained workers to learnnew workﬂows. Similar to our work, Chu et al. [8] recentlypublished the full framework of an AR assistance system toassemble Dougong structures. Their AR smartphone app usesobject recognition to identify relevant parts of the currentassembly step and then helps with the assembly process bydisplaying AR animations. Object recognition and trackingis handled by the AR library

Vuforia and the display ofassembly instructions are implemented with

Unity . While most studies of AR assistance systems are done ina laboratory environment [10], Lorenz et al. [14] analyzedrequirements of three different production environments inthe real world and conclude that rugged tablet computersare the best hardware choice because of harsh productionenvironments and the worn safety gear of the workers.The researchers point out that HMDs, which are the mostresearched AR hardware [10], are neither robust nor can theybe worn with protective gear. Additionally, Wolfartsberger etal. [15] also noted that the current generation of HMDs stillhas many disadvantages and recommend using a ﬁxed tabletscreen or in-situ projection.In comparison to the state of the art on AR assistancesystems, our work is based on deep neural networks forobject detection and human pose detection. The additionof our action detection system enables the AR system toanalyze the workers actions and incorporates the trendingﬁeld of human-object interaction into assistance systems. Incontrast to other works [12], [7], [13], we describe the fullframework of our system architecture. Because our machine-learning-based system only relies on RGB video input and amonitor for display, our assistance system is very ﬂexible andcomplies with the compiled requirements of [14] for harshindustrial environments.III. M

ETHODOLOGY

Our objective is to provide an intuitive system that aims toincrease the efﬁciency of the user by using the followingintegrated features: detailed process step descriptions, visualhints using AR visualizations and video hints by using videotutorials. All these features make sure to decrease the timespent of looking for additional sources of information, whichpotentially interrupts the workers focus, thus decreasing hisefﬁciency. Another objective is the prevention of errors orthe timely detection of errors such that the user can instantlyreact. The assistance system aims to reduce the error rateby the use of the following implemented features: actionerror detection, object identiﬁcation, step validation and errorprevention - through the use of detailed instructions, hintsand guiding AR elements.

A. System Architecture

The overall architecture of our proposed assistance systemcan be observed in Fig. 2. It offers a high-level view ofthe system and highlights all entities. As illustrated in thesystem design, the main business logic is divided between a

Step Detection and a

Scenario Handler . The

Step Detection consumes the camera input in form of frames. The

ScenarioHandler provides the step requirements such as necessaryobjects and actions the worker has to do at the respectivesteps. These information are sent to the object and actiondetection models, which are running in parallel in order toincrease the speed of image processing and therefore assurefaster responsiveness. We used a

Faster R-CNN architecture[16] for object detection with an

Inception v2 model [17],while

OpenPose [18] together with a neural network basedframe classiﬁer are utilized for the action detection. Afterig. 2: System design of our proposed assistance systemthe models process the frame, the results are sent as a tupleto a queue which the

Camera View Controller pulls anddisplays to the user. The

Scenario Controller is responsiblefor evaluating these results in order to display possibleerrors or wrong actions. Furthermore, it is also responsiblefor displaying AR components to assist the user in ﬁxingthese errors and giving supplementary hints. Additionally, inorder to improve the decision making accuracy and avoidﬂickering of data, a smoothing module is employed whichruns over the last 10 frames and takes the result with thehighest occurrence to be considered truth. In case the stepwas successfully validated, the

Scenario Controller sends avalidation event back to the

Scenario Handler which movesto the next step. Once a new step is initiated, the

ScenarioController is notiﬁed by the event and displays it in thedescription of the new step. In case a step was not validated,an error event will be published which notiﬁes the

ScenarioHandler to display an error/warning message through the

Error View . B. Action Detection

In order to recognize the workers actions and therebydetect the correctness of the users action, we use theskeleton-tracking-based pose detection of

OpenPose overa series of frames together with a neural network classi-ﬁer in our assistance system. The implementation is basedon the open source project

Realtime-Action-Recognition byChen et al [19]. First, we calculate the joint velocities froma series of N=5 frames using the formulas described inequation (1) to (4). Afterwards, the actions are derived bya neural network which outputs the classiﬁed action. Toretrieve the joints of each frame, the

OpenPose algorithmis trained to detect the human skeleton from an image. Theinput to

OpenPose is an RGB image acquired from thecamera and the output is the skeleton of the human.The raw skeleton data output of

OpenPose is pre-processed with the following three steps that are being proposedby [19] and applied to our use case:1)

Coordinate scaling - As OpenPose has a different unitfor x coordinate and y coordinate, the output joint’sposition must be scaled.2) Removal of unnecessary joints - Mainly, the move-ments of arms and hands are relevant for our use case,thus the joints for the head are removed to reduce thevector size.3)

Padding of missing joints - To enhance robust-ness against inconsistent detections, we incorporate apadding with detections of the previous frames in casejoints are missing.The joint positions are retrieved and joint positions fromN=5 consecutive frames are considered to calculate the jointvelocities and derive the user action. First, the height of thebody H is calculated to ﬁnd the average coordinates x , y between the two hips that are represented by the coordinatedpairs: left hip: ( x , y ) and right hip: ( x , y ) . x , y = x + x , ( y + y ) H = (cid:113) ( x − x ) + ( y − y ) (2)Subsequently, the normalized joint positions X n can be cal-culated using the initial joint coordinates X . Finally, the jointvelocity V i for all joints can be calculated where i representsthe list of all joints and [ t k ] is the frame number from theseries of the 5 frames. X n = XH (3) V i = X n [ t k ] − X n [ t k − ] (4)These features are concatenated and given as input into aig. 3: Graphical user interface with AR components (top)text and video for additional information as well as warningsor error messages (bottom)3 layer fully connected neural network to classify the usersaction. C. Graphical User Interface

Fig. 3 illustrates the most important features of our GraphicalUser Interface (GUI), which is implemented as a crossplatform application using Kivy. It includes AR componentssuch as virtual hints of relevant areas within the workingplace and animations that describe the task visually. Textﬁelds are displayed to inform the user about tasks or errors.Additionally, videos can be displayed, which provide addi-tional information. The progress bar at the bottom displaysthe step a worker is currently in and the watch counter at thetop left shows the time passed in order to provide a referenceto the user. To display the AR components and animations,we used an anchor based calibration, which means that weutilized the position of a detected object as an anchor todisplays the AR features with respect to that object. Thisway, we ensure the ﬂexibility of our system by not relyingon manually aligned positions. We employ a database wherethe speciﬁc display positions of AR components with respectto the objects are listed accordingly.

D. Experiments

In order to evaluate the impact of our system on the worker,we conducted a series of user studies. For that reason, aspecial assembly scenario consisting of 9 steps was createdwhich reﬂects all the features of the system such as: objectdetection, action validation, interface enhancement throughAR elements and additional hints through video tutorials.Depending on the scope of the user study, certain featureswere activated or deactivated in order to study their impact.To create a comparable baseline, we provided a basic versionof the system where we disabled any additional featuresand only kept descriptions of the 9 steps. Subsequently, wecompared this baseline against our assistance system with all assistance features. Table I shows the instructions of thetest scenario. A total of 30 participants (13 female, 17 male)from various backgrounds were recruited. The participantswere aged from 20 to 45. 17 out of the 30 participants had atechnical background. The rest had a business, medical, lawor construction related background.TABLE I: Task instructions

Step number Explanation1 Please make sure to have all required tools available onyour work table.2 Pick up the drill and make a whole around 3 cm deep onthe board at x = cm , y = cm , where the short sideof the board is considered the x-axis and the longer side,the y-axis.3 Find the screw bit inside the grey box and change thedrill bit to a screw bit.4 Find the green box and pick a screw.5 Secure the screw with the drill into the previously madehole.6 Find the wooden board underneath your working stationand place the board on the table.7 Pick the pencil and mark two spots with the pencil on thefollowing coordinates on the board: 1) x = cm , y = cm and 2) x = cm , y = cm .8 Then, measure the distance between the two holes andmark the middle point.9 Pick up the hacksaw and saw the board in 2 based onthe previously made middle mark. IV. E

VALUATION

To test whether the assumptions from the conceptual designwere correct and if all the features led to an improved time ofcompletion the task, we measured the task completion timeas well as error rate in two modes as described above. In theﬁrst mode, denoted as mode 1, all features were removedexcept for the step instructional text and the users were askedto complete the scenario. In the second mode, denoted asmode 2, we activated all the features and asked them tocomplete the tutorial. To eliminate the possibility that theuser might have learned the task during the ﬁrst stage, whichcould affect the second stage, we split the group into halfand asked the ﬁrst group to begin with mode 1 while thesecond group began with mode 2 using the AR assistancesystem.

A. Completion Time

First, we evaluated the time difference between the twomodes in order to determine if the assistance system leads toa reduced time to task completion. The results are capturedin Fig 4. It can be observed that a substantial executiontime improvement was achieved when using our system. Inevery case an improvement occurred. In 10% of the casesthe improvement is over 2 minutes which represents 33% ofsaved time. On average, the assembly duration decreases by69 seconds or 20%. As the scenario was relatively simpleand does not require highly specialized skills, we can expecteven better results in the case of very complex scenarios.ig. 4: Task completion timeIt is worth noting that during the user study, people withslightly higher technical inclinations performed better (canbe noted in participant 3, 4, 8, 19 and 23), requiring less timeto complete the tutorial compared to people who have otherprofessional backgrounds where the use of GUI elementsdecreased the assembly time (can be noted in participant 5,20, 21 and 28).

B. Error Rate

In the next step we tracked the number of errors the userdid on each iteration while doing the above experiment andcompared the two modes against each other. The results areillustrated in Fig. 5. The following instances were consideredFig. 5: Error Rateto be errors: grabbing the wrong tool, missing a tool on theworking station, performing a wrong action, moving to thenext step although the current task was not fully completedand placing an object in the wrong position. Only three usersmanaged to complete the tutorial scenario without any errorson the ﬁrst try. The rest (90%) made at least one error.The maximum number of errors registered per person wasparticipant 4. During the experimental phase, it was notedthat after committing their ﬁrst mistake users became morealert and started paying more attention to the instructions inan effort to avoid further mistakes. While the average errorrate is already small during mode 1 (without AR assistance)with an average of 1.53 errors per user, we still notice asubstantial improvement by activating all features in mode 2.Furthermore, one third of users managed to complete thetutorial without any errors during mode 2.

C. One-way ANOVA

We performed a one-way Analysis of variance (ANOVA)on the data for task completion time (Fig. 4) and errorrate (Fig. 5). For the completion time, we found a highlysigniﬁcant difference between the two groups (p=4.83E-8),therefore rejecting the null hypothesis that both groups were randomly sampled from the same distribution. For the errorrate, we found a highly signiﬁcant difference between thegroups as well (p=9.3E-4). The statistical analysis providesstrong evidence that our assistance system reduces comple-tion time and reduces error rate signiﬁcantly in untrainedworkers when performing the given task for the ﬁrst time.

D. User Acceptance

We evaluated the impact of different components of theassistance system by masking out speciﬁc components duringthe experiments.Fig. 6: User acceptance of speciﬁc AR components.As depicted in Fig. 6, AR elements are considered to be themost useful feature of the GUI. The most popular argumentsof the participants included phrases such as "offers extrainfo", "requires less thinking" and "easy to understand". Thisindicates that the user acceptance to receive such visual cuesis strong. Interestingly, a correlation between "most useful"and "needs improvement" can be observed. Because usersput importance on certain features, they also demand thatthese features work highly accurate. The top arguments forthe ﬁrst two results in the "needs improvement" column were"sometimes elements are slightly off" and "bounding boxesare not always precise". The watch counter was elected asthe least useful feature with over 50% of the votes. Usersdescribed it as "not providing any value", "not noticeable"and "does not help with the assembly". Such feedback leadsus to believe that the watch counter as a metric should not beused by the worker himself, but rather by managers in orderto assess the efﬁciency of a worker and judge his progress.In the "Bothersome" category, 40% of the questioned usersdid not ﬁnd any of the features to be irritating, which showsthat the implemented features complement each other andcreate an effective learning environment. On the other hand,AR elements and object identiﬁcation were reported as being"too obvious" and "not necessary". However, we believe thatthese arguments might be inﬂuenced by the relatively lowcomplexity of the tutorial scenario and the common toolsused therein. We expect such features to be more valuablein more complex and unknown scenarios.

E. Learning Curve

Lastly, in order to assess the users learning curve, we picked10 participants out of the original group and asked them to dothe same scenario 20 times with and without our assistancesystem. We then measured each time to complete the task.The results are illustrated in Fig. 7. The completion timeonverges at the 14 th iteration while using mode 1 (withoutAR assistance) and at the 10 th iteration in mode 2 (with ARassistance). Notably, without the usage of the AR assistanceFig. 7: Learning curves of assistance system (blue) and textbased instructions only (green)users achieve faster task completion after convergence wasreached, with the best result requiring only 1:13 minuteswhile the shortest completion time when using the assistancesystem is 1:40 minutes. This indicates that the assistancesystem can help workers learn new workﬂows faster butdecreases the efﬁciency when the user is already familiarwith the task. It is evident that after familiarization withthe tasks, completing without the assistance system resultsin faster completion time as the additional features mightbecome distracting. Furthermore, the scenario was relativelysimple. In order to infer more meaningful results, the studyhas to be extended to more complex tasks in further research.Nevertheless, our study gives empirical evidence that thereis a point in time after which AR assistance systems losetheir beneﬁts for untrained workers. In our case, this is after13 iterations of the same task.V. CONCLUSIONWe proposed an AR-based industrial assistance system byincorporating established neural network architectures forcomputer vision tasks to assist and train workers in manualprocesses. We combined a robust object detection model witha human pose detection model to recognize the users actionsonly on RGB camera input and display hints and visualcues accordingly. Furthermore, we integrated and evaluatedseveral AR-based components and additional methods suchas video input or animations to assess the impact on userexperience and understanding. The system was evaluated ina variety of relevant metrics for industrial applications suchas time to task completion, error rate, user acceptance andlearning curve. We performed a one-way ANOVA and founda highly signiﬁcant (p<0.001) improvement in time efﬁciencyand error rate. Furthermore, the results indicated a promotionof the learning curve especially for new employees and washighly accepted by participants in the early stages. However,we observed a negative impact with further iterations ofthe task. This might be attributed to our relatively simpletasks which we conducted for demonstrative purposes. We aspire to extend the user studies to include more complexscenarios and participant groups. Moreover, we aim to addmore sophisticated architectures and functionalities to thesystem to enhance user understanding even further.R EFERENCES[1] D. Preuveneers and E. Ilie-Zudor, “The intelligent industry of thefuture: AÂ˘asurvey on emerging trends, research challenges and oppor-tunities in industry 4.0,”

Journal of Ambient Intelligence and SmartEnvironments

Skills forecast : trends and challenges to 2030 . Luxembourg:Publications Ofﬁce of the European Union, 2018.[4] L. F. de Souza Cardoso, F. C. M. Q. Mariano, and E. R. Zorzal,“A survey of industrial augmented reality,”

Computers & IndustrialEngineering , vol. 139, p. 106159, Jan. 2020.[5] T. Masood and J. Egger, “Augmented reality in support of industry4.0—implementation challenges and success factors,”

Robotics andComputer-Integrated Manufacturing

The International Journal of Advanced Manufacturing Tech-nology , vol. 94, no. 1-4, pp. 509–521, Aug. 2017.[8] C.-H. Chu, C.-J. Liao, and S.-C. Lin, “Comparing augmented reality-assisted assembly functions—a case study on dougong structure,”

Applied Sciences , vol. 10, no. 10, p. 3383, May 2020.[9] E. Bottani and G. Vignali, “Augmented reality technology in the man-ufacturing industry: A review of the last decade,”

IISE Transactions ,vol. 51, no. 3, pp. 284–310, feb 2019.[10] J. Egger and T. Masood, “Augmented reality in support of intelligentmanufacturing – a systematic literature review,”

Computers & Indus-trial Engineering , vol. 140, p. 106195, Feb. 2020.[11] T. Caudell and D. Mizell, “Augmented reality: an application ofheads-up display technology to manual manufacturing processes,” in

Proceedings of the Twenty-Fifth Hawaii International Conference onSystem Sciences . IEEE, 1992.[12] F. Loch, F. Quint, and I. Brishtel, “Comparing video and augmentedreality assistance in manual assembly,” in . IEEE, sep 2016.[13] M. Funk, A. BÃd’chler, L. BÃd’chler, T. Kosch, T. Heidenreich, andA. Schmidt, “Working with augmented reality?” in

Proceedings of the10th International Conference on PErvasive Technologies Related toAssistive Environments . ACM, jun 2017.[14] M. Lorenz, S. Knopp, and P. Klimant, “Industrial augmented reality:Requirements for an augmented reality maintenance worker supportsystem,” in . IEEE, oct 2018.[15] J. Wolfartsberger, J. Haslwanter, and R. Lindorfer, “Perspectives onassistive systems for manual assembly tasks in industry,”

Technologies ,vol. 7, no. 1, p. 12, jan 2019.[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

Advances inneural information processing systems , 2015, pp. 91–99.[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-ing the inception architecture for computer vision,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[18] Z. Cao, G. H. Martinez, T. Simon, S.-E. Wei, and Y. A. Sheikh,“OpenPose: Realtime multi-person 2d pose estimation using partafﬁnity ﬁelds,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , 2019.[19] F. Chen, “Multi-person real-time action recognition based-on humanskeleton,” in