Integrative Object and Pose to Task Detection for an Augmented-Reality-based Human Assistance System using Neural Networks
IIntegrative Object and Pose to Task Detection for anAugmented-Reality-based Human Assistance System using NeuralNetworks
Linh Kästner , Leon Eversberg , Marina Mursa and Jens Lambrecht Abstract — As a result of an increasingly automatized anddigitized industry, processes are becoming more complex. Aug-mented Reality has shown considerable potential in assistingworkers with complex tasks by enhancing user understand-ing and experience with spatial information. However, theacceptance and integration of AR into industrial processesis still limited due to the lack of established methods andtedious integration efforts. Meanwhile, deep neural networkshave achieved remarkable results in computer vision tasks andbear great prospects to enrich Augmented Reality applications .In this paper, we propose an Augmented-Reality-based humanassistance system to assist workers in complex manual taskswhere we incorporate deep neural networks for computervision tasks. More specifically, we combine Augmented Realitywith object and action detectors to make workflows moreintuitive and flexible. To evaluate our system in terms of useracceptance and efficiency, we conducted several user studies.We found a significant reduction in time to task completion inuntrained workers and a decrease in error rate. Furthermore,we investigated the users learning curve with our assistancesystem.
I. I
NTRODUCTION
Due to the raise of automation and digitization in all areasof industries, the complexity of manufacturing processesincreased in recent years and tools to handle the rapidlygrowing amounts of information are needed [1]. Despite theupcoming changes in the industrial workforce, it is unlikelythat human labor will become obsolete due to complete au-tomation. Manufacturers look for ways to use technologicaladvancement to assist their employees cognitively and physi-cally rather than replacing them entirely [2]. These assistancesystems can help with the global trend of an increasingdemand in high-qualification- and decreasing demand in low-qualification-jobs [3]. In this context, Augmented-Reality(AR) -based assistance systems have shown promising resultssuch as reducing task execution time, improving productquality or increasing workers ability to learn new tasks [4].While AR assistance systems have great potential of im-proving the collaboration between humans and machines,the implementation of these systems face technological, en-vironmental and organisational challenges [5] and are still anongoing field of research. Furthermore, state-of-the-art ARassistance systems often require an additional comissioningstep, are restricted to specific areas and do not take advantage Linh Kästner, Leon Eversberg, Marina Mursa and Jens Lambrecht arewith the Chair Industry Grade Networks and Clouds, Faculty of ElectricalEngineering, and Computer Science, Berlin Institute of Technology, Berlin,Germany [email protected]
Fig. 1: Proposed pipeline of industrial AR assistance systemof the recent success of computer vision approaches likeneural networks [6], [7], [8]. The majority of AR systemsstill rely on fiducial markers [4], [9] or manual calibration[6], [7], [8]. Additionally, while most AR systems haveintegrated some form of object detection, the addition of anaction detection is not yet established. Moreover, only littleattention has been put on static screens as AR visualizationhardware [10] despite their simplicity and potential to employmore processing power compared to handheld devices orhead-mounted devices (HMDs).On this account, we present an AR-based industrial as-sistance system, which incorporates deep learning (DL)architectures to ensure a more flexible and intuitive userexperience. We combine an object detection model with anaction detection model to guide untrained users in manualtasks with the objective of promoting knowledge-transfer (s.Fig. 1). We tested our assistance system with 30 participantsin an exemplary assembly scenario consisting of 9 stepsand evaluated the results in terms of task completion time,error rate, user acceptance and learning promotion. The maincontributions of this paper are the following: • A proposal of an industrial AR assistance system forcomplex manual tasks using deep neural networks rely-ing on RGB camera input only • The combination of object detection and action detec-tion enabling the system to perform interaction analysis • A quantitative and qualitative evaluation of our systemsuch as time to task completion, error rate and a user a r X i v : . [ c s . H C ] A ug cceptance questionnaire • A novel study on the effect of task repetition on theimpact of the AR assistance systemThe paper is structured as follows: Sec. II gives an overviewof related work. Sec. III presents the methodology. Sec. IVpresents the results and evaluation. Finally, Sec. V gives aconclusion of our work.II. RELATED WORKOver the last decade there has been a growing interest inindustrial AR. The term Augmented Reality was first intro-duced by Caudell and Mizell [11] in 1992 where they usedHMDs in aircraft manufacturing to reduce manufacturingcost and improve workers efficiency. The largest focus ofresearch has been in the application of manual assemblytasks [4]. Loch et al. [12] created an AR assistance systemfor manual assembly tasks of Lego bricks. They provideassistance on a screen by overlaying animations over thevideo feed of a camera. They compared their AR systemto video instructions and found an improvement in time tocompletion and number of errors in participants when usingthe AR system for the first time. Uva et al. [7] designed aSpatial Augmented Reality (SAR) prototype intended to beused for working stations in the context of smart factories.They used a workbench consisting of a controller, a projectorand camera mounted on a frame and a rotating table withfiducial markers. The instructions were projected directly onthe objects to be maintained. Their research shows that SARinstructions have significant benefits compared to paper-based instructions in tasks with high difficulty. The authorsnote that they did not consider the participants experienceor the learning effect after multiple task repetitions. Incontrast to [7], Funk et al. [13] did a long-term analysisof their SAR assistance system. The researchers equippedan assembly line of a car manufacturing company withprojectors and depth cameras and tested their prototypeover 11 work days with expert and untrained workers. Thedepth images were used to detect the picking of boxes ina specific location and direct feedback was given by a top-mounted projector. Subsequently, the task completion time,error rate and NASA-TLX scores were compared and foundno improvements when comparing in-situ instructions withno instructions. Instead, it was observed that the assistancesystem was useful for untrained workers during the learningphase and the researchers suggest that there is a tippingpoint after which the assistance system loses its benefits,thus suggesting that AR assistance systems are best forknowledge transfer and helping untrained workers to learnnew workflows. Similar to our work, Chu et al. [8] recentlypublished the full framework of an AR assistance system toassemble Dougong structures. Their AR smartphone app usesobject recognition to identify relevant parts of the currentassembly step and then helps with the assembly process bydisplaying AR animations. Object recognition and trackingis handled by the AR library
Vuforia and the display ofassembly instructions are implemented with
Unity . While most studies of AR assistance systems are done ina laboratory environment [10], Lorenz et al. [14] analyzedrequirements of three different production environments inthe real world and conclude that rugged tablet computersare the best hardware choice because of harsh productionenvironments and the worn safety gear of the workers.The researchers point out that HMDs, which are the mostresearched AR hardware [10], are neither robust nor can theybe worn with protective gear. Additionally, Wolfartsberger etal. [15] also noted that the current generation of HMDs stillhas many disadvantages and recommend using a fixed tabletscreen or in-situ projection.In comparison to the state of the art on AR assistancesystems, our work is based on deep neural networks forobject detection and human pose detection. The additionof our action detection system enables the AR system toanalyze the workers actions and incorporates the trendingfield of human-object interaction into assistance systems. Incontrast to other works [12], [7], [13], we describe the fullframework of our system architecture. Because our machine-learning-based system only relies on RGB video input and amonitor for display, our assistance system is very flexible andcomplies with the compiled requirements of [14] for harshindustrial environments.III. M
ETHODOLOGY
Our objective is to provide an intuitive system that aims toincrease the efficiency of the user by using the followingintegrated features: detailed process step descriptions, visualhints using AR visualizations and video hints by using videotutorials. All these features make sure to decrease the timespent of looking for additional sources of information, whichpotentially interrupts the workers focus, thus decreasing hisefficiency. Another objective is the prevention of errors orthe timely detection of errors such that the user can instantlyreact. The assistance system aims to reduce the error rateby the use of the following implemented features: actionerror detection, object identification, step validation and errorprevention - through the use of detailed instructions, hintsand guiding AR elements.
A. System Architecture
The overall architecture of our proposed assistance systemcan be observed in Fig. 2. It offers a high-level view ofthe system and highlights all entities. As illustrated in thesystem design, the main business logic is divided between a
Step Detection and a
Scenario Handler . The
Step Detection consumes the camera input in form of frames. The
ScenarioHandler provides the step requirements such as necessaryobjects and actions the worker has to do at the respectivesteps. These information are sent to the object and actiondetection models, which are running in parallel in order toincrease the speed of image processing and therefore assurefaster responsiveness. We used a
Faster R-CNN architecture[16] for object detection with an
Inception v2 model [17],while
OpenPose [18] together with a neural network basedframe classifier are utilized for the action detection. Afterig. 2: System design of our proposed assistance systemthe models process the frame, the results are sent as a tupleto a queue which the
Camera View Controller pulls anddisplays to the user. The
Scenario Controller is responsiblefor evaluating these results in order to display possibleerrors or wrong actions. Furthermore, it is also responsiblefor displaying AR components to assist the user in fixingthese errors and giving supplementary hints. Additionally, inorder to improve the decision making accuracy and avoidflickering of data, a smoothing module is employed whichruns over the last 10 frames and takes the result with thehighest occurrence to be considered truth. In case the stepwas successfully validated, the
Scenario Controller sends avalidation event back to the
Scenario Handler which movesto the next step. Once a new step is initiated, the
ScenarioController is notified by the event and displays it in thedescription of the new step. In case a step was not validated,an error event will be published which notifies the
ScenarioHandler to display an error/warning message through the
Error View . B. Action Detection
In order to recognize the workers actions and therebydetect the correctness of the users action, we use theskeleton-tracking-based pose detection of
OpenPose overa series of frames together with a neural network classi-fier in our assistance system. The implementation is basedon the open source project
Realtime-Action-Recognition byChen et al [19]. First, we calculate the joint velocities froma series of N=5 frames using the formulas described inequation (1) to (4). Afterwards, the actions are derived bya neural network which outputs the classified action. Toretrieve the joints of each frame, the
OpenPose algorithmis trained to detect the human skeleton from an image. Theinput to
OpenPose is an RGB image acquired from thecamera and the output is the skeleton of the human.The raw skeleton data output of
OpenPose is pre-processed with the following three steps that are being proposedby [19] and applied to our use case:1)
Coordinate scaling - As OpenPose has a different unitfor x coordinate and y coordinate, the output joint’sposition must be scaled.2) Removal of unnecessary joints - Mainly, the move-ments of arms and hands are relevant for our use case,thus the joints for the head are removed to reduce thevector size.3)
Padding of missing joints - To enhance robust-ness against inconsistent detections, we incorporate apadding with detections of the previous frames in casejoints are missing.The joint positions are retrieved and joint positions fromN=5 consecutive frames are considered to calculate the jointvelocities and derive the user action. First, the height of thebody H is calculated to find the average coordinates x , y between the two hips that are represented by the coordinatedpairs: left hip: ( x , y ) and right hip: ( x , y ) . x , y = x + x , ( y + y ) H = (cid:113) ( x − x ) + ( y − y ) (2)Subsequently, the normalized joint positions X n can be cal-culated using the initial joint coordinates X . Finally, the jointvelocity V i for all joints can be calculated where i representsthe list of all joints and [ t k ] is the frame number from theseries of the 5 frames. X n = XH (3) V i = X n [ t k ] − X n [ t k − ] (4)These features are concatenated and given as input into aig. 3: Graphical user interface with AR components (top)text and video for additional information as well as warningsor error messages (bottom)3 layer fully connected neural network to classify the usersaction. C. Graphical User Interface
Fig. 3 illustrates the most important features of our GraphicalUser Interface (GUI), which is implemented as a crossplatform application using Kivy. It includes AR componentssuch as virtual hints of relevant areas within the workingplace and animations that describe the task visually. Textfields are displayed to inform the user about tasks or errors.Additionally, videos can be displayed, which provide addi-tional information. The progress bar at the bottom displaysthe step a worker is currently in and the watch counter at thetop left shows the time passed in order to provide a referenceto the user. To display the AR components and animations,we used an anchor based calibration, which means that weutilized the position of a detected object as an anchor todisplays the AR features with respect to that object. Thisway, we ensure the flexibility of our system by not relyingon manually aligned positions. We employ a database wherethe specific display positions of AR components with respectto the objects are listed accordingly.
D. Experiments
In order to evaluate the impact of our system on the worker,we conducted a series of user studies. For that reason, aspecial assembly scenario consisting of 9 steps was createdwhich reflects all the features of the system such as: objectdetection, action validation, interface enhancement throughAR elements and additional hints through video tutorials.Depending on the scope of the user study, certain featureswere activated or deactivated in order to study their impact.To create a comparable baseline, we provided a basic versionof the system where we disabled any additional featuresand only kept descriptions of the 9 steps. Subsequently, wecompared this baseline against our assistance system with all assistance features. Table I shows the instructions of thetest scenario. A total of 30 participants (13 female, 17 male)from various backgrounds were recruited. The participantswere aged from 20 to 45. 17 out of the 30 participants had atechnical background. The rest had a business, medical, lawor construction related background.TABLE I: Task instructions
Step number Explanation1 Please make sure to have all required tools available onyour work table.2 Pick up the drill and make a whole around 3 cm deep onthe board at x = cm , y = cm , where the short sideof the board is considered the x-axis and the longer side,the y-axis.3 Find the screw bit inside the grey box and change thedrill bit to a screw bit.4 Find the green box and pick a screw.5 Secure the screw with the drill into the previously madehole.6 Find the wooden board underneath your working stationand place the board on the table.7 Pick the pencil and mark two spots with the pencil on thefollowing coordinates on the board: 1) x = cm , y = cm and 2) x = cm , y = cm .8 Then, measure the distance between the two holes andmark the middle point.9 Pick up the hacksaw and saw the board in 2 based onthe previously made middle mark. IV. E
VALUATION
To test whether the assumptions from the conceptual designwere correct and if all the features led to an improved time ofcompletion the task, we measured the task completion timeas well as error rate in two modes as described above. In thefirst mode, denoted as mode 1, all features were removedexcept for the step instructional text and the users were askedto complete the scenario. In the second mode, denoted asmode 2, we activated all the features and asked them tocomplete the tutorial. To eliminate the possibility that theuser might have learned the task during the first stage, whichcould affect the second stage, we split the group into halfand asked the first group to begin with mode 1 while thesecond group began with mode 2 using the AR assistancesystem.
A. Completion Time
First, we evaluated the time difference between the twomodes in order to determine if the assistance system leads toa reduced time to task completion. The results are capturedin Fig 4. It can be observed that a substantial executiontime improvement was achieved when using our system. Inevery case an improvement occurred. In 10% of the casesthe improvement is over 2 minutes which represents 33% ofsaved time. On average, the assembly duration decreases by69 seconds or 20%. As the scenario was relatively simpleand does not require highly specialized skills, we can expecteven better results in the case of very complex scenarios.ig. 4: Task completion timeIt is worth noting that during the user study, people withslightly higher technical inclinations performed better (canbe noted in participant 3, 4, 8, 19 and 23), requiring less timeto complete the tutorial compared to people who have otherprofessional backgrounds where the use of GUI elementsdecreased the assembly time (can be noted in participant 5,20, 21 and 28).
B. Error Rate
In the next step we tracked the number of errors the userdid on each iteration while doing the above experiment andcompared the two modes against each other. The results areillustrated in Fig. 5. The following instances were consideredFig. 5: Error Rateto be errors: grabbing the wrong tool, missing a tool on theworking station, performing a wrong action, moving to thenext step although the current task was not fully completedand placing an object in the wrong position. Only three usersmanaged to complete the tutorial scenario without any errorson the first try. The rest (90%) made at least one error.The maximum number of errors registered per person wasparticipant 4. During the experimental phase, it was notedthat after committing their first mistake users became morealert and started paying more attention to the instructions inan effort to avoid further mistakes. While the average errorrate is already small during mode 1 (without AR assistance)with an average of 1.53 errors per user, we still notice asubstantial improvement by activating all features in mode 2.Furthermore, one third of users managed to complete thetutorial without any errors during mode 2.
C. One-way ANOVA
We performed a one-way Analysis of variance (ANOVA)on the data for task completion time (Fig. 4) and errorrate (Fig. 5). For the completion time, we found a highlysignificant difference between the two groups (p=4.83E-8),therefore rejecting the null hypothesis that both groups were randomly sampled from the same distribution. For the errorrate, we found a highly significant difference between thegroups as well (p=9.3E-4). The statistical analysis providesstrong evidence that our assistance system reduces comple-tion time and reduces error rate significantly in untrainedworkers when performing the given task for the first time.
D. User Acceptance
We evaluated the impact of different components of theassistance system by masking out specific components duringthe experiments.Fig. 6: User acceptance of specific AR components.As depicted in Fig. 6, AR elements are considered to be themost useful feature of the GUI. The most popular argumentsof the participants included phrases such as "offers extrainfo", "requires less thinking" and "easy to understand". Thisindicates that the user acceptance to receive such visual cuesis strong. Interestingly, a correlation between "most useful"and "needs improvement" can be observed. Because usersput importance on certain features, they also demand thatthese features work highly accurate. The top arguments forthe first two results in the "needs improvement" column were"sometimes elements are slightly off" and "bounding boxesare not always precise". The watch counter was elected asthe least useful feature with over 50% of the votes. Usersdescribed it as "not providing any value", "not noticeable"and "does not help with the assembly". Such feedback leadsus to believe that the watch counter as a metric should not beused by the worker himself, but rather by managers in orderto assess the efficiency of a worker and judge his progress.In the "Bothersome" category, 40% of the questioned usersdid not find any of the features to be irritating, which showsthat the implemented features complement each other andcreate an effective learning environment. On the other hand,AR elements and object identification were reported as being"too obvious" and "not necessary". However, we believe thatthese arguments might be influenced by the relatively lowcomplexity of the tutorial scenario and the common toolsused therein. We expect such features to be more valuablein more complex and unknown scenarios.
E. Learning Curve
Lastly, in order to assess the users learning curve, we picked10 participants out of the original group and asked them to dothe same scenario 20 times with and without our assistancesystem. We then measured each time to complete the task.The results are illustrated in Fig. 7. The completion timeonverges at the 14 th iteration while using mode 1 (withoutAR assistance) and at the 10 th iteration in mode 2 (with ARassistance). Notably, without the usage of the AR assistanceFig. 7: Learning curves of assistance system (blue) and textbased instructions only (green)users achieve faster task completion after convergence wasreached, with the best result requiring only 1:13 minuteswhile the shortest completion time when using the assistancesystem is 1:40 minutes. This indicates that the assistancesystem can help workers learn new workflows faster butdecreases the efficiency when the user is already familiarwith the task. It is evident that after familiarization withthe tasks, completing without the assistance system resultsin faster completion time as the additional features mightbecome distracting. Furthermore, the scenario was relativelysimple. In order to infer more meaningful results, the studyhas to be extended to more complex tasks in further research.Nevertheless, our study gives empirical evidence that thereis a point in time after which AR assistance systems losetheir benefits for untrained workers. In our case, this is after13 iterations of the same task.V. CONCLUSIONWe proposed an AR-based industrial assistance system byincorporating established neural network architectures forcomputer vision tasks to assist and train workers in manualprocesses. We combined a robust object detection model witha human pose detection model to recognize the users actionsonly on RGB camera input and display hints and visualcues accordingly. Furthermore, we integrated and evaluatedseveral AR-based components and additional methods suchas video input or animations to assess the impact on userexperience and understanding. The system was evaluated ina variety of relevant metrics for industrial applications suchas time to task completion, error rate, user acceptance andlearning curve. We performed a one-way ANOVA and founda highly significant (p<0.001) improvement in time efficiencyand error rate. Furthermore, the results indicated a promotionof the learning curve especially for new employees and washighly accepted by participants in the early stages. However,we observed a negative impact with further iterations ofthe task. This might be attributed to our relatively simpletasks which we conducted for demonstrative purposes. We aspire to extend the user studies to include more complexscenarios and participant groups. Moreover, we aim to addmore sophisticated architectures and functionalities to thesystem to enhance user understanding even further.R EFERENCES[1] D. Preuveneers and E. Ilie-Zudor, “The intelligent industry of thefuture: A¢asurvey on emerging trends, research challenges and oppor-tunities in industry 4.0,”
Journal of Ambient Intelligence and SmartEnvironments
Skills forecast : trends and challenges to 2030 . Luxembourg:Publications Office of the European Union, 2018.[4] L. F. de Souza Cardoso, F. C. M. Q. Mariano, and E. R. Zorzal,“A survey of industrial augmented reality,”
Computers & IndustrialEngineering , vol. 139, p. 106159, Jan. 2020.[5] T. Masood and J. Egger, “Augmented reality in support of industry4.0—implementation challenges and success factors,”
Robotics andComputer-Integrated Manufacturing
The International Journal of Advanced Manufacturing Tech-nology , vol. 94, no. 1-4, pp. 509–521, Aug. 2017.[8] C.-H. Chu, C.-J. Liao, and S.-C. Lin, “Comparing augmented reality-assisted assembly functions—a case study on dougong structure,”
Applied Sciences , vol. 10, no. 10, p. 3383, May 2020.[9] E. Bottani and G. Vignali, “Augmented reality technology in the man-ufacturing industry: A review of the last decade,”
IISE Transactions ,vol. 51, no. 3, pp. 284–310, feb 2019.[10] J. Egger and T. Masood, “Augmented reality in support of intelligentmanufacturing – a systematic literature review,”
Computers & Indus-trial Engineering , vol. 140, p. 106195, Feb. 2020.[11] T. Caudell and D. Mizell, “Augmented reality: an application ofheads-up display technology to manual manufacturing processes,” in
Proceedings of the Twenty-Fifth Hawaii International Conference onSystem Sciences . IEEE, 1992.[12] F. Loch, F. Quint, and I. Brishtel, “Comparing video and augmentedreality assistance in manual assembly,” in . IEEE, sep 2016.[13] M. Funk, A. BÃd’chler, L. BÃd’chler, T. Kosch, T. Heidenreich, andA. Schmidt, “Working with augmented reality?” in
Proceedings of the10th International Conference on PErvasive Technologies Related toAssistive Environments . ACM, jun 2017.[14] M. Lorenz, S. Knopp, and P. Klimant, “Industrial augmented reality:Requirements for an augmented reality maintenance worker supportsystem,” in . IEEE, oct 2018.[15] J. Wolfartsberger, J. Haslwanter, and R. Lindorfer, “Perspectives onassistive systems for manual assembly tasks in industry,”
Technologies ,vol. 7, no. 1, p. 12, jan 2019.[16] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in
Advances inneural information processing systems , 2015, pp. 91–99.[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-ing the inception architecture for computer vision,” in
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[18] Z. Cao, G. H. Martinez, T. Simon, S.-E. Wei, and Y. A. Sheikh,“OpenPose: Realtime multi-person 2d pose estimation using partaffinity fields,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2019.[19] F. Chen, “Multi-person real-time action recognition based-on humanskeleton,” in