Efficient Real-Time Camera Based Estimation of Heart Rate and Its Variability
EEfficient Real-Time Camera Based Estimation of Heart Rate and Its Variability ♥ Amogh Gudi (cid:63)
VV TUD [email protected]
Marian Bittner (cid:63)
VV TUD [email protected]
Roelof Lochmans
TU/e [email protected]
Jan van Gemert
TUD [email protected] VV Vicarious Perception Technologies
Amsterdam, The Netherlands
TUD
Delft University of Technology
Delft, The Netherlands
TU/e
Eindhoven University of Technology
Eindhoven, The Netherlands
Abstract
Remote photo-plethysmography (rPPG) uses a remotelyplaced camera to estimating a person’s heart rate (HR).Similar to how heart rate can provide useful informationabout a person’s vital signs, insights about the underlyingphysio/psychological conditions can be obtained fromheart rate variability (HRV). HRV is a measure of the finefluctuations in the intervals between heart beats. However,this measure requires temporally locating heart beatswith a high degree of precision. We introduce a refinedand efficient real-time rPPG pipeline with novel filteringand motion suppression that not only estimates heart ratemore accurately, but also extracts the pulse waveform totime heart beats and measure heart rate variability. Thismethod requires no rPPG specific training and is ableto operate in real-time. We validate our method on aself-recorded dataset under an idealized lab setting, andshow state-of-the-art results on two public dataset withrealistic conditions (VicarPPG and PURE).
1. Introduction
Human vital signs like heart rate, blood oxygen sat-uration and related physiological measures can be mea-sured using a technique called photo-plethysmography(PPG). This technique involves optically monitoring lightabsorption in tissues that are associated with blood vol-ume changes. Typically, this is done via a contact sen-sor attached to the skin surface [2]. Remote Photo-plethysmography (rPPG) detects the blood volume pulseremotely by tracking changes in the skin reflectance as ob-served by a camera [9, 31]. In this paper we present a novelframework for extracting heart rate (HR) and heart rate vari-ability (HRV) from the face. (cid:63)
Equal contribution.
The process of rPPG essentially involves two steps: de-tecting and tracking the skin colour changes of the subject,and analysing this signal to compute measures like heartrate, heart rate variability and respiration rate. Recent ad-vances in computer video, signal processing, and machinelearning have improved the performances of rPPG tech-niques significantly [9]. Current state-of-the-art methodsare able to leverage image processing by deep neural net-works to robustly select skin pixels within an image andperform HR estimation [4, 22]. However, this reliance uponheavy machine learning (ML) processes has two primarydrawbacks: (i) it necessitates rPPG specific training of theML model, thereby requiring collection of large trainingsets; (ii) complex models can require significant compu-tation time on CPUs and thus can potentially add a bot-tleneck in the pipeline and limit real-time utility. SincerPPG analysis is originally a signal processing task, the useof an end-to-end trainable system with no domain knowl-edge leaves room for improvement in efficiency (e.g., we know that pulse signal is embedded in average skin colourchanges [31, 34, 35], but the ML system has to learn this).We introduce a simplified and efficient rPPG pipeline thatperforms the full rPPG analysis in real-time. This methodachieves state-of-the-art results without needing any rPPGrelated training. This is achieved via extracting regions ofinterest robustly by 3D face modelling, and explicitly re-ducing the influence of head movement to filter the signal.While heart rate is a useful output from a PPG/rPPGanalysis, finer analysis of the obtained blood volume pulse(BVP) signal can yield further useful measures. One suchmeasure is heart rate variability: an estimate of the varia-tions in the time-intervals between individual heart beats.This measure has high utility in providing insights intophysiological and psychological state of a person (stresslevels, anxiety, etc.). While traditionally this measure isobtained based on observation over hours, short and ultra-short duration ( ≤ a r X i v : . [ c s . C V ] S e p kin Pixel Averaging P.O.S Signal Extraction
3D Head Pose R GB Freq. Filtering [Wide+Narrow Band]
Peak Detection
Heart
Rate
Heart Rate
Variability
Inter-beat Intervals
Active Appearance Modelling
Face Finding Motion Suppression @ ~
30 frames per second
Figure 1: An overview of the proposed heart rate and heart rate variability estimation pipeline (left to right). The face in captured webcamimages are detected and modelled to track the skin pixels in region of interest. A single 1-D signal is extracted from the spatially averagedvalues of these pixels over time. In parallel, 3-D head movements are tracked and used to suppress motion noise. An FFT based wide andnarrow band filtering process produces a clean pulse waveform from which peaks are detected. The inter-beat intervals obtained from thesepeaks are then used to compute heart rate and heart rate variability. The full analysis can be performed in real time on a CPU. [20]. Our experiments focus on obtaining ultra-short HRVmeasure as a proof-of-concept/technology demonstrator forlonger duration applications.The computation of heart rate variability requires tem-porally locating heart beats with a high degree of accuracy.Unlike HR estimation, where errors in opposite directionsaverage out, HRV analysis is sensitive to even small arte-facts and all errors add up to strongly distort the final mea-surement. Thus, estimating HRV is a challenging task forrPPG and this has received relatively little focus in litera-ture. Our method extracts a clean BVP signal from the inputvia a two step wide and narrow band frequency filter to ac-curately time heart beats and estimate heart rate variability.
Contributions
We make the following contributions inthis work: ( i) We present a refined and efficient rPPGpipeline that can estimate heart-rate with state-of-the-artaccuracy from RGB webcams. This method has the ad-vantage that it does not require any specific rPPG train-ing and it can perform its analysis with real-time speeds.( ii)
Our method is able to time individual heart beats inthe estimated pulse signal to compute heart rate variabil-ity. This body of work has received little attention, and weset the first benchmarks on two publicly available datasets.( iii)
We provide an in-depth HR and HRV estimation anal-ysis of our method on a self-recorded dataset as well aspublicly available datasets with realistic conditions (Vi-carPPG [28], PURE [24], MAHNOB-HCI [21]). We showstate-of-the-art results on VicarPPG and PURE. This alsosurpasses a previous benchmark set by a deep learningbased method on PURE.
2. Related Work
Signal processing based rPPG methods
Since the earlywork of Verkruysse et al . [31], who showed that heart ratecould be measured from recordings from a consumer gradecamera in ambient light, a large body of research has beenconducted on the topic. Extensive reviews of these work canbe found in [9, 19, 26]. Most published rPPG methods workeither by applying skin detection on a certain area in eachframe or by selecting one or multiple regions of interest andtrack their averages over time to generate colour signals. Ageneral division can be made into methods that use blindsource separation (ICA, PCA) [14, 15, 16] vs those that usea ‘fixed’ extraction scheme for obtaining the BVP signal [5,12, 28, 29, 33]. The blind source separation methods requirean additional selection step to extract the most informativeBVP signal. To avoid this, we make use a ‘fixed’ extractionscheme in our method.Among the ‘fixed’ methods, multiple stand out and serveas inspiration and foundation for this work. Tasli et al . [28]presented the first face modelling based signal extractionmethod and utilized detrending [27] based filtering to esti-mate BVP and heart rate. The CHROM [5] method use aratio of chrominance signals which are obtained from RGBchannels followed by a skin-tone standardization step. Li et al . [12] proposed an extra illumination rectification stepusing the colour of the background to counter illuminationvariations. The SAMC [29] method proposes an approachfor BVP extraction in which regions of interest are dynam-ically chosen using self adaptive matrix completion. ThePlane-orthogonal to skin (POS) [33] method improves onCHROM. It works by projecting RGB signals on a planeorthogonal to a normalized skin tone in normalized RGBpace, and combines the resulting signals into a single sig-nal containing the pulsatile information. We take inspira-tion from Tasli et al . [28] and further build upon POS [33].We introduce additional signal refinement steps for accuratepeak detection to further improve HR and HRV analysis.
Deep learning based rPPG methods
Most recent workshave applied deep learning (DL) to extract either heart rateor the BVP directly from camera images. They rely on theability of deep networks to learn which areas in the im-age correspond to heart rate. This way, no prior domainknowledge is incorporated and the system learns rPPG con-cepts from scratch. DeepPhys [4] is the first such end-to-end method to extract heart and breathing rate from videos.HR-Net [22] uses two successive convolutional neural net-works [11] to first extracts a BVP from a sequence of im-ages and then estimate the heart rate from it. Both showstate-of-the-art performance on two public datasets and anumber of private datasets. Our presented algorithm makesuse of an active appearance model [30] to select regionsof interest to extract a heart rate signal from. Due to this,no specific rPPG training is required while prior domainknowledge is more heavily replied upon.
HRV from PPG/rPPG
Some past methods have also at-tempted extracting heart rate variability from videos [1, 16,25]. A good overview is provided by Rodriguez et al . [18].Because of the way HRV is calculated, it is crucial that sin-gle beats are detected accurately with a high degree of ac-curacy. Methods that otherwise show good performance inextracting HR can be unsuitable for HRV analysis, sincethey may not provide beat locations. Rodriguez et al . [18]evaluate their baseline rPPG method for HRV estimation.Their method is based on bandpass filtering the green chan-nel from regions of interest. However, their results are onlyreported on their own dataset, which makes direct compari-son difficult. Our method also estimates heart rate variabil-ity by obtaining precise temporal beat locations from thefiltered BVP signal.
3. Method
We present a method for extracting heart rate and heartrate variability from the face using only a consumer gradewebcam. Figure 1 shows an overview of this method alongwith a summarized description.
The first step in the pipeline includes face finding [32]and fitting an active appearance model (AAM) [30]. ThisAAM is then used to determine facial landmarks and headorientation. The landmarks are used to define a region of interest (RoI) which only contains pixels on the face be-longing to skin. This allows us to robustly track the pixelsin this RoI over the course of the whole video. Our RoI con-sists of the upper region of the face excluding the eyes. Anexample of this can be seen in Figure 1 and 3. The head ori-entation is used to measure and track the pitch, roll, and yawangles of the head per frame. Across all pixels in the RoI,the averages for each colour channel (R,G,B) is computedand tracked (concatenated) to create three colour signals.
The colour signals and the head orientation angles aretracked over a running window of 8.53 seconds. Thiswindow duration corresponds to 256 frames at 30 fps, or512 frames at 60 fps. All signals are resampled using linearinterpolation to counteract variations in frame rates of theinput. They are resampled to either 30 or 60 fps, whicheveris closer to the frame rate of the source video. Subsequently,the three colour signals from R, G and B channels are com-bined into a single rPPG signal using the POS method [33].The POS method filters out intensity variations by project-ing the R, G and B signals on a plane orthogonal to an ex-perimentally determined normalized skin tone vector. Theresulting 2-D signal is combined into a 1-D signal with oneof the input signal dimensions being weighted by an alphaparameter that is the quotient of the standard deviations ofeach signal. This ensures that the resulting rPPG signal con-tains the maximum amount of pulsating component.
Rhythmic motion noise suppression
A copy of the ex-tracted rPPG signal as well as the head-orientation signalsare converted to frequency domain using Fast Fourier Trans-form. The three resulting head-orientation spectra (one eachof pitch, roll, and yaw) are combined into one via averaging.This is then subtracted from the raw rPPG spectrum afteramplitude normalization. This way, the frequency compo-nents having a high value in the head-orientation spectrumare attenuated in the rPPG spectrum. Subsequently, the fre-quencies outside of the human heart rate range (0.7 - 4 Hz /42 - 200 bpm) are removed from the spectra.
Wide & narrow band filtering
The highest frequencycomponent inside the resulting spectrum is then used to de-termine the passband range of a narrow-bandpass filter witha bandwidth of 0.47 Hz. Such a filter can either be realizedvia inverse FFT or a high order FIR filter (e.g. ~50 th or-der Butterworth). The selected filter is then applied to theoriginal extracted rPPG signal to produce noise-free BVP. To prevent minor shifts in the locations of the crest ofeach beat over multiple overlapping running windows, the
Heart Rate (bmp)
Inter-beat Intervals (ms)
Squared Successive Differences of IBIs (ms ) ~ Figure 2: Example of heart rate variability computation: Evenwhen the heart rate (HR) is almost constant, the underlying inter-beat intervals (IBIs) can have many fluctuations. This is detectedby rising squared successive differences (SSD), a measure of heartrate variability (HRV). signals from each window are overlap added with earliersignals [5, 6, 33]. First, the filtered rPPG signal is normal-ized by subtracting its mean and dividing it by its standarddeviation. During resampling of the signal, the number ofsamples to shift is determined based on the source and re-sampled frame rates. The signal is then shifted back in timeaccordingly and added to the previous/already overlappedsignals. Older values are divided by the times they havebeen overlap added, to ensure all temporal locations lie inthe same amplitude range. Over time, a cleaner rPPG signalis obtained from this.
Once a clean rPPG signal is obtained, we can performpeak detection on it to locate the individual beats in timein the signal. From the located beats, heart rate and heartrate variability can be calculated. To do this, we first extractthe inter-beat-intervals (IBIs) from the signal, which are thetime intervals between consecutive beats.
Heart rate calculation
Heart rate is calculated by averag-ing all IBIs over a time window, and computing the inverseof it. That is, HR w = 1 / IBI w , where IBI w is the mean ofall inter-beat intervals that fall within the time window w .The choice of this time window can be based on the user’srequirement (e.g. instantaneous HR, long-term HR). Heart rate variability calculation
Multiple metrics canbe computed to express the measure of heart rate vari-ability in different units. In this work, we focus on oneof the most popular time-domain metric for summarizingHRV called the root mean square of successive differences(RMSSD) [10, 13, 18], expressed in units of time. As thename suggests, this is computed by calculating the rootmean square of time difference between adjacent IBIs:
S t a b l e S e t V i c a r P P G P U R E MAHNOB-HCI
Figure 3: Examples of images from (left to right) the StableSet, Vi-carPPG, PURE, and MAHNOB-HCI datasets. The example fromVicarPPG dataset shows the RoI overlaid on the face. The subjectsin StableSet were physically stabilized using the shown chin rest(face removed for privacy reasons). The subjects in PURE performdeliberate head movements. The videos from MAHNOB-HCI suf-fer from high compressions noise.
RMSSD = (cid:115) N − (cid:16) (cid:80) N − i =0 ( IBI i − IBI i +1 ) (cid:17) , (1)where IBI i represents the i th inter-beat interval, and N rep-resents the number of IBIs in the sequence. A graphicalexample of such HRV calculation is shown in Figure 2.In addition, we also compute two frequency-domainmetrics of HRV, simply known as Low-frequency (LF) andHigh-frequency (HF) band [13] (as well as a ratio of them),that are commonly used in rPPG HRV literature [1, 14, 18].The LF and HF components are calculated using Welch’spower spectral density estimation. Since Welch’s methodexpects evenly sampled data, the IBIs are interpolated at afrequency of 2.5Hz and zero padded to the nearest power oftwo. The power of each band is calculated as total powerin a region of the periodogram: the LF band from [0.04 to0.15 Hz], and the HF band from [0.15 to 0.4 Hz]. Detailsabout these metrics can be found in [20]. Both metrics areconverted to normalized units by dividing them by the sumof LF and HF.
4. Experiments and Results
Some example frames from the datasets used in this pa-per can be seen in Figure 3.
StableSet rPPG dataset
To make a proof-of-concept testof our proposed rPPG method, we recorded the Stable-Set rPPG dataset (cid:63) . This video dataset consists of 24 sub-jects recorded at 25 fps in 1920 × (cid:63) As part of research work at the Human-Technology interaction group,Eindhoven University of Technology. We acknowledge and thank dr.Daniel Lakens and the research team for their valuable contributions. easurements. The subjects were recorded while watchingemotion inducing video stimuli as well as playing the gameof Tetris at varying difficulty levels. This was done with theintention of inducing HRV changes.
VicarVision rPPG dataset - VicarPPG
The VicarPPGdataset [28] consists of 20 video recordings of 10 unre-strained subjects sitting in front of a webcam (Logitechc920). The subjects were recorded under two conditions:at rest while exhibiting stable heart rates, and under a post-workout condition while exhibiting higher heart rates grad-ually reducing. The videos were originally recorded at1280 ×
720 resolution with an uneven variable frame rateranging from as low as ~5 fps up to 30 fps. The frames werelater upsampled and interpolated to a fixed 30 fps frame ratevideo file. The ground truth was obtained via a finger pulseoximeter device (CMS50E).
Pulse Rate Detection Dataset - PURE
The PUREdataset [24] comprises of 60 videos of 10 subjects. Ev-ery subject is recorded under 6 different conditions with in-creasing degree of head movements including talking, slowand fast translation, small and large rotation. The videoswere recoded at 30 fps in 640 ×
480 resolution with no com-pression (lossless), and the ground truth was obtained via apulse oximeter (CMS50E).
MAHNOB-HCI Tagging rPPG Dataset
This datasetconsists of 527 videos of 27 subjects, along with 256 HzECG ground truth recording. The videos were recoded at61 fps in 780 ×
580 resolution and compressed to a high de-gree. To make our results comparable to previous work, weextract the same 30 second video duration from these videos(frames [306 - 2135]) and only analyse these.
To assess the heart rate estimation accuracy of ourmethod, we measure the deviation of the predicted heartrates from the ground truth in all the datasets. We expressthis deviation in terms of the mean absolute error (MAE)metric in beats per minute (bpm). This metric is the averageof the absolute differences between predicted and true aver-age heart rates obtained within a set time window. To makea fair comparison with other work, we set different time-window sizes for MAE computation per dataset to matchthe ones used in prior work: 8.5 secs, 15 secs, 30 secs, and30 secs on StableSet, VicarPPG, PURE, and MAHNOB-HCI respectively. It should be noted that while shorter time-windows require more precise estimation, the choice of thiswindow size did not affect our results significantly.The summarised results of this heart rate analysis incomparison with other work can be in Table 1 and Figure 4,
40 60 80 100 120 140 160 E s t i m a t e d H e a r t R a t e ( bp m ) Ground Truth Heart Rate (bpm)
StableSet Dataset
Baseline
SadHappy
Tetris Baseline
Tetris EasyTetris MediumTetris Hard (a) StableSet
40 60 80 100 120 140 160 E s t i m a t e d H e a r t R a t e ( bp m ) Ground Truth Heart Rate (bpm)
VicarPPG Dataset
At RestPost Workout (b) VicarPPG
40 60 80 100 120 140 160 E s t i m a t e d H e a r t R a t e ( bp m ) Ground Truth Heart Rate (bpm)
PURE Dataset
Steady
TalkingSlow TranslationFast TranslationSmall Rotation
Medium Rotation (c) PURE
40 60 80 100 120 140 160 E s t i m a t e d H e a r t R a t e ( bp m ) Ground Truth Heart Rate (bpm)
MAHNOB-HCI Dataset (d) MAHNOB-HCI
Figure 4: Scatter plot of the predicted vs ground truth heart rates.Each point represents one video segment in the dataset: (a) Sta-bleSet (8.5s segment); (b) VicarPPG (15s segments); (c) PURE(30s segments); (d) MAHNOB-HCI (30s segments). While highcorrelation between the ground truth and estimated heart rates canbe seen in the first three datasets, the results on MAHNOB-HCI isworse. This can be attributed to its high compression noise. while some qualitative examples can be seen in Figure 5.On the StableSet, our proposed method obtains a low errorrate of 1.02 bpm. This high accuracy can be attributed to thefact that subjects’ movements were physically stabilized viaa chin-rest (see Figure 3).On the VicarPPG and the PURE datasets, our methodoutperforms all previous methods by a large margin. Thisis in spite of the subjects being unrestrained, exhibiting awide range of changing heart rates, and performing a va-riety of large and small head movements. The very highaccuracy of 0.3 bpm on the PURE dataset can be attributedto the fact that the videos were lossless encoded and hadno compression noise. All the noise was caused was purelydue to head movements, and this was the main failure pointof other methods. Our method is able to filter out this mo-tion noise significantly well. Conversely, on closer analy-sis, we found that the relatively lower average error on theVicarPPG dataset was primarily due to some segments inthe videos having very low effective frame rates (~5 fps).This low frame rate approaches the Nyquist frequency forhuman heart rate analysis, which is a theoretical limita-tion that says sampling frequency must at least be twicethe highest frequency to be measured. If videos severely ethods
StableSet VicarPPG PURE
MAHNOB-HCI
Baseline (mean) ± 6.01 ± 7.36 ± 17 ± 13.9 S i g n a l P r o c e ss i n g M e t h o d s Ours ± 1.4 ± 6.32 ± 0.51 ± 11.89
Basic/EVM [35][28] ± 10.1 - -
Tasli et al. [28] ± 7.7 - -
ICA [17][7] - 24.1 ± 30.9 * -
NMD-HR [7] - 8.68 ± 24.1 * - [34][22] - 2.44 13.84
CHROM [5][22] - 2.07 13.49
LiCVPR [12][22] - 28.2 7.41 D L M e t h o d s HR-CNN [22] - 1.84* 7.25
DeepPhys [4] - - 4.57
Head-motion [8] - - ≤ Table 1: A comparison of the performances of various methods interms the mean absolute error in beats per minute (bpm). Baseline(mean) represents the accuracy obtained by only predicting themean heart rate of the dataset. * represent accuracies obtained ona slight variations of the full dataset: ICA [17][7], NMD-HR [7],HR-CNN [22] were tested on a 8, 8, 4-person subset of the PUREdataset respectively. ≤ represents root mean squared error, whichis always greater than or equal to mean absolute error. The dif-ferent colours separate the different categories of methods: signalprocessing, deep learning and head-motion based methods. Ourproposed method obtains a high accuracy on StableSet, VicarPPGand PURE datasets and outperforms all previous methods. Accu-racy on MAHNOB-HCI dataset is low similar to most other signal-processing methods, likely due to its high compression noise. Notethat the simple mean predicting baseline obtains an accuracy quiteclose to most methods on this dataset. suffering from such low frame rate artefacts are excluded(namely
06 WORKOUT ,
08 WORKOUT ), the error rate dropsto ± .On the MAHNOB-HCI dataset, we see that similar tothe majority of signal processing methods, our method doesnot achieve a very good accuracy. An interesting observa-tion is that the accuracy produced by almost all methods isclose to that of a dummy baseline method that blindly pre-dicts the mean heart rate of the dataset (~71 bpm) for anyinput. Apart from [12], only the deep learning based meth-ods perform better. This could be an indication that the highcompression noise distorts the pulse information in the spa-tial averages of skin pixels. Deep learning based methodsare able to somewhat overcomes this, perhaps by learningto detect and filter out the spatial ‘pattern’ of such compres-sion noise. In addition, deep learning methods might also beimplicitly learning to track ballistic head movements of thesubjects since it is also caused by blood pulsations [3, 23].In fact, the lowest error rate is obtained by the ballistic headmovement based method described in [8]. This methoddoes not rely on colour information at all, and hence its pre-diction is not affected by the underlying compression noise.This suggests a similar conclusion: simple spatial averaging of pixel values is not sufficient to estimate HR accurately inthis dataset due to the high compression noise. The task of assessing heart rate variability is greatlymore noise-sensitive than estimating heart rate. To validateour method on this task, we compute mean absolute errorbetween the time-domain HRV measure root mean squareof successive differences (RMSSD) of the predicted heartbeats in comparison the ground truth. In addition, we alsoreport the mean absolute errors of frequency domain met-rics: Low Frequency (LF), High Frequency (HF) (in nor-malized units), and their ratio LF/HF. We evaluate HRV onthe StableSet, VicarPPG and PURE datasets, all of whichcontain videos longer than one minute in duration.The results of this analysis can be seen in Table 2 andFigure 8. Similar to the results of HR analysis, our methodpredicts HRV with a good degree of accuracy on all threedatasets over the length of the full video (1 min to 2.5 min).Based on HRV literature [20] and considering that the aver-age human heart rate variability is in the range of [19-75] msRMSSD, error rates less than ~20 ms RMSSD can be con-sidered acceptably accurate.
Since both the VicarPPG and PURE dataset have phys-iological and physical condition partitions, it is possible toperform a more in-depth analysis of the HR and HRV re-sults from our experiments. These in-depth individual errorrates per condition can be seen in Table 3.
VicarPPG Conditions
It can be noticed that while theproposed method is fairly accurate over all conditions, itestimates heart rates in the post-workout condition less ac-curately than in the rest condition. This can also be ob-served in the scatter plot of Figure 4b: while the overall av-erage HR is accurate, there are more outliers in the higherHR region. On closer examination, we found the primaryreason for this to be that a much larger number of videosegments with a very low variable frame rate were presentin the post-workout partition. Such low frame rate arte-facts affect the estimation of higher HRs more severely asthe Nyquist frequency requirement is also higher. Figure 5also shows this for an example video from the VicarPPGdataset: the predicted HR follows the ground truth verywell except for in the starting segment which has a higherheart rate (but low frame rate). If these videos are excluded(
06 WORKOUT ,
08 WORKOUT ), the error rates drop signif-icantly to ± , and so does the discrepancy be-tween the two conditions. PURE Conditions
The PURE dataset conditions arebased on the amount and type of movement the subjects round Truth HREstimated HR
StableSet ~2.5 min
VicarPPG ~1.5 min 67 PURE ~ MAHNOB-HCI ~30 sec405060708090100
StableSet ~2.5 min 4060
VicarPPG ~1.5 min 48495051525354
PURE ~ MAHNOB-HCI ~30 sec
Ground Truth HREstimated HR
Figure 5: Examples of heart rate estimation from all datasets (x-axis — time, y-axis — heart rate; top row — good examples, bottom row— bad examples). When conditions are right, the estimated heart rate is able to follow the ground truth closely. The rare errors in StableSetand PURE are due to incorrect face modelling caused by occlusion (chin-rest) and deformation (talking) respectively. Prediction errors inVicarPPG are mostly in the high HR range due to low frame rate artefacts, while on MAHNOB-HCI they are due to compressions noise.Figure 8: Predicted vs groundtruth heart rate variability(HRV) in terms of RMSSD(ms). Each point represents onevideo in the dataset: (a) Sta-bleSet (~2.5min duration); (b)VicarPPG (~1.5min duration);(c) PURE (~1min duration).The estimated HRV showsfairly decent correlation withthe ground truth.
10 25 40 55 70 85 100 E s t i m a t e d H R V a r i a b ili t y R M SS D ( m s ) Ground Truth HR Variability RMSSD (ms)
StableSet Dataset
BaselineSadHappyTetris BaselineTetris EasyTetris MediumTetris Hard (a) StableSet
10 25 40 55 70 85 100 E s t i m a t e d H R V a r i a b ili t y R M SS D ( m s ) Ground Truth HR Variability RMSSD (ms)
VicarPPG Dataset
At RestPost Workout (b) VicarPPG
10 25 40 55 70 85 100 E s t i m a t e d H R V a r i a b ili t y R M SS D ( m s ) Ground Truth HR Variability RMSSD (ms)
PURE Dataset
Steady
TalkingSlow Translation
Fast Translation
Small RotationMedium Rotation (c) PURE
HRV Metric
StableSet VicarPPG PURE
RMSSD (ms) ± 9.77 ± 6.97 ± 11.9
LF & HF (n.u.) ± 11.31 ± 8.80 ± 8.97
LF/HF ± 1.23 ± 0.37 ± 1.46
Table 2: Heart rate variability computation performance of theproposed method in terms of mean absolute error. Based on theaverage human heart rate variability range, we see that good accu-racies are obtained on all dataset. perform. As can be seen in Table 3, our method performsalmost equally well in all movement conditions. In fact,even large rotations and fast translations of the face are han-dled just as good as the
Steady condition. This is primarilydue to the motion noise suppression step and the translation-robustness of the appearance modelling. We are still able toclosely track our regions of interest through the movement.The worse performing condition is Talking. This is some- what expected as moving the mouth and jaw deforms theface in a non-trivial manner which the appearance modelis unable to adapt well to. In addition, repetitive musclemovements in the face during talking can interfere with theobserved colour changes.
StableSet Conditions
Unlike the previous datasets, theconditions in the StableSet do not relate to physical (orphysically induced) differences, but to the type of stimuliapplied to the subjects while being recorded. These includetwo emotion-inducing videos (sad and happy), and threestress-inducing Tetris-game playing tasks with increasinglevels of difficulty.The average HR and HRV estimations in comparisonwith the ground truth for each condition is shown in Fig-ure 9. While no significant differences can be seen in theaverage heart rate measurements of the subjects under dif-ferent conditions, their HRV measurements show some in-teresting results: the emotional videos and the higher dif-ficulty Tetris games induce a higher HRV in the subjectswhen compared to the baseline. These results demonstratethe usefulness of heart rate variability in assessing the un-derlying psychological conditions: HRV is able to highlightdifferences by showing variations under different conditionswhile HR stays the same.
For real time application, processing speed is just as vitalas prediction accuracy. The average CPU processing timesof our method and its individual components are listed inTable 4 (on an Intel Xeon E5-1620). We see that the methodis comfortably able to perform the full analysis with a goodreal-time speed for a video resolution of 640 × × i c a r P P G D a t a s e t P U R E D a t a s e t Condition
At Rest Post Workout Steady Talking Slo. Translation Fast Translation Small Rotation Med. Rotation
HR (bpm) ± 0.72 ± 8.87 ± 0.11 ± 0.76 ± 0.06 ± 0.26 ± 0.56 ± 0.7
HRV RMSSD (ms) ± 8.20 ± 5.39 ± 7.73 ± 15.56 ± 7.67 ± 8.46 ± 6.63 ± 13.59
Table 3: In-depth results of heart rate and heart rate variability analysis on VicarPPG and PURE datasets for all condition. Good accuracyis obtained in all movement conditions in the PURE dataset.
Talking performs relatively worse, likely due to incorrect face modellingcaused by non-trivial facial deformations. The relatively lower performance in the post workout condition of VicarPPG is due to low framerate artefacts in the video. Baseline Sad Happy H . R . V a r i a b ili t y ( L F / H F ) Stimuli
Ground Truth HRVEstimated HRV H e a r t R a t e ( b m p ) Stimuli
Ground Truth HREstimated HR 00.51
Baseline Easy Tetris Med. Tetris Hard Tetris H . R . V a r i a b ili t y ( L F / H F ) Stimuli
Ground Truth HRVEstimated HRV606570758085 Baseline Easy Tetris Med. Tetris Hard Tetris H e a r t R a t e ( b m p ) Stimuli
Ground Truth HREstimated HR
Figure 9: Comparison of HRV measure (LF/HF) for all conditionsof the StableSet dataset. While the heart rates across all conditionsremain constant, HRV is higher for emotional video stimuli anddifficult Tetris tasks relative to baselines. This shows the utility ofHRV over HR for measuring psychological conditions like stress.
Face Finding & Modelling (ms)
Skin Pixel
Selection (ms) rPPG
Algorithm (ms)
Total (ms)
Frame
Rate ± 17.2 ± 0.2 ± 0.2 ± 18.8 ~ fps Table 4: The processing speed of individual components of theproposed pipeline and the total frame rate for an input video(640 ×
480 resolution). Note that the bottleneck in the pipeline isface finding and modelling, while the rest require negligible time.
5. Discussion
We were able to obtain successful and promising resultsfrom our appearance modelling and signal-processing basedrPPG method. The results show that this method is ableto obtain high accuracies, surpassing the state-of-the-art ontwo public datasets (VicarPPG and PURE). We showed thatsmall/large movements and rapidly changing heart rate con-ditions do not degrade performance. This can be attributedto the appearance modelling and noise suppression steps inthe pipeline. Only non-trivial facial deformations (e.g. dur-ing talking) proved slightly challenging, but the method stillproduced sub-1 bpm error rates in these conditions.This high accuracy was obtain while being efficient: themethod’s computational costs were very low and the full pipeline could be executed in real-time on CPUs. This is incontrast to deep learning based methods where larger mod-els that can potentially form computational bottlenecks insuch pipelines. This efficiency is attained by taking advan-tage of our prior domain knowledge about the rPPG process,which the deep learning methods have to spend computa-tional resources to learn and execute.This high precision in estimating the pulse signal enablesthe measurement of heart rate variability (HRV), whosecomputation is sensitive to noise. HRV is a useful mea-sure: As shown in the results, it can indicate underlyingphysio/psychological conditions (like stress) where HR isunable to show any difference.A limitation of this method was observed in analysisof videos with very high compression rates. The resultantnoise distorts the pulse signal almost completely when em-ploying spatial averaging techniques. Deep learning meth-ods like HR-CNN [22] have shown better results in this set-ting, while it fails to match our method in cases with lowercompression. This could be because the network is able tolearn the spatial patterns of this compression noise and fil-ter it out, as well as track ballistic head movements and inferheart rate from it. In contrast, in lower compression cases,our prior domain knowledge assumptions perform more ac-curately. While this makes our method well suited for mod-ern videos, deep learning might be better suited for process-ing archival videos, often subjected to higher compression.
6. Conclusion
This paper demonstrates a refined and efficient appear-ance modelling and signal processing based pipeline for re-mote photo-plethysmography. This method is able to esti-mate both heart rate and heart rate variability using camerasat real-time speeds. This method was validated on multiplepublic datasets and state-of-the-art results were obtained onVicarPPG and PURE datasets. We verify that this method isable to perform equally well under varying movement andphysiological conditions. However, while the estimationsare precise under ordinary video-compression conditions,high levels of compression noise degrades the accuracy. eferences [1] K. Alghoul, S. Alharthi, H. Al Osman, and A. El Saddik.Heart rate variability extraction from videos signals: Ica vs.evm comparison.
IEEE Access , 5:4711–4719, 2017.[2] J. Allen. Photoplethysmography and its application in clini-cal physiological measurement.
Physiological measurement ,28(3):R1, 2007.[3] G. Balakrishnan, F. Durand, and J. Guttag. Detecting pulsefrom head motions in video. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 3430–3437, 2013.[4] W. Chen and D. McDuff. Deepphys: Video-based physiolog-ical measurement using convolutional attention networks. In
Proceedings of the European Conference on Computer Vi-sion (ECCV) , pages 349–365. Springer, Cham, 2018.[5] G. De Haan and V. Jeanne. Robust pulse rate fromchrominance-based rppg.
IEEE Transactions on BiomedicalEngineering , 60(10):2878–2886, 2013.[6] G. De Haan and A. Van Leest. Improved motion robustnessof remote-ppg by using the blood volume pulse signature.
Physiological measurement , 35(9):1913, 2014.[7] H. Demirezen and C. E. Erdem. Remote photoplethysmog-raphy using nonlinear mode decomposition. In , pages 1060–1064. IEEE, 2018.[8] M. A. Haque, R. Irani, K. Nasrollahi, and T. B. Moeslund.Heartbeat rate measurement from facial video.
IEEE Intelli-gent Systems , 31(3):40–48, May 2016.[9] M. Hassan, A. S. Malik, D. Fofi, N. Saad, B. Karasfi, Y. S.Ali, and F. M´eriaudeau. Heart rate estimation using facialvideo: A review.
Biomedical Signal Processing and Control ,38:346–360, 2017.[10] R.-Y. Huang and L.-R. Dung. Measurement of heart ratevariability using off-the-shelf smart phones.
Biomedical en-gineering online , 15(1):11, 2016.[11] Y. LeCun and Y. Bengio. The handbook of brain theory andneural networks. chapter Convolutional Networks for Im-ages, Speech, and Time Series, pages 255–258. MIT Press,Cambridge, MA, USA, 1998.[12] X. Li, J. Chen, G. Zhao, and M. Pietikainen. Remote heartrate measurement from face videos under realistic situations.In
Proceedings of the IEEE conference on Computer Visionand Pattern Recognition (CVPR) , pages 4264–4271, 2014.[13] M. Malik, J. T. Bigger, A. J. Camm, R. E. Kleiger,A. Malliani, A. J. Moss, and P. J. Schwartz. Heart rate vari-ability: Standards of measurement, physiological interpreta-tion, and clinical use.
European heart journal , 17(3):354–381, 1996.[14] D. McDuff, S. Gontarek, and R. W. Picard. Improvements inremote cardiopulmonary measurement using a five band dig-ital camera.
IEEE Transactions on Biomedical Engineering ,61(10):2593–2601, 2014.[15] D. McDuff, S. Gontarek, and R. W. Picard. Remote detec-tion of photoplethysmographic systolic and diastolic peaksusing a digital camera.
IEEE Transactions on BiomedicalEngineering , 61(12):2948–2954, 2014. [16] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Advancementsin noncontact, multiparameter physiological measurementsusing a webcam.
IEEE transactions on Biomedical Engi-neering , 58(1):7–11, 2010.[17] M.-Z. Poh, D. J. McDuff, and R. W. Picard. Non-contact,automated cardiac pulse measurements using video imagingand blind source separation.
Optics express , 18(10):10762–10774, 2010.[18] A. M. Rodr´ıguez and J. Ramos-Castro. Video pulse rate vari-ability analysis in stationary and motion conditions.
Biomed-ical engineering online , 17(1):11, 2018.[19] P. V. Rouast, M. T. Adam, R. Chiong, D. Cornforth, andE. Lux. Remote heart rate measurement using low-cost rgbface video: a technical literature review.
Frontiers of Com-puter Science , 12(5):858–872, 2018.[20] F. Shaffer and J. Ginsberg. An overview of heart rate vari-ability metrics and norms.
Frontiers in Public Health , 5:258,2017.[21] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. Amulti-modal affective database for affect recognition and im-plicit tagging.
Affective Computing, IEEE Transactions on ,3:1 – 1, 01 2012.[22] R. Spetlik, J. Cech, V. Franc, and J. Matas. Visual heartrate estimation with convolutional neural network. In
BritishMachine Vision Conference (BMVC) , 08 2018.[23] I. Starr, A. Rawson, H. Schroeder, and N. Joseph. Studieson the estimation of cardiac ouptut in man, and of abnor-malities in cardiac function, from the heart’s recoil and theblood’s impacts; the ballistocardiogram.
American Journalof Physiology-Legacy Content , 127(1):1–28, 1939.[24] R. Stricker, S. M¨uller, and H.-M. Gross. Non-contact video-based pulse rate measurement on a mobile service robot.In
The 23rd IEEE International Symposium on Robot andHuman Interactive Communication (Ro-Man) , pages 1056–1062. IEEE, 2014.[25] Y. Sun, S. Hu, V. Azorin-Peris, R. Kalawsky, and S. E.Greenwald. Noncontact imaging photoplethysmography toeffectively access pulse rate variability.
Journal of biomedi-cal optics , 18(6):061205, 2012.[26] Y. Sun and N. Thakor. Photoplethysmography revisited:from contact to noncontact, from point to imaging.
IEEETransactions on Biomedical Engineering , 63(3):463–477,2015.[27] M. P. Tarvainen, P. O. Ranta-Aho, and P. A. Karjalainen.An advanced detrending method with application to hrvanalysis.
IEEE Transactions on Biomedical Engineering ,49(2):172–175, 2002.[28] H. E. Tasli, A. Gudi, and M. den Uyl. Remote ppg basedvital sign measurement using adaptive facial regions. In ,pages 1410–1414. IEEE, 2014.[29] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F.Cohn, and N. Sebe. Self-adaptive matrix completion forheart rate estimation from face videos under realistic condi-tions. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 2396–2404,2016.30] H. Van Kuilenburg, M. Wiering, and M. Den Uyl. A modelbased method for automatic facial expression recognition. In
European Conference on Machine Learning , pages 194–205.Springer, 2005.[31] W. Verkruysse, L. O. Svaasand, and J. S. Nelson. Remoteplethysmographic imaging using ambient light.
Optics ex-press , 16(26):21434–21445, 2008.[32] P. Viola and M. Jones. Rapid object detection using a boostedcascade of simple features. In
Proceedings of the 2001 IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition (CVPR) , volume 1, Dec 2001.[33] W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan. Al-gorithmic principles of remote ppg.
IEEE Transactions onBiomedical Engineering , 64(7):1479–1491, 2016.[34] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, andW. T. Freeman. Eulerian video magnification for revealingsubtle changes in the world.
ACM Transactions on Graphics(Proc. SIGGRAPH) , 31(4), 2012.[35] Y. Zhang, S. L. Pintea, and J. C. Van Gemert. Video ac-celeration magnification. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages529–537, 2017.[36] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmarkdetection by deep multi-task learning. In