[PDF] Weighing Counts: Sequential Crowd Counting by Reinforcement Learning

Abstract

We formulate counting as a sequential decision problem and present a novel crowd counting model solvable by deep reinforcement learning. In contrast to existing counting models that directly output count values, we divide one-step estimation into a sequence of much easier and more tractable sub-decision problems. Such sequential decision nature corresponds exactly to a physical process in reality scale weighing. Inspired by scale weighing, we propose a novel 'counting scale' termed LibraNet where the count value is analogized by weight. By virtually placing a crowd image on one side of a scale, LibraNet (agent) sequentially learns to place appropriate weights on the other side to match the crowd count. At each step, LibraNet chooses one weight (action) from the weight box (the pre-defined action pool) according to the current crowd image features and weights placed on the scale pan (state). LibraNet is required to learn to balance the scale according to the feedback of the needle (Q values). We show that LibraNet exactly implements scale weighing by visualizing the decision process how LibraNet chooses actions. Extensive experiments demonstrate the effectiveness of our design choices and report state-of-the-art results on a few crowd counting benchmarks. We also demonstrate good cross-dataset generalization of LibraNet. Code and models are made available at: this https URL

Full PDF

WWeighing Counts: Sequential Crowd Countingby Reinforcement Learning (cid:63)

Liang Liu , Hao Lu , Hongwei Zou , Haipeng Xiong ,Zhiguo Cao , and Chunhua Shen School of Aritiﬁcal Intelligence & Automation, Huazhong University of Science &Technology, China The University of Adelaide, Australia { wings, zgcao } @hust.edu.cn Abstract.

We formulate counting as a sequential decision problem andpresent a novel crowd counting model solvable by deep reinforcementlearning. In contrast to existing counting models that directly outputcount values, we divide one-step estimation into a sequence of mucheasier and more tractable sub-decision problems. Such sequential decisionnature corresponds exactly to a physical process in reality—scale weighing.Inspired by scale weighing, we propose a novel ‘counting scale’ termedLibraNet where the count value is analogized by weight. By virtuallyplacing a crowd image on one side of a scale, LibraNet (agent) sequentiallylearns to place appropriate weights on the other side to match the crowdcount. At each step, LibraNet chooses one weight (action) from the weightbox (the pre-deﬁned action pool) according to the current crowd imagefeatures and weights placed on the scale pan (state). LibraNet is requiredto learn to balance the scale according to the feedback of the needle (Qvalues). We show that LibraNet exactly implements scale weighing byvisualizing the decision process how LibraNet chooses actions. Extensiveexperiments demonstrate the eﬀectiveness of our design choices and reportstate-of-the-art results on a few crowd counting benchmarks, includingShanghaiTech, UCF CC 50 and UCF-QNRF. We also demonstrate goodcross-dataset generalization of LibraNet. Code and models are madeavailable at https://git.io/libranet

Keywords:

Crowd Counting · Reinforcement Learning

Counting is sequential decision process by nature. Dense object counts are notinferred by humans with a simple glance [4]. Instead humans count objects ina sequential manner, with initial fast counting on apparent objects (large sizesand clear appearance) and gradually slow counting on objects that are hard to (cid:63)

L. Liu and H. Lu contributed equally. Z. Cao is the corresponding author. Part ofthis work was done when L. Liu was visiting The University of Adelaide.

Accepted toProc. European Conf. Computer Vision 2020. a r X i v : . [ c s . C V ] J u l Liu et al. Fig. 1.

Counting scale. We implement crowd counting as scale weighing. By virtuallyplacing a crowd image (with 7 people) on the scale, if placing a 10 g weight on thescale pan, the scale will lean to the right; if exchanging the 10 g weight to 5 g, the scaleinstead will lean to the left. Finally by adding another 2 g weight, the scale is balanced.The total weights on the scale can therefore indicate the number of crowd. recognize (small sizes or blurred appearance). Such a sequential decision behaviorcan be modeled by a physical process in reality—scale weighing. In scale weighing,it is easy to choose a weight when the weights placed on the scale are far fromthe true weight of the stuﬀ. When placed weights are close to the true weight,small and light weights are carefully chosen until the needle indicates the balance.This process decomposes a diﬃcult problem into a series of much more tractablesub-problems.Following the same spirit of human counting and scale weighing, we formulatecounting as a sequential decision problem and implement it as scale weighing.Indeed counting objects is like weighing stuﬀ. In the context of crowd countingshown in Fig. 1, the ‘stuﬀ’ is a crowd image, and the ‘weights’ are a series of pre-deﬁned value operators. We repeatedly choose counting ‘weights’ to approximatethe ground-truth counts until the scale is balanced. The ﬁnal image count issimply a summation of placed ‘weights’.The sequential decision nature of scale weighing makes it suitable to bedescribed by a reinforcement learning (RL) task. We hence propose a DeepQ-Network (DQN) [29]-based solution, LibraNet , to implement scale weighingand apply it to crowd counting as a ‘counting scale’. In particular, given a ‘stuﬀ’,LibraNet outputs a combination of weights step by step. In each step, a weight(action) is chosen from the weight box (the pre-deﬁned action pool) or removedfrom the scale pan according to the feedback of the needle (Q values that indicatehow to choose the next action). The weighing process continues until LibraNetchooses the ‘end’ operator. The ‘stuﬀ’ is the image feature encoded from acrowd image, and the ‘end’ condition meets when the summation of the weightsequals/approximates to the ground-truth people count.We visualize how LibraNet works and illustrate that LibraNet exactly imple-ments scale weighing. We show through extensive experiments why our choices indesigning reward function work well, that LibraNet can be used as a plug-in toexisting local counts models [47,19], and that LibraNet achieves state-of-the-artperformance on three crowd counting datasets, including ShanghaiTech [52], The naming of LibraNet is inspired by the zodiac sign.eighing Counts: Sequential Crowd Counting by Reinforcement Learning 3

UCF CC 50 [11], and UCF-QNRF [12]. We also report cross-dataset performanceto verify the generalization of LibraNet.In summary, we show that counting can be interpreted as scale weighing andwe implement scale weighing with LibraNet. To our knowledge, LibraNet is theﬁrst approach that uses RL techniques to solve crowd counting. . Crowd counting is often tackled as a dense prediction task[24,25]. Solutions range from early attempts that detect pedestrians [7], regressimage counts [3], estimate density maps [16], predict localized counts [5], torecent deep learning-based density maps estimation [17], redundant counts regres-sion [6,23], instance blobs localization [15] and count intervals classiﬁcation [19,48].Since detection typically failed on small and dense people, regression-basedapproaches [3,32] were proposed. While early methods alleviated the issues ofocclusion and clutter, they ignored spatial information because only the globalimage count was regressed. This situation eased when the concept of density mapwas introduced in [16]. Chen et al. [5] also introduced localized count regressionby mining local feature importance and sharing visual information among spatialregions.With the success of Deep Convolutional Neural Networks (DCNNs), deepcrowd counting models emerged. [45] applied a CNN to crowd counting by globalcount regression. [51] presented a switchable training scheme to estimate thedensity map and the global count. By contrast, works of [6,23] adopted redundantcounting where local patches were densely sampled in a sliding-window mannerduring training, and the image count was obtained by normalizing redundantlocal counts at inference time. Authors of [20] employed a CRF-based structuredfeature enhancement module and a dilated multiscale structural similarity lossto address scale variations of crowd. To alleviate perspective distortion, the workin [35] integrated perspective information into density regression and proposeda PACNN for eﬃcient crowd counting. In [15] a network is trained to outputa single blob for each person for localization. The work in [44] optimized aresidual signal to reﬁne the density map. Instead of direct regression, authorsof [19,48] reformulated it as a classiﬁcation problem by discretizing local countsand classifying count intervals.Most existing models generate crowd counts in one step. This renders diﬃ-culties in correcting under- or over-estimated counts. Despite that there exists amethod that recurrently reﬁnes density map with a spatial transformer network[21], it does not decompose a hard task into a sequence of easy sub-tasks anddoes not fully leverage the advantage of sequential counting.

Deep Reinforcement Learning . RL [8,31] is one of the fundamental machinelearning paradigms. It includes several elements, namely, agent, environment,policy, state, action, and reward. It aims to learn policies such that an agentcan receive the maximum reward when interacting with the environment. Sincethe work of [28] introduced deep learning into RL, it has received extensive

Liu et al.

Image

Feature Map Feature Vectors Action Map Count MapDQN

CNN

Action Vectors C : Concatenate : Update ArgMax

Needle U ArgMax

UC C t=0 t=1

ArgMax UC t=T e W i 1 W i 2 W i iI FV  W iTe U + + + + Needle Needle

Fig. 2.

Overview of LibraNet. A CNN backbone ﬁrst extracts the feature map

F V I ofan input image I , then each element F V iI of F V I is sent to a DQN. In DQN, F V iI anda weighing vector W it are concatenated and sent to a 2-layer MLP. The output of MLPis a 9-dimensional Q value vector. We choose an action with the maximum Q value,and update W it +1 per Eq. (4). This process repeats until the model chooses ‘end’ orexceeds the predeﬁned maximum step. The output action vectors can be converted tocount intervals by Eq. (5), and the intervals are further remapped to a count map withinverse discretization [19]. The image count of I is acquired by summing the count map. studies [27,29,34,46]. In particular, RL achieved breakthroughs in a few areassuch as go [37] and real-time strategy games [30,43]. Recently some deep RL-based methods were also proposed to tackle computer vision tasks, such as objectlocalization [2] and instance segmentation [1]. However, these RL practices incomputer vision cannot be directly transferred to crowd counting. A main reasonis that there is no principled way to reformulate counting into a sequentialdecision problem suitable for RL. Inspired by scale weighing, we ﬁll this gap andpresent the ﬁrst deep RL-based approach to crowd counting. Here we explain LibraNet in detail. Sec. 2.1 introduces the formulation of sequen-tial counting. Sec. 2.2 shows how to deal with this sequential task with Q-learning.Sec. 2.3 explains the network architecture, and Sec. 2.4 presents implementationdetails. An overview of our method is shown in Fig. 2.

Despite that most deep counting networks treat density maps as the regressiontarget [12,26,33,45,52], there is another line of works pursuing the idea of localcount modeling and also reporting promising results [6,19,23,48]. LibraNet followsthis local count paradigm but operates in a sequential manner. In what follows,we present a generalized perspective of local count modeling and show how wereformulate them into sequential learning. eighing Counts: Sequential Crowd Counting by Reinforcement Learning 5

Local Count Regression . Some previous works [6,23,47] consider counting aproblem of local count regression, which densely samples an image into a seriesof local patches then estimates the per-patch count directly. It amounts to thefollowing optimization problem min θ (cid:88) i ∈ I (cid:12)(cid:12)(cid:12) G ( i ) − N θR ( i ) (cid:12)(cid:12)(cid:12) , (1) where I is the input image and i denotes the local patch sampled from I , G ( i ) returns the ground truth count given i , and N θR is a regression networkparameterized by θ . Local Count Classiﬁcation . Inspired by local count regression, counting isfurther formulated as a classiﬁcation problem [19,48] where local patch countsare discretized into count intervals. This process is deﬁned by min θ (cid:88) i ∈ I (cid:12)(cid:12)(cid:12)(cid:12) G ( i ) − ID (cid:18) arg max c N θC ( i, c ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) , (2) where N θC is a classiﬁcation network parameterized by θ , c is the number of countintervals, and ID( · ) deﬁnes an inverse-discretization procedure that recovers thecount value from the count interval [19]. More details about discretization andinverse-discretization can be referred to Supplementary Materials. Local Counting by Sequential Decision . Motivated by scale weighing, count-ing can be transformed into a sequential decision task. We call this a weighingtask . Instead of estimating a count value or a count interval directly, the weighingtask sequentially chooses a value operation in each step from a pre-deﬁned actionpool. The sequential process terminates when the agent chooses the ‘ending’operation or exceeds the maximum step allowed. This task is deﬁned by min θ (cid:88) i ∈ I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) G ( i ) − T e (cid:88) t =0 arg max a N θE (cid:16) i, W it , a (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (3) where N θE is a sequential decision network parameterized by θ , a is one of thepre-deﬁned value operations. T e = min ( t m , t e ) is the ending step, where t m isthe maximum step, t e is the step that chooses the ending operation. W it is theweight vector that represents the chosen weights, which is initialized by a full-zerovector. W it takes the form W it +1 = (cid:26) { , , , .... } if t = 0 W it (cid:93) a t otherwise , (4) where a t is the operation chosen at the step t , and (cid:93) is a weight updating operator(see also Eq. (7)). In step T the count V iT of the patch i takes the form V iT = T (cid:88) t =0 arg max a N θE (cid:16) i, W it , a (cid:17) = T (cid:88) t =0 w it , (5) Liu et al. where w it forms W it such that W it = (cid:16) w i , w i , ..., w it − , , ... (cid:17) . (6) Overall, the working ﬂow of this weighing task is akin to scale weighing. In eachstep, the network N θE (scale) evaluates the value diﬀerence between the imagepatch i and the value associated with the weight vector W it (weights); accordingto the output of the network (needle), the agent chooses an action (add or removea weight) to adjust V iT to approximate the ground-truth patch count G ( i ) untilthey are equal (the scale is balanced). We present more details in the sequel. We implement Eq. (3) within the framework of Q-leaning [29]. The elements ofQ-learning include state, action, reward and Q value. They correspond to thescale pan, weights, designed rewards and needle in scale weighing.

State (Scale Pan) . The state depicts the status of ‘two scale pans’—the weightvector W it and the image feature F V iI . Formally, the state s = { F V iI , W it } .According to [19], the data distribution is often long-tailed in crowd countingdatasets with imbalanced samples. Liu et al. [19] shows that this issue couldbe alleviated by quantizing local counts and treating the count intervals as thelearning target. We follow this idea to check the balancing condition of the scale. Action (Weights) . In Q-learning, an action is deﬁned to modify the state.Since

F V iI is ﬁxed in s once it is extracted, the action is designed to only change W it . We design an action pool in a way similar to the scale weighing systemand the money system [42], i.e., a = {− , − , − , − , +1 , +2 , +5 , +10 , end } . Itincludes 8 value operations and one ending operation (indicating the scale isbalanced). Given a new action a t , W it is modiﬁed by an updating operator (cid:93) W it (cid:93) a t = { w i , ..., w it − , , , ... } (cid:93) a t = { w i , ..., w it − , a t , , ... } . (7) W it records what weights are placed/removed from the scale pan before step t − Reward Function . A reward scores the value of each action. We deﬁne twotypes of reward: ending reward and intermediate reward. In particular, we use aconventional ending reward and further design three counting-speciﬁc rewards— force ending reward , guiding reward , and squeezing reward . Ending Reward . Following [2], we employ a conventional ending reward to evaluatethe value of the ‘end’ action, deﬁned by R e ( E t e − ) = (cid:26) + η e if | E t e − | ≤ (cid:15) − η e otherwise , (8) where t e is the step that the agent chooses the ‘end’ action, E t e − is the absolutevalue error between the ground-truth count G ( i ) and the accumulated value V it e − , and (cid:15) is the error tolerance. Here η e =5, and (cid:15) =0. eighing Counts: Sequential Crowd Counting by Reinforcement Learning 7 Algorithm 1

Training Procedure of LibraNet

1: Initialize a

Buﬀer ← [ ], the Q-network N θQ , and the backbone network N b for epoch ← do

3: Update the Q-network N ¯ θQ ← N θQ for all image I in the training dataset do

5: Compute the image feature

F V I ← N b ( I )6: for all patch i in image I do

7: Initialize W i ⇐ { , , ... }

8: Fetch the ground-truth patch count G ( i )9: for t ← T e do

10: Obtain the state s t ← { F V iI , W it }

11: Compute the Q value Q t ← N θQ ( s t )12: Choose an action a t with (cid:15) -greedy policy13: Compute the reward r according to Sec. 2.214: Update W it +1 per Eq. (4)15: Obtain the next state s t +1 ← { F V iI , W it +1 } Buﬀer ← ( s t , a t , s t +1 , r )17: end for end for

19: Sample a batch B from the Buﬀer to train N θQ per Eq. (16)20: end for end for Considering that the agent is hard to choose the ‘end’ action because of hugesearching space, the agent is forced to stop when it exceeds the maximum stepallowed. This is described by the force ending reward R fe ( E t m ) = (cid:26) + η e if | E t m | ≤ (cid:15) − η e otherwise , (9) where E t m is the absolute value error at the maximum step t m . Intermediate Reward . In previous works [2,14] that employ deep RL to dealwith object localization, an intermediate reward is simply given according tothe change of IoU. In counting, an optimal action can be computed to reach thebalancing state faster. We thus introduce a guiding reward to push the agent tochoose the optimal action, deﬁned by R g ( E t , E t − , a t , a gt ) =  η g η + η − if a t = a gt if E t < E t − otherwise , (10) where a t is the action chosen in the step t , and a gt is the optimal action, given by a gt = arg min a (cid:12)(cid:12)(cid:12) G ( i ) − (cid:16) V it − + a (cid:17)(cid:12)(cid:12)(cid:12) . (11) In our implementation, η g =+3, η + =+1, and η − = − Liu et al. explanation is that, because of the huge searching space, the agent cannot searchfor actions smoothly. To reach the balancing state faster, we propose a squeezingreward to constrain the estimated value, deﬁned by R s = (cid:26) R g ( E t , E t − , a t , a g ) if S (cid:0) V it , G ( i ) (cid:1) = 1 R sg ( E t , E t − , a t , a g ) otherwise , (12) where R g is the guiding reward (Eq. (10)). S (cid:0) V it , G ( i ) (cid:1) decides whether V it isout of the tolerance range as S (cid:16) V it , G ( i ) (cid:17) = sign (cid:16) G ( i ) × (cid:15) − (cid:16) V it − G ( i ) (cid:17)(cid:17) , (13) where (cid:15) is a tolerance range set to 0 . S (cid:0) V it , G ( i ) (cid:1) = −

1, weleverage a squeezed guiding reward to squeeze the estimation within the tolerancerange, deﬁned by R sg ( E t , E t − , a t , a gt ) = (cid:26) η sg η s if a t = a gt otherwise , (14) where η sg = −

1, and η s = −

3. Notice that, in this reward function, all rewards areset to be negative such that the agent is encouraged to avoid choosing an actionsequence that leads to overestimation.

Q Values (Needle) . In Q learning, the Q value of an action is an estimationof the accumulated reward after this action is taken, which takes the form Q ( s t , a t ) = (cid:26) r if a t = ‘end’ or t = t m r + γ max a (cid:48) Q ( s t +1 , a (cid:48) ) otherwise , (15) where r is the reward coming from either R e , R fe , R g or R sg , the next state s t +1 is acquired after the action a t is taken at the present state s t , and γ is the rewarddiscount factor set to 0 . Here we give an overview of LibraNet (Fig. 2). LibraNet consists of two parts: afeature extraction backbone and a DQN. The backbone includes 5 convolutionalblocks of VGG16 [38]. It aims to extract the feature map

F V I of an image I .Each spatial feature vector F V iI in F V I and its weight vector W it correspondto a 32 ×

32 block in the original image. The backbone uses the model trainedby [19] and is then ﬁxed when training the DQN.The core of LibraNet is the DQN. Its input is

F V iI and W it . In each step of thetraining stage, F V iI and W it are concatenated and sent to a two-layer multi-layerperception (MLP) with 1024-dimensional hidden units in each layer, and theDQN outputs a 9-dimensional Q value Q t . An action a t chosen by (cid:15) -greedy policy(Sec. 2.4) is then concatenated with W it to obtain W it +1 (Eq. (4)). The estimation eighing Counts: Sequential Crowd Counting by Reinforcement Learning 9 repeats until the ‘end’ action is reached or exceeds t m steps. The output of DQNis the weighing vector W iT e for each patch i . When the weighing task terminates, V iT e is computed according to Eq. (5).In the inferring stage, the agent chooses the action with the maximal Q valueto obtain the weighing vector W iT e and the weighing value V iT e of each patch.Notice that V iT e is still the quantized count interval. It needs to be further mappedto a counting value with a class-count look-up table [19]. Finally we can sum allpatch counts to obtain the image count. Following [29], we use a replay memory buﬀer [18] to remove correlations inthe weighing process. We follow the standard DQN [29] structure which has aQ-network and a target network. The target network computing the target Qvalue (max a (cid:48) Q ( s t +1 , a (cid:48) )+ r ) is ﬁxed when training the Q-network, and we updatethe target network at the beginning of each epoch with the parameters of theQ-network. (cid:96) loss is used for optimization. The overall loss is deﬁned by (cid:96) = (cid:88) ( s t ,a t ,s t +1 ,r ) ∈ U ( B ) (cid:12)(cid:12)(cid:12)(cid:12) r + γ max a (cid:48) N ¯ θQ (cid:0) s t +1 , a (cid:48) (cid:1) − N θQ ( s t , a t ) (cid:12)(cid:12)(cid:12)(cid:12) /N , (16) where N Q is LibraNet, θ and ¯ θ are the parameters of the Q-network and thetarget network, respectively, r is the reward, and γ is the discount factor.During training, we follow the (cid:15) -greedy policy: a random action is choseneither with a probability of (cid:15) or according to the maximum Q value. (cid:15) startsfrom 1 and decreases to 0 . .

05. To reduce computation cost, weupdate the model when every 100 samples are sent to the buﬀer. Consideringthat, the maximum quantized count interval is less than 80, the maximum step t m is set to 8 (the maximum value operation is +10). Algorithm 1 summarizesthe training ﬂow. We use SGD with a constant learning rate of 1 e − .Following [17], we crop 9 -resolution patches. These patches are mirroredto double the training set. For the UCF-QNRF dataset [12], we follow BL [26]to limit the shorter side of the image to be less than 2048 pixels and to crop512 ×

512 patches for training.

Here we validate the eﬀectiveness of LibraNet, visualize the weighing process,compare it against other state-of-the-art methods, demonstrate its cross-datasetgeneralization, justify each design choice, and show its generality as a plug-in.We report the mean absolute error (MAE) and (root) mean square error (MSE).

To understand how LibraNet works, we visualize the inferring process of onesample in Fig. 3. It can be seen that, in the ﬁrst several steps, LibraNet tends +10 +10 +10 +1010

10 10 +10 +10 +10 +1020

20 20 +10 +10 +10 +1030

30 30 +10 +10 +10 +1040

40 40E E t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 GTCounting

Interval

Fig. 3.

Visualization of the inferring process of LibraNet. (upper right) Visualizationsof action selection. We estimate the count interval for each 32 ×

32 patch of the image.The weighing process is shown from t =0 to t =7, and the ground truth count intervalsare shown in the right. For each patch, the lower green number is the accumulatedvalue (the count interval), and the upper number is the value operator, including thevalue-increased operator (blue), the value-decreased operator (dull-red), and the endingoperator ‘E’ (yellow). (bottom right) Estimated Q values in each step of the upper leftpatch. The red point in each step is the Q value of the chosen action. to choose the action such that the estimation increases rapidly to approximatethe ground truth. This is consistent with the target of guiding reward (Eq. (10)).When the accumulated value is close to the ground truth, LibraNet begins tochoose actions with small values. This is similar to how we weight a stuﬀ usinga scale. Once the accumulated value equals to the ground truth, the weighingprocess terminates. Notice that, even if the maximum step is reached, LibraNetstill produces a relatively accurate estimation due to force ending reward (Eq. (9)).Interestingly, we ﬁnd that the agent chooses positive actions more frequentlythan negative ones, because i) the initial value is 0, and the target count is either0 or positive. Thus, the agent tends to choose positive actions to approximatethe ground truth, and ii) we design a squeeze guide reward (Eq. (14)) to avoidoverestimation. This reward penalizes overestimation and further decreases thefrequency of selecting negative actions.To further analyze why the agent chooses certain actions, we visualize Qvalues of the top left patch. The ground truth count interval is 45, and theagent chooses four consecutive +10, three +1 and one End actions. The ﬁnalestimated interval is 43. In the ﬁrst 4 steps, Q values excluding

End are greaterthan 0 and have a clear distinction. It means that the agent is conﬁdent with its eighing Counts: Sequential Crowd Counting by Reinforcement Learning 11

Table 1.

Comparison with state-of-the-art approaches on three crowd counting bench-marks. The lowest errors are boldfaced

Method SHT Part A SHT Part B UCF QNRF UCF CC 50MAE MSE MAE MSE MAE MSE MAE MSEDRSAN [21] 69.3 96.4 11.1 18.2 — — 219.2 250.2CSRNet [17] 68.2 115.0 10.6 16.0 — — 266.1 397.5TEDnet [13] 64.2 109.1 8.2 12.8 113 188 249.4 354.5SPN+L2SM [49] 64.2 98.4 7.2 11.1 104.7 173.6 188.4 315.3BCNet [19] 62.8 102.0 8.6 16.4 118 192 239.6 322.2BL [26] 62.8 101.8 7.7 12.7 88.7 154.8 229.3 308.2CAN [22] 62.3 100.0 7.8 12.2 107 183 212.2

MBTTBF [39] 60.2 94.1 8.0 15.5 97.5 165.2 233.1 300.9PGCNet [50] 57.0 action selection. After four steps, the accumulated value is 40, which closes tothe ground truth. In the last 4 step, Q values are less than 0, and the diﬀerencesbetween each action is small, which implies the agent is aware of the closeness tothe ground truth. To avoid overestimation, the agent becomes cautious to avoida signiﬁcantly wrong decision. Even if the ﬁnal weighing value does not strictlyequal to the ground truth, the estimation is not likely to shift away from theground truth signiﬁcantly. We can see that LibraNet follows exactly how a scaleweighs a stuﬀ, which means LibraNet indeed learns what we expect it to learn.

We evaluate our method on three public crowd counting benchmarks: Shang-haiTech, UCF CC 50 and UCF-QNRF.The ShanghaiTech (SHT) Dataset [52] includes 1 ,

198 crowd images with330 ,

165 head annotations. It has two parts: part A includes 482 images withvarying resolution collected from Internet; part B includes 716 images of the sameresolution collected from street surveillance videos. In part A, 300 images areused for training, and other 182 images for testing. Part B adopts 400 images fortraining and 316 images for testing. Results are shown in Table 1. We compareour method against other 10 state-of-the-art methods and report the best MAEin part A and comparable performance on part B.The UCF CC 50 Dataset [11] is a challenging crowd counting dataset withonly 50 images. By contrast, there are 63 ,

705 people annotations, so the scenesare extremely congested. We employ 5-fold cross-validation when reporting theresults and also compare LibraNet with other state-of-the-art approaches. Theresults shown in Table 1 verify that LibraNet outperforms other competitors andreports the best performance in MAE.The UCF-QNRF Dataset [12] is a recent high-solution crowd counting dataset,which includes 1 ,

535 images with 1 , ,

642 annotations. The images are oﬃciallysplit into two parts: 1201 images for training and 334 for testing. We compareLibraNet with 7 recent methods. The results in Table 1 illustrate our methodoutperforms state-of-the-art methods in both MAE and MSE.

Table 2.

Cross-dataset evaluations on the ShanghaiTech (A and B) and UCF-QNRF(QNRF) datasets

Method A → B A → QNRF B → A B → QNRF QNRF → A QNRF → BMAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSEMCNN [52] 85.2 142.3 — — 221.4 357.8 — — — — — —D-ConvNet [36] 49.1 99.2 — — 140.4 226.1 — — — — — —SPN+L2SM [49] 21.2 38.7 227.2 405.2 126.8 203.9 — — 73.4 119.4 — —BCNet [19] 20.5 37.9 131.9 230.6 138.6 230.0 240.0 419.6 71.3 123.7 16.1 26.1BL [26] — — — — — — — — 69.8 123.8 15.3 26.5LibraNet

Table 3.

Ablation study on the ShanghaiTech Part A datasetMethod

MAE MSEBCNet [19] 62.8 102.0Imitation Learning [10] 64.7 102.8W/O Guiding 149.8 261.3W/O Force Ending 62.7 104.3W/O Squeezing 63.5 102.7Full Designs

Table 4.

GAME on the ShanghaiTech Part A dataset

GAME0 GAME1 GAME2 GAME3BCNet [19] 62.8 73.3 87.0 116.7LibraNet

To demonstrate the generalization of LibraNet, we conduct cross-dataset ex-periments by training the model on one dataset but testing on the other one.Results are shown in Table 2. LibraNet shows consistently better generalizationperformance than other competitors across all transfer settings.

Here we validate basic design choices of LibraNet on the SHT Part A dataset [52].The results are shown in Table 3.

Local Accuracy . BCNet is the blockwise classiﬁcation network proposed by [19].This is our direct baseline, because LibraNet uses the backbone pretrained by [19].Besides the image-level error, we also report the Grid Average Mean absoluteError (GAME) [9] in Table 4. GAME assesses patch-level counting accuracy.LibraNet outperforms BCNet in all GAME metrics, which suggests that LibraNetgenerates more locally accurate patch counts than BCNet. We believe this maybe the reason why LibraNet signiﬁcantly reduces the image-level error.

Optimal Action . In Sec. 2.2, we compute the optimal action to reach the bal-ancing state faster.

Is it suﬃcient to learn a weighing model that only chooses theoptimal action?

To justify this, we build another baseline ‘Imitation Learning’ [10]with the following optimization target eighing Counts: Sequential Crowd Counting by Reinforcement Learning 13

Table 5.

Sensitivity analysis of the maximum step on the ShanghaiTech Part A datasetStep

Table 6.

Sensitivity analysis of the tolerance range on the ShanghaiTech Part A datasetRange

Table 7.

LibraNet as a plug-inMethod

MAE MSEImageNet Regression 156.2 259.9ImageNet Classiﬁcation 140.4 230.3ImageNet Regression+LibraNet 126.6 211.1ImageNet Classiﬁcation+LibraNet 119.7 203.4TasselNetv2 † [47] 68.6 110.2TasselNetv2 † +LibraNet 64.7 100.6Blockwise Classiﬁcation [19] 62.8 102.0Blockwise Classiﬁcation+LibraNet max θ (cid:88) i ∈ I T e (cid:88) t =0 A N (cid:88) a =0 [ a = a gi,t ] log (cid:0) N θM (cid:0) i, W it , a (cid:1)(cid:1) , (17)where a gi,t is the optimal action (Eq. (11)) of time t in patch i , A N is the numberof pre-deﬁned action, N θM is a sequential decision network, and N θM (cid:0) i, W it , a (cid:1) computes the probability of a -th action in i -th patch. In each step, N θM selectsthe action with the maximum probability. Results in Table 3 show that learningwith only the optimal action is insuﬃcient . Designed Rewards . From the 3-th to the 5-th rows of Table 3, we present theablative studies on modiﬁed rewards. ‘W/O Guiding’ means training LibraNetwithout the ‘guiding reward’ (Eq. (10)) which simply sets +1 to error-decreasedaction and − Parameters Sensitivity . To analyze the impact of the maximum action step t m , we train LibraNet with t m ranging from 4 to 16 on the SHT Part A dataset.Results are shown in Table 5. When t m is not suﬃcient, LibraNet works poorly,because LibraNet cannot reach the neighborhood of ground truth even if themaximum value operation can be chosen in each step. We set t m = 8 in all otherexperiments. We also evaluate the eﬀect of the tolerance range ( (cid:15) ) in Eq. (12).Results are shown in Table 6. We observe that, LibraNet is not sensitive to thisparameter, and the best result is achieved when (cid:15) = 0 . (cid:15) = 0 .

5. Furthermore, we analyze the eﬀect of randomness.

Following [41], we run LibraNet for 6 times on the SHT Part A with diﬀerentrandom seeds. The MAE is 56 . ± .

8, and MSE is 97 . ± .

3, which suggestsLibraNet is not sensitive to randomness.

Execution Speed . Finally, we report the speed of LibraNet on a platform withRTX 2060 6 GB GPU and Intel i7-9750H CPU. It takes 158 ms to process an1080 ×

720 image, including 142 ms on backbone and 16ms on LibraNet. Theresult illustrates that LibraNet only introduces negligible computation costs.

To show that LibraNet is a general idea and the pretraining with [19] is not theonly opinion, here we apply LibraNet as a plug-in to other counting/pretrainedmodels. Results are shown in Table 7.First we attach LibraNet to a regression baseline and a classiﬁcation baselinewith ImageNet-pretrained VGG16 [38]. The VGG16 is ﬁxed and concatenatedwith a trainable 1 × × C or a 1 × × CNN features contain little information relevant to counting [40].The second model is a regression-based blockwise counter—TasselNetv2 † [47].‘TasselNetv2 † +LibraNet’ means extracting the feature map by the backbonepretrained by TasselNetv2 † and then sending them to DQN to estimate the count.To adapt to regression-based weighing where the count values is continuous,we modify the pre-deﬁned action pool a = { − − − − . − . − . − . − . − .

01, 0 .

02, 0 .

05, 0 .

1, 0 .

2, 0 .

5, 1, 2, 5 } . Results show that‘TasselNetv2 † +LibraNet’ outperforms TasselNetv2, which illustrates the idea ofscale weighing is also eﬀective for the regression-based counter. In this work, we have introduced a novel sequential decision paradigm to tacklecrowd counting, which is inspired by the behavior of human counting and scaleweighing. We implement scale weighing using deep RL and present a new countingmodel LibraNet. Experiments verify the eﬀectiveness of LibraNet and explainhow it works. For future work, we plan to extend LibraNet to other regressiontasks. We believe that scale weighing is a general idea that may not be limitedto counting. eighing Counts: Sequential Crowd Counting by Reinforcement Learning 15

AppendixA Discretization and Inverse-Quantization

In this section, we illustrate the generation of counting intervals (‘Discretization’)and inverse-quantization [19] in detail.

A.1 Counting Intervals Generation

First, given a map of dotted annotations, it is convoluted by the Gaussian kernelto compute the density map D ( p ) [16], which takes the form D ( p ) = N (cid:88) i =1 δ ( p − D i ) ∗ G σ i ( p ) (18)where p ∈ I is a pixel in the image I , D i is the i -th dot annotation of I , and G σ i is a Gaussian kernel with the variance of σ i . In this paper, we employ theadaptive Gaussian kernel [52], whose variance is deﬁned by σ i = β ¯ d i , (19)where ¯ d i is the average distance between the dot point D i and its 3-nearest dotpoints, β is a hyperparameter which is set to 0 . N [23] N i = (cid:88) p b ∈ P i D ( p b ) , (20)where P i is the i -th patch of the image I .Finally, the count value is quantized to compute the count interval C [19] C i = Q ( N i ) = (cid:40) N i = 0 max (cid:16) f loor (cid:16) log ( N i ) − lw + 2 (cid:17) , (cid:17) Otherwise , (21)where w is the width of the quantized interval in log space, l is a hyperparameterwhich means the interval of (cid:0) , e l (cid:1) is divided as an independent class [19]. In thispaper, we set w = 0 . l = − A.2 Inverse-Quantization

During testing, the counting value is recovered from the counting interval by theinverse-quantization IQ [19], i.e., N i = IQ ( C i ) =  C i = 0 exp ( l + w ( C i − C i = 1 exp ( l + w ( C i − exp ( l + w ( C i − Otherwise (22)

B More Qualitative Results

GT:361 EST:366.9GT:384 EST:408.7GT:218 EST:222.5

Fig. 4.

Qualitative results of LibraNet on the ShanghaiTech part A dataset. Fromleft to right, there are testing images, density maps with ground-truth counts, and ourestimated results.eighing Counts: Sequential Crowd Counting by Reinforcement Learning 17

GT:23 EST:23.1GT:15 EST:13.4

GT:26 EST:25.6

Fig. 5.

Qualitative results of LibraNet on the ShanghaiTech part B dataset. Fromleft to right, there are testing images, density maps with ground-truth counts, and ourestimated results.8 Liu et al.

GT:648 EST:648.3GT:1858 EST:1981.1GT:1366 EST:1336.1

Fig. 6.

Qualitative results of LibraNet on the UCF CC 50 dataset. From left to right,there are testing images, density maps with ground-truth counts, and our estimatedresults.

GT:2164 EST:2079.0GT:1613 EST:1505.7GT:1668 EST:1578.4

Fig. 7.

Qualitative results of LibraNet on the UCF-QNRF dataset. From left to right,there are testing images, density maps with ground-truth counts, and our estimatedresults.eighing Counts: Sequential Crowd Counting by Reinforcement Learning 19

GT:299 EST:259.8GT:239 EST:204.5GT:3406 EST:2769.9

GT:2701 EST:2352.3

GT:298 EST:151.9GT:96 EST:3.7

Fig. 8.

Failure cases. The ﬁrst 2 cases from the ShanghaiTech Part B dataset show thatour method suﬀers from the training bias caused by long-tailed distribution, leading tounder-estimations. The 3-rd and 4-th rows demonstrate that LibraNet can be aﬀectedby illumination variations. The last 2 rows illustrate that our method fails due to blurryappearance. In each row, from left to right, there are testing images, density maps withground-truth counts, and our estimated results.0 Liu et al.

Acknowledgement . This work is supported by the Natural Science Foundationof China under Grant No. 61876211 and Grant No. U1913602.

References

1. Araslanov, N., Rothkopf, C.A., Roth, S.: Actor-critic instance segmentation. In:Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).pp. 8237–8246 (2019)2. Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcementlearning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 2488–2496 (2015)3. Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: Privacy preserving crowd monitoring:Counting people without people models or tracking. In: Proc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR). pp. 1–7. IEEE (2008)4. Chattopadhyay, P., Vedantam, R., Selvaraju, R.R., Batra, D., Parikh, D.: Countingeveryday objects in everyday scenes. In: Proc. IEEE Conference on ComputerVision and Pattern Recognition (CVPR). pp. 1135–1144 (2017)5. Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowdcounting. In: Proc. British Machine Vision Conference (BMVC). p. 3 (2012)6. Cohen, J.P., Boucher, G., Glastonbury, C.A., Lo, H.Z., Bengio, Y.: Count-ception:Counting by fully convolutional redundant counting. In: Proc. IEEE InternationalConference on Computer Vision Workshops (ICCVW). pp. 18–26 (Oct 2017).https://doi.org/10.1109/ICCVW.2017.97. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.886–893 (2005)8. Diuk, C., Cohen, A., Littman, M.L.: An object-oriented representation for eﬃcientreinforcement learning. In: Proc. International Conference on Machine learning(ICML). pp. 240–247. ACM (2008)9. Guerrero-Gmez-Olmedo, R., Torre-Jimnez, B., Lpez-Sastre, R., Maldonado-Bascn,S., Ooro-Rubio, D.: Extremely overlapping vehicle counting. In: Iberian Conferenceon Pattern Recognition and Image Analysis. pp. 423–431 (2015)10. Hussein, A., Gaber, M.M., Elyan, E., Jayne, C.: Imitation learning: A survey oflearning methods. ACM Computing Surveys (CSUR) (2), 1–35 (2017)11. Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting inextremely dense crowd images. In: Proc. IEEE Conference on Computer Visionand Pattern Recognition (CVPR). pp. 2547–2554 (2013)12. Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah,M.: Composition loss for counting, density map estimation and localization in densecrowds. In: Proc. European Conference on Computer Vision (ECCV). pp. 532–546(2018)13. Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowdcounting and density estimation by trellis encoder-decoder networks. In: Proc.IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp.6133–6142 (2019)14. Kong, X., Xin, B., Wang, Y., Hua, G.: Collaborative deep reinforcement learningfor joint object search. In: Proc. IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 1695–1704 (2017)eighing Counts: Sequential Crowd Counting by Reinforcement Learning 2115. Laradji, I.H., Rostamzadeh, N., Pinheiro, P.O., Vazquez, D., Schmidt, M.: Whereare the blobs: Counting by localization with point supervision. In: Proc. EuropeanConference on Computer Vision (ECCV). pp. 547–562 (2018)16. Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Advances inNeural Information Processing Systems (NIPS). pp. 1324–1332 (2010)17. Li, Y., Zhang, X., Chen, D.: CSRNet: Dilated convolutional neural networks forunderstanding the highly congested scenes. In: Proc. IEEE Conference on ComputerVision and Pattern Recognition (CVPR). pp. 1091–1100 (2018)18. Lin, L.J.: Reinforcement learning for robots using neural networks. Tech. rep.,Carnegie-Mellon Univ Pittsburgh PA School of Computer Science (1993)19. Liu, L., Lu, H., Xiong, H., Xian, K., Cao, Z., Shen, C.: Counting objects by blockwiseclassiﬁcation. IEEE Transactions on Circuits and Systems for Video Technology(2019)20. Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deepstructured scale integration network. In: Proc. IEEE/CVF International Conferenceon Computer Vision (ICCV) (October 2019)21. Liu, L., Wang, H., Li, G., Ouyang, W., Lin, L.: Crowd counting using deep recur-rent spatial-aware network. In: Proc. International Joint Conference on ArtiﬁcialIntelligence (IJCAI). pp. 849–855. AAAI Press (2018)22. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proc. IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). pp. 5099–5108(2019)23. Lu, H., Cao, Z., Xiao, Y., Zhuang, B., Shen, C.: TasselNet: counting maize tasselsin the wild via local counts regression network. Plant methods (1), 79 (2017)24. Lu, H., Dai, Y., Shen, C., Xu, S.: Indices matter: Learning to index for deepimage matting. In: Proc. IEEE/CVF International Conference on Computer Vision(ICCV). pp. 3266–3275 (2019)25. Lu, H., Dai, Y., Shen, C., Xu, S.: Index networks. IEEE Transactions on PatternAnalysis and Machine Intelligence (2020)26. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation withpoint supervision. In: Proc. IEEE/CVF International Conference on ComputerVision (ICCV). pp. 6142–6151 (2019)27. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D.,Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: Proc.International Conference on Machine Learning (ICML). pp. 1928–1937 (2016)28. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 (2013)29. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-levelcontrol through deep reinforcement learning. Nature (7540), 529 (2015)30. OpenAI: Openai ﬁve. https://blog.openai.com/openai-five/ (2018)31. Riedmiller, M., Gabel, T., Hafner, R., Lange, S.: Reinforcement learning for robotsoccer. Autonomous Robots (1), 55–73 (2009)32. Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Crowd counting using multiplelocal features. In: 2009 Digital Image Computing: Techniques and Applications. pp.81–88. IEEE (2009)33. Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowdcounting. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2017)2 Liu et al.34. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 (2017)35. Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for eﬃcientcrowd counting. In: Proc. IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). pp. 7279–7288 (2019)36. Shi, Z., Zhang, L., Liu, Y., Cao, X., Ye, Y., Cheng, M.M., Zheng, G.: Crowdcounting with deep negative correlation learning. In: Proc. IEEE Conference onComputer Vision and Pattern Recognition (CVPR). pp. 5382–5390 (2018)37. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Masteringthe game of go with deep neural networks and tree search. Nature (7587), 484(2016)38. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 (2014)39. Sindagi, V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusionfor crowd counting. In: Proc. IEEE/CVF International Conference on ComputerVision (ICCV). pp. 1002–1012 (2019)40. Stahl, T., Pintea, S.L., van Gemert, J.C.: Divide and count: Generic object countingby image divisions. IEEE Transactions on Image Processing (2), 1035–1044 (2018)41. Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with doubleq-learning. In: Thirtieth AAAI conference on artiﬁcial intelligence (2016)42. Van Hove, L.: Optimal denominations for coins and bank notes: in defense of theprinciple of least eﬀort. Journal of Money, Credit and Banking pp. 1015–1021 (2001)43. Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung,J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level instarcraft ii using multi-agent reinforcement learning. Nature pp. 1–5 (2019)44. Wan, J., Luo, W., Wu, B., Chan, A.B., Liu, W.: Residual regression with semanticprior for crowd counting. In: Proc. IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR). pp. 4036–4045 (2019)45. Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: Deep people counting in extremelydense crowds. In: Proc. ACM International Conference on Multimedia (ACMMM).pp. 1299–1302. ACM (2015)46. Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., De Freitas, N.:Dueling network architectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581 (2015)47. Xiong, H., Cao, Z., Lu, H., Madec, S., Liu, L., Shen, C.: Tasselnetv2: in-ﬁeldcounting of wheat spikes with context-augmented local regression networks. PlantMethods15