Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions
Alexander Egiazarov, Fabio Massimo Zennaro, Vasileios Mavroeidis
FFirearm Detection via Convolutional NeuralNetworks: Comparing a Semantic SegmentationModel Against End-to-End Solutions
Alexander Egiazarov
Digital Security GroupUniversity of Oslo
Oslo, [email protected]
Fabio Massimo Zennaro
Digital Security GroupUniversity of Oslo
Oslo, Norwayfabiomz@ifi.uio.no
Vasileios Mavroeidis
Digital Security GroupUniversity of Oslo
Oslo, Norwayvasileim@ifi.uio.no
Abstract —Threat detection of weapons and aggressive behaviorfrom live video can be used for rapid detection and prevention ofpotentially deadly incidents such as terrorism, general criminaloffences, or even domestic violence. One way for achieving thisis through the use of artificial intelligence and, in particular,machine learning for image analysis. In this paper we conducta comparison between a traditional monolithic end-to-end deeplearning model and a previously proposed model based on anensemble of simpler neural networks detecting fire-weapons viasemantic segmentation. We evaluated both models from differentpoints of view, including accuracy, computational and datacomplexity, flexibility and reliability. Our results show that asemantic segmentation model provides considerable amount offlexibility and resilience in the low data environment comparedto classical deep model models, although its configuration andtuning presents a challenge in achieving the same levels ofaccuracy as an end-to-end model.
Index Terms —weapon detection, firearm detection, firearmsegmentation, semantic segmentation, physical security, neuralnetworks, convolutional neural networks (CNNs)
I. I
NTRODUCTION
Threat detection from live video feeds of firearms, knives,and aggressive behavior can be used in preventing or rapidlydetecting and mitigating potentially deadly incidents suchas terrorism, general criminal offenses, or even domesticviolence. One way for achieving this is the use of artificialintelligence and, in particular, machine learning for imageanalysis to detect weapons that, in many cases, can also bepartially concealed, thus making their discovery a difficulttask.According to the report ”Global Study on Homicide” pub-lished by the United Nations Office on Drugs and Crime,54% of homicides in 2017 involved firearms, accounting for238,804 victims. That proves that firearms are a preferredinstrument for committing a crime. In addition, firearms may be used in mass shootings thus resulting in multiple loss oflives in a single incident, such as in the case of a terrorist act.In previous work, we presented an approach for firearmdetection that makes use of an ensemble of Semantic Con-volutional Neural Networks [1]. This approach decomposes atask, such as the detection of a firearm, into a set of smallertasks, such as the detection of individual component parts ofthe firearm. We argued that this approach has computationaland practical advantages compared to the traditional singlemonolithic approach, such as requiring less computationalresources for training the smaller models and the ability totrain the individual component part models in parallel. Theresults of our previous work demonstrated that the individualnetworks achieved satisfactory accuracy after being trained ona limited set of data. An important strength of this approach isthat the final system relies not only on the performance of theindividual networks but also on the ensembling of the resultsof all networks.In this paper, we put to rigorous test our hypothesesabout the strengths and weaknesses of an approach based onsemantic segmentation. We perform a series of experimentalsimulation to assess the accuracy, the flexibility, and therobustness of our model against the end-to-end model basedon a single deep network. Our conclusions clearly delineatethe advantages, as well as the significant limitations, of oursolution. II. B
ACKGROUND
In this section, we introduce the problem of weapon de-tection and we explain how it can be expressed as a machinelearning problem. We then recall the principle of semantic seg-mentation or decomposition, and we describe two approachesto the problem of weapon detection: an approach based on a r X i v : . [ c s . C V ] D ec single deep neural network, and the approach based on anensemble of simpler semantic neural networks. A. Weapon Detection and Segmentation
The majority of the weapon detection solutions employclassical machine learning methods where the object is clas-sified or localized by common computer vision techniques orusing monolithic architectures [2]. Despite the popularity ofthe semantic segmentation approach [3], [4], [5], [6] very fewstrides were made in the implementation of weapon detectionsystems based on semantic decomposition, primarily becauseit requires a radically different approach to design of the deeplearning models and unique datasets.
B. Machine Learning
Machine learning provides a set of methods and techniquesfor inferring patterns and relations among data. More formally,in the supervised learning setting, we are given a collection of N data samples x i each one with its own label y i ; a supervisedlearning algorithm allows us to learn a general function f : x i (cid:55)→ y i that maps data samples onto their respective labels[7]. In the specific case of weapon detection and segmentation,the set of samples x i will correspond to images, while the setof labels y i will correspond either to binary values denotingthe presence of a weapon in an image or to a box surroundinga weapon within an image.A versatile algorithm for learning a mapping f is of-fered by neural networks , layered graphical models that canapproximate any function (to an arbitrary degree given asufficiently wide or deep architecture) [7]. In the case ofimages, a special family of neural networks that have beenproven to be particularly successful are deep convolutionalneural networks (CNN) [8]. Deep CNNs are neural networksthat use convolutional windows to analyze an image and relyon several layers for processing. Thanks to their priors andtheir complex architecture, deep CNNs are able to learn todiscriminate images with high accuracy. The main drawbackof this solution lies in its sample requirements and its computa-tional complexity. In order to train a deep CNN, it is necessaryto collect a large amount of data and rely on considerablecomputational power to process this data. C. Semantic Segmentation
Semantic segmentation is the general engineering principleof decomposing a single complex task along semantic lines inorder to define a set of simpler problems. In the specific case ofweapon detection, the application of this principle translatesinto the decomposition of the hard task of detecting wholeweapons into a set of easier image detection problems. Since awhole weapon constitutes a complex object in terms of shape,texture, and orientation within a picture, we considered thepossibility of decomposing the problem of detecting a singleweapon into the problem of detecting some of its visuallyprominent component parts, such as barrel, stock, magazine,and receiver. Each one of these component parts has a simplershape, a consistent texture, and a higher degree of orientational invariance than the whole rifle, thus constituting a simplerdetection problem.
D. Firearm Detection Based on a Single Neural Network
The basic approach to weapon detection applies the stan-dard paradigm of deep CNNs that has been proven to besuccessful in image recognition. This paradigm is based on theimplementation of a single deep CNN trained end-to-end withlabeled data. After proper training, such a network is expectedto be able to accurately detect weapons within images. DeepCNN have the ability to learn complex functions allowingclassification of objects and they constitute the state of the artin image detection and segmentation. The drawbacks of CNNsare the immediate consequences of their size and complexity.First, in order to fit a deep model described by a high numberof parameters, it is necessary to collect a large amount of data.This may be expensive or challenging, as it is in the case ofimages of weapons. A second challenge is due to the structuralcomplexity of a deep CNN. As a versatile function fitter,a deep CNN is defined by a large set of hyper-parameters.Properly choosing or exploring a subset of all the possiblecombinations of hyper-parameters is a non-trivial task. A thirdchallenge derives from those mentioned above. As the amountof data and the complexity of a network increase, so is thecomputational cost for training the network.
E. Firearm Detection Based on Semantic Decomposition
We have proposed an alternative approach to the prob-lem of firearm detection based on the principle of semanticsegmentation [1]. Instead of designing a single deep CNNfor discriminating a whole weapon within an image, weimplemented a set of shallower CNNs, each one tasked withthe simpler objective of recognizing a single component partof a weapon. The final decision on the presence or absenceof a weapon is then achieved by aggregating the outputs ofthe smaller networks. This solution allows us to tackle themain drawbacks of a monolithic CNN described in SectionII-D. In particular, shallower CNNs demand less data andcomputational power for training, as well as having a smallerspace of hyper-parameters. Moreover, we suggest that relyingon the outputs of multiple independent networks may make oursolution more reliable in situations where weapons are partlyobfuscated within an image. Finally, we may be able to achievea more robust decision by aggregating multiple outputs, asproposed by the theory of ensemble models [9]. The mainweakness of our solution lies in the decomposition itself. Inour model, each network learns exclusively to recognize asingle component part of a weapon independently from theremaining. A solution based on a single deep CNN may modelhigher-order correlations between the parts so that detectingone component may help to detect other components.III. P
ROBLEM S TATEMENT
This work experimentally evaluates and compares our se-mantic segmentation approach to weapon detection against thestandard approach based on a deep CNN.arlier results [1] demonstrated that our approach forweapon detection and segmentation achieves a reasonableperformance. In [1], it was shown that four small CNNs couldbe successfully trained for detecting individual componentparts of a specific weapon (AR15). The output of the networkscould be merged to generate heatmaps and to decide whethera weapon is present or absent.In this paper, we provide a more careful examination ofthe proposed semantic segmentation model by comparing itagainst a single network model resembling more closely thestate of the art. In particular, we are interested in studying andcomparing the performance of the two models in a regime withlimited amount of data and limited computational power. Wedefine a set of tasks intended to provide a fair comparisonbetween our model and the standard model with differentcapacities. Our experiments are not meant to compare oursolution directly with the state of the art for deciding whichmodel achieves the highest performance. Instead, we carry outa comparison between scaled-down implementations of oursolution and state-of-the-art solutions in order to evaluate thestrengths and the weaknesses of our model, in particular interms of accuracy (measured in terms of statistical accuracy), computational and data cost (evaluated in terms of architecturedepth), flexibility (expressed in the compositionality of theoutputs of our individual networks) and reliability (expressedin the tuning of false positives and false negatives).Before presenting our simulations, we first describe ourmodels and datasets. IV. M
ODELS
In this section we describe the models and the architectureswe will use in our evaluations.
A. Semantic Segmentation Model
The aim of the semantic segmentation model is to decom-pose the hard task of detecting a whole weapon in a setof simpler task aimed at detecting only specific componentparts of a weapon. In order to successfully recognize anAR15 rifle, we identified four main component parts: stock,magazine, barrel and receiver (see Figure 1). We selected thesecomponents as they are the most visually distinct parts of afirearm.Fig. 1: Main components of AR-15 style rifle. Fig. 2: Semantic Segmentation Model
Image P × Q pixels Input ....
Patches N patches × pixels CNN CNN CNN CNN Semantic Networks (stocks, magazine, barrel, receiver) [0 , , ...,
1] [0 , , ...,
0] [0 , , ...,
0] [1 , , ..., Network Outputs N × vectors Network Decision
Aggregation Module Final Output
The semantic segmentation model has been designed as amulti-layered system; for an illustration, refer to Figure 2.At the input level , the model receives an image of arbitrarydimension. On the following patches level the input imageis divided into patches. In order to deal with images withdifferent sizes and ratios, a sliding window algorithm is used toextract patches. The size and the step of the sliding window isset in a user-independent fashion as a ratio between the sides ofthe image. After extraction, each patch is rescaled to × pixels. On the semantic networks level patches are fed intothe four component CNNs. These networks are defined in amodular way, following [1]; each CNN is constituted by M sem convolutional layers aimed at performing feature extraction,and N sem dense layers carrying out the final classification.In the convolutional section, we use layers containing 32 or64 filters with default stride of 1x1, with ReLU activationfunctions, and 2x2 max-pooling [1]. In the dense section,we use fully connected layers; we use a ReLU activationfunction, except for the last layer where we rely on thesoftmax function to compute the output. Moreover, in thesecond-to-last layer, we use dropout [10] with a probability p = 0 . for regularization [1]. Given the relatively smallarchitectures, all the networks are trained independently, inparallel, on their respective labelled data. On the networkoutputs level we collect the output of each individual network.Each CNN produces an array of arbitrary length, made upof binary values; each value denotes the presence or theabsence of a specific component in each patch. On the networkdecision level the vector of binary values outputted by eachnetwork is aggregated into a final decision, representing theevaluation on whether a weapon component was present in mage P × Q pixels InputRescaling × pixels CNN Deep Network Final Output
Fig. 3: Single Network Modelthe original image. Aggregating a binary vector of arbitrarylength into a single value may be accomplished using dif-ferent algorithms, from a voting mechanism to processingthese vectors using a dedicated module, such as a recurrentneural network able to manage arbitrary-length inputs. We relyon validated thresholds: we consider the outputs on all thepatches, and we use thresholds to decide whether few positivenetwork outputs on isolated patches constitute a false positive,or whether a concentrated set of positive networks outputsflagged the presence of a weapon component. Lastly, on the final output level the binary decisions of the four networkare aggregated in the overall decision of the model about thepresence or absence of a weapon. A basic solution would beto simply use a decision algorithm which counts the outputsfrom each of the four network modules as 25% probabilityof presence. However, this algorithm may be improved by amore sophisticated decision algorithms that process the outputof the individual networks. We illustrate different voting andweighting mechanisms aimed at maximizing the final accuracyof the model.
B. Single Network Model
As a comparison, we instantiate a solution shaped on thestate of the art for weapon detection. We define a deep CNNtrained on whole images in order to learn to flag the presenceof a weapon inside the image. For illustration, refer to themodel in Figure 3.Like the semantic segmentation model the individual net-work model receives at the input level an image of arbitrarydimension. On the following rescaling level the input imageis rescaled to a fixed dimension. On the deep network level the image is forwarded to a deep CNN which is made up of M single convolutional layers, and N single dense layers, withthe same hyper-parameters described above for the CNNs inthe semantic segmentation model. Finally, on the final ouputlevel we obtain the output of the deep CNN in the form of a (uncalibrated) probability of the firearm being present in thepicture.As discussed in Section III, we do not aim at implementinga cutting edge architecture for the sake of achieving the bestpossible performance. In other words, we are not interestedin pushing the number and the width of layers M single and N single as high as possible. Instead, we will pay attentionto the ratio between the number of layers in the semanticsegmentation model ( M sem and N sem ) and in the deep CNN( M single and N single ). This will allow us to evaluate therelative performance of the two models with respect to thecomputational power or data availability necessary to traindeeper models. V. D ATASETS
In this section we discuss the generation and the preparationof the data for our models.
A. Data for the Semantic Segmentation Model
The architecture of the semantic segmentation model re-quires the definition of a custom dataset for each one of thefour CNNs included in the system. Specifically, each CNNhas to be trained on a proper dataset that contains positivesamples (images of the weapon component that the networkis supposed to detect) and negative samples (random imagesnot containing the weapon component that the network issupposed to detect).To create the datasets for the semantic segmentation modelwe assembled a total of images from the public domain.We chose to use publicly available images, instead of synthet-ically created ones, due to to the higher variation of detailsand combination of components that are naturally present inthe sourced dataset. We visually inspected the original set ofimages to verify its quality and removed any sample that didnot portray the actual chosen firearm model (e.g., obvious toyreplicas, other firearm models) or depicted images with non-related content. We then extracted positive patches and negative patches for each of the four CNNs. All patcheswere resized to 200 ×
200 pixels size. Positive-labelled patchescontain the specific component part for each network, whilenegative-labelled patches contain random images (includingbackground details, clothing, people and other random objectsavailable in the starting set of images) and samples of othercomponent parts that the network is not supposed to detect.Notice that enriching the negative dataset with componentparts that the network should not detect is essential to preventthe network from learning to detect just the color or thetexture of rifle parts, instead of the actual component part.See Figure 4 for examples of positive-labelled and negative-labelled samples.Each dataset is partitioned into a training, validation andtesting subset with the respective proportions of 80%, 16%and 4%, thus yielding 2000 training samples, 400 validationsamples and 100 testing samples.Each training dataset is then augmented via random mod-ification of the samples (such as, rotation, offset from centerig. 4: Data samples. On the left, a general purpose negativesample; on the right, a positive sample for the barrel networkand a potential negative sample for the receiver, magazine andbuttstock network.and scale changes). By adding 3 modified images in additionto the original sample, each training dataset was enlarged to8000 samples. This procedure provides a bigger variety of dataand, by applying augmentation after partitioning our data, weguarantee that the same samples with and without modificationwill not appear in the training and test dataset. Inclusion ofmodified images changed the training, validation and testingset proportions to 95%, 5% and 1% respectively.At the end, each CNN is trained and evaluated on a datasetmade up of a training set D trsem of samples, a validationset D valsem of samples, and a test set D tesem of samples,all evenly divided in positive and negative samples. See TableI for a summary of the data and its partitioning.TABLE I: Dataset for each CNN in the semantic segmentationmodel. Positive patches
Training Validation Testing TotalInitial partitioning
After augmentation
Training Validation Testing TotalInitial partitioning
After augmentation
B. Data for the Single Network Model
For the single neural network model we collected a newdataset. This decision is due to the fact that the deep CNN ismeant to be trained to detect a whole weapon within an imageand therefore it can not be trained on the patches used to trainthe four component-specific CNNs. Positive-labelled sampleare extracted from the original dataset of public domainimages. We selected images containing the entirety of theAR15 rifle. As before, this set of images is partitionedinto a training dataset of 3000 samples, a validation dataset of400 samples and a test dataset of 100 samples. The trainingdataset is further augmented using the same method usedfor the semantic segmentation dataset, thus producing a finaltraining dataset of 8000 samples. Negative-labelled samplesare extracted from the Indoor Scene Recognition dataset. This dataset contains varied and realistic images of indoor envi-ronments, which may resemble the places where an automaticweapon detection system may be deployed. We randomly sub-selected 8000 images for training, 400 for validation, and 100for testing. We carefully selected from different categories,and, given the abundance of data, we did not perform anyaugmentation. At the end, the CNN in the single networkmodel is trained and evaluated on a dataset made up of atraining set D trsingle of samples, a validation set D valsingle of samples, and a test set D tesingle of samples, allevenly divided in positive and negative samples. See Table IIfor a summary of the data and its partitioning.TABLE II: Dataset for the CNN in the single network model. Positive patches
Training Validation Testing TotalInitial partitioning
After augmentation
Training Validation Testing TotalInitial partitioning
VI. E
VALUATION
In this section, we present the evaluation of the two modelswe have described above, the semantic decomposition model and the single network model . Our simulations are meant tocompare the two solutions, highlighting the different perfor-mances of each module, showing the degrees of freedom inaggregating the individual networks in the semantic decom-position model, and contrasting the results in the two modelswhen trained on very limited sets of data.
A. Simulation 1: Evaluating the CNNs
In this simulation, we train independently all the CNNs ofour two models, using different settings for their hyperparam-eters. We perform model selection and choose the optimalarchitecture for each of the CNN we implemented. For thesemantic segmentation model, this simulation runs up to the network outputs level of Figure 2. a) Protocol:
In these simulations we consider differentarchitectures for the individual component-specific CNNs andfor the single deep CNN. In particular we vary the numberof the convolutional layers ( M sem , M single ) and the numberof dense layers ( N sem , N single ) in the set { , , } . Thesevalues were chosen to include the basic setting for the semanticdecomposition model described in [1], and to allow for theexploration of larger models with higher capacity, within amodest computational budget for training.In total, we considered possible architectures ( ( M =3 , N = 3) , ( M = 3 , N = 4) , ( M = 4 , N = 3) , ( M =4 , N = 4) , ( M = 5 , N = 5) ), models ( semanticsegmentation models and single network models), leadingto the training, validation and testing of networks ( CNNsfor each semantic segmentation model and CNN for eachsingle network model).Training is performed for 15 iterations, and we use the vali-dation dataset to perform early stopping and select the weightonfiguration returning the best accuracy on the validationdataset. Notice that in this simulation each network is trainedand evaluated independently. In the semantic segmentationmodel, the four component CNNs are evaluated at the networkoutput level (see Figure 2), therefore ignoring for the momentthe network decision level and the final output level . b) Results: Table III reports the best architectures foundin our model selection process (more results are available inthe Appendix). For each network we report the architectureexpressed as the number of convolutional layers, M single or M sem , and dense layers, N single or N sem . We also reportthe early stopping epoch and the associated accuracy on thevalidation dataset. The final performance of the network isexpressed in terms of true positives and true negatives overthe test dataset.TABLE III: Best architecture ( arch ) for each network in thetwo models, epoch of early stopping ( epoch ), performance onthe validation dataset at the early stopping epoch (( acc (val) ))and true positives ( TP (test) ) and true negatives (
TN (test) ) onthe test set.
Arch Epoch Acc (val) TP (test) TN (test)Full AR
Barrels
Magazines
Receivers
Stocks
All the networks achieve a good performance in theirtraining, using architectures of similar complexity. The singlenetwork model achieves the best validation performance with convolutional layers and dense layers. Its performanceon this dedicated task is, in general, inferior to the individualcomponent networks; this result is understandable as the singlenetwork model tackles here a more complex task (recognizinga whole weapon) compared to the networks in the semanticsegmentation model (recognizing component parts). c) Discussion: This simulation allowed us to select theoptimal architectural hyper-parameters for our two models. Itis likely that given more training data, larger architecture withstricter regularization, and more computational power, theseresults could be improved upon. Within the limits we selectedin terms of computational powers and number of layers, thehyper-parameters we found constitute the optimal solutionsfor our models and we will use these hyper-parameters in thefollowing simulations.
B. Simulation 2: Tuning of the Network Decisions
In the previous simulation we evaluated the performance ofindividual component networks in detecting weapon compo-nents within individual patches. We now estimate the thresholdparameters that would allow us to combine the outputs overeach patch into a final decision. This simulation runs only forthe semantic segmentation model at the network decision level of Figure 2. a) Protocol:
In this simulation we use the optimal CNNsthat we have already trained. These networks have been trainedso far only to classify patches of images, and decide whethereach of them contains the specific component part they weretrained on. In order to process a whole image, we need toaggregate the outputs of each CNN over several patches.Given an image x , each network records its network outputsfor each patch extracted from the image. The juxtaposition ofthese network outputs for overlapping patches provides us witha detection heatmap H ( x ) over the image x .Using the validation data, we can estimate data-definedthresholds from the heatmaps H ( x ) in order to return anetwork decision. We estimate a positive threshold θ p as theaverage of the maximum value of the heatmap for the positiveimages: θ p = E (cid:20) max x ∈P H ( x ) (cid:21) , where P is the set of positive images. Similarly we evaluatea negative threshold θ n as the average of the maximum valueof the heatmap for the negative images: θ n = E (cid:20) max x ∈N H ( x ) (cid:21) , where N is the set of negative images. Finally we alsodefine an intermediate threshold θ i which is estimated on thecombined positive set P and negative set N: θ i = E (cid:20) max x ∈ ( P∪N ) H ( x ) (cid:21) . Notice that, in our case, since the validation dataset is bal-anced, we have θ i = θ p + θ n because of the linearity of theexpectation.Thus, given an image x , each network will process all thepatches, compute the heatmap H ( X ) over the image, evaluatethe mean heatmap E [ H ( X )] , and compare it against oneof the learned thresholds θ . The output will be a positivedecision if E [ H ( X )] ≥ θ . The gap between thresholds maybe used to define an uncertainty region which may require theintervention of a human supervisor in the loop. However, inthis simulation, we will simply return a negative decision if E [ H ( X )] < θ .We compute the thresholds of each component networkusing the positive and negative images in the valida-tion dataset for the single network model D valsingle . Notice thatin this simulation we do no explicitly evaluate the accuracyof the each component network because the images in thevalidation dataset D valsingle and test dataset D tesingle of thesingle network model are labelled in terms of presence ofa whole weapon, and they lack labels about the presenceof individual components; indeed, it is not unusual that inan image containing a weapon, one of the four componentsmay be occluded or hidden; such an image, while being apositive instance of a weapon, would be a negative instancefor the occluded component. An overall evaluation of thesemantic segmentation model in terms of accuracy is thereforeostponed to the next section; here we estimate possiblethresholds θ for the network decision level . b) Results: Table IV reports the thresholds computedon the validation data. As expected θ p > θ n for all thenetworks; this makes sense as we would expect positivesamples containing instances of weapon to raise detection inmore patches. However, in the magnitude of these thresholdswe can observe that certain parts may be easier to detect thanothers; in particular, the closeness of the two thresholds θ p , θ n for the magazine network may point to the fact that correctlydiscriminate the presence or absence of magazine may be moredifficult that other components.TABLE IV: Threshold values for each component network θ p θ n θ i Barrels
Magazines
Receivers
Stocks c) Discussion:
This simulation allowed us to computethe threshold parameters for our component networks whichwill allow us to compute the overall decision of a componentnetwork. Moreover, the same computation of these thresholdshas provided us with further insight on the weapon detectionproblem, highlighting that some weapon component discrim-ination problems may be harder than others.
C. Simulation 3: Comparing the Semantic SegmentationModel and the Single Network Model
Building on the previous simulation that allowed us tocompute a single decision for each network, in this simulationwe finally evaluate the overall performance of the semanticsegmentation model against the single network model. Weconsider different aggregation protocols to merge the decisionsof the individual networks, and we compare the accuracy oftheir final decision against the accuracy of the single networkmodule. This simulation runs at the final output level of Figure2. a) Protocol:
We run our simulations using the optimalhyper-parameters found in the previous simulations. However,instead of testing the two models independently on theirrespective datasets as we did in Simulation 1, we contrasttheir results on an identical dataset.A key challenge in this experiment is how to guaranteethat the performances of the two models are compared in afair way. First of all, notice that both our models are trainedon similar positive samples coming from a common originaldataset of public domain images; we thus assume thatboth models are provided with similar training informationabout the object to detect; although negative samples maydiffer, we expect the training data for positive samples notto be skewed or manipulated as to provide an advantage toany of the two models. Next, we need to guarantee that themeasure of performance is equitable. For this to be the case,we need to evaluate the performance of the two models on the same test cases. Thus given a test image, the outputs of thetwo models can be compared in a consistent way.This comparison is then fair with respect to the data (bothmodels learned from sets of data derived from a commonsource). However, it may be argued whether our comparisonis fair with the respect to data processing choices, such asthe way outputs in the semantic segmentation model areaggregated or how images are rescaled in the single networkmodels; such choices are specific to each one of the twomodels, and it is therefore hard to guarantee any sort offairness with respect to them; we think that the best approachwould be to consider these choices as hyper-parameters of thetwo models and investigate how performances would changewhen varying these additional hyper-parameters; we will notcarry out this further investigation in this paper; instead, wewill present our conclusion conditional to our assumptions,and leave further investigation for future work. At the end, weopted to use the test dataset D tesingle we prepared for the singlenetwork model. This may provide a small edge to the singlenetwork model that was trained on data coming from the samedistribution, and it will constitute a realistic out-of-distributionchallenge for the semantic segmentation model.For the aggregation of the four decisions of the individualcomponent networks, we start implementing simple votingrules: strict majority rule (final output is positive if at leastthree out of four networks return a positive decision), weakmajority rule (final output is positive if at least two out offour networks return a positive decision), unanimity rule (finaloutput is positive iff all the four networks return a positivedecision) and veto rule (final output is positive if at leastone out of four networks return a positive decision). We alsoconsider the possibility of having a weighted vote, in whichthe weight of each individual component network is scaledwith respect to its accuracy; we use the normalized validationaccuracy of each network to set the weights. This approachallows to underestimate the decisions of weak networks, andboost the decision of networks performing over average.At the end, we measure the performances of each model interms of accuracy. b) Results: Table V shows the accuracy of the semanticsegmentation model at the final output level , as a function ofthe different voting rules and the different possible thresholds θ used at the network decision level .In general, we observe that there seems to be no optimal θ for all the aggregation rules. On the contrary, we can observea correlation between the magnitude of θ and how stringent arule is. This makes sense: loose rules (like having 1 networkout of 4 detecting a component part to flag a detection) maytake advantage of a higher θ threshold to prevent too manyfalse positives; on the opposite, strict rules (like requiringall 4 out of 4 networks detecting the respective componentpart to flag a detection) may operate better with a lower θ that would avoid too many false negatives. The type of ruleand the magnitude of θ may then be jointly set or optimizedas hyperparameters in order to control the trade-off betweenprecision and recall.he weighted vote has also been tested. However, due tothe almost-uniform performance of the component networks(see Table III), the weights were very close to be uniformand we did not observe any significant difference in accuracy.We still hold, though, that in case of sufficiently differentperformances, the weighted vote rule may have a positiveimpact on the overall results.For a direct comparison with the single network models,these accuracies should be contrasted against the results re-ported for the Full AR architecture in Table III. The compar-ison shows that the single network model easily outperformsthe semantic segmentation model. On one side, this may bedue to the better fit between training and test data in the caseof the single network model and to its ability of modellingcorrelations between component part; on the other side, theversatility of the semantic segmentation model, presenting theopportunity to tune thresholds and aggregation rules, offersmore control to the designer, but also constitute a furtherchallenge in the optimization process.TABLE V: Final accuracy of the semantic segmentation modelas a function of the threshold parameter θ . Rule θ = 0 θ n θ i θ p TP = 100%TN = 27%Tot = 63.5% TP = 99%TN = 49%Tot = 74% TN = 91%TP = 66%Tot = 78.5% TP = 79%TN = 77%Tot = 78%
TP = 94%TN = 64%Tot = 79% TP = 86%TN = 81%Tot = 83.5% TP = 61%TN = 90%Tot = 75.5% TP = 39%TN =95%Tot = 67%
TP = 60%TN = 89%Tot = 74.5% TP = 44%TN = 95%Tot = 69.5% TP = 25%TN = 100%Tot = 62.5% TP = 10%TN = 100%Tot = 55%
TP = 16%TN = 98%Tot = 57% TP = 9%TN = 100%Tot = 54.5% TP = 3%TN = 100%Tot = 51.5% TP = 1%TN = 100%Tot = 50.5% c) Discussion:
The results observed in Table V high-lights the versatility of the semantic segmentation modelin trading-off true positives and true negatives; once theindividual component networks are trained their outputs canbe aggregated using different rules and threshold offering afurther level of control to the user. A monolithic architecturemade up by a single network does not have this option; whilewe could introduce a cost-sensitive loss function to trade-offprecision and recall, any tuning of this loss function wouldrequiring a re-training of the whole single network model.However, the flexibility of the semantic segmentation modelis earned at the cost of a hard combinatorial optimizationproblem; the new degrees of freedom mean that finding acombination of threshold and rule able to reach the raw levelof accuracy of a single network model is not trivial.
D. Simulation 4: Data Comparison of the Models
Finally, we compare the performances of the two modelsin a low-data regime. In particular, we want to assess thehypothesis that the semantic decomposition model may bea better fit in low-data regimes since its task of detecting asimple component part may be arguably easier than the task of the single network model of detecting a whole weapon. Itmay be suggested that the task of a single network model is notmore difficult when we take into account that such a model hasthe possibility of learning correlations between the componentparts of a weapon; yet, we hypothesize that, in a low-dataregime, data scarcity makes it challenging and unlikely for thesingle network model to learn such correlations, leaving thenetwork with a harder task of detecting a weapon as a whole.In this section we put this hypothesis to test. This simulationagain runs up to the network outputs level of Figure 2, likeSimulation 1. a) Protocol:
For this experiment we consider again thebest performing architectures in term of accuracy that wediscovered in the first simulation (see Table III). Given thatthe performances on the whole dataset are known, we proceedto re-train and re-evaluate such architectures with randomsubsamples of the original dataset. We reduce the size ofthe training sets via random subsampling by considering only , , or or the original dataset, while keeping thesize of the test dataset constant. We train and test followingthe same protocol of Simulation 1. b) Results: Figure 5 shows how the final accuracychanges when the amount of training data shrinks (see Ap-pendix for more details). Notice that, despite a small increasein performance of the single network model when trained ona 75% limited dataset, the overall trend of this model presentsa sharper drop than any of the networks in the semanticsegmentation model.Fig. 5: Results of limited training. c) Discussion:
Smaller semantic networks proved tobe more stable and accurate than the single network modelwhen provided with a reduced set of data. This confirms thehypothesis that, in a low-data regime, without further changesto the architecture and the regularization, smaller solutionmay be a safer choice.VII. E
THICAL CONSIDERATIONS
As in previous work [1], we acknowledge that the applica-tion of machine learning model in critical scenarios presentspotential ethical challenges. Our work is motivated by theevelopment of system and tools that may benefit the civilsociety and that may be deployed to prevent violence and lossof lives. However, we are aware that a sensitive technology likeweapon detection may find application in other contexts, forinstance, within lethal autonomous weapon system (LAWS).As the authors of this work, we disavow such application ofour work, and in particular we condemn the use of our modelsin autonomous weapons .VIII. C ONCLUSIONS
In this paper we considered two main approaches to theproblem of detecting weapons in images: a standard, mono-lithic end-to-end approach based on a deep CNN, and analternative modular approach based on the principle of decom-position of a complex problem in a set of smaller and simplersub-problems. We conducted a set of rigorous experimentsto evaluate the two solution from different points of view,within fixed computational limits. Under the point of viewof reliability and flexibility, the semantic segmentation modelwas shown to offer a higher degree of control to the designer:different sub-problems may be identified and solved by smalldedicated CNNs, and the ratio between precision and recallmay be controlled by changing the way the outputs of theindividual component part networks are aggregated. This levelof control is not available, by default, in a deep CNN, whichautomatically generates hierarchies of features, and whichoptimizes a loss function that does not explicitly account forprecision and recall. However, this added degree of freedomof the semantic segmentation model translates in a more chal-lenging optimization problem. Thus, from the point of view ofraw performance (in terms of accuracy, for instance) the singlenetwork model outperforms the semantic segmentation modelthanks to its easier and more direct optimization, whereas thesemantic segmentation model requires more fine-tuning of theaggregation parameters. Yet, the semantic segmentation modelproved to be more robust in lower-data regimes: decreasing theamount of available data has a smaller effect in the semanticsegmentation model compared to the single network model,likely due to the fact that the individual component networksare learning simpler functions that can be fit with less data.In summary, the semantic segmentation model was provento show useful properties (flexibility, modularity, robustness)directly inherited from the underlying principle on which thismodel was designed. Its limited accuracy remains howevera significant obstacle to make this model an alternative tothe current deep CNN paradigm. Further work may aim atexploring more rigorous and grounded ways to deal with theproblem of optimizing the aggregation process, either treatingit as a hyper-parameter exploration problem or trying to solveit using black-box models.A
CKNOWLEDGMENT
This research was supported by the research project OsloAnalytics funded by the Research Council of Norway underthe Grant No.: IKTPLUSS 247648. https://futureoflife.org/open-letter-autonomous-weapons/ R EFERENCES[1] A. Egiazarov, V. Mavroeidis, F. M. Zennaro, and K. Vishi, “Firearmdetection and segmentation using an ensemble of semantic neuralnetworks,” arXiv preprint arXiv:2003.00805 , 2020.[2] R. K. Tiwari and G. K. Verma, “A computer vision based frameworkfor visual gun detection using harris interest point detector,”
ProcediaComputer Science , vol. 54, pp. 703–712, 2015.[3] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detectionand hierarchical image segmentation,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 33, pp. 898–916, May 2011.[4] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feed-forward semantic segmentation with zoom-out features,”
CoRR ,vol. abs/1412.0774, 2014.[5] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation andobject localization with superpixel neighborhoods,” in , pp. 670–677, Sep. 2009.[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Semantic image segmentation with deep convolutional nets and fullyconnected crfs,”
CoRR. arXiv , 12 2014.[7] C. M. Bishop,
Pattern recognition and machine learning . springer, 2006.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Proceedings of the 25thInternational Conference on Neural Information Processing Systems -Volume 1 , NIPS’12, (USA), pp. 1097–1105, Curran Associates Inc.,2012.[9] L. Rokach,
Pattern classification using ensemble methods , vol. 75. WorldScientific, 2010.[10] N. Srivastava,
Improving neural networks with dropout . PhD thesis,University of Toronto, 2013.
PPENDIX
A. Further experimental results
Tables A.I, A.II, A.III, and A.IV report the result of trainingthe component networks of the semantic segmentation modelon all the architectures we considered. Table A.V reports theresults of training both models on a limited data set.TABLE A.I: Best accuracy results for the network trained onthe barrel component of the AR-15.
Arch Epoch Acc (val) TP (test) TN (test)5x5
15 94,4% 97% 94%
12 92,5% 92% 93%
TABLE A.II: Best accuracy results for the network trained onthe magazine component of the AR-15.
Arch Epoch Acc (val) TP (test) TN (test)5x5
14 90,7% 92% 96%
11 93,0% 93% 95%
12 94,1% 96% 93%
TABLE A.III: Best accuracy results for the network trainedon the receiver component of the AR-15.
Arch Epoch Acc (val) TP (test) TN (test)5x5
11 96,6% 99% 95%
15 95,6% 98% 90%
TABLE A.IV: Best accuracy results for the network trainedon the stock component of the AR-15.
Arch Epoch Acc (val) TP (test) TN (test)5x5
11 94,8% 98% 90%
10 93,9% 97% 92%
12 91,7% 96% 88%
TABLE A.V: Accuracy of individual models after limited settraining
Training data
25% 50% 75% 100%Full AR
81% 89% 93.5% 92.5%
Barrels
94% 96% 97% 97%
Magazines
89% 93.5% 90.5% 94.5%
Receivers
93% 94.5% 94.5% 94.5%