[PDF] Deep Multilabel CNN for Forensic Footwear Impression Descriptor Identification

Abstract

In recent years deep neural networks have become the workhorse of computer vision. In this paper, we employ a deep learning approach to classify footwear impression's features known as \emph{descriptors} for forensic use cases. Within this process, we develop and evaluate an effective technique for feeding downsampled greyscale impressions to a neural network pre-trained on data from a different domain. Our approach relies on learnable preprocessing layer paired with multiple interpolation methods used in parallel. We empirically show that this technique outperforms using a single type of interpolated image without learnable preprocessing, and can help to avoid the computational penalty related to using high resolution inputs, by making more efficient use of the low resolution inputs. We also investigate the effect of preserving the aspect ratio of the inputs, which leads to considerable boost in accuracy without increasing the computational budget with respect to squished rectangular images. Finally, we formulate a set of best practices for transfer learning with greyscale inputs, potentially widely applicable in computer vision tasks ranging from footwear impression classification to medical imaging.

Full PDF

DD EEP M ULTILABEL

CNN

FOR F ORENSIC F OOTWEAR I MPRESSION D ESCRIPTOR I DENTIFICATION (P REPRINT ) Marcin Budka, Akanda Wahid -Ul- Ashraf, Matthew Bennett

Faculty of Science and TechnologyBournemouth UniversityFern BarrowPoole BH12 5BB, UKe-mail: {aashraf, mbudka, mbennett}@bournemouth.ac.uk

Scott Neville, Alun Mackrill

Bluestar Software Ltd.Fair Cross OfﬁcesStratﬁeld Saye RG7 2BT,e-mail: {scott.neville, alun.mackrill}@bluestar-software.co.uk

February 11, 2021 A BSTRACT

In recent years deep neural networks have become the workhorse of computer vision. In this paper,we employ a deep learning approach to classify footwear impression’s features known as descriptors for forensic use cases. Within this process, we develop and evaluate an effective technique for feedingdownsampled greyscale impressions to a neural network pre-trained on data from a different domain.Our approach relies on learnable preprocessing layer paired with multiple interpolation methods usedin parallel. We empirically show that this technique outperforms using a single type of interpolatedimage without learnable preprocessing, and can help to avoid the computational penalty related tousing high resolution inputs, by making more efﬁcient use of the low resolution inputs. We alsoinvestigate the effect of preserving the aspect ratio of the inputs, which leads to considerable boost inaccuracy without increasing the computational budget with respect to squished rectangular images.Finally, we formulate a set of best practices for transfer learning with greyscale inputs, potentiallywidely applicable in computer vision tasks ranging from footwear impression classiﬁcation to medicalimaging.

In this work we develop an approach to train a deep Convolutional Neural Network (CNN) to classify features infootwear impressions for use in forensic applications. The features we classify are known as descriptors within the UKfootwear forensic units [1, 2, 3, 4] and can be deﬁned as recognisable units within a footwear pattern which can beclassiﬁed. The descriptors are used by forensic practitioners to describe the makeup of a footwear pattern.Every footwear impression added to the UK’s National Footwear Reference Collection (NFRC) is manually labelledwith the descriptors [1]. The NFRC is built on an agreed standard for coding footwear patterns for different forces in The National Footwear Reference Collection (NFRC) and The National Footwear Database (NFD) are developed and maintainedby Bluestar Software Ltd (BSL) [1] a r X i v : . [ c s . C V ] F e b he UK and at the time of writing, to the best of our knowledge, is the biggest police-owned collection of footwearimpressions in the world. The NFRC footwear pattern collection is updated on a regular basis [5].The NFD is a successor of the NFRC where footwear labels are maintained and added regularly. The NFRC records thecustody and crime scene marks while the NFD facilitates matching with the NFRC footwear patterns. Currently, around30 out of 43 police forces in England and Wales, continuously send or update data in the NFD [5].The NFRC uses a total of 17 descriptors to identify a footwear impression. Each of the descriptors is assigned a uniquename and code. A shoe print or footwear impression may contain any subset of these descriptors . The location of the descriptors are divided into two parts: 1) the heel / instep, and 2) the main sole (i.e. top). In this study we do not exploitthis location information in any way. A single descriptor can exist multiple times in a shoe print, however, the speciﬁclocation (other than the heal/instep or main sole) and frequency of the descriptor is not identiﬁed and counted.Each of the 17 descriptors (Table 1) has speciﬁc semantics (for the purpose of quick identiﬁcation by forensicspractitioner rather than a computer), which relate to the name of the descriptor . For example, descriptor D05: 5 sided ,contains all shapes which are 5 sided; descriptor D09: Text indicates any text that can be found on a shoe print. Thenumber of possible geometric variations that are usually found can potentially be inﬁnite. For example, descriptorD09: Text can be any combination of characters and fonts, while descriptor D05: 5 sided can be a rough pentagonof any shape and form. Two descriptors can overlap, resulting in a multiple descriptors from a single topologicalsubpattern on a footwear impression. For example, descriptor D09: Text and D10: Logo usually overlap, as manylogos contain text. Additionally, among the 17 descriptors , three are subcategories of two main/parent descriptors : D01-01: Wavy , D01-D02: Curved-wavy are the subcategories of

D01: Bar , and

D02-01: Target is a single subcategoryof

D02: Circular . While labelling with the descriptors for a footwear impression, the microscopic patterns of theimpressions are not usually considered. For example,

D12: Texture can contain microscopic patterns which are also

D06: 6 sided but usually

D06: 6 sided is not labelled in such cases as these microscopic patterns are often not reliableand persistent [6]. All the sided shaped descriptors (e.g.

D03 , D04 , etc.) do not necessarily have very precise straightlines as sides but some curves and deformations are ubiquitous.

In UK policing collection of footwear evidence is normally done in two scenarios: 1) collection of detainee footwear incustody, and 2) collection of crime scene marks. The vast majority of the footwear impressions captured from detaineesin custody usually follow one of the below processes [7]:•

Inked Impressions:

The inked impressions are captured using a specialist pad and paper kit (sometimes calleda ‘Bigfoot Kit’) [1, 6, 2]. The kit uses a pad with a reactive chemical and specialist paper. The impressions canthen be digitised using an ofﬁce document scanner, if required.•

Ink-less Impressions:

A specialised footwear impression digital scanner is used in this case to capture thefootwear impression without any use of ink. This process produces only a digital copy of the impressionwhereas the inked impression also produces a physical copy on paper [1].Additionally, some UK forces use coloured photographs of the shoe sole as opposed to using one of the impressioncapturing methods described above [1].

In practice to date, the descriptors are manually identiﬁed by experts and are only used as an intermediate step toidentify a pattern. Processes vary between police forces, however when adding an impression to the NFRC, twoindependent experts individually identify the descriptors . If both experts agree on the set of identiﬁed descriptors , thefootwear impression image is labelled with the identiﬁed descriptors . However, when there is a disagreement betweenthe two experts, the labelling process involves a panel of experts for further analysis. The accuracy of identifying the descriptors by experts are thought to be ‘very high’, however, to the best of our knowledge, there was no empiricalstudy to quantify this accuracy [1].

The main limitation associated with manually identifying descriptors is the time and cost of human expertise. Althoughforensic practitioners are able to directly identify many common footwear impressions without the need for classiﬁcationagainst the descriptors , classifying rare or new shoe models takes longer. As there are tens of thousands of footwearmodels, it is impractical for a human expert to be able to accurately identify a speciﬁc model with only the descriptors .The NFRC/NFD provides a number of additional searching and ordering features to make identiﬁcation possible in a2

A bar of any type such as straight,angled, curved, including chevrons

D07Complex

Shapes such as star, arrow, waistedbar, heart and cross, and any othershape with more than six sides

D01-01Wavy

A bar element with more than onedirectional change

D08Zigzag

A broken or continuous line thatchanges direction repeatedlywith abrupt right and left turns

D01-02Curved-wavy

Any bar shape/ element deviatingfrom a straight line with a singlerounded directional change howeversmall the angle of the curved section

D09Text

Any alpha-numeric characters;may overlap with D10

D02Circular

Includes circle, semi-circle, oval,semi-oval, concentric circles,target, tear-drop, stud, crescent

D10Logo

A brand or trademark incorporatinga symbol, badge, emblem or picture;may overlap with D09

D02-01Target

Any concentric circle arrangementwhether the centre-most circle ishollow or solid

D11Lattice

A regular, interlocking and/or repeatedpattern (aka network, web or trellis);includes brickwork, herring-bone,honeycomb and chicken wire

D033 sided

All types of triangle including thosewith one rounded side such as apie-segment

D12Textured

This includes pre-dominant stippling,crepe or random patterns added bythe manufacturer as part of their design

D044 sided

Square, rectangle, oblong, paralle-logram, rhombus, diamond, arrowhead

D13Hollow

A pattern that has the appearance of ahollow shape, such as a doughnut or frame

D055 sided

Usually a regular shaped pentagon,but includes all ﬁve-sided shapes

D14Plain

A plain surface with no patternsor texture

D066 sided

Usually a regular shaped hexagon,but includes all six-sided shapes

Table 1: Footwear descriptors for the UK’s National Footwear Reference Collection (NFRC)3ractical time span. These features are generally used in the same way for all searches (looking at frequency/geographyof distribution) and therefore take little time to use compared with the time taken to identify descriptors. However,the most frequently worn footwear are very well known to the forensic practitioners thus are easily labelled by them,without the need of any computer system, or the descriptors . Since the accuracy of human footwear forensics experts arenot empirically evaluated, the automated process cannot be argued to be same or better than human experts. Despite this,clear use cases for an automatic descriptor identiﬁcation exist. For example, when a new footwear model is captured,labelling would be completed by an expert, then blindly veriﬁed by another. An automatic descriptor identiﬁcationwill be faster and have higher availability for the second check as the number of human experts available is limited.Automatic descriptor identiﬁcation could potentially replace the second opinion when adding patterns to the NFRC(see Section 1.2).

The automation of the descriptor analysis can provide rapid identiﬁcation of the descriptors in a given impression,which in turn will result in faster identiﬁcation of a shoe model from its print, especially for an untrained (in terms offootwear analysis) personnel. Additionally, the identiﬁed descriptors can be used to narrow down the search in thedatabase with thousands of footwear impressions. The automated approach, which is not only capable of identifying the descriptors but also infer their topological location (e.g. using Grad-CAM [8]) can be further beneﬁcial for trainingpolice users. Rapid automatic descriptor identiﬁcation can be achieved without involving a forensic expert, resultingin faster determination of intelligence. The latter is particularly important as the suspect will then have little time todestroy the evidence and can be questioned sooner (ideally before leaving custody), resulting in a plausibly increaseddetection rate. In England and Wales, there are around 25-30 (an estimation without an ofﬁcial source) human expertswho can identify the descriptors currently, whereas there are 123,171 law enforcement personnel [9] who may handlea case where identiﬁcation of the descriptors may be necessary. Automated descriptor identiﬁcation can potentiallyprovide such expertise to all the law enforcement personnel in the UK.Due to large variability in the complex geometric shapes and patterns of a descriptor , a simple template matchingalgorithm [10] would be suboptimal. Each of the descriptors has an apparent but variable high-level geometricsemantics. Figure 1: Different types of descriptors : D10, D11, D03 on two separate real inked impressionsFigure 1 shows two real-world inked impression with four descriptors each; D03 , D09 , D10 , D11 . As it can be seen,although the same descriptors appear on both of the impressions, their patterns are very distinct. While

D10: Logo is an obvious example, a more ‘stable’

D03: 3 sided also looks quite different. The impression on the right has

D03 with smother edges, and also bigger in size than

D03 found in the left impression. Also note that although both theimpressions have

D11: Lattice , its appearance is very distinct. According to a statistical bulletin published on the 18 th July 2019 by the Home Ofﬁce for England and Wales. This number ofofﬁcers does not include the British Transport Police

4s a result, designing ﬁlters to identify the descriptors is not practical. Instead, a deep learning based approach hasbeen taken, able to automatically learn the ﬁlters from the already existing manually labelled dataset.

Training a deep neural network requires estimating a large number of parameters in the order of hundreds of millions.The matrix arithmetic operation performed to estimate these parameters are best suited for a Graphics Processing Unit(GPU) due to a GPU’s better ability to perform highly parallel ﬂoating-point operations when compared with a CPU(Central Processing Unit). The GPU computational power are still limited however and there are other bottlenecks likemoving data between the main memory and the GPU. As a result, a smaller model with a lower number of parametersis computationally more efﬁcient than a bigger model with a larger number of parameters.Apart from the base architecture (number of layers and units per layer) of a neural network, the number of computationsgrows approximately quadratically with the resolution of the input image. Higher resolution images also take up morespace in the GPU memory, which tends to be smaller than system memory, limiting the batch size and further reducingthe overall training speed. In order to reduce the computational cost and facilitate faster training, the input imagesare usually downscaled [11, 12]. However, the performance/accuracy of a neural network tends to suffer when imageresolution is reduced.It should also be noted that the theoretical beneﬁt of a higher resolution image may not always increase with an everincreasing resolution of that image, e.g. once we have already achieved the theoretical upper bound of the accuracy fora speciﬁc domain. In our case, we have very small and complex features deﬁning a class (see Section 1), thus, the upperbound of the resolution with a beneﬁcial impact on the model is assumed to be higher than in the classiﬁcation basedtasks where the classes are usually more apparent.

Our dataset consist of high-resolution footwear impressions that have been captured via the means discussed inSection 1.1. As deliberated on in Section 3, in practice the images resolution need to be reduced and there are a numberof different image interpolation techniques that can be used here. In our experiments, we investigate and benchmarkvarious combinations of image interpolation techniques, including:• Nearest Neighbour interpolation ( N ) , which is the least computationally expensive and does not insert newcolours in the result. In this interpolation, only the nearest neighbour’s pixel intensity is considered. Theestimation function f on a point ( x, y ) becomes a piecewise function with constant value [13, 14].• Bilinear interpolation ( B ) is a linear interpolation over all non-channel dimensions of an image, i.e. for atwo dimensional image it is the interpolation over both the X and Y dimensions [15]. A straight line passingthrough two points ( x , y ) and ( x , y ) between range x and x is the linear interpolant of these two points.For a range of ( x , x ) , the slopes of the interpolant from both of these points ( x and x ) should be exactlythe same, hence the following equation of slopes can be formulated: y − y x − x = y − y x − x (1)Solving Equation 1 for y gives: y = y (cid:16) x − xx − x (cid:17) + y (cid:16) x − x x − x (cid:17) (2)Equation 2 produces interpolation over the X direction. In case of a two dimensional image for four differentpoints on the image, Q = ( x , y ) , Q = ( x , y ) , Q = ( x , y ) , Q = ( x , y ) , the task is to estimatethe function f at a point ( x, y ) . In this four points scenario, the linear interpolation on the X direction usingEquation 2 gives us the following: f ( x, y ) = (cid:16) x − xx − x (cid:17) f ( Q ) + (cid:16) x − x x − x (cid:17) f ( Q ) (3) f ( x, y ) = (cid:16) x − xx − x (cid:17) f ( Q ) + (cid:16) x − x x − x (cid:17) f ( Q ) (4) We use the terms interpolation, resampling, downscaling, and resizing interchangeably

5e can then use Equations 3 and 4 to interpolate on the Y direction in order to estimate f ( x, y ) : f ( x, y ) = y − yy − y (cid:32) x − xx − x f ( Q ) + x − x x − x f ( Q ) (cid:33) + y − y y − y (cid:32) x − xx − x f ( Q ) + x − x x − x f ( Q ) (cid:33) (5)• Hamming ( H ) interpolation technique uses a sinc approximating kernel by multiplying (convolution operationas its in the frequency domain) the well-known sinc [16] function with the hamming [17] window function [18].Equation 7 is the sinc function and Equation 6 is the Hamming window function with the window interval ( − m, m ) W hamming = 0 .

54 + 0 .

46 cos (cid:18) πxm (cid:19) (6) sinc ( x ) = sin( πx ) πx (7)Although an ideal interpolation technique is expected not to alter any pattern within the image or introduce any artefact,most of the interpolation techniques usually alter some image features and also introduce artefacts when interpolated toreduce image resolution [14, 19]. Figures 3, 4, and 5 shows how the interpolation techniques discussed above can affectthe features of a footwear impression image at different resolutions . It is apparent from the undersampled (Figure 3)images that using different interpolation techniques produce slightly different images. Although these discrepancies areaesthetically undesirable and can hamper the performance of the model, we leverage such differences as an effectiveimage augmentation technique as described in Section 5.Figure 2: Original image without any interpolation. Zoom in to circumvent distortion introduced by the interpolationapplied from the medium where this paper is being viewedAs we can see, all three interpolated images closely resemble the original (Figure 2) at a higher resolution (Figure 5)and at the same time their differences reduces. Comparing between the lowest resolution images (Figure 3), it is Please zoom in to see how the interpolation pattern gradually resembles the original image (Figure 2) with increasing resolution.Zooming is required as the images embedded in this paper go through arbitrary interpolation applied by your browser or PDF reader. descriptor D01: Bars which are angled whereas the original (Figure 2) andhigher resolution (zoom in for Figure 5) impressions have descriptor D01: Bars which are straight lines with an angleof ◦ . A study by [11] found that even a mildest quality loss of input images can greatly hamper the performance of adeep learning model. Glorot and Bengio [20] too found neural networks to be susceptible to image noise. (a) N (100 × (b) B (100 × (c) H (100 × (d) N (300 × (e) B (300 × (f) H (300 × Figure 3: Interpolation samples with ﬁxed aspect ratio and varying sizes7 a) N (500 × (b) B (500 × (c) H (500 × (d) N (700 × (e) B (700 × (f) H (700 × Figure 4: Interpolation samples with ﬁxed aspect ratio and varying sizes (zoom in to see the original pattern)8 a) N (1000 × (b) B (1000 × (c) H (1000 × Figure 5: Interpolation samples with ﬁxed aspect ratio and varying sizes (zoom in to see the original pattern)

In our experiments, we use

ResNet-50 , a popular 50 layer CNN architecture with residual connections [21], pre-trainedon the ImageNet dataset [22] with a custom head initialised using the Glorot/Xavier initialisation [20] and optional,learnable preprocessing layer (see below). The head consists of an adaptive pooling layer, followed by two

BatchNorm → Dropout → Linear/Dense blocks with

ReLU non-linearity in between. The number of units in the non-output linearlayer was set to , while the output layer has a total of 17 neurons with sigmoid activation functions, one per each descriptor type. The models are trained using

AdamW [23], a stochastic gradient descent based backpropagationalgorithm in two phases:1.

Initial training , where all the

ResNet-50 body layers are frozen and only the 2-layer head as well as theoptional preprocessing layer are trained with the learning rate of e − and weight decay of . .2. Fine-tuning , where the whole network is trained using discriminative learning rates [24] of between e − and e − and weight decay of . .We have experimented with various combinations of the following:1. Loss function : In addition to the default Binary Cross Entropy (BCE) loss, which in our experiments wasalways used in a cost-sensitive setting via class weighting (i.e. with the cost of misclassiﬁcation being inverselyproportional to class frequency in the training dataset), we have also used the Soft-F1 loss in an attempt tomaximise both precision and recall directly within the model training process. The Soft-F1 loss is a simplegeneralisation of the F1 score obtained by replacing the number of True Positives (TP), False Positives (FP)and False Negatives (FN) with their probabilistic counterparts [25]:

T P = (cid:88) i y i ˆ y i F P = (cid:88) i (1 − y i )ˆ y i F N = (cid:88) i y i (1 − ˆ y i ) y i ∈ { , } is the label for the i th data instance and ˆ y i ∈ [0 , is the model prediction.2. Channel conﬁguration : All the original input images are greyscale (single-channel), yet the pre-trainedmodel expects RGB/colour inputs (three-channels). The simplest and most popular approach to address thisdiscrepancy is to collate three identical copies of the greyscale input. Since this approach seems wasteful,we have instead opted for various compositions of the three-channel input obtained via applying different interpolation techniques (see Section 4) to the high resolution input image – these are speciﬁed in Table 2.3.

Preprocessing layer : For the same reasons as described above, we have included a number of learnablepreprocessing layers in our network. The rationale here was that the distribution of greyscale images,particularly when collating three different interpolated versions of each image into a single three-channel input,is different from the distribution of natural RGB images from the ImageNet dataset. The preprocessing layerswe have used have been shown in Figure 6. channels R G B

B-B-B Bilinear Bilinear BilinearB-H-N Bilinear Hamming Nearest NeighbourB-N-H Bilinear Nearest Neighbour HammingH-B-N Hamming Bilinear Nearest NeighbourH-H-H Hamming Hamming HammingH-N-B Hamming Nearest Neighbour BilinearN-B-H Nearest Neighbour Bilinear HammingN-H-B Nearest Neighbour Hamming BilinearN-N-N Nearest Neighbour Nearest Neighbour Nearest Neighbour

Table 2: Compositions of input channels via different combinations of interpolation techniquesFigure 6: Learnable preprocessing layers: (a) cbn_1 : 1x1 Conv and BatchNorm, (b) cbn_3 : 3x3 Conv and BatchNorm,(c) inc : inception-like transformation, (d) inc_d : dense inception-like transformationThe training dataset consisted of 33,757 greyscale images retrieved from the NFRC with the class distribution as shownin Figure 7a. As it can be seen, the classes are dominated by

D01 , D02 , D04 and

D07 , with

D01-01 , D02-01 , D05 and

D14 being the least frequent. The validation set consisted of 1,000 images retrieved from the same database with theclass distribution as shown in Figure 7b. The image resolution ranged from × to , × , and has beendepicted in Figure 8, where the whiskers represent Q and Q , and outliers have been omitted for presentation clarity.10 a) Training set (b) Validation set Figure 7: Class distributionFigure 8: Original input image resolution

Table 3 contains the aggregated results of the total of 180 experiments run across 90 different combinations of hyper-parameters as speciﬁed in Table 2 and Figure 6. Each experiment has been repeated twice with random initialisation,and the average of these two runs was used to construct Table 3. In each of the experiments, we have trained a customhead on top of a ﬁxed/frozen ImageNet pre-trained ResNet-50 for 10 epochs, followed by 40 epochs of ﬁnetuning ofthe whole network. We have opted for the Area Under the Precision-Recall Curve (

PRAUC ) as the performance metricin order to decouple the results from class-speciﬁc thresholds.

PRAUC is a better measurement of performance of abinary classiﬁer than the

AUC of the

ROC (receiver operating characteristic) curve [26, 27] since

ROC is very sensitiveto class imbalance and in our case class labels are heavily imbalanced.The ﬁrst thing to notice is that according to the results, preserving the aspect ratio of the input images (i.e. × resolution) always leads to better performance than when using squished images (i.e. × resolution, see the ‘ ∆ ’column in Table 3). This holds regardless of the type of resampling, preprocessing and loss function used. It appearsthat preserving the aspect ratio of the original images matters much more than higher horizontal resolution (i.e. 144 vs224 pixels) and is the best way of spending a ﬁxed computational budget (the number of input pixels in both cases isapproximately equal ( ×

224 = 50 , , ×

144 = 50 , ). Although this performance difference might bepartially attributed to the nature of our inputs (shoe impressions are long and thin, so squishing can introduce signiﬁcantdistortions), the same can be said about many other objects like people or vehicles. It is hence somewhat surprising that × is the default transfer learning setting for popular deep learning frameworks [28, 29, 30].The average difference in PRAUC between the two resolutions (everything else being equal) is 0.0189 (2.8%), while themaximum difference over all 180 runs reaches 0.0415 (6.5%). To put these numbers in context, the average differencebetween any two randomly initialised runs of each experiment is 0.0051, while the maximum difference is 0.0206. Thusthe observed effect is unlikely to be a random ﬂuctuation. For this reason, we limit further analysis to the results for the × (and higher) resolutions, preserving the input aspect ratio.11 hannels preprocessing PRAUCBCE LOSS F1 LOSS352x144 224x224 ∆ ∆ N-N-N cbn_1 0.7200 0.6969 0.0231 0.6927 0.6712 0.0215N-N-N cbn_3 0.7171 0.6968 0.0203 0.6839 0.6741 0.0098N-N-N inc 0.7205 0.6990 0.0215 0.6861 0.6706 0.0155N-N-N inc_d 0.7215 0.6945 0.0270 0.6938 0.6712 0.0226N-N-N no_tfm 0.7135 0.6940 0.0195 0.6902 0.6728 0.0174H-H-H cbn_1 0.7113 0.6958 0.0155 0.6876 0.6694 0.0182H-H-H cbn_3 0.7172 0.6939 0.0233 0.6852

H-H-H inc 0.7133 0.6931 0.0202 0.6878 0.6721 0.0157H-H-H inc_d

B-B-B no_tfm µ Table 3: Performance based on the Area Under the Precision-Recall Curve (

PRAUC ). ‘ ∆ ’ denotes difference betweenthe two resolutions. Max for each column in bold , min in underline .12 similar observation can be made when considering the loss function. BCE consistently outperforms the

F1 LOSS ,with the average

PRAUC difference of 0.0269 and the maximum difference of 0.0374. Despite attractive theoreticalproperties of the

F1 LOSS as discussed in Section 5, the

BCE loss proved to be a much better choice in practice. Forthis reason we are not considering the

F1 LOSS in the subsequent analysis. rank channels PRAUC ( ↓ ) max(∆) Table 4: Performance for × input resolu-tion, BCE loss, and various input channel conﬁgura-tions, averaged over preprocessing methods, sorted by

PRAUC . rank preprocessing PRAUC ( ↓ ) max(∆) Table 5: Performance for × input resolution, BCE loss, and various preprocessing layers, averagedover input channel conﬁgurations, sorted by

PRAUC .The average difference in

PRAUC among various combinations of input channels (Table 4) is much less pronounced.The approach of creating an input to the network by ‘sandwiching’ the outputs of three different interpolation methodsrather than simply creating three identical channels (i.e.

B-N-H vs N-N-N ), doesn’t seem to affect the

PRAUC much,with the difference between the best and worst performing approach being 0.0041. Nevertheless, it’s worth noting thatthe

B-B-B approach, which is the default in the existing deep learning frameworks [28, 29, 30], is one of the worstperforming. This seems to conﬁrm the intuition that the blurring effect characteristic for bilinear interpolation, tends tomake discrimination between different types of descriptors more challenging. Another observation is that the inﬂuenceof input channel ordering on

PRAUC can be almost as big as the difference between the best and worst performingapproach (the difference between

B-N-H and

H-N-B is 0.0039). This is somewhat surprising as in theory, the learnablepreprocessing layer should be able to ‘swap’ the input channel order if needed. However, in the context of the averagedifference between any two randomly initialised runs of each experiment, which as mentioned earlier was 0.0051, theresults given in Table 4 need to be declared inconclusive.The inﬂuence of the learnable preprocessing layer on

PRAUC is more substantial. As it can be seen in Table 5, thedifference between the dense inception-like transformation ( inc_d ) and no transformation at all ( no_tfm ) reaches 0.0082.Since no_tfm is the worst performing approach in our experiments, we conclude that using some kind of learnablepreprocessing is beneﬁcial. preprocessingchannels cbn_1 cbn_3 inc inc_d no_tfm µ max(∆) B-B-B

B-H-N

B-N-H

H-B-N 0.7224

H-H-H

H-N-B

N-B-H

N-H-B

N-N-N µ max(∆) Table 6:

PRAUC for × input resolution, BCE loss, and various combinations of pre-ﬁrst layer transformationsand input channel conﬁgurations. The highest score in each column in bold . The highest PRAUC in each row inunderline. 13able 6 presents the breakdown of the results by preprocessing layers and input channel conﬁgurations. As it can beseen, inc_d gives the highest PRAUC on average (0.7208), is the best preprocessing method for 6 out of 9 input channelconﬁgurations, and second best for additional 2. It is harder to identify the best performing channel conﬁguration asnone of them seems to be dominating across different preprocessing layers. However, looking at preprocessing andchannel conﬁguration jointly, inc_d with

H-H-H gives the highest

PRAUC of 0.7259, which is 0.0198 more than theworst performing combination ( no_tfm with

B-B-B ) and 0.0092 more than the average across all the entries in Table 6. preprocessingchannels cbn_1 cbn_3 inc inc_d µ max(∆) B-B-B 0.7450

B-H-N

B-N-H

H-B-N

H-H-H 0.7450

H-N-B

N-B-H

N-H-B

N-N-N µ max(∆) Table 7:

PRAUC for × input resolution, BCE loss, and various combinations of preprocessing layers and inputchannel conﬁgurations. The highest score in each column in bold . The highest PRAUC in each row in underline.In Table 7 we report the results of a similar experiment, this time with the resolution of the input images increased to × . This lead to a signiﬁcant increase of the PRAUC across all tested combinations of preprocessing layersand input channel conﬁgurations, with the minimum, average and maximum difference of 0.0132, 0.0221 and 0.0337respectively. As before, inc_d is the dominating preprocessing method, while none of the input channel conﬁgurationsseems to be a clear winner. Note, that in Table 7,

B-B-B is no longer as strongly dominated by other input channelconﬁgurations as it was the case at lower input resolutions due to the blurring effect now being less severe. preprocessingchannels inc_dB-B-B

B-H-N 0.7736B-N-H

H-B-N

H-H-H

H-N-B

N-B-H

N-H-B

N-N-N µ max(∆) Table 8:

PRAUC for × input resolution, BCE loss, and various combinations of input channel conﬁgurations.The highest PRAUC in bold .In out ﬁnal experiment, we have investigated increasing the input resolution to × . As shown in Table 8, thishas resulted in further improvement in terms of PRAUC reaching 0.0280 on average. It is also apparent that with theincrease in the input resolution, the importance of interpolation diminishes – the maximum difference among all theinterpolation methods is 0.0060, albeit achieved at signiﬁcantly increased computational cost.

In Figure 9 we depict the

PRAUC of our best model from Table 8, broken down by class/descriptor. As it can be seen,some descriptors seem to be relatively easy to classify;

D01: Bar , D02: Circular , D04: 4 sided , D07: Complex and14

08: Zigzag all have

P RAU C > . , which is to be expected as these descriptors are relatively clear cut (see Table 1).At the other end of the spectrum, D05: 5 sided followed by

D13: Hollow and

D01-02: Curved-wavy are the mostchallenging. Note, that

D05 is not only the least frequent in the dataset as per Table 7 (we’ve counteracted this by usingclass-weighting in the

BCE loss), but it is also one of the subtler descriptors in general. As it can be seen in Table 1,

D05 can for example be a rectangle with one of the corners ‘cut off’, hence easy to confuse with

D04: 4 sided . In a similarvein,

D13: Hollow can easily be confused with a circle (

D02: Circular ), triangle (

D03: 3 sided ), square/rectangle (

D04:4 sided ) etc. Some shapes on an impression can also represent multiple descriptors, for example

D09: Text and

D10:Logo will often overlap. Some overlays may be more complex such as

D03: 3 sided and

D13: Hollow . There may evenbe some examples of nested overlap such as

D02-01: Target (which implies

D02: Circular ) and

D13: Hollow , as acircle with the centre missing is both a target and hollow.

D14: Plain is unusual in that when it applies to part of theshoe, it excludes the other descriptors from that area.

Descriptor PRAUCD01

D01-01

D01-02

D02

D02-01

D03

D04

D05

D06

D07

D08

D09

D10

D11

D12

D13

D14

Figure 9: Per class PRAUCIn order to investigate this issue further, in Figure 10 we show the confusion matrix generated for the validation dataset.For each validation image, if the predicted score for a descriptor which is not present in the image (false positive)exceeds the score for a descriptor which is in the image (true positive), then the two are considered confused. Anexample is given in Table 9. The actual labels are

D02 , D05 and

D06 . Since the score for

D02 is the highest, 1 wouldbe added to the diagonal entry for this descriptor. However, as the score for

D04 (false positive) is higher than that for

D05 and

D06 , these are considered confused (i.e. either or both of

D05 and

D06 are classiﬁed as

D04 ) and hence / (i.e. one over the number of potentially misclassiﬁed descriptors) is added to the entries D05-D04 and

D06-D04 of theconfusion matrix.

Descriptor D01 D02 D03 D04 D05 D06Label

Score

Table 9: Example prediction to illustrate confusion matrix calculationAs it can be seen in Figure 10,

D01 is the most frequently misclassiﬁed descriptor which can be partially explained byits prevalence in the dataset. In Figure 11(a) we show an example of a

D01: Bar in the top right corner of the print,which has not been detected by our model. At the same time, the model detected

D13: Hollow in the locations shadedin orange, although this particular shoeprint impression has not been labelled with

D13 by the human expert. However,due to wear, some of the 6 sided shapes (

D06 ) closed, and indeed now ﬁt the description of

D13 . Another example ofundetected

D01 is given in Figure 11(b), where in addition

D11: Lattice has been misclassiﬁed as

D04: 4 sided in theareas highlighted by the heatmap – the lattice indeed consists of 4 sided ‘cells’. In both of these examples, it is actuallydifﬁcult to state which descriptor

D01 was confused with; it appears that

D01 was simply not detected, yet this wouldstill be recorded in the confusion matrix due to the way in which the matrix was derived.15igure 10: Confusion matrixAnother descriptor worth looking at is

D05 , which as mentioned before is the most rare in the dataset and has the lowest

PRAUC . D05 is most often misclassiﬁed as

D06 or D12 , while at the same time

D03 , D04 and

D07 are most oftenmisclassiﬁed as

D05 . An example can be seen in Figure 11(c).It is worth noting, that all three examples of errors in Figure 11 have been selected on the basis of the highest loss (i.e.they are as bad as it gets). It is reassuring that these do not result from the model behaving in an unexpected fashion, butare rather due to ambiguity in the inputs that may even cause disagreement between expert users, requiring resolving byexpert panel. 16 a) D , D → D (b) D , D → D (c) D → D Figure 11: High loss misclassiﬁcation examples

The descriptor identiﬁcation task we have approached in this study is of great signiﬁcance for the forensic practitionersin the UK and beyond. The descriptors are an agreed standard for coding footwear patterns for different forces in theUK are in active use. Although a human performance benchmark is not available at this time, our model performs wellwith the

PRAUC of over . . The mistakes that the model tends to make are mostly justiﬁable, either by ambiguity orby overlaps in the input patterns, and are not unlike what an inexperienced human would make. The system that wehave built has been deployed for testing by selected police forces.In the process of building the model, we have experimented with a number of ways of feeding greyscale impressions tothe ImageNet (RGB) pre-trained network. Our ﬁndings can be summarised as the following ‘best practices’:• Preserve the aspect ratio of the input images. This seems particularly important if the object of interest (ashoeprint in our case) has aspect ratio signiﬁcantly different than (i.e. ‘long and thin’ or ‘short and fat’).This advice goes against the common practice in the computer vision community of ‘squishing’ the inputimages to make them square in order to use ImageNet pre-trained models. This is unnecessary as current deeplearning frameworks allow one to feed rectangular images to the ImageNet pre-trained models out of the box.• Use as high input resolution as practical. In our experiments increasing the input resolution always led tohigher PRAUC albeit at the cost of signiﬁcantly increased computations, which is an obvious constraint. Theoriginal resolution of the input images can also be a limitation as there’s little point in upscaling such images.• Use different interpolation methods to construct the three input channels from greyscale images. Althoughthe effect of this approach that we have observed was modest, it was positive nevertheless. It also seems thatusing the Nearest Neighbour interpolation as one of the input channels is beneﬁcial, while using three identicalchannels obtained via Bilinear interpolation is detrimental, particularly at lower input resolutions.• Use a learnable preprocessing layer. In our experiments, these additional computations played a crucial rolein the process of adapting greyscale inputs to be used with a colour-image pre-trained network, regardlessof the interpolation method used. Learnable preprocessing combined with different interpolation methods toconstruct the three input channels gave the best results.17 cknowledgements

The ﬁnancial support of Innovate UK is acknowledged via the Knowledge Transfer Partners programme.

References [1] Bluestar Software Limited. National Footwear Solutions (the NFRC and NFD), 2020. URL https://bluestar-software.co.uk/products/forensic-intelligence/ .[2] Robert Milne.

Forensic intelligence . CRC Press, 2012.[3] Hannah Larsen and Matthew R Bennett. Recovery of 3d footwear impressions using a range of different techniques.

Journal of Forensic Sciences , 2020.[4] Matthew R Bennett and Marcin Budka.

Digital technology for forensic footwear analysis and vertebrate ichnology .Springer, 2018.[5] Bluestar Software Limited. National Footwear Solutions (the NFRC and NFD), 2020. URL https://bluestar-software.co.uk/products/uk-national-solutions/ .[6] William J Bodziak.

Footwear impression evidence: detection, recovery and examination . CRC Press, 1999.[7] National Policing Improvement Agency. Footwear marks recovery manual, 2007. URL http://library.college.police.uk/docs/appref/NPIA-(2007)-Footwear-Marks-Recovery-Manual.pdf .[8] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In

Proceedings of theIEEE international conference on computer vision , pages 618–626, 2017.[9] Home Ofﬁce. Home Ofﬁce, Police Workforce, England and Wales, 2019. URL https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/831726/police-workforce-mar19-hosb1119.pdf .[10] Roberto Brunelli.

Template matching techniques in computer vision: theory and practice . John Wiley & Sons,2009.[11] Michal Koziarski and Boguslaw Cyganek. Impact of low resolution on image recognition with deep neuralnetworks: An experimental study.

International Journal of Applied Mathematics and Computer Science , 28(4),2018.[12] Suresh Prasad Kannojia and Gaurav Jaiswal. Effects of varying resolution on performance of cnn based imageclassiﬁcation: An experimental study.

International Journal of Computer Sciences and Engineering , 6(9):451–56p,2018.[13] Philippe Thévenaz, Thierry Blu, and Michael Unser. Image interpolation and resampling.

Handbook of medicalimaging, processing and analysis , 1(1):393–420, 2000.[14] Philippe Thévenaz, Thierry Blu, and Michael Unser. Interpolation revisited [medical images application].

IEEETransactions on medical imaging , 19(7):739–758, 2000.[15] PR Smith. Bilinear interpolation of digital images.

Ultramicroscopy , 6(2):201–204, 1981.[16] Philip M Woodward and Ian L Davies. Information theory and inverse probability in telecommunication.

Proceedings of the IEE-Part III: Radio and Communication Engineering , 99(58):37–44, 1952.[17] Ralph Beebe Blackman and John Wilder Tukey. The measurement of power spectra from the point of view ofcommunications engineering—part i.

Bell System Technical Journal , 37(1):185–282, 1958.[18] Erik HW Meijering, Wiro J Niessen, and Max A Viergever. Quantitative evaluation of convolution-based methodsfor medical image interpolation.

Medical image analysis , 5(2):111–126, 2001.[19] Jeffrey Tsao. Interpolation artifacts in multimodality image registration based on maximization of mutualinformation.

IEEE transactions on medical imaging , 22(7):854–864, 2003.[20] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks.In

Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics , pages 249–256,2010.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[22] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009.1823] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In

International Conference onLearning Representations , 2018.[24] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. In

Proceedingsof the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages328–339, 2018.[25] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan. Scalable learning ofnon-decomposable objectives. In

Artiﬁcial intelligence and statistics , pages 832–840. PMLR, 2017.[26] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot whenevaluating binary classiﬁers on imbalanced datasets.

PloS one , 10(3):e0118432, 2015.[27] Akanda Wahid-Ul-Ashraf, Marcin Budka, and Katarzyna Musial. How to predict social relationships—physics-inspired approach to link prediction.

Physica A: Statistical Mechanics and its Applications , 523:1110–1129,2019.[28] François Chollet et al. Keras. https://github.com/fchollet/keras , 2015.[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In

NIPS-W , 2017.[30] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI }16)