Deep Multilabel CNN for Forensic Footwear Impression Descriptor Identification
Marcin Budka, Akanda Wahid Ul Ashraf, Scott Neville, Alun Mackrill, Matthew Bennett
DD EEP M ULTILABEL
CNN
FOR F ORENSIC F OOTWEAR I MPRESSION D ESCRIPTOR I DENTIFICATION (P REPRINT ) Marcin Budka, Akanda Wahid -Ul- Ashraf, Matthew Bennett
Faculty of Science and TechnologyBournemouth UniversityFern BarrowPoole BH12 5BB, UKe-mail: {aashraf, mbudka, mbennett}@bournemouth.ac.uk
Scott Neville, Alun Mackrill
Bluestar Software Ltd.Fair Cross OfficesStratfield Saye RG7 2BT,e-mail: {scott.neville, alun.mackrill}@bluestar-software.co.uk
February 11, 2021 A BSTRACT
In recent years deep neural networks have become the workhorse of computer vision. In this paper,we employ a deep learning approach to classify footwear impression’s features known as descriptors for forensic use cases. Within this process, we develop and evaluate an effective technique for feedingdownsampled greyscale impressions to a neural network pre-trained on data from a different domain.Our approach relies on learnable preprocessing layer paired with multiple interpolation methods usedin parallel. We empirically show that this technique outperforms using a single type of interpolatedimage without learnable preprocessing, and can help to avoid the computational penalty related tousing high resolution inputs, by making more efficient use of the low resolution inputs. We alsoinvestigate the effect of preserving the aspect ratio of the inputs, which leads to considerable boost inaccuracy without increasing the computational budget with respect to squished rectangular images.Finally, we formulate a set of best practices for transfer learning with greyscale inputs, potentiallywidely applicable in computer vision tasks ranging from footwear impression classification to medicalimaging.
In this work we develop an approach to train a deep Convolutional Neural Network (CNN) to classify features infootwear impressions for use in forensic applications. The features we classify are known as descriptors within the UKfootwear forensic units [1, 2, 3, 4] and can be defined as recognisable units within a footwear pattern which can beclassified. The descriptors are used by forensic practitioners to describe the makeup of a footwear pattern.Every footwear impression added to the UK’s National Footwear Reference Collection (NFRC) is manually labelledwith the descriptors [1]. The NFRC is built on an agreed standard for coding footwear patterns for different forces in The National Footwear Reference Collection (NFRC) and The National Footwear Database (NFD) are developed and maintainedby Bluestar Software Ltd (BSL) [1] a r X i v : . [ c s . C V ] F e b he UK and at the time of writing, to the best of our knowledge, is the biggest police-owned collection of footwearimpressions in the world. The NFRC footwear pattern collection is updated on a regular basis [5].The NFD is a successor of the NFRC where footwear labels are maintained and added regularly. The NFRC records thecustody and crime scene marks while the NFD facilitates matching with the NFRC footwear patterns. Currently, around30 out of 43 police forces in England and Wales, continuously send or update data in the NFD [5].The NFRC uses a total of 17 descriptors to identify a footwear impression. Each of the descriptors is assigned a uniquename and code. A shoe print or footwear impression may contain any subset of these descriptors . The location of the descriptors are divided into two parts: 1) the heel / instep, and 2) the main sole (i.e. top). In this study we do not exploitthis location information in any way. A single descriptor can exist multiple times in a shoe print, however, the specificlocation (other than the heal/instep or main sole) and frequency of the descriptor is not identified and counted.Each of the 17 descriptors (Table 1) has specific semantics (for the purpose of quick identification by forensicspractitioner rather than a computer), which relate to the name of the descriptor . For example, descriptor D05: 5 sided ,contains all shapes which are 5 sided; descriptor D09: Text indicates any text that can be found on a shoe print. Thenumber of possible geometric variations that are usually found can potentially be infinite. For example, descriptorD09: Text can be any combination of characters and fonts, while descriptor D05: 5 sided can be a rough pentagonof any shape and form. Two descriptors can overlap, resulting in a multiple descriptors from a single topologicalsubpattern on a footwear impression. For example, descriptor D09: Text and D10: Logo usually overlap, as manylogos contain text. Additionally, among the 17 descriptors , three are subcategories of two main/parent descriptors : D01-01: Wavy , D01-D02: Curved-wavy are the subcategories of
D01: Bar , and
D02-01: Target is a single subcategoryof
D02: Circular . While labelling with the descriptors for a footwear impression, the microscopic patterns of theimpressions are not usually considered. For example,
D12: Texture can contain microscopic patterns which are also
D06: 6 sided but usually
D06: 6 sided is not labelled in such cases as these microscopic patterns are often not reliableand persistent [6]. All the sided shaped descriptors (e.g.
D03 , D04 , etc.) do not necessarily have very precise straightlines as sides but some curves and deformations are ubiquitous.
In UK policing collection of footwear evidence is normally done in two scenarios: 1) collection of detainee footwear incustody, and 2) collection of crime scene marks. The vast majority of the footwear impressions captured from detaineesin custody usually follow one of the below processes [7]:•
Inked Impressions:
The inked impressions are captured using a specialist pad and paper kit (sometimes calleda ‘Bigfoot Kit’) [1, 6, 2]. The kit uses a pad with a reactive chemical and specialist paper. The impressions canthen be digitised using an office document scanner, if required.•
Ink-less Impressions:
A specialised footwear impression digital scanner is used in this case to capture thefootwear impression without any use of ink. This process produces only a digital copy of the impressionwhereas the inked impression also produces a physical copy on paper [1].Additionally, some UK forces use coloured photographs of the shoe sole as opposed to using one of the impressioncapturing methods described above [1].
In practice to date, the descriptors are manually identified by experts and are only used as an intermediate step toidentify a pattern. Processes vary between police forces, however when adding an impression to the NFRC, twoindependent experts individually identify the descriptors . If both experts agree on the set of identified descriptors , thefootwear impression image is labelled with the identified descriptors . However, when there is a disagreement betweenthe two experts, the labelling process involves a panel of experts for further analysis. The accuracy of identifying the descriptors by experts are thought to be ‘very high’, however, to the best of our knowledge, there was no empiricalstudy to quantify this accuracy [1].
The main limitation associated with manually identifying descriptors is the time and cost of human expertise. Althoughforensic practitioners are able to directly identify many common footwear impressions without the need for classificationagainst the descriptors , classifying rare or new shoe models takes longer. As there are tens of thousands of footwearmodels, it is impractical for a human expert to be able to accurately identify a specific model with only the descriptors .The NFRC/NFD provides a number of additional searching and ordering features to make identification possible in a2
A bar of any type such as straight,angled, curved, including chevrons
D07Complex
Shapes such as star, arrow, waistedbar, heart and cross, and any othershape with more than six sides
D01-01Wavy
A bar element with more than onedirectional change
D08Zigzag
A broken or continuous line thatchanges direction repeatedlywith abrupt right and left turns
D01-02Curved-wavy
Any bar shape/ element deviatingfrom a straight line with a singlerounded directional change howeversmall the angle of the curved section
D09Text
Any alpha-numeric characters;may overlap with D10
D02Circular
Includes circle, semi-circle, oval,semi-oval, concentric circles,target, tear-drop, stud, crescent
D10Logo
A brand or trademark incorporatinga symbol, badge, emblem or picture;may overlap with D09
D02-01Target
Any concentric circle arrangementwhether the centre-most circle ishollow or solid
D11Lattice
A regular, interlocking and/or repeatedpattern (aka network, web or trellis);includes brickwork, herring-bone,honeycomb and chicken wire
D033 sided
All types of triangle including thosewith one rounded side such as apie-segment
D12Textured
This includes pre-dominant stippling,crepe or random patterns added bythe manufacturer as part of their design
D044 sided
Square, rectangle, oblong, paralle-logram, rhombus, diamond, arrowhead
D13Hollow
A pattern that has the appearance of ahollow shape, such as a doughnut or frame
D055 sided
Usually a regular shaped pentagon,but includes all five-sided shapes
D14Plain
A plain surface with no patternsor texture
D066 sided
Usually a regular shaped hexagon,but includes all six-sided shapes
Table 1: Footwear descriptors for the UK’s National Footwear Reference Collection (NFRC)3ractical time span. These features are generally used in the same way for all searches (looking at frequency/geographyof distribution) and therefore take little time to use compared with the time taken to identify descriptors. However,the most frequently worn footwear are very well known to the forensic practitioners thus are easily labelled by them,without the need of any computer system, or the descriptors . Since the accuracy of human footwear forensics experts arenot empirically evaluated, the automated process cannot be argued to be same or better than human experts. Despite this,clear use cases for an automatic descriptor identification exist. For example, when a new footwear model is captured,labelling would be completed by an expert, then blindly verified by another. An automatic descriptor identificationwill be faster and have higher availability for the second check as the number of human experts available is limited.Automatic descriptor identification could potentially replace the second opinion when adding patterns to the NFRC(see Section 1.2).
The automation of the descriptor analysis can provide rapid identification of the descriptors in a given impression,which in turn will result in faster identification of a shoe model from its print, especially for an untrained (in terms offootwear analysis) personnel. Additionally, the identified descriptors can be used to narrow down the search in thedatabase with thousands of footwear impressions. The automated approach, which is not only capable of identifying the descriptors but also infer their topological location (e.g. using Grad-CAM [8]) can be further beneficial for trainingpolice users. Rapid automatic descriptor identification can be achieved without involving a forensic expert, resultingin faster determination of intelligence. The latter is particularly important as the suspect will then have little time todestroy the evidence and can be questioned sooner (ideally before leaving custody), resulting in a plausibly increaseddetection rate. In England and Wales, there are around 25-30 (an estimation without an official source) human expertswho can identify the descriptors currently, whereas there are 123,171 law enforcement personnel [9] who may handlea case where identification of the descriptors may be necessary. Automated descriptor identification can potentiallyprovide such expertise to all the law enforcement personnel in the UK.Due to large variability in the complex geometric shapes and patterns of a descriptor , a simple template matchingalgorithm [10] would be suboptimal. Each of the descriptors has an apparent but variable high-level geometricsemantics. Figure 1: Different types of descriptors : D10, D11, D03 on two separate real inked impressionsFigure 1 shows two real-world inked impression with four descriptors each; D03 , D09 , D10 , D11 . As it can be seen,although the same descriptors appear on both of the impressions, their patterns are very distinct. While
D10: Logo is an obvious example, a more ‘stable’
D03: 3 sided also looks quite different. The impression on the right has
D03 with smother edges, and also bigger in size than
D03 found in the left impression. Also note that although both theimpressions have
D11: Lattice , its appearance is very distinct. According to a statistical bulletin published on the 18 th July 2019 by the Home Office for England and Wales. This number ofofficers does not include the British Transport Police
4s a result, designing filters to identify the descriptors is not practical. Instead, a deep learning based approach hasbeen taken, able to automatically learn the filters from the already existing manually labelled dataset.
Training a deep neural network requires estimating a large number of parameters in the order of hundreds of millions.The matrix arithmetic operation performed to estimate these parameters are best suited for a Graphics Processing Unit(GPU) due to a GPU’s better ability to perform highly parallel floating-point operations when compared with a CPU(Central Processing Unit). The GPU computational power are still limited however and there are other bottlenecks likemoving data between the main memory and the GPU. As a result, a smaller model with a lower number of parametersis computationally more efficient than a bigger model with a larger number of parameters.Apart from the base architecture (number of layers and units per layer) of a neural network, the number of computationsgrows approximately quadratically with the resolution of the input image. Higher resolution images also take up morespace in the GPU memory, which tends to be smaller than system memory, limiting the batch size and further reducingthe overall training speed. In order to reduce the computational cost and facilitate faster training, the input imagesare usually downscaled [11, 12]. However, the performance/accuracy of a neural network tends to suffer when imageresolution is reduced.It should also be noted that the theoretical benefit of a higher resolution image may not always increase with an everincreasing resolution of that image, e.g. once we have already achieved the theoretical upper bound of the accuracy fora specific domain. In our case, we have very small and complex features defining a class (see Section 1), thus, the upperbound of the resolution with a beneficial impact on the model is assumed to be higher than in the classification basedtasks where the classes are usually more apparent.
Our dataset consist of high-resolution footwear impressions that have been captured via the means discussed inSection 1.1. As deliberated on in Section 3, in practice the images resolution need to be reduced and there are a numberof different image interpolation techniques that can be used here. In our experiments, we investigate and benchmarkvarious combinations of image interpolation techniques, including:• Nearest Neighbour interpolation ( N ) , which is the least computationally expensive and does not insert newcolours in the result. In this interpolation, only the nearest neighbour’s pixel intensity is considered. Theestimation function f on a point ( x, y ) becomes a piecewise function with constant value [13, 14].• Bilinear interpolation ( B ) is a linear interpolation over all non-channel dimensions of an image, i.e. for atwo dimensional image it is the interpolation over both the X and Y dimensions [15]. A straight line passingthrough two points ( x , y ) and ( x , y ) between range x and x is the linear interpolant of these two points.For a range of ( x , x ) , the slopes of the interpolant from both of these points ( x and x ) should be exactlythe same, hence the following equation of slopes can be formulated: y − y x − x = y − y x − x (1)Solving Equation 1 for y gives: y = y (cid:16) x − xx − x (cid:17) + y (cid:16) x − x x − x (cid:17) (2)Equation 2 produces interpolation over the X direction. In case of a two dimensional image for four differentpoints on the image, Q = ( x , y ) , Q = ( x , y ) , Q = ( x , y ) , Q = ( x , y ) , the task is to estimatethe function f at a point ( x, y ) . In this four points scenario, the linear interpolation on the X direction usingEquation 2 gives us the following: f ( x, y ) = (cid:16) x − xx − x (cid:17) f ( Q ) + (cid:16) x − x x − x (cid:17) f ( Q ) (3) f ( x, y ) = (cid:16) x − xx − x (cid:17) f ( Q ) + (cid:16) x − x x − x (cid:17) f ( Q ) (4) We use the terms interpolation, resampling, downscaling, and resizing interchangeably
5e can then use Equations 3 and 4 to interpolate on the Y direction in order to estimate f ( x, y ) : f ( x, y ) = y − yy − y (cid:32) x − xx − x f ( Q ) + x − x x − x f ( Q ) (cid:33) + y − y y − y (cid:32) x − xx − x f ( Q ) + x − x x − x f ( Q ) (cid:33) (5)• Hamming ( H ) interpolation technique uses a sinc approximating kernel by multiplying (convolution operationas its in the frequency domain) the well-known sinc [16] function with the hamming [17] window function [18].Equation 7 is the sinc function and Equation 6 is the Hamming window function with the window interval ( − m, m ) W hamming = 0 .
54 + 0 .
46 cos (cid:18) πxm (cid:19) (6) sinc ( x ) = sin( πx ) πx (7)Although an ideal interpolation technique is expected not to alter any pattern within the image or introduce any artefact,most of the interpolation techniques usually alter some image features and also introduce artefacts when interpolated toreduce image resolution [14, 19]. Figures 3, 4, and 5 shows how the interpolation techniques discussed above can affectthe features of a footwear impression image at different resolutions . It is apparent from the undersampled (Figure 3)images that using different interpolation techniques produce slightly different images. Although these discrepancies areaesthetically undesirable and can hamper the performance of the model, we leverage such differences as an effectiveimage augmentation technique as described in Section 5.Figure 2: Original image without any interpolation. Zoom in to circumvent distortion introduced by the interpolationapplied from the medium where this paper is being viewedAs we can see, all three interpolated images closely resemble the original (Figure 2) at a higher resolution (Figure 5)and at the same time their differences reduces. Comparing between the lowest resolution images (Figure 3), it is Please zoom in to see how the interpolation pattern gradually resembles the original image (Figure 2) with increasing resolution.Zooming is required as the images embedded in this paper go through arbitrary interpolation applied by your browser or PDF reader. descriptor D01: Bars which are angled whereas the original (Figure 2) andhigher resolution (zoom in for Figure 5) impressions have descriptor D01: Bars which are straight lines with an angleof ◦ . A study by [11] found that even a mildest quality loss of input images can greatly hamper the performance of adeep learning model. Glorot and Bengio [20] too found neural networks to be susceptible to image noise. (a) N (100 × (b) B (100 × (c) H (100 × (d) N (300 × (e) B (300 × (f) H (300 × Figure 3: Interpolation samples with fixed aspect ratio and varying sizes7 a) N (500 × (b) B (500 × (c) H (500 × (d) N (700 × (e) B (700 × (f) H (700 × Figure 4: Interpolation samples with fixed aspect ratio and varying sizes (zoom in to see the original pattern)8 a) N (1000 × (b) B (1000 × (c) H (1000 × Figure 5: Interpolation samples with fixed aspect ratio and varying sizes (zoom in to see the original pattern)
In our experiments, we use
ResNet-50 , a popular 50 layer CNN architecture with residual connections [21], pre-trainedon the ImageNet dataset [22] with a custom head initialised using the Glorot/Xavier initialisation [20] and optional,learnable preprocessing layer (see below). The head consists of an adaptive pooling layer, followed by two
BatchNorm → Dropout → Linear/Dense blocks with
ReLU non-linearity in between. The number of units in the non-output linearlayer was set to , while the output layer has a total of 17 neurons with sigmoid activation functions, one per each descriptor type. The models are trained using
AdamW [23], a stochastic gradient descent based backpropagationalgorithm in two phases:1.
Initial training , where all the
ResNet-50 body layers are frozen and only the 2-layer head as well as theoptional preprocessing layer are trained with the learning rate of e − and weight decay of . .2. Fine-tuning , where the whole network is trained using discriminative learning rates [24] of between e − and e − and weight decay of . .We have experimented with various combinations of the following:1. Loss function : In addition to the default Binary Cross Entropy (BCE) loss, which in our experiments wasalways used in a cost-sensitive setting via class weighting (i.e. with the cost of misclassification being inverselyproportional to class frequency in the training dataset), we have also used the Soft-F1 loss in an attempt tomaximise both precision and recall directly within the model training process. The Soft-F1 loss is a simplegeneralisation of the F1 score obtained by replacing the number of True Positives (TP), False Positives (FP)and False Negatives (FN) with their probabilistic counterparts [25]:
T P = (cid:88) i y i ˆ y i F P = (cid:88) i (1 − y i )ˆ y i F N = (cid:88) i y i (1 − ˆ y i ) y i ∈ { , } is the label for the i th data instance and ˆ y i ∈ [0 , is the model prediction.2. Channel configuration : All the original input images are greyscale (single-channel), yet the pre-trainedmodel expects RGB/colour inputs (three-channels). The simplest and most popular approach to address thisdiscrepancy is to collate three identical copies of the greyscale input. Since this approach seems wasteful,we have instead opted for various compositions of the three-channel input obtained via applying different interpolation techniques (see Section 4) to the high resolution input image – these are specified in Table 2.3.
Preprocessing layer : For the same reasons as described above, we have included a number of learnablepreprocessing layers in our network. The rationale here was that the distribution of greyscale images,particularly when collating three different interpolated versions of each image into a single three-channel input,is different from the distribution of natural RGB images from the ImageNet dataset. The preprocessing layerswe have used have been shown in Figure 6. channels R G B
B-B-B Bilinear Bilinear BilinearB-H-N Bilinear Hamming Nearest NeighbourB-N-H Bilinear Nearest Neighbour HammingH-B-N Hamming Bilinear Nearest NeighbourH-H-H Hamming Hamming HammingH-N-B Hamming Nearest Neighbour BilinearN-B-H Nearest Neighbour Bilinear HammingN-H-B Nearest Neighbour Hamming BilinearN-N-N Nearest Neighbour Nearest Neighbour Nearest Neighbour
Table 2: Compositions of input channels via different combinations of interpolation techniquesFigure 6: Learnable preprocessing layers: (a) cbn_1 : 1x1 Conv and BatchNorm, (b) cbn_3 : 3x3 Conv and BatchNorm,(c) inc : inception-like transformation, (d) inc_d : dense inception-like transformationThe training dataset consisted of 33,757 greyscale images retrieved from the NFRC with the class distribution as shownin Figure 7a. As it can be seen, the classes are dominated by
D01 , D02 , D04 and
D07 , with
D01-01 , D02-01 , D05 and
D14 being the least frequent. The validation set consisted of 1,000 images retrieved from the same database with theclass distribution as shown in Figure 7b. The image resolution ranged from × to , × , and has beendepicted in Figure 8, where the whiskers represent Q and Q , and outliers have been omitted for presentation clarity.10 a) Training set (b) Validation set Figure 7: Class distributionFigure 8: Original input image resolution
Table 3 contains the aggregated results of the total of 180 experiments run across 90 different combinations of hyper-parameters as specified in Table 2 and Figure 6. Each experiment has been repeated twice with random initialisation,and the average of these two runs was used to construct Table 3. In each of the experiments, we have trained a customhead on top of a fixed/frozen ImageNet pre-trained ResNet-50 for 10 epochs, followed by 40 epochs of finetuning ofthe whole network. We have opted for the Area Under the Precision-Recall Curve (
PRAUC ) as the performance metricin order to decouple the results from class-specific thresholds.
PRAUC is a better measurement of performance of abinary classifier than the
AUC of the
ROC (receiver operating characteristic) curve [26, 27] since
ROC is very sensitiveto class imbalance and in our case class labels are heavily imbalanced.The first thing to notice is that according to the results, preserving the aspect ratio of the input images (i.e. × resolution) always leads to better performance than when using squished images (i.e. × resolution, see the ‘ ∆ ’column in Table 3). This holds regardless of the type of resampling, preprocessing and loss function used. It appearsthat preserving the aspect ratio of the original images matters much more than higher horizontal resolution (i.e. 144 vs224 pixels) and is the best way of spending a fixed computational budget (the number of input pixels in both cases isapproximately equal ( ×
224 = 50 , , ×
144 = 50 , ). Although this performance difference might bepartially attributed to the nature of our inputs (shoe impressions are long and thin, so squishing can introduce significantdistortions), the same can be said about many other objects like people or vehicles. It is hence somewhat surprising that × is the default transfer learning setting for popular deep learning frameworks [28, 29, 30].The average difference in PRAUC between the two resolutions (everything else being equal) is 0.0189 (2.8%), while themaximum difference over all 180 runs reaches 0.0415 (6.5%). To put these numbers in context, the average differencebetween any two randomly initialised runs of each experiment is 0.0051, while the maximum difference is 0.0206. Thusthe observed effect is unlikely to be a random fluctuation. For this reason, we limit further analysis to the results for the × (and higher) resolutions, preserving the input aspect ratio.11 hannels preprocessing PRAUCBCE LOSS F1 LOSS352x144 224x224 ∆ ∆ N-N-N cbn_1 0.7200 0.6969 0.0231 0.6927 0.6712 0.0215N-N-N cbn_3 0.7171 0.6968 0.0203 0.6839 0.6741 0.0098N-N-N inc 0.7205 0.6990 0.0215 0.6861 0.6706 0.0155N-N-N inc_d 0.7215 0.6945 0.0270 0.6938 0.6712 0.0226N-N-N no_tfm 0.7135 0.6940 0.0195 0.6902 0.6728 0.0174H-H-H cbn_1 0.7113 0.6958 0.0155 0.6876 0.6694 0.0182H-H-H cbn_3 0.7172 0.6939 0.0233 0.6852
H-H-H inc 0.7133 0.6931 0.0202 0.6878 0.6721 0.0157H-H-H inc_d
B-B-B no_tfm µ Table 3: Performance based on the Area Under the Precision-Recall Curve (
PRAUC ). ‘ ∆ ’ denotes difference betweenthe two resolutions. Max for each column in bold , min in underline .12 similar observation can be made when considering the loss function. BCE consistently outperforms the
F1 LOSS ,with the average
PRAUC difference of 0.0269 and the maximum difference of 0.0374. Despite attractive theoreticalproperties of the
F1 LOSS as discussed in Section 5, the
BCE loss proved to be a much better choice in practice. Forthis reason we are not considering the
F1 LOSS in the subsequent analysis. rank channels PRAUC ( ↓ ) max(∆) Table 4: Performance for × input resolu-tion, BCE loss, and various input channel configura-tions, averaged over preprocessing methods, sorted by
PRAUC . rank preprocessing PRAUC ( ↓ ) max(∆) Table 5: Performance for × input resolution, BCE loss, and various preprocessing layers, averagedover input channel configurations, sorted by
PRAUC .The average difference in
PRAUC among various combinations of input channels (Table 4) is much less pronounced.The approach of creating an input to the network by ‘sandwiching’ the outputs of three different interpolation methodsrather than simply creating three identical channels (i.e.
B-N-H vs N-N-N ), doesn’t seem to affect the
PRAUC much,with the difference between the best and worst performing approach being 0.0041. Nevertheless, it’s worth noting thatthe
B-B-B approach, which is the default in the existing deep learning frameworks [28, 29, 30], is one of the worstperforming. This seems to confirm the intuition that the blurring effect characteristic for bilinear interpolation, tends tomake discrimination between different types of descriptors more challenging. Another observation is that the influenceof input channel ordering on
PRAUC can be almost as big as the difference between the best and worst performingapproach (the difference between
B-N-H and
H-N-B is 0.0039). This is somewhat surprising as in theory, the learnablepreprocessing layer should be able to ‘swap’ the input channel order if needed. However, in the context of the averagedifference between any two randomly initialised runs of each experiment, which as mentioned earlier was 0.0051, theresults given in Table 4 need to be declared inconclusive.The influence of the learnable preprocessing layer on
PRAUC is more substantial. As it can be seen in Table 5, thedifference between the dense inception-like transformation ( inc_d ) and no transformation at all ( no_tfm ) reaches 0.0082.Since no_tfm is the worst performing approach in our experiments, we conclude that using some kind of learnablepreprocessing is beneficial. preprocessingchannels cbn_1 cbn_3 inc inc_d no_tfm µ max(∆) B-B-B
B-H-N
B-N-H
H-B-N 0.7224
H-H-H
H-N-B
N-B-H
N-H-B
N-N-N µ max(∆) Table 6:
PRAUC for × input resolution, BCE loss, and various combinations of pre-first layer transformationsand input channel configurations. The highest score in each column in bold . The highest PRAUC in each row inunderline. 13able 6 presents the breakdown of the results by preprocessing layers and input channel configurations. As it can beseen, inc_d gives the highest PRAUC on average (0.7208), is the best preprocessing method for 6 out of 9 input channelconfigurations, and second best for additional 2. It is harder to identify the best performing channel configuration asnone of them seems to be dominating across different preprocessing layers. However, looking at preprocessing andchannel configuration jointly, inc_d with
H-H-H gives the highest
PRAUC of 0.7259, which is 0.0198 more than theworst performing combination ( no_tfm with
B-B-B ) and 0.0092 more than the average across all the entries in Table 6. preprocessingchannels cbn_1 cbn_3 inc inc_d µ max(∆) B-B-B 0.7450
B-H-N
B-N-H
H-B-N
H-H-H 0.7450
H-N-B
N-B-H
N-H-B
N-N-N µ max(∆) Table 7:
PRAUC for × input resolution, BCE loss, and various combinations of preprocessing layers and inputchannel configurations. The highest score in each column in bold . The highest PRAUC in each row in underline.In Table 7 we report the results of a similar experiment, this time with the resolution of the input images increased to × . This lead to a significant increase of the PRAUC across all tested combinations of preprocessing layersand input channel configurations, with the minimum, average and maximum difference of 0.0132, 0.0221 and 0.0337respectively. As before, inc_d is the dominating preprocessing method, while none of the input channel configurationsseems to be a clear winner. Note, that in Table 7,
B-B-B is no longer as strongly dominated by other input channelconfigurations as it was the case at lower input resolutions due to the blurring effect now being less severe. preprocessingchannels inc_dB-B-B
B-H-N 0.7736B-N-H
H-B-N
H-H-H
H-N-B
N-B-H
N-H-B
N-N-N µ max(∆) Table 8:
PRAUC for × input resolution, BCE loss, and various combinations of input channel configurations.The highest PRAUC in bold .In out final experiment, we have investigated increasing the input resolution to × . As shown in Table 8, thishas resulted in further improvement in terms of PRAUC reaching 0.0280 on average. It is also apparent that with theincrease in the input resolution, the importance of interpolation diminishes – the maximum difference among all theinterpolation methods is 0.0060, albeit achieved at significantly increased computational cost.
In Figure 9 we depict the
PRAUC of our best model from Table 8, broken down by class/descriptor. As it can be seen,some descriptors seem to be relatively easy to classify;
D01: Bar , D02: Circular , D04: 4 sided , D07: Complex and14
08: Zigzag all have
P RAU C > . , which is to be expected as these descriptors are relatively clear cut (see Table 1).At the other end of the spectrum, D05: 5 sided followed by
D13: Hollow and
D01-02: Curved-wavy are the mostchallenging. Note, that
D05 is not only the least frequent in the dataset as per Table 7 (we’ve counteracted this by usingclass-weighting in the
BCE loss), but it is also one of the subtler descriptors in general. As it can be seen in Table 1,
D05 can for example be a rectangle with one of the corners ‘cut off’, hence easy to confuse with
D04: 4 sided . In a similarvein,
D13: Hollow can easily be confused with a circle (
D02: Circular ), triangle (
D03: 3 sided ), square/rectangle (
D04:4 sided ) etc. Some shapes on an impression can also represent multiple descriptors, for example
D09: Text and
D10:Logo will often overlap. Some overlays may be more complex such as
D03: 3 sided and
D13: Hollow . There may evenbe some examples of nested overlap such as
D02-01: Target (which implies
D02: Circular ) and
D13: Hollow , as acircle with the centre missing is both a target and hollow.
D14: Plain is unusual in that when it applies to part of theshoe, it excludes the other descriptors from that area.
Descriptor PRAUCD01
D01-01
D01-02
D02
D02-01
D03
D04
D05
D06
D07
D08
D09
D10
D11
D12
D13
D14
Figure 9: Per class PRAUCIn order to investigate this issue further, in Figure 10 we show the confusion matrix generated for the validation dataset.For each validation image, if the predicted score for a descriptor which is not present in the image (false positive)exceeds the score for a descriptor which is in the image (true positive), then the two are considered confused. Anexample is given in Table 9. The actual labels are
D02 , D05 and
D06 . Since the score for
D02 is the highest, 1 wouldbe added to the diagonal entry for this descriptor. However, as the score for
D04 (false positive) is higher than that for
D05 and
D06 , these are considered confused (i.e. either or both of
D05 and
D06 are classified as
D04 ) and hence / (i.e. one over the number of potentially misclassified descriptors) is added to the entries D05-D04 and
D06-D04 of theconfusion matrix.
Descriptor D01 D02 D03 D04 D05 D06Label
Score
Table 9: Example prediction to illustrate confusion matrix calculationAs it can be seen in Figure 10,
D01 is the most frequently misclassified descriptor which can be partially explained byits prevalence in the dataset. In Figure 11(a) we show an example of a
D01: Bar in the top right corner of the print,which has not been detected by our model. At the same time, the model detected
D13: Hollow in the locations shadedin orange, although this particular shoeprint impression has not been labelled with
D13 by the human expert. However,due to wear, some of the 6 sided shapes (
D06 ) closed, and indeed now fit the description of
D13 . Another example ofundetected
D01 is given in Figure 11(b), where in addition
D11: Lattice has been misclassified as
D04: 4 sided in theareas highlighted by the heatmap – the lattice indeed consists of 4 sided ‘cells’. In both of these examples, it is actuallydifficult to state which descriptor
D01 was confused with; it appears that
D01 was simply not detected, yet this wouldstill be recorded in the confusion matrix due to the way in which the matrix was derived.15igure 10: Confusion matrixAnother descriptor worth looking at is
D05 , which as mentioned before is the most rare in the dataset and has the lowest
PRAUC . D05 is most often misclassified as
D06 or D12 , while at the same time
D03 , D04 and
D07 are most oftenmisclassified as
D05 . An example can be seen in Figure 11(c).It is worth noting, that all three examples of errors in Figure 11 have been selected on the basis of the highest loss (i.e.they are as bad as it gets). It is reassuring that these do not result from the model behaving in an unexpected fashion, butare rather due to ambiguity in the inputs that may even cause disagreement between expert users, requiring resolving byexpert panel. 16 a) D , D → D (b) D , D → D (c) D → D Figure 11: High loss misclassification examples
The descriptor identification task we have approached in this study is of great significance for the forensic practitionersin the UK and beyond. The descriptors are an agreed standard for coding footwear patterns for different forces in theUK are in active use. Although a human performance benchmark is not available at this time, our model performs wellwith the
PRAUC of over . . The mistakes that the model tends to make are mostly justifiable, either by ambiguity orby overlaps in the input patterns, and are not unlike what an inexperienced human would make. The system that wehave built has been deployed for testing by selected police forces.In the process of building the model, we have experimented with a number of ways of feeding greyscale impressions tothe ImageNet (RGB) pre-trained network. Our findings can be summarised as the following ‘best practices’:• Preserve the aspect ratio of the input images. This seems particularly important if the object of interest (ashoeprint in our case) has aspect ratio significantly different than (i.e. ‘long and thin’ or ‘short and fat’).This advice goes against the common practice in the computer vision community of ‘squishing’ the inputimages to make them square in order to use ImageNet pre-trained models. This is unnecessary as current deeplearning frameworks allow one to feed rectangular images to the ImageNet pre-trained models out of the box.• Use as high input resolution as practical. In our experiments increasing the input resolution always led tohigher PRAUC albeit at the cost of significantly increased computations, which is an obvious constraint. Theoriginal resolution of the input images can also be a limitation as there’s little point in upscaling such images.• Use different interpolation methods to construct the three input channels from greyscale images. Althoughthe effect of this approach that we have observed was modest, it was positive nevertheless. It also seems thatusing the Nearest Neighbour interpolation as one of the input channels is beneficial, while using three identicalchannels obtained via Bilinear interpolation is detrimental, particularly at lower input resolutions.• Use a learnable preprocessing layer. In our experiments, these additional computations played a crucial rolein the process of adapting greyscale inputs to be used with a colour-image pre-trained network, regardlessof the interpolation method used. Learnable preprocessing combined with different interpolation methods toconstruct the three input channels gave the best results.17 cknowledgements
The financial support of Innovate UK is acknowledged via the Knowledge Transfer Partners programme.
References [1] Bluestar Software Limited. National Footwear Solutions (the NFRC and NFD), 2020. URL https://bluestar-software.co.uk/products/forensic-intelligence/ .[2] Robert Milne.
Forensic intelligence . CRC Press, 2012.[3] Hannah Larsen and Matthew R Bennett. Recovery of 3d footwear impressions using a range of different techniques.
Journal of Forensic Sciences , 2020.[4] Matthew R Bennett and Marcin Budka.
Digital technology for forensic footwear analysis and vertebrate ichnology .Springer, 2018.[5] Bluestar Software Limited. National Footwear Solutions (the NFRC and NFD), 2020. URL https://bluestar-software.co.uk/products/uk-national-solutions/ .[6] William J Bodziak.
Footwear impression evidence: detection, recovery and examination . CRC Press, 1999.[7] National Policing Improvement Agency. Footwear marks recovery manual, 2007. URL http://library.college.police.uk/docs/appref/NPIA-(2007)-Footwear-Marks-Recovery-Manual.pdf .[8] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In
Proceedings of theIEEE international conference on computer vision , pages 618–626, 2017.[9] Home Office. Home Office, Police Workforce, England and Wales, 2019. URL https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/831726/police-workforce-mar19-hosb1119.pdf .[10] Roberto Brunelli.
Template matching techniques in computer vision: theory and practice . John Wiley & Sons,2009.[11] Michal Koziarski and Boguslaw Cyganek. Impact of low resolution on image recognition with deep neuralnetworks: An experimental study.
International Journal of Applied Mathematics and Computer Science , 28(4),2018.[12] Suresh Prasad Kannojia and Gaurav Jaiswal. Effects of varying resolution on performance of cnn based imageclassification: An experimental study.
International Journal of Computer Sciences and Engineering , 6(9):451–56p,2018.[13] Philippe Thévenaz, Thierry Blu, and Michael Unser. Image interpolation and resampling.
Handbook of medicalimaging, processing and analysis , 1(1):393–420, 2000.[14] Philippe Thévenaz, Thierry Blu, and Michael Unser. Interpolation revisited [medical images application].
IEEETransactions on medical imaging , 19(7):739–758, 2000.[15] PR Smith. Bilinear interpolation of digital images.
Ultramicroscopy , 6(2):201–204, 1981.[16] Philip M Woodward and Ian L Davies. Information theory and inverse probability in telecommunication.
Proceedings of the IEE-Part III: Radio and Communication Engineering , 99(58):37–44, 1952.[17] Ralph Beebe Blackman and John Wilder Tukey. The measurement of power spectra from the point of view ofcommunications engineering—part i.
Bell System Technical Journal , 37(1):185–282, 1958.[18] Erik HW Meijering, Wiro J Niessen, and Max A Viergever. Quantitative evaluation of convolution-based methodsfor medical image interpolation.
Medical image analysis , 5(2):111–126, 2001.[19] Jeffrey Tsao. Interpolation artifacts in multimodality image registration based on maximization of mutualinformation.
IEEE transactions on medical imaging , 22(7):854–864, 2003.[20] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.In
Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages 249–256,2010.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[22] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009.1823] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In
International Conference onLearning Representations , 2018.[24] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In
Proceedingsof the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages328–339, 2018.[25] Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Ryan Rifkin, and Gal Elidan. Scalable learning ofnon-decomposable objectives. In
Artificial intelligence and statistics , pages 832–840. PMLR, 2017.[26] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot whenevaluating binary classifiers on imbalanced datasets.
PloS one , 10(3):e0118432, 2015.[27] Akanda Wahid-Ul-Ashraf, Marcin Budka, and Katarzyna Musial. How to predict social relationships—physics-inspired approach to link prediction.
Physica A: Statistical Mechanics and its Applications , 523:1110–1129,2019.[28] François Chollet et al. Keras. https://github.com/fchollet/keras , 2015.[29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In
NIPS-W , 2017.[30] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI }16)