Prototypicality effects in global semantic description of objects
Omar Vidal Pino, Erickson Rangel Nascimento, Mario Fernando Montenegro Campos
PPrototypicality effects in global semantic description of objects
Omar Vidal Pino Erickson R. Nascimento Mario F. M. CamposUniversidade Federal de Minas Gerais (UFMG), Brazil { ovidalp, erickson, mario } @dcc.ufmg.br Abstract
In this paper, we introduce a novel approach for seman-tic description of object features based on the prototypical-ity effects of the Prototype Theory. Our prototype-based de-scription model encodes and stores the semantic meaningof an object, while describing its features using the seman-tic prototype computed by CNN-classifications models. Ourmethod uses semantic prototypes to create discriminativedescriptor signatures that describe an object highlightingits most distinctive features within the category. Our exper-iments show that: i) our descriptor preserves the semanticinformation used by the CNN-models in classification tasks;ii) our distance metric can be used as the object’s typical-ity score; iii) our descriptor signatures are semantically in-terpretable and enables the simulation of the prototypicalorganization of objects within a category.
1. Introduction
The extraction of image relevant features has been thesubject of Computer Vision research for decades. The ad-vent of Convolutional Neural Networks (CNN) enabled toachieve a visual recognition model with similar behavior of semantic memory [55] for classification tasks [18, 47, 51],and sparked the tendency of semantic processing of imagesusing deep-learning techniques.For several years, hand-crafted features [2, 28, 53] andmachine learning [36, 46, 50] were the choice methods forimage feature description tasks. The impressive successof CNN-models spawned numerous CNN-descriptors pro-duced by different approaches that learn effective represen-tations for describing image features [16, 21, 45, 60, 62].Consequently, representations of image features extractedusing deep classification models [18, 47, 51], or usingCNN-descriptors are commonly referred to as semantic fea-ture or semantic signature .The term semantic feature has been extensively stud-ied in the field of linguistic semantics and it is defined asthe representation of the basic conceptual components ofthe meaning of any lexical item [9]. In the seminal work a) Input earshapeeyesnose color mouthsnout prior knowledge prototypecategory
Global Semantic Description b)c) e) d) semantic meaning + differencesemantic Figure 1.
Motivation and Concepts. Schematic of prototype-baseddescription model . a) features extraction; b) object features recog-nition; c) categorization; d) object features; e) central semanticmeaning of a category. The human visual system is able to observean object and to build a global semantic description highlightingthe object features that make it distinctive within the category. Wepropose how to simulate this behavior through the processing flowfrom a) to e) . of Rosch [41] the author analyzed the semantic structureof the meaning of words and introduced the concept ofprototype semantics (or Prototype Theory). According toRosch [41, 43], the representation of a category semanticmeaning is related to the category prototype, particularly tothose categories naming natural objects.Some CNN-description models [16, 25, 45, 60, 62] (andsemantic description models [4, 15, 21, 40]) stand for thesemantic information of the image features using a rangeof different approaches. Nevertheless, none of these mod-els construct their representations coding the visual seman-tic information with the extensive theoretical foundation ofCognitive Science to represent the semantic meaning. Werely on cognitive semantics studies related to the PrototypeTheory for modeling the central semantic meaning of cat-egory. Our approach uses the representation of central se-mantic meaning of category for simulating the human be-havior in object features description task.In this paper, we propose a novel approach to take onthe semantic features descriptions of objects. We bring to a r X i v : . [ c s . C V ] D ec ight the Prototype Theory as a theoretical basis to repre-sent the semantic meaning of the image visual information.We develop a prototype-based description model that usesthe category’s prototype to find a global semantic represen-tation of the basic conceptual components (objects) of theimage semantic meaning. Human beings can learn the mostdistinctive features of a specific category [30, 52]. Theselearned features (or properties) are used by the human brainto identify, classify and describe objects [55]. The Proto-type Theory proposes that human beings think of a categoryin terms of abstract prototypes, defined by the central caseof a category [13, 41, 42].Successful execution of the object recognition and de-scription tasks in the human brain is inherently related tothe learned prototype of the object category [32, 41, 42, 61].This raises the following two questions: i) Can a model ofthe perception system be developed in which objects are de-scribed using the same semantic features that are learned toidentify and classify them? ii) How can the category proto-type be included in the object global semantic description?We address these two questions motivated by the humanapproach for describing objects highlighting their most dis-tinctive features within the category. For example, a typicalhuman description: a dalmatian is a dog (semantic mean-ing) that is distinguished by its unique black or liver coloredspotted coat (semantic difference with respect to the centralsemantic meaning of dog category). Figure 1 depicts ourprototype-based description model.We evaluate our approach using CNN-models both onMNIST and ImageNet datasets. The experiments showthat our prototype-based description model can simulatethe prototypical organization of objects categories. Further-more, our descriptor can construct semantic signatures thatare discriminative, interpretable, with low dimensionality,and with the ability to encapsulate and to retain the mean-ing of object features.
2. Related works
CNN descriptors.
The CNN descriptor family showedthat it is possible for a learning approach to outperformthe best techniques based on carefully hand-crafted fea-tures [2, 28, 53]. These models differ among themselves onhow to compute the descriptors in their deep architectures,similarity functions and features extraction methods. Someapproaches extract immediate activations of the model as adescriptor signature [6, 8, 14, 27]. Other methods directlylearned a measure of similarity to compare image patchesusing a similarity convolutional network [16, 45, 59, 60].Siamese networks were used to learn discriminative repre-sentation and to learn a similarity metric [16, 60, 62]. Thedeep model LIFT [59] learns each of the tasks involved infeature management: detection, orientation estimation, andfeature description. Lin et al . [25] constructed a compact binary descriptor for efficient object matching based on thefeatures extracted with the VGG16 model [47].
Semantic descriptors and correspondence.
Findingcorrespondences between different scenes that share similaror semantically related features is a challenging problem.Liu et al . [26] propose to use SIFT Flow to create seman-tic flow family methods as a solution to the high degree ofvariation that includes the challenge of semantic correspon-dence [3, 20, 26, 37, 54, 57]. Several of these methods com-bine their approaches with the extraction of hand-craftedfeatures [28, 53]. Some works [4, 15, 63] use the robustnessof CNN-models for training deep learning architectures thataddress the problem of semantic correspondence. Kim etal . [21] tackled the problem of semantic correspondence byconstructing a semantic descriptor. FCSS descriptor [21]has the property of being robust to intra-class appearancevariation due of its local self-similarity (LSS) and its abil-ity to keep the precise localization of deep neural networks.The performance of CNN-models used in description tasksare still not at par with the performance achieved by CNNused in classification models. In general, CNN descriptorsand semantic descriptors are trained to learn their own se-mantic representations and use different deep learning ar-chitectures. Most of these feature description models do notuse the discriminative power of the features extracted us-ing the well-know CNN-classification models [18, 47, 51].Moreover, none of these feature description approaches in-corporates the cognitive sciences foundation to introducemeaning in the representations of image features.
Prototype Theory.
The Prototype Theory analyzesthe internal structure of categories and introduces theprototype-based concept of categorization. It proposes acategories representation as heterogeneous and not discrete,where the features and category members do not have thesame relevance within the category. Rosch [41] obtainedevidence that humans store the semantic meaning of cate-gory based on the degrees of representativeness ( typicity )of category members. The author showed that human be-ings store the category knowledge as a semantic organi-zation around of category prototype ( prototypical organi-zation ) [42]. The prototype or prototypical concept wasformally defined as the clear central member of a cate-gory [13, 41]. Rosch [42] showed that human beings learnfirst the core semantic meaning of the object ( prototype ) andthen its specificities. In this paper, we model the central se-mantic meaning of category based on the four types of pro-totypicality effects [12, 13]: extensional non-equality , inten-sional non-equality , extensional non-discreteness , and in-tensional non-discreteness . The prototypicality effects sur-mise the importance of the distinction between central andperipheral meaning [13]. . Methodology Rosch [41, 42] showed that humans learn the central se-mantic meaning of categories (the prototype) and includeit in their cognitive processes. Based in these assumptions,our proposal follows the flow of conceptual processes pre-sented in Figure 1 as hypothesis for simulating the humanbehavior in object features description. We propose to de-scribe an object, highlighting the global features that distin-guish it within a category. In other words, after recognizingthe category to which the object belongs, how do we findwhat are the features that distinguish it from others withinthe category? How to model a global object description withsimilar behavior of the diagram in Figure 1?To address these issues, an due to their good perfor-mance, we use CNN-classification models in feature extrac-tion, recognition and classification of the visual informa-tion received as input (processes a, b, c, and d in Figure 1).The CNN-models, analogous to the human memory [10],make associations that keep the knowledge in their connec-tion structures. Our method downloads that knowledge ofpre-trained CNN-models into a semantic structure (seman-tic prototype) that stands for the central semantic meaningof learned categories (Figure 1e)). Our method proposes arepresentation (signature) that describes an object, encapsu-lating the semantic meaning of extracted features, and its se-mantic differences in relation to the central semantic mean-ing of the category. In the following sections, we presentpart of our method that encapsulates the category centralmeaning (prototype). Also, we present how to introduce theprototype representation in semantic description of objectfeatures. Figure 2 shows the architecture overview of ourprototype-based description model.
The semantic structure, i.e. , central/peripheral mean-ing , of a category are related with differences of typical-ity and membership salience of category members ( exten-sional non-equality ). The prototype is an “average” of theabstraction of all objects in the category [49]. It summarizesthe most representative members (or features) of the cate-gory. The combination of the observed features and theirrelevance for the category enables the grouping of objectsinto family resemblance ( intensional non-equality ). Thisapproach justifies the object’s position within the semanticstructure of the category and allows typical objects to begrouped into the semantic center of the category ( prototyp-ical organization ).Let C = { c , c , ..., c n } be a finite set of cat-egories of objects, F = { f , f , ..., f m } be a fi-nite set of distinguishing features of an object, and O c i = { o ∈ O : category ( o ) = c i } , is the set of ob-jects that share the same category c i , (where O = { an universe of objects } ). Definition 1.
Semantic prototype.
We call the centralmeaning of the category c i ∈ C , semantic prototype of c i -category or simply semantic prototype to the “average”and standard deviation of each of the features of all objectswithin the category, along with a “measure” of the rele-vance of those features. Formally the semantic prototypeis a - tuple P i = ( M i , Σ i , Ω i ) where ∀ i = 1 , ..., n ; ∀ j =1 , ..., m : i) M i = [ µ i , µ i , ..., µ im ] is a nonempty m -dimensional vector, where µ ij is the mean of the j- th featureextracted for only typical objects of the category c i ∈ C ; ii) Σ i = [ σ i , σ i , ..., σ im ] is a nonempty m -dimensional vec-tor, where σ ij is the standard deviation of the j- th featureextracted for only typical objects of the category c i ∈ C ;iii) Ω i = [ ω i , ω i , ..., ω im ] is a nonempty m -dimensionalvector, where ω ij is the relevance value of the j- th featurefor the category c i ∈ C . Definition 2.
Convolutional semantic prototype.
The con-volutional semantic prototype of a category c i ∈ C is a -tuple P i = ( M i , Σ i , Ω i , b i ) , where M i , Σ i are computedusing features of category c i extracted from the fully con-volutional layer of CNN-models; and Ω i , b i are the learnedparameters of i - th category in the softmax layer. Next, werefer to the convolutional semantic prototype of the categoryas semantic prototype . Definition 3.
Semantic value.
The semantic meaning ofobserved features F = { f , f , ..., f m } for category c i ∈ C , summary value of the observed features F , or simply semantic value of F in c i -category is an abstract value: ˆ z = (cid:80) m ω ij a j + b i , where ω ij ∈ Ω i , a j ∈ { F, M i } .Consequently, the central semantic meaning of the category c i ∈ C or summary value of the semantic prototype P i =( M i , Σ i , Ω i , b i ) is the semantic value ˆ z i = (cid:80) m ω ij µ j + b i , where ω ij ∈ Ω i , µ ij ∈ M i , ∀ j = 1 , ..., m ; ∀ i = 1 , ..., n. Definition 4.
Prototypical distance.
Let o ∈ O c i a repre-sentative object of category c i ∈ C , F o the features of object o and P i = ( M i , Σ i , Ω i , b i ) the semantic prototype of thecategory c i . We defined as prototypical distance between o and P i the semantic distance: δ ( o , P i ) = m (cid:88) j =1 | ω ij | | f j − µ ij | (1)where ω ij ∈ Ω i , µ ij ∈ M i , and f j ∈ F o ; ∀ j =1 ...m ; ∀ i = 1 ...n. (Adapted from the semantic distanceof the Multiplicative Prototype Model (MPM) [32, 61] andGeneralized Context Model (GCM) [7]). Definition 5.
Distance between objects.
Let o , o ∈ O c i be a representative objects of category c i ∈ C ; F o , F o the features of objects o , o respectively. We define the objects distance between o and o as the semantic distance nput (X) CNN-Model x m x x x x . . . FeatureLayer Softmax . . . a) m a g n i t u d e θ π π semantic semantic Global Semantic Descriptor semantic differencemeaning b) c) d) e)f) difference meaning semantic dog
Prototype
Categorization x m x x x x . . . FeatureLayer . . . . . . x m x x x x . .. . . . simple reduction of dimention . . .. . . . . . Figure 2.
Overview of prototype-based description model.
Set of steps to transform the visual information received as input into a GlobalSemantic Descriptor signature. a ) input image; b) extracted features using a CNN-classification model; c) classification and categoryprototype selection; d) Global semantic description of object using the category prototype; e) graphic representation of the Global SemanticDescriptor signature resulting from the dimensionality reduction function ( f ( x ) ); and f) Global Semantic Descriptor signature. given by: δ ( o , o ) = m (cid:88) j =1 | ω ij | (cid:12)(cid:12) f j − f j (cid:12)(cid:12) , (2)where ω ij ∈ Ω i , f j ∈ F o and f j ∈ F o ∀ j = 1 ...m ; ∀ i =1 ...n. (We introduce the learned weights of CNN-models in the psychological distance between two stimuli defined byMedin [31]). Definition 6.
Feature metric space.
Let F c i be a nonemptyset of all object features of the category c i ∈ C . Since thedistance function δ : F c i × F c i → R + satisfies the ax-ioms of non-negativity , identity of indiscernible , symmetry and triangle inequality ; δ is a metric in the features set F c i .Consequently, ( F c i , δ ) is a metric space or feature metricspace . Algorithm 1
Prototype Construction Input : CNN-model Λ , objects dataset O , category c i Output : Category Prototype ( P i ) O c i ← { o ∈ O : category ( o ) = c i } f eatures block ← {} for o ∈ O c i do if o is typical then F o ← Λ .f eatures of ( o ) f eatures block ← f eatures block ∪ F o Ω i , b i ← Λ .sof max weight learned of ( c i ) M i , Σ i ← compute stats ( f eatures block ) return ( M i , Σ i , Ω i , b i ) Our approach of object semantic description based onprototypes assumes as semantic meaning vector , the seman-tic vector ( (cid:126)z = Ω i (cid:12) F o + ¯ b i ) constructed from element-wise Algorithm 2
Global Semantic Descriptor ψ Input : features F o , (cid:126)r , learned weights (Ω i , b i ) Output : object signature ( ψ o ) meaning ← f ( F o , Ω i , b i , meaning ) difference ← f ( (cid:126)r, Ω i , b i , other ) return meaning ⊕ difference operations to compute the semantic value (Definition 3).Furthermore, we represent the semantic difference vector as the weighted residual vector ( (cid:126)r = | F o − M i | ) composedof the absolute values of the difference of each object fea-ture with each feature of the category prototype.Figure 2 shows an overview of our prototype-based de-scription model. Our Global Semantic Descriptor modeluses as requirement the prototypes priori knowledge of eachCNN-model categories (prototypes are computed using theAlgorithm 1). After the categorization process, we use thecorresponding category prototype for semantic descriptionof object features. We show graphically in Figure 2d) howto introduce the category prototype into the global semanticdescription of object’s features. A drawback of our repre-sentation (Figure 2d)) is having high dimensionality, since itis based on the semantic meaning vector ( (cid:126)z ) and the seman-tic difference vector ( (cid:126)δ = Ω i (cid:12) (cid:126)r ). The large dimensionalof our feature vectors makes its practical uses unfeasible incommon computer vision tasks such as semantic correspon-dence [15, 21]. Dimensionality reduction
Several dimensionality reduction algorithms such asPCA [1] and NMF [23] are based on discarding featuresthat do not generate meaningful variation. Although thisapproach works on some tasks, after applying these algo-rithms we lost the ability of data interpretation [1]. For the lobal semantic descriptor z f u ll y c o nn e c t e d l a y e r semantic gradientmatrix representation ω θ bσ µ reshaped descriptor m a g n i t u d e θ µ σ
2π 4π semantic boundary
Figure 3.
Dimensionality reduction workflow . The transforma-tion function f ( x ) converts the high dimensionality of semanticdescription representation into corresponding global semantic de-scriptor signature. We can see the descriptor signature compu-tation whose taxonomy stands for the semantic meaning of c i -category (Property 3iii). We also show the trivial case when theinput m-dimensional vector has the same dimension as the χ r × r auxiliary matrix ( m = r · r and p = q = r ). perspective of Prototypes Theory, discarding features it isno suitable when applied to the semantic space, due to theabsence of necessary and sufficient definitions to categorizean object ( intensional non-discreteness ). Sometimes dis-carding features may mean discarding elements of the cat-egory [13]. For instance, there may be some objects withinthe category that do not have some of category typical fea-tures (flying is a typical feature of bird category; however,penguin is a bird that does not fly).We propose a simple transformation f ( x ) to compressour global semantic description representation of the ob-ject’s features (Figure 2d) in a global semantic signa-ture (Figure 2f). The final descriptor signature preservesthe semantic meaning (Property 1) and the semantic differ-ence (Property 2) present in the first global semantic de-scription representation. Depending on the input values, ourdescriptor uses the transformation f ( x ) to construct globalsemantic signatures with different meanings within the cat-egory (Property 3).The descriptor signature ( ψ ) is computed by concatenat-ing the corresponding signatures of semantic meaning vec-tor ( (cid:126)z ) and semantic difference vector ( (cid:126)δ ) with our transfor-mation f ( x ) (see Algorithm 2). Figure 3 shows the mainsteps of f ( x ) transformation: Resizing the input vectorin the best configuration of square auxiliary matrices χ r × r and concatenate the output signatures of the flow , , foreach χ r × r ; and constructing the semantic gradient us-ing the angle matrix ( Θ r × r ) formed by the position of eachfeature with respect to the center of χ r × r ; reducing thegradient to 8-vectors similarly to SIFT [28]. Algorithm 3details the steps. Descriptor propertiesProperty 1.
Semantic preservation.
The semantic de-scriptor signature preserves the semantic value : (cid:82) π ψ = (cid:80) | ψ | / k =0 ψ [ k ] = ˆ z. Algorithm 3
Dimensionality Reduction f ( x ) Input : m-dimensional vector α , Ω i , b i , type Output : semantic signature ¯ b i ← b i m // m-dimensional vector ¯ b i ( b i = (cid:80) m ¯ b i ) χ r × r ← shape ( r, r ) // setting source matrix ( χ ) dimension5: Finding the optimal configuration p, q where p ≡ mod r ) , q ≡ mod r ) and m = p · q α, Ω i , ¯ b i = reshape to matrix p × q ( α, Ω i , ¯ b i ) Computing angles matrix: Θ r × r = angles f rom ( χ r × r ) signature ← [] for j = 1 , ..., pr ; k = 1 , ..., qr do Mapping χ jkr × r in α, Ω i , ¯ b i Computing (cid:126)z ijk using
Hadamard product (cid:12) . (cid:126)z ijk = (cid:40) Ω jki (cid:12) α jk + ¯ b ijk , if type = meaning (cid:12)(cid:12)(cid:12) Ω jki (cid:12)(cid:12)(cid:12) (cid:12) α jk , otherwise g jk ← vectors (Θ r × r , (cid:12)(cid:12)(cid:12) (cid:126)z ijk (cid:12)(cid:12)(cid:12) , sign ( (cid:126)z ijk )) . signature jk ( l ) = (cid:80) g jk ( θ ) , ∀ θ ∈ Θ r × r : θ l − < θ ≤ θ l with θ l = l · π , ∀ l = 1 , ..., signature ← signature ⊕ signature jk return signature Proof.
To prove this, it suffices to follow backwardthrough steps [8 , of Algorithm 3. (cid:82) π ψ = (cid:80) | ψ | / k =0 ψ = (cid:80) f ( α, Ω i , b i , meaning ) = (cid:80) (cid:80) j (cid:80) k g jk = (cid:80) (cid:126)z = (cid:80) Ω i (cid:12) α + ¯ b i = ˆ z ; α ∈ { M i , F o } . Property 2.
Prototypical distance preservation.
The ob-ject signature ψ ( o ∈ O c i ) preserves the prototypical dis-tance : (cid:82) π π ψ o = (cid:80) | ψ o | k = | ψ o | / ψ o [ k ] = δ ( o , P i ) . Proof.
Similar to the previous proof ( type=other ). (cid:82) π π ψ = (cid:80) | ψ | k = | ψ | / ψ ( F o , | F o − M i | , Ω i , b i ) = (cid:80) f ( | F o − M i | , Ω i , b i , other ) = (cid:80) (cid:80) j (cid:80) k g jk = (cid:80) (cid:126)δ = (cid:80) | Ω i | (cid:12) | F o − M i | = δ ( o , P i ) . Property 3.
Structural polymorphism.
Our Global Seman-tic Descriptor has the polymorphic property of describ-ing, with the same structural representation, distinctly dif-ferent semantic meanings within the c i -category. Conse-quently, our descriptor uses the category prototype P i =( M i , Σ i , Ω i , b i ) to construct different semantic signaturetaxonomies:i) an object o ∈ O c i . ψ o = ψ ( F o , | F o − M i | , Ω i , b i ) ;ii) central semantic meaning (abstract prototype) of c i -category. ψ P i = ψ ( M i , | M i − M i | , Ω i , b i ) = ψ ( M i ,(cid:126) , Ω i , b i ) ;iii) semantic meaning of c i -category. ψ i = ψ ( M i , Σ i , Ω i , b i ) . . Experiments Datasets.
We conducted our experiments on two bench-mark image datasets: MNIST [22] and ImageNet [44]. Weused the , training samples of MNIST dataset asarchetype for building our prototypes. Also, we used Im-ageNet dataset for building our real prototypes of objects. Models.
We used a CNN-MNIST model based on theLeNet architecture [22] for digit classification in theMNIST dataset. The CNN-MNIST model was used asa pilot model of our experiments as well as the VGG16model [47] was the ground of our semantic descriptionmodel. We used VGG16 models because its features are thebasis of a variety of image processing tasks such as objectdetection [38], image annotation [33], video emotion recog-nition [56], style transfer [11], image alignment [15, 39],cluster, and scene classification [29]. Our prototype-baseddescriptor model is scalable and can easily be adapted toany other CNN-classification model.
In the experiments, we computed the prototypes withCNN-MNIST and VGG16 models in MNIST and ImageNetdatasets, respectively. We assume as the object featuresthose extracted from the model layer that is at once rightbefore the softmax layer (see Feature Layer in Figure 2b).We need typical objects of the category, or any informationabout the typicality value (or typicality degree) of the objectof a specific category, to properly build the proposed seman-tic prototype . However, none of the datasets used have thisinformation. For this reason, we used as category of typicalobjects only those elements that are - unequivocally - clas-sified as category members (Top 1) by CNN-classificationmodels. For each category in the datasets, we extracted fea-tures of typical members and computed the semantic proto-type (see Definition 1) using Algorithm 1.
Prototypical behavior.
Achieving the members prototyp-ical behavior within the category is one of the motivationsand theoretical basis of our work. Nevertheless, there is nodefined metric to quantify whether our representation cor-rectly captures the category semantic meaning. This is aconsequence of the fact that there is no defined metric to ro-bustly evaluate the typicality level of an object to a category,this skill is still reserved only for human beings.Our prototype model ( semantic prototype + prototypicaldistance ) tries to capture the central semantic meaning ofthe category. In a comparable way to the human being, wewant to simulate that visually typical elements of category a) b) Figure 4. Top-5 of the most relevant members of c -category (number five) in the MNIST dataset. a) (from left to right)Top-5 elements closest to the semantic prototype of c -category;the index value represents the position of the object within the c -category of the dataset. b) Top-5 elements farthest from the seman-tic prototype of c -category. Ranking Closest ( δ min ) Farthest ( δ max ) Index Value Index Value
Top 1 ,
786 3 .
181 55 ,
886 16 . Top 2 ,
144 3 .
478 9 ,
344 15 . Top 3 ,
322 3 .
807 56 ,
838 15 . Top 4 ,
954 3 .
896 20 ,
976 14 . Top 5 ,
588 3 .
920 19 ,
590 14 . Table 1.
Prototypical distance of the c -members presented in Fig-ure 4. The top-5 most relevant elements (closest and farthest) areshown based on our prototypical distance metric ( δ ) . It also showsthe position of each element within the MNIST dataset (index) andits semantic difference ( δ value ) with respect to the P -prototype. are organized close (based on the prototypical distance met-ric) to the category prototype.Figure 4 and Table 1 present an example of the semanticmeaning captured by our prototype model for members ofthe number five category in MNIST dataset. As shown inFigure 4, our proposal finds as typical elements of number (top-5 closest) the handwritten digits with features that are,undoubtedly, distinctive of the c -category. Our model alsocan find the peripheral meaning of the category. Memberswith less representative features of the c -category, or littlereadable, are placed in the periphery (top-5 farthest), awayfrom the central meaning, but keeping the features of thecategory (it still belongs to the category). Our model finds,as a human being, that it can be a , but not a typical .Based on our experiments results (in MNIST and Ima-geNet datasets), we assume that the proposed semantic pro-totype correctly captures the central semantic meaning ofthe category. Our prototypical distance has an influence onthe arrangement of the elements around the category seman-tic prototype. Top-5 typical objects of the category are posi-tioned close to the prototype and Top-5 less typical ones arepositioned more distant from the semantic center. But, doesour model organize all category members with this proto-typical organization? op_5_ c l o s e s t t op_5_ f u r t he s t t op_5_ c l o s e s t t op_5_ f u r t he s t Figure 5.
Prototypical organization within categories and for MNIST and ImageNet datasets respectively. In the top, from left toright, the five elements closest and furthest to the semantic prototype of each category; the index of the first element is annotated (insidethe black box). Noting how the signatures domain preserves the internal disposition of the category achieved in the feature domain.
Prototypical organization.
Visualizing the semantic po-sition of each category member with respect to the centralsemantic meaning of the category (the abstract prototype),constitutes a simple approach to see the internal semanticstructure of the entire category. The experiments in thissection aim to visualize the internal semantic structure ofthe category using the semantic meaning encapsulated byour model for each category member. First, we need tocorroborate that our prototype model can correctly interpretthe object features and position it semantically within thecategory, keeping a prototypical organization. Second, wewant to verify if the proposed semantic descriptor encodesand preserves the semantic information contained in the ob-ject features, while preserving the prototypical organizationwithin the category.Visualizing the category internal structure is infeasiblein the m-dimensional features space since most techniquesof data visualization are based on the discarding of fea-tures. From the perspective of the Prototype Theory foun-dations, this approach can be problematic ( intensional non-discreteness ). For this reason, we used topology techniquesto show that our model simulates the prototypical organiza-tion within the category.Let ( F c i , δ ) and ( R , l ) be the metric spaces ; and themap ρ : F c i → R | ρ ( o ∈ O c i ) = ρ ( F o ) = p ( ˆ z o , δ ( o, P i )) , where F o are the object features, ˆ z o is theobject semantic value (see Definition 3), δ ( o, P i ) is the pro-totypical distance; the point p ( x, y ) ∈ R and l is L1-normcondition. ρ maps the object to the ( R , l ) metric spacewith its semantic value and its prototypical distance .Let o , o ∈ O c i , and p = ρ ( o ) , p = ρ ( o ) the mapped point in ( R , l ) metric space. Then,the Sum of Absolute Difference (SAD) l ( p , p ) = l ( ρ ( o ) , ρ ( o )) = | ˆ z − ˆ z | + | δ − δ | . Using theDefinitions 3, 4 and 5; we end up with the expres-sion: δ ( o , o ) ≤ l ( p , p ) ≤ δ ( o , o ) . Consequently,for every F o , F o ∈ F c i and ε > exists a ϕ = ε +12 > such that: δ ( o , o ) < ϕ ⇒ l ( ρ ( o ) , ρ ( o )) < ε , thatis, ρ is continuous . This means that if ρ ( o ) = p , ∀ p ∈ { neighborhood of p } ρ − ( p ) ∈ { neighborhood of o } .Let ( ψ c i , l ) the metric space of objects descriptor signa-tures. Similarly, using the Properties 1 and 2 we can showthat the map γ : ( ψ c i , l ) → ( R , l ) | γ ( ψ o ∈ ψ c i ) = p ( (cid:82) π ψ o , (cid:82) π π ψ o ) = p ( ˆ z o , δ ( o, P i )) is continuous . Since ρ and γ are continuous, the behavior in ( R , l ) metric spaceis equivalent to the behavior in feature metric space ( F c i , δ ) and descriptor’s metric space ( ψ c i , l ) .Figure 5 shows examples of the internal semantic struc-ture of categories mapped using ρ and γ . The experimentsdemonstrate a prototypical organization within the categoryin the ( R , l ) metric space. Note how the semantic value and prototypical distance organize prototypically all cate-gory elements. Top5 most visually representative membersof the number five in ( F c i , δ ) metric space (see Figure 4)are the same Top5 most representative in ( R , l ) metricspace. Top5 closest members are mapped ( in blue ) and po-sitioned near the abstract prototype mapped ( in black ) (seeFigure 5). Likewise, Top5 less representative members ( inred ) continue to be positioned in the peripheries. Even withdifferent models and datasets, the internal prototypical orga-nization of the category achieved in the descriptor signaturedomain (right) is identical to the prototypical organizationin features domain (left). This means that our descriptor sig-nature preserves in its taxonomy the semantic informationcontained in the object features. Signature taxonomies.
Figure 6 shows an example ofthe signatures taxonomies constructed with our descriptorusing CNN-MNIST model (signatures with size ). Weshowed the structural polymorphism property of our de-scriptor (Property 3) to construct signatures of the centralsemantic meaning (abstract prototype), the semantic mean-ing of the category and the meaning of a category member .The abstract prototype signature is a degenerate version ofthe category signature. The abstract prototype signature canbe understood as the numbers distribution (or DNA chain)that stands for the category. The category members willhave a semantic meaning with similar representation of cat- semanticmeaning semanticdifference semanticmeaning semanticdifference prototypesignaturecategorysignature - c a t ego r y - c a t ego r y m e m be r s no m e m be r | | Figure 6. Taxonomies of the semantic signatures constructed withour descriptor for c -category in MNIST dataset. We show theabstract prototype signature and c -category signature ( semanticmeaning ). In addition, we present descriptor signatures examplesfor two members of the c -category and a member that does notbelong to the c -category. egory DNA chain. The semantic difference of the categorysignature can be understood as the features boundary of allcategory members. Consequently, semantic information en-coded in our global semantic descriptor signatures allows,easily, to recover object semantic information (Properties 1,2); and it also allows to interpret the object typicality withinthe category ( typicality score ( o ) = 1 /δ ( o , P i ) ). We evaluate the proposed semantic encoding of ourGlobal Semantic Descriptor (GSDP) (version based inVGG16 model) comparing our representation against thefollowing image global description: GIST [35], LBP [34],HOG [5], Color64 [24], Color Hist [48], Hu H CH [17, 19,48], and VGG16 [47]. Yang et al . [58] showed that whenthe features representations achieve good metrics in cluster-ing tasks, it can generalize well when transferred to othertasks. Based in these assumptions, we evaluate our seman-tic encoding for verifying its usefulness and suitability inimage clustering tasks.We used the K-means algorithm for clustering , images of the first categories of ImageNet ( × cat-egory) using the descriptors signatures. The experimentwas conducted incrementally, starting with cluster (for category) and incrementing a category for each iteration.Table 2 shows a screenshot of K-means-metrics achievedby the selected descriptors in the first categories. Fig-ure 7 shows the Kmeans metrics behavior for VGG16 andGSDP signatures, when the number of clusters (categories)increased in each execution of algorithm. Our GSDP de-scriptor keeps the semantic information of VGG16 signa-tures (see Figure 5) with a more discriminatory represen-tation and even lower feature dimension ( ). The resultsshow that our descriptor encoding significantly outperforms Descriptor Size Metrics ScoresH C V ARI AMIGIST 960 0.05 0.05 0.05 0.01 0.05LBP 512 0.02 0.03 0.03 0.01 0.02HOG 1960 0.04 0.04 0.04 0.01 0.03Color64 64 0.12 0.12 0.12 0.04 0.11Color Hist 512 0.08 0.08 0.08 0.03 0.07Hu H CH 532 0.04 0.04 0.04 0.01 0.02VGG16 4096 0.77 0.78 0.77 0.60 0.76GSDP (Our) 256
Table 2. Kmeans cluster metrics for each evaluated descriptor.The Table shows the Kmeans-measures of clustering the first ImageNet categories ( clusters): Homogeneity (H), Complete-ness (C), V-measure (V), Adjusted Rand Index (ARI) and Ad-justed Mutual Information (AMI).Figure 7. Comparing the performance of VGG16 feature versusour descriptor signature in clustering task. The Figure shows theKmeans-metrics reached by both representations for each itera-tions ( ) of our experiment. the other image global encodings in terms of cluster met-rics. The results achieved in clustering tasks encourage usto evaluate the generalization ability of our semantic repre-sentation in other computer vision tasks.
5. Conclusions
We introduced a novel Global Semantic Descriptor thatis based on the foundations of the Prototype Theory. Ourprototype-based description model does not need to betrained and it is easily adaptable to be used with any otherexisting CNN classification model. As shown in the ex-periments, our semantic descriptor is discriminative, smalldimensioned, encodes the semantic information of the cat-egory, and achieves a prototypical organization of the cate-gory members. We further showed how to interpret and re-trieve the object typicality information encoded in our rep-resentation. Our model proposes a starting point to intro-duce the theoretical foundation related to the representationof semantic meaning and the learning of visual concepts ofthe Prototype Theory in the CNN-Descriptors family. eferences [1] H. Abdi and L. J. Williams. Principal component analy-sis. Wiley interdisciplinary reviews: computational statistics ,2(4):433–459, 2010.[2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-uprobust features (surf).
Computer Vision and Image Under-standing (CVIU) , 110(3):346–359, 2008.[3] H. Bristow, J. Valmadre, and S. Lucey. Dense semantic cor-respondence where every pixel is a classifier. In
Proceedingsof the IEEE International Conference on Computer Vision(ICCV) , pages 4024–4031, 2015.[4] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. Uni-versal correspondence network. In
Advances in Neural In-formation Processing Systems , pages 2414–2422, 2016.[5] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In
CVPR , volume 1, pages 886–893. IEEE,2005.[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional ac-tivation feature for generic visual recognition. In
Proceed-ings of the International Conference on Machine Learning(ICML) , pages 647–655, 2014.[7] W. Estes. Memory storage and retrieval processes in cate-gory learning.
Journal of Experimental Psychology: Gen-eral , 115(2):155, 1986.[8] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor match-ing with convolutional neural networks: a comparison to sift. arXiv preprint arXiv:1405.5769 , 2014.[9] V. Fromkin, R. Rodman, and N. Hyams.
An introduction tolanguage . Cengage Learning, 2018.[10] J. M. Fuster. Network memory.
Trends in neurosciences ,20(10):451–459, 1997.[11] L. Gatys, A. Ecker, and M. Bethge. A neural algorithm ofartistic style.
Nature Communications , 2015.[12] D. Geeraerts.
Diachronic prototype semantics: A contribu-tion to historical lexicology . Oxford University Press, 1997.[13] D. Geeraerts.
Theories of lexical semantics . Oxford Univer-sity Press, 2010.[14] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale or-derless pooling of deep convolutional activation features. In
Proceedings of the of the European Conference on ComputerVision (ECCV) , pages 392–407. Springer, 2014.[15] K. Han, R. S. Rezende, B. Ham, K.-Y. K. Wong, M. Cho,C. Schmid, and J. Ponce. Scnet: Learning semantic corre-spondence. arXiv preprint arXiv:1705.04043 , 2017.[16] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg.Matchnet: Unifying feature and metric learning for patch-based matching. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , pages3279–3286, 2015.[17] R. M. Haralick, K. Shanmugam, et al. Textural features forimage classification.
IEEE Transactions on Systems, Man,and Cybernetics , SMC-3(6):610–621, 1973.[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,pages 770–778, 2016. [19] M.-K. Hu. Visual pattern recognition by moment invariants.
IRE transactions on information theory , 8(2):179–187, 1962.[20] J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatialpyramid matching for fast dense correspondences. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 2307–2314, 2013.[21] S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn. Fcss:Fully convolutional self-similarity for dense semantic corre-spondence. arXiv preprint arXiv:1702.00926 , 2017.[22] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, Nov 1998.[23] D. D. Lee and H. S. Seung. Algorithms for non-negativematrix factorization. In
Advances in neural information pro-cessing systems , pages 556–562, 2001.[24] M. Li. Texture moment for content-based image retrieval. In
Multimedia and Expo, 2007 IEEE International Conferenceon , pages 508–511. IEEE, 2007.[25] K. Lin, J. Lu, C.-S. Chen, and J. Zhou. Learning compactbinary descriptors with unsupervised deep neural networks.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 1183–1192, 2016.[26] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense corre-spondence across scenes and its applications.
IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) ,33(5):978–994, 2011.[27] J. L. Long, N. Zhang, and T. Darrell. Do convnets learn cor-respondence? In
Advances in Neural Information ProcessingSystems , pages 1601–1609, 2014.[28] D. G. Lowe. Distinctive image features from scale-invariantkeypoints.
International Journal of Computer Vision (IJCV) ,60(2):91–110, 2004.[29] X. Lu, Y. Yuan, and J. Fang. Jm-net and cluster-svm foraerial scene classification. In
Proceedings of the 26th In-ternational Joint Conference on Artificial Intelligence , pages2386–2392. AAAI Press, 2017.[30] A. Martin. The representation of object concepts in the brain.
Annu. Rev. Psychol. , 58:25–45, 2007.[31] D. L. Medin and M. M. Schaffer. Context theory of classifi-cation learning.
Psychological review , 85(3):207, 1978.[32] J. P. Minda and J. D. Smith. Comparing prototype-based andexemplar-based accounts of category learning and attentionalallocation.
Journal of Experimental Psychology: Learning,Memory, and Cognition , 28(2):275, 2002.[33] V. N. Murthy, S. Maji, and R. Manmatha. Automatic imageannotation using deep learning representations. In
Proceed-ings of the 5th ACM on International Conference on Multi-media Retrieval , pages 603–606. ACM, 2015.[34] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolutiongray-scale and rotation invariant texture classification withlocal binary patterns.
TPAMI , 24(7):971–987, 2002.[35] A. Oliva and A. Torralba. Modeling the shape of the scene:A holistic representation of the spatial envelope.
IJCV ,42(3):145–175, 2001.[36] C. B. Perez and G. Olague. Genetic programming as strat-egy for learning image descriptor operators.
Intelligent DataAnalysis , 17(4):561–583, 2013.37] W. Qiu, X. Wang, X. Bai, A. Yuille, and Z. Tu. Scale-spacesift flow. In
Applications of Computer Vision (WACV), 2014IEEE Winter Conference on , pages 1112–1119. IEEE, 2014.[38] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In
Advances in neural information processing systems , pages91–99, 2015.[39] I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neu-ral network architecture for geometric matching. In
Proc.CVPR , volume 2, 2017.[40] I. Rocco, R. Arandjelovi´c, and J. Sivic. End-to-end weakly-supervised semantic alignment. In
CVPR , 2018.[41] E. Rosch. Cognitive representations of semantic categories.
Journal of experimental psychology: General , 104(3):192,1975.[42] E. Rosch. Principles of categorization. cognition and cate-gorization, ed. by eleanor rosch & barbara b. lloyd, 27-48,1978.[43] E. Rosch and C. B. Mervis. Family resemblances: Studiesin the internal structure of categories.
Cognitive psychology ,7(4):573–605, 1975.[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge.
International Journal of ComputerVision (IJCV) , 115(3):211–252, 2015.[45] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, andF. Moreno-Noguer. Discriminative learning of deep convolu-tional feature point descriptors. In
Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) , pages118–126, 2015.[46] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning lo-cal feature descriptors using convex optimisation.
IEEETransactions on Pattern Analysis and Machine Intelligence(TPAMI) , 36(8):1573–1585, 2014.[47] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.[48] Y.-j. Song, W.-b. Park, D.-w. Kim, and J.-h. Ahn. Content-based image retrieval using new color histogram. In
Intelli-gent Signal Processing and Communication Systems, 2004.ISPACS 2004. Proceedings of 2004 International Symposiumon , pages 609–611. IEEE, 2004.[49] R. J. Sternberg and K. Sternberg.
Cognitive psychology . Nel-son Education, 2016.[50] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua. Ldahash:Improved matching with smaller descriptors.
IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) ,34(1):66–78, 2012.[51] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.Inception-v4, inception-resnet and the impact of residualconnections on learning. In
AAAI , pages 4278–4284, 2017.[52] S. L. Thompson-Schill. Neuroimaging studies of seman-tic memory: inferring how from where.
Neuropsychologia ,41(3):280–292, 2003.[53] E. Tola, V. Lepetit, and P. Fua. A fast local descriptor fordense matching. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1–8. IEEE, 2008.[54] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer.Dense segmentation-aware descriptors. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , pages 2890–2897, 2013.[55] E. Tulving. Coding and representation: searching for a homein the brain.
Science of memory: Concepts , pages 65–68,2007.[56] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal. Video emo-tion recognition with transferred deep feature encodings. In
Proceedings of the 2016 ACM on International Conferenceon Multimedia Retrieval , pages 15–22. ACM, 2016.[57] H. Yang, W.-Y. Lin, and J. Lu. Daisy filter flow: A gener-alized discrete approach to dense correspondences. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , pages 3406–3413, 2014.[58] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learn-ing of deep representations and image clusters. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 5147–5156, 2016.[59] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learnedinvariant feature transform. In
Proceedings of the of the Eu-ropean Conference on Computer Vision (ECCV) , pages 467–483. Springer, 2016.[60] S. Zagoruyko and N. Komodakis. Learning to compare im-age patches via convolutional neural networks. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 4353–4361, 2015.[61] S. R. Zaki, R. M. Nosofsky, R. D. Stanton, and A. L. Co-hen. Prototype and exemplar accounts of category learningand attentional allocation: A reassessment.
Journal of Ex-perimental Psychology: Learning, Memory and Cognition ,29(6):1160–1173, 2003.[62] J. Zbontar and Y. LeCun. Computing the stereo matchingcost with a convolutional neural network. In
Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 1592–1599, 2015.[63] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A.Efros. Learning dense correspondence via 3d-guided cy-cle consistency. In