Learning Intermediate Features of Object Affordances with a Convolutional Neural Network
LLearning Intermediate Features of Object Affordanceswith a Convolutional Neural Network
Aria Y. Wang ([email protected])
Center for Neural Basis of Cognition (CNBC), Carnegie Mellon University5000 Forbes Ave, Pittsburgh, PA 15213 United States
Michael J. Tarr ([email protected])
Department of Psychology, Carnegie Mellon University5000 Forbes Ave, Pittsburgh, PA 15213 United States
Abstract
Our ability to interact with the world around us relies onbeing able to infer what actions objects afford – oftenreferred to as affordances. The neural mechanisms ofobject-action associations are realized in the visuomo-tor pathway where information about both visual proper-ties and actions is integrated into common representa-tions. However, explicating these mechanisms is particu-larly challenging in the case of affordances because thereis hardly any one-to-one mapping between visual featuresand inferred actions. To better understand the nature ofaffordances, we trained a deep convolutional neural net-work (CNN) to recognize affordances from images and tolearn the underlying features or the dimensionality of af-fordances. Such features form an underlying composi-tional structure for the general representation of affor-dances which can then be tested against human neuraldata. We view this representational analysis as the firststep towards a more formal account of how humans per-ceive and interact with the environment.
Keywords: affordance; dataset; convolutional neural network;
While interacting with our environment, we naturally inferthe functional properties of the objects around us. Theseproperties, typically referred to as affordances, are definedby Gibson (1979), as all of the actions that an object in theenvironment offers to an observer. For example, kick for a balland drink for water. Understanding affordances is critical forunderstanding how humans are able to interact with objects inthe world.In recent years, convolutional neural networks have beensuccessful in preforming object recognition in large-scale im-age datasets (Krizhevsky, Sutskever, & Hinton, 2012). At thesame time, convolutional networks trained to recognize ob-jects have been used as feature extractors and can success-fully model neural responses as measured by fMRI in humanvisual cortex (Agrawal, Stansbury, Malik, & Gallant, 2014) orby electrodes in monkey IT cortex (Yamins & DiCarlo, 2016).To understand the relevant visual features in an object thatare indicative of affordances, we trained a CNN to recognizeaffordable actions of objects in images.
Dataset Collection
Training deep CNNs is known to require large amounts ofdata. Available affordance datasets with images and seman-tic labels are largely limited at this moment. The only relevantdataset currently available to the public was created by Chao,Wang, Mihalcea, and Deng (2015), and only includes affor-dance labels for 20 objects from the PASCAL dataset and90 objects from the COCO dataset. Here we built a largescale affordance dataset with affordances labels attached toall images in the ImageNet dataset (Deng et al., 2009). Thisdataset forms a more general representation of the affordancespace and allows large scale end-to-end training from the im-age space and to this affordance space. The dataset collec-tion process is shown in Figure 1. Human labelers were pre-sented with object labels from ImageNet object categories andanswered the question “What can you do with that object?”. Allanswers were then co-registered with WordNet (Miller, 1995)action labels so that our labels could be extended to otherdatasets. The top five responses from labelers were used ascanonical affordance labels for each object. 334 categories ofactions were labeled for around 500 objects categories. Whencombined with image to object label mappings from ImageNet,these affordance labels provided us with the image to affor-dance label mappings that were used to train our CNN.Figure 1: Dataset Collection. The labelers are given objectlabels, indicated in the green boxes here and assign to themaffordances labels, indicated with blue boxes.
Visualization of Affordance Space
In our affordance dataset, each object was represented by abinary vector indicating whether each of the possible actionswas available for this object or not. Each object can then be a r X i v : . [ q - b i o . N C ] F e b epresented as a point in the affordance space. We used PCAto project these affordance vectors into a 3D space and plottedthe object classes as illustrated in Figure 2. In the 3D spacecreated for visualization, the objects appear to be well sepa-rated. More specifically, the majority of living things were orga-nized along the top axis; the majority of small household itemswere organized along the left axis; and transportation toolsand machines were organized along the right axis. Human-related categories such as dancer and queen do not belongto any axis and appear as flowing points in the space.Figure 2: ImageNet images in the affordance space. Results
Network Training
A CNN was trained to predict affordance categories from im-ages. A total of 55 affordances were selected as potential ac-tions after ensuring that each affordance label had at least 8object categories associated with it (by removing affordancesthat were associated with too few object categories). Eachobject category was placed in the training, validation or test-ing sets. These sets were exclusive, such that, if one objectcategory appeared in one set, it would not appear in the othertwo sets. Such separation ensures that the learning of affor-dances was not based on recognizing the same objects andlearning linear mappings between objects and affordances.We used the ResNet18 model (He, Zhang, Ren, & Sun,2016) (other models such as VGG produced similar results),and trained it using the Adam optimizer (Kingma & Ba,2014) by minimizing binary cross-entropy loss. Approximately630,000 images from ImageNet were used in training, andapproximately 71,000 images each were used for validationand testing. The trained CNN was evaluated by computingthe average percentage of correctly predicted affordance la-bels, and the results are reported in Table 1. The trained net-works showed significantly better performance compared tothe baseline.
Skewed Distribution and Oversampling
Since actions such as “hold” and “grab” would be used onobjects much more often than actions such as “thrust”, we obtained an uneven distribution of affordance labels acrossimage categories, as shown in Figure 3. In computer vi-sion, oversampling is a commonly used solution for this prob-lem. However, because of the multi-label nature of the affor-dance recognition problem, proper oversampling is challeng-ing. Less frequently appearing classes need to be oversam-pled without over representing the more frequently appearingclasses. We used Multi-label Best First Over-sampling (ML-BFO) (Ai et al., 2015), and re-trained the CNN with the resam-pled data. This produced a considerable increase in predictionperformance, as seen in Table 1.Figure 3: Percentages of objects classes assigned to eachaffordances categories.
Sample Predictions
Figures 4(a)–(d) demonstrate images where the network wasable to predict correctly. However, the presence of distinct fea-tures can mislead the network. For example, in Figure 4(e),where white bars stand out in the image, the network pre-dicted “grab” and “drive”, potentially mistaking the image as abar or a road. On the other hand, human labelers, knowingthat it is a image of a wall, provided labels such as “walk” and“enter”. Since ImageNet contains natural scene images, mul-tiple objects are likely to appear in one image, even thougheach image is assigned only one object label. Such im-ages confuse both the labelers and the network, and there-fore can lead to incorrect affordance recognition as shown inFigures 4(f) and (g).
Visualizing the Learned Representation Space
RDM across Layers
To visualize the representations learned by the network, werandomly sampled 10 images from each of 30 objects classes,and extracted activations from the network layers. Pairwisecorrelation distance between network activation across layerswas computed for each pair of images, and is shown in Fig-ure 5. Pairwise distance between affordance labels is shownin the bottom-right matrix. This matrix denotes the groundtruth distance in affordance space. Similar patterns begin toemerge in Layer 4 for both the fine-tuned network and the net-work trained from scratch. Critically, this pattern is not seen forthe off-the-shelf network that was not trained on affordances. a) care/feed (b) hang/wear/grab(c) eat/taste (d) switch on/off(e) P: grab/driveGT: walk/exit (f) P: empty/fillGT: taste (g) P: hunt/chaseGT: talk/meet
Figure 4: Sample predictions. (a)-(d): Examples of imageswith correct affordance predictions (correct label below eachimage). (e)-(g): Examples of images with incorrect affordancepredictions (P: indicates the CNN prediction, while GT: indi-cates the ground truth based on human labeling.This demonstrates that our network learns representationsthat effectively separate different affordance categories. t-SNE
Activations from the second to last layer in the network trainedfrom scratch were visualized using t-SNE (Maaten & Hinton,2008), as shown in Figure 6. Images are coarsely split intofour groups based on their distinct affordances: living things,vehicles, physical spaces and small items. In the 2D t-SNEvisualization, the representation of living things (in green), ve-hicles (in red) and physical spaces (in blue) are visibly sep-arable. Small items (in yellow), in contrast, span the entire
Pretrained AffNetAffNet Trained from ScratchOff-the-Shelf Pretrained Network Layer 1 Layer 2 Layer 3 Layer 4 Avgpool Object CategoriesAffordance Categories
Figure 5: RDM matrix of layers from CNN from off-the-shelfpre-trained network, fine-tuned network and network trainedfrom scratch for affordances.space. The category of small items does not appear well sep-arated, which is likely due to the visualization being limited to2 dimensions. ● Living Things: meet, feed, water, pet, catch, care, …● Vehicle: drive, operate, decelerate, ride, board, ...● Space: stand, enter, exit, travel to, walk, ...● Small_items: fill, carry, open, grab, cover, ...
Figure 6: t-SNE visualization of the second to last layer inthe CNN trained from scratch. Representations of imagesare coarsely split into four groups based on the distinct af-fordances of the images: living things (green), vehicles (red),physical spaces (blue) and small items (yellow).
Unit Visualization
We were able to visualize the output layer units of the CNNby optimizing in pixel space to determine which images maxi-mally activated a specific unit. Figure 7 shows such visualiza-tion of 6 units from the output layer. The “ride” unit, for exam-ple, shows human- and horse-like structures; the “wear” unitTable 1: Training Results. “Fine-tuning” indicates that the network was pre-trained to predict image categories, while “Trainingfrom Scratch” indicates that the network was initialized with random weights. Baseline accuracy was calculated by estimatingthe most frequent categories. Baseline Fine-tuning Training from Fine-tuning Training fromScratch w/ oversampling scratch w/ oversamplingTraining Accuracy (%) 7.61 80.39 71.42 87.60 85.05Testing Accuracy (%) 6.86 44.62 37.47 hows a coarse clothing pattern and details of common tex-tures often associated with clothing. Similarly, units “climb”,“sit”, and “fill” show stairs-like, chair-like, and container-likestructures respectively. Interestingly, the “watch” unit showspreference for dense textures in the center of the image space,which may correlate with image characteristics from objectsthat are related to watching (e.g., TV). It should be noted thatunit visualization is very limited for capturing the learned inter-mediate features. Interpreting features in a limited 2D spaceis inherently biased and subjective. (a) Ride (b) Wear (c) Watch(d) Climb (e) Fill (f) Sit
Figure 7: Visualization of 6 last layer units in the CNN.
Discussion
We successfully trained a CNN to predict affordances fromimages, as a means for learning the underlying dimension-ality of object affordances. The intermediate features in theCNN constitute an underlying compositional structure for therepresentation of affordances.
Future Work
To ensure the objectivity of the affordance labeling, affordancelabels for images – as opposed to just object categories la-bels – are being collected currently using Amazon Mechan-ical Turk. This dataset will be made publicly available afterverification.With a CNN trained for affordance recognition, weights fromthe intermediate layers can be extracted and used to featurizeeach image. A model can then be trained to predict the BOLDresponses to each image. Correlations between the predictedresponses and the true responses can be used to measuremodel performance. If a linear model is built to perform thistask, the model weights could then be used as a proxy to lo-calize where information about affordances is represented inthe human brain.Finally, affordance categories can be split into two largegroups: semantically relevant ones, such as “eat”, which re-quires past experience with the objects in question; and non-semantically relevant ones, such as “sit”, which may be in- ferred directly from the shapes of the objects. If semanticaffordances are being processed in the brain, top-down in-formation about the objects is potentially necessary in orderto inform an observer about affordable actions, while the non-semantic ones would not require top-down information. Givensuch differences we may be able to differentiate between top-down and bottom-up visual processing in the human brainusing our model; in particular, by distinguishing the differentbrain regions that are engaged in either or both of these twoprocesses.
Acknowledgments
This project is funded by MH R90DA023426-12 Interdisci-plinary Training in Computational Neuroscience (NIH/NIDA)
References
Agrawal, P., Stansbury, D., Malik, J., & Gallant, J. L. (2014).Pixels to voxels: modeling visual representation in the hu-man brain. arXiv preprint arXiv:1407.5104 .Ai, X., Wu, J., Sheng, V. S., Yao, Y., Zhao, P., & Cui, Z. (2015).Best first over-sampling for multilabel classification. In
Pro-ceedings of the 24th acm international on conference oninformation and knowledge management (pp. 1803–1806).Chao, Y.-W., Wang, Z., Mihalcea, R., & Deng, J. (2015).Mining semantic affordances of visual object categories. In
Computer vision and pattern recognition (cvpr), 2015 ieeeconference on (pp. 4259–4267).Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical im-age database. In
Computer vision and pattern recognition,2009. cvpr 2009. ieee conference on (pp. 248–255).Gibson, J. (1979).
The ecological approach to visual percep-tion (pp. 127-143).
Boston: Houghton Miffin.He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residuallearning for image recognition. In
Proceedings of the ieeeconference on computer vision and pattern recognition (pp.770–778).Kingma, D. P., & Ba, J. (2014). Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 .Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenetclassification with deep convolutional neural networks. In
Advances in neural information processing systems (pp.1097–1105).Maaten, L. v. d., & Hinton, G. (2008). Visualizing data usingt-sne.
Journal of machine learning research , (Nov), 2579–2605.Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM , (11), 39–41.Yamins, D. L., & DiCarlo, J. J. (2016). Using goal-driven deeplearning models to understand sensory cortex. Nature neu-roscience ,19