Capturing human category representations by sampling in deep feature spaces
Joshua C. Peterson, Jordan W. Suchow, Krisha Aghi, Alexander Y. Ku, Thomas L. Griffiths
CCapturing human category representations by sampling in deep feature spaces
Joshua C. Peterson ([email protected])Jordan W. Suchow ([email protected])Krisha Aghi ([email protected])Alexander Y. Ku ([email protected])Thomas L. Griffiths (tom griffi[email protected]) Department of Psychology , Helen Wills Neuroscience Institute University of California, Berkeley, CA 94720 USA
Abstract
Understanding how people represent categories is a core prob-lem in cognitive science. Decades of research have yieldeda variety of formal theories of categories, but validating themwith naturalistic stimuli is difficult. The challenge is that hu-man category representations cannot be directly observed andrunning informative experiments with naturalistic stimuli suchas images requires a workable representation of these stimuli.Deep neural networks have recently been successful in solvinga range of computer vision tasks and provide a way to com-pactly represent image features. Here, we introduce a methodto estimate the structure of human categories that combinesideas from cognitive science and machine learning, blendinghuman-based algorithms with state-of-the-art deep image gen-erators. We provide qualitative and quantitative results as aproof-of-concept for the method’s feasibility. Samples drawnfrom human distributions rival those from state-of-the-art gen-erative models in quality and outperform alternative methodsfor estimating the structure of human categories.
Keywords: categorization; neural networks; Markov ChainMonte Carlo
Introduction
Categorization is a central problem in cognitive science andconcerns why and how we divide the world into discrete unitsat various levels of abstraction. The biggest challenge forstudying human categorization is that the content of mentalcategory representations cannot be directly observed, whichhas led to development of laboratory methods for estimatingthis content from human behavior. Because these methodsrely on small sets of artificial stimuli with handcrafted or low-dimensional feature sets, they are ill-suited to the study of cat-egorization as an intelligent process, which is principally mo-tivated by robust human categorization performance in com-plex ecological settings (Nosofsky et al., 2017).One of the challenges of applying laboratory methods torealistic stimuli such as natural images is finding a way torepresent them. Deep learning models, such as convolutionalneural networks, discover features that can be used to repre-sent complex images compactly and perform well on a rangeof computer vision tasks (LeCun et al., 2015). It may be pos-sible to express human category structure using these features,an idea supported by recent work in cognitive science (Lakeet al., 2015; Peterson et al., 2016).Ideally, experimental methods could be combined withstate-of-the-art deep learning models to estimate the structureof human categories with as few assumptions as possible, andwhile avoiding the problem of dataset bias. In what follows,we propose a method that uses a human in the loop to estimate arbitrary distributions over complex feature spaces, adapt-ing an existing experimental paradigm to exploit advances indeep architectures to capture the precise structure of humancategory representations and iteratively sharpen them. Suchknowledge is crucial to forming an ecological theory of intel-ligent categorization behavior and to providing a ground-truthbenchmark to guide future work in machine learning.
Background
Deep neural networks for images
Deep neural networksare modern instantiations of classic multilayer perceptrons,and represent a powerful class of machine learning model.DNNs can be trained efficiently through gradient descentand structurally specialized for particular domains (LeCun etal., 2015). In the image domain, deep convolutional neu-ral networks (CNNs; LeCun et al., 1989) excel in classiccomputer vision tasks, including natural image classification(Krizhevsky et al., 2012). CNNs exploit knowledge of theinput domain by learning a hierarchical set of translation-invariant image filters. The resulting representations, real-valued feature vectors, are surprisingly general and outper-form other methods in explaining complex human behavior(Lake et al., 2015; Peterson et al., 2016).Generative Adversarial Networks (GANs; Goodfellow etal., 2014) and Variational Autoencoders (VAEs; Kingma &Welling, 2013) provide a generative approach to modeling thecontent of natural images. Importantly, though the approachesdiffer considerably, each approach makes use of a network(called a “decoder” or “generator”) that learns a deterministicfunction that maps samples from a known noise distribution p ( z ) (e.g., a multivariate Gaussian) to samples from the trueimage distribution p ( x ) . This can be thought of as mapping arelatively low-dimensional feature representation z to a rela-tively high-dimensional image x . Sampling new images fromthese networks is as simple as passing Gaussian noise intothe learned decoder. In addition, because of its simple form,the resulting latent space z tends to be easy to traverse mean-ingfully (i.e., an intrinsic linear manifold) and can be readilyvisualized via the decoder, a property we exploit presently. Estimating the structure of human categories
Methodsfor estimating human category templates have existed forsome time. In psychophysics, the most popular and well-understood method is known as classification images (CI;Ahumada, 1996). a r X i v : . [ c s . C V ] M a y rWhich is more like a cat? , Decoder Inference
Figure 1: Deep MCMCP. A current state z and proposal z ∗ (top middle) are fed to a pretrained deep image gen-erator/decoder network (top left). The corresponding de-coded images x and x ∗ for the two states are presentedto human raters on a computer screen (leftmost arrowand bottom left). Human raters then view the imagesin an experiment (bottom middle arrow) and act as partof an MCMC sampling loop, choosing between the twostates/images in accordance with the Barker acceptancefunction (bottom right). The chosen image can then besent to the inference network (rightmost arrow) and de-coded in order to select the state for the next trial, how-ever this step is unnecessary when we know exactly whichstates corresponds to which images.In the classification images experimental procedure, a hu-man participant is presented with images from two categories,A and B, each with white noise overlaid, and asked to se-lect the stimulus that corresponds to the category in question.On most trials, the participant will obviously select the exem-plar generated from the category in question. However, if theadded white noise significantly perturbs features of the imagethat are important to making the distinction, they may fail.Exploiting this, we can estimate the decision boundary from anumber of these trials using the simple formula: ( n AA + n BA ) − ( n AB + n BB ) , (1)where n XY is the average of the noise across trials where thecorrect class is X and the observer chooses Y .Vondrick et al. (2015) used a variation on classification im-ages using deep image representations that could be invertedback to images using an external algorithm. In order to avoiddataset bias introduced by perturbing real class exemplars,white noise in the feature space was used to generate stimuli.In this special case, category templates reduce to n A − n B . Oneach trial of the experiment, participants were asked to selectwhich of two images (inverted from feature noise) most re-sembled a particular category. Because the feature vectors forall trials were random, thousands of stimuli could be renderedin advance of the experiment using relatively slow methodsthat require access to large datasets. This early inversionmethod was applied to mean feature vectors for thousands ofpositive choices in the experiments and yielded qualitativelydecipherable category template images, as well as better ob-jective classification decision boundaries that were guided hu-man bias. Under the assumption that human category distri-butions are Gaussian with equal variance, this method yieldsa vector that aligns with the nearest-mean decision boundary,although a massive number of human trials are required.Markov Chain Monte Carlo with People (MCMCP; San-born & Griffiths, 2007), an alternative to classification im-ages, is an experimental procedure in which humans act asa valid acceptance function A in the Metropolis–Hastings al-gorithm, exploiting the fact that Luce’s choice axiom, a well-known model of human choice behavior, is equivalent to the Barker acceptance function (see equation in Figure 1). On thefirst trial, a stimulus x is drawn arbitrarily from the parame-ter space and compared to a new proposed stimulus x ∗ thatis nearby in that parameter space. The participant makes aforced choice as to which is the better exemplar of some cat-egory (e.g., dog), acting as the acceptance function A ( x ∗ ; x ) .If the initial stimulus is chosen, the Markov chain remains inthat state. If the proposed stimulus is chosen, the chain movesto the proposed state. The process then repeats until the chainconverges to the target category distribution p ( x | c ) . In prac-tice, convergence is assessed heuristically, or limited by thenumber of human trials that can be practically obtained.MCMCP has been successfully employed to capture a num-ber of different mental categories (Sanborn & Griffiths, 2007;Martin et al., 2012), and though these spaces are higher-dimensional than those in previous laboratory experiments,they are still relatively small and artificial compared to realimages. Unlike classification images, this method makes noassumptions about the structure of the category distributionsand thus can estimate means, variances, and higher order mo-ments. Therefore, we take it as a starting point for the currentmethod. MCMCP in deep feature spaces
The typical MCMCP experiment is effective so long as noisecan be added to dimensions in the stimulus parameter spaceto create meaningful changes in content. In the case of natu-ral images, noise in the space of all pixel intensities is veryunlikely to modify the stimulus in meaningful ways. In-stead, we propose perturbing images in a deep feature spacethat captures only essential variation. Since trials in anMCMCP experiment are not independent, we employ real-time, web-accessible generative adversarial networks to ren-der high quality inversions from their latent features. Themapping from features to images learned by a GAN is de-terministic, and therefore MCMCP in low-dimensional fea-ture space approximates the same process in high-dimensionalimage space. The resulting judgments (samples) approximatedistributions that both derive arbitrary human category bound-aries for natural images and can be sampled from to createmages, yielding new human-like generative image models.A schematic of this procedure is illustrated in Figure 1.There are several theoretical advantages to our method overprevious efforts. First, MCMCP can capture arbitrary distri-butions, so it is not as sensitive to the structure of the underly-ing low-dimensional feature space and should provide bettercategory boundaries than classification images when required.This is important when using various deep features spaces thatwere learned with different training objectives and architec-tures. MCMC inherently spends less time in low probabil-ity regions and should in theory waste fewer trials. Havinggenerated the images online and as a function of the partic-ipant’s decisions, there is no dataset or sampling bias, andauto-correlation can be addressed by removing temporally ad-jacent samples from the chain. Finally, using a deep generatorprovides drastically clearer samples than shallow reconstruc-tion methods, and can be trained end-to-end with an inferencenetwork that allows us to categorize new images using thelearned distribution.
Experiments
For our experiments, we explored two image generator net-works trained on various datasets. Since even relatively low-dimensional deep image embeddings are large compared tocontrolled laboratory stimulus parameter spaces, we use a hy-brid proposal distribution in which a Gaussian with a low vari-ance is used with probability P and a Gaussian with a highvariance is used with probability 1 − P . This allows partic-ipants to both refine and escape nearby modes, but is simpleenough to avoid excessive experimental piloting that more ad-vanced proposal methods often require.Participants in all experiments completed exactly 64 tri-als (image comparisons), collectively taking about 5 minutes,containing segments of several chains for multiple categories.The order of the categories and chains within those categorieswere always interleaved. Each participant’s set of chains foreach category were initialized with the previous participantsfinal states, resulting in large, multi-participant chains. Allexperiments were conducted on Amazon Mechanical Turk. Ifa single image did not load for a single trial, the data for thesubject undergoing that trial was completely discarded, anda new subject was recruited to continue on from the originalchain state. Experiment 1: Initial test with face categories
Methods
We first test our method using DCGAN (Radfordet al., 2015) trained on the Asian Faces Dataset. We chose thisdataset because it requires a deep architecture to produce rea-sonable samples (unlike MNIST, for example), yet it is con-strained enough to test-drive our method using a relativelysimple latent space. Four chains for each of four categories(male, female, happy, and sad) were used. Proposals weregenerated from an isometric Gaussian with a standard devia-tion of 0.25 50% of the time, and 2 otherwise. In addition,we conducted a baseline in which two new initial state pro-posals were drawn on every trial, and were independent of previous trials (classification images). The final dataset con-tained 50 participants and over 3 ,
200 trials (samples) in totalfor all chains. The baseline classification images (CI) datasetcontained the same number of trials and participants.
Results
MCMCP chains are visualized using Fisher LinearDiscriminant Analysis in Figure 2, along with the resulting av-erages for each chain and each category. Chain means withina category show interesting variation, yet converge to similarregions in the latent space as expected. Figure 2 also showsvisualizations of the mean faces for both methods in the fi-nal two columns. MCMCP means appear to have convergedquickly, whereas CI means only moderately resemble theircorresponding category (e.g., the MCMCP mean for “happy”is fully smiling, while the CI mean barely reveals teeth). Allfour CI means appear closer to a mean face, which is what onewould expect from averages of noise. We validated this im-provement with a human experiment in which 30 participantsmade forced choices between CI and MCMCP means. Theresults are reported in Figure 3. MCMCP means are consis-tently highly preferred as representations of each category ascompared to CI. This remained true even when an additional50 participants (total of 100) completed the CI task, obtainingtwice as many image comparison trials as with MCMCP.
Experiment 2: Larger networks & larger spaces
The results of Experiment 1 show that reasonable categorytemplates can be obtained using our method, yet the complex-ity of the stimulus space used does not rival that of large objectclassification networks. In Experiment 2, we tackled a morechallenging (and interesting) form of the problem. To do this,we employed a bidirectional generative adversarial network(BiGAN; Donahue et al., 2016) trained on the 1.2 million-image
ILSVRC
12 dataset (64 ×
64 center-cropped). BiGANincludes an inference network, which regularizes the rest ofthe model and produces unconditional samples competitivewith the state-of-the-art. This also allows for the later possi-bility of comparing human distributions with other networksas well as assessing machine classification performance withnew images based on the granular human biases captured.
Methods
Our generator network was trained given uniformrather than Gaussian noise, which allows us to avoid propos-ing highly improbable stimuli to participants. Additionally,we avoid proposing states outside of this hypercube by forc-ing z to wrap around (proposals that travel outside of z areinjected back in from the opposite direction by the amountoriginally exceeded). In particular, we run our MCMC chainsthrough an unbounded state space by redefining each boundeddimension z k as z (cid:48) k = (cid:40) − sgn ( z k ) × [ − ( z k − (cid:98) z k (cid:99) )] , if | z | > z k , otherwise . (2)Proposals were generated from an isometric Gaussian with astandard deviation of 0 . . bottle , car , fire hydrant , andigure 2: Visualizing captured representations. A. Fisher Linear Discriminant projections of all four MCMCP chains for each ofthe four face categories. The four sets of chains overlap to some degree, but are also well-separated overall. Means of individualchains are closer to other means from the same class than to those of other classes. B. Individual MCMCP chain means (4 × person , television , following Vondrick et al. (2015). Group 2included bird , body of water , fish , flower , and landscape . Eachchain was approximately 1 ,
040 states long, and four of thesechains were used for each category (approximately 4 , ,
600 samples from 650 participants.To demonstrate the efficiency and flexibility of our methodcompared to alternatives, we obtained an equivalent numberof trials for all categories using the variant of classificationimages introduced in Vondrick et al. (2015), with the excep-tion that we used our BiGAN generator instead of the offlineinversion previously used. This also serves as an importantbaseline against which to quantitatively evaluate our methodbecause it estimates the simplest possible template. happy sad female male0.20.30.40.50.60.70.80.91.0 P r e f e r e n c e f o r M C M C P ( % c h o s e n ) Neutral1x CI Trials2x CI Trials
Figure 3: Human two-alternative forced-choice tasks reveal astrong preference for MCMCP means as representations of acategory, when twice as many trials are used for CI.
Results
The acceptance rate was approximately 50% forboth category groups, which is near the common goal forMCMCP experiments. The samples for all ten categories areshown in Figure 5B and D using Fisher Linear DiscriminantAnalysis. Similar to the face chains, the four chains for eachcategory converge to similar regions in space, largely awayfrom other categories. In contrast, classification images showslittle separation with so few trials (5C and D). Previous worksuggests that at least an order of magnitude higher numberof comparisons may be needed for satisfactory estimation ofcategory means. Our method estimates well-separated cate-gory means in a manageable number of trials, allowing forthe method to scale greatly. This makes sense given that CIcompares arbitrary images, potentially wasting many trials,and clearly suffers from a great deal of noise.Beyond yielding a decision rule, our method additionallyproduces a density estimate of the entire category distribu-tion. In classification images, only mean template images canbe viewed, while we are able to visualize several modes inthe category distribution. Figure 4 visualizes these modes us-ing the means of each component in a mixture of Gaussiansdensity estimate. This produces realistic-looking multi-modalmental category templates, which to our knowledge has neverbeen accomplished with respect to natural image categories.
Efficacy in classifying real images
Improvements of MCMCP over classification images maybe both perceptible and detectable, but their practical differ-ences are also worth considering — do they differ signifi-cantly on real-world tasks? Moreover, if the representationswe learn through MCMCP are good approximations to peo-ple, we would expect them to perform reasonably well in cat-egorizing real images. For this reason, we provide an addi-tional quantitative assessment of the samples we obtained andigure 4: 40 most interpretable mixture component means (modes) taken from the 50 largest mixture weights for category.compare them to classification images (CI) using an externalclassification task.To do this, we scraped ≈
500 images from Flickr for eachof the ten categories, which was used for a classification task.To classify the images using our human-derived samples, weused (1) the nearest-mean decision rule, and (2) a decision rulebased on the highest log-probability given by our ten densityestimates. For classification images, only a nearest-mean de-cision rule can be tested. In all cases, decision rules based onour MCMCP-obtained samples overall outperform a nearest-mean decision rule using classification images (see Table 1). In category group 1, the MCMCP density performed best andwas more even across classes. In category group 2, nearest-mean using our MCMCP samples did much better than a den-sity estimate or CI-based nearest-mean.
Discussion
Our results demonstrate the potential of our method, whichleverages both psychological methods and deep surrogate rep-resentations to make the problem of capturing human cate-gory representations tractable. The flexibility of our methodin fitting arbitrary generative models allows us to visualizeigure 5: Categories are better separated by MCMCP representations. Fisher Linear Discriminant projections of A . CI compar-isons for each category of group 1, B. samples for MCMCP chains for category group 1, C. CI comparisons for each categoryof group 2, and D. samples for MCMCP chains for category group 2. For A and C, large dots represent category means. For Band D, large dots represent chain means.Table 1: Classification performance compared to chance forboth category sets (chance is 0.20). bird body of water fish flower landscape all MM .33 .28 .01 .57 .67 .37MD .23 .31 .18 .44 .73 .38
CM .23 .30 .2 .24 .52 .30 bottle fire hydrant car person television all
MM .15 .11 .32 .77 .73 .42
MD .25 .26 .56 .19 .50 .35CM .28 .15 .62 .12 .13 .26
MM = MCMCP Mean, MD = MCMCP Density, CM = CI Mean multi-modal category templates for the first time, and improveon human-based classification performance benchmarks. It isdifficult to guarantee that our chains explored enough of therelevant space to actually capture the concepts in their entirety,but the diversity in the modes visualized and the improvementin class separation achieved are positive indications that weare on the right track. Further, the framework we presentcan be straightforwardly improved as generative image mod-els advance, and a number of known methods for improvingthe speed, reach, and accuracy of MCMC algorithms can beapplied to MCMCP make better use of costly human trials.There are several obvious limitations of our method. First,the structure of the underlying feature spaces used may eitherlack the expressiveness (some features may be missing) or theconstraints (too many irrelevant features or possible imageswastes too many trials) needed to map all characteristics ofhuman mental categories in a practical number of trials. Evenwell-behaved spaces are very large and require many trials toreach convergence. Addressing this will require continuingexploration of a variety of generative image models. We seeour work as part of an iterative refinement process that canyield more granular human observations and inform new deepnetwork objectives and architectures, both of which may yetconverge on a proper, yet tractable model of real-world humancategorization.
References
Ahumada. (1996). Perceptual classification images from vernieracuity masked by noise.
Perception , , 2–2.Donahue, J., Kr¨ahenb¨uhl, P., & Darrell, T. (2016). Adversarial fea-ture learning. arXiv preprint arXiv:1605.09782 .Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., . . . Bengio, Y. (2014). Generative adverserial net-works. In Advances in Neural Information Processing Systems (pp. 2672–2680).Kingma, D. P., & Welling, M. (2013). Auto-encoding variationalBayes. In
Proceedings of the 2nd International Conference onLearning Representation (ICLR).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet clas-sification with deep convolutional neural networks. In
Advancesin Neural Information Processing Systems (pp. 1097–1105).Lake, B. M., Zaremba, W., Fergus, R., & Gureckis, T. M. (2015).Deep neural networks predict category typicality ratings for im-ages. In
Proceedings of the 37th Annual Conference of the Cog-nitive Science Society.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.
Nature , (7553), 436–444.LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,Hubbard, W., & Jackel, L. D. (1989). Backpropagation appliedto handwritten zip code recognition. Neural Computation , (4),541–551.Martin, J. B., Griffiths, T. L., & Sanborn, A. N. (2012). Testing theefficiency of Markov Chain Monte Carlo with people using facialaffect categories. Cognitive Science , (1), 150–162.Nosofsky, R. M., Sanders, C. A., Meagher, B. J., & Douglas, B. J.(2017). Toward the development of a feature-space representa-tion for a complex natural category domain. Behavior ResearchMethods , 1–27.Peterson, J., Abbott, J., & Griffiths, T. (2016). Adapting deep net-work features to capture psychological representations. In
Pro-ceedings of the 38th Annual Conference of the Cognitive ScienceSociety.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised repre-sentation learning with deep convolutional generative adversarialnetworks. arXiv preprint arXiv:1511.06434 .Sanborn, A., & Griffiths, T. L. (2007). Markov Chain Monte Carlowith people. In
Advances in Neural Information Processing Sys-tems (pp. 1265–1272).Vondrick, C., Pirsiavash, H., Oliva, A., & Torralba, A. (2015). Learn-ing visual biases from human imagination. In