[PDF] Gibbs Sampling with People

Abstract

A core problem in cognitive science and machine learning is to understand how humans derive semantic representations from perceptual objects, such as color from an apple, pleasantness from a musical chord, or seriousness from a face. Markov Chain Monte Carlo with People (MCMCP) is a prominent method for studying such representations, in which participants are presented with binary choice trials constructed such that the decisions follow a Markov Chain Monte Carlo acceptance rule. However, while MCMCP has strong asymptotic properties, its binary choice paradigm generates relatively little information per trial, and its local proposal function makes it slow to explore the parameter space and find the modes of the distribution. Here we therefore generalize MCMCP to a continuous-sampling paradigm, where in each iteration the participant uses a slider to continuously manipulate a single stimulus dimension to optimize a given criterion such as 'pleasantness'. We formulate both methods from a utility-theory perspective, and show that the new method can be interpreted as 'Gibbs Sampling with People' (GSP). Further, we introduce an aggregation parameter to the transition step, and show that this parameter can be manipulated to flexibly shift between Gibbs sampling and deterministic optimization. In an initial study, we show GSP clearly outperforming MCMCP; we then show that GSP provides novel and interpretable results in three other domains, namely musical chords, vocal emotions, and faces. We validate these results through large-scale perceptual rating experiments. The final experiments use GSP to navigate the latent space of a state-of-the-art image synthesis network (StyleGAN), a promising approach for applying GSP to high-dimensional perceptual spaces. We conclude by discussing future cognitive applications and ethical implications.

Full PDF

GGibbs Sampling with People

Peter M. C. Harrison * Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Raja Marjieh * Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Federico Adolﬁ

Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Pol van Rijn

Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Manuel Anglada-Tort

Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Ofer Tchernichovski

Hunter College CUNYThe CUNY Graduate Center [email protected]

Pauline Larrouy-Maestri

Max Planck Institute for Empirical AestheticsFrankfurt [email protected]

Nori Jacoby

Max Planck Institute for Empirical AestheticsFrankfurt [email protected] * Equal contribution.

Abstract

A core problem in cognitive science and machine learning is to understand howhumans derive semantic representations from perceptual objects, such as colorfrom an apple, pleasantness from a musical chord, or seriousness from a face.Markov Chain Monte Carlo with People (MCMCP) is a prominent method forstudying such representations, in which participants are presented with binarychoice trials constructed such that the decisions follow a Markov Chain MonteCarlo acceptance rule. However, while MCMCP has strong asymptotic properties,its binary choice paradigm generates relatively little information per trial, andits local proposal function makes it slow to explore the parameter space andﬁnd the modes of the distribution. Here we therefore generalize MCMCP toa continuous-sampling paradigm, where in each iteration the participant usesa slider to continuously manipulate a single stimulus dimension to optimize agiven criterion such as ‘pleasantness’. We formulate both methods from a utility-theory perspective, and show that the new method can be interpreted as ‘GibbsSampling with People’ (GSP). Further, we introduce an aggregation parameter tothe transition step, and show that this parameter can be manipulated to ﬂexiblyshift between Gibbs sampling and deterministic optimization. In an initial study,we show GSP clearly outperforming MCMCP; we then show that GSP providesnovel and interpretable results in three other domains, namely musical chords,vocal emotions, and faces. We validate these results through large-scale perceptualrating experiments. The ﬁnal experiments use GSP to navigate the latent space of astate-of-the-art image synthesis network (StyleGAN), a promising approach forapplying GSP to high-dimensional perceptual spaces. We conclude by discussingfuture cognitive applications and ethical implications. a r X i v : . [ q - b i o . N C ] N ov Introduction

Humans continuously extract semantic representations from complex perceptual inputs, re-expressingthem as meaningful information that can be efﬁciently communicated to primary cognitive processessuch as memory, decision-making, and language [1–3]. Effective semantic representation seems to bea prerequisite for intelligent behavior, and is correspondingly a core topic of study in both cognitivescience and machine learning [4–6].One way of studying semantic representations in humans is to present participants with stimuli thatexhaustively sample from a stimulus space (e.g., the space of visible colors) and ask them to evaluatethese stimuli (e.g., [7]). Unfortunately, this method works poorly for high-dimensional stimuli whoseparameter spaces are too large to explore exhaustively. An alternative approach is to hand-constructstimulus sets to test speciﬁc hypotheses about semantic representations (e.g., that slow melodies tendto sound sad, [8]); however, this approach relies heavily on prior domain knowledge, and is poorlysuited to exploratory research.Markov Chain Monte Carlo with People (MCMCP) addresses this problem [9–11]. MCMCP takes asinput a stimulus space (e.g., visible colors) and a target semantic category (e.g., ‘danger’). In eachtrial, participants are presented with two stimuli and are asked which comes from the category. Byvirtue of MCMCP’s adaptive procedure, stimulus selection becomes progressively biased towardsparts of the stimulus space that represent the category. The resulting process iteratively characterizesthe subjective mapping between the stimulus space and the semantic concept for a given participantor participant group. The technique provides a way for cognitive scientists to systematically quantifysubjective aspects of perception, for example the way in which participants from a particular musicalculture hear certain chords as ‘consonant’, or the way in which participants have certain subjectiveideas of what a ‘serious’ face looks like. The approach has been shown to outperform reversecorrelation, a related non-adaptive method for mapping semantic categories to perceptual spaces [12].According to the underlying theory, MCMCP converges asymptotically to the participant’s internalprobabilistic representation of a given semantic category within a stimulus space [9]. However, itspractical convergence rate is limited for several reasons. The ﬁrst concerns the response interface:MCMCP is traditionally limited to binary choice responses, which can only provide a small amountof information per trial (1 bit), much less than the theoretical limit of other response interfaces (e.g.,sliders). The second depends on the proposal function that generates successive stimuli: a too-narrowproposal function makes the process slow to ﬁnd the modes of the distribution, whereas a too-wideproposal function makes it harder for the process to estimate the mode with much precision [13, 14].Here we present a new technique for addressing these problems, termed Gibbs Sampling with People(GSP). While MCMCP corresponds to a human instantiation of the Metropolis-Hastings MCMCsampler, GSP corresponds to a human instantiation of a Gibbs sampler. Crucially, unlike [15], GSPhas participants respond with a continuous slider rather than a binary choice. This has two effects:ﬁrst, it substantially increases the upper bound of information per trial, and second, it eliminates theneed to calibrate a proposal function. We further show how GSP can be formulated in utility theory,thereby generalizing the approach from discrete to continuous semantic representations, and we showhow GSP can be shifted towards deterministic optimization through an aggregation process.This paper continues with a review of MCMCP and a theoretical account of GSP. We then de-scribe four studies applying GSP to various visual and auditory domains, ranging from simplelow-dimensional problems to complex high-dimensional problems parameterized by deep neuralnetworks. These studies include experiments directly implementing GSP and MCMCP, controlexperiments investigating different hyperparameters, and validation experiments for the generatedoutputs. All combined, these 25 experiments represent data from 5,178 human participants. MCMC is a procedure for sampling from distributions whose normalization constants are impracticalto compute directly. It works by constructing a Markov chain whose stationary distribution corre- Appendices, code, and raw data are hosted at https://doi.org/10.17605/OSF.IO/RZK4S . x from theparameter space, then repeating the following steps until convergence: (1) Sample a candidate x ∗ forthe next state of the Markov chain according to some proposal function q ( x ∗ ; x ) ; (2) Decide whetherto accept this candidate according to an appropriate acceptance function A ( x ∗ ; x ) constructed toreﬂect the probability distribution of interest. In the case of a symmetric proposal function, theacceptance function takes a simple form known as the Barker acceptance function and is given by A ( x ∗ ; x ) = π ( x ∗ ) / ( π ( x ) + π ( x ∗ )) where π ( x ) is the target distribution [16].In MCMCP the acceptance function is replaced with a human participant, whose task is to choosebetween the current state and the candidate state [9]. The trick is then to frame this task such that theparticipant’s choices correspond to the acceptance function for an interesting probability distribution.The solution presented in the original MCMCP paper is to tell the participant that one of the stimulicomes from a class distribution (e.g., cats), and one of them comes from an unknown category.The authors suppose that the participant computes the posterior probability of class membershipassuming a locally uniform likelihood for the alternative class, and then selects a stimulus withprobability proportional to its posterior probability of class membership. Under these assumptions,the participant’s behaviour can be shown to correspond to the classic Barker acceptance functionwhere the underlying probability distribution is the likelihood function for the class being judged.This formulation is elegant but it has two important limitations. First, it can only be applied to semanticrepresentations that take categorical forms; the derivation does not make sense for continuous semanticrepresentations (e.g., pleasantness). Second, it assumes that participants make their choices withprobabilities equal to their posterior probabilities of class membership (a process termed probabilitymatching ) as opposed to the Bayes-optimal strategy of deterministically maximizing this posteriorprobability [17]. Humans do indeed seem to exhibit probability matching in certain contexts, but aconvincing cognitive model ought to explain how this sub-optimal process arises [18].Here we reformulate MCMCP (and later GSP) without these limitations. We suppose that theparticipant is asked to select the stimulus that best matches a given criterion C ; for example ‘selectthe most pleasant chord’ or ‘select the color that most resembles lavender’. We suppose that theparticipant performs this task by extracting a utility value for each stimulus, and selecting thestimulus with the maximum utility [19]. In the case of the class membership tasks typically usedin MCMCP, we might hypothesize that the utility corresponds to the subjective likelihood of thestimulus conditioned on the class of interest; however, in the general case the utility function need notnecessarily correspond to a probability distribution. The utility value is however assumed to have adeterministic component and a noise component, namely U i = (cid:96) i + n i , where (cid:96) i is the deterministicutility of stimulus i , and n i is the associated noise. This noise component can capture participant-level noise from sensory [20, 21] and cognitive [21, 22] processes, as well as population-level noisecorresponding to individual differences in the utility function [19]. In the case where the noisecomponents are i.i.d. and have an extreme value distribution common to discrete choice models, it canbe shown that the probability of selecting a given stimulus s is equal to (1 + exp( − γ ( (cid:96) − (cid:96) ))) − where γ − corresponds to the scale parameter of the noise component. If the utility is assignedbased on subjective likelihood (cid:96) i = log p ( s i | C ) , then this equation reproduces the Barker acceptancefunction with target distribution π ( s ) ∝ p γ ( s | C ) . This justiﬁes MCMCP for an optimal observerwith a noisy utility function (for proof and discussion see Appendix A.1). Gibbs sampling is an alternative approach for sampling from probability distributions [23], deﬁnedas follows. Let p ( z , . . . , z n ) be a target distribution over an n -dimensional state space from whichone would like to sample, and choose a starting vector state z (1) = ( z (1)1 , . . . , z (1) n ) . Then, circularlyupdate coordinates by sampling from p ( z ( i +1) k | z ( i +1)1 , . . . , z ( i +1) k − , z ( i ) k +1 , . . . , z ( i ) n ) .In GSP the participant provides the coordinate updates. This is achieved by presenting the participantwith a slider that is associated with the current stimulus dimension z k and instructing the participant tomove the slider to maximize a certain criterion, such as the pleasantness of a sound or the resemblanceof a fruit to a strawberry. To analyze the decision step, let us discretize the slider into a set of points { z ik } i and let z − k denote the other ﬁxed dimensions. As before, suppose that each point on the slider3s associated with a utility that contains both a deterministic (cid:96) ( z ik , z − k ) and a noise n i component, andsuppose that the participant chooses the slider location that maximizes the utility. Then, under similarassumptions to those made in the MCMCP case, the probability distribution over slider locations is p ( choose i ) = p ( z ik | z − k ) = e γ(cid:96) ( z ik ,z − k ) (cid:80) j e γ(cid:96) ( z jk ,z − k ) (1)and as the granularity of the slider tends to inﬁnity, the denominator becomes a marginal, and GSPbecomes a sampler from p ( z ) ∝ e γ(cid:96) ( z ) (for proof and discussion see Appendix A.2).As with MCMCP, GSP can be used to explore either categorical or continuous semantic representa-tions. In the former case, the experimenter might ask a question like ‘adjust the slider until the imagemost resembles a dog’, and the participant’s utility function might correspond to the log probabilityof the image given the class, (cid:96) ( z ) = log p ( z | C ) ; in this case the sampler’s stationary distributionwill be proportional to p γ ( z | C ) . In continuous semantic representations, the utility function maynot correspond to a probability distribution, and the interpretation would simply be that the samplerexplores different regions of the space in proportion to their exponentiated utility e γ(cid:96) ( z ) .The noise parameter γ − is important for the behavior of the sampler. As γ − → , the choicedistribution becomes increasingly peaked around the highest utility item on the slider, shifting thusthe sampler into an optimization regime (speciﬁcally, coordinate descent). Typically we are interestedin minimizing γ − , so as to maximize the utility of the samples (mode seeking); however, some noiseis still desirable because it helps drive exploration of the utility space.There are two main ways to reduce the effective noise, γ − . One approach is to estimate the jointdistribution p ( z ) ∝ e γ(cid:96) ( z ) by ﬁtting a kernel density estimate (KDE) to the GSP samples, thensimulating γ − → by taking the distribution’s mode. For simple distributions, this mode can alsobe estimated by averaging over samples. However, neither KDEs nor averaging are well-suited tocomplex high-dimensional spaces, where the joint distribution is hard to estimate reliably.An alternative approach is to manipulate the Gibbs sampler itself. Speciﬁcally, suppose that wecollect multiple samples from the conditional distribution of a given step of the Gibbs sampler p ( choose i ) ∝ e γ(cid:96) ( z ik ,z − k ) , and then simulate γ − → by returning the peak of the one-dimensionalKDE from these samples (or potentially the sample mean). This will in turn simulate γ − → forthe joint distribution p ( z ) ∝ e γ(cid:96) ( z ) as produced by the Gibbs sampler. The practical advantages ofthis approach are that (a) we restrict density estimation to a more tractable one-dimensional case, and(b) the same stimulus can be re-used for multiple trials, which can be useful when stimuli are slow tocreate. We explore both KDE and mean aggregation in this paper.There are several ways to assign the iterations of a GSP or MCMCP chain to human participants.In within-participant chains, the entire chain is completed by just one participant, and the resultingsamples reﬂect the semantic representations of that single participant. In across-participant chains,each iteration comes from a different participant, and the samples then reﬂect shared semanticrepresentations across participants (Fig. S1). While within-participant chains can theoretically beused to study individual differences, here we focus on studying representations at the level of theparticipant group, using both chain types and averaging over participants where appropriate.Researchers interpreting MCMCP and GSP results must think carefully both about the deﬁnition ofthe stimulus space and of the participant group. For example, if the stimulus space only includesmale voices, the results may not be generalizable to female voices. Similarly, if the participant groupcomprises solely US participants, then the results may not be generalizable to Japanese participants.Of course, these issues are by no means limited to MCMCP and GSP, but apply rather to the majorityof psychological research. We will revisit these matters below.A related paradigm with a Gibbs sampler interpretation is serial reproduction , where one participant’simitation of a stimulus becomes the next stimulus in a transmission chain [24–30]. However, serialreproduction is limited to percepts that can be entirely reproduced in a single trial (e.g., spokensentences, [30]). In contrast, GSP participants only ever have to manipulate one stimulus dimensionat a time, even if the stimulus itself is high-dimensional. This allows GSP to explore much richerstimulus spaces. A second related paradigm with a Gibbs sampler interpretation is described by [15],studying subjective randomness by having participants impute missing parts of coin-ﬂip sequences.Our approach differs in soliciting continuous rather than discrete judgments. A third related paradigmis the multidimensional method of adjustment, where participants simultaneously adjust multiple4liders to make a stimulus match a certain criterion (e.g., [31]). GSP differs from the latter inproviding a principled way to share the task between participants, and a coherent probabilistic modelrelating slider movements to the utility function. Our ﬁrst study concerns a particularly low-dimensional perceptual space: color. This kind ofperceptual space should provide a useful sanity check for any semantic sampling procedure: if aprocedure fails here, it is surely even less likely to succeed in high-dimensional perceptual spaces. We tested our sampling methods on recovering the colors associated with eight words: ‘chocolate’,‘cloud’, ‘eggshell’, ‘grass’, ‘lavender’, ‘lemon’, ‘strawberry’, and ‘sunset’. We parameterized colorspace using the perceptually oriented HSL scheme [32], where each color is encoded as three integers:hue, saturation, and lightness, taking values in [0, 360], [0, 100], and [0, 100] respectively.The ﬁrst sampling method was MCMCP, implemented with a Gaussian proposal function of standarddeviation 30. The second method was standard GSP. The third method was aggregated GSP, collecting10 slider responses for each iteration, and propagating the mean response to the next iteration. Eachmethod was evaluated using across-participant chains of length 30, with ﬁve chains per color category,with each chain’s starting location sampled from a uniform distribution over the color space (Exp.1a, 1b, 1c). All participants ( N = 422) were recruited from Amazon Mechanical Turk (AMT) andpre-screened with a color-blindness test and a color-vocabulary test before continuing with the onlineexperiment (Appendix C). Each participant contributed up to 40 trials for a given method. In eachcase, the participant was presented with a word (e.g., ‘lavender’), and asked to choose the color thatbest matched that word with either a binary choice interface (MCMCP) or a slider (GSP) (Fig. 1A).There are several ways that one could evaluate the success of an MCMCP or GSP procedure. Herewe follow previous work by having participants rate how well samples match the target category[12], but see Appendix C for an alternative analysis. We elicited c. 5.2 ratings per sample from anew participant group ( N = 322, Exp. 1d); participants were presented with the target word fromthe original chain, and asked to judge how well the color matched this word on a scale from 1(not at all) to 4 (very much). The results indicate a clear advantage for GSP over MCMCP, withGSP converging faster and on higher ratings; they also show that aggregation robustly improvedratings (Fig. 1B, 1C). Inspecting Fig. 1B, it is clear that many MCMCP samples poorly reﬂected theirsemantic category; meanwhile, GSP produced considerably fewer poor samples, and aggregated GSPeven fewer. Investigating further, we found that the poor performance of MCMCP persisted when(a) normalizing for the longer duration of GSP trials (Fig. S8), (b) trying different proposal widths(Exp. 1e, 153 participants, Fig. S9), (c) using different questions (Exp. 1f, 190 participants, Fig. S10),(d) implementing within-experiment aggregation (Exp. 1g, 1h, 572 participants), (e) implementingpost-hoc aggregation (Fig. S12), and (f) accounting for the trade-off between mode-seeking andexploration (Exp. 1i, 270 participants, Fig. S13). The implication is that, when the stimulus space iswell-parameterized, GSP substantially improves sampling quality over MCMCP. In addition, it isclear that aggregation improves sampling quality still more at the cost of additional participant trials.As an exercise, it is useful to reﬂect on how the stimulus space and the participants might constrainthe generalizability of these results. The stimulus space presents little problem; every visible colorhas a close neighbor in the HSL scheme used here. However, the results should not be expected togeneralize globally, given well-documented cross-cultural variations in color-naming [7]. This study concerns a long-standing psychological question: how the way that a sentence is spoken(its prosody ) communicates the speaker’s emotional state [33]. Prior research mostly depends onrecordings of actors expressing particular emotions, but such stereotypical recordings might not fullyreﬂect natural emotion perception [34]. GSP provides a way to study prosody perception withoutactors, instead generating emotional prosody directly from the perceptual judgments of listeners. Additional methods, results, and demographic information for all experiments are provided in the Appen-dices. B C

Iteration R a t i ng GSP + aggregationGSPMCMCP

MCMCP ValidationGSP

MCMCP GSP GSP + aggregationChocolateCloudEggshellGrassLavenderLemonStrawberrySunset 10 20 30 00 10 20 30 0 10 20 30

Iteration

Adjust the slider to matchthe following word as well as possible: lavender

How well does the color match the following word: lavender not at all a little quite a lot very muchChoose which color bestmatches the following word: lavender

A B

Figure 1: Sampling color representations. A : MCMCP/GSP instructions. B : Generated samples. C : Task and results for the validation experiment (95% conﬁdence intervals over participants).We began with three sentences from the Harvard sentence corpus [35] recorded by a female speaker[36], chosen to facilitate comparison with previous research; these sentences are phonologicallybalanced and semantically neutral. We deﬁned our stimulus space in terms of seven parametricmanipulations, corresponding to duration (speeding up or slowing down the fragment), intensityvariation (rate and depth) and pitch (absolute level, range, slope, and F0 perturbation). We exploredthis space using 220 within-participant GSP chains, each comprising 21 iterations, and each beginningwith the original unaltered recording (Exp. 2a). Participants ( N = 110) were recruited from AMT,pre-screened with the audio test of [37], and each randomly assigned to either ‘anger’, ‘happiness’,or ‘sadness’ (Fig. 2A). Each participant contributed two chains corresponding to different sentences.Fig. 2B plots mean feature values for the different emotional categories. Sad speech was markedby long duration, reduced pitch range, shallow pitch slope, and high F0 perturbation. Happinesshad short duration, increased mean pitch, shallow pitch slope, and high pitch range. Anger hadshort duration, low mean pitch, falling pitch slope, and high pitch range. These characterizationsare generally consistent with previous research (e.g., [38]). We also observed interesting patternsof feature correlations. For example, we found duration and F0 perturbation to be correlated forsadness ( r = .28) but not for the other emotions (anger: r = − .03, happiness: r = .00); in contrast wefound that pitch level and pitch slope were positively correlated in all three emotions (Fig. S15). Thissuggests a new way to explore the perceptual spaces of perceived emotions, contrasting with previousliterature that mainly focuses on unique contributions of single dimensions.We then evaluated the resulting samples with a new participant group ( N = 161), who rated how wellsamples matched each emotion on a four-point scale, producing c. 5.4 ratings per stimulus (Exp.2c). Ratings increased steadily for the ﬁrst sweep of the parameter vector and then plateaued with areliable mean contrast of 0.9 points (Fig. 2C). We also replicated the results with across- instead ofwithin-participant chains (Exp. 2b, 2d, 210 participants, Fig. S14A).These results imply that GSP is effective for exploring emotional prosody, and for generatingemotional stimuli without the confounds of acted recordings. Nonetheless, there are clear ways inwhich this work could be extended. The stimulus space was deﬁned by a limited set of manipulations,such as mean pitch, pitch slope, and F0 perturbation; this set could be extended to include forexample spectral features or more granular pitch manipulations [39, 40]. The stimuli all correspondto English sentences, and the participants were all US participants; our results should not be assumedto generalize outside this cultural context [41, 42]. Moreover, all stimuli were synthesized with afemale voice, so the results should not be assumed to generalize to male speakers. Our third study concerns the subjective pleasantness of musical pitch combinations, or chords .For Westerners, this domain is highly multimodal, containing many prototypes of ‘pleasant’ (or6 djust the slider to make thespeaker sound like she is: sad P i t c h i n he r t z Time in seconds

Assign emotionto participant angry , happy or sad A R a n d o m Iteration C on t r a s t C F0 perturbationPitch slopePitch rangePitch levelAmplitudemodulationdepthAmplitudemodulationfrequencyDuration

Normalized parameters

Final iteration (21)

15 199 123 60 1

Previous iterations B Figure 2: Sampling emotional prosody. A : Overview of the GSP task. B : Mean feature values byiteration. C : Mean ‘contrast’ ratings, corresponding to the mean rating for the ‘correct’ emotionminus the mean rating for the ‘incorrect’ emotions (95% conﬁdence intervals over participants).‘consonant’) chords. Exhaustively exploring this continuous space is difﬁcult for conventionalmethods, and so far such research has been limited to single pitch intervals or to speciﬁc tuningsystems [43–46]. Here we investigate whether GSP can help us to characterize the continuous spaceof pairs of pitch intervals without restricting stimuli to a given tuning system.Our stimulus space comprised two continuous intervals, specifying the logarithmic distance fromthe bass tone in the range 0.5–11 [47]. The standard Western tuning system corresponds to integercoordinates in this space. We explored this space with 50 across-participant GSP chains of length40, whose starting locations were sampled from a uniform distribution over the stimulus space. Theparticipants ( N = 134) were recruited from AMT and pre-screened with the audio test of [37] (Exp.3a). These participants were instructed to make each chord as ‘pleasant’ as possible (Fig. 3A). In asubsequent validation experiment, participants ( N = 168) rated pleasantness for samples from (a) theempirical distribution and (b) the top ﬁve modes of KDEs applied to raw samples from iteration 10onwards (Exp. 3b). Each condition received 662 ratings with up to 80 ratings per participant.Ratings increased clearly as a function of iteration, with KDE modes scoring signiﬁcantly higher thanraw samples. The KDEs display a rich structure that replicates and extends prior research (Fig. 3B)[43–46]. In particular, the 1D KDE shows clear integer peaks corresponding to the Western tuningsystem, with dips at the semitone (1) and tritone (6); the 2D KDE additionally shows peaks at variousprototypical sonorities from Western music, such as the major triad (4, 7), the ﬁrst inversion of themajor triad (3, 8), and a dominant seventh chord with omitted third (7, 10) (see e.g., [48]; see alsoFig. S18). These results imply that GSP is effective for exploring continuous musical spaces.Our stimulus space only contained three-tone chords, but of course real music contains many differentvarieties of chords. Our chord tones were synthesized using artiﬁcial harmonic complex tones;though such tones are commonly used in prior research [49], real music contains many differentkinds of tones, some of which have different consonance proﬁles [46]. Moreover, our participantgroup comprised mostly US and Indian participants, yet consonance perception is known to varycross-culturally [49]. Future work should explore how our results vary as a function of these variables. Our ﬁnal study addresses a particularly high-dimensional domain: images of human faces. Suchimages would be too high-dimensional for GSP to manipulate in their raw form, so we instead7igure 3: GSP over musical chords. A : Schematic illustration of the experimental task. B : KDE overiterations 10 to 39, with density expressed relative to a uniform distribution, the top 15 modes markedby red dots, all plotted alongside the marginal distribution of lower and upper intervals combined. C :Validation ratings by iteration (95% conﬁdence intervals over responses).parameterize the stimuli with a generative model. State-of-the-art image synthesis models typicallystill have high-dimensional parameter spaces, but here we build on recent work showing that the latentspace of these models can be effectively navigated using principal component analysis (PCA) [50].Following [50], we apply this approach to the generative adversarial network ‘StyleGAN’ [51, 52],pretrained on the FFHQ dataset of faces from Flickr [51], and applying PCA to the intermediatelatent code (termed w in the original papers). We used the top 10 PCA components to parameterizeour stimulus space, allowing these components to vary up to two standard deviations from the mean,and ﬁxing the input latent code ( z in the original papers) to the mean to control variability.We used the resulting generative model to explore subjective stereotypes for the following adjectives:‘attractive’, ‘fun’, ‘intelligent’, ‘serious’, ‘trustworthy’, and ‘youthful’, with these choices informedby prior literature (e.g., [53]). We constructed 18 across-participant GSP chains of length 50 withuniformly sampled starting locations and three chains for each adjective (Fig. 4A, Exp. 4a). We used293 US participants from AMT, aggregating 5 trials per iteration using the arithmetic mean. We thenevaluated the generated samples with a rating experiment, following the same procedure as the colorexperiment but collecting c. 52.1 ratings per sample from 179 US participants (Exp. 4b).The results are illustrated in Fig. 4B–C. The GSP chains converged on highly rated samples remarkablyquickly, with one full sweep of the 10 dimensions being sufﬁcient to effectively capture the targetcategories as evaluated by the validation experiment. This implies that GSP can indeed successfullynavigate StyleGAN’s generative space. Follow-up experiments found similar success with differentdimensionality reduction techniques and aggregation methods (Exp. 4c–f, Appendix F).Samples from the GSP process will inherit certain biases from the StyleGAN model. For example,if male faces are over-represented in StyleGAN samples, they are likely to be over-represented inthe GSP samples; likewise, if StyleGAN samples contain predominantly young female faces andold male faces, then GSP samples for ‘youthful’ are likely to be biased towards female faces. Toexamine such biases, we conducted a follow-up experiment analyzing the distribution of differentfeatures as subjectively rated by online participants (Exp. 4g, Appendix F). The results indicatethat StyleGAN’s training dataset already contains signiﬁcant biases that are propagated through themodeling pipeline, and potentially contribute to the prevalence of white faces in the GSP samples,as well as gender associations for the different targets. These ﬁndings indicate the importance ofinterpreting GSP results in the context of their associated generative models, and of sourcing less8igure 4: Sampling facial representations. A : Instructions for the GSP task. B : Results of thevalidation experiment, including ﬁnal samples for each target adjective (95% conﬁdence intervalsover participants). C : Example GSP chain for ‘fun’, with samples ordered by iteration.biased training datasets for future cognitive applications [54]; though StyleGAN’s FFHQ datasetmay be more diverse than many competing machine-learning datasets, it is clearly not bias-free. Ourresults will also reﬂect the stereotypes held by our participant group; repeating this method withdifferent participant groups could yield interesting hypotheses concerning how facial stereotypes varyacross different demographics and cultures. Appendix F describes initial experiments in this line withparticipant groups differentiated by gender and location (Exp. 4d, 4i, 4j). We have presented GSP, a new technique for extracting semantic representations from humanparticipants. GSP organizes these participants into virtual Gibbs samplers, and thereby generatesstimuli from the perceptual space associated with a given semantic representation. We have shownhow this technique can recover semantic representations for a variety of perceptual domains, includingcolor, emotional prosody, musical chords, and faces. The richness of the derived representations iscompelling, and suggests many future applications in cognitive and social sciences.GSP has several features that seem to help it converge quickly on high-quality samples. One is itscontinuous-slider interface, which can deliver much more information per trial than the binary choicemethod used by MCMCP. A second is its lack of tuning parameter, which reduces the resourcesrequired to develop a workable experiment. A third is the way in which it manipulates a singlestimulus dimension at a time: it is plausible that participants ﬁnd it easier to evaluate differencesbetween stimuli when the stimuli differ on just a single perceptual dimension.By formulating GSP and MCMCP in utility theory [19] we enable both methods to be applied tocontinuous as well as categorical semantic representations, while relaxing assumptions about theparticipant’s prior and response noise. By incorporating aggregation into the conditional part of theGibbs sampler, we increase the participant-to-stimulus ratio and thereby make GSP practical forstimuli that take a long time to generate, with the useful byproduct of averaging out perceptual noise.The ﬁnal study showed how GSP can be used to navigate the latent space of deep neural synthesismodels. The important prerequisite is ﬁnding a relatively low-dimensional basis for the networkfor GSP to parameterize; fortunately, it seems that relatively simple techniques such as PCA cansometimes sufﬁce for this task [50]. This approach has clear potential for helping cognitive scientiststo study semantic representations in high-dimensional perceptual spaces.9 roader Impact

This research extends the methods available to cognitive scientists who seek to characterize semanticrepresentations in human participants. In particular, the proposed method facilitates studying muchricher perceptual spaces (both in terms of dimensionality and in terms of granularity) than can beexplored effectively with conventional methods.Our research group is particularly interested in using GSP to study cross-cultural differences inperception [24, 47, 55]. In this context, exploratory techniques such as GSP are particularly useful,because they can generate valuable cognitive insights without specifying a constrained hypothesisspace a priori . Previous work using slider interfaces with cross-cultural populations makes us rela-tively conﬁdent that GSP could be applied cross-culturally [56], as long as sufﬁcient care is taken toensure that the task is understood properly by the participants. Addressing cross-cultural populationsin this way can help to ameliorate cognitive science’s longstanding bias towards participants fromWEIRD (Western, Educated, Industrial, Rich and Democratic) backgrounds [57].It is important to identify potential pitfalls in applying GSP, especially when such activities haveadverse ethical implications. We give three recommendations below for avoiding such mistakes.

Do not conﬂate subjective judgments with objective truth.

GSP is a tool for understandingparticipants’ subjective notions of particular semantic concepts. It does not necessarily reveal anyobjective truth about these concepts. This is particularly relevant in examples like our face study,where GSP is used to characterize perceived intelligence and trustworthiness. For example, GSPmay suggest that participants associate glasses-wearing with intelligence: this does not mean thatwearing glasses makes someone intelligent, or even that glasses wearing is necessarily associatedwith intelligence in the real world. Mistakes of this kind have the potential to perpetuate or amplifydangerous stereotypes in society, especially when the inferences concern race/ethnicity and gender;such an approach has a regrettable history in the now-discredited ﬁeld of physiognomy. Researchersusing our method and related psychological methods should be aware of this negative history, andhold their own work to a higher ethical standard to avoid causing similar harm. Consequently, GSPshould not be used as tool for generating training datasets for machine-learning algorithms, or forﬁne-tuning the parameters or hyperparameters of such algorithms, unless the researcher makes itabsolutely clear that the algorithm is being used to study human stereotypes rather than objectivetruths.

Analyze, report, and ideally avoid potential biases.

Cognitive scientists must always be sensitiveto potential biases in designing their stimuli and recruiting their participants. GSP is no exceptionto this principle. Our studies include examples of relatively simple and unbiased stimulus spaces(HSL colors; musical triads) as well as examples of relatively complex but potentially biased stimulusspaces (recordings of spoken sentences; images generated by the StyleGAN model). For practicalreasons, our studies all used participants recruited from AMT; while this platform provides a relativelydiverse participant group compared to the common practice of recruiting psychology students, itclearly does not represent the full diversity of the global population [58], and our results are likely toreﬂect culturally dependent stereotypes as a result (e.g., the Western preference for musical chordswith high harmonicity, [49]). It is imperative that cognitive scientists remain vigilant concerning thepotential harms of using non-diverse participant groups, both as regards making incorrect scientiﬁcconclusions and as regards perpetuating the under-representation of already marginalized parts ofsociety [57]. We discuss these issues on a case-by-case basis above, but future cognitive work usingthese methods should examine these issues in greater detail. For example, we did not gather detailedpersonal information about our participants on variables such as race/ethnicity due to privacy reasons,but it is important that future work studying facial stereotypes takes such variables into account.

Validate ﬁndings with rigorous hypothesis-driven experiments.

The power of combining GSPwith deep generative models (e.g., StyleGAN) is that it enables the researcher to ask exploratoryquestions about complex naturalistic stimuli, such as ‘what do people think a serious face looks like?’However, the downside of this approach is that the technique is susceptible to inheriting hidden biasesfrom the generative model [59]. It is therefore essential that cognitive research combining GSP withdeep generative models should treat the results as exploratory, and ideally validate the results withwell-controlled experiments that do not rely on the generative model.10 cknowledgments and Disclosure of Funding

The authors are grateful to David Poeppel for general help and support, and to Alec Mitchell, JesseSnyder, Jordan Suchow, Matthew Wilkes, and Sally Kleinfeldt for their support of the Dallingerproject. We would also like to thank Roya Pakzad for advising us on ethical aspects of the project.

References [1] J. R. Anderson,

The adaptive character of thought . Hillsdale, NJ: Lawrence Erlbaum Associates, 1990.[2] E. Rosch, “Cognitive representations of semantic categories.,”

Journal of Experimental Psychology:General , vol. 104, no. 3, pp. 192–233, 1975.[3] M. N. Jones, J. Willits, S. Dennis, and M. Jones,

Models of semantic memory , pp. 232–254. NY: OxfordUniversity Press, 2015.[4] T. L. Grifﬁths, M. Steyvers, and J. B. Tenenbaum, “Topics in semantic representation,”

PsychologicalReview , vol. 114, no. 2, pp. 211–244, 2007.[5] G. Mesnil, A. Bordes, J. Weston, G. Chechik, and Y. Bengio, “Learning semantic representations of objectsand their parts,”

Machine Learning , vol. 94, no. 2, pp. 281–301, 2014.[6] Q. Wang and K. Chen, “Alternative semantic representations for zero-shot human action recognition,” in

Joint European Conference on Machine Learning and Knowledge Discovery in Databases , (New York,NY), pp. 87–102, Springer, 2017.[7] P. Kay, B. Berlin, L. Mafﬁ, W. R. Merriﬁeld, and R. Cook,

The World Color Survey . Stanford, CA: CSLIPublications, 2009.[8] P. N. Juslin and E. Lindström, “Musical expression of emotions: Modelling listeners’ judgements ofcomposed and performed features,”

Music Analysis , vol. 29, no. 1-3, pp. 334–364, 2010.[9] A. Sanborn and T. L. Grifﬁths, “Markov Chain Monte Carlo with People,” in

Advances in Neural Informa-tion Processing Systems , pp. 1265–1272, 2008.[10] A. N. Sanborn, T. L. Grifﬁths, and R. M. Shiffrin, “Uncovering mental representations with Markov chainMonte Carlo,”

Cognitive Psychology , vol. 60, no. 2, pp. 63–106, 2010.[11] A. N. Sanborn and T. L. Grifﬁths, “Exploring the structure of mental representations by implementingcomputer algorithms with people,” in

Cognitive Modeling in Perception and Memory: A Festschrift forRichard M. Shiffrin , pp. 212–228, Abingdon, UK: Taylor and Francis, 2015.[12] J. B. Martin, T. L. Grifﬁths, and A. N. Sanborn, “Testing the efﬁciency of Markov chain Monte Carlo withpeople using facial affect categories,”

Cognitive Science , vol. 36, no. 1, pp. 150–162, 2012.[13] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, eds.,

Markov Chain Monte Carlo in Practice . London,UK: Chapman & Hall, 1996.[14] A. Hsu, J. Martin, A. Sanborn, and T. Grifﬁths, “Identifying representations of categories of discrete itemsusing Markov chain Monte Carlo with People,” in

Proceedings of the Annual Meeting of the CognitiveScience Society , 2012.[15] T. L. Grifﬁths, D. Daniels, J. L. Austerweil, and J. B. Tenenbaum, “Subjective randomness as statisticalinference,”

Cognitive Psychology , vol. 103, pp. 85–109, 2018.[16] A. A. Barker, “Monte Carlo calculations of the radial distribution functions for a proton-electron plasma,”

Australian Journal of Physics , vol. 18, pp. 119–133, 1965.[17] N. Vulkan, “An economist’s perspective on probability matching,”

Journal of Economic Surveys , vol. 14,no. 1, pp. 101–118, 2000.[18] D. R. Shanks, R. J. Tunney, and J. D. McCarthy, “A re-examination of probability matching and rationalchoice,”

Journal of Behavioral Decision Making , vol. 15, no. 3, pp. 233–250, 2002.[19] D. McFadden, “Conditional logit analysis of qualitative choice behaviour,” in

Frontiers in Econometrics (P. Zarembka, ed.), pp. 105–142, New York, NY: Academic Press, 1974.[20] Y. Weiss, E. P. Simoncelli, and E. H. Adelson, “Motion illusions as optimal percepts,”

Nature Neuroscience ,vol. 5, no. 6, pp. 598–604, 2002.[21] X.-X. Wei and A. A. Stocker, “A Bayesian observer model constrained by efﬁcient coding can explain‘anti-Bayesian’ percepts,”

Nature Neuroscience , vol. 18, pp. 1509–1517, 2015.[22] A. N. Sanborn, T. L. Grifﬁths, and D. J. Navarro, “A more rational model of categorization,” in

Proceedingsof the 28th Annual Conference of the Cognitive Science Society (R. Sun and N. Miyake, eds.), pp. 726–731,Cognitive Science Society, 2006.

23] A. E. Gelfand and A. F. M. Smith, “Sampling-based approaches to calculating marginal densities,”

Journalof the American Statistical Association , vol. 85, no. 410, pp. 398–409, 1990.[24] N. Jacoby and J. H. McDermott, “Integer ratio priors on musical rhythm revealed cross-culturally byiterated reproduction,”

Current Biology , vol. 27, no. 3, pp. 359–370, 2017.[25] V. Kempe, N. Gauvrit, and D. Forsyth, “Structure emerges faster during cultural transmission in childrenthan in adults,”

Cognition , vol. 136, pp. 247–254, 2015.[26] J. Xu and T. L. Grifﬁths, “A rational analysis of the effects of memory biases on serial reproduction,”

Cognitive Psychology , vol. 60, no. 2, pp. 107–126, 2010.[27] S. Kirby, H. Cornish, and K. Smith, “Cumulative cultural evolution in the laboratory: An experimentalapproach to the origins of structure in human language,”

Proceedings of the National Academy of Sciences ,vol. 105, no. 31, pp. 10681–10686, 2008.[28] T. Verhoef, S. Kirby, and B. De Boer, “Emergence of combinatorial structure and economy through iteratedlearning with continuous acoustic signals,”

Journal of Phonetics , vol. 43, pp. 57–68, 2014.[29] P. Edmiston, M. Perlman, and G. Lupyan, “Repeated imitation makes human vocalizations more word-like,”

Proceedings of the Royal Society B: Biological Sciences , vol. 285, no. 1874, 2018.[30] B. Braun, G. Kochanski, E. Grabe, and B. S. Rosner, “Evidence for attractors in English intonation,”

TheJournal of the Acoustical Society of America , vol. 119, no. 6, pp. 4006–4015, 2006.[31] A. M. Grimaud, T. Eerola, and N. Collins, “EmoteControl: A system for live-manipulation of emotionalcues in music,” in

Proceedings of the 14th International Audio Mostly Conference: A Journey in Sound ,pp. 111–115, 2019.[32] G. H. Joblove and D. Greenberg, “Color spaces for computer graphics,” in

Proceedings of the 5th AnnualConference on Computer Graphics and Interactive Techniques , pp. 20–25, 1978.[33] R. Banse and K. R. Scherer, “Acoustic proﬁles in vocal emotion expression.,”

Journal of Personality andSocial Psychology , vol. 70, no. 3, pp. 614–636, 1996.[34] T. Bänziger, G. Hosoya, and K. R. Scherer, “Path models of vocal emotion communication,”

PLOS ONE ,vol. 10, no. 9, 2015.[35] “IEEE recommended practice for speech quality measurements,” tech. rep., IEEE, 1969. ISBN:9781504402743.[36] P. Demonte, “HARVARD corpus speech shaped noise and speech-modulated noise for SIN test,” 2019.Publisher: University of Salford.[37] K. J. Woods, M. H. Siegel, J. Traer, and J. H. McDermott, “Headphone screening to facilitate web-basedauditory experiments,”

Attention, Perception, & Psychophysics , vol. 79, no. 7, pp. 2064–2072, 2017.[38] P. Laukka, H. A. Elfenbein, N. S. Thingujam, T. Rockstuhl, F. K. Iraki, W. Chui, and J. Althoff, “Theexpression and recognition of emotions in the voice across ﬁve nations: A lens model analysis based onacoustic features,”

Journal of Personality and Social Psychology , vol. 111, no. 5, 2016.[39] P. N. Juslin and P. Laukka, “Communication of emotions in vocal expression and music performance:Different channels, same code?,”

Psychological Bulletin , vol. 129, no. 5, pp. 770–814, 2003.[40] K. R. Scherer, “Acoustic patterning of emotion vocalizations,” in

The Oxford Handbook of Voice Perception (S. Frühholz and P. Belin, eds.), 2019.[41] P. Laukka and H. A. Elfenbein, “Cross-cultural emotion recognition and in-group advantage in vocalexpression: A meta-analysis,”

Emotion Review , 2020.[42] S. Paulmann and A. K. Uskul, “Cross-cultural emotional prosody recognition: Evidence from Chinese andBritish listeners,”

Cognition & Emotion , vol. 28, no. 2, pp. 230–244, 2014.[43] P. M. C. Harrison and M. T. Pearce, “Simultaneous consonance in music perception and composition,”

Psychological Review , vol. 127, pp. 216–244, 2020.[44] I. Lahdelma and T. Eerola, “Cultural familiarity and musical expertise impact the pleasantness of conso-nance/dissonance but not its perceived tension,”

Scientiﬁc Reports , vol. 10, 2020.[45] R. Plomp and W. J. M. Levelt, “Tonal consonance and critical bandwidth,”

The Journal of the AcousticalSociety of America , vol. 38, pp. 548–560, 1965.[46] W. A. Sethares,

Tuning, timbre, spectrum, scale . London, UK: Springer, 2005.[47] N. Jacoby, E. A. Undurraga, M. J. McPherson, J. Valdés, T. Ossandón, and J. H. McDermott, “Universaland non-universal features of musical pitch perception revealed by singing,”

Current Biology , vol. 29,no. 19, pp. 3229–3243, 2019.[48] D. L. Bowling, D. Purves, and K. Z. Gill, “Vocal similarity predicts the relative attraction of musicalchords,”

Proceedings of the National Academy of Sciences , vol. 115, no. 1, pp. 216–221, 2018.

49] J. H. McDermott, A. F. Schultz, E. A. Undurraga, and R. A. Godoy, “Indifference to dissonance in nativeAmazonians reveals cultural variation in music perception,”

Nature , vol. 535, no. 7613, pp. 547–550, 2016.[50] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “GANSpace: Discovering interpretable GANcontrols,” arXiv , 2020.[51] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,”in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR , pp. 4401–4410,2019.[52] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the imagequality of StyleGAN,” arXiv , 2019.[53] L. Brinkman, A. Todorov, and R. Dotsch, “Visualising mental representations: A primer on noise-basedreverse correlation in social psychology,”

European Review of Social Psychology , vol. 28, no. 1, pp. 333–361, 2017.[54] M. K. Scheuerman, K. Wade, C. Lustig, and J. R. Brubaker, “How we’ve taught algorithms to seeidentity: Constructing race and gender in image databases for facial analysis,”

Proceedings of the ACM onHuman-Computer Interaction , vol. 4, 2020.[55] N. Jacoby, E. H. Margulis, M. Clayton, E. Hannon, H. Honing, J. Iversen, T. R. Klein, S. A. Mehr,L. Pearson, I. Peretz, M. Perlman, R. Polak, A. Ravignani, P. E. Savage, G. Steingo, C. J. Stevens, L. Trainor,S. Trehub, M. Veal, and M. Wald-Fuhrmann, “Cross-cultural work in music cognition: Challenges, insights,and recommendations,”

Music Perception , vol. 37, no. 3, pp. 185–195, 2020.[56] B. Sievers, L. Polansky, M. Casey, and T. Wheatley, “Music and movement share a dynamic structure thatsupports universal expressions of emotion,”

Proceedings of the National Academy of Sciences , vol. 110,no. 1, pp. 70–75, 2013.[57] J. Henrich, S. J. Heine, and A. Norenzayan, “Most people are not WEIRD,”

Nature , vol. 466, no. 7302,p. 29, 2010.[58] D. Difallah, E. Filatova, and P. Ipeirotis, “Demographics and dynamics of Mechanical Turk workers,” in

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining , pp. 135–143,2018.[59] A. Grover, K. Choi, R. Shu, and S. Ermon, “Fair generative modeling via weak supervision,” arXiv , 2019. ppendix A Mathematical framework In this section we present derivations of the formulas derived in the theoretical exposition of the maintext. We start from a derivation of the acceptance function of MCMCP based on utility theory whichwe then generalize to GSP.

A.1 MCMCP

To analyze the decision step of MCMCP, let us imagine that a participant is presented with twoalternatives, s and s , from which they are asked to choose according to some criterion c . In thecontext of utility theory, we suppose that the participant performs this task by extracting a utility valuefor each stimulus, and selecting the stimulus with the maximum utility. The utility value has twocomponents, a deterministic component and a noise component, namely U i = (cid:96) i + n i , where (cid:96) i is thedeterministic utility of stimulus i , and n i is the associated noise. This noise component can captureintra-participant noise from sensory [1, 2] and cognitive [2, 3] processes, as well as inter-participantnoise corresponding to individual differences in the utility function [4]. In the current derivation, weassume that the noises are i.i.d. and that they are Gumbel distributed (also known as type I extremevalue), n i ∼ Gumbel ( µ, γ − ) . Gumbel distributions are commonly used in discrete choice modelsbecause they approximate Gaussian noise while possessing useful analytic properties [5]. From here,the probability of choosing, say s , would be p ( choose s ) = p ( U > U ) = p ( n − n < (cid:96) − (cid:96) ) . (2)To proceed, we recall the following useful property of Gumbel distributions: let X ∼ Gumbel ( µ , γ − ) and X ∼ Gumbel ( µ , γ − ) be two independent variables, then the difference islogistically distributed, namely, X − X ∼ Logistic ( µ − µ , γ − ) . Thus, we see that the right handside of (2) is simply the cumulative distribution function of the logistic distribution Logistic (0 , γ − ) ,from which it follows that p ( choose s ) = 11 + e − γ ( (cid:96) − (cid:96) ) (3)which is the desired result (see Section 2.1 in the main paper). A.2 GSP

Let us now generalize the analysis of MCMCP to GSP. Recall that in the GSP step, a participant ispresented with a slider that is associated with an active dimension, say z k , from which they are askedto select a value. To analyze the decision step, let us discretize the slider into a set of points { z ik } i ,and let z − k denote the other ﬁxed dimensions. Similar to the MCMCP case, we assume that theparticipant extracts a utility value for each stimulus along the slider, namely, U i = (cid:96) ( z ik , z − k ) + n i ,with the noise being i.i.d. and Gumbel distributed n i ∼ Gumbel ( µ, γ − ) , and we assume that theychoose the alternative with the highest utility. Such a choice model is known in the literature as themultinomial logit [5]. For completeness, let us derive the formula for the probability of choosing thealternative z ik . We have p ( z ik | z − k ) = p (cid:92) j (cid:54) = i U i > U j  = (cid:90) ∞−∞ d(cid:15)p (cid:92) j (cid:54) = i n j < (cid:96) i − (cid:96) j + n i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n i = (cid:15)  p ( n i = (cid:15) )= (cid:90) ∞−∞ d(cid:15) (cid:89) j (cid:54) = i p ( n j < (cid:96) i − (cid:96) j + (cid:15) ) p ( n i = (cid:15) )= γ (cid:90) ∞−∞ d(cid:15) exp  −  (cid:88) j (cid:54) = i e − γ ( (cid:96) i − (cid:96) j )  e − γ ( (cid:15) − µ ) − γ ( (cid:15) − µ )  = e γ(cid:96) i (cid:80) j e γ(cid:96) j (cid:15) (cid:48) = γ ( (cid:15) − µ ) and noticingthat the sum over exponentiated utility differences is a positive number, so that the integral identity (cid:82) ∞−∞ dx exp {− λe − x − x } = 1 /λ holds. Thus, substituting (cid:96) i = (cid:96) ( z ik , z − k ) we arrive at the desiredequation, that is, Equation (1) in the main paper. Notice also that in the case of two alternatives, thisderivation recovers the acceptance function of MCMCP.Both derivations of the MCMCP and GSP choice probabilities relied on two main assumptionsregarding the nature of the noise: (a) it is i.i.d., and (b) it is Gumbel distributed. Starting from thelatter, notice that the derivation of the GSP choice probability makes it clear how to generalize toother types of noise. Indeed, up to (and including) the third equality, we relied only on the i.i.d. natureof the noise. Moreover, the third equality provides a prescription on how to generalize: for a givenchoice of noise model, simply plug in the right cumulative function and probability distribution ofthat model. Thus, for a Gaussian noise for example, that is, n i ∼ N ( µ, σ ) , we have p ( z ik | z − k ) = 1 √ πσ (cid:90) ∞−∞ d(cid:15) (cid:89) j (cid:54) = i Φ (cid:18) (cid:96) i − (cid:96) j + (cid:15)σ (cid:19) e − (cid:15) σ (4)where Φ is the normal cumulative distribution function. This is known as the independent probitmodel [5]. Of course, unlike the Gumbel case, in the Gaussian case this does not result in a closedform formula. This, however, does not prevent the GSP process from exploring the utility terrain ofthe model, given the functional similarity between Gumbel and normal distributions.If the noise cannot be assumed to be i.i.d., the third equality no longer holds. We see two mainpotential sources of i.i.d. violations in this paradigm:1. Intra-participant correlation (different participants have different utility functions);2. Neighboring-point correlation (neighboring points on the slider are likely to receive corre-lated noise).Intra-participant correlation has different implications for across- and within-participant chains (Fig.S1). In across-participant chains, each participant only contributes one observation to the chain, sointra-participant correlation never manifests. In within-participant chains, all observations in a givenchain come from the same participant, meaning that the i.i.d. assumption remains unviolated, andeach chain ends up approximating the underlying utility function for the individual participant. Thepopulation-level utility function can then be approximated by aggregating over chains.Neighboring-point correlation could have a subtle effect on the derivations presented here. Futurework could revise our model to include a correlation structure for these points, for example follow-ing the correlated multinomial probit model where noise values are taken from a joint Gaussiandistribution with a speciﬁed correlation structure [5].Our derivation also does not take into account context effects, whereby the participant’s previoustrials inﬂuence their responses to the present trial. In particular, it is possible that the utility value ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Within-participants It e r a t i on s Chains

Stimulus Participant

Chains Chains

Across-participants Across-participants with aggregation

Figure S1: Illustration of different chain designs.15 atabaseAutomatic deployment and recruiting

Online participantsCloud-based supporting services

GPUs Storage (e.g MCMCP/ GSP)

Experiment

Experimentaltemplate

Dallinger

Cloud server platform mechanical turk P s y N e t Postgres

HPC

Figure S2: Computational infrastructure used for data collection.for a given stimulus changes when the participant has already experienced that stimulus multipletimes. This possibility is particularly high in within-participant chains, where the same participantexperiences many stimuli from adjacent steps in the Gibbs sampler; in contrast, across-participantchains mostly avoid this effect by preventing the participant from experiencing multiple stimuli fromthe same chain (Fig. S1). We tested the strength of this effect in the emotional prosody experiment,conducting both a within-participants and an across-participants version of the same paradigm. Wefound that the results did not differ materially between the two, implying that memory effects werenot a signiﬁcant problem for this paradigm.The above GSP derivation also assumes that the participant visits all slider positions. If this assumptionis violated, the denominator in the GSP choice probability would cover only a subset of the locations,effectively reducing the granularity of the slider. We can try and minimize this effect experimentally,by forcing the participant to explore a certain amount of the stimulus space before proceeding tothe next trial, but it is often impractical to enforce exhaustive exploration. However, we expectthat the consequences of this assumption violation are not severe for two reasons: (a) participantstend to focus on the parts of the slider that contain most of the utility/probability mass, and (b)participants can extrapolate between slider locations to estimate the utility values of intermediatepoints. Nonetheless, we would like to explore this assumption more in future work.

Appendix B General methods

B.1 Implementation

We implemented all experiments in PsyNet, our under-development framework for implementingcomplex experiment paradigms such as GSP and MCMCP. This framework builds on the Dallingerplatform for experiment hosting and deployment. Participants interact with the experiment viatheir web browser, which communicates with a back-end Python server cluster responsible fororganizing the timeline of the experiment (Fig. S2). This cluster is mostly managed by Heroku, and comprises a customizable collection of virtual instances that share the experiment managementand stimulus generation workload, as well as an encrypted Postgres database instance for storingresults. In some experiments we additionally used Amazon Web Services (AWS) S3 storage forhosting stimuli, and an AWS Elasic Compute Cloud (EC2) instance with an NVIDIA K80 GPUfor deep neural network synthesis. Code for the implemented experiments can be found at https://doi.org/10.17605/OSF.IO/RZK4S . https://github.com/Dallinger/Dallinger https://aws.amazon.com/ .2 Participants All participants provided informed consent in accordance with the Max Planck Society Ethics Councilapproved protocol (application 2018-38). All participants were recruited from Amazon MechanicalTurk (AMT), which is an online service for crowd-sourcing workers for online tasks. Here the onlyuniversal constraint we placed on recruitment was that participants must be at least 18 years of age,and have a 95% or higher approval rate on previous tasks on AMT; this approval criterion is meant tohelp recruit reliable participants. In some experiments we also constrained the worker to be a USresident.When designing the experiment, each component was given an estimate for the average time it shouldtake to complete; participants were then paid at a US $9/hour rate according to how much of theexperiment they completed. Importantly, participants were still paid a proportional amount even ifthey left the experiment early on account of failing a pre-screening task.A total of 5,178 participants took part in the 25 experiments reported in this paper, excluding thosewho failed pre-screening tests. For the participants who reported demographic information, self-reported ages ranged from 18 to 89 ( M = 35.25, SD = 10.37), and 35.74% identiﬁed as female (63.9%male and 0.36% other). These participants may be further differentiated into two groups: those who participated in the mainexperiments and those who participated in the validation experiments. These two groups had similarcompositions, with the main participant group (Table S1) comprising 2,967 participants (35.95%female, 63.64% male, 0.41% other; ages 18–89, M = 35.25, SD = 10.42), and the validation group(Table S2) comprising 2,211 participants (35.22% female, 64.53% male, 0.25% other; ages 18–74, M = 35.25, SD = 10.24).The musical chord study also collected additional information about musical expertise. Participantsin the main chord experiment reported 0–25 ( Med = 2, M = 4.26, SD = 6.23) years of musical experi-ence (i.e., playing an instrument or singing), whereas participants in the corresponding validationexperiment reported 0–64 ( Med = 2, M = 4.39, SD = 7.57) years of musical experience.Participant recruitment was managed by PsyNet. For the across-participant chain experiments,we speciﬁed a desired number of chains and a desired length for these chains, and participantswere then automatically recruited until the chains reached their desired lengths. For the within-participant chain experiments, we speciﬁed a desired number of completed participant sessions,and recruitment continued until this threshold was met. For the rating experiments, we chose adesired number of ratings per experimental condition such that we expected that any variation in theresulting condition means should primarily reﬂect the stochasticity of the original sampler rather thestochasticity of participant raters. As a rule of thumb, we aimed for approximately 150 participantsper validation study, scaling this number accordingly when the validation study compared multiplemethods. Participants were then automatically recruited until the minimum number of ratings perexperimental condition was reached. Age and gender distributions were computed from all participants who passed the pre-screening tasks,excluding the validation experiments for emotional prosody, for which demographic information was notcollected. Participant numbers only include participants who contributed at least one valid trial to the mainexperiment. An experimental condition typically corresponded to one point on a ﬁgure, for example the third iterationfor the second ‘lavender’ GSP chain. N Pre-screening US-only Validated in1a Color (MCMCP) MCMCP 8 3 30 1 Across 57 CB, CV No Exp. 1d, 1h1b Color (GSP) GSP 8 3 30 1 Across 53 CB, CV No Exp. 1d, 1h1c Color (agg. GSP) GSP 8 3 30 10 Across 312 CB, CV No Exp. 1d, 1h1e Color (MCMCP proposal) MCMCP 8 3 30 1 Across 153 CB, CV No -1f Color (questions) GSP/MCMCP 8 3 30 1 Across 190 CB, CV No -1g Color (agg. MCMCP) MCMCP 8 3 30 10 Across 302 CB, CV No Exp. 1h2a Prosody (within) GSP 3 7 21 1 Within 110 Audio Yes Exp. 2c, 2d2b Prosody (across) GSP 3 7 20 1 Across 57 Audio Yes Exp. 2d3a Musical chords GSP 1 2 40 1 Across 134 Audio No Exp. 3b4a Faces GSP 6 10 50 5 Across 293 CV Yes Exp. 4b, 4d, 4g4c Faces (KDE) GSP 6 10 50 5 Across 278 CV Yes Exp. 4d4e Faces (basis) GSP 1 10 30 5 Across 167 CV Yes Exp. 4f4h Faces (art) GSP 6 10 50 5 Across 260 CV Yes -4i Faces (global, KDE) GSP 6 10 50 5 Across 269 CV No Exp. 4d4j Faces (dating) GSP 1 10 30 5 Across 332 CV Yes -

N ote. ‘Rep.’ indicates the number of semantic representations that were tested; ‘Dim.’ indicates the dimensionality of the stimulus space; ‘Iter.’ indicates thenumber of iterations in each chain; ‘Agg.’ indicates how many participants contributed to each iteration of the GSP chain; ‘ N ’ denotes the number of participantsincluded in the ﬁnal analysis; ‘CB’ denotes the color blindness pre-screening task; ‘CV’ denotes the color vocabulary pre-screening task; ‘US-only’ indicateswhether the participant group was restricted to US residents; ‘Exp.’ denotes ‘Experiment’. able S2: Validation experimentsExperiment Ratings perparticipant Ratings perstimulus Total stimuli N Pre-screening US-only Validating1d Color (original) 60 5.2 3,720 322 CB, CV No Exp. 1a, 1b, 1c1h Color (inc. agg. MCMCP) 60 3.3 4,960 270 CB, CV No Exp. 1a, 1b, 1c, 1g1i Color (uniform sample) 60 4.2 4,000 280 CB, CV No Exp. 1a, 1b, 1c, 1g2c Prosody 147 5.4 4,383 161 Audio Yes Exp. 2a2d Prosody 132 4.1 4,874 153 Audio Yes Exp. 2a, 2b3b Musical chords 80 16.4 820 168 Audio Yes Exp. 3a4b Faces (original) 80 52.1 275 179 CV Yes Exp. 4a4d Faces (aggregation, location) 80 25.6 815 261 CV No Exp. 4a, 4c, 4i4f Faces (basis) 59.9 4.3 260 131 CV No Exp. 4e4g Faces (bias) 78.9 3.2 7,056 286 CV Yes Exp. 4a

N ote. ‘ N ’ denotes the number of participants included in the analysis; ‘CB’ denotes the color blindness pre-screening task; ‘CV’ denotes the color vocabularypre-screening task; ‘US-only’ indicates whether the participant group was restricted to US residents; ‘Exp.’ denotes ‘Experiment’. In the row corresponding toExp. 4g, the number of stimuli (7,056) corresponds to 7 (the number of questions) multiplied by 1,008 (the number of images). igure S3: Example trial from the color blindness pre-screening task. B.3 Pre-screening tests

A useful technique for improving the quality of data from online participants is to implement pre-screening tests designed to screen out participants likely to deliver low-quality data [6]. Here weused three pre-screening tests in various combinations: a color blindness test, a color vocabularytest, and an audio test. These tests are primarily intended to screen out participants who do not meetcertain explicit criteria such as wearing headphones, but they also help to screen out participantswith a minimal degree of English comprehension, or automated scripts (‘bots’) masquerading asparticipants [7].The color blindness test was derived from the well-known Ishihara color blindness test [8]. Hereparticipants had to respond to six trials where the task was to transcribe a number from an image,with the contrast of the image being designed such that it is difﬁcult to perform if the participantsuffers from color perception deﬁciencies (see Fig. S3 for an example trial). The image was set todisappear after three seconds to encourage quick responses. Each participant had to take six suchtrials; to pass, they had to answer at least four of these six trials correctly.The color vocabulary test was constructed by taking six English color words that require a relativelygood vocabulary knowledge to understand: ‘turquoise’, ‘magenta’, ‘granite’, ‘ivory’, ‘maroon’, and‘navy’. None of these words were used in the other experiments. We associated each word with anRGB deﬁnition sourced from Wikipedia, and presented the participant with six trials where they werepresented with a color and had to choose which of the six words corresponded to that color (see Fig.S4). The pass threshold was a score of four out of six.The audio pre-screening task, originally developed in [6], was intended to ensure that participantswere wearing headphones and could hear perceive subtle sound differences. The task has participantsperform a three-alternative forced-choice task to identify the quietest of three tones. These tones areconstructed to elicit a phase cancellation effect, such that when played on loudspeakers the order ofquietness changes, causing the participant to respond incorrectly. Each participant had to take sixsuch trials; to pass, they had to answer at least four of these six trials correctly.

B.4 Performance incentives

In order to further improve data quality, some of our experiments (speciﬁcally, all but the emotionalprosody experiments) additionally included a ﬁnancial incentive for participants to provide high-quality data. Prior to the main part of the experiment, we informed all participants of this incentive,using the following text: 20igure S4: Example trial from the color vocabulary pre-screening task.Figure S5: Example trial from the audio pre-screening task.The quality of your responses will be automatically monitored, and you will receivea bonus at the end of the experiment in proportion to your quality score. The bestway to achieve a high score is to concentrate and give each trial your best attempt.We purposefully left the deﬁnition of ‘quality’ vague, so as to avoid encouraging participants to‘game’ a particular aspect of response quality. Of course, our tasks were subjective, and so therewas no meaningful way to deﬁne a high-quality answer a priori . Instead, our approach was to useconsistency as a proxy for quality; the rationale is that a participant who takes the task seriously andcarefully is likely to deliver consistent responses when administered the same trial multiple times, incontrast to a participant who does not pay attention to the task and simply answers randomly.We estimated consistency as follows. Once a participant ﬁnished all of their ‘main’ experimenttrials, they then received a small number (4–8, depending on the experiment) of trials that repeatedrandomly selected trials from the earlier part of the experiment. The data from these trials contributedsolely to consistency estimation, not to chain construction. In GSP trials and four-point rating trials,consistency was quantiﬁed by taking the Spearman correlation between the two sets of answers; forMCMCP trials, consistency was quantiﬁed by taking the percentage agreement between the twosets of answers. Participants were then given a small monetary bonus in proportion to the resultingconsistency score, ranging from zero dollars for chance performance up to one dollar for perfectlyconsistent performance.

B.5 Chain construction

In all experiments except the emotional prosody experiment, we randomized the starting locations ofeach chain by randomly sampling from a uniform distribution over the range of permissible featurevalues. In the case of emotional prosody, we found this randomization problematic because it oftenled to unrealistic parts of the stimulus space. In this case we therefore initialized each chain at a ‘null’state corresponding to the unaltered reference sentence.21n a given trial of a GSP experiment, the participant’s slider manipulated exactly one dimension ofthe stimulus. To counteract any potential biases towards left or right slider directions, we randomizedthe effective direction of the slider on each trial, such that approximately half of the time the right ofthe slider corresponded to positive feature values, and the other half of the time it corresponded tonegative feature values.Our experiments implemented both within-participant and across-participant chains (Fig. S1). Inwithin-participant chains, the entire chain is completed by just one participant, and the resultingsamples reﬂect the semantic representations of that single participant. In across-participant chains,each iteration comes from a different participant, and the samples then reﬂect shared semanticrepresentations across participants.Across-participant chains are more complex to implement because of the interaction between multipleparticipants. Each time a participant is ready to take a new trial, it is necessary to scan the differentchains in the experiment and identify one that satisﬁes the following conditions:1. The chain is not full (i.e., it has not reached its speciﬁed quota of iterations);2. The participant has not already participated in that chain;3. No other participants have been assigned to that particular iteration of the chain.The last point – ensuring that multiple participants are not assigned to the same iteration of a chain –is important for the efﬁciency of data collection, but it can cause problems when a participant claims aparticular iteration of the chain and then drops out of the experiment, potentially blocking any futureadditions to that chain. We therefore implemented a time-out parameter for this experiment, set to 60seconds, after which the participant’s pending trial was invalidated and the chain was unblocked.In within-participant chains we are free to discard all of a participant’s data when they drop out ofan experiment partway through. This is not practical in across-participant chains, however, wheremany subsequent participants might have built on the data previously contributed by this participant.In the latter case, we therefore retain the participant’s contributions even when they drop out of theexperiment.

Appendix C Color

C.1 Supplementary methods

We chose eight words designed to be moderately but not overly familiar to English speakers that weanticipated to evoke strong color associations. These words were ‘chocolate’, ‘cloud’, ‘eggshell’,‘grass’, ‘lavender’, ‘lemon’, ‘strawberry’, and ‘sunset’. We then explored the perceptual spacesassociated with these eight words using GSP and MCMCP.We implemented GSP and MCMCP using the Hue, Saturation, Lightness (HSL) color space. Wechose this color space over the Red, Green, Blue (RGB) color space because it is generally consideredto better reﬂect how humans perceive color relationships. In this space, each color is encoded as threeintegers: hue, saturation, and lightness, taking values in [0, 360), [0, 100], and [0, 100] respectively.MCMCP relies on the speciﬁcation of a proposal function. In our main experiment, we used aGaussian distribution with a standard deviation of 30, chosen on the basis of internal piloting, androunding samples to the nearest integer values. In MCMCP the proposal distribution should besymmetric, which can be problematic to satisfy when the sampler reaches the boundaries of thesample space. We addressed this problem by computing the proposal distribution modulo the scalerange, such that moving past the top of the scale means returning to the bottom of the scale. Thisworks particularly well for hue, which is already deﬁned as a circular space, with both 0 and 360corresponding to the color red. It works less well for saturation and lightness, because these are linearscales; however, as we chose target colors occupying central regions of these two scales, we expectedthat these boundary effects should not materially inﬂuence MCMCP’s performance.We implemented simple web interfaces for the MCMCP and GSP tasks. In the MCMCP task,participants were presented with pairs of colors, and had to choose which color best represented atarget word (Fig. S6). In the GSP task, participants were presented with a single color that constantlyupdated to reﬂect the current position of a slider; participants were then instructed to move the sliderto make the color represent a target word as well as possible (Fig. S7).22able S3: Questions used in Exp. 1f.Label MCMCP GSPProbability Choose which colour is most likely tocome from the following category. Adjust the slider to make the color aslikely as possible to come from the fol-lowing category.Goodness Choose which colour best matches thefollowing word. Adjust the slider to match the followingword as well as possible.Typicality Choose which colour is most typical ofthe following category. Adjust the slider to make the color astypical as possible for the following cat-egory.

N ote.

All color experiments except for Exp. 1f used only the ‘Goodness’ question; Exp. 1f testedall three questions.For each given sampling method we constructed ﬁve across-participants chains per adjective, yielding40 chains in total. Each chain was ﬁlled to a length of 30 states, not including the initial random state.Each participant contributed a maximum of 40 trials to the chains for a given sampling method (Exp.1a–c; see Table S1 for participant numbers).The aggregated GSP experiment combined 10 trials for each iteration of the Gibbs sampler (Exp. 1c).These 10 trials were combined using the arithmetic mean in the case of saturation and lightness, andthe circular mean in the case of hue. These means were then propagated to the next iteration of theGibbs sampler.We also ran ﬁve follow-up experiments to better understand the relative performance of GSP andMCMCP (see also Tables S1 and S2):• We tested MCMCP with ﬁve different proposal function standard deviations: 10, 20, 30, 40,and 50, all expressed on the integer color scale (Exp. 1e).• We reran the MCMCP and GSP experiments with three different kinds of questions designedto probe different notions of utility and category membership (Table S3, Exp. 1f).• We reran the MCMCP experiment using 10-fold aggregation (Exp. 1g), and validated italongside the other methods (Exp. 1h).• We tested 4,000 colors randomly sampled from a uniform distribution over the HSL spaceusing the same rating procedure as Exp. 1d (Exp. 1i).The validation experiment (Exp. 1d) used the same pre-screening procedure as the chain-constructionexperiments. A minimum of ﬁve ratings were collected for each sample generated in the formerexperiments, with the constraint that participants could not rate the same stimulus more than once.Participants were assigned pseudo-randomly to stimuli such that the number of ratings accumulatedevenly for each stimulus. In each trial, participants were presented with the target word from theoriginal chain, and asked to judge how well the colour matched this word on a scale from 1 (notat all) to 4 (very much). A given participant’s ratings were only included in the ﬁnal tallies if theycompleted the entire validation experiment. See Table S2 for participant numbers.

C.2 Supplementary results

In the main paper we identiﬁed a clear advantage for GSP over MCMCP, given chains of the samelength and the same amount of aggregation. However, we were concerned about several possibleconfounds, which we will now discuss alongside corresponding analyses.

Claim:

GSP trials are more time-consuming than MCMCP trials. Even if GSP requires fewertrials to achieve good sample quality, if these trials take much longer, then GSP will end upbeing practically slower than MCMCP.

Fig. S8 plots validation ratings for the different samplingmethods, with the horizontal axis now corresponding to the total participant time invested in therespective chains (Exp. 1a–c). We estimated total participant time by taking iteration number andmultiplying it by the median participant time spent on the two different trial types. The resultsindicate that non-aggregated GSP still clearly outperformed MCMCP despite the longer duration23igure S6: Screenshot from the color MCMCP implementation.Figure S7: Screenshot from the color GSP implementation.24 l ll l lll l llllllllllllllllllllllllllllllllllllllllllllllllllllll l l l l lllllllllllllllllllllll R a t i ng lll GSP + aggregationGSPMCMCP

Figure S8: Mean sample ratings as a function of the participant time invested in chain construction(Exp. 1a–c, 1d), with time plotted on a log scale (95% conﬁdence intervals over participants).of its individual trials. It is difﬁcult to make a clear statement about the relative performance ofaggregated GSP because its proﬁle overlaps minimally with the other two methods; however, theﬁgure implies that non-aggregated GSP outperforms aggregated GSP for the ﬁrst few iterations,with aggregated GSP then overtaking at a later point. This is consistent with our expectations: thefast-but-noisy non-aggregated GSP can quickly escape its low-probability starting states, but thesame noise prevents it from converging as precisely as aggregated GSP in later iterations.

Claim:

MCMCP has a tuning parameter corresponding to the width of the proposal function.Perhaps the relatively poor performance of MCMCP was simply due to the wrong choice ofproposal width.

We tabulated samples from the previously described control experiment withdifferent MCMCP proposal widths (Fig. S9, Exp. 1d). There appears to be little difference in samplequality for the different proposal widths. As expected, we see that the MCMCP chains with thesmallest proposal width (10) only make local adjustments to the color, meaning that once the chaingets close to an appropriate color category, it can be carefully tweaked to resemble this category aswell as possible. However, these narrow-proposal chains often fail to approach the appropriate colorcategory in the ﬁrst place, even after 30 iterations. In contrast, the wide-proposal chains explore thecolor space quickly, but are unable to make subtle adjustments to match speciﬁc categories. Themoderate proposal width of 30 provides some compromise between these two behaviors, and seemsto be a sensible choice for the MCMCP-GSP comparison.

Claim:

We altered the MCMCP question somewhat to better represent the notion of continu-ous utility as opposed to category membership. Perhaps this alteration diminished the efﬁcacyof MCMCP in practice.

We tabulated samples from Exp. 1f which trialled different types of ques-tions for the MCMCP and GSP tasks (Table S3, Fig. S10). Visually inspecting these plots, westruggled to discern any systematic effect of question type on the sample distributions. We do notdoubt that subtle differences could be distinguished with the right kind of experiment, but it seemsthat in practice any such effects are small.

Claim:

We only evaluated MCMCP without aggregation; perhaps MCMCP with aggregationwould perform as well as GSP.

We compared validation ratings for aggregated MCMCP againstratings for aggregated GSP, non-aggregated GSP, and non-aggregated MCMCP (Fig. S11). Aggre-25igure S9: Raw color samples for MCMCP with ﬁve different standard deviations for the Gaussianproposal function: 10, 20, 30, 40, and 50 (Exp. 1e).gated MCMCP does outperform non-aggregated MCMCP, but the difference is small compared tothe difference between GSP and aggregated GSP. This makes intuitive sense: while aggregated GSPcan produce very precise updates at each iteration, aggregated MCMCP can only provide one bit ofinformation at each iteration, placing a fundamental limit on its convergence rate.

Claim:

A common analysis approach with MCMCP is to generate category prototypes byaveraging over many samples. Perhaps MCMCP performs better when using this analysismethod.

We recomputed the samples generated by the three methods using instead an incrementalaggregation process, generating a summary sample for each iteration and target word by averagingall previous samples from all chains for that word, with iterations 1–6 treated as burn-in samplesand hence discarded. The resulting samples are displayed in Fig. S12. The aggregation processclearly improves sample quality for MCMCP and GSP (non-aggregated), but it does not fully solveMCMCP’s problem with poor sample quality. Though we do not have participant rating data for theseaggregated samples, it is clear that MCMCP failed to converge on appropriate colors for chocolate,eggshell, and lavender. One might further criticize the lavender samples for being too red, thestrawberry and sunset samples for not being red enough, and the lemon samples for being not yellowenough. It seems apparent that aggregating over trials does not necessarily resolve the performanceissues of MCMCP in this paradigm.

Claim:

Our original evaluation rewards methods that produce highly prototypical categoryexemplars; using our utility function metaphor, one might say that the evaluation rewardsmode-seeking behavior. However, there is a trade-off between mode seeking and exploration;perhaps GSP is better at mode seeking, but MCMCP is better at exploration.

We estimateda benchmark utility distribution over the stimulus space for each target word using a large-scalerating experiment (Exp. 1i), and then compared the results to the utility distributions estimated byMCMCP and GSP. To provide a visual intuition for the differences between techniques, Fig. S13plots marginal distributions for hue as estimated by the rating, MCMCP, and GSP experiments,using a generalized additive model for the ratings and a kernel density estimator (KDE) for theMCMCP and GSP distributions, and again treating iterations 1–6 as burn-in samples. Only ‘grass’,‘lavender’, ‘lemon’, and ‘strawberry’ are plotted here, because these are the four words with the mostinterpretable marginals for hue (the remaining adjectives have many very dark or very light samples,in which case differences in hue become imperceptible). From comparing GSP and MCMCP to theratings, it is apparent that the poor performance of MCMCP is not simply due to having broaderpeaks, but rather comes from mislocated secondary peaks, for example red for ‘grass’, orange for‘lavender’, blue for ‘lemon’, and so on. Visually inspecting the raw samples in Fig. 1B supports thisimpression: many of the MCMCP samples seem to be unrelated to the target category. Incidentally,the ﬁgure also helps for visualizing the effect of aggregation; we see how aggregation sharpens the26igure S10: Raw color samples for MCMCP and GSP with three different kinds of questions asdescribed in Table S3 (Exp. 1f). R a t i ng GSP + aggregationGSPMCMCP + aggregationMCMCP

Figure S11: Validation results for non-aggregated and aggregated GSP and MCMCP (Exp. 1a, 1b, 1c,1g, and 1h). The shaded regions indicate 95% conﬁdence intervals over participants.27igure S12: Colors derived by averaging raw samples from iteration 7 onwards for the differentsampling methods (Exp. 1a–c).GSP peaks to clear unimodal distributions, but fails to provide much improvement for MCMCP.Future work should investigate these differences more systematically using quantitative assessmentsof multidimensional distribution similarity, and unpacking the potentially non-trivial relationshipbetween sample ratings and utility values.To summarize, it seems that none of these six considerations impact substantially on the mainconclusion that GSP outperforms MCMCP for this color estimation task. Nonetheless, each of theseissues could certainly be explored in more detail in future work; each perceptual domain is different,and in some cases MCMCP may become the preferred tool as a result.

Appendix D Emotional prosody

D.1 Stimuli

The stimuli were created on the basis of three sentences from the Harvard sentences [9] recordedby a female speaker [10]. These sentences are phonologically balanced and semantically neutral.The stimulus space was then deﬁned through seven continuous acoustic manipulations performed tothese sentences. The manipulations were performed using the software Praat [11] and the Pythonpackage Parselmouth [12]. Pitch (F0 contour) was extracted using a pitch ﬂoor of 100 Hz and ceilingof 500 Hz (default window size) using the command

To Pitch in Parselmouth. Before proceedingwe conﬁrmed that all contours were free of any octave jumps. From the

Sound and the

Pitch object,we created a

Manipulation object using the command

To Manipulation . From the

Pitch objectwe extracted the glottal pulses using

To PointProcess . The manipulations were then performed inthe following order:1.

Pitch level , shifting the pitch contour by a value in the range [ −

37, 37] Hz.2.

Pitch range , scaling the original pitch range (expressed in Hz) by a value in the range [20,180]%, using the middle of the original pitch range as the center of the scaling operation.3.

Pitch slope , altering the original sentence’s pitch slope by a value in the range [ −

37, 37] Hz.In our case, the reference sentences always began with a falling slope, and our manipulationwas never severe enough to change them to a rising slope. Instead, a positive value of our28 rass Lavender Lemon Strawberry R a t i ng s M C M C P G SP M C M C P ( agg . ) G SP ( agg . ) Hue V a l u e Figure S13: Utility distributions as estimated by rating, MCMCP, and GSP experiments, treatingiterations 1–6 as burn-in samples (Exp. 1a, 1b, 1c, 1g, 1i).29itch slope feature indicates a ﬂattened contour, and a negative value indicates a steeplyfalling contour.We manipulated pitch slope in the following way. We extracted the time of the ﬁrst ( t ) andlast ( t ) pitch values (ignoring unvoiced segments), and then edited the pitch contour byadding the following linear function f ( t ) to each pitch: f ( t ) = x ∗ t − t t − t (5)where t denotes the time of the point being edited and x denotes the feature value, rangingbetween −

37 Hz and 37 Hz.We achieved this by creating an empty

PitchTier object and populating it with the newcontour using the command

Add point . Finally we replace the old

PitchTier in the

Manipulation object with the new one using

Replace pitch tier .4.

F0 perturbation is commonly measured as local frequency variation in the F0 contour(jitter), and corresponds approximately to the perceptual impression of hoarseness [13].We modiﬁed F0 perturbation by converting the

PointProcess object (representing theglottal pulses) to a Praat

Matrix object (representing the time points of the pulses) using

To Matrix . We changed the position of the pulses by applying the Praat formula self+ randomGauss(0, r) where r was a number between 0 and 0.0001 determining thestrength of the perturbation. The Matrix was converted back to a

PointProcess with

To PointProcess , and the glottal pulses in the

Manipulation replaced using

Replacepulses . This follows the algorithm proposed in [14].5.

Duration , allowed to change linearly from 80% to 120% from the original duration. Tomanipulate the duration we created an empty

DurationTier object using the command

Create DurationTier . At time 0 we placed a point with the duration value using thecommand

Add point 0 scalar . We then ran

Replace duration tier to apply thechanges. Note that changing the duration did not affect the overall pitch.6.

Intensity variation , corresponding to a periodic amplitude modulation of the signal. Thismanipulation was characterized by two parameters which constituted two independentdimensions of the stimulus space: amplitude modulation frequency (ranging from 0–5 Hz)and amplitude modulation depth (ranging from 0.01–10 dB). We implemented this using theoperation ‘Vibrato and tremolo’ as deﬁned in [14] and implemented in Parselmouth.

D.2 Procedure

The main chain-construction experiment (Exp. 2a) assigned each participant to one of three differentemotions: happiness, sadness and anger. To ensure that all participants were familiar with theemotional concept we presented contexts that has been used in previous studies on emotional prosody[15, 16]:

Anger:

Please think of a situation where you experienced a demeaning offenseagainst you and yours. For example, somebody behaves rudely toward you and hin-ders you from achieving a valued goal. The situation is unexpected and unpleasant,but you have the power to retaliate.

Happiness:

Please think of a situation where you made reasonable progress towardthe realization of a goal. For example, you have succeeded in achieving a valuedgoal. Your success may be due to your own actions, or somebody else’s, but thesituation is pleasant and you feel active and in control.

Sadness:

Please think of a situation where you experienced an irrevocable loss.For example, you lose someone or something very valuable to you, and you haveno way of getting back that what you want.Each participant was randomly assigned to a different emotion (happy, sad, angry). After theheadphone-screening task and a short demographic questionnaire, they took a practice trial to famil-iarize themselves with the slider. They then completed two within-participant chains correspondingto three alterations of each of the seven dimensions, alternating between both chains until both were30omplete. Each chain was initialized with the feature values of the reference sentence. In each trialthe participant could chose from 25 stimuli synthesized from 25 equidistant points on the slider.In the validation experiment (Exp. 2c), each participant rated stimuli for the three emotion words(happiness, sadness, anger) in three corresponding randomly ordered blocks. Each block contained49 stimuli, which came in four types: (a) raw samples from the GSP chains, (b) samples derived byaveraging the last three iterations of the GSP chains, (c) the initial unchanged sentences, (d) samplescorresponding to random feature values. Participants were presented with the same emotional contextsas the participants in the chain-construction experiment, and responded using the same four-pointscale as the other experiments (‘1. Not at all’, ‘2. A little’, ‘3. Quite a lot’, ‘4. Very much’).We also conducted a control experiment where we switched from within-participant chains to across-participant chains, reducing the number of participants by approximately half because the originalexperiment proved to have more than sufﬁcient power, and leaving all other experiment parametersunchanged (Exp. 2b). Note that reducing the number of participants should not bias the validationratings, which only used raw samples rather than samples created by aggregating over participants.Due to a minor implementation error, this experiment only constructed chains of length 20 ratherthan of length 21.In the subsequent validation component (Exp. 2d), participants rated three blocks of 44 stimuli: 20samples from the original within-participant chains, 20 samples from the new across-participantchains, 3 random samples, and one initial unchanged sentence. In all other regards this secondvalidation was identical to the ﬁrst validation.

D.3 Supplementary results

Fig. S14A shows the results of the within- and across-participant comparison (Exp. 2a, 2b). Theresulting feature values are broadly similar between these two experiments, suggesting that memoryeffects did not substantively contaminate our within-participant chains. This conclusion is supportedby the validation experiment, which shows similar contrast scores for both within- and across-participant chains (Fig. S14B, Exp. 2d).Fig. S14C shows how mean feature values develop over the course of the within-participant experi-ment (Exp. 2a). Here we can see how most of the development of the feature values occurs over theﬁrst sweep of the feature vector (iterations 1–7), after which point the feature values stay broadlysimilar. Three audio examples from the same sentences in the ﬁnal iteration of this experiment can befound at https://doi.org/10.17605/OSF.IO/RZK4S in the folder sound-examples-prosody with the ﬁlenames sad|happy|angry_final_sentence.wav , alongside for reference the initialstimulus original_sentence.wav .GSP also allows us to investigate higher-order structure in perceptual representations. As an illustra-tive analysis, Fig. S15 plots pairwise correlations for different features in the generated samples (Exp.2a). For example, we see that duration and F0 perturbation were signiﬁcantly correlated for sadness ( r = .28) but not for the other emotions (anger: r = − .03, happiness: r = .00); in contrast we see that pitchlevel and pitch slope were positively correlated for all three emotions. These kinds of higher-orderanalyses provide a more expressive perspective on prosody features than previous research, whichmainly focuses on the independent contributions of single features rather than interactions betweenfeatures. We intend to explore these kinds of interactions more in future research. Appendix E Musical chords

E.1 Supplementary methods

This study applied GSP to the perceived pleasantness of musical chords. Each of these chordscomprises three tones, and is hence termed a triad . We represented each triad as a pair of numbers,following the ‘pitch chord type’ representation of [17], which represents each chord tone as a pitchinterval in semitones from the bass (i.e., lowest) tone. This representation captures the sense in whichhuman pitch perception is relative (i.e., pitches are heard relative to their recent auditory context)and logarithmic (i.e., perceived pitch distance is approximately proportional to the difference in thelogarithm of the frequencies) [18]. Integer values in this representation correspond to the standard12-tone equal-tempered tuning system of Western music.31 cross Within

Temporal jitterPitch slopePitch rangePitch levelAmp. mod. depthAmp. mod. frequencyDuration

Normalized parameters

SadnessHappinessAnger A r ando m Iteration C on t r a s t acrosswithin Across vs. within contrast B

11 12 13 14 15 16 17 18 19 20 210 1 2 3 4 5 6 7 8 9 10

All iterations within chains C Figure S14: A : Average parameter settings for across- and within-participant chains in iteration20 (Exp. 2a, 2b, 95% conﬁdence intervals over chains). B : Mean validation contrast for differentiterations (Exp. 2d, 95% conﬁdence intervals over participants). Contrast is deﬁned as the differencebetween the rating for the target emotion and the mean rating for the non-target emotions. C : Averageparameter settings for all iterations in within-participant chains (Exp. 2a). Sadness Happiness Anger All P i t c h s l ope P i t c h r ange P i t c h l e v e l A m p . m od . dep t h A m p . m od . f r eq . D u r a t i on P i t c h s l ope P i t c h r ange P i t c h l e v e l A m p . m od . dep t h A m p . m od . f r eq . D u r a t i on P i t c h s l ope P i t c h r ange P i t c h l e v e l A m p . m od . dep t h A m p . m od . f r eq . D u r a t i on P i t c h s l ope P i t c h r ange P i t c h l e v e l A m p . m od . dep t h A m p . m od . f r eq . D u r a t i on F0 perturbationPitch slopePitch rangePitch levelAmp. mod. depthAmp. mod. freq. -0.20.00.20.4

Correlation

Figure S15: Pearson correlations between parameters in all three emotions (Exp. 2a).We generated chords using Tone.js, a Javascript library for synthesizing sounds in the client’sbrowser. Each triad was synthesized as three simultaneous complex tones comprising 10 harmonics https://tonejs.github.io/ Lo w e r i n t e r v a l ( s e m i t one s ) Lo w e r i n t e r v a l ( s e m i t one s ) Lo w e r i n t e r v a l ( s e m i t one s ) Lo w e r i n t e r v a l ( s e m i t one s ) Density

Density

Iter 0:0 Iter 10:19Iter 10:29 Iter 10:39

Upper interval (semitones)

Figure S16: Kernel density estimates generating the four sets of KDE modes considered in thevalidation experiment for musical triads (Exp. 3b). The top ﬁve modes are indicated in red. Thedensity values are computed relative to a uniform distribution.with amplitudes scaled by 12 dB/octave. These complex tones were presented with an ADSR envelopecomprising a linear attack portion of 200 ms and a maximum amplitude of 1.0, an exponential decayportion lasting 100 ms taking the amplitude to 0.8, and a ﬁnal exponential decay release portionlasting 1 s. The pitch of the bass tone was sampled uniformly and continuously in the logarithmicrange G3–F4 (i.e., 196–349 Hz). The other two tones were speciﬁed by two continuous intervals inthe range [0.5, 11], with the limits chosen such that the unison (0) and octave (12) were excluded, toprevent duplicating the pitch class of the bass tone. We did however allow the two non-bass tonesto overlap.In each trial of the main experiment (Exp. 3a), participants were presented with the following prompt:‘Adjust the slider to match the following word as well as possible: pleasant’. Releasing the slider Two tones are said to share the same pitch class if they are separated by an integer multiple of 12 semitones(an octave).

ADPclust

R package [20] with the number of clusters set a priori to 20,and keeping the ﬁve resulting centroids with the highest kernel density. In total, this resulted in 20experimental conditions comprising 820 stimuli in total.In each trial of the validation experiment, the participant was assigned to a randomly chosen stimulusfrom one of the conditions, and was asked to rate how pleasant that stimulus was on a four-levelscale: ‘Not at all’, ‘A little’, ‘Quite a lot’ and ‘Very much’. Overall we collected 662 ratings for eachexperimental condition, with each participant contributing up to 80 ratings.We should note that while in Gibbs sampling it is customary to consider samples in jumps of fullcoordinate sweeps, here we decided to aggregate data continuously, given the inherent symmetrybetween the two intervals, so as to improve the quality of the estimated modes. We further exploitedthat symmetry by folding the data along the x = y line, since reordering a pair of intervals doesnot alter the generated chord. Fig. S17 shows the raw data, where the x = y symmetry is clearlyapparent, and Fig. S16 shows the folded distribution after re-ordering the two intervals.34igure S18: Combined marginal distributions for the two intervals, using a sliding window of length20 (Exp. 3a). E.2 Supplementary results

In the main paper, we mostly discussed the structure and validation of features and raw samplesaggregated across various chains and iterations. Fig. S17 complements this perspective by presentingthe trajectory of a typical chain (Exp. 3a). It is clear that the dynamics are far from an optimizationregime, where one would expect to see small and converging updates toward some local optimum(e.g., [21]). Instead, the trajectories illustrate the sampling regime of GSP, characterized by big leapsand lack of convergence, scanning the various regions of the space.Fig. S18 shows the behavior of the combined marginal distributions for the two intervals, computedover a sliding window of length 20 with Gaussian kernels (bandwidth = 0.175 semitones, Exp. 3a).We see that two strong modes emerge at the perfect ﬁfth (7) and the major third (4), alongside otherpeaks at integers and dips at the semitone (1) and tritone (6), reﬂecting the standard Western tonalhierarchy [22].Audio samples of the top 15 KDE modes extracted from iterations 0 (random) and10–39 can be found at https://doi.org/10.17605/OSF.IO/RZK4S in the folder sound-examples-musical-triads , with the modes arranged in descending order of den-sity ( random_seed_top_modes_iter_0_0.wav and top_modes_iter_10_39.wav ). Appendix F Faces

F.1 Supplementary methods

This study used the ‘StyleGAN’ model of [23, 24] pretrained on the FFHQ dataset of faces fromFlickr [23]. This model is a generative adversarial network, comprising a latent vector z sampledfrom a probability distribution p ( z ) , an input layer that takes a constant input y , and the other layers y i taking the previous layer and a non-linear function of z as an input: y i = G i ( y i − , w ) , w = M ( z ) , (6)where M is an 8-layer multilayer perceptron and the output layer y L corresponds to an RGB image.35he study depended on participants interactively manipulating principal components of the w vectorusing a slider. We achieved this by creating an API that took as input a random seed for the latentvector z , a vector of principal component values for w , and the index of the principal component tobe manipulated by the slider. The API then returned a video where the active principal componentwas incrementally modiﬁed through a speciﬁed number of standard deviations about the mean, withthis API building on code released in [25]. The resulting video was then streamed to the participant’slocal computer, with the slider selecting between different frames of the video. We hosted the API onan AWS EC2 instance ﬁtted with an NVIDIA K80 GPU.An important technical issue concerned ensuring that participants didn’t have to wait for the relativelyslow stimulus generation process. We therefore generated stimuli asynchronously in advance of agiven experimental trial, with participants being randomly assigned to the pool of currently availablestimuli for each trial. Aggregating multiple responses per step of the GSP process helped in thisregard, meaning that a higher throughput of participants could be sustained for a given rate of stimulusproduction.The main experiment (Exp. 4a) evaluated six adjectives which we thought could elicit meaningfulperceptual associations: ‘attractive’, ‘fun’, ‘intelligent’, ‘serious’, ‘trustworthy’, and ‘youthful’,with these choices informed by prior literature (e.g., [26]). Three across-participant chains wereconstructed for each of these adjectives, each of length 50 plus the initial random state, resulting ina total of 18 chains. Each step in the chain received ﬁve responses from ﬁve different participants,which were then aggregated using the arithmetic mean.Participants were recruited from AMT as before with the stipulation that they be resident in theUS. All participants were pre-screened with the color vocabulary task used previously for the colorexperiment. After completing a short demographic questionnaire, they took six practice trials tofamiliarize themselves with the task, then proceeded to the main experiment, where they completedup to 18 trials (one from each chain).The validation experiment (Exp. 4b) recruited participants in the same manner, and had the participantsrate all generated samples from iterations 1–10, 20, 30, 40, and 50. Each participant contributed80 ratings, under the constraint that they never rated the same sample twice, and with participantsbeing assigned to stimuli such that the number of ratings accumulated equally across stimuli. Datacollection was continued until all samples had been rated at least 50 times.We additionally conducted several follow-up GSP experiments to explore the paradigm further,described below and in Table S1:1. We tested an alternative aggregation approach, where we summarized the ﬁve responses foreach item with a KDE (Gaussian kernel, standard deviation of 0.5 in units of PCA standarddeviations), and took the mode of the resulting distribution (Exp. 4c, Fig. S20).2. We tested a small number of alternative methods for constructing a basis for the stimulusspace (Exp. 4e, Fig. S24). In addition to the original PCA, we tested sparse PCA using asparsity parameter of 1.0 (see the alpha parameter of SparsePCA from the scikit-learn package) and independent component analysis (ICA). We also tested the effect of retainingdimensions 71–80 instead of dimensions 1–10 of the PCA solution. In this experimentwe only used the adjective ‘attractive’, and reduced the chain length to 30 iterations. Forcomparability with the original results of Exp. 4a, all chains were initialized to the samerandom seeds as in the original experiment.3. We reran the original experiment but with the StyleGAN model pretrained on a datasetof faces from WikiArt ( ; https://github.com/ak9250/stylegan-art ), to illustrate the dataset-dependence of the results (Exp. 4h, Fig. S29).4. We reran the original experiment using KDE modes and relaxing participant recruitment toaccept both US and non-US participants (Exp. 4i, Fig. S20). The resulting participant groupwas dominated by Indian (c. 50%) participants but also included a high proportion of USparticipants (c. 40%).5. We reran the original experiment but asking participants to adjust the slider to ‘ﬁnd the personthat you would most like to date’, assigning self-reported male and female participants toseparate chains so that we could perform a group-difference analysis (Exp. 4j, Fig. S23).36e additionally ran several rating experiments to complement these GSP experiments (Table S2).Exp. 4d collected ratings for the KDE mode experiment (Exp. 4c) and the global participant groupexperiment (Exp. 4i), as well as collecting ratings for the original experiment (Exp. 4a), with otherwisethe same design as the original validation experiment (Exp. 4b), including the US-only criterion (Fig.S20). Exp. 4f used the same approach to collect ratings for the basis experiment (Exp. 4e, Fig. S24).Exp. 4g used a similar design to investigate biases at different stages of the modeling pipeline(Fig. S25, S26, S27, S28). Stimuli were sourced from three stages: (a) random samples from theStyleGAN’s FFHQ training dataset ( N = 300); (b) random samples from the StyleGAN model ( N =300); random samples from the StyleGAN model, but only allowing the top 10 principal componentsto vary ( N = 300); (c) samples from iterations 0, 10, 20, 30, 40, and 50 of the GSP processes fromExp. 4a ( N = 108). Instead of asking participants to rate how well the images matched the GSPadjectives, we instead asked participants to answer questions from the following list:1. What is the gender of the person in the image?2. Is the person in the image of white ethnicity?3. Is the person in the image smiling?4. Is the person in the image wearing a hat?5. Is the person in the image wearing formal clothes?6. Is the person in the image wearing glasses?In each case, the participant was presented with three options: “Male”/“Female”/“Other” in the caseof gender, and “Yes”/“No”/“Don’t know” in the other cases. We also asked participants to estimatethe age in years of the person depicted in the image.We had two kinds of motivations for choosing these particular evaluations. We chose gender, age,and ethnicity because these are two criteria according to which many people experience bias in thereal world, and we wanted to understand how these variables were treated by the modeling pipeline.We chose the other four evaluations because they are examples of easily quantiﬁed features that seemlikely to inﬂuence judgments made about the person. Of course, it should be acknowledged thatsome of these variables are impossible to determine deﬁnitively from an image; for example, it is asubstantial simpliﬁcation to treat gender and ethnicity in a categorical way. However, we anticipatedthat this simpliﬁcation would be necessary to make the task understandable to the participants, andthat the resulting data would nonetheless be informative about the kinds of biases present in themodeling pipeline. F.2 Supplementary resultsRaw samples from the main experiment.

Example raw samples and validation results from themain experiments (Exp. 4a, 4b) are shown in Fig. 4 of the main paper. Fig. S19 illustrates the rawsamples in more detail, displaying iterations 0–10, 20, 30, 40, 50 from one chain for each target word.It is clear from both the validation results and the raw samples that the chains make clear progresstowards the target category already by the end of the ﬁrst sweep (10 iterations), and sometimes theresemblance to the target category does not improve noticeably after this point. However, this doesnot mean that the process converges to a static image after this point: instead, there is a moderateamount of variety in the subsequent faces (see also Fig. S21 and S22). The process is therefore stillsomewhat in the stochastic sampling regime rather than the deterministic optimization regime.

Validation results for follow-up experiments.

Fig. S20 plots validation results for Exp. 4a (meanaggregation, US-only participants), Exp. 4c (aggregation with KDE modes, US-only participants), andExp. 4i (aggregation with KDE modes, global participants), as collected in Exp. 4d. The broad trendsin the ratings are replicated across the three experiments: typically almost all of the improvementcomes in iterations 1–10, with ratings staying mostly stable after this point. All three experimentsstruggle to capture trustworthiness, which is clearly a particularly subjective judgment to make.Interestingly, there is no evidence that KDE peak-picking outperforms the arithmetic mean as anaggregation technique. Inspecting the raw data and the density estimates, this does not seem to be aconsequence of poorly chosen kernel width or artifacts in the density estimation process. Instead,it seems that the participants’ conditional distributions could typically be approximated well by aunimodal distribution, and hence averaging was a sensible aggregation method.37igure S19: Raw samples from six GSP chains in Exp. 4a (US-only participants, mean aggregation). lllllllllllllllllllllllllllllllll lll lll lll llllllllllllllllllllllllllllllllllll lll lll lll lll lllllllllllllllllllllllllllllllll lll lll lll llllllllllllllllllllllllllllllllllll lll lll lll lll lllllllllllllllllllllllllllllllll lll lll lll llllllllllllllllllllllllllllllllllll lll lll lll lll

Serious Trustworthy YouthfulAttractive Fun Intelligent R a t i ng Aggregation method lll

MeanKDE modeKDE mode (non−US)

Figure S20: Validation results for Exp. 4a, 4c, and 4i, as produced in Exp. 4d. The shaded regionscorrespond to 95% conﬁdence intervals over participants.38 ints at cross-cultural differences.

Raw samples of six chains from Exp. 4c (aggregation withKDE modes, US-only participants), and Exp. 4i (aggregation with KDE modes, global participants)are displayed in Fig. S21 and S22. It is important not to read too much into these raw samples, asthey ultimately come from stochastic distributions and will vary over repeated runs. However, we didnotice some suggestive differences between the ﬁnal samples of the US chains and those of the globalchains. Most salient was the fact that all US chains for ‘intelligent’ ﬁnished with a Caucasian man,whereas the three ﬁnal states of the global chains included both a woman and a non-Caucasian man.We also noticed that the global chains were the only ones to include a man as the ﬁnal ‘attractive’sample. While some of this variation will be due to chance, the remaining variation will presumablyreﬂect different stereotypes held by the different participant groups. It would be interesting to explorethese different stereotypes in more systematic ways.Figure S21: Raw samples from six GSP chains in Exp. 4c (US-only participants, KDE modeaggregation).

Gender differences.

Exp. 4j provides a second proof of concept for this kind of group-differenceapproach (Fig. S23). Here participants were split by self-reported gender, and instructed to optimizethe slider for a person that they would most like to date. As one might expect, the samples reﬂect apredominant (but not universal) preference for members of the opposite gender. This in itself maybe a trivial result, but it is easy to intuit how one could extrapolate this approach to much morecomplex and interesting group-difference studies, for example those involving different cross-culturalpopulations.

Basis construction methods.

Fig. S24 plots validation results for Exp. 4e (exploring different basisconstruction methods), as collected in Exp. 4f. The results suggest an early advantage for the originalPCA technique; however, the discrepancy with sparse PCA and ICA is small, and seems to disappearafter more iterations. As would be expected, the version of PCA with components 71–80 performspoorly; in practice, these components contribute very little perceptually speaking (see also [25]).On this basis, there is little evidence to dismiss any one of PCA, sparse PCA, or ICA. Future workshould also consider other recently proposed approaches for parameterizing the generative model, forexample [27, 28].

Bias analyses.

Fig. S25 plots perceived gender in the different datasets evaluated in Exp. 4g. We seethat the gender balance is fairly equal between men and women, with perhaps slightly more womenthan men as the pipeline progresses. Fig. S26 plots perceived age as a function of perceived genderin the same datasets. Looking ﬁrst at the training dataset, we see that the mean age is close to 30years, with the male faces tending to be perceived as somewhat older than 30, and the female facesbeing perceived as slightly younger than 30. This association between age and gender is ampliﬁed39igure S22: Raw samples from six GSP chains in Exp. 4i (global participant group, KDE modeaggregation).Figure S23: Final samples from the ﬁrst four male and female chains in the dating preferencesexperiment (Exp. 4j).to a certain amount through the modeling pipeline, even before the PCA process; it seems as if themodel is capturing this association and stereotyping it to a certain degree. This relationship hasinteresting implications for the GSP samples; if female samples tend to be subjectively younger thanmale samples, and if younger faces tend to be perceived as more attractive, then GSP samples for‘attractive’ will be biased towards women, even if the participants do not possess any systematicbias for women over men. Likewise, if older faces tend to be perceived as more intelligent, thenthis relationship between age and gender would be expected to induce a bias in the GSP samples for‘intelligent’ towards male faces.There are many other similar biases that one could anticipate affecting the GSP process. To illustratesome of these potential biases, Fig. S27 plots judgments for ethnicity, smiling, hats, formal clothes,and glasses wearing for the four datasets in Exp. 4g, split by gender. We see for example that men aremuch more likely than women to be portrayed in formal clothes, potentially a further reason why‘intelligent’ GSP samples tend to favor men. Similarly, men are more likely to be portrayed in glasses,another potential contributor to perceived intelligence. Conversely, women are more likely than mento be smiling, potentially supporting a female bias in the ‘attractive’ and ‘fun’ GSP samples. Thesehypotheses are consistent with Fig. S28, which shows that perceived intelligence is indeed associated40 .53.03.5 0 10 20 30Iteration R a t i ng Original PCAPCA (dims. 71−80)Sparse PCAICA

Figure S24: Validation results for Exp. 4e (exploring different basis construction methods), ascollected in Exp. 4f. The shaded regions correspond to 95% conﬁdence intervals over participants.

Training Random Random PCA GSP

Dataset C o m po s i t i on % Perceived gender

Female Male Other

Figure S25: Perceived gender for faces from different stages of the modeling pipeline, as collected inExp. 4g. 41

Training Random Random PCA GSP

Dataset P e r c e i v ed age Perceived gender

Female Male

Figure S26: Perceived age split by gender for faces from different stages of the modeling pipeline, ascollected in Exp. 4g. The error bars denote 95% conﬁdence intervals bootstrapped over images.

Is the person in the image wearing formal clothes? Is the person in the image wearing glasses?Is the person in the image of white ethnicity? Is the person in the image smiling? Is the person in the image wearing a hat?Training Random Random PCA GSP Training Random Random PCA GSP Training Random Random PCA GSP P e r c en t Y e s Perceived gender

Female Male

Figure S27: Evaluations of ethnicity, smiling, hats, formal clothes, and glasses, for faces fromdifferent stages of the modeling pipeline, split by gender (Exp. 4g). The error bars denote 95%conﬁdence intervals bootstrapped over images.with wearing formal clothes and glasses, and that perceived attractiveness and fun are both associatedwith smiling. These examples illustrate the complex network of biases that can be inherited froma generative model such as StyleGAN, and highlight the importance of developing more balancedtraining datasets for future cognitive work in this area.

Training dataset.

To provide a more intuitive illustration of the method’s dependence on the trainingdataset, Fig. S29 displays ﬁnal GSP samples from Exp. 4h, which used the StyleGAN model trainedon a dataset of portraits from WikiArt ( ). The artistic nature of theWikiArt dataset differs clearly from the photographic nature of the FFHQ dataset, and this is reﬂectedin the GSP samples. Nonetheless, the GSP process still successfully navigates this new space to ﬁndsamples that subjectively reﬂect the target adjectives.42 s the person in the image wearing formal clothes? Is the person in the image wearing glasses?Is the person in the image of white ethnicity? Is the person in the image smiling? Is the person in the image wearing a hat?

Attractive Fun Intelligent Serious Trustworthy Youthful Attractive Fun Intelligent Serious Trustworthy Youthful Attractive Fun Intelligent Serious Trustworthy Youthful P e r c en t Y e s Figure S28: Evaluations of ethnicity, smiling, hats, formal clothes, and glasses, for GSP samplesevaluated in Exp. 4g. The error bars denote 95% conﬁdence intervals bootstrapped over images.Figure S29: Final GSP samples from Exp. 4h, which used the StyleGAN model pretrained on adataset of portraits from WikiArt ( ). F.3 Conclusion

Our analyses indicate that GSP is an effective tool for exploring the generative space of the StyleGANmodel. Here we relied on a simple PCA approach for creating a reduced basis of the generative space,but there are other promising approaches in the literature that could also be applied to this task (e.g.,[27, 28]). However, our analyses also indicate that dataset bias is a real and important issue wheninterpreting the outcomes of this approach. Future work must engage with this problem by studyingthe kinds of biases inherent in their generative models and ideally ﬁnding ways to construct lessbiased models in the ﬁrst place (e.g., [29]).

Appendix references [1] Y. Weiss, E. P. Simoncelli, and E. H. Adelson, “Motion illusions as optimal percepts,”

NatureNeuroscience , vol. 5, no. 6, pp. 598–604, 2002.[2] X.-X. Wei and A. A. Stocker, “A Bayesian observer model constrained by efﬁcient coding canexplain ‘anti-Bayesian’ percepts,”

Nature Neuroscience , vol. 18, pp. 1509–1517, 2015.[3] A. N. Sanborn, T. L. Grifﬁths, and D. J. Navarro, “A more rational model of categorization,”in

Proceedings of the 28th Annual Conference of the Cognitive Science Society (R. Sun and43. Miyake, eds.), pp. 726–731, Cognitive Science Society, 2006.[4] D. McFadden, “Conditional logit analysis of qualitative choice behaviour,” in

Frontiers inEconometrics (P. Zarembka, ed.), pp. 105–142, New York, NY: Academic Press, 1974.[5] K. E. Train,

Discrete choice methods with simulation . Cambridge University Press, 2009.[6] K. J. Woods, M. H. Siegel, J. Traer, and J. H. McDermott, “Headphone screening to facilitateweb-based auditory experiments,”

Attention, Perception, & Psychophysics , vol. 79, no. 7,pp. 2064–2072, 2017.[7] M. Chmielewski and S. C. Kucker, “An MTurk crisis? Shifts in data quality and the impact onstudy results,”

Social Psychological and Personality Science , vol. 11, no. 4, pp. 464–473, 2020.[8] J. H. Clark, “The Ishihara Test for color blindness,”

American Journal of Physiological Optics ,vol. 5, pp. 269–276, 1924.[9] “IEEE recommended practice for speech quality measurements,” tech. rep., IEEE, 1969. ISBN:9781504402743.[10] P. Demonte, “HARVARD corpus speech shaped noise and speech-modulated noise for SIN test,”2019. Publisher: University of Salford.[11] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [Computer program].” Version6.0.37, , 2018.[12] Y. Jadoul, B. Thompson, and B. de Boer, “Introducing Parselmouth: A Python interface toPraat,”

Journal of Phonetics , vol. 71, pp. 1–15, 2018.[13] I. R. Titze, Y. Horii, and R. C. Scherer, “Some technical considerations in voice perturbationmeasurements,”

Journal of Speech, Language, and Hearing Research

Nature HumanBehaviour , vol. 3, no. 4, pp. 369–382, 2019.[16] P. Laukka, H. A. Elfenbein, N. S. Thingujam, T. Rockstuhl, F. K. Iraki, W. Chui, and J. Althoff,“The expression and recognition of emotions in the voice across ﬁve nations: A lens modelanalysis based on acoustic features,”

Journal of Personality and Social Psychology , vol. 111,no. 5, 2016.[17] P. M. C. Harrison and M. T. Pearce, “Representing harmony in computational music cognition,”

PsyArXiv , 2020.[18] T. Stainsby and I. Cross, “The perception of pitch,” in

The Oxford handbook of music psychology (S. Hallam, I. Cross, and M. Thaut, eds.), pp. 47–58, New York, NY: Oxford University Press,2009.[19] A. Rodriguez and A. Laio, “Clustering by fast search and ﬁnd of density peaks,”

Science ,vol. 344, no. 6191, pp. 1492–1496, 2014.[20] X.-F. Wang and Y. Xu, “Fast clustering using adaptive density peak detection,”

Statisticalmethods in medical research , vol. 26, no. 6, pp. 2800–2811, 2017.[21] N. Jacoby and J. H. McDermott, “Integer ratio priors on musical rhythm revealed cross-culturallyby iterated reproduction,”

Current Biology , vol. 27, no. 3, pp. 359–370, 2017.[22] C. L. Krumhansl and E. J. Kessler, “Tracing the dynamic changes in perceived tonal organizationin a spatial representation of musical keys,”

Psychological Review , vol. 89, no. 4, pp. 334–368,1982.[23] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarialnetworks,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionCVPR , pp. 4401–4410, 2019.[24] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improvingthe image quality of StyleGAN,” arXiv , 2019.[25] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “GANSpace: Discovering interpretableGAN controls,” arXiv , 2020. 4426] L. Brinkman, A. Todorov, and R. Dotsch, “Visualising mental representations: A primer onnoise-based reverse correlation in social psychology,”

European Review of Social Psychology ,vol. 28, no. 1, pp. 333–361, 2017.[27] A. Voynov and A. Babenko, “Unsupervised discovery of interpretable directions in the GANlatent space,” arXiv , 2020.[28] Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in GANs,” arXiv , 2020.[29] A. Grover, K. Choi, R. Shu, and S. Ermon, “Fair generative modeling via weak supervision,” arXivarXiv