Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models
aa r X i v : . [ c s . C L ] F e b Understanding the Capabilities, Limitations, andSocietal Impact of Large Language Models
Alex Tamkin ∗ , Miles Brundage ∗ ,Jack Clark † , and Deep Ganguli Stanford University OpenAI AI Index
Introduction
On October 14th, 2020, researchers from OpenAI, the Stanford Institute forHuman-Centered Artificial Intelligence, and other universities convened to dis-cuss open research questions surrounding GPT-3, the largest publicly-discloseddense language model at the time.The meeting took place under Chatham House Rules. Discussants came from avariety of research backgrounds including computer science, linguistics, philos-ophy, political science, communications, cyber policy, and more. Broadly, thediscussion centered around two main questions:1.
What are the technical capabilities and limitations of large lan-guage models?
The discussion touched on several key areas including:the surprising impact of scale on model capabilities, the difficulty in as-sessing whether large language models truly understand language, the im-portance of training models on multiple data modalities, and challengesin aligning model objectives with human values.2.
What are the societal effects of widespread use of large languagemodels?
The discussion touched on several key areas including: difficul-ties in scoping all possible uses (or misuses) of general purpose languagemodels, challenges organizations may face in model deployment, the po-tential for these models to algorithmically spread disinformation, difficul-ties in mitigating model bias (e.g., racial, gender, religious, etc.), and theimpact of language model-based automation on the labor market.While the conversation was collegial and productive, there was a sense of ur-gency to make progress sooner than later in answering these questions. Here, ∗ Equal contribution † Work carried out while employed at OpenAI
1e provide a detailed summary of the discussion organized by the two themesabove. We conclude with a list of potential future research directions inspiredby the discussion.
Scale
GPT-3 is one of the largest publicly-disclosed language models — it has 175billion parameters and was trained on 570 gigabytes of text. For comparison,its predecessor, GPT-2 (which is functionally similar to GPT-3) has 1.5 billionparameters and was trained on 40 gigabytes of text. While GPT-2 displayedsome zero-shot generalization to downstream tasks, GPT-3 further displayed theability to learn more novel tasks when given examples in context. Participantsfound it remarkable that such capabilities emerge merely from scaling modeland training data size.One person remarked that the growth in model capabilities as they scale “feelslike a law of physics or thermodynamics” in its stability and predictability. Sev-eral participants were optimistic that these trends would continue even for mod-els much larger than GPT-3, yielding ever-stronger models capable of more ad-vanced few-shot learning of new skills from a small number of training examples.One participant remarked that the scale of models like GPT-3 was reminiscentof large particle accelerator experiments, which require many people with di-verse backgrounds to execute. For example, when training such large models,different teams with diverse expertise must collaborate to run experiments, buildand maintain the computing infrastructure, develop the algorithms, and con-tinuously interrogate the model’s capabilities for possible problems (e.g., bias,misuse, safety concerns, etc.). The latter point is referred to as “red-teaming”throughout the rest of this document.
Understanding
What constitutes “understanding” in a language model, and does GPT-3 ful-fill this definition? Some leaned towards definitions based on strong notionsof intelligence, which require models to possess intentionality or the ability to Since this is a summary of discussions, rather than a research paper, we do not includereferences. Rather, we hyperlink to relevant papers that were discussed at the workshop. Fora more comprehensive set of references related to some of these issues, we point readers to theoriginal GPT-3 paper and to recent work of Bender and Gebru et al published a few monthsafter this workshop.
Multimodality
Much of the conversation considered the importance of multimodal models —language models trained on text and data from other modalities, e.g., images,audio recordings, etc. Participants largely agreed in their predictions that largemultimodal models will become more prevalent and enable more diverse capabil-3ties. However, some argued that GPT-3 is already trained on multimodal data,in that the training data contains prose, structured data tables, and computercode. Others suggested that the main benefit of multimodal training might beto improve the speed at which models acquire useful capabilities, as the inter-action between different data modalities may provide a stronger learning signalthan each data modality in isolation provides. Finally, some commented thatno single additional modality was critical to language use, given that humansdiffer in the range of sensory modalities they have access to.
Alignment
Participants discussed the need to better align model objectives with humanvalues. For example, one participant mentioned some language models treatall symbols (e.g., nouns, prepositions, numbers, etc.) equally, but humans caremuch more about, for example, incorrectly stating someone’s age than aboutmisplacing a preposition. Several other participants emphasized the importanceand challenge of better optimizing for factual accuracy and robustness to adver-sarial examples. Aligning human and model objectives was seen to be especiallyimportant for “embodied” AI agents which learn through active interaction withtheir environment. Discussants emphasized the dual importance of developingbetter algorithms for “steering” agents towards human values, as well as fos-tering cross-disciplinary collaborations to better clarify what “human values”means, especially given diversity across individuals and communities and theprevalence of bias in available datasets.
Capabilities
GPT-3 has an unusually large set of capabilities, including text summarization,chatbot behavior, search, code generation, and essay generation. One discussantstated that such a large “capability surface” makes it challenging to both scopethe full array of uses (because GPT-3 can take in arbitrary inputs, it is a priori impossible to anticipate all potential behaviors of the model) and to ensuretheir safety to people and societies. Participants noted that, by putting GPT-3behind a controlled-access API, OpenAI is able to constrain the model’s usemore easily than if they open sourced it. However, open questions remain.For example, who gets access and why? How can one provide model access In fact, shortly after the workshop, OpenAI released DALL-E, which is a multimodalversion of GPT-3 trained on both images and text.
4o support a large community to red-team (interrogate the model for potentialmisuse and develop mitigation strategies) at scale?
Deployment
Participants discussed several options for defining and addressing the ethicaland societal challenges of deploying large language models. One suggestionwas to increase the computing resources available to academia so that it wouldbe easier for academics to do research that informs the deployment of largelanguage models. Someone suggested that laws requiring disclosure of when AIis being used to generate text could be helpful in managing the effects of largelanguage models. Another participant asked what metrics might be used toevaluate whether language models are having a societally beneficial effect, andthere was general agreement that this is a challenging but important task.Several participants noted that OpenAI and other organizations will not have amonopoly on large language models forever. Participants suggested that devel-opers may only have a six- to nine-month advantage until others can reproducetheir results. It was widely agreed upon that those on the cutting edge shoulduse their position on the frontier to responsibly set norms in the emerging field.Additionally, some participants pointed out that, due to standard advances intechnology, it will only become easier for other actors to replicate models likeGPT-3 over time. This further suggests the urgency of using the current timewindow, during which few actors possess very large language models, to developappropriate norms and principles for others to follow.
Disinformation
A major discussion point considered the deliberate misuse of language modelsfor purposes such as generating disinformation. More specifically, models likeGPT-3 can be used to create false, misleading, or propagandistic essays, tweets,and news stories de novo . One participant was skeptical about the magnitudeof these likely risks since many previous technologies (e.g. photography andPhotoshop) sparked similar concerns and have already raised societal aware-ness of the risks of disinformation. Furthermore, while automated generation ofdisinformation may be feasible in principle, human labor may still be more cost-effective for such purposes. Others disagreed, and saw automated generation asmuch more cost-effective than training and paying humans to generate disin-formation. Participants agreed that empirically investigating the economics ofautomated vs human generated disinformation is important.Thinking ahead, someone suggested considering a future in which language mod-els can generate text that is not just coherent on commonly discussed topics, but5ighly persuasive on arbitrary topics. Another participant suggested that GPT-3 or other future language models could make disinformation hard or impossibleto detect at the level of content, forcing reliance on metadata by online plat-forms. Relatedly, someone suggested that the existence of systems like GPT-3should spur more use of cryptography to authenticate media.
Bias
GPT-3 exhibits several racial, gender, and religious biases. One discussantanalogized the difficulty of addressing language model bias to the problem ofcontent moderation on online platforms — despite the difficult normative issuesin both cases, there are still some areas of relative consensus and opportunitiesfor mitigation. For example, online platforms agree on the need to address childpornography or egregious threats of violence, and the concept of “protectedclasses” in discrimination law provides a useful initial framework for thinkingabout some language model biases.Several workshop participants noted that it is difficult to define what it means tomitigate bias in large language models in a universal manner, since appropriatelanguage use is highly contextual. One participant noted that all datasets arebiased in some ways, so the challenge is not eliminating all bias but addressingharmful biases according to some set of normative and/or legal criteria. Somesuggested that companies like OpenAI do not have the appropriate standingand should not aim to make such decisions on behalf of society. Someone elseobserved that it is especially difficult to think about mitigating bias for multi-purpose systems like GPT-3 via changes to their training data, since bias istypically analyzed in the context of a particular use cases.Participants discussed a wide variety of possible means of addressing harmfulbiases in language models, including: • Changes to the initial training data to mitigate bias a priori • Training a separate model to filter content generated by a language model • Fine-tuning a large language model on data with desired properties • Tagging data so that the model learns to distinguish among certain formsof content (see e.g. CTRL) • Training models to be more “fact-aware” • Reinforcement learning with human feedback • Leveraging the model’s own knowledge to improve outputs (e.g., withcareful prompt design) 6
Developing more expansive suites of “bias tests” that models can be runthrough prior to deployment • Red-teaming the model at scale by engaging trusted partners to work withthe model and through limited commercial offerings.None of these approaches was considered a panacea. For example, steering amodel with human feedback still raises the question of who the human labelersare or how they should be chosen, and content filters can sometimes underminethe agency of the very groups that they are intended to protect (e.g., marginal-ized groups reclaiming words or phrases that are used as slurs by majoritygroups). One participant argued that keeping a human in the loop of text gen-eration is critical for addressing these issues. Some participants emphasized thatcertain use cases should be avoided given the limitations of existing techniques,and that text generation applications vary widely in terms of open-endednessand risk. For example, detecting regular expressions is much more tractable todo safely than managing a suicide hotline.
Economy
Another theme of the discussion considered the economic implications of modelslike GPT-3. Participants observed that current jobs that involve reading oranalyzing text vary widely in their desirability, with some being more enjoyable(e.g., creative writing or reading and summarizing reports) and others oftenbeing traumatizing or alienating (e.g., content moderation). This raises thequestion of when jobs, or what kinds of jobs, should or shouldn’t be automatedby large language models. One participant suggested that leaving such decisionsup to companies would likely have adverse consequences. Education was alsomentioned as a societal area likely to be affected by large language models,via changes to the essay writing process as well as evaluation of text. Oneparticipant pointed out that providing API access to a variety of groups fromdifferent sectors of society can help provide an early signal of potential societalchanges.
The following research questions were inspired by the discussion: • Can we better understand why language models improve so much withscale? Can this enable us to build models which scale more efficiently?7
What are the limits of scaling? Will scale lead to strong causal reasoning,symbolic manipulation, commonsense understanding, and robustness to awider class of inputs? Or will different techniques be necessary? • How can we understand the limits of what large language models arecapable of? Can we enable models to ask for help or clarification, orabstain when they are unsure? • How can we develop new neural network architectures and algorithms thatenable efficient learning from diverse, multimodal data beyond text? • What are the opportunities and tradeoffs involved in different approachesto steering the outputs of large-scale language models to be more alignedwith human values? • How should access to models like GPT-3 be allocated, balancing consider-ations like security, replicability, and fairness? What kinds of tests do weneed to develop in order to qualify language models like GPT-3 as beingsafe or unsafe for use in particular contexts? • What can academia do to best position itself to develop guardrails for theindustrial development of such models - including advocating for sufficientfunding to replicate the compute resources required to train them? • How can we best foster cross-disciplinary collaboration to understand andmanage the biases in large datasets and model representations of suchdatasets? • How can we best characterize the potential “threat landscape” for suchmodels; e.g., do we need to spend more time worrying about how modelslike this could be used by profit-driven actors to generate lots of low-gradespam, or should we be more worried about state-based actors using modelsto generate persuasive text for use in disinformation campaigns? ••