Will OpenAI’s code generation model revolutionize labor? Or imperil it?

OpenAI will collaborate with researchers to determine how Codex, the company’s code generation model, could radically alter the American workforce.

OpenAI announced on Friday, March 3 that they are seeking researchers to investigate the economic impact of Codex, their code generation model. This call for external collaborators was mentioned as part of a larger post about the challenges the company has faced in responsibly deploying large language models (LLMs) such as GPT-3.

In a research proposal, OpenAI lays out the motivations and research questions behind researching the economic impact of Codex. There are several motivations for this research: to ensure AI safety, to consider economic impacts when making decisions around deployment policies and AI system design, to establish a precedent for economic impact research on future AI systems, and to ensure that artificial general intelligence (AGI) systems will benefit humanity.

“We need to engage in research about the economic impact of our models today in order to be positioned to assess the safety of developing and releasing more capable systems in the future,” states the announcement.

Codex is a version of OpenAI’s model GPT-3, a large language model that can generate text when given input text. Codex was fine-tuned on publicly available code to be able to generate working code when it is given code as input.

The research OpenAI describes in the proposal will focus on both “direct impacts” and “indirect impacts.” Hypothetical examples of direct impacts of Codex outlined in the report include an increase in programmer productivity and a shift in the type of skills in demand in the labor market. The proposal also gives examples of indirect impacts, some of which are changes to the price of goods and services and exacerbation of inequities in economic opportunity.

OpenAI’s research proposal cites a lengthy DeepMind paper on ethical and social risks from language models. The paper outlines a taxonomy of six types of risks presented by LLMs (discrimination; exclusion and toxicity; information hazards; misinformation harms; malicious uses; human-computer interaction harms; and automation, access, and environmental harms) then describes possible paths to mitigation of these risks. The authors write, “...the responsibilities for addressing risks fall significantly upon those developing LMs and laying the foundations for their applications.”

GitHub faced backlash when they released Copilot, an extension built on top of Codex for code editors that autocompletes code as a programmer writes it.

GitHub announced a technical preview of Copilot in collaboration with OpenAI in June of 2021. Former GitHub CEO Nat Friedman wrote that Copilot “draws context from the code you’re working on, suggesting whole lines or entire functions.” Copilot was trained on billions of lines of code stored in public GitHub repositories.

Following the announcement, the Free Software Foundation published a blog post calling Copilot “unacceptable and unjust.” The FSF mentions that developers question whether training Codex on public code repositories constitutes fair use of free software and if there is the potential for copyright infringement.

In response to OpenAI’s March 3 tweet sharing their blog post about responsible deployment, Amritha Jayanti, Associate Director of the Technology and Public Purpose Project, tweeted “Why are we celebrating wide scale deployment of powerful and potentially harmful systems without any proper oversight for consumers?”

The aforementioned OpenAI blog post about responsible deployment mentions that “when we first worked on GPT-3, we viewed it as an internal research artifact rather than a production system and were not as aggressive in filtering out toxic training data as we might have otherwise been.” This toxic training data resulted in GPT-3’s reproduction of hateful stereotypes, among other inappropriate content.

OpenAI addressed this issue by releasing InstructGPT. InstructGPT is an updated version of GPT-3 with an extra reinforcement learning step that encouraged the model to align its responses to prompts with human raters’ judgements of what was appropriate. The InstructGPT announcement post states, “We believe that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability, and we will continue to push in this direction.”

OpenAI’s blog post announcing the call for research about Codex states, “While we don’t anticipate that the current capabilities of Codex could threaten large-scale economic disruption, future capabilities of code generation and other large language model applications could.” Interested researchers can fill out this form to express interest in collaborating on this effort.