Tom Schaul Google DeepMind London, UK [email protected]

Annotated by @Hrishi Olickel

Abstract

An agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its coverage of experience/data is broad enough, and (c) it has sufficient capacity and resource. In this position paper, we justify these conditions, and consider what limitations arise from (a) and (b) in closed systems, when assuming that (c) is not a bottleneck. Considering the special case of agents with matching input and output spaces (namely, language), we argue that such pure recursive self-improvement, dubbed 'Socratic learning,' can boost performance vastly beyond what is present in its initial data or knowledge, and is only limited by time, as well as gradual misalignment concerns. Furthermore, we propose a constructive framework to implement it, based on the notion of language games.

1 Introduction

On the path between now and artificial superhuman intelligence (ASI; Morris et al., 2023; Grace et al., 2024) lies a tipping point, namely when the bulk of a system's improvement in capabilities is driven by itself instead of human sources of data, labels, or preferences (which can only scale so far). As yet, few systems exhibit such recursive self-improvement, so now is a prudent time to discuss and characterize what it is, and what it entails.

We focus on one end of the spectrum, the clearest but not the most practical one, namely pure self-contained settings of 'Socratic' learning, closed systems without the option to collect new information from the external world. We articulate conditions, pitfalls and upper limits, as well as a concrete path towards building such systems, based on the notion of language games.

The central aim of this position paper is to clarify terminology and frame the discussion, with an emphasis on the long run. It is not to propose new algorithms, nor survey past literature; we pay no heed to near-term feasibility or constraints. We start with a flexible and general framing, and refine and instantiate these definitions over the course of the paper.

Definitions

Consider a closed system (no inputs, no outputs) that evolves over time (see Figure 1 for an illustration). Within the system is an entity with inputs and outputs, called agent, that also changes over time. External to the system is an observer whose purpose is to assess the performance of the agent. If performance keeps increasing, we call this system-observer pair an improvement process. The dynamics of this process are driven by both the agent and its surrounding system, but setting clear agent boundaries is required to make evaluation well-defined: in fact an agent is what can be unambiguously evaluated. Similarly, for separation of concerns, the observer is deliberately located outside of the system: As the system is closed, the observer's assessment cannot feed back into the system. Hence, the agent's learning feedback must come from system-internal proxies such as losses, reward functions, preference data, or critics.

The simplest type of performance metric is a scalar score that can be measured in finite time, that is, on (an aggregation of) episodic tasks. Mechanistically, the observer can measure performance in two ways, by passively observing the agent’s behaviour within the system (if all pertinent tasks occur naturally), or by copy-and-probe evaluations where it confronts a cloned copy of the agent with interactive tasks of its choosing.



Without loss of generality, the elements within an agent can be partitioned into three types: Fixed elements are unaffected by learning, such as its substrate or unmodifiable code. Transient elements do not carry over between episodes, or across to evaluation (e.g., activations, the state of a random number generator). And finally learned elements (e.g., weights, parameters, knowledge) change based on a feedback signal, and their evolution maps to performance differences (Lu et al., 2023).

We can distinguish improvement processes by their implied lifetime; some are open-ended and keep improving without limit (Hughes et al., 2024), while others converge onto their asymptotic performance after some finite time.1

2 Three Necessary Conditions For Self-Improvement

Self-improvement is an improvement process as defined above, but with the additional criterion that the agent's own outputs (actions) influence its future learning. In other words, systems in which agents shape (some of) their own experience stream, potentially enabling unbounded improvement in a closed system. This setting may look familiar to readers from the reinforcement learning community (RL; Sutton, 2018): RL agents' behaviour changes the data distribution it learns on, which in turn affects its behaviour policy, and so on. Another prototypical instance of a self-improvement process is self-play, where the system (often a symmetric game) slots the agent into the roles of both player and opponent, to generate an unlimited experience stream annotated with feedback (who won?) that provides direction for ever-increasing skill-learning.

From its connection to RL, we can derive necessary conditions for self-improvement to work, and help clarify some assumptions about the system. The first two conditions, feedback and coverage, are about feasibility in principle, the third (scale) is about practice.

2.1 Feedback

Feedback is what gives direction to learning; without it, the process is merely one of self modification. In a closed system where the true purpose resides in the external observer, but can not be accessed directly, feedback can only come from a proxy. This creates the fundamental challenge for system-internal feedback is be aligned with the observer, and remain aligned throughout the process. It places a significant burden on the system at set-up time, with the most common pitfall being a poorly designed critic or reward function that becomes exploitable over time, resulting in a process that deviates from the observer's intent. RL's famed capability for self-correction is not applicable here: what can self-correct is behaviour given feedback, but not feedback itself. Additionally, ideal feedback should be efficient, i.e., contain enough information (not too sparse, not too noisy, not too delayed) for learning to be feasible within the time horizon of the system.