Musing 80: Intelligence at the Edge of Chaos
Fascinating paper out of Northwestern, Yale, and Idaho State
Today’s paper: Intelligence at the edge of chaos. Zhang et al. 3 Oct. 2024. https://arxiv.org/pdf/2410.02536
For those following the news on large language models (LLMs) like ChatGPT, there is the overwhelming feeling that (a) bigger models are better, (b) lots of data is required to train big models. However, it is also accepted, albeit difficult to apply in practice, that the quality of the data, as well as the inherent complexity in the data both play important roles in determining how the model might end up performing on downstream tasks. Rather than attributing LLMs’ excellent performance on complex tasks solely on big data, the authors in today’s paper consider the following hypothesis instead: that intelligence can emerge from modeling simple systems as long as they exhibit complex behaviors, even when the process that generates the data lacks inherent intelligence.
Complexity science has a long and venerable history in the field of computing. From early explorations of cellular automata to the development of algorithms capable of navigating chaotic systems, the intersection of complexity and computing has fostered innovations that push the boundaries of AI and systems modeling. Complex systems give rise to emergent properties from simple rules, which some would argue, is what makes the LLMs also so intriguing.
The authors utilize Stephen Wolfram’s elementary cellular automata (ECA) as our experimental framework. ECAs are one-dimensional, binary-state, discrete computational systems defined by 256 possible 8-bit rules. They generate a diverse spectrum of behaviors ranging from simple, repetitive patterns to highly complex and chaotic structures. Despite their simple rule-based definitions, certain ECAs produce patterns of significant complexity, making them ideal for examining the relationship between intelligence and complexity.
The paper begins by describing some “complexity measures” that have been proposed over the years to assess the behavior of dynamical systems:
Lempel-Ziv Complexity assesses the compressibility of a sequence by counting the number of unique substrings in the sequence.
Compression Complexity quantifies how effectively a sequence can be compressed using a data compression algorithm such as Zlib.
Lyapunov Exponent gauges a system’s sensitivity to initial conditions. Higher Lyapunov exponents indicate that small variations in initial states result in rapidly diverging outcomes. The authors adopt the method proposed by Wolf (1986) for computing this metric.
Krylov Complexity evaluates how information propagates in a system’s Hilbert space, measuring how quickly an operator spans larger regions of the state space over time.
Wolfram Classification categorizes ECA rules into four categories based on behavior and complexity.
For most analyses, the authors focus on Lempel-Ziv Complexity and the Wolfram Classification; however, it should be noted that the measures are generally correlated with each other.
An overview of the training process and task evaluations used by the authors is provided in Figure 1 below.
First, the authors describe the data generation process. They trained their models by simulating a set of ECA rules. Each simulation produced a sequence of binary vectors, with each vector representing the system's state at a given time step. The process began with a randomly initialized vector as the automaton's initial state, and the system was evolved over 1,000 time steps through the repeated application of the selected ECA rule. This generated a sequence of binary vectors that reflected the system's evolving dynamics over time. To increase the diversity of the training data, random spatiotemporal windows were extracted from the full sequences, specifically selecting subsequences of 60 time steps and 100 spatial dimensions. The models were trained to predict either 1 or 5 steps into the future to introduce varying levels of difficulty into the task.
Next, the authors describe the training procedure of the GPT-2 models. They employed a modified version of the GPT-2 architecture, adapted to handle binary input and output data for next-token prediction. Instead of the traditional token embedding layer followed by a softmax function over a vocabulary, a linear projection layer was used to map binary vectors into the model's embedding space. The GPT-2 model then processed these embeddings to capture temporal patterns and dependencies in the sequences. At the output stage, another linear projection layer was applied to map the model's hidden states back to the dimensionality of the binary data, enabling the prediction of the next state at each time step. This adaptation allowed the GPT-2 model to handle binary data directly, without relying on a predefined vocabulary, ensuring deterministic behavior in line with the rules governing ECAs.
Finally, concerning the pre-training setup of the models, the models were pretrained using next-token prediction tasks on data generated by individual ECA rules, with training running for up to 10,000 epochs. Early stopping was implemented to prevent overfitting, based on validation loss. The training data was organized into batches of 64 sequences, each comprising 60 time steps and 100 spatial dimensions. The Adam optimizer was used with an initial learning rate of 2 × 10⁻⁶ and a weight decay of 0.01. The learning rate was gradually adjusted using a linear warm-up for the first 10% of the total training steps, followed by a cosine annealing schedule. Gradient accumulation was utilized to manage larger effective batch sizes within the constraints of GPU memory, and gradient clipping with a maximum norm of 1.0 was applied to prevent exploding gradients.
Now let’s move on to experiments. The empirical study was designed to evaluate the relationship between the complexity of ECA rules and the intelligence exhibited by models trained on data generated from these rules. The models are tested on various downstream tasks, including reasoning and chess move prediction, to assess how pre-training on ECA rules of varying complexity impacted performance.
For their downstream task, the authors were inspired by the ARC task to evaluate models’ problem-solving and reasoning abilities. Their approach utilizes sequence completion problems that require the model to infer transformation rules from provided examples and apply them to novel scenarios. They also separate tasks by ‘easy’ and ‘hard’.
Figure 2 below presents the model performance across three downstream tasks (easy reasoning, hard reasoning, and chess move prediction) as a function of the complexity of the ECA rules the models were pretrained on. The top row highlights the relationship between performance and the LempelZiv complexity of the rules, while the bottom row categorizes the performance by Wolfram’s complexity classes. For clarity, two representative rules from each complexity class are displayed on the left, with their corresponding performance annotated in the top plots.
In terms of Wolfram’s classification, rules from Classes I and II (uniform and periodic) show lower average efficiency in the reasoning tasks compared to those from Classes III and IV (chaotic and complex). Complex rules especially outperform the other classes on the chess move prediction task. This pattern suggests that models trained on more complex rules tend to perform better on harder downstream tasks. Results with respect to other complexity measures are shown in Figure 3 below.
The authors have also observed that models trained on certain Class III (Chaotic) rules, such as Rules 105, 146, and 150, have poorer performance on the hard reasoning and chess move prediction tasks. This behavior is explained due to chaotic systems lacking the structured patterns necessary for effective learning. In other words, they may be too random to predict, leading to weaker downstream performance. These results highlight the existence of a “sweet spot” of complexity conducive to intelligence, where the system is still predictable yet hard to predict.
In closing, this paper contributes to a broader understanding of how intelligence may arise from exposure to complexity and offer a new perspective on the types of data and systems that could drive more advanced reasoning in AI systems. These insights also align with theories in cognitive science about human intelligence evolving to manage complex, unpredictable environments, drawing parallels between artificial and human cognition. In future work, this framework can be further explored by training larger LLMs on synthetic data generated by simple rule-based systems. Incorporating measures of complexity, such as those used in this study, could provide a valuable tool for prioritizing and curating data, ensuring that models are exposed to information with the right balance of structure and randomness.'