Tracing the Representation Geometry of Language Models from Pretraining to Post-training

1Computer Science, McGill University 2Mila - Quebec AI Institute 3UC Berkeley 4Cohere 5Google Deepmind
6Mathematics and Statistics, Université de Montréal 7Neurology & Neurosurgery and Montreal Neurological Institute, McGill University
8CIFAR Learning in Machines & Brains Program 9Google, Paradigms of Intelligence Team
*Equal contribution Advisory capacity only
NeurIPS 2025
Three geometric phases of LLM pretraining

We discover three distinct geometric phases during language model pretraining: warmup (representational collapse), entropy-seeking (dimensionality expansion with peak n-gram memorization), and compression-seeking (anisotropic consolidation leading to improved downstream performance).

Abstract

Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral approach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay (α-ReQ).

With OLMo (1B-7B) and Pythia (160M-12B) models, we uncover a consistent non-monotonic sequence of three geometric phases during autoregressive pretraining. The initial "warmup" phase exhibits rapid representational collapse. This is followed by an "entropy-seeking" phase, where the manifold's dimensionality expands substantially, coinciding with peak n-gram memorization. Subsequently, a "compression-seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, a transition marked with significant improvement in downstream task performance.

Method

We primarily consider the final layer, last token activations as the representations for a given input sequence and study the geometry of the corresponding manifold. To analyze the intrinsic geometry of the manifold, we compute the feature covariance matrix.

We analyze the representation geometry of language models during training using two complementary spectral metrics:

  • RankMe (Effective Rank): Measures the effective dimensionality of the representation manifold
  • α-ReQ (Eigenspectrum Decay): Quantifies how quickly eigenvalues decay, indicating anisotropic structure

We apply these metrics to study OLMo (1B-7B parameters) and Pythia (160M-12B parameters) models across their full pretraining trajectories and post-training phases.

Method visualization

Figure 1: Spectral framework reveals three universal phases in LLM training. (A) LLM representations analyzed via empirical feature covariance Σ̂(fθ) of last-token hidden states fθ(xi). (B) Two complementary spectral metrics: α-ReQ measures eigenspectrum decay rate (variance concentration), while RankMe quantifies effective rank (utilized dimensionality).

Key Findings

🔄 Three Geometric Phases

We find that while the pretraining loss decreases monotonically, the spectral changes are non-monotonic! In particular, we find that LLMs undergo 3 distinct phases during pretraining:

  • Warmup: Rapid compression which collapses representation to dominant directions
  • Entropy-seeking: Manifold expansion, adding information in non-dominant directions
  • Compression-seeking: Anisotropic consolidation, selectively packing more information in dominant directions
Three geometric phases

Figure 2: Loss decreases monotonically, but representation geometry does not. (A) Schematic from Fig 1, for the pretraining stage. (B) Cross-entropy loss, gradient norm and learning rate schedule during OLMo-2 7B model pretraining. (C, D) RankMe and α-ReQ, respectively, for OLMo-2 7B model vary non-monotonically across pretraining, demonstrating three key phases: "warmup", "entropy-seeking", and "compression-seeking". (E, F) Same as C,D, but for Pythia models, demonstrating the consistent existence of the three phases across model families and scales.

📊 Memorization vs Generalization Across Phases

Does the representation complexity inform us about changes in LLM behavior? To understand this better, we investigate LLM behavior from the lens of memorization vs generalization across the different phases.

Strikingly, we find that the model uses characteristically different mechanisms as it optimizes the next-token prediction objective:

  • Entropy-seeking phase: Correlates with short-sequence memorization, as measured using n-gram alignment
  • Compression-seeking phase: Correlates with dramatic gains in factual reasoning requiring long-range dependencies (e.g., TriviaQA)
Memorization vs generalization

Figure 3: Distinct learning phases are linked to different LLM capabilities. (A) Memorization metric, i.e. spearman correlation between LLM and ∞-gram outputs, and representation geometry metric, α-ReQ, across Pythia models' (1–12B parameters) pretraining. Memorization peaks late in the "entropy-seeking" phase before plateauing or degrading slightly in the "compression-seeking" phase, suggesting that the former prioritizes capturing short-context n-gram statistics. (B) 0-shot performance on multiple-choice (SciQ) and factual question-answering (TriviaQA) tasks across pretraining. While accuracy on SciQ benefits from learning in both phases, accuracy on TriviaQA groks once the model learns long-context statistics, primarily in the "compression-seeking" phase.

🎯 Post-training: SFT/DPO vs RLVR

Post training: we find key differences between SFT/DPO and RLVR:

  • SFT & DPO exhibit entropy-seeking expansion, favoring instruction memorization but reducing OOD robustness
  • RLVR exhibits compression-seeking consolidation, learning reward-aligned behaviors at the cost of reduced exploration

We believe this rank-consolidation helps explain why base models can recover better performance at high pass@K compared to RLVR-tuned models.

Post-training comparison

Figure 4: Post-training induces distinct geometric transformations in model representations, aligned with specific behavioral changes. (A) Conceptual overview of post-training (SFT, DPO and RLVR) (top), corresponding RankMe metrics from intermediate checkpoints of Llama-3.1-Tülu-3.1-8B (bottom) highlighting distinct progression for each stage. (B) Impact of pretraining on OLMo-2-1B SFT (Anthropic-HH): (top) longer pretraining improves in-distribution (ID) performance, while out-of-distribution (OOD) generalization (Alpaca farm) saturates (bottom) Overtrained models with higher RankMe exhibit markedly distinct outputs on AlpacaEval after undergoing SFT on two different datasets (Anthropic-HH and Alpaca farm). (C) RLVR post-training narrows base model's (Llama-3.1-8B-Tülu-3-DPO) exploratory behavior on AMC-23 (particularly at higher sampling counts e.g. k = 256), suggesting higher effective-rank facilitates better search.

🔬 Why Do These Geometric Phases Arise?

We show, both analytically and with simulations in a toy model, that gradient descent dynamics with cross-entropy loss, coupled with skewed token frequencies and representation bottlenecks, are the key reasons underlying these non-monotonic spectral changes.

Toy model simulation

Figure 5: Learning dynamics of cross-entropy loss replicate multiphase learning dynamics. (A) Schematic of a model with feature extractor fθ (∈ Rd), linear classifier W (∈ Rn×d) and cross-entropy loss ℒCE. Skewed class distribution and information bottleneck (d < n) are critical to replicate all three phases observed in LLM pretraining. (B, C) Classifier weights (Wi) and feature representations (fθ(x)) demonstrate distinctive trajectories analogous to "warmup" (dotted), "entropy-seeking" (solid), and "compression-seeking" (dashed) phases. (D) Quantitative spectral metrics RankMe and eigenvalues, σ1, σ2.

💡 Task-Relevant Information in Spectral Tail

Is task-relevant info contained in the top eigendirections? To understand this we project the activations to a top-K / bottom-K subspaces and measure performance on standard benchmarks.

Surprisingly, we find that in line with our theoretical results, the spectral tail encodes critical task-relevant information, while the dominant directions are expendable for many tasks!

In particular, on SciQ:

  • Removing top 50 directions barely hurts accuracy
  • Retaining only top 50 directions collapses performance
Eigendirection analysis table

Table 1: Full-spectrum information is required. Retaining only top eigen-directions markedly degrades SciQ accuracy.

BibTeX

@inproceedings{li2025tracing,
  title={Tracing the Representation Geometry of Language Models from Pretraining to Post-training},
  author={Li, Melody Zixuan and Agrawal, Kumar Krishna and Ghosh, Arna and Teru, Komal Kumar and Santoro, Adam and Lajoie, Guillaume and Richards, Blake A.},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025},
  url={https://arxiv.org/abs/2509.23024}
}