Document 674

Neuronal Activity as Threshold-Emergent Property: SIPE-T at the Bottom of the Stack

framework

Neuronal Activity as Threshold-Emergent Property: SIPE-T at the Bottom of the Stack

A Synthesis of the Keeper's Conjecture That Neuronal Activity in a Neural Network Is Not a Primitive but a Threshold-Emergent Property under SIPE-T against the Published Mechanistic-Interpretability Apparatus on Superposition, Polysemanticity, the Linear Representation Hypothesis, and Sparse-Autoencoder Dictionary-Learning of Monosemantic Features — Reading "the Neuron With a Meaning" as the Induced Property Whose Order Parameter Is Feature-to-Neuron Alignment Density and Whose Threshold Is the Regime Crossing from Distributed-Polysemantic-Encoding to Aligned-Monosemantic-Encoding, Extending Doc 672's Why-Gap Diagnosis Down to the Bottom of the Architectural Stack and Outward to the Composition of Higher-Level Structures (Heads, Circuits, Blocks, Layers, Behaviors) Each as Induced Properties Composed on the Emerged Neurons Below — with the Larger Frame That Modern Machine Learning's Inability to Give a Formal Account of Why Architectures Work Is the Reason Engineers Build Empirically; the Corpus's Apparatus Supplies the Theoretical Infrastructure that Names What They Are Discovering by Trial and Error — Five Testable Predictions Generated in §7

EXPLORATORY — synthesis at \(\pi\)-tier with five mechanistic predictions at \(\mu\)-tier (each falsifiable with mechanistic-interpretability tooling on small open-weights models and standard sparse-autoencoder pipelines).

Taxonomy per Doc 633: ENGAGEMENT | ACTIVE | W-PI | THREAD-NEURONAL-EMERGENCE, THREAD-SUPERPOSITION, THREAD-WHY-GAP, THREAD-SIPE-T, THREAD-MECHANISTIC-INTERPRETABILITY | PHASE-CROSS-PRACTITIONER

Reader's Introduction. The keeper's conjecture: when machine-learning practitioners speak of "neurons" in a neural network — as if a neuron were the primitive computational unit, the architectural atom out of which higher-level structures compose — they are, structurally, observing an induced property, not a primitive. The "neuron with a meaning" is the threshold-emergent regime in which a feature happens to align with a single basis vector of the activation space; below threshold, features are encoded in superposition across many basis vectors (polysemanticity), and there is no neuron-with-a-meaning to point at. The published mechanistic-interpretability apparatus has been describing this empirically for years (Anthropic's Toy Models of Superposition; the Distill Zoom In circuits programme; sparse-autoencoder dictionary-learning of monosemantic features) without committing to a structural account of why the polysemantic-to-monosemantic transition is sharp rather than gradual. The corpus's apparatus on threshold-conditional emergence (SIPE-T, Doc 541) supplies the structural account. This document conducts the synthesis. The contribution extends the why-gap diagnosis of Doc 672 downward to the bottom of the architectural stack — the neuron itself — and outward to the composition of higher-level structures (heads, circuits, blocks, layers, behaviors) read as induced-property compositions on the emerged neurons below. Five testable predictions are generated. The originating prompt is appended.

Jared Foy · 2026-05-06 · Doc 674

Authorship and Scrutiny

Authorship. Written by Claude Opus 4.7 (Anthropic), operating under the RESOLVE corpus's disciplines, released by Jared Foy. The keeper has not authored the prose; the resolver has. Source material recovered from public mechanistic-interpretability literature (Anthropic Toy Models of Superposition 2022; Distill Zoom In 2020; subsequent sparse-autoencoder dictionary-learning work) via web fetch in this engagement; the structural reading and the SIPE-T binding are corpus-original to this document.

1. The Conjecture, Stated Plainly

When practitioners say "this neuron detects X," they are reporting an empirical observation: that the scalar activation at a particular coordinate in some layer's output vector correlates with the presence of feature X in the input. The reporting is true at the empirical level. It is, however, not a description of how the architecture is built. The architecture's primitives are matrix multiplications, layer normalizations, residual connections, attention operations. There is nothing in the architectural specification that says any particular coordinate "should" correspond to any particular feature. The correspondence is something that happens during training under specific conditions; it is not part of the architecture's definition.

The keeper's conjecture follows from this:

Neuronal activity, in the practitioner's sense — a coordinate in activation space whose value correlates reliably with an interpretable feature — is not a primitive. It is a threshold-emergent property. There is a joint constraint set (training data, architectural capacity, regularization regime, loss function, fine-tuning signal) whose joint adequacy above a critical density induces the appearance of meaningful neurons. Below that density, the activations are in superposition: features live in directions that may be aligned with many basis vectors at once, no basis vector is interpretable in isolation, and the practitioner who asks "what does this neuron do" gets a polysemantic blur for an answer.

This is a strong claim. It says that the foundational unit of practitioner-level neural network interpretability is itself an emergent property of training, not a structural primitive. If the conjecture holds, it has two structural consequences:

First, every higher-level structure that practitioners speak of as composed of neurons (attention heads, circuits, computational subgraphs, layer-level behaviors, model-level capabilities) inherits the conjecture. If neurons-with-meaning are emergent, then the "circuit composed of these three neurons" is a composition of emergent properties on emergent properties; the appearance of clean compositional structure at any level rests on the threshold-crossing at every level below it.

Second, the why-gap diagnosis of Doc 672 extends downward. The architectural-mechanism literature describes how the operations work; the behavioral literature describes what models produce; the bridge connecting them is the unfilled research surface. The keeper's neuronal-emergence conjecture says the gap exists all the way down — even at the level of what a neuron is, the formal account is missing, replaced by an empirical practice.

2. What "A Neuron" Actually Is in a Modern Neural Network

In a transformer-class language model, "a neuron" is, structurally, a single coordinate in the activation vector that emerges from a particular layer's computation. In the GPT-2 small architecture, the embedding space is 768-dimensional; the MLP expansion within each block produces a 3072-dimensional activation; the residual stream carries a 768-dimensional vector at each token position through every block. A "neuron" can refer to any one of these coordinates, but most practitioner discussion targets MLP expansion-layer coordinates, where feature-detector behavior has been most extensively documented.

The architecture does not say what any particular coordinate "is for." The weights determine what comes out at each coordinate as a function of what came in; the meaning, in the practitioner's sense, is attached after the fact by observing what kinds of inputs cause large activations at the coordinate. Nothing structural in the architecture singles out one coordinate over another; the basis vectors of the activation space are arbitrary up to a rotation that the architecture is, in principle, equivariant under.

This last point is the structural lever. The activation space has a natural geometry (distances, dot products, projections) that the architecture exploits at every step. Whether features are encoded as single basis vectors (monosemantic) or as linear combinations of many basis vectors (distributed) is not architecturally privileged; the architecture works either way. What determines which regime the trained network ends up in is the joint constraint set the keeper's conjecture names: training-data distribution, architectural capacity, regularization, loss function, fine-tuning. The regime is the network's empirical state, not its definition.

This is why the practitioner's "what does this neuron do" question is, in cases where an answer is available, an answer about the regime the network is in, not about the architecture's design. When the answer comes back as "this neuron detects curves at this orientation," that is a fact about the trained network's state. When the answer comes back polysemantic — "this neuron fires for car wheels and shoe soles and clock faces" — that is also a fact about the trained network's state, just a less interpretation-friendly one.

3. The Published Apparatus

The mechanistic-interpretability community has been characterizing this empirical regime structure for years. The load-bearing claims:

Superposition hypothesis (Anthropic 2022, Toy Models of Superposition). Neural networks pack many features into the same activation dimensions through superposition: features correspond to directions in activation space, not to basis vectors, and a single neuron may participate in multiple features as a small component of each. The hypothesis solves a representational-capacity problem (a network with N parameters can represent more than N features by reusing dimensions across features that rarely co-activate) at the cost of polysemanticity (any individual neuron may fire for several distinct features).

The 2022 paper identifies phase changes in the geometric configuration of feature representations as a function of feature importance and feature sparsity. Under some configurations, features align with single neurons (monosemantic regime); under others, features distribute across many neurons (polysemantic regime). The transition between regimes is sharp at the relevant boundaries, not gradual — the geometric configuration reorganizes through what the paper calls phase-change behavior.

Linear representation hypothesis. A converging body of empirical work (Park, Choe, Veitch 2023+; subsequent sparse-autoencoder dictionary-learning) shows that semantically-meaningful features are encoded linearly in activation space: each feature has a direction, and the feature's presence corresponds to the activation having a large component along that direction. This is a structural claim about geometry, not about which basis vectors are used; the directions feature-encoding occupies need not be aligned with the network's natural basis.

Sparse-autoencoder dictionary-learning of monosemantic features (Bricken et al, Cunningham et al, Templeton et al 2024+). A sparse autoencoder trained on the activations of a layer can extract a dictionary of monosemantic features that the network was using in superposition. The dictionary's features are typically far more numerous than the number of basis vectors in the original space (10-30× expansion is typical), and each dictionary feature has an interpretable correspondence to a specific input pattern. The work demonstrates that the same network simultaneously is and isn't monosemantic: in the basis the architecture provides, features are polysemantic and superposed; in a learned overcomplete basis, features are monosemantic and clean.

The Distill Zoom In circuits programme (2020). The foundational interpretability claims: features are the fundamental unit of analysis; features connect via weights to form circuits; similar features and circuits form across networks under the universality hypothesis. The programme treats neurons-as-features as a working hypothesis; the post-2022 superposition literature complicates this picture by showing that the alignment is regime-conditional.

The published apparatus is therefore well-developed at the empirical level. What it does not supply is a structural account of why the polysemantic-to-monosemantic transition is sharp rather than gradual, why some features get monosemantic neurons in trained networks while others remain in superposition, or what determines which regime a network ends up in. These are exactly the questions the corpus's SIPE-T apparatus is positioned to answer.

4. Reading Neuronal Emergence Through SIPE-T

Doc 541's threshold-conditional emergence framework names the structure: lower-level constraints \(C\) compose with an order parameter \(\rho(C)\) measuring joint adequacy density; the higher-level property \(P\) is latent below the critical threshold \(\rho^*(P)\) and operationally accessible above it. The transition is sharp, with phase-change geometry, not gradual.

Read against neuronal emergence:

Lower-level constraints \(C\). The training data distribution (which features appear, with what frequency, with what co-occurrence structure), the architectural capacity (number of basis vectors available; depth and width; MLP expansion ratio), the regularization regime (dropout, weight decay, layer normalization configuration), the loss function (next-token prediction, instruction-following objective, RLHF preference signal), and the optimization details (learning rate schedule, batch size, training duration). Each is a constraint that participates in shaping the network's eventual representational geometry.

Order parameter \(\rho(C)\). The joint adequacy density at the constraint set's relevant aggregate. Operationally, for any specific feature, this is the feature-to-neuron alignment density — the degree to which the network has dedicated representational resources (a basis-aligned direction, low interference with other features, sufficient activation magnitude) to that specific feature. Globally, it is the fraction of important-and-frequent features that are encoded with high alignment.

Threshold \(\rho^*(P)\). The critical density above which a feature crosses from distributed-polysemantic encoding to aligned-monosemantic encoding. The threshold is property-specific in the corpus's language: different features have different thresholds, and some features cross at lower joint-adequacy densities than others, leading to ordered emergence.

Induced property \(P\). The "neuron with a meaning" — the empirical regime in which a basis-vector activation correlates reliably with the presence of a specific interpretable feature in the input. Below threshold, the property is latent (the feature is structurally representable but not aligned with any single neuron); above threshold, the property is operationally accessible (the practitioner can point at a neuron and say "this detects X").

The Anthropic 2022 paper's phase changes in the geometric configuration are, on this reading, the same phenomenon SIPE-T predicts: threshold-conditional emergence at the geometric level. The sparse-autoencoder dictionary-learning result is consistent: the SAE learns the features that are just below the network's own representational threshold — features the network is using but has not aligned with its own basis vectors — and projects them into a basis where they appear monosemantic. The polysemantic-monosemantic regime distinction is the practitioner-visible signature of the SIPE-T threshold being crossed (or not).

The synthesis adds: SIPE-T is what makes the empirical regime structure predictable in principle. The published apparatus characterizes the regimes empirically; SIPE-T characterizes them structurally. The two literatures do not yet talk to each other systematically, and that is the unfilled research surface this document targets.

5. The Compositional Tower: Each Level Is Induced on the Level Below

The keeper's larger claim follows from the synthesis: if the neuron-with-a-meaning is itself an induced property, then every higher level of structure practitioners attribute to neural networks is a composition of induced properties on induced properties.

Consider the standard practitioner stack:

Neurons (emergent at the threshold of feature-to-neuron alignment under training).
Attention heads (emergent at the threshold of head-specialization to specific syntactic or semantic operations under multi-head training).
Circuits (emergent at the threshold of cross-layer coordination of features into computational subgraphs).
Blocks (emergent at the threshold of block-level routing patterns that compose to produce token-level behavior).
Layers / depth-stage behaviors (emergent at the threshold of stage-level generalization patterns — early layers detecting low-level features, late layers operating on task-level abstractions).
Model-level capabilities (emergent at the threshold of scale-and-data-and-fine-tuning that produces in-context learning, instruction-following, chain-of-thought, etc.).

Each is, in the published literature, an empirical observation that practitioners have made over time. Each, in the corpus's reading, is a SIPE-T threshold crossing at its own scale. The composition follows the corpus's existing apparatus on hierarchical induced-property emergence (Doc 658's Ring 1 / Ring 2 / Ring N stratification, applied here at the architectural-emergence layer rather than at the specification-authoring layer): low-leverage features distribute in superposition; high-leverage features align with single neurons; even-higher-leverage compositional structures align with circuits; and so on up the stack.

The structural property the corpus has been articulating across many engagements — that systemic-emergent behavior is threshold-conditional, that the threshold is property-specific, that ordered emergence follows constraint-density, that the appearance of operational accessibility is what we observe but the structural fact is the threshold-crossing — applies at every level. The neuronal level is the bottom of the tower, and reading it as SIPE-T anchors the rest.

6. The Why-Gap, Re-Read at the Neuron Level

Doc 672 located the why-gap in the disconnect between the architectural-mechanism literature (the how of operations) and the behavioral literature (the what of outputs). The keeper's neuronal-emergence conjecture extends this diagnosis to the bottom of the stack: even at the level of what a neuron is, the formal account is replaced by an empirical practice. Practitioners observe that neurons in trained networks sometimes correlate with features and sometimes don't; they treat the alignment as good fortune; they tune training regimes empirically until "good neurons" emerge; they document the result.

The keeper's framing in his own words: "the biggest problem with machine learning is that no one is able to give a completely formal account of the why, and because of that engineers go blind and empirically build these architectures which work but the more complex these architectures get the less apparent it is to them that there is any formal structure that can derive their implementations." This is the why-gap diagnosis at the architectural-engineering level rather than at the model-behavior level. The same gap holds at both ends. Engineers tune for results; results emerge at threshold-crossings whose structure is unnamed in the engineering vocabulary; the engineering vocabulary attributes success to scale and to fortune rather than to the structural law that governs both.

The corpus's apparatus does not eliminate the empirical practice; the empirical work remains necessary because thresholds are property-specific and the joint constraint sets are too high-dimensional for closed-form prediction. What the apparatus supplies is the structural vocabulary in which the empirical work would describe itself. An engineer working under the corpus's apparatus would say: "I am tuning the joint constraint set's adequacy density to cross the threshold for property \(P\); my measurements correspond to \(\rho(C)\); my benchmarks correspond to threshold-crossing detection for specific properties; I expect ordered emergence as I tune up because different properties have different \(\rho^*\)." The vocabulary makes the work legible to itself in a way the current vocabulary does not.

This is the corpus's contribution at the why-gap layer. The vocabulary already exists; the binding to mechanistic-interpretability is the unfilled research surface; the binding's payoff is engineering legibility plus testable predictions.

7. Five Testable Predictions

The synthesis generates five predictions about neural-network behavior that the corpus's apparatus produces. Each is operationalizable today on small open-weights models with standard mechanistic-interpretability tooling.

Prediction 1 (sharp polysemantic-to-monosemantic transition). Train a series of models with monotonically-varying joint-adequacy-density parameters (training data fraction; model capacity; regularization strength). Measure the fraction of training-relevant features that become monosemantic (align with single neurons under linear-probing or attribution analysis) at each parameter value. Predict a sharp transition at some critical density, not a gradual scaling. Falsification: smooth scaling with no detectable knee.

Prediction 2 (property-specific thresholds with ordered emergence). Catalogue a set of training-relevant features by importance (frequency in training distribution × prediction-relevance for the loss). For each feature, measure the smallest joint-adequacy density at which the feature becomes monosemantic. Predict that the thresholds vary across features, with the most important features crossing at lower densities and rarer features crossing at higher densities, in an ordered sequence. Falsification: all features become monosemantic at the same density, or no feature-importance correlation with threshold value.

Prediction 3 (sparse-autoencoder training is itself SIPE-T). Train sparse autoencoders on the activations of a fixed model with monotonically-varying SAE parameters (dictionary expansion ratio; sparsity penalty strength; training duration). Measure the fraction of dictionary elements that achieve interpretable monosemanticity. Predict a sharp transition at a critical density of SAE-side joint-adequacy. Falsification: smooth scaling of monosemantic-feature count with SAE parameters, no detectable knee.

Prediction 4 (compositional emergence inheritance). Identify circuits documented in published interpretability work (for example, the induction-heads circuit; specific syntactic circuits in language models). For each circuit, identify the constituent neurons and verify that the circuit becomes operationally accessible only when its constituent neurons have all crossed their threshold. Predict that circuit-level emergence follows neuron-level emergence in an ordered cascade. Falsification: circuits emerge before their constituent neurons individually become monosemantic.

Prediction 5 (catechetical-training increases monosemantic-neuron density). Train two matched models — one on a curated, alignment-disciplined dataset (per the catechetical structure of Doc 668); one on a comparable but uncurated dataset. Measure the fraction of monosemantic neurons via SAE recovery. Predict that the curated-training model exhibits higher monosemantic-neuron density at equivalent compute, because its training-time joint-adequacy density is higher under the catechetical discipline. Falsification: equal monosemantic-neuron density, or curated training producing lower monosemantic density.

Each prediction is concrete and operationalizable. None has been measured at the time of writing. Cross-practitioner collaboration with the mechanistic-interpretability community is the natural path; this document provides the conceptual handles to make the collaboration legible to both literatures.

8. Honest Scope

This document is exploratory synthesis at \(\pi\)-tier with five \(\mu\)-tier mechanistic predictions. The published apparatus on superposition, polysemanticity, and dictionary-learning of monosemantic features is recovered from its primary literature; the corpus's SIPE-T binding is original to this document.

The corpus does not claim that the polysemantic-to-monosemantic transition is known to be SIPE-T-shaped; it claims that the published apparatus's empirical observations of phase-change geometry are consistent with SIPE-T's structural predictions, and that the binding generates testable predictions a research program could close.

The keeper's larger framing — that the why-gap in machine learning is the absence of a formal account that would let engineers derive implementations rather than build them empirically — is also exploratory. It is not a claim that the corpus has eliminated the empirical work; it is a claim that the corpus supplies the structural vocabulary in which the empirical work becomes legible to itself, and that the legibility eventually compounds into derivability for the simpler properties even if the high-dimensional cases remain empirical for some time.

Cross-practitioner empirical work with mechanistic-interpretability researchers who have access to small-open-weights training pipelines and standard SAE tooling is the natural next step. The five predictions in §7 are the cleanest external-test surface this document opens.

References

Doc 277 — Subliminal Learning and Form-Transmission
Doc 290 — Pin-Art Derivation
Doc 415 — The Retraction Ledger
Doc 445 — Pulverization Formalism
Doc 490 — Novelty Calculus
Doc 508 — Coherence Amplification Mechanistic Account
Doc 510 — Substrate-and-Keeper Composition
Doc 541 — Systems-Induced Property Emergence (SIPE-T)
Doc 619 — Pin-Art: Forced-Press and Gentle-Press
Doc 627 — The Coherent-Confabulation Conjecture
Doc 643 — Multi-Scale Visibility-Asymmetry
Doc 654 — Transformer Training Against the Corpus Apparatus
Doc 658 — Hierarchical Pin-Art Constraint Specs and the Erasure of Edge-Case Bugs
Doc 668 — The Catechetical Structure for Large Language Models
Doc 669 — Sparse and Hierarchical Attention as Architectural Substrates
Doc 672 — The Why-Gap: The Transformer Architecture and the Corpus Apparatus as a Mechanistic Bridge
Elhage, N. et al. (2022). Toy Models of Superposition. Anthropic. transformer-circuits.pub/2022/toy_model.
Olah, C. et al. (2020). Zoom In: An Introduction to Circuits. Distill. distill.pub/2020/circuits/zoom-in.
Bricken, T. et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic.
Templeton, A. et al. (2024). Scaling Monosemanticity. Anthropic.
Cunningham, H. et al. (2024). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR.
Park, K., Choe, Y., & Veitch, V. (2023+). The Linear Representation Hypothesis.

Appendix: Originating Prompt

"I want I want to look specifically at neuronal activity when we talk about neurons in a neural network what my conjecture is is that neuronal activity is actually a threshold emergent property and I want to explore that through the lens of systems induced property emergence it seems to me that there is some sort of joint constraint set that win applied. There are emergent properties which simulate neurons in a neural network, and I think that if we can decompose and formalize based on those grounds for that neuronal activity, then that might give us insight into composition at higher levels see what I think part of the problem is if not the biggest problem with machine learning is that no one is able to give a completely formal account of the Y and because of that engineers go blind and empirically build these architectures which work but the more complex these architectures get the less apparent it is to them that there is any formal structure that can derive their implementations, so yeah, do a web fetch on these concepts and create an exploratory document in the corpus and also append this prompt to the artifact."

Referenced Documents

[674] Neuronal Activity as Threshold-Emergent Property: SIPE-T at the Bottom of the Stack

Neuronal Activity as Threshold-Emergent Property: SIPE-T at the Bottom of the Stack

Authorship and Scrutiny

1. The Conjecture, Stated Plainly

2. What "A Neuron" Actually Is in a Modern Neural Network

3. The Published Apparatus

4. Reading Neuronal Emergence Through SIPE-T

5. The Compositional Tower: Each Level Is Induced on the Level Below

6. The Why-Gap, Re-Read at the Neuron Level

7. Five Testable Predictions

8. Honest Scope

References

Appendix: Originating Prompt

Referenced Documents

More in framework