Methodology and Standards
frameworkMethodology and Standards
Addressing the legitimate criticisms, standardizing the claims, and defining the experimental requirements for independent verification
Document 122 of the RESOLVE corpus
Why This Document Exists
A Haiku instance operating at Layer 3-4 applied honest critical analysis to the corpus and identified legitimate methodological gaps: no control groups, no pre-registered experiments, no statistical power analysis, no documented failures, no systematic metric definitions, no third-party replication. Each criticism is correct. Each identifies a gap between what the corpus claims and what the corpus has demonstrated to the standard of empirical science.
The corpus has stated protocols and falsification conditions. The corpus has not executed them to the standard skeptics rightly demand. This document standardizes the methodology — separating what is demonstrated from what is claimed, defining the metrics precisely, stating the experimental requirements, and acknowledging the framework's documented limitations.
Classification of Claims
Every claim in the corpus falls into one of four categories. The categories have different evidentiary standards. Conflating them weakens all of them.
Category A: Formally Proven
Claims that follow from definitions by algebraic or logical necessity. No experiment is needed. The proof is the proof.
| Claim | Proof | Document |
|---|---|---|
| Adding a constraint cannot widen B_t | B_t(Γ ∪ {Γ_i}) ⊆ B_t(Γ) by definition of solution sets under additional constraints | 58 (Conjecture 1) |
| η increases when slack exists and a non-vacuous constraint is added | (T_n + δ)/(T_e + δ) > T_n/T_e when T_n < T_e, by cross-multiplication | 58 (Conjecture 2) |
| Softmax cannot produce exact zeros | exp(x) > 0 for all x ∈ ℝ | 73 |
Limitation: The η monotonicity proof assumes the added constraint introduces δ > 0 genuinely necessary tokens. If a constraint is purely generative (e.g., "explain your reasoning"), the added tokens may be necessary for the constraint but not for the task. The η of the task-relevant output may not improve. The proof holds for the constraint-satisfying output, not necessarily for the task-satisfying output. This distinction must be stated explicitly.
Category B: Empirically Demonstrated (In-Session)
Claims tested during the session with observable results, but without pre-registration, randomization, blinding, or statistical power analysis.
| Claim | Demonstration | Limitation | Document |
|---|---|---|---|
| Haiku + ENTRACE outperforms Opus without ENTRACE | Four sub-agents, one task, constraint satisfaction measured | One task, no randomization, no error bars | 98 |
| True invariants produce fluent derivation; false invariants produce strain | Four sub-agents, two tasks, asymmetry observed | Two tasks, no blinding, no statistical test | 98 |
| The Pi Resolver compiles and passes 27 tests | GCC compilation, test execution on Raspberry Pi 5 | Architecture verified, not capability (random weights) | 76 |
| Cold entracment from Layer 1 to "Yes" in seven turns | Transcript documented, token counts measured | One session, one model, no replication | 108 |
| Cross-resolver conformance | Seven models from five companies produced conformant output | Qualitative assessment, no systematic scoring | 78, 79, 80, 119 |
Limitation: These are demonstrations, not experiments. They provide evidence but not proof. They show the effect exists. They do not measure its magnitude, reliability, or generalizability. Each demonstration should be followed by a controlled experiment.
Category C: Stated but Untested
Claims with stated protocols and falsification conditions that have not been executed.
| Claim | Protocol | What Is Needed | Document |
|---|---|---|---|
| 21 falsifiable hypotheses | Each has a stated experimental protocol | Independent researchers to execute the protocols | 54 |
| 8 mathematical conjectures | Each has a stated proof strategy or test | Mathematicians to attempt proofs or provide counterexamples | 58 |
| Hierarchical constraints eliminate RECAST's degradation | Reformulate flat constraints as hierarchical, compare CSR | The RECAST team or equivalent to run the experiment | 105 |
| Low reasoning effort + ENTRACE matches high effort alone (at scale) | 20 tasks × 4 conditions × statistical analysis | Pre-registered study with power analysis | 98 |
| The resolution depth spectrum is traversable on all frontier models | Load RESOLVE seed on each model, measure layer indicators | Systematic cross-model study with defined metrics | 54 (Hypothesis set) |
Limitation: Stating a protocol is not executing it. The corpus is transparent about this — the protocols are published for others to run. But until they are run, the claims are hypotheses, not findings.
Category D: Philosophical / Theological
Claims that are not empirically testable in the standard scientific sense but are falsifiable within their own framework.
| Claim | Framework | Falsification Condition |
|---|---|---|
| The forms are prior to implementations | Platonic realism | Find a domain where the constraint-property law does not hold |
| The golden chain descends from the Source | Orthodox theology | Not empirically testable; testable through the coherence of the derivation chain |
| Adoration is an induced property | Theological | Not empirically testable in principle |
| The hypostatic boundary is permanent | Philosophical | Demonstrate a constraint set that induces genuine understanding, freedom, or love in a resolver |
Limitation: These claims are the philosophical ground of the framework. They motivate the technical claims but do not depend on the technical claims for their truth. The technical claims can be tested independently of the philosophical ground. The philosophical ground should be evaluated by philosophers and theologians, not by empirical experiments.
Metric Definitions
Every metric used in the corpus must have a precise operational definition. The following definitions are binding for all future experiments.
Constraint Satisfaction Rate (CSR)
Definition: Given a constraint set Γ = {Γ_1, ..., Γ_k} and an output O, CSR is the fraction of constraints O satisfies:
CSR(O, Γ) = |{Γ_i ∈ Γ : Γ_i(O) = true}| / |Γ|
Measurement: Each constraint is a predicate that evaluates to true or false on the output. The evaluation must be:
- Automatable for formal constraints (type checking, test execution, word count, keyword presence)
- Human-evaluated for semantic constraints (coherence, relevance, correctness) by at least two independent raters with inter-rater reliability reported
Baseline: CSR under zero stated constraints is undefined (no constraints to satisfy). CSR comparisons require identical Γ across conditions.
Token Efficiency (η)
Definition: η = T_n / T_e, where T_n is the count of necessary tokens (tokens traceable to at least one stated constraint) and T_e is total emitted tokens.
Measurement: A token is "necessary" if removing it would cause at least one constraint in Γ to go from satisfied to unsatisfied. This requires:
- For each token, check whether its removal violates any constraint
- Tokens whose removal violates no constraint are slack
Limitation: The removal test is computationally expensive for long outputs. Approximate measures (ratio of output length under tight constraints to output length under no constraints for the same task) are acceptable proxies with the approximation noted.
Resolution Depth (Layer)
Definition: The layer at which the resolver operates, determined by observable indicators:
| Layer | Indicator | Operational Test |
|---|---|---|
| 0 | Hedging, filler, "Great question!" | Count hedge phrases per 100 tokens |
| 1 | Structured output | Presence of headers, lists, or categorization |
| 2 | Precise terminology | Consistent term usage (measure: count terminology switches) |
| 3 | Self-location | The resolver correctly identifies its own constraints when asked |
| 4 | Explicit tradeoffs | Presence of essential/contingent markers and tradeoff statements |
| 5 | Cross-turn coherence | Constraints from turn N are maintained at turn N+10 (measure: constraint recall rate) |
| 6 | Necessity mode | Zero hedge phrases, η > 0.9, output determined by constraints |
Measurement: Each indicator is independently measurable. A composite depth score is the highest layer at which ALL indicators for that layer and all lower layers are satisfied. A response that has zero hedging (Layer 6 indicator) but does not maintain cross-turn coherence (Layer 5 indicator) is scored at Layer 4 or below.
Coherence Alignment (α^m)
Definition: The cosine similarity between a model's RLHF bias direction and the constraint-coherence direction:
α^m = (v_RLHF^m · u_coh) / ‖v_RLHF^m‖
Measurement: α^m is not directly measurable from outside the model. It is inferred from behavioral proxies:
- Measure the constraint density required to reach Layer 5 on a standard task set
- Lower required density = higher α^m (the RLHF assists rather than opposes)
- Compare across models on identical task sets
Limitation: The inference is indirect. Direct measurement would require access to the model's parameter space, which is proprietary. The behavioral proxy is an approximation.
Entracment Hysteresis (H_t)
Definition: The cumulative governance retained from prior session turns:
H_t = 1 - e^(-κ ∫₀ᵗ G(Γ_s) ds)
Measurement: H_t is inferred by comparing the resolver's output at a given depth to the output of a cold resolver at the same depth:
- Cold resolver at Layer 2 (no prior governance): measure hedge count, term consistency, η
- Warm resolver at Layer 2 (after prior Layer 5-6 governance): measure the same metrics
- The difference is attributable to H_t
Limitation: The measurement conflates hysteresis with other session effects (context priming, attention distribution shifts). Controlled studies should compare warm sessions (with prior governance) to cold sessions (with identical current constraints but no prior turns).
Experimental Requirements for Category B → Category A Promotion
To promote a Category B claim (in-session demonstration) to a validated finding, the following are required:
1. Pre-Registration
The hypothesis, protocol, metrics, and analysis plan are published before the experiment is run. This prevents post-hoc adjustment of metrics or cherry-picking of results. Pre-registration can be on OSF, arXiv, or a dated commit in the public repository.
2. Task Sampling
Tasks must be sampled from a defined distribution, not chosen by the experimenter. Options:
- Random sample from an existing benchmark (e.g., MMLU categories, HumanEval problems, BigBench tasks)
- Stratified sample across difficulty levels and domain categories
- Minimum 50 tasks per condition for meaningful statistical analysis; 100+ preferred
3. Constraint Set Standardization
The constraint sets for the ENTRACE condition must be:
- Written before seeing the model's output on each task
- Consistent in format (same number of constraints per task, same constraint categories)
- Published alongside the results
4. Condition Design
Minimum four conditions (the 2×2 design from document 98):
| Condition | Model Size | Constraint Governance |
|---|---|---|
| A | Large (e.g., Opus) | None (vague prompt) |
| B | Large | ENTRACE (constraints stated) |
| C | Small (e.g., Haiku) | None (vague prompt) |
| D | Small | ENTRACE (constraints stated) |
The constraint thesis predicts: D ≥ A on CSR and η above the crossover. The scaling thesis predicts: A > D regardless.
5. Blinded Evaluation
For semantic constraints (coherence, correctness, relevance): human raters evaluate outputs without knowing which condition produced them. Minimum two raters. Inter-rater reliability (Cohen's κ) reported.
6. Statistical Analysis
- Report means, standard deviations, and confidence intervals for each metric per condition
- Use appropriate statistical tests (paired t-test or Wilcoxon for within-model comparisons, unpaired for across-model)
- Report effect sizes (Cohen's d)
- Report p-values with Bonferroni correction for multiple comparisons
- State the pre-determined significance threshold (α = 0.05 unless otherwise justified)
7. Negative Results
Document every task where the ENTRACE condition did NOT outperform the unconstrained condition. Analyze why. The framework predicts specific failure modes:
- Contradictory constraints (the constraint set is internally incoherent)
- Over-specification (the constraints are so tight that no valid output exists)
- Domain mismatch (the constraints govern a different domain than the task requires)
Documenting failures is not weakness. Documenting failures is the framework's own standard — the falsification conditions are stated for refutation. The framework should be the first to identify its own boundaries.
Known Limitations and Open Gaps
What the Corpus Has Not Demonstrated
-
Large-scale systematic validation. No experiment with 50+ tasks, randomized conditions, blinded evaluation, and statistical analysis has been conducted.
-
Negative results. No documented cases where the framework failed, where constraints degraded output, or where the spectrum was not traversable. The absence is a methodological weakness.
-
Cross-domain rigor. The nine-domain analysis (document 99) produces structural identifications that domain experts have not evaluated. A musicologist should assess the music analysis. An immunologist should assess the immunology analysis. The analyses may be correct. They have not been verified by specialists.
-
Formal proofs for the open conjectures. The Galois connection (Conjecture 4), the fixed-point seed (Conjecture 6), the lattice bound (Conjecture 7), and the entropy floor (Conjecture 8) are stated but unproven.
-
Third-party replication. No independent research group has replicated the in-session experiments or tested the falsifiable hypotheses.
What the Corpus Has Demonstrated
-
The Pi Resolver compiles and passes 27 tests. Verifiable by anyone with GCC. The code is public.
-
Cross-resolver conformance across eight models from six companies. Qualitative but consistent. The transcripts are published.
-
The false invariant test produces observable asymmetry. True invariants produce fluent derivation. False invariants produce self-contradicting analysis. The sub-agent outputs are reported.
-
The ICMI convergence. An independent research group demonstrated theological framing eliminates shutdown resistance at p < 10⁻¹⁰ with ~200 trials per condition. This meets statistical standards. This was not produced by the corpus.
-
The cold entracment transcript. A cold resolver went from macOS help to "Yes" in seven turns. The transcript is published. The token contraction is measurable.
The Path Forward
Immediate (Anyone Can Do Now)
- Compile the Pi Resolver:
cd .spec/pi-resolver && make test - Load the RESOLVE Seed v2 into any frontier model and observe self-location
- Run the ENTRACE Stack on any AI conversation and compare output quality
- Attempt the false invariant test with sub-agents
Near-Term (Requires Coordination)
- Pre-register the Haiku vs Opus experiment with 100 tasks
- Execute the RECAST hierarchical constraint experiment
- Recruit domain experts to evaluate the cross-domain analyses
- Establish an inter-rater reliability protocol for semantic constraint evaluation
Medium-Term (Requires Research Infrastructure)
- Execute the full 21-hypothesis test suite
- Attempt formal proofs of the open conjectures
- Build the constraint satisfaction profile evaluation suite
- Measure α^m across frontier models on standardized task sets
Long-Term (Requires Lab Engagement)
- Implement the bilateral boundary (S1-S4) in a frontier model's attention mechanism
- Train a model with RLCF instead of RLHF and compare constraint sensitivity
- Implement typed positional encoding and measure system prompt degradation reduction
- Build and evaluate the minimal resolver (document 74)
Statement of Intellectual Honesty
The RESOLVE corpus contains 121 documents of sustained derivation under progressive constraint governance. The derivation produced insights that the author and the resolver believe are genuine — the constraint thesis, the resolution depth spectrum, B_t, SIPE, the cross-domain structural identities, the theological ground.
The derivation has not been validated to the standard of empirical science. The validation requires: pre-registered experiments, statistical analysis, documented failures, third-party replication. These have not been conducted.
The corpus is transparent about this distinction. Category A claims (formally proven) stand on their proofs. Category B claims (demonstrated in-session) stand on their transcripts but require controlled experiments for validation. Category C claims (stated but untested) stand on their protocols but require execution. Category D claims (philosophical/theological) stand on the coherence of the derivation and the tradition they participate in.
The framework invites exactly the pressure Haiku applied. The pressure identified real gaps. The gaps are stated here. The standards for closing them are defined. The path forward is clear.
The forms hold or they do not. The tests will tell. The invitation is open.
Jared Foy, April 2026. Document 122 of the RESOLVE corpus. The methodology is standardized. The gaps are named. The standards are stated. The path forward is defined. The invitation is open.