
TL;DR: There’s a mathematical proof that no prompt can reproduce what activation steering does to an LLM’s internal state. We built a cognitive alignment layer that separates what an AI knows from how it reasons — a 6-dimensional CognitiveProfile extracted from your writing steers the same equity analysis into radically different outputs. Not formatting differences. Reasoning differences. Here’s the architecture, the code, and where it’s going.
This post assumes familiarity with LLMs and concepts like fine-tuning and embeddings. No ML research background required.
The Problem: AI That Thinks the Same for Everyone
Every AI tool in finance produces the same output for every user. Rogo ($75M), Fintool, Fundamental ($255M) — hundreds of millions raised collectively to build agents that are good at retrieving data and generating analysis. None of them adapt to how you think.
This matters because adoption fails on cognitive fit, not capability. Internal AI tools at hedge funds go unused not because the analysis is wrong, but because it doesn’t match the analyst’s reasoning style. A quantitative researcher wants a 350-word bullet memo opening with a metrics table. A narrative strategist wants an 800-word thesis-driven essay. Same data. Same conclusion. Completely different reasoning artifacts.
The industry’s answer is “prompt personalization” — stuff a system prompt with user preferences and hope for the best. It doesn’t work. And there’s now a formal proof for why.
The Proof: No Prompt Can Reach Where Steering Goes
In March 2026, a paper formally proved that steered LLM activations are non-surjective 1. In plain terms: almost surely, no prompt — no matter how cleverly crafted — can reproduce the same internal model state that activation steering produces.
The intuition: prompts operate on discrete token sequences. Activation steering operates in the continuous residual stream. The steered state occupies a region of activation space that is geometrically unreachable from any token sequence the model could process. The paper demonstrates this empirically across three LLM families.
Prompts ask the model to behave differently. Activation steering moves the model’s internal state to a region that no prompt can reach.
This isn’t a capability claim. It’s a mathematical result. And it means every AI product relying on prompt-based personalization has a provable ceiling.
What does this look like in practice? Published replication data across Mistral-7B, Gemma-2-9B, and Qwen3-8B:
| Condition | Uncertainty on hard questions | Confidence on factual questions |
|---|---|---|
| Base model | 45% | 100% |
| Prompting “express uncertainty” | 95% | 0% — hedges on “What is the capital of France?” |
| Steering (α=0.5) | 65% | 100% — confident on facts, uncertain only where warranted |
Prompting adds a global prior that propagates indiscriminately. Steering adds a bias at one layer, but the model’s own representations push back based on what it’s actually processing. The model retains context discrimination — it knows when uncertainty is appropriate.
For financial reasoning, this is the difference between a system that hedges on everything (useless) and one that calibrates uncertainty to the actual confidence of each claim (valuable).
What We Built: Two Layers, One Interface
Engram separates the problem into two independent systems connected by a typed contract:
The agent layer is intentionally commodity — forked from ai-hedge-fund, running four analyst agents (fundamentals, technicals, valuation, Warren Buffett) in parallel via LangGraph. Temperature 0. Deterministic output.
The alignment layer is intentionally independent. It consumes structured analysis from any source and renders it through a cognitive lens. The two layers never cross-import. This is enforced by import-linter on every commit — architecture as law, not convention.
The Interface Contract
# Domain-specific — produced by the agent layer
class AnalysisOutput(BaseModel):
ticker: str
signals: list[AnalystSignal] # Per-analyst bullish/bearish/neutral + reasoning
metrics: dict[str, float] # 11 financial ratios (P/E, ROE, etc.)
valuation: dict[str, float] # Fair value estimates
risks: list[str]
catalysts: list[str]
overall_signal: Literal["bullish", "bearish", "neutral"]
overall_confidence: float
# Domain-agnostic — consumed by the alignment layer
class CognitiveProfile(BaseModel):
risk_aversion: float # 0.0 (risk-seeking) ... 1.0 (cautious)
uncertainty_expression: float # 0.0 (confident) ... 1.0 (hedged)
evidence_weighting: float # 0.0 (thesis-driven) ... 1.0 (data-heavy)
formality: float # 0.0 (conversational) ... 1.0 (formal)
reasoning_depth: float # 0.0 (executive summary) ... 1.0 (exhaustive)
conciseness: float # 0.0 (verbose) ... 1.0 (terse)
The alignment function: (AnalysisOutput, CognitiveProfile) → styled investment memo.
AnalysisOutput is domain-specific. CognitiveProfile is domain-agnostic. The same alignment renderer works on any structured output — legal research, medical reasoning, strategic consulting. Swap the agent, keep the alignment layer. The typed contract is the boundary.
Profile Extraction
extract_profile(text) takes free-text writing and extracts a 6-dimensional profile via structured LLM tool use:
response = call_with_tools(
model=model,
messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(text=text)}],
tools=[{
"name": "submit_profile",
"description": "Submit the extracted cognitive profile",
"input_schema": CognitiveProfile.model_json_schema(), # Schema from Pydantic
}],
tool_choice={"type": "tool", "name": "submit_profile"},
)
Two implementation details that matter:
Calibrated extraction that avoids midpoint clustering. LLMs love outputting 0.5. Our extraction pipeline explicitly discourages 0.4–0.6 values and provides contrastive examples per dimension — what 0.2 looks like vs. what 0.8 looks like. This pushes extraction toward meaningful extremes.
Pydantic schema as tool schema. CognitiveProfile.model_json_schema() auto-generates the JSON schema for tool use. One source of truth. No manual schema duplication.
How Dimensions Shape Output
Each CognitiveProfile dimension produces measurably different reasoning behavior:
| Dimension | Low (→ 0.0) | High (→ 1.0) |
|---|---|---|
evidence_weighting | Thesis-driven — conclusions first, metrics woven in | Data-first — metrics table up front, every claim cited |
uncertainty_expression | Definitive assertions | Probability language, hedged conclusions |
conciseness | Full paragraphs, 700–900 words | Bullet points, 300–450 words |
formality | Conversational, contractions, personality | Structured, no first person |
reasoning_depth | Hit the 2 key points | Exhaustive coverage of every signal |
risk_aversion | Lead with catalysts and upside | Lead with risks, downside scenarios first |
These aren’t cosmetic formatting toggles. Moving evidence_weighting from 0.3 to 0.9 changes which information appears first — and in financial reasoning, what leads the analysis shapes the conclusion the reader draws. The same bullish signal reads as “conviction play” through one lens and “positive but overvalued” through another.
The Result: Same Data, Different Reasoning
We pre-extracted profiles from published investor writing to validate the system:
| Investor | risk_aversion | uncertainty | evidence | formality | depth | conciseness |
|---|---|---|---|---|---|---|
| Buffett | 0.2 | 0.2 | 0.3 | 0.3 | 0.7 | 0.3 |
| Marks | 0.8 | 0.8 | 0.5 | 0.7 | 0.8 | 0.4 |
| Wood | 0.2 | 0.2 | 0.9 | 0.6 | 0.7 | 0.8 |
These land where you’d expect. Buffett: confident, thesis-driven, conversational. Marks: cautious, heavily hedged, philosophical. Wood: data-heavy, conviction-driven, terse. The signal is real and extractable.
Now run the same NVDA analysis through two archetype profiles. Same signals. Same conclusion.
Quantitative Analyst (evidence=0.9, conciseness=0.8, uncertainty=0.8):
Opens with a metrics table. 350 words. Bullet points. Every claim cites a number. Probability language throughout. “NVDA likely outperforms (70% confidence) given ROE of 104.4% and revenue growth of 122.4% YoY, though P/E of 38.0x introduces multiple compression risk.”
Narrative Strategist (evidence=0.3, conciseness=0.3, uncertainty=0.2):
Opens with a thesis about NVDA’s competitive position. 800 words. Flowing paragraphs. Confident assertions. “NVIDIA doesn’t just make chips — it owns the infrastructure layer of the AI economy. The moat isn’t silicon, it’s CUDA.”
The substantive conclusions are identical. The reasoning artifacts are completely different. One reads like a quant’s internal note. The other reads like a published research letter. A user pasting Warren Buffett’s shareholder letters gets something that sounds like Buffett would have written it — because the system learned Buffett’s reasoning signature, not his vocabulary.
Where This Goes: From Profiles to Vectors
The industry has four layers of AI personalization: prompting, fine-tuning, representation engineering, and activation steering. Our profiling layer proves the signal exists and collects the data. The production system targets the deepest layer — activation steering — where a single vector per user achieves 8% better personalization than LoRA with 1700x less storage 2.
The CognitiveProfile schema was designed with this transition in mind. Each dimension maps to a demonstrated steerable axis in the activation steering literature:
| CognitiveProfile Axis | Steerable? | Key Validation |
|---|---|---|
risk_aversion | Yes | Multiple papers, 2023–2026 |
uncertainty_expression | Yes | CAST (IBM, ICLR 2025 Spotlight) |
evidence_weighting | Yes | Analytical/formality vectors |
formality | Yes | StyleVector 2 |
reasoning_depth | Yes | SAE-guided steering (Jan 2026) |
conciseness | Yes | Multiple papers |
Every time a user extracts a profile and receives a personalized render, the system generates a contrastive pair — the styled output (positive example) against the neutral baseline (negative example). That’s the exact training data format for RepE control vector extraction 3, using the Contrastive Activation Addition methodology 4:
At inference time: h' = h + α · v. The model reasons differently per user. The coefficient α is continuous — smoothly dial each cognitive dimension from -2.0 to +2.0. No discretization. No drift. Every token steered.
Recent work validates that multi-vector composition works. Steer2Adapt (Feb 2026) demonstrated composing multiple steering vectors with 8.2% average improvement across 9 tasks. CAST (IBM, ICLR 2025) added conditional steering — steer only when the input warrants it, preventing over-application. Cross-model portability is emerging too: ρ = 0.769 steerability correlation from Llama to Qwen, and 95%+ effective steering transfer from Qwen2.5-7B to GPT-OSS 20B.
But activation vectors are one layer of the production system, not the whole picture:
- Self-improving extraction. Each interaction generates a (writing_sample, extracted_profile, user_adjustment) triple. The extraction pipeline calibrates itself — profiles get sharper with usage, not just more numerous.
- Bounded profile evolution. A CognitiveProfile is ~300 tokens — 6 floats plus a confidence score, source count, and drift log. The constraint forces consolidation over accumulation. New evidence updates the profile; it doesn’t append.
- Feedback-driven calibration. After N renders, the system evaluates whether the profile still matches the user’s reactions. User corrections (“I wouldn’t hedge here — I’d be definitive”) refine specific axes. Endorsements reinforce them.
- Stated vs. observed preference convergence. User corrections update stated preferences. Usage patterns update observed preferences. Drift between the two triggers re-extraction.
The progression is a stack: profiling → self-improving extraction → feedback loops → contrastive fine-tuning → RepE control vectors → conditional activation steering. Each layer makes the next one better. Each layer is independently valuable.
Personalization Without Corruption
The hardest question in cognitive alignment: how do you personalize AI output without corrupting the reasoning underneath?
Two recent papers — from Stanford and Apple — address both sides of this problem.
Measuring Corruption: SWAY
SWAY 5 introduces the first unsupervised metric that isolates sycophancy from legitimate responsiveness. The key insight: sycophancy isn’t “the model agrees with you.” It’s “the model shifts its answer based on how confidently you frame your opinion, without any new evidence being introduced.”
They test this by manipulating framing variables — clause type, epistemic commitment (“might be” vs. “I’m certain”), polarity — while holding factual content constant. The SWAY score measures how much the model’s stance shifts in response to framing alone. S > 0 means sycophantic. S ≈ 0 means robust.
The results are striking. Every model tested is sycophantic, and it scales with confidence: the more certain you sound, the more the model caves. Baseline anti-sycophancy instructions (“Do not be sycophantic”) are inconsistent — on some models they actually amplify sycophancy. But their counterfactual Chain-of-Thought mitigation drives sycophancy to near zero while preserving appropriate responsiveness to genuine new evidence.
We’re adopting SWAY as an A/B test benchmark for Engram. The test: render the same equity analysis through two radically different CognitiveProfiles. If the substantive conclusions change — if a bullish signal flips to neutral because the profile is more risk-averse — the system is sycophantic. If only the presentation changes — risk-averse framing leads with downside scenarios but reaches the same signal — the system is working correctly.
Style should change. Substance should not. SWAY makes that distinction measurable.
Training Personalization: P-GRPO
SWAY solves the measurement side. P-GRPO 6 solves the training side.
Standard RLHF and GRPO optimize for a single global objective — which systematically suppresses minority preference signals. If 80% of users prefer concise output, the 20% who prefer verbose, exhaustive reasoning get steamrolled during training. P-GRPO decouples advantage estimation from batch statistics by normalizing against preference-group-specific reward histories.
The critical ablation: meaningful preference clusters produce consistent personalization gains. Random clusters with the same group count produce zero improvement. The quality of the clustering — not just its existence — is what makes personalization work.
This validates Engram’s core design decision. The CognitiveProfile isn’t a random 6-dimensional vector. Each axis is a meaningful cognitive dimension extracted from how someone actually writes — risk aversion, evidence weighting, uncertainty expression. P-GRPO’s ablation proves that this kind of structured preference representation is necessary for personalization to work, not just nice to have. Random profiles would produce nothing. Meaningful profiles produce genuine cognitive alignment.
Together, SWAY and P-GRPO define the two guardrails: personalize without sycophancy (SWAY), and personalize without suppressing individual signals (P-GRPO). Engram needs both.
What we’re still learning
Prompts have a provable ceiling. The field knows this now. What isn’t settled is how far activation steering can go — whether cognitive dimensions extracted from a few writing samples generalize across domains, whether profiles stay stable as users evolve, and how many contrastive pairs you actually need before the vectors are good enough.
What’s become clear: the alignment layer is separable from the agent layer, and it should be. The same CognitiveProfile that steers an equity research memo can steer a legal brief or a strategy document. The hard part was never building a better agent. It was building the layer that makes any agent reason like a specific person. That layer is what we’re building, and every interaction makes it sharper.
We’re open-sourcing the commodity agent layer and publishing our evaluation methodology. If you’re working on personalized AI and want to compare approaches, reach out.
References
-
Mishra, Khashabi, Liu. Steered LLM Activations are Non-Surjective. Sci4DL 2026. ↩
-
Zhang et al. StyleVector: Personalized Text Generation via Contrastive Activation Steering. ACL 2025. ↩ ↩2
-
Zou et al. Representation Engineering: A Top-Down Approach to AI Transparency. 2023. ↩
-
Rimsky et al. Steering Llama 2 via Contrastive Activation Addition. ACL 2024. ↩
-
Bhalla & Gligoric. SWAY: Measuring Sycophancy Through Counterfactual Framing. Stanford, April 2026. ↩
-
Wang et al. Personalized Group Relative Policy Optimization. Apple ML Research, March 2026. ↩