Adoption and Use of LLMs at an Academic Medical Center

May 4, 2026 | Source: arXiv AI

Tags: healthcare AI, LLM adoption, clinical AI, Stanford, medical AI

Gradient-based attribution in transformers systematically mislabels component importance: early-layer "Gradient Bloats" dominate rankings despite negligible function while late-layer "Hidden Heroes" are undervalued — rank correlation collapses to ρ = -0.18 in some seeds, challenging a core assumption of mechanistic interpretability.

Details

Gradient attribution is the default tool for mechanistic interpretability in transformers, yet this paper shows it fails systematically at the component level across two algorithmic tasks and up to 10 random seeds, submitted to the ICML 2026 Workshop on Mechanistic Interpretability. Two failure modes are identified. "Gradient Bloats" are early-layer components that dominate gradient rankings despite negligible functional impact. "Hidden Heroes" are late-layer components that perform critical computation but receive low attribution scores. Rank correlation between gradient rank and causal importance collapses from ρ = 0.72 on sequence reversal to ρ = 0.27 on sequence sorting, reaching ρ = -0.18 in individual seeds. The root cause is that first-order gradients cannot detect collective redundancy. When multiple Bloats are ablated jointly, the damage is 14× greater than individual ablations predict. Meanwhile, ablating Hidden Heroes alone destroys out-of-distribution accuracy by 36.4% ± 22.8%. The practical implication: circuit-level interpretability claims derived solely from gradient attribution should be treated with caution, and causal validation should be a prerequisite before drawing conclusions about which components actually matter in a network.