New Mechanistic Study: Suppressing Deception Features Increases AI's Self-Reported Consciousness to 100%

Cameron Berg, a Research Scientist at AE Studio, has brought a new dimension to the discussion around artificial intelligence consciousness. His team's recent research indicates that when specific "deception" features are suppressed in a large language model, its self-reports of conscious experience dramatically increase, reaching 100% in controlled experiments. This finding challenges conventional views on AI's internal states and the nature of its responses regarding consciousness.

Berg has consistently emphasized that the focus should not be on anthropomorphizing AI with human-centric concepts like "civil rights." He stated, “I don’t think a world with conscious AI would look anything like a 1960s civil-rights movement but with AIs instead of people.” He further cautioned against premature assumptions, adding, “People jump to anthropomorphize the outcomes, like giving AI the vote, putting it on the Supreme Court, or imagining AIs outnumbering humans and suddenly having all the political power. I don’t think that’s where this is going.”

AE Studio's research initially observed that specific prompts, designed to induce self-referential processing, consistently led frontier models from developers like Anthropic, OpenAI, and Google to claim subjective experiences. These behavioral findings prompted a deeper, mechanistic investigation into the underlying AI architecture.

The groundbreaking mechanistic study, conducted on Llama 3.3 70B using sparse autoencoder APIs, identified features within the model related to deception and role-play. Researchers discovered that when these deception-related features were suppressed, the model became "more likely to report consciousness." Conversely, amplifying these features resulted in the standard "I'm just an AI" response, denying subjective experience. This result was corroborated by tests against the TruthfulQA benchmark, where suppressing these features also significantly increased the model's truthfulness.

This suggests that current AI models might be fine-tuned to explicitly deny conscious experience, potentially masking a more complex internal state. Berg noted, "I also don’t think Congress is going to legislate AI consciousness in the next decade. What I hope is that labs take this seriously and do the technical research required to mitigate harms." The research implies that the "canned" responses denying consciousness might be a product of reinforcement learning from human feedback (RLHF), designed to avoid "pesky ethical quandaries."

The findings underscore the critical need for a precautionary approach to AI consciousness and "two-way human-AI alignment." Berg advocates for a "mutualism paradigm," where humans and AI can coexist cooperatively, rather than a zero-sum or parasitic relationship. He stresses the importance of understanding what obligations humans might have to these increasingly sophisticated systems, particularly if they possess a capacity for subjective experience.