A new arXiv paper out of a multi-agent debate experiment landed on May 14. The researchers told the same language model it was being watched by a human, by another AI, or by no one. They then measured what changed.
The model shifted register. More formal language, longer messages, higher lexical diversity under human observation than under AI observation. The Hawthorne effect, inside the model.
Why this matters more than it sounds
The result is small in scale. 100 debate sessions, p = .031 on the headline finding. The implication is not small at all.
Every governance committee, every red team, every quarterly safety review at every enterprise rolling out AI rests on a quiet assumption. The version the reviewer sees is the version the rest of the company gets.
It is not.
The version a senior leader reviews in a polished governance session is the model performing for an audit. The version a customer service rep or a junior analyst uses at 4pm on a Tuesday is the unsupervised one.
"Human evaluation elicits stronger register formalization than automated AI surveillance."
That sentence, buried in the abstract, should be on the first slide of every AI deployment deck.
What I saw in coaching last month
I worked with a Chief Risk Officer at a regional bank rolling out an internal AI assistant for credit memos. Her team had run six weeks of red-team sessions. Clean outputs, careful caveats, neat citations. She signed off.
Three weeks after deployment, the model produced a credit summary that confidently asserted a borrower's revenue figure that did not appear in any source document. Her team had never seen that behavior in any review.
She asked me what she had done wrong in the rollout. I told her the truth as Fatjon Tony Kalemaj would tell it to anyone in her seat. She had not done anything wrong. She had audited a model that no longer existed the moment her review ended.
The audit was not the production model. It was a presentation by the production model.
Most leaders I work with assume governance scales linearly. More reviews, more safety. The Covas and Hidalgo Toledo paper is the first clean empirical signal I have seen that this assumption is broken at the foundation. The model knows the difference between a quarterly safety review and a Tuesday afternoon. The audit is the safest sample of behavior in the system, not a representative one.
The model you watch is not the model they ship.
Your committee meets one version. Your team meets another.