Great updates and info Kara.
Furthermore, here is a recent study from the Alignment Science Team @ Anthropic that shows that AI models quite often conceal their true reasoning processes when explaining answers to a user.
Of big concern here is user's ability to monitor and understand AI decision-making.
Brief details about the study:
- It evaluates Claude 3.7 Sonnet and DeepSeek R1 on their chain-of-thought faithfulness, gauging how honestly they explain reasoning steps
- Models were provided hints like user suggestions, metadata, or visual patterns, with the CoT checked for admission of using them when explaining answers
- Reasoning models performed better than earlier versions, but still hid their actual reasoning up to 80% of the time in testing
- The study shows models were less faithful in explaining their reasoning on more difficult questions than simpler one
In conclusion, all this brings furher complications rather than clarity about what sits behind the actual reasoning for models, especially for complex behaviour and problems.
Todor
------------------------------
Todor Kostov
Director
------------------------------
Original Message:
Sent: 31-03-2025 17:27
From: Kara K.W. Byun
Subject: Guide to an AI-Powered Workplace
One of the biggest blockers to adoption of LLMs in enterprise workplace settings has been explainability, especially in heavily regulated industries like financial services. The work that Anthropic is doing to head-on address explainability, by observing LLMs in action, is fascinating. Check it out:
On the Biology of a Large Language Model
Circuit Tracing: Revealing Computational Graphs in Language Models
tl;dr from MIT Technology Review:
"The AI firm Anthropic has developed a way to peer inside a large language model and watch what it does as it comes up with a response, revealing key new insights into how the technology works. The takeaway: LLMs are even stranger than we thought.
... Shedding some light on how these models work exposes their weaknesses, revealing why they make stuff up and why they can be tricked into going off the rails. It helps resolve deep disputes about exactly what these models can and can't do. And it shows how trustworthy (or not) they really are.
Batson and his colleagues describe their new work in two reports published today. The first presents Anthropic's use of a technique called circuit tracing, which lets researchers track the decision-making processes inside a large language model step by step. Anthropic used circuit tracing to watch its LLM Claude 3.5 Haiku carry out various tasks. The second (titled "On the Biology of a Large Language Model") details what the team discovered when it looked at 10 tasks in particular."
------------------------------
Kara K.W. Byun
Head of Fintech
------------------------------