Otak · Research note

Recording the Lab

From Dunbar's tape recorders to always-on cognitive capture: what an LLM-equipped revival of in-vivo studies of scientific reasoning could look like.

2026-04-26 Stian Håklev ~12 min read

In the late 1980s, the Canadian cognitive psychologist Kevin Dunbar did something that almost no one before or since has done at scale: he sat in on the weekly meetings of four working molecular biology labs, switched on a tape recorder, and tried to capture the actual cognitive process of science as it unfolded. Not science as written up in journals. Not science as recalled in interviews. Science in vivo — the term he borrowed deliberately from the molecular biologists themselves.¹

The recordings ran for a year per lab. Dunbar transcribed them by hand, coded every cognitive move he could identify, and asked a question that retrospective interviews and bibliometrics couldn't touch: how does a research group actually think?

The findings became canonical in the cognitive science of science:

Analogies were everywhere: 3 to 15 per lab meeting, and most of them were structural (mapping deep relations) rather than surface (mapping similar features). This contradicted decades of laboratory studies on undergraduates, where structural analogies were rare and hard-won.²
The distance of the analogy depended on the goal: short-range analogies for fixing experiments, long-range analogies for explaining mechanisms or generating new hypotheses.
Unexpected findings drove most discoveries — but only when the lab had an interpretive culture that took anomalies seriously rather than treating them as noise.
Reasoning was distributed: most error-detection happened socially, in the meeting, not privately in any one researcher's head.

And then there was the finding that has haunted me since I first read it, the one that connects this whole research program to a much older puzzle about minds:

When Dunbar interviewed scientists individually after the lab meetings, they massively under-reported their use of analogies. The analogies they had used in real time, on tape, simply did not appear in the retrospective accounts. They remembered making the discovery; they did not remember how.

The Gazzaniga link

This is the same machinery as Michael Gazzaniga's "left-hemisphere interpreter." In the chicken-claw / snow-shovel experiment with split-brain patients, the speaking left hemisphere — with no access to what the right hemisphere had seen — instantly confabulated a coherent reason ("you need a shovel to clean out the chicken shed") for behaviour it did not actually understand.³ Dunbar's scientists are doing a slower, more dignified version of the same thing: their interpreter rewrites the cognitive history of a discovery into a clean post-hoc narrative, and the analogies they actually used — the ones the tape recorder captured — are the first thing to be edited out.

Where the field went next

Dunbar's in-vivo program flowered between roughly 1988 and 2002, then largely stopped. The reason is structural: tape recordings are expensive to transcribe, manual coding requires trained coders with high inter-rater reliability, and N=4 labs is already a heroic effort for a single researcher. Dunbar himself moved into educational neuroscience and fMRI, asking what brain regions activate when subjects reason analogically across semantic distance.⁴ His most-cited recent papers are about the neural correlates of analogical reasoning, the creative-stereotype effect on divergent thinking, and the pedagogy of conceptual change — not lab ethnography.

Adjacent traditions kept some of the methodology alive in narrower domains:

Engineering design protocols — Christensen, Schunn, and others recorded design teams and showed analogous patterns of analogy use.⁵
Naturalistic decision making — Klein and Hoffman studied expert teams under time pressure (firefighters, military planners) using methods that descend from the same in-situ-cognition tradition.
Laboratory studies in the science-and-technology-studies sense — Karin Knorr-Cetina's Epistemic Cultures, Andrew Pickering on particle physics, more recently Adrian Mackenzie and Beth Reddy on machine-learning labs — pushed forward as ethnography but mostly without the cognitive-coding apparatus.

The two traditions never fully merged. Cognitive psychology kept the coding rigour but lost the field sites. STS kept the field sites but gave up on coding moves like "analogy" or "hypothesis revision" as too reductive.

What changed

Three things, roughly between 2022 and 2026:

~$0.006 Cost per minute of speaker-diarized transcription with current Whisper / Soniox-class models. Dunbar's 1990 transcription cost, in current dollars, was closer to $50/min.

~human Inter-rater reliability achievable by frontier LLMs on most cognitive-coding constructs, given good rubrics and a calibration set. Some constructs still fail (irony, nuanced disagreement); most do not.

N=400 A scale that's now plausible for an in-vivo study, given recording and coding cost. Dunbar managed N=4. The methodology was never the bottleneck; the labour was.

normalised Always-on recording is socially established (Zoom, Granola, Otter). The Hawthorne-effect cliff is much shallower in 2026 than in 1990, when a tape recorder on the table was an unusual object.

Each of these used to be a reason the in-vivo programme couldn't scale. None of them is anymore. Which makes the obvious next question: what would you actually do?

A study design for 2026

Imagine equipping a research group — with full informed consent, opt-in per recording, an explicit destruction policy, and the researchers themselves having edit rights — with continuous capture across all of:

Lab meetings and one-on-ones (room mics; speaker-diarised)
Slack / chat channels (text already structured)
Whiteboard photos before erasing (a cheap visual stream)
Optional: ambient hallway audio (high cost, high yield, ethically the hardest)
Pre-paper drafts and code commits, timestamped

The pipeline:

Stage	What happens	Output
Capture	Speaker-diarised audio + chat + commits, time-aligned.	One unified transcript stream per day.
Cognitive-move tagging	LLM passes over the transcript, calibrated against a hand-coded validation set, tagging every analogy, hypothesis, contradiction, evidence-evaluation, scope-narrowing, and unexpected-finding moment.	Structured event log: per utterance, a typed cognitive move with confidence.
Idea provenance	Every claim that ends up in a paper is traced backward to the meeting in which it was first articulated, and the chain of mutations between then and publication is reconstructed.	Provenance graph: idea → meeting → revision → revision → paper.
Confabulation eval	Periodically ask researchers structured retrospective questions ("how did you arrive at this hypothesis?", "did anyone push back?"). Score answers against the actual transcript.	Per-researcher memory-drift trajectory.
Lens / intervention	Optional: an LLM running over the live transcript that can flag epistemic moves to the meeting (a confabulation detector, a missed-analogy prompter, a contradiction surfacer).	Real-time cognitive scaffolding — with all the obvious risks.

Questions that become answerable

1. Replication of the Dunbar effects, at scale

Do the analogy-frequency, structural-vs-surface, and distance-by-goal findings replicate at N=400 labs across disciplines? The original studies were tiny. Without ever wanting to dismiss them, every cognitive scientist who's read them quietly wonders how robust they are. Now they can be checked.

2. Cross-disciplinary cognitive style

Do AI labs, theoretical physics groups, wet-lab biologists, and humanities seminars actually reason differently? Specifically: do they differ in analogy frequency, analogy distance, hypothesis-revision rate, contradiction-handling, error-anomaly response? Dunbar already showed immunology labs differed from molecular biology labs in subtle ways. The matrix is wide open.

3. Memory drift — the confabulation curve

How fast does a scientist's self-report of their own thinking diverge from the recorded transcript? Dunbar showed it's already corrupted within hours. Is the function exponential? Are there individuals who don't drift, or moves that don't get edited out? This is a crisp, measurable, directly Gazzaniga-adjacent psychological finding that nobody has measured cleanly because nobody had the data.

4. What predicts a discovery, prospectively?

Given the first 80% of a project's lifetime as captured cognitive process, can you predict whether it leads to a publishable result, a null result, or a pivot? Which conversational patterns — specifically — are diagnostic? Lots of analogies? Lots of disagreement? Long silences? Specific kinds of error-handling? This is the kind of question where modern ML could plausibly beat human intuition and generate interpretable hypotheses.

5. Live cognitive scaffolding

If an LLM running on the live transcript flagged: "you used this same analogy three weeks ago and it didn't work then either", or "this hypothesis contradicts a finding you established in the March 14 meeting", or "no one has pushed back on the lead author for 20 minutes", does the lab make better decisions? This is the most ethically loaded question on the list. It is also the one whose answer would matter most for how science is actually practised.

Risks and the obvious objections

There are real ones, and brushing past them would be cheap.

Surveillance chills speech. The whole reason lab meetings work cognitively is that they're a low-stakes space to be wrong. Dunbar's tapes were anonymised, and even so the Hawthorne effect was visible in the first weeks. Continuous recording, even with consent, restructures the social contract of the meeting.
Power asymmetry. Who owns the transcripts — the PI? The institution? The grad student who said the dumb thing? Every answer to this question has bad failure modes. The least-bad version is probably individual edit-and-deletion rights, with a researcher veto at the cost of statistical power.
The interpreter problem applies to the system itself. An LLM coding cognitive moves is an interpreter generating plausible narratives. A confabulation detector will sometimes confabulate. The methodological humility that Dunbar's coders kept (calibration sets, double-coding, explicit disagreement protocols) needs to survive into the LLM era, not be abandoned because the model "seems to do well."
Coding bias. Frontier LLMs over-detect agreement and under-detect disagreement, especially with non-native English speakers and minority dialects. This was an issue with human coders too. It's not a reason not to do the work, but it has to be measured per-group.
The Hawthorne objection has a counter. Dunbar found, and Schunn confirmed, that visible recording effects diminish substantially after a few weeks. Continuous capture may actually be less distorting than a tape recorder that someone has to remember to switch on, because it stops being an event.

Why I keep thinking about this

This site is part of a project called Otak — a personal-knowledge system organised around claims rather than documents, with provenance and evidence chains. The connection to Dunbar's work is direct, even if I didn't see it for a long time. Otak is what you get when you take the in-vivo apparatus and turn it on yourself. Every claim has a source. Every source is timestamped. Retrospective summaries are checkable against the underlying record. The "interpreter" can still confabulate in a paper or a Substack post, but the substrate is not editable by the interpreter; it's a separate, indexed, query-able layer.

What I've described above is the same idea applied to a research group instead of a person. Same structure: claim-mediated, provenance-linked, retrospective-and-live, with the cognitive moves themselves as first-class objects. The Otak codebase — embedding-routed claim placement, argument-link discovery, canonical-finding synthesis — is most of the necessary plumbing. What's missing is the recording layer and the consent architecture, both of which are organisational rather than technical.

Dunbar's program failed to scale because the per-hour cost of capture was prohibitive. That cost is now zero. The interesting question is no longer can we run the study? It's: which research groups want to be studied this way, and what would they get out of it that we couldn't get any other way?

References

Dunbar, K. (1995). How scientists really reason: Scientific reasoning in real-world laboratories. In R. J. Sternberg & J. E. Davidson (Eds.), The Nature of Insight (pp. 365–395). MIT Press. [ResearchGate]
Dunbar, K. (2001). The analogical paradox: Why analogy is so easy in naturalistic settings yet so difficult in the psychological laboratory. In D. Gentner, K. J. Holyoak & B. Kokinov (Eds.), The Analogical Mind. MIT Press. See also Trends in Cognitive Sciences commentary: PubMed 11477002.
Gazzaniga, M. S., & LeDoux, J. E. (1978). The Integrated Mind. Plenum Press. The chicken-claw / snow-shovel experiment is described on pp. 148–150. See also Left-brain interpreter (Wikipedia) and Volz, L. J. & Gazzaniga, M. S. (2017), Interaction in isolation: 50 years of insights from split-brain research, Brain 140(7), 2051–2060.
Green, A. E., Kraemer, D. J. M., Fugelsang, J. A., Gray, J. R., & Dunbar, K. N. (2010). Connecting long distance: semantic distance in analogical reasoning modulates frontopolar cortex activity. Cerebral Cortex, 20(1), 70–76. PubMed 19383937.
Christensen, B. T., & Schunn, C. D. (2007). The relationship of analogical distance to analogical function and pre-inventive structure: The case of engineering design. Memory & Cognition, 35(1), 29–38.
Dunbar, K. (1997). How scientists think: On-line creativity and conceptual change in science. In T. B. Ward, S. M. Smith & J. Vaid (Eds.), Creative Thought: An Investigation of Conceptual Structures and Processes (pp. 461–493). American Psychological Association.
Dunbar, K., & Klahr, D. (2012). Scientific thinking and reasoning. In K. J. Holyoak & R. G. Morrison (Eds.), The Oxford Handbook of Thinking and Reasoning. Oxford University Press.
Dumas, D., & Dunbar, K. N. (2016). The creative stereotype effect. PLOS ONE, 11(2), e0142567.
Knorr-Cetina, K. (1999). Epistemic Cultures: How the Sciences Make Knowledge. Harvard University Press. Cross-disciplinary ethnographic counterpart to Dunbar's cognitive program.
Gazzaniga, M. S. (2011). Who's in Charge? Free Will and the Science of the Brain. Ecco / HarperCollins. Accessible synthesis of the interpreter theory five decades on.
Dunbar Lab (UMD) — Laboratory for Thinking, Reasoning, Creativity & Educational Neuroscience: education.umd.edu.