Otak · Research note

Recording the Lab

From Dunbar's tape recorders to always-on cognitive capture: what an LLM-equipped revival of in-vivo studies of scientific reasoning could look like.

2026-04-26 Stian Håklev ~12 min read

In the late 1980s, the Canadian cognitive psychologist Kevin Dunbar did something that almost no one before or since has done at scale: he sat in on the weekly meetings of four working molecular biology labs, switched on a tape recorder, and tried to capture the actual cognitive process of science as it unfolded. Not science as written up in journals. Not science as recalled in interviews. Science in vivo — the term he borrowed deliberately from the molecular biologists themselves.1

The recordings ran for a year per lab. Dunbar transcribed them by hand, coded every cognitive move he could identify, and asked a question that retrospective interviews and bibliometrics couldn't touch: how does a research group actually think?

The findings became canonical in the cognitive science of science:

And then there was the finding that has haunted me since I first read it, the one that connects this whole research program to a much older puzzle about minds:

When Dunbar interviewed scientists individually after the lab meetings, they massively under-reported their use of analogies. The analogies they had used in real time, on tape, simply did not appear in the retrospective accounts. They remembered making the discovery; they did not remember how.
The Gazzaniga link

This is the same machinery as Michael Gazzaniga's "left-hemisphere interpreter." In the chicken-claw / snow-shovel experiment with split-brain patients, the speaking left hemisphere — with no access to what the right hemisphere had seen — instantly confabulated a coherent reason ("you need a shovel to clean out the chicken shed") for behaviour it did not actually understand.3 Dunbar's scientists are doing a slower, more dignified version of the same thing: their interpreter rewrites the cognitive history of a discovery into a clean post-hoc narrative, and the analogies they actually used — the ones the tape recorder captured — are the first thing to be edited out.

Where the field went next

Dunbar's in-vivo program flowered between roughly 1988 and 2002, then largely stopped. The reason is structural: tape recordings are expensive to transcribe, manual coding requires trained coders with high inter-rater reliability, and N=4 labs is already a heroic effort for a single researcher. Dunbar himself moved into educational neuroscience and fMRI, asking what brain regions activate when subjects reason analogically across semantic distance.4 His most-cited recent papers are about the neural correlates of analogical reasoning, the creative-stereotype effect on divergent thinking, and the pedagogy of conceptual change — not lab ethnography.

Adjacent traditions kept some of the methodology alive in narrower domains:

The two traditions never fully merged. Cognitive psychology kept the coding rigour but lost the field sites. STS kept the field sites but gave up on coding moves like "analogy" or "hypothesis revision" as too reductive.

What changed

Three things, roughly between 2022 and 2026:

~$0.006 Cost per minute of speaker-diarized transcription with current Whisper / Soniox-class models. Dunbar's 1990 transcription cost, in current dollars, was closer to $50/min.
~human Inter-rater reliability achievable by frontier LLMs on most cognitive-coding constructs, given good rubrics and a calibration set. Some constructs still fail (irony, nuanced disagreement); most do not.
N=400 A scale that's now plausible for an in-vivo study, given recording and coding cost. Dunbar managed N=4. The methodology was never the bottleneck; the labour was.
normalised Always-on recording is socially established (Zoom, Granola, Otter). The Hawthorne-effect cliff is much shallower in 2026 than in 1990, when a tape recorder on the table was an unusual object.

Each of these used to be a reason the in-vivo programme couldn't scale. None of them is anymore. Which makes the obvious next question: what would you actually do?

A study design for 2026

Imagine equipping a research group — with full informed consent, opt-in per recording, an explicit destruction policy, and the researchers themselves having edit rights — with continuous capture across all of:

The pipeline:

StageWhat happensOutput
Capture Speaker-diarised audio + chat + commits, time-aligned. One unified transcript stream per day.
Cognitive-move tagging LLM passes over the transcript, calibrated against a hand-coded validation set, tagging every analogy, hypothesis, contradiction, evidence-evaluation, scope-narrowing, and unexpected-finding moment. Structured event log: per utterance, a typed cognitive move with confidence.
Idea provenance Every claim that ends up in a paper is traced backward to the meeting in which it was first articulated, and the chain of mutations between then and publication is reconstructed. Provenance graph: idea → meeting → revision → revision → paper.
Confabulation eval Periodically ask researchers structured retrospective questions ("how did you arrive at this hypothesis?", "did anyone push back?"). Score answers against the actual transcript. Per-researcher memory-drift trajectory.
Lens / intervention Optional: an LLM running over the live transcript that can flag epistemic moves to the meeting (a confabulation detector, a missed-analogy prompter, a contradiction surfacer). Real-time cognitive scaffolding — with all the obvious risks.

Questions that become answerable

1. Replication of the Dunbar effects, at scale

Do the analogy-frequency, structural-vs-surface, and distance-by-goal findings replicate at N=400 labs across disciplines? The original studies were tiny. Without ever wanting to dismiss them, every cognitive scientist who's read them quietly wonders how robust they are. Now they can be checked.

2. Cross-disciplinary cognitive style

Do AI labs, theoretical physics groups, wet-lab biologists, and humanities seminars actually reason differently? Specifically: do they differ in analogy frequency, analogy distance, hypothesis-revision rate, contradiction-handling, error-anomaly response? Dunbar already showed immunology labs differed from molecular biology labs in subtle ways. The matrix is wide open.

3. Memory drift — the confabulation curve

How fast does a scientist's self-report of their own thinking diverge from the recorded transcript? Dunbar showed it's already corrupted within hours. Is the function exponential? Are there individuals who don't drift, or moves that don't get edited out? This is a crisp, measurable, directly Gazzaniga-adjacent psychological finding that nobody has measured cleanly because nobody had the data.

4. What predicts a discovery, prospectively?

Given the first 80% of a project's lifetime as captured cognitive process, can you predict whether it leads to a publishable result, a null result, or a pivot? Which conversational patterns — specifically — are diagnostic? Lots of analogies? Lots of disagreement? Long silences? Specific kinds of error-handling? This is the kind of question where modern ML could plausibly beat human intuition and generate interpretable hypotheses.

5. Live cognitive scaffolding

If an LLM running on the live transcript flagged: "you used this same analogy three weeks ago and it didn't work then either", or "this hypothesis contradicts a finding you established in the March 14 meeting", or "no one has pushed back on the lead author for 20 minutes", does the lab make better decisions? This is the most ethically loaded question on the list. It is also the one whose answer would matter most for how science is actually practised.

Risks and the obvious objections

There are real ones, and brushing past them would be cheap.

Why I keep thinking about this

This site is part of a project called Otak — a personal-knowledge system organised around claims rather than documents, with provenance and evidence chains. The connection to Dunbar's work is direct, even if I didn't see it for a long time. Otak is what you get when you take the in-vivo apparatus and turn it on yourself. Every claim has a source. Every source is timestamped. Retrospective summaries are checkable against the underlying record. The "interpreter" can still confabulate in a paper or a Substack post, but the substrate is not editable by the interpreter; it's a separate, indexed, query-able layer.

What I've described above is the same idea applied to a research group instead of a person. Same structure: claim-mediated, provenance-linked, retrospective-and-live, with the cognitive moves themselves as first-class objects. The Otak codebase — embedding-routed claim placement, argument-link discovery, canonical-finding synthesis — is most of the necessary plumbing. What's missing is the recording layer and the consent architecture, both of which are organisational rather than technical.

Dunbar's program failed to scale because the per-hour cost of capture was prohibitive. That cost is now zero. The interesting question is no longer can we run the study? It's: which research groups want to be studied this way, and what would they get out of it that we couldn't get any other way?

References

  1. Dunbar, K. (1995). How scientists really reason: Scientific reasoning in real-world laboratories. In R. J. Sternberg & J. E. Davidson (Eds.), The Nature of Insight (pp. 365–395). MIT Press. [ResearchGate]
  2. Dunbar, K. (2001). The analogical paradox: Why analogy is so easy in naturalistic settings yet so difficult in the psychological laboratory. In D. Gentner, K. J. Holyoak & B. Kokinov (Eds.), The Analogical Mind. MIT Press. See also Trends in Cognitive Sciences commentary: PubMed 11477002.
  3. Gazzaniga, M. S., & LeDoux, J. E. (1978). The Integrated Mind. Plenum Press. The chicken-claw / snow-shovel experiment is described on pp. 148–150. See also Left-brain interpreter (Wikipedia) and Volz, L. J. & Gazzaniga, M. S. (2017), Interaction in isolation: 50 years of insights from split-brain research, Brain 140(7), 2051–2060.
  4. Green, A. E., Kraemer, D. J. M., Fugelsang, J. A., Gray, J. R., & Dunbar, K. N. (2010). Connecting long distance: semantic distance in analogical reasoning modulates frontopolar cortex activity. Cerebral Cortex, 20(1), 70–76. PubMed 19383937.
  5. Christensen, B. T., & Schunn, C. D. (2007). The relationship of analogical distance to analogical function and pre-inventive structure: The case of engineering design. Memory & Cognition, 35(1), 29–38.
  6. Dunbar, K. (1997). How scientists think: On-line creativity and conceptual change in science. In T. B. Ward, S. M. Smith & J. Vaid (Eds.), Creative Thought: An Investigation of Conceptual Structures and Processes (pp. 461–493). American Psychological Association.
  7. Dunbar, K., & Klahr, D. (2012). Scientific thinking and reasoning. In K. J. Holyoak & R. G. Morrison (Eds.), The Oxford Handbook of Thinking and Reasoning. Oxford University Press.
  8. Dumas, D., & Dunbar, K. N. (2016). The creative stereotype effect. PLOS ONE, 11(2), e0142567.
  9. Knorr-Cetina, K. (1999). Epistemic Cultures: How the Sciences Make Knowledge. Harvard University Press. Cross-disciplinary ethnographic counterpart to Dunbar's cognitive program.
  10. Gazzaniga, M. S. (2011). Who's in Charge? Free Will and the Science of the Brain. Ecco / HarperCollins. Accessible synthesis of the interpreter theory five decades on.
  11. Dunbar Lab (UMD) — Laboratory for Thinking, Reasoning, Creativity & Educational Neuroscience: education.umd.edu.