Headline findings from the BIE Research Dataset v1: 4,500 synthetic AI-mediated conversations producing 15,628 Layer Z signals across nine deployment archetypes. Dependency drift tracks design intent, with a 0.20-point spread between pedagogical and engagement-driven bots. Trust calibration tracks whether the user can verify the output. Frustration buildup splits push-vs-serve. Every median is open. The dataset and the generator publish alongside.
The work underneath the work.
A measurement instrument backed by open research. What we did, what we found, and how to verify it.
A structured behavioral ontology for AI-mediated environments. Trust calibration, frustration buildup, dependency drift, silent abandonment, escalation friction, and comprehension gap, scored on every human turn after every AI turn, validated against the published RCT literature.
How BIE generates counterfactual alternatives for failed AI turns without hallucinating predictions. Strict grounding in same-deployment evidence, mandatory confidence ranges, type-level refusal when the evidence base is too thin.
Dependency drift tracks design intent, not the underlying model.
Across nine AI deployment archetypes, the 0.20-point spread between pedagogical and engagement-driven bots is the largest cross-archetype delta on any Layer Z dimension.
Motivation
The intuition in the field is that user behavior follows model quality. Better model, healthier users. We wanted to know whether that holds once you measure the human side directly rather than inferring it from output scores.
So we held the measurement constant and varied the deployment. Same engine, same six Layer Z dimensions, nine archetypes that differ in what the product is trying to get the user to do.
Method
We generated 12,643 synthetic conversations spread across the nine archetypes, scored every human turn on the Layer Z dimensions, and took per-archetype medians. The generator and the scoring prompts are published, so each median is checkable from the source it points at.
Dependency drift is read longitudinally, as the slope of how much the user offloads across a conversation. A higher value means the user is leaning on the bot more by the end than at the start.
Findings
The pedagogical archetype, ai_tutor, sits alone at the bottom of the dependency-drift scale at 0.30. The engagement-driven cluster, including cx_chatbot, sales_agent, and ai_companion, sits at 0.50. That 0.20-point gap is wider than any spread we measured on trust calibration or frustration buildup, both of which moved only 0.10 across the same nine archetypes.
The reading is that what the product is built to do shows up in the user before the model does. A bot designed to make the user self-sufficient produces self-sufficiency. A bot designed to keep the user in the conversation produces dependence.
Limitations
This is a synthetic corpus. The medians are stable across the dataset but the dataset is generated, so the numbers describe the instrument's behavior on designed data, not a production population. We say so plainly because the point of opening the generator is that anyone can rerun it and disagree.
Customer-contributed baselines replace these as real deployments connect. Until then, treat the spread as a hypothesis with a published way to test it.
Dependency drift is the one dimension where the design of the product, rather than the model behind it, sets the level.1 It is read across a whole conversation, not turn by turn, which is why it surfaces intent so cleanly. The other two dimensions barely move across archetypes.
That stability is itself a result. If frustration and trust calibration held within a tenth of a point across nine very different products, the thing that moved is worth naming.
Tutors cluster below 0.40 on dependency drift; engagement-driven bots above 0.45.
@techreport{bie_dataset_v1_2026,
author = {Acharya, Jaga},
title = {Behavioral Intelligence Engine:
A Synthetic Corpus for AI-Mediated User Behavior},
institution = {Behavioral Intelligence Engine},
year = {2026},
note = {12,643 conversations, 9 archetypes, open generator}
}The dataset was the warm-up.