Most apps auto-title conversations from the first few exchanges, before the discussion has gone anywhere. Titles drift toward each other. Eventually, finding an old thread costs more than re-asking, so users re-ask. The history drawer turns into a wastebasket no one opens.
We wanted to understand the lived shape of that decay. How do people actually use these tools? How do they handle the slow accumulation of threads? And which interface changes do they themselves wish for? A small mixed-methods study, survey at the population level, interviews at the motivational level, was designed to find some empirical footing for the next generation of LLM chat UIs.
Recruitment ran through public social channels in three languages. We deliberately balanced veterans against first-month newcomers: only talking to power users would miss the friction beginners hit on day one; only talking to beginners would miss the structural failures that surface at 100 conversations. Every respondent gave informed consent; interview participants received a privacy briefing before recording.
The survey targeted retrieval behavior as its core construct: when a user wants old content back, do they go dig in the sidebar, or do they open a fresh window and re-ask? Subjective dimensions (interface satisfaction, retrieval difficulty, perceived organization) were measured on 5-point Likert scales.
Interviews were semi-structured. Each participant walked us through their daily management ritual: how they start a thread, when they rename, whether they delete, the hesitation before deletion. We also probed the affective layer, that hard-to-name irritation when the drawer becomes a mess. The closing prompt asked each interviewee to describe (or sketch) their ideal interface.
Recruited via public posts in CN/EN/ES. Veterans and newcomers in deliberate balance, because each group sees a different failure mode of the same UI.
Demographics, frequency, and a battery of 5-point scales targeting interface satisfaction, retrieval difficulty, and perceived organization. Eleven concrete pain points selectable as multi-check.
Semi-structured interviews where each participant demoed their actual sidebar, opening, renaming, deleting, hesitating, and finished by drawing their dream interface.
Over half of the sample has used LLM apps for more than a year. The bulk of respondents have lived through the slow accumulation of threads that produces structural retrieval failure.
Beginner and intermediate users were intentionally included to surface first-touch friction, but the bulk of the sample sits in the long-tail use range where structural problems live.
We expected differentiation: Copilot for code, ChatGPT for writing, Claude for research. The data refused that story. Across the board, whichever app a user prefers becomes a general-purpose tool covering coding, writing, learning, lookup, and analysis.
Of the items below, the share who selected each as a real pain point in their own use:
50% of high-skill users flagged either "auto-titles" or "too many threads"; only 16.7% of casual users did the same.
Experts can articulate which drawer hinge is broken, because 69.7% of them already hold 50+ conversations. Casual users (33.3% past 50) just feel the friction without the vocabulary.
Overall satisfaction lands at 6.35/10 (SD = 1.92, t(47) = 3.07, p < 0.01); subjective frustration sits at 3.08/5 (SD = 1.05, equivalent to neutral). The headline number is "fine." The interview transcripts are not.
Title-accuracy rating averages 2.94 (SD = 1.00), equivalent to neutral on the TOST test (δ = 0.5, p = 0.002). One veteran said ChatGPT and Claude titles were "close enough." Another noted that Gemini's mobile voice mode opens a fresh thread per question; the auto-titles are so generic that one thread silently absorbs many topics.
Demand for tagging averages M = 4.12 (SD = 0.76, t = 10.2, p < 0.001), the highest item in the battery. In-thread bookmarks come in at M = 3.79 (SD = 0.97, p < 0.001) but with wider spread. Interviews surface concrete asks: prefixed brackets like [work] as ad-hoc tags, Drive-style folders by project name, the ability to star and pin threads.
Multiple interviewees said the same thing differently. The drawer is broken, but the model is what they're paying for. The implication is uncomfortable: as model capability converges across vendors over the next two years, the moat shifts to interface, and the interface is the part nobody has invested in.
Four candidate predictors entered a single logistic regression. Only one came out significant, and not by a small margin.
The design implication is unusually clean: for the ~40% of users who feel scatter, optimizing titles or beefing up search is second-order. What helps them is keeping same-topic content together by default.
Volume alone doesn't make people miserable. Repeated retrieval failure inside that volume does. Storing more is fine. Storing more without findability is the failure mode.
Self-rated tech skill explains essentially zero variance in satisfaction (β = 0.067, p = 0.838, R² < 0.1%). The story changes when we replace skill with an "organizational initiative" composite (rename frequency + tidy-up willingness + topic-splitting habit, z-summed).
A small amount of organizing effort, regardless of the user's technical fluency, reliably yields a smoother retrieval experience. That's a design lever, make organizing low-cost, and the lever is open to everyone.
K-Means clustering (k = 3) on feature-preference scores produced three groups with sharply different shapes. ANOVA F(2, 45) = 34.75, p < 0.001, η² = 0.61. The separation is too clean to be sampling noise.
The biggest group. Wants something close to a search-engine over their entire history. Treats every past conversation as a queryable document.
Treats threads as a reusable function library. Wants the most-used items pinned to the top, surfaced by recency-of-use rather than recency-of-creation.
No appetite for organization. Relies on recent memory and the default reverse-chronological list. Actively rejects manual folders (Kruskal-Wallis H = 12.78, p = 0.002).
Cluster identity is not tied to occupation (Fisher exact p = 0.483, Cramér's V = 0.28). Treat the clusters as a feature-prioritization map, not a demographic taxonomy.
We correlated each of ten candidate features with overall satisfaction. The pattern is hard to misread: features that hand control back to the user beat features that try to take it away.
| Feature | Correlation w/ satisfaction | Magnitude | Verdict |
|---|---|---|---|
| In-thread bookmarks | r = 0.282 · p = 0.026 | ★ ONLY SIGNIFICANT | |
| Manual folders | r = 0.227 · p = 0.06 | marginal · n.s. | |
| Auto-grouping | r = 0.018 | TOST: null | |
| Auto-link related threads | r = -0.004 | TOST: null | |
| Timeline view | r = 0.051 | TOST: null | |
| Full-text search | r = -0.047 | TOST: null |
Correlation isn't causation. It could be that already-organized users like both bookmarks and the current UI. But when every "let the system do it" feature flatlines simultaneously, the signal lands cleanly: outsourcing organization to the AI doesn't move the experience. Letting the user mark a single line at the moment they realize it matters does.
Splitting respondents by whether they selected "scattered topics" as a pain reveals an asymmetry. Users without scatter pain rate the pin/top-of-list feature significantly higher than those with it (M = 4.39 vs. 3.95, t = 2.07, df = 44.1, p = 0.022 one-tailed, Cohen's d = 0.59). Pinning is a "nice-to-have" for users who already have their house in order, not a fix for those drowning.
Synthesized from interview transcripts and cluster membership. Each persona aggregates several real respondents who shared a mental model. They are illustrative, not census categories.
Opens a new thread per coding question, treating each as a reusable function. His system held up at small scale and broke when the threads stacked up, most acutely on Gemini mobile, where every voice question becomes its own session under a generic auto-title.
I started re-asking, because finding the original was slower than typing it again./ P-01
Wants a Spotify-style sidebar: most-used at the top, sorted by frequency. Has migrated to third-party clients (Cherry Studio) for tag and filter support. Has zero appetite for AI-driven categorization.
Uses LLM as external memory for dissertation work: Derrida one day, Victorian novels the next, chapter revisions in between. Daily use is fine. Pulling a single argument back is brutal.
Twenty minutes opening seven threads called "Literary Analysis," then I just gave up and re-asked the question./ P-02
Asks for thread-level organization (tags, folders), but the only feature correlated with satisfaction is sub-thread bookmarking. Currently copies key paragraphs into Notion to build her own index outside the chat app.
Asked what he wants from history management, his answer is direct: he scrolls until something looks familiar. Reverse-chronological is enough; he usually just wants yesterday's brainstorm or last week's notes.
I scroll till I see something I recognize. That's it./ P-03
Adds 8–10 threads/week, heading toward the ~50-thread breakdown point. Refused folders ("feels like homework"). Said the "related threads" recommendation feels useful, because it asks nothing of him.
The most counter-intuitive finding is how little automation buys. Auto-grouping, auto-linking, timeline view, full-text search, all flatlined under TOST. In-thread bookmarks (r = 0.282) led the table; "initiative" independently predicted satisfaction (partial r = 0.345). Both signals point the same direction: user participation has value of its own.
Instead of silently filing untouched threads into algorithm-shaped folders, give people light-weight manual tools, bookmarking a paragraph, splitting a thread, that activate at the moment of intent.
The interview pattern: a single thread quietly drifts from one topic into another, and the early content gets buried. This is the same failure mode as scatter (OR = 12). It also poisons the model's own context window.
A non-intrusive cue when the system detects a topic shift, "this looks like a new topic. Continue here, or split from this point?", makes splitting the path of least resistance, without punishing users who genuinely want a long thread. Conversational UIs should remember they're closer to "talking to someone" than to "selecting a session."
Expert pain is a scale problem, not a skill problem. The current interface holds at 10 threads (James's comfort zone) and cracks at 100 (Carlos's crisis). That gap shouldn't be patched late.
From day one: server-side full-text indexing instead of client-side Ctrl+F that fails on lazy-loaded items; multiple sort views, chronological, frequency, manual folder, instead of one canonical reverse-time list.
James won't use folders. Carlos will live and die by them. A single design must serve both. Smart defaults, most-used threads pinned automatically this week, related-thread suggestions when starting a new prompt, work for the passive group; explicit folders, tags, and pinning serve the active group.
The principle: invisible affordances by default, visible affordances on request.
Capabilities are converging across vendors. Differentiation is migrating to the surface. The vendor that makes their drawer actually usable wins a lasting edge, and right now, no one is.
Among all candidates, only "similar topics scattered" reaches significance (OR = 12.07, p = 0.003). Volume, search quality, and titles do not.
Controlling for self-rated skill, organizational initiative still independently explains satisfaction. Skill itself is a non-event.
Auto-grouping, auto-linking, timeline view, full-text search: all four were equivalence-tested to zero against satisfaction.
A low-friction manual tool, applied at the moment a user knows they care, is the single feature whose adoption tracks happiness.
Mediation analysis confirms the path from count to frustration is null. Frustration arises from structural retrieval failure inside volume, not volume per se.
The findings are best read as a stable directional signal, not a deterministic verdict. We used equivalence tests throughout to distinguish "real null" from "underpowered to detect." Marginal items remain trends only.
Even where initiative correlates with satisfaction, reverse causation is plausible: organized personalities may report higher satisfaction independent of any tool effect.
Tech-skill rests on self-evaluation, with the inherent bias of self-report.
Long-term tracking and A/B trials are the right next step: when we actually ship a more user-controllable interface, do organizing behavior and satisfaction move? That's how the agency hypothesis goes from correlation to cause.