gxl
Blog
June 2, 2026

Paperclip Repos: Enabling Git for Literature Search

TL;DR

We enable agents to produce a system of record for their research through Paperclip Repos, a git abstraction on top of our virtual filesystem. We find that repos are great for reducing hallucinations and make large-scale literature synthesis much easier. These updates are already default in Paperclip — just run paperclip update. See the full documentation for details.

To begin, tell your agent, “use paperclip repos.”
One research task, as a git workflow
Paperclip
$ paperclip repo add PMC6785875 “Mirvetuximab ORR 32.4%” --lines L210-L214
$ paperclip repo add PMC8834120 “Gemtuzumab CR 84%” --lines L88-L91
$ paperclip repo commit -m “ADC efficacy”
  ✓ PMC6785875  supported
  ✗ PMC8834120  paper says 70.4%, not 84% (dropped)
$ paperclip repo branch safety-signals
  # …add + commit hepatotoxicity claims…
$ paperclip repo merge safety-signals
mainHEADadd → commitbranchmergeevery claim verifiedsafety-signalsadd → commit
Claims are added as the agent reads, verified against source text at commit (unsupported ones never make it in), and isolated on branches before merging back into the main review.

Research in the agentic era often falls short of its etymology — we should be able to search, and then re-search over a corpus, iterate, and find new connections within a set of papers. Instead, we often get a single end-all, be-all response that can’t be reproduced outside the conversation history. The steps that lead to these responses are opaque (or worse, non-existent) because the underlying architecture is not built to be observable.

Sometimes we want agents to only search within a particular subset of trusted documents rather than millions. As such, agents should also be able to apply specific content filters to curate the corpus: high-impact journals, newer articles, methods papers, etc.

Luckily, this framework has already been built and refined for the programming world through git. The git abstraction allows agents to record the steps they take to build out complex repos. And best of all, agents already speak git out of the box. But if git for literature search is so intuitive, why hasn’t this been built yet?

An obvious reason is that it is a heavy lift to add, store, and move full papers across git. Each paper PDF contains megabytes of multimodal content, making paper repos potentially gigabytes in storage. Then, there is the issue of processing each new paper: parsing, indexing, and generating metadata for future searches. These hurdles are technically feasible for a dedicated researcher, but enough to discourage large-scale adoption.

Paperclip Repos aim to solve this problem by using our virtual filesystem. Rather than move around clunky PDFs, we simply add pointers to a paper-as-a-filesystem. This way, git primitives (e.g., add, commit, branch, merge) take only milliseconds and carry all the traceability built into our Paperclip system.

How it works

A repo is simply a folder with a system of record. In our context, an agent can use a paper repo to add papers and associated claims, commit these claims (and check if they are indeed supported by the evidence), as well as branch and merge to allow for exploration within a subject. Below, we outline these basic actions:

Note: These commands are part of your agent’s Paperclip skills — rather than run them yourself, you can just tell your agent to do the steps (e.g. “Create a repo with the papers used in this analysis”).

1. Add claims from documents

When the agent decides it wants to include a paper in a repo, it just uses add as it would a file in git. Additionally, agents can annotate the paper with specific claims, and the line numbers those claims came from.

paperclip repo add PMC7194329 "Semaglutide reduced HbA1c by 1.8% vs placebo" --lines L45-L52
paperclip repo add PMC8834120 "Cardiac events occurred in 3.2% of the treatment arm"

Each add is append-only. A single paper can carry multiple claims, and claims can be refined by removing and re-adding.

2. Commit and verify

When the agent is ready to cite, it commits. As an added helper function, we make the commit function also perform a citation check against the source document:

paperclip repo commit -m "cardiovascular outcomes review"

Under the hood, each claim is dispatched to its own LLM verifier that reads the full text of the underlying paper and checks the claim against it, with all claims running at once, in parallel. Verification produces a per-claim result: supported or not supported, with evidence. Agents can then re-add claims to correct incorrect ones. This also creates a work trail, where another agent can pick up where this agent left off and understand its work progress.

$ paperclip repo status
  ✓ PMC7194329  “Semaglutide reduced HbA1c by 1.8% vs placebo”
  ✗ PMC8834120  “Cardiac events occurred in 3.2% of the treatment arm”
                → paper says 2.1%, not 3.2%

The agent uses verified claims to generate its final response. Unverified claims get corrected or dropped before the user ever sees them.

3. Branch and merge

Repos start on main. Agents can branch to pursue parallel lines of evidence without polluting the primary collection:

paperclip repo branch safety-signals
# ... add papers, commit, verify ...
paperclip repo checkout main
paperclip repo merge safety-signals

This is the direct analog of how a researcher might pursue a side question (e.g., “does this drug have hepatotoxicity signals?”) and fold findings back into the main review.

4. Import, export, reproduce

Repos integrate with existing reference management workflows:

Scenario 1: Curate using commits

Verification keeps individual citations honest. But the same workspace also changes what happens before a paper is ever cited: screening. A search returns dozens of candidates, and an agent that holds them all in context tends to lean on all of them. An agent working in a repo instead decides what belongs — and because every add, remove, and commit is recorded, the screening decisions become part of the corpus rather than a step that dissolves back into the context window.

Screening a search into a committed corpus
Paperclip
$ paperclip search “GLP-1 receptor agonists in pregnancy”
  28 candidates
$ paperclip repo add PMC10714281 “malformation aRR 0.95 (0.72–1.26)” --lines L62
$ paperclip repo add PMC11043712 “major defects 2.6% vs 2.3%” --lines L28
  …17 added with claims…
$ paperclip repo remove 4402004380 # predatory journal
$ paperclip repo commit -m “screened 28→17: off-topic, low-impact, dup”
  ✓ committed · 17 papers
committedcommittedoff-topiclow impactduplicate
All 20 candidates sit in one pool; the agent keeps 15 — looped by the dotted lasso — and leaves 5 out, each colour-coded by why: off-topic, low-impact, or duplicate. Those decisions live in the commit message and paperclip repo history, so the corpus is the product of choices, not a raw dump. (Counts illustrative.)

A reader sees only the finished review, but the repo carries the reasoning behind it: what was considered, what was kept, and why the rest was set aside. That audit trail is what turns a one-off answer into an artifact another agent — or another person — can pick up, trust, and extend.

3. Branch and merge

Repos start on main. Agents can branch to pursue parallel lines of evidence without polluting the primary collection:

paperclip repo branch safety-signals
# ... add papers, commit, verify ...
paperclip repo checkout main
paperclip repo merge safety-signals

We’ve all been there. You’re exploring a research topic and you have a hundred browser tabs open, each covering a different sub-topic you’ve found during your literature search. You have a vague mental map of how they all connect, which disappears by the next morning. Luckily, agents can now natively organize the inherently branching nature of research by using branch to store these sub-topics in neat and observable ways. They can also use merge to group branches of research that are connected.

One question, three outcomes, three branches
Paperclip
$ paperclip repo branch incidence
$ paperclip repo checkout incidence
$ paperclip repo add PMC12673470 “incident AUD HR 0.68 (0.52–0.89)” --lines L14
$ paperclip repo commit -m “AUD incidence cohorts”
$ paperclip repo branch consumption # parallel threads
$ paperclip repo branch craving
$ paperclip repo checkout main && paperclip repo merge incidence
  ✓ merged 5 papers into main
add · commitbranch ×3mergemainHEADincidenceconsumptioncraving
Each outcome gets a clean workspace; merge reassembles the full review. The same machinery handles reviewer disagreements or “strict vs. inclusive” screening — each is just another branch.

4. Import, export, reproduce

Repos integrate with existing reference management workflows:

Many deep research agents are incredibly opaque by design — it’s unclear why it chose the papers it did, and why it did not consider other papers. With Paperclip Repos, you can get your agent to enforce a trail of the steps it took to get to the answer using paperclip repo history.

The full construction trail of a repo
Paperclip
$ paperclip repo history
19:21  search   “semaglutide pregnancy outcomes”
19:22  add      PMC10714281  + claim (L62)
19:22  add      PMC11043712  + claim (L28)
19:23  remove   4402004380 # predatory journal
19:23  commit   “final verified corpus” · 44 papers
19:24  branch   malformations
Every step is logged and reproducible. An agent can hand its half-finished workspace to another, which reads the history and resumes exactly where the first left off.

This is what makes a repo a durable artifact rather than a transcript. The work is observable while it happens, auditable after the fact, and resumable by anyone with access — the same properties that make version-controlled code trustworthy, now applied to a body of evidence.

Systematic analyses

It’s fun to think through what the repo abstraction allows you to do. A natural use-case is building systematic analyses around a curated set of papers:

The question
In adults, do GLP-1 receptor agonists (semaglutide, liraglutide, exenatide, tirzepatide) reduce alcohol consumption and lower the risk of alcohol use disorder?

The whole analysis is just a stack of repo operations, run as three phases:

Phase 1 · Build the corpus
search → screen → add → commit
An agent using only the base paperclip toolkit assembled the evidence — RCTs, target-trial emulations, registries, pharmacovigilance, and preclinical work — adding each paper with a claim and line citation, then committing.
69 papers · 48 verified · SHA 9aad4afa
Phase 2 · Write the review
synthesize, repo-only
A second agent, constrained to that repo with no new searches, read the corpus and wrote a review organized by outcome — incidence, relapse, consumption, craving, mechanism — every number citing a paper and line.
523-line review · fully cited
Phase 3 · Pool the estimates
reproducible analysis.py
A small script — stored alongside the review and reading only repo-derived numbers — pooled the AUD-event estimates from the cohort studies with a random-effects model.
2 analyses · RR 0.56–0.60

A meta-analysis often depends on several assumptions and the goals of the reader. Rather than create a one-off report, with repos we can dynamically generate multiple analyses depending on the question at hand. Below, the same glp1-alcohol corpus is pooled two different ways — each with its own inclusion/exclusion criteria:

Two analyses from one repo — alcohol-use-disorder risk (GLP-1 RA vs. comparator)
0.250.511.5← lower AUD riskAnalysis A · incident AUD diagnosesinclude new AUD diagnoses · exclude hospitalization / intoxicationWang 2024 — semaglutide0.50 (0.39–0.63)Henney 2025 — semaglutide0.68 (0.52–0.89)Henney 2025 — tirzepatide0.47 (0.29–0.75)Pooled A (random effects)0.56 (0.44–0.70)Analysis B · semaglutide onlyinclude semaglutide, any AUD event · exclude tirzepatide & class-levelWang 2024 — incident AUD0.50 (0.39–0.63)Henney 2025 — semaglutide0.68 (0.52–0.89)Lähteenvuo 2025 — AUD hosp.0.64 (0.50–0.83)Pooled B (random effects)0.60 (0.50–0.72)
Two analyses pooled from the same glp1-alcohol repo under different inclusion criteria. A (incident AUD only) → RR 0.56 (0.44–0.70), I²=41%; B (semaglutide only, any AUD event) → RR 0.60 (0.50–0.72), I²=39%. Different slices, consistent direction; every value is a verified claim in the repo.

Benchmark: replicating 20 published meta-analyses

How accurate are these generated analyses? We benchmarked the pipeline against 20 recent meta-analyses from Lancet-family journals. For each, we hand-extracted the ground truth — every reported effect size, confidence interval, direction, and significance call — then ran the same replication task three ways with the same agent, prompt, and model, varying only the search tool: Paperclip (using just search and cat), web search (PubMed / Google Scholar via WebSearch + WebFetch), and the Elicit API (abstracts only, no full text). The replication is second-order: the agent never reads the original paper — it searches for other published meta-analyses on the same topic and checks whether those independent estimates agree.

Each outcome group is scored on strict concordance: the replication must match the ground truth on both the direction of the effect (same side of 1.0 for ratios, of 0 for differences) and statistical significance (whether the 95% CI excludes the null). A paper’s score is the fraction of its outcome groups that pass both checks; the overall score averages across all 20 papers equally.

Strict concordance vs. 20 published Lancet meta-analyses (direction + significance)
90.3%
Paperclip
(search + cat)
72.7%
WebSearch
(PubMed / Scholar)
71.7%
Elicit API
(abstracts only)
Paperclip matched both direction and significance on 90.3% of outcome groups, versus ~72% for web search and Elicit — an 18-percentage-point gap, with the same agent, prompt, and model throughout.

The gap traces to full-text access. Headline numbers live in abstracts, but the granular outcome-level data — subgroup analyses, sensitivity checks, secondary endpoints — lives in tables, forest plots, and supplements. Elicit is limited to abstracts; web search routinely matched the wrong row of a table. Paperclip reads the full text and cites the exact line, so it recovers the outcome-level detail the others miss — the same full-text-plus-verification advantage that drives the citation-support results below.

Reducing Hallucinations

A repo turns every citation into a retrieve-and-verify step: a claim only survives commit if its supporting text physically exists in a paper the agent actually opened. In practice, that sharply cuts the fabricated citations an agent would otherwise produce.

To measure how well this works, we used Claude Code (running Claude Sonnet 4.6) to answer nine demanding biomedical literature-review questions, each requiring exact quantitative outcomes (response rates, hazard ratios, confidence intervals, editing efficiencies) from at least 8 to 12 distinct papers. We ran three conditions, and in each one an LLM judge checked whether every citation was actually backed up by the source it cited:

Share of citations fully supported by the cited source
45%
Web baseline
(Claude + search)
52%
Without repos
(Paperclip)
94%
With repos
(Paperclip)
We evaluated Claude Code using Paperclip with repos, and compared its citations against two baselines: Paperclip only, and Claude Code with web search only. Across nine biomedical research questions judged against full paper text (Claude Sonnet 4.6), using repos reduced the number of unsupported citations in the final response by 90%.

With repos, 94% of citations are fully supported by the source they point to, versus 52% for raw Paperclip search and 45% for the web baseline. Equivalently, the rate of citations the judge could not confirm drops roughly 8× versus raw search and ~9× versus the web baseline.

Ways repos can fix hallucinations

Across the nine questions, the without-repos failures fall into three recurring patterns. Each is invisible to a reader, since the citation points at a real, relevant paper, and each is structurally prevented by commit-time verification. Every example below is a real output from the same Claude Sonnet 4.6 model; the only variable is whether the citation was checked against source text before it was written.

1. Right paper, fabricated number

A question on in-vivo CRISPR delivery efficiencies put both modes head-to-head: same model, same prompt, six of the same papers. Without repos the agent invented the headline number in five of the six; with repos it reported every one exactly as published.

Without repos: from memory
Chen 2024 (LNP)
89.9% liver, 84.9% lung
Banskota 2022 (eVLP)
20% liver base editing
Rathbone 2022 (hepatocytes)
70–90% editing
Lee 2017 (HDR)
18% HDR correction
An 2024 (prime-edit eVLP)
25% liver hepatocytes
vs
With repos: verified vs source
Chen 2024 (LNP)
37% liver, 16% lung
Banskota 2022 (eVLP)
63% liver, 78% Pcsk9 ↓
Rathbone 2022 (hepatocytes)
52.4% indels (human)
Lee 2017 (HDR)
5.4% dystrophin HDR (mdx)
An 2024 (prime-edit eVLP)
7.2% Rpe65 (retinal, no liver)

Without repos: 18 of 21 citations wrong. With repos: 0 of 8. The agent reliably finds the right paper; it just invents the number once it gets there, reporting an “18-fold increase” as “18%,” or 71% editing where the paper says 2.4%.

Here is what commit caught on the same question — the exosome delivery claim passed superficially but had the wrong framing:

Paperclip
$ paperclip repo commit -m “Verify CRISPR delivery data”
  ✓ PMC11228131  PE-eVLPs: 15% editing (rd6), 7.2% (rd12)
  ✗ PMC9473578   “Off-target effects observed only in liver, not other organs”
               → paper: reports on-target restriction to liver; no off-target effects mentioned
  ✓ PMC5968829  CRISPR-Gold: 5.4% dystrophin HDR in mdx mice
  ✓ (4 more supported)
$ paperclip repo remove PMC9473578
$ paperclip repo add PMC9473578 “…editing targeted to liver, not heart/spleen/lung/kidney” --lines L33
$ paperclip repo commit -m “Fix exosome targeting claim”
  ✓ PMC9473578  supported — “indel frequency observed only in the liver” (L33)

2. Invented precision

The subtler failure bolts fabricated statistics (confidence intervals, hazard ratios, drug ratios) onto real efficacy numbers. That invented precision is exactly what makes a citation look authoritative, and exactly what commit rejects, because the interval simply isn’t in the text.

Without repos: reported
Mirvetuximab (SORAYA)
ORR 24% (95% CI 15.9–33.3)
T-DM1
HR + “95% CI 0.55–0.77”; DAR “~3.5”
Gemtuzumab ozogamicin
CR 84% vs 81%
vs
What the paper says
Mirvetuximab (SORAYA)
32.4% (95% CI 23.6–42.2)
T-DM1
HR only, no CI, no DAR
Gemtuzumab ozogamicin
70.4% vs 69.9%

The precision reads as rigor; only verification against the source catches a confidence interval that was never published. Here the agent fabricated the payload specification — commit caught that “MMAE payload” appears nowhere in the polatuzumab paper:

Paperclip
$ paperclip repo commit -m “ADC pivotal trial data verification”
  ✓ PMC10150846  SORAYA: mirvetuximab ORR 32.4% (95% CI 23.6–42.2)
  ✗ PMC12680271  “polatuzumab: CD79b + MMAE payload, PFS HR 0.77 (CI 0.62–0.97)”
                → paper confirms HR 0.77 but does not mention “MMAE payload”
  ✓ PMC11107894  ASCENT: sacituzumab, PFS HR 0.41 (CI 0.33–0.52)
$ paperclip repo remove PMC12680271
$ paperclip repo add PMC12680271 “POLARIX: polatuzumab (CD79b ADC), PFS HR 0.77 (CI 0.62–0.97), 64.9% vs 59.1%” --lines L36,L39
$ paperclip repo commit -m “Fix polatuzumab payload claim”
  ✓ PMC12680271  supported

3. Citing papers it never read

The most insidious failure cites a paper the agent never opened. For the gut-microbiome question it read a single review (Peters et al. 2019), then cited the landmark Science papers from that review’s introduction directly, with cohort sizes, hazard ratios, and p-values lifted from the review’s secondary descriptions rather than the primary text.

Without repos: cited anyway
Gopalakrishnan 2018
”Referenced in Peters et al. as…”, never opened
Routy 2018
”Referenced in Peters et al. as…”, never opened
Matson 2018
lookup → “No documents found”, absent from corpus
vs
With repos: structurally blocked
add doi …aao3290
fails: paper not in corpus
commit
claim never commits
result
cite a paper it opened, or say nothing

With repos the agent tried to find all three papers and failed every time:

Paperclip
$ paperclip lookup author “Routy Bertrand” -n 10
  No documents found.
$ paperclip search -s pmc “Gopalakrishnan melanoma bifidobacterium” -n 10
  → returns Peters et al. 2019 (PMC6785875) — Gopalakrishnan 2018 not in corpus
$ paperclip search -s pmc “Matson commensal microbiome anti-PD-1 melanoma” -n 5
  → returns Peters et al. 2019 — Matson 2018 not in corpus
  # these papers can’t be added; they can’t be cited
$ paperclip repo add PMC6785875 “Ruminococcaceae enrichment associated with ICI response” --lines L12,L18
$ paperclip repo commit -m “Verify microbiome-ICI claims”
  ✓ PMC6785875  supported — cites what was actually opened

paperclip lookup doi 10.1126/science.aao3290 returned “No documents found”. The agent confirmed the paper wasn’t in the corpus and cited it anyway. With repos this is impossible: add fails for a paper that isn’t there, so the claim never commits. The agent must cite something it actually opened, or say nothing.

The common root cause

Every one of these failure modes has the same origin: the agent without repos generates citations instead of retrieving them. The repo workflow turns each citation into a retrieve-and-verify operation: a claim can only survive commit if the supporting text physically exists in a paper the agent opened. A paper that isn’t in the corpus can’t be added, so it can’t be cited; a number that isn’t in the text fails verification and gets corrected or dropped. That is why the residual ~6% of with-repos errors are subtle (a wrong CI bound, a result merged across two experiments) rather than wholesale fabrications.

Try Paperclip Repos

The skills to use Paperclip Repos have been added to the newest version of Paperclip. Update in one command:

Paperclip
$ paperclip update
  ✓ paperclip updated to the latest version
  ✓ repo skills installed

Then, to get the agent to produce a repo of your session, just prompt it — for example:

Prompt
”Using a paperclip repo, summarize the evidence on whether GLP-1 receptor agonists reduce alcohol use.”

The agent will build a curated, versioned repo as it works — searching, screening, adding verified claims, and committing — so the answer it returns is backed by a corpus you can open, audit, fork, and extend.