Paperclip Repos: Enabling Git for Literature Search
We enable agents to produce a system of record for their research through Paperclip Repos, a git abstraction on top of our virtual filesystem. We find that repos are great for reducing hallucinations and make large-scale literature synthesis much easier. These updates are already default in Paperclip — just run paperclip update. See the full documentation for details.
commit (unsupported ones never make it in), and isolated on branches before merging back into the main review.Research in the agentic era often falls short of its etymology — we should be able to search, and then re-search over a corpus, iterate, and find new connections within a set of papers. Instead, we often get a single end-all, be-all response that can’t be reproduced outside the conversation history. The steps that lead to these responses are opaque (or worse, non-existent) because the underlying architecture is not built to be observable.
Sometimes we want agents to only search within a particular subset of trusted documents rather than millions. As such, agents should also be able to apply specific content filters to curate the corpus: high-impact journals, newer articles, methods papers, etc.
Luckily, this framework has already been built and refined for the programming world through git. The git abstraction allows agents to record the steps they take to build out complex repos. And best of all, agents already speak git out of the box. But if git for literature search is so intuitive, why hasn’t this been built yet?
An obvious reason is that it is a heavy lift to add, store, and move full papers across git. Each paper PDF contains megabytes of multimodal content, making paper repos potentially gigabytes in storage. Then, there is the issue of processing each new paper: parsing, indexing, and generating metadata for future searches. These hurdles are technically feasible for a dedicated researcher, but enough to discourage large-scale adoption.
Paperclip Repos aim to solve this problem by using our virtual filesystem. Rather than move around clunky PDFs, we simply add pointers to a paper-as-a-filesystem. This way, git primitives (e.g., add, commit, branch, merge) take only milliseconds and carry all the traceability built into our Paperclip system.
How it works
A repo is simply a folder with a system of record. In our context, an agent can use a paper repo to add papers and associated claims, commit these claims (and check if they are indeed supported by the evidence), as well as branch and merge to allow for exploration within a subject. Below, we outline these basic actions:
Note: These commands are part of your agent’s Paperclip skills — rather than run them yourself, you can just tell your agent to do the steps (e.g. “Create a repo with the papers used in this analysis”).
1. Add claims from documents
When the agent decides it wants to include a paper in a repo, it just uses add as it would a file in git. Additionally, agents can annotate the paper with specific claims, and the line numbers those claims came from.
paperclip repo add PMC7194329 "Semaglutide reduced HbA1c by 1.8% vs placebo" --lines L45-L52
paperclip repo add PMC8834120 "Cardiac events occurred in 3.2% of the treatment arm"
Each add is append-only. A single paper can carry multiple claims, and claims can be refined by removing and re-adding.
2. Commit and verify
When the agent is ready to cite, it commits. As an added helper function, we make the commit function also perform a citation check against the source document:
paperclip repo commit -m "cardiovascular outcomes review"
Under the hood, each claim is dispatched to its own LLM verifier that reads the full text of the underlying paper and checks the claim against it, with all claims running at once, in parallel. Verification produces a per-claim result: supported or not supported, with evidence. Agents can then re-add claims to correct incorrect ones. This also creates a work trail, where another agent can pick up where this agent left off and understand its work progress.
The agent uses verified claims to generate its final response. Unverified claims get corrected or dropped before the user ever sees them.
3. Branch and merge
Repos start on main. Agents can branch to pursue parallel lines of evidence without polluting the primary collection:
paperclip repo branch safety-signals
# ... add papers, commit, verify ...
paperclip repo checkout main
paperclip repo merge safety-signals
This is the direct analog of how a researcher might pursue a side question (e.g., “does this drug have hepatotoxicity signals?”) and fold findings back into the main review.
4. Import, export, reproduce
Repos integrate with existing reference management workflows:
- Import from
.bibor.risfiles to bootstrap from an existing library. - Export to BibTeX, RIS, CSV, or Markdown for downstream use in manuscripts or other tools.
- Reproduce any research workflow with
paperclip repo history, which logs every search, map, and filter the agent performed, making the full chain of reasoning auditable and repeatable.
Scenario 1: Curate using commits
Verification keeps individual citations honest. But the same workspace also changes what happens before a paper is ever cited: screening. A search returns dozens of candidates, and an agent that holds them all in context tends to lean on all of them. An agent working in a repo instead decides what belongs — and because every add, remove, and commit is recorded, the screening decisions become part of the corpus rather than a step that dissolves back into the context window.
paperclip repo history, so the corpus is the product of choices, not a raw dump. (Counts illustrative.)A reader sees only the finished review, but the repo carries the reasoning behind it: what was considered, what was kept, and why the rest was set aside. That audit trail is what turns a one-off answer into an artifact another agent — or another person — can pick up, trust, and extend.
3. Branch and merge
Repos start on main. Agents can branch to pursue parallel lines of evidence without polluting the primary collection:
paperclip repo branch safety-signals
# ... add papers, commit, verify ...
paperclip repo checkout main
paperclip repo merge safety-signals
We’ve all been there. You’re exploring a research topic and you have a hundred browser tabs open, each covering a different sub-topic you’ve found during your literature search. You have a vague mental map of how they all connect, which disappears by the next morning. Luckily, agents can now natively organize the inherently branching nature of research by using branch to store these sub-topics in neat and observable ways. They can also use merge to group branches of research that are connected.
merge reassembles the full review. The same machinery handles reviewer disagreements or “strict vs. inclusive” screening — each is just another branch.4. Import, export, reproduce
Repos integrate with existing reference management workflows:
- Import from
.bibor.risfiles to bootstrap from an existing library. - Export to BibTeX, RIS, CSV, or Markdown for downstream use in manuscripts or other tools.
- Reproduce any research workflow with
paperclip repo history, which logs every search, map, and filter the agent performed, making the full chain of reasoning auditable and repeatable.
Many deep research agents are incredibly opaque by design — it’s unclear why it chose the papers it did, and why it did not consider other papers. With Paperclip Repos, you can get your agent to enforce a trail of the steps it took to get to the answer using paperclip repo history.
This is what makes a repo a durable artifact rather than a transcript. The work is observable while it happens, auditable after the fact, and resumable by anyone with access — the same properties that make version-controlled code trustworthy, now applied to a body of evidence.
Systematic analyses
It’s fun to think through what the repo abstraction allows you to do. A natural use-case is building systematic analyses around a curated set of papers:
The whole analysis is just a stack of repo operations, run as three phases:
paperclip toolkit assembled the evidence — RCTs, target-trial emulations, registries, pharmacovigilance, and preclinical work — adding each paper with a claim and line citation, then committing.analysis.pyA meta-analysis often depends on several assumptions and the goals of the reader. Rather than create a one-off report, with repos we can dynamically generate multiple analyses depending on the question at hand. Below, the same glp1-alcohol corpus is pooled two different ways — each with its own inclusion/exclusion criteria:
glp1-alcohol repo under different inclusion criteria. A (incident AUD only) → RR 0.56 (0.44–0.70), I²=41%; B (semaglutide only, any AUD event) → RR 0.60 (0.50–0.72), I²=39%. Different slices, consistent direction; every value is a verified claim in the repo.Benchmark: replicating 20 published meta-analyses
How accurate are these generated analyses? We benchmarked the pipeline against 20 recent meta-analyses from Lancet-family journals. For each, we hand-extracted the ground truth — every reported effect size, confidence interval, direction, and significance call — then ran the same replication task three ways with the same agent, prompt, and model, varying only the search tool: Paperclip (using just search and cat), web search (PubMed / Google Scholar via WebSearch + WebFetch), and the Elicit API (abstracts only, no full text). The replication is second-order: the agent never reads the original paper — it searches for other published meta-analyses on the same topic and checks whether those independent estimates agree.
Each outcome group is scored on strict concordance: the replication must match the ground truth on both the direction of the effect (same side of 1.0 for ratios, of 0 for differences) and statistical significance (whether the 95% CI excludes the null). A paper’s score is the fraction of its outcome groups that pass both checks; the overall score averages across all 20 papers equally.
The gap traces to full-text access. Headline numbers live in abstracts, but the granular outcome-level data — subgroup analyses, sensitivity checks, secondary endpoints — lives in tables, forest plots, and supplements. Elicit is limited to abstracts; web search routinely matched the wrong row of a table. Paperclip reads the full text and cites the exact line, so it recovers the outcome-level detail the others miss — the same full-text-plus-verification advantage that drives the citation-support results below.
Reducing Hallucinations
A repo turns every citation into a retrieve-and-verify step: a claim only survives commit if its supporting text physically exists in a paper the agent actually opened. In practice, that sharply cuts the fabricated citations an agent would otherwise produce.
To measure how well this works, we used Claude Code (running Claude Sonnet 4.6) to answer nine demanding biomedical literature-review questions, each requiring exact quantitative outcomes (response rates, hazard ratios, confidence intervals, editing efficiencies) from at least 8 to 12 distinct papers. We ran three conditions, and in each one an LLM judge checked whether every citation was actually backed up by the source it cited:
- Without repos: the agent searches and reads with Paperclip, then writes citations from working memory.
- With repos: the agent uses the
add/commit/ verify workflow described above. - Web baseline: Claude with web search and fetch tools, no Paperclip.
With repos, 94% of citations are fully supported by the source they point to, versus 52% for raw Paperclip search and 45% for the web baseline. Equivalently, the rate of citations the judge could not confirm drops roughly 8× versus raw search and ~9× versus the web baseline.
Ways repos can fix hallucinations
Across the nine questions, the without-repos failures fall into three recurring patterns. Each is invisible to a reader, since the citation points at a real, relevant paper, and each is structurally prevented by commit-time verification. Every example below is a real output from the same Claude Sonnet 4.6 model; the only variable is whether the citation was checked against source text before it was written.
1. Right paper, fabricated number
A question on in-vivo CRISPR delivery efficiencies put both modes head-to-head: same model, same prompt, six of the same papers. Without repos the agent invented the headline number in five of the six; with repos it reported every one exactly as published.
Without repos: 18 of 21 citations wrong. With repos: 0 of 8. The agent reliably finds the right paper; it just invents the number once it gets there, reporting an “18-fold increase” as “18%,” or 71% editing where the paper says 2.4%.
Here is what commit caught on the same question — the exosome delivery claim passed superficially but had the wrong framing:
2. Invented precision
The subtler failure bolts fabricated statistics (confidence intervals, hazard ratios, drug ratios) onto real efficacy numbers. That invented precision is exactly what makes a citation look authoritative, and exactly what commit rejects, because the interval simply isn’t in the text.
The precision reads as rigor; only verification against the source catches a confidence interval that was never published. Here the agent fabricated the payload specification — commit caught that “MMAE payload” appears nowhere in the polatuzumab paper:
3. Citing papers it never read
The most insidious failure cites a paper the agent never opened. For the gut-microbiome question it read a single review (Peters et al. 2019), then cited the landmark Science papers from that review’s introduction directly, with cohort sizes, hazard ratios, and p-values lifted from the review’s secondary descriptions rather than the primary text.
With repos the agent tried to find all three papers and failed every time:
paperclip lookup doi 10.1126/science.aao3290 returned “No documents found”. The agent confirmed the paper wasn’t in the corpus and cited it anyway. With repos this is impossible: add fails for a paper that isn’t there, so the claim never commits. The agent must cite something it actually opened, or say nothing.
The common root cause
Every one of these failure modes has the same origin: the agent without repos generates citations instead of retrieving them. The repo workflow turns each citation into a retrieve-and-verify operation: a claim can only survive commit if the supporting text physically exists in a paper the agent opened. A paper that isn’t in the corpus can’t be added, so it can’t be cited; a number that isn’t in the text fails verification and gets corrected or dropped. That is why the residual ~6% of with-repos errors are subtle (a wrong CI bound, a result merged across two experiments) rather than wholesale fabrications.
Try Paperclip Repos
The skills to use Paperclip Repos have been added to the newest version of Paperclip. Update in one command:
Then, to get the agent to produce a repo of your session, just prompt it — for example:
The agent will build a curated, versioned repo as it works — searching, screening, adding verified claims, and committing — so the answer it returns is backed by a corpus you can open, audit, fork, and extend.