Field note
LLM agents in quantitative finance, where they actually pay off
A working note on where coding agents earn their keep in equity research, forensic accounting, and DCF construction, and where they still do not.
The seductive pitch for large language models in finance is that they read 10-Ks fast. The more interesting reality is what they let you do on top of that: build small, disciplined agents that compose research workflows a single analyst could not. This note is about where those agents actually earn their keep, based on running them against the full Russell 3000 over the past year as part of Leviathan.
The honest summary is that LLM agents are remarkable at one thing in finance, mediocre at a second, and still genuinely bad at a third. The trick is knowing which is which.
What works: forensic accounting at scale
The clearest win is forensic accounting. The literature has spent thirty years building structured scoring models for earnings manipulation, balance-sheet quality, and accruals abuse. The two best known are Beneish’s M-Score and Sloan’s accruals anomaly. Both are deterministic. Both rely on a fixed set of line items from 10-K filings.
The bottleneck has never been the math. It has been the extraction. Pulling Days Sales Outstanding, Asset Quality Index, and Sales Growth Index out of an unstructured 10-K for one company takes a junior analyst maybe twenty minutes if the filing is well behaved. Across the Russell 3000, that is a full-time job for a small team. Across the past decade of filings for the same universe, it is impossible.
Agents flip this. The pipeline is mechanical:
- Fetch the 10-K from EDGAR.
- Extract the income statement, balance sheet, and cash flow statement into a structured table.
- Compute the eight Beneish ratios.
- Output a structured record with provenance for every cell.
The Beneish M-Score is then a closed-form combination:
The agent does not invent the formula. The agent extracts the inputs reliably and at scale. That is the entire trick.
Across 1,847 filings processed in our internal benchmark, the agent matched a human analyst on M-Score inputs in 94 percent of cases. The 6 percent failure mode was almost entirely non-standard filing formats (foreign issuers, restated filings, segment reorganizations).
The error mode worth flagging: the agent is good at pulling line items the filer labeled. It is bad at synthesizing a line item the filer did not break out. Total Accruals To Total Assets (TATA) is a real-world example. If the filer rolled accruals into a different line, the agent will quietly pick the wrong number. The fix is to flag low-confidence cells and route them to human review, not to pretend the agent can always reconstruct what the filer obscured.
What works less cleanly: DCF construction
Discounted cash flow modeling is the next tier down. The model is deterministic in principle. The inputs are not.
A textbook DCF needs:
- Five to ten years of revenue projections.
- Operating margin assumptions across the projection window.
- Reinvestment rate (capex plus working capital change as a fraction of EBIT).
- A terminal growth rate.
- A weighted average cost of capital.
Agents are excellent at pulling the historical base rates from filings and consensus estimates. They are mediocre at picking the forward assumptions. The forward assumptions are where the actual investing skill lives.
The pattern that holds up at Leviathan is a hybrid: the agent produces a base-case DCF with conservative, mechanically-derived assumptions (revenue growth fades to nominal GDP, margins revert to peer median, WACC from Damodaran), and surfaces every assumption explicitly with its source. The human analyst then either accepts the assumptions or replaces them with their own view, with the agent recomputing instantly.
This is not “AI does DCF.” It is AI does the bookkeeping of DCF, human does the judgment. The unbundling matters. The bookkeeping is 80 percent of the wall-clock time. The judgment is 100 percent of the value.
Damodaran’s implied equity risk premium is the canonical input here. As of this writing it sits near 4.6 percent, which has not moved much in the past six months despite the rate cycle. An agent that pulls this number monthly and propagates it across the portfolio’s DCFs is doing real work even if it never picks a stock.
The error mode here is false precision. A DCF with thirty inputs feels rigorous. If twenty of them are model-generated guesses dressed up as analysis, the rigor is theater. The fix is the same as in forensic accounting: surface confidence, distinguish derived numbers from sourced numbers, and never let a model-generated forecast masquerade as an analyst conviction.
What is still bad: thesis generation
The third tier is where the honest answer becomes uncomfortable. LLM agents are still not very good at writing actual investment theses.
They can summarize. They can pattern-match. They can produce a plausible-looking three-paragraph bull case. What they cannot do reliably is identify the asymmetric bet that an experienced investor would. The good theses I have read in the past year, the ones that materially outperformed, almost all hinged on a single non-consensus insight: a structural shift the market was mispricing, a balance sheet item the sell-side was misreading, a moat that was widening when the narrative said it was eroding.
Models do not generate these. They average. The training distribution is dominated by sell-side notes, which are themselves dominated by consensus. The output regresses to the mean of what has already been said.
The use case that works is thesis stress testing, not thesis generation. Give an agent your draft thesis and ask it to argue the other side with citations. The model is good at finding the counterargument because the counterargument is somewhere in its training distribution. It is just rarely surfaced together.
What this implies for tooling
The implication for anyone building research tools in finance: stop pitching “AI that picks stocks.” Start building infrastructure that lets a small team behave like a large one. The unit economics shift dramatically when one analyst can run quality screens across 3,000 names instead of 30.
The composition that has been working:
- Deterministic models (Beneish, Altman, Sloan) for screening.
- LLM-driven extraction pipelines to feed those models inputs at scale.
- Hybrid DCF tools where the agent does the plumbing and the human does the judgment.
- Citation-grounded research notes where every number traces back to a filing or a curated source.
- No autonomous trade execution. Ever.
That last point matters more than it sounds. The cost of a wrong forensic accounting flag is a wasted analyst hour. The cost of an autonomous trade based on a hallucinated signal is uncapped. The asymmetry is too violent to ignore.
The honest summary
Coding agents and LLM agents are now legitimately useful in equity research. The use cases that pay off are the unglamorous ones: extraction, structured analysis, summary, stress testing. The use cases that get the press are the ones that still do not work: autonomous thesis generation, end-to-end portfolio management, anything that requires non-consensus judgment.
The market eventually figures out which is which. The teams that figure it out first ship.
That is the bet.