When the Past Stops The Future
A property insurer can look perfectly solvent right up until the week it isn't. The numbers on the page are fine. The reserves are funded. Then a hurricane comes ashore, the claims arrive, and it turns out the company had been quietly underestimating what it owed for years. Between 2017 and 2023, this happened over and over in Florida and Louisiana. Insurer after insurer went under. Not because they were reckless, but because the math they trusted had stopped describing the world.
I want to explain why that math fails, what we found when we tried to fix it, and the company we ended up building as a result.
Start with the basic problem an insurer has. It sells you a policy this year. You might file a claim this year, or next year, or five years from now. Even after you file it, the final cost can take years to settle. So at any moment the insurer owes money it hasn't paid yet, on claims that have already happened. Figuring out how much that is, that's loss reserving. It sounds like accounting trivia. It isn't. The reserve number is what regulators check to decide whether a company can keep operating. Set it too low and you're insolvent and don't know it. It is, in a real sense, the number that decides whether the company lives.
The standard way to estimate it is older than most of the people doing it. You build something called a loss triangle. It's a table where each row is a year that claims were incurred, and each column shows how much had been paid as of one year later, two years later, and so on. The triangle shows you how losses develop, how a year's claims grow from a partial number to a final one as time passes. Then you assume the future will develop the same way the past did, and you extrapolate the rows that aren't finished yet. That's the Chain Ladder method, and actuaries have used some version of it since the 1960s. Bornhuetter-Ferguson, Cape Cod, Mack: the other classical methods are refinements, but they all lean on the same load-bearing assumption. They assume the pattern by which losses develop stays roughly stable from year to year.
For most of the history of insurance, that assumption was fine. The whole discipline is built on the idea that the past is a good guide to the future. For fire, theft, and fender-benders, it mostly is.
Climate change broke it. When hurricanes get more intense and more frequent, the pattern by which losses develop doesn't stay stable. It shifts. A bad year is no longer a rare draw from the same old distribution. The distribution itself has moved. And here is the part that should bother you. The classical methods can't tell. They will keep producing a confident number while the ground moves underneath them. They don't fail loudly. They fail silently. That's what makes them dangerous. A model that's a little bit wrong and says so is manageable. A model that's badly wrong and stays serene is how a company ends up insolvent on paper one quarter and in receivership the next.
So the real problem isn't accuracy. Everyone frames it as accuracy. Get a better number, shave the error down a few points. But the arithmetic was never the hard part. Chain Ladder is, frankly, not hard arithmetic. The hard part is noticing the moment when the arithmetic stops applying. The technical name for that moment is a structural break. It's a point where the process generating your data changes, so the model fit to the old data is now describing something that no longer exists. A model that can notice the break is worth more than a model that's marginally more accurate in a world that isn't moving. Because in a world that isn't moving, you barely needed the model. You needed it for exactly the moment it tends to fail.
That reframing is the whole idea, and it's what the research paper is about. My co-author Thomas Mbrice and I, with a faculty mentor at Stony Brook advising, asked a narrow question. Can a machine detect these structural breaks earlier and more reliably than the classical methods? We used an LSTM, a kind of neural network that processes a sequence one step at a time and carries a memory of what it has seen. That makes it natural for data where the order and the history matter, as they do in a loss triangle. We trained it on more than fifteen years of regulatory triangle data from over eighty insurers in Florida and Louisiana. And we fed it the thing the classical methods ignore entirely: the weather. Hurricane intensity indices and sea-surface temperatures from NOAA, so the model could see the physical driver of the shift, not just its aftermath in the claims.
Two pieces mattered more than the architecture. The first is a custom loss function. When the model senses a regime shift, it weights recent data more heavily, so it adapts to the new world instead of being anchored to an average of a world that's gone. The second is a theoretical framework. It was not enough to build something that worked on the data we happened to have. Catastrophes are rare almost by definition, and a method that only looks good because it got lucky with which hurricanes landed in the test window is worthless. So we put structural-break detection in probabilistic terms and derived formal performance guarantees that hold even when the catastrophe events in the test window are sparse. We validated against real events, Hurricanes Ian and Ida, to check that the detector fired when it should have.
Now the honest part, because this is the sentence people skim past and I don't want you to. This is a white paper and a research program, not a deployed result. We hypothesize a 15 to 20 percent improvement in reserve accuracy on catastrophe-exposed years. That number is a target. It's grounded in the prior literature and in the convergence results we derive, but it is not something we have demonstrated in production. I'd rather tell you that plainly than dress up a hypothesis as a finding. The contribution of the paper is the framing and the math. It gives a way to think about reserving as a problem of detecting change, with guarantees that survive the rarity of the thing you're trying to detect.1
The paper proves a machine can watch reserving for the moment the world shifts. That was the result I cared about going in. But working on it surfaced something I cared about more, which had nothing to do with neural networks. The entire reserving workflow is still done by hand. An actuary pulls the triangles, cleans them, picks a method, runs it in a spreadsheet, reconciles it against the regulatory rules, and writes up an opinion. It takes weeks, and most of those weeks are not insight. They're plumbing.
So we built the thing that does the plumbing. That's BlackScholes, what we call the world's first AI actuary. You drop in your loss triangles. It parses and validates them, helps you select the model that best fits the data, runs the reserve calculation, runs the regulatory compliance checks, lets you ask it questions about what it found in plain language, and generates the reserve-opinion report in one click. The research is the proof that this can be done well. The company is the application. I built it with Thomas and with Samuel Katsaros, who shaped the product and the company even though he wasn't on the paper.
The bet underneath BlackScholes is bigger than reserving, though. Reserving is the wedge. It's a painful, well-defined, deeply manual job that a careful machine can do faster and, eventually, more safely. But once a machine can do reserving, the question stops being whether software can help the actuary. It becomes a different question. What is the smallest insurance company you could run if software did most of the actuarial work? We think the answer is smaller than anyone currently believes. The plan is to start with the AI actuary and, over time, become the first AI-native insurance company.
But I keep coming back to the silent-failure point, because it's the part that generalizes past insurance. The most dangerous models aren't the ones that are obviously wrong. They're the ones that stay confident while the world moves out from under them. And the longer they've been right, the more you trust them at exactly the wrong moment. The most useful thing you can teach a model is not how to be more accurate. It's how to notice when it has stopped being accurate at all.
- If you want the actual math, the LSTM setup, the regime-aware loss function, and the convergence guarantees under sparse catastrophe events, it's all in the paper on arXiv. ↩