Arguing with Turing is a loosing game (or why predicting some things is hard)

FiveThirtyEight’s 2016 election forecast and Nassim Taleb

Back in August 2016, Nassim Nicholas Taleb and Nate Silver had a massive and a very public run-in on Twitter. Taleb – the author of “Black Swan” and “Antifragile” and Nate – the founder of FiveThirtyEight were arguably the two most famous statisticians alive back at the time, so their spite ended up attracting quite a lot of attention.

The bottom line of the spat was that for Taleb, the FiveThirtyEight 2016 US presidential election model was overly confident. As the events in the race between Clinton and Trump swung the vote intentions one way or another (mostly in response to the quips from Trump), Nate’s model’s forecasts had too large swings in its predictions and confidence intervals to be considered a proper forecast.

Taleb did have a point. Data-based forecasts, be they based on the polls alone or more considerations, are supposed to be successive estimations of a well-defined variable vector – aka voting splits per state, translating in electoral college votes for each candidate on the day of the election. Any subsequent estimation of that variable vector with more data (such as fresher polls) is supposed to be a refinement of the prior estimation. We expect the incertitude range of the subsequent estimations to be smaller than the one of the prior estimation and to be included into the prior predictions incertitude range – at least most of the time, modulo expected errors.

In the vein of the central thesis of “Black Swan”, Taleb’s shtick for 2016 election models – and especially the FiveThrityEight’s one – was that they were prone to overconfidence they had in their predictions and underestimated the probability of rare events that would throw them off. As such, according to him, the only correct estimation technique would be a very conservative one – giving every candidate 50/50% chance of willing right until the night of the election, when it would jump to 100%.

Image
Nassim Taleb’s “Rigorous” “forecasting”

The thing is that Nassim Nicholas Taleb is technically correct here. And Nate Silver was actually very much aware of that himself before Taleb showed up.

You see, the whole point of Nate Solver creating the FiveThirtyEight website and going for rigorous statistics to predict games and election outcomes was specifically the fact that mainstream analysts had hard time with probabilities and would go either for a 50/50 (too close to call), or a 100% sure, never in-between (link). The reality is that there is a whole world in between and that such rounding errors would cost you dearly. As anyone dealing with inherently probabilistic events would painfully learn, for an 80% win bet, you still loose 20% of the time – and have to pay dearly for it.

Back in 2016, Nate’s approach was to make forecasts based on the current state of the polls and the estimation of how much that could be expected to change if the candidates were doing about the same thing as the candidates in past elections. Assuming the candidates don’t do anything out of the ordinary and the environment doesn’t change drastically, his prediction would do well and behave like an actual forecast.

Nate Silver| Talks at Google| The Signal and The Noise
Nate explains what is a forecast is to him

The issue is that you can never enter the same river twice. Candidates will try new things – especially if models show them as loosing if they don’t do anything out of the ordinary.

Hence Taleb is technically correct. Nate Silver’s model predictions will always be overly confident and overly sensible to changes resulting from candidate’s actions or previously unseen changes in the environment.

Decidability, Turing, Solomonoff and Gödel

However Taleb’ technical correctness is completely besides the point. Unless you are God himself (or Solomonoff’s Demon – more about that in a second), you can’t make forecasts that are accurately calibrated.

In fact, any forecast you can make is either trivial or epistemologically wrong – provably.

Let’s for a second imagine ourselves physicists and get a “spherical cow in vacuum” extremely simplified model of the world. All the events are 100% deterministic, everyone obeys well-defined known rules, based on which they will make their decisions, and the starting conditions are known. Basically, if we had a sufficiently powerful computer, we can run the perfect simulation of the universe and come up with the exact prediction of the state of the world and hence of the election result.

Forecast in this situation seems trivial, no? Well, actually no. As Alan Turing – the father of computing – have proved in 1936, unless you run the simulation, in general you cannot predict even if a process (such as voter making up his mind) will be over by the election day. Henry Gordon Rice was even more radical and in 1951 has proven a theorem that can be summarized as “All non-trivial properties of simulation runs cannot be predicted”.

In computer science, forecasting if a process will be over is called halting problem and is known as a prime example of a problem that is undecidable (aka, there is no method to predict the outcome in advance). For those of you with an interest in mathematics, you might have noted a relationship to Gödel’s incompleteness theorem – stating that undecidable problems will exist for any set of postulates that are complex enough to embed the basic algebra.

Things only go downhill once we add some probabilities into the mix.

For those of you who have done some physics, you probably have recognized the extremely simplified model of the world above as Laplace’s demon and have been screaming on the screen about the quantum mechanics and how that’s impossible. You are absolutely correct.

That was one of the reasons that pushed Ray Solomonoff (also known for the Kolmogrov-Solomonoff-Chaitin) to create something he called Algorithmic Probability and a probabilistic pendant to Laplace’s demon – Solomonoff’s demon.

In the Algorithmic probability frameowrlk, any possible theory about the world that can exist has some non-nul probability associated to it and is able to predict something about the state of the world. And to perform an accurate prediction about the probability of the event, you need to calculate the probability of that event according to all theories, weighted by the probability of each theory:

Single event prediction according to Solomonoff (conditional probabilities)
E is event, H is a theory, sum is over all possible theories

Fortunately for us, Solomonoff also provided a (Bayesian) way of learning the probabilities of events according to theories and probabilities of theories themselves, based on the prior events:

Learning from a single new data point according to Solomonoff
D is new data, H is a theory, sum is over all possible theories T, but excluding theory H

Solomonoff was nice enough to provide us with two critical results about his learning. First, that it was admissible – aka optimal. In other terms, there is no other learning mode that could do better. The second – that it was uncomputable. Given that a single learning or prediction step requires an iteration over the entire infinite set of all possible theories, only a being from a thought experiment – Solomonoff’s demon – is actually able to do it.

This means that any computable prediction methodology is necessarily incomplete. In other terms, it is guaranteed not to take in account enough theories for its predictions to be accurate. It’s not a bug or an error – it’s a law.

So when Nassim Taleb says that Nate Silver’s model overestimates its confidence, while technically correct, is also completely trivial and tautological. Worse than that. Solomonoff also proved that in general case, we can’t evaluate by how much our predictions are off. We can’t possibly quantify what we are leaving on the table by using only a finite subset of theories about the universe.

The fact that we cannot evaluate in advance by how much we are off in our predictions basically means that all of Taleb’s own recommendations about investments are basically worthless. Yes his strategy will do better when there are huge bad events that market consensus did not expect. It will however do much worse in case they don’t happen.

Basically, for any forecasts, the, best we can do is something along the lines of Nate Silver’s now-cast + some knowledge gleaned from the most frequent events that occurred in the past.

Think about it. Things that could radically change the outcome of the 2016 election included:

  • An antifa militia killed Donald Trump
  • An alt-right militia killed Hilary
  • An alt-right terrorist group inspired by Trump’s speeches blew up the One World Trade Center, causing massive outrage against him
  • Trump making an off-hand remark about grabbing someone’s something that would put on his back 50% of the potential electorate
  • It rains in a swing state, meaning 1000 less democratic voters decide last second to note vote because Hillary’s victory is assured
  • FBI director publicly speaks about Hillary’s emails scandal once again, painting her as corrupt politician and causing a massive discouragement among potential democrat voters
  • Aliens show up and abduct everyone.
  • ….

Some of them were predictable since they happened before (rain) and were built into the model’s uncertainty. Other factors were impossible to predict. In fact, if in November 2015 you would have claimed that Donald Trump will become the next president due to the last-minute FBI intervention, you would have been referred to the nearest mental health specialist.

2020 Election

So why write about that spat now, in 2020 – years after the fact?

First, it’s a presidential election year in the US, again, but this time around even with more potential for unexpected turns. Just as in 2016, Nate’s models is drawing criticism, except this time for underestimating Trump’s chances to win instead of overestimating. While Nate and the FiveThrityEight team are trying to account for the new events and be exceedingly clear about what their predictions are, they are limited in what they can reliably quantify. His model – provably – cannot predict all the extraordinary circumstances that still can happen in the upcoming weeks.

There is still time for the Trump campaign to discourage enough female vote by painting Biden as rapist – especially if FBI reports in the week before the election about an ongoing investigation. There is still time for a miracle vaccine to be pushed to the market. This is not an ordinary elections. Many thing can happen and none of them is exactly predictable. There is still room for new modes for voter suppression to pop up and for Amy Barrett to be nominated to make sure the Supreme court will judge the suppression legal.

And no prediction can account for them all. Nate and the FiveThirtyEight team are the best applied statisticians in the politics since Robert Abelson, but they can’t decide the undecideable.

2020 COVID 19 models

Second – and perhaps most importantly – 2020 was a year of COVID 19 and forecasts associated to it. Back in March and April, you had to be either extremely lazy or very epistemologically humble to not try to build your own models of COVID19 spread or not to criticize other’s models.

Some takes and models were such an absolute hot garbage that the fact they could come from people I have respected and admired in the past plunged me into a genuine existential dread.

But putting those aside, even excellent models and knowledgeable experts were constantly off – and by quite a lot as well. Let’s take an example of expert predictions from March 16th about how many total cases they thought the CDC will have reported on March 29th in the US (from FiveThrityEigth for a change). Remind yourself – March 16th was the time when only 3500 cases were reported in the US, but Italy was already out of the ICU beds and was not resuscitating anyone above 65, France was going into confinement and China was still mid-lockdown.

Consensus: ~ 10 000 – 20 000 cases total with 80% upper bound at 81 000.
Reality: 139 000 cases

Yeah. CDC reported about 19 000 cases on the March 29th alone. Total was almost 10 times higher than the predicted range and almost the double of the predicted interval. But the reason they were off wasn’t because experts are wrong and their models are bad. Just as the election models, they can’t account for factors that are impossible to predict. In this specific case, what made all the difference was that CDC’s own COVID 19 tests were defective, and the FDA didn’t start issuing emergency use authorizations for other manufacturer’s tests until around the 16th of March. The failure of the best-financed, most reliable and most up-to-date public health agency to produce reliable tests for the world-threatening pandemics for two months would have been an insane hypothesis to make given the prior track record of the CDC.

More generally, all the attempts to predict the evolution of the COVID19 pandemic in the developed countries ran into the same problem. While we were able to learn rapidly how the virus was spreading and what needed to be done to limit its spread, the actions that would be taken by people and governments of developed countries were a total wild card. Compared to the election forecasts, we don’t have any data from prior pandemics in developed countries to build and calibrate models upon.

Nowhere is it more visible than in the state-of-the-art models. Here is Youyang Gu’s COVID19 projections forecasting website, using a methodology similar to the one of Nate Silver uses to predict the election outcomes (parameters fitting based on prior experience + some mechanistic insight about the underlying process). It is arguable the best performing one. And yet, it failed spectacularly, even a month in advance.

Youyang Gu’s forecast for deaths/day in early July
Youyang Gu’s forecast for deaths/day in early August. His July prediction was off by about 50-100%. I also wouldn’t be as optimistic about November.

And once again, it’s not the fault of Youyang Gu. A large part of that failure is imputable to Florida and other Southern states doing everything to under-report the number of cases and deaths attributed to COVID19.

The model is doing as well as it could, given what we know about the disease and how people have behaved themselves in the past. It’s just that no model can solve the halting problem of “when will the government decide to do something about the pandemic, what it will be and what people will decide to do about it”. Basically the “interventions” part of projections from Richard Neher’s lab COVID19-scenarios is as of now undecidable.

Once you know the interventions, predicting the rest reliably is doable.

Another high-profile example of expert model failure is the Global Health Security Index ranking for countries best prepared for pandemic from October 2019. US and the UK received respectively the best and second best scores for pandemic preparedness. Yet, if we look today (end of October 2020) and the COVID19 per capita among developed countries, they are a close second and third worst performers. They are not an exception. France ranked better than Germany. Yet it has 5 times the number of deaths. China, Vietnam and New Zealand – all thought to be around the 50/195 mark are now the best performers, with a whooping 200, 1700 and 140 times less deaths per capita than the US – the supposed best performer. All because of difference in the leadership approach and how decisive and unrelentless the response to the pandemic from the government was.

GHS index is not useless – it was just based on the previously implicit expectation that governments of countries affected by the outbreaks would aggressively intervene and do all in their capacity to stop the pandemic. After all, USA in the past went as far as to collaborate with the USSR, at the height of the Cold War to to develop and distribute the polio vaccine. It was after all what was observed during the 2010 Swine flu pandemic and during the HIV pandemic (at least when it became clear it was hitting everyone, not just the minorities) and in line with what WHO was seeing in developing countries hit by pandemics that did have a government.

Instead of a conclusion

Forecasting is hard – especially when it is quantitative. In most of cases, predicting what will happen deterministically is impossible. Probabilistic predictions can only base themselves on what happened in the past frequently enough to allow them to be properly calibrated.

They cannot – provably – account for all the possible events, even if we had a perfectly deterministic and known model of the world. We can’t even estimate what the models are leaving on the table when they try to perform their predictions.

So the next time you criticize an expert or modeler and point a shortcoming in their model, think for a second. Are you criticizing them for not having solved the halting problem? Not having bypassed the Rice theorem? Built an axiome set that does not account for a previously un-encountered event? Would the model you want require a Laplace or a Solomonoff Demon to actually run it? Are you – in the end – betting against Turing?

Because if the answer is yes, you are – provably – going to lose.

===========================

A couple of Notes

An epistemologically wrong prediction, can be right, but for the wrong reason. A bit like “even the stopped clock shows the correct time twice a day”. You can pull a one-off, maybe a second one if you are really lucky, but that’s about it. That’s one of the reasons for which any good model can only take in account events that has occurred sufficiently frequently in the past. To insert them into the model, we need to quantify their effect magnitude, uncertainty and the probability of occurrence. Many pundits using single-parameter models in 2016 election to predict a certain and overwhelming win for Hillary learnt it the hard way.

Turing’s halting problem and Rice theorem have some caveats. It is in fact possible to predict some properties of algorithms and programs, assuming they are well-behaved in some sense. In a lot of cases it involves them not having loops or having a finite and small number of branches. There is an entire field of computer science dedicated to developing methods to prove things about algorithms and to writing algorithms in which properties of interest are “trivial” and can be predicted.

Another exception to Turing’s halting problem and Rice’s theorem is given by the law of large number. If we have seen a large number of algorithms and seen the properties of their outcomes, we can make reasonably good predictions about the statistics of those properties in the future. For instance we cannot compute the trajectories of individual molecules in a volume, but we can have a really good idea about their temperature. Similarly, we can’t predict if a given person will become infected before a given day or if they will die of an infection, but we can predict the average number of infections and the 95% confidence interval for a population as a whole – assuming we did see enough outcomes and know the epidemiological interventions and adherence. That’s the reason, for instance, why we are pretty darn sure about how to limit COVID19 spread.

Image
Swiss Cheese defense in depth against COVID19 by Ian M. Mackay

To push the point above further, scientists can often make really good prediction from existing data. Basically, that’s how scientific process work – only theories with confirmed good predictive value are allowed to survive. For instance, back in February we didn’t know for sure what was the probability of the hypothesis “COVID19 spreads less as temperature and humidity increases”, nor what would be the effect on spread. We had however a really good idea that interventions such as masks, hand washing, social distancing, contact tracing, at-risk contacts quarantine, rapid testing and isolation would be effective – because of all we learnt about viral upper respiratory tract diseases over the last century and the complete carnage of the theories that were not good enough at allowing us to predict what would happen next.

There is however an important caveat to the “large number” approach. Before the large-scale laws kick in and modeler’s probabilistic models become quantitative, stochastic fluctuations – for which models really can’t account – dominate and stochastic events – such as super spreader events – play an outsized role, making forecasts particularly brittle. For instance, in France the early super spreader event in Mulhouse (mid-February) meant that Alsace-Moselle took the brunt of the ICU deaths in the first COVID19 wave, followed closely by the cramped-up and cosmopolitan Paris region. One single event, one single case. Predicting it and its effect would have been impossible, even with China-grade mass survelliance.

Mobilising against a pandemic - France's Napoleonic approach to covid-19 |  Europe | The Economist
March COVID19 deaths in France from The Economis

Instead of epilogue:

The theoretical part of this post was inspired by the conversations with my colleagues at the IC department of EPFL. If you are interested in learning more about how algorithms permeate our daily life or hear a more rigorous and coherent version of my ramblings about decidability, information, Laplace-Bayes-Solomonoff and how automated models (aka ML/AI mesh with all of that), Lê Nguyên Hoang and El Mahid El Mhamdi have you covered both with their YouTube series (here, here and if you speak french here) and books (mostly here, and if you speak french here and here as well).

The rest of this post was inspired by my own past research as a quantitative systems biologist and experience with building/criticizing models molecular mechanisms driving diseases or underlying biological functions to see them fail a reality test due to a process thought to be irrelevant and be reminded of Dobzhansky’s “Nothing in Biology Makes Sense Except in the Light of Evolution”. Now I am a CS researcher, investigating how to build best possible models of distributed and self-learning systems.

Problems with a major programming language version bump (Python 2>3)

After about 10 years after the initial Python 3 release and about six months after the end of Python 2 support I have finally bumped my largest and longest-running project to Python 3. Or at least I think so. Until I find some other bug in a rare execution path.

BioFlow is a python project of mine that I have been on-and-off running and maintaining since 2013 – by now almost 7 years. Heavily dependent on high-performance scientific computing libraries and python libaries providing bindings to them (cough scikits.sparse cough), despite Python 3 being out for a couple of years by the time I started working on it none of the libraries I depended supported it yet. So I got on with Python 2 and rolled for it for a number of years. By that time, with several refactors, feature creep and optimization, with about 6.5 k LOC, 2.5 k Lines of comments, 665 commits over 7 years and a solid 30% of test coverage, it is a middle-of-the road python work-horse library.

As many other people running scientific computing libraries I did see a number of things impacting the code: bugs being introduced into the libraries I depended on (hello library version pinning), performance degradation due to anti-spectre attacks on Intel CPUs, libraries disappearing for good (RIP bulbs), databases discontinuing support for the means accessing them I was using (why, oh why neo4j did you drop REST) or host system just crapping itself trying to install the old Fortran libraries that have not yet been properly packaged for it (hello Docker).

Overall, it taught me a number of things about programming craftsmanship, writing quality code and debugging code I forgot the details about. But that’s a topic for another post – back to Python 2 to 3 transition.

Python 2 was working just fine for me, but with its end of life coming near, proper async support and type hinting being added to it, a switch to Python 3 seemed like a logical thing to do to ensure a long-term support.

After several attempts to keep a codebase in Python 2 consistent with Python 3

So I forked off a 2to3 branch and ran the 2to3 script in the main library. At first it seemed that it should have solved most of issues:

  • print xyz was turned into print(xyz)
  • dict.iteritems() was turned into dict.items()
  • izip became zip
  • dict.keys() when fed to an enumerator was turned into list(dict.keys())
  • reader.next() was turned into next(reader)

So I gladly tried to start running my test suite, only to discover that it was completely broken:

  • string.lower("XYZ") now was “XYZ".lower()
  • file("fname", 'w') was now an open("fname", 'w')
  • but sometimes also open("fname", 'wr')
  • and sometimes open("fname", 'rt') or open("fname", 'rb') or open("fname, 'wb'), depending purely on the ingesting library
  • AssertDictEqual or assertItemsEqual (a staple in my unit test suite) disappeared into thin air (guess assertCountEqual will now have to do…)
  • wtf is even with pickle dumps ????

Not to be forgotten that to switch to Python 3 I had to unfreeze dependencies for the libraries I was building on top, which came with its own cans of worms:

  • object.properties[property] now became an object._properties[property] in one of the libraries I heavily depended on (god bless whoever invented Ctrl-F and PyCharm for it’s context-aware class/object usage/definition search)
  • json dumps all of a sudden now require an explicit encoding, just as hashlib digests

And finally, after running for a couple of weeks my library, some previously un-executed branch triggered a bunch of exception arising from the fact that in Python 2 / meant an integer division, unless a float was involved, whereas for Python 3 / is always a float division and an // is needed to trigger an integer division.

I can be in part blamed for those issues. A code with complete unit test coverage would have caught all of the exceptions in the unit-test phase and the integration tests would have caught problems in rare codepaths.

The problem is that no real-life library have a total unit-test or coverage library. Python 3 transition trench warfare hell have killed a number of popular python projects – for instance Gourmet recipe manager (I used to use myself). For hell’s sake, even DropBox, who employs Guido himself and runs a multi-billion business on an almost pure Python stack waited until end 2018 and took about a year to roll-over.

The reality is that the debugging of a major language version bump is **really** different from anything a codebase encounters in its lifetime.

When you write a new feature, you test it out as you develop. Bugs appear as you add lines of code and you can track them down. When a dependencies craps out, the bugs that appear are related to it. It is possible to wrap it and isolate the difference in its response to calls. Debugging is localized and traceable. When you refactor, you change the model of the problem and code organization in your head. The bugs that appear are once again directly triggered by your code modifications.

When the underlying language changes, the bugs appear **everywhere**. You don’t know which line of code could be at the bugged one, and you miss bugs because some bugs obscure other bugs. So you have to do pass after pass after pass of your entire codebase, spending weeks and months tracking exceptions as they pop up and never sure if you have corrected all the bugs yet. It is hard, because you need to load the entire codebase in your head to search for bugs, be aware of the corner cases. It is demoralizing, because you are just trying to get to the point where your code already was, without improving it in any way possible.

It is pretty much a trench warfare hell – stuck in the same place, without clear advantage gained by debugging excursions at the limit of your mental capacities. It is unsurprising that a number of projects never made it to Python 3, especially niche ones made by non-developers and for non-developers – the kind of projects that made Python 2 a loved, universal language that surely would have a library that could solve your niche problem. The problem is so severe in the scientific community, that there is a serious conversation in Nature about starting to use Python 2.7 to maximise projects reproductibility, given it is guaranteed it will never change/

What could have been improved? As a rank-and-file (non-professional) developer of a niche, moderately complex library here’s a couple of things that would have my life **a lot** easier while bumping the Python version:

  • Provide a tool akin to 2to3 and make it default path. It was far from perfect – sure. But it hammered out the bulk of the differences and allowed to code to at least start executing and me – to start catching bugs.
  • Unlike 2to3, it needs to annotate potential problems in the code it could not resolve. 'rt' vs 'rb' was a direct consequence for the text vs byte separation in Python 3 and it was clear problems will arise with that. Same thing for / vs //. 2to3 should have at least high-lighted potential for problems. For me my workflow, adding a # TODO: potential conflict that needs resolution would have gone a loooooong way.
  • Better even, roll out a syntax change in the old language version that will allow the developer to explicitly resolve the ambiguity so that the automated upgrade tools can get more out of the library
  • Don’t touch the unittest functions. They are the lifeblood of the debugging of the library after the language bump. If they bail out, getting them to work would require figuring out how the code they are covering works once again and defeats their purpose.
  • Make sure that the most wide-spread libraries in your ecosystem have performed a roll-over before pushing others to do the same.
  • Those libraries need to provide a “bump” version: aka with exactly the same call syntax from the users code, they would return exactly the same results both in the previous and the new version of the language. Aka the libraries should not be bumping their own major version at the same time they bump the supported langage version.