MODERATOR NOTE: coronavirus resources & updates: https://wolfr.am/coronavirus
-- you have earned Featured Contributor Badge
Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board.
This post has been listed in the main resource-hub COVID-19 thread: https://wolfr.am/coronavirus in the section Computational Publications. Please feel free to add your own comment on that discussion pointing to this post ( https://community.wolfram.com/groups/-/m/t/1906954 ) so many more interested readers will become aware of your excellent work. Thank you for your effort!
Oh dear. This model seems to be using RATE information as its source - which leads to the usual problems with noisy inputs.
No, the fits are done with the cumulative case data, then the fit is used to predict what the daily rate changes would be according to the fit of the cumulative case data up to that point. Growth rate changes are what will be noticed in the news. The fits raise the interesting question that the early data may not be very good as the rates derived from the daily log differences of the reported cases do not fit any pattern. Here are the plots for New Jersey including today's data.
Using the default fitting, which uses Norm to minimize the fit residuals, the early data are not adding much to the fit as shown by the last log plot. Either the dynamics are changing after the peak or the early data are not very good. The plots of the early daily rates show a chaotic pattern suggesting the problem may be with the early data. Using the default fitting method to the cumulative data limits the influence of the early data points, so hopefully the predictive value gets better with more data, but the data from South Korea looked a little peculiar at the end, but the late case increases were not huge even though they failed to fit the model.
I looked over the three plots above and I'm guessing that there was a logarithmic fit, judging by the log ordinate for the case plot. If you differentiated your logistic function, you SHOULD have seen the linear scaled case rate growing in the usual bell shaped curve for a maximum. Your "Daily Log case differences " did not show the desired shape, so you did not have the opportunity to refit a non linear regression to the post-peak case data - to show the departure of the logistic parameters at later dates to a higher asymptotic level. Let me show you the Oklahoma deaths and death rate.
Below is the Wolfram Alpha plot of daily Death rate Day 1 is March 23rd, peak occurs around April 7th
I notice that when I did the same for Cases and Case Rate, the daily fit for Cases needed daily revision after the day for the Peak case rate. When I discarded the pre-peak date data points and ran a new regression, the new best fit parameters showed a considerably higher plateau, suggestiing the bell-shape time series is not in fact symmetrical.
Sincerely Brian Whatcott Altus.
Robert, you asked about my data source. I use Worldometer.info. It provides data for a number of countries, and includes US by state - but no smaller divisions. I collect it daily. With a little effort, you can recapture prior data by clicking data points, as I recall https://www.worldometers.info/coronavirus/country/us/?fbclid=IwAR0fWg98jsNTSUX0_f2s0s7xnPE1pdClvUGyphK7YC_l24od34kMmdnLyGE
Brian W
Brian,
Just clicked on the link, they are now reporting 45318, deaths for 4/22. I don't know how often they update, but it eventually may be showing up there as well. Are you analyzing with Mathematica?We are getting to an interesting point in the epidemic, which should decide the fate of the logistic model. The logistic model requires an exponential decline, which is easiest to show in the log plot of the daily differences.
In the picture the orange dots are the daily differences of the raw data and the curve is the calculated daily difference from the logistic fit to the cumulative data as a continuous curve which is very close to the first derivative curve. If the points don't follow something close to a log linear decline, I don't think the logistic model holds, but with data anomalies it may be difficult to track. I am finding that the death data is lagging the case data by 7 to 8 days. In the end the death data may be more reliable, especially if people start reporting antibody tests as positive cases.
Bob
Robert, please take a look at the notebook below, which generously copied from yours, analysing both positives and number of deaths. Two new ideas are:
Dietmar, I believe you hit several nails on the head with your critique of logistical models. They necessarily provide the same time constant for the asymptotic approach as for the exponential increase. Moreover, when the daily growth rate becomes linear, the differential of the logistic function puts the middle of the straight line growth as the peak of the daily rate. I am pleased by an improved model function that David Bowman at Georgetown proposed. At the cost of a fourth, fitting parameter, the exponential growth time constant is uncoupled from the time constant of the asymptotic approach. DEATHS = k1 / (1 + aexp(ab*(starting day number - present day number)))^(1/c) It too suffers - this time from its virtue: the trailing exponential can easily follow a short term hollow - so that I find it advantageous to train the non-linear fitting with a fictitious value for a datum several days in the future, which best matches the recent early data points of the slowing daily numbers - so that the fit does not hunt unnecessarily to find the least squares fit of the linear data. Sadly the only data plot where this is presently useful is US Deaths, which appear to be turning back: other plots I maintain are currently best served by straight line models, which provide a direct value for daily rates.
Dietmar,
I am not sure what to make of the dynamic the moving average introduces, but mid March most of the datasets show a jump in cases for one day, but that wouldn't affect deaths. I think probably the cyclic phenomenon is introduced by using a linear moving average on data that is more or less growing exponentially. I think if you want to do smoothing that is consistent with the data, you should try an ExponentialMovingAverage, I have also looked at the parameters as the data accumulates. The L parameter in general keeps growing, while k decreases. My best answer is that the early data doesn't have any particular shape to it, rather it is chaotic -- you could fit most anything to it. There is probably good reason for this. Public health officials were playing catch up; the spread of the virus was much wider than was realized. So initially there was a mix of unrestrained viral spread and minimal quarantine. So we are seeing a situation that doesn't fit any model. Eventually the quarantine effort becomes more effective, but it may not become perfect and behave as a leaky quarantine. In the end the South Korea data, showed a persistent growth not at all consistent with the logistic model. Notebook attached.
Another interesting way to look at models is that the logistic model is the same as multiplying L times any unimodal statistical distribution. I am also attaching a notebook showing what fitting with that idea shows. Basically if we want to predict when it will end, The tail behavior of the distribution is what is important, and it is the right tail that will be important, because it doesn't really matter if timing screwed up the initial data accumulation. The ultimate conclusion I think is that you cannot make predictions unless you know in advance the shape of the entire curve. If the tail behavior is exponential, then the logistic distribution should eventually succeed, but it wasn't in S. Korea.
One thing that probably is predictable is that once there is a clear decline in case rates, it should eventually end, and we may be able to detect the shape of the decline. Right now we are probably still too close to the peak in the U.S.
The attached notebooks are not annotated, but you clearly will be able to understand them.
On your last remark, yes, that is the issue with trying to predict the future: It's really hard to do in advance; it's much easier to "predict" (=explain) the future once it's not the future anymore, but that's not as helpful in our current situation. This is also the difficulty with those SIR/SEIR/etc. "first-principle" models. They are really pretty, and work alright once all the parameters are known with sufficient accuracy. Unfortunately it turns out that we typically have a good grasp of those parameters only after the epidemic is over.
Agreed. Most of the epidemiologic models are focused on the dynamics of the disease and recovery rates. Effective quarantine, however, can shut down an epidemic very rapidly, and the epidemiologic models also require transmission assumptions that are not known for a new disease.
I do suspect the S. Korea finale was unusual, and that we should be able to predict the rate of decline from the data pretty soon, then modelling with the tail behavior of statistical distributions might prove useful. If the decline is faster than exponential, then the gamma distribution offers the range of tail behaviors between exponential and normal distributions, if it is slower, then a Pareto distribution could model it. If there are recurrent new small outbreaks, which might be what happened in S. Korea, then no mathematical model will succeed.
I just updated the notebook I posted with the latest data. This does not look good. Looks like the United States is turning the corner, the wrong way…
US Data going the wrong way?
About that step change in Deaths: I notice Worldometers updated US DEATHS about an hour ago - local time, actually 04:17 GMT tomorrow! Here's what that one and only decaying plot looks like now: Their usual update happens around noon. Looks like that double exponential may go back to its former linear growth....
Day #48 of my data was rather like that: a step increase when NY State and several others decided to include non-hospital cases and fatalities in their totals. I excluded two prior days data - which is itself a kind of massaging of the input - an original sin of sorts.
As far as critiquing models is concerned, clearly, capturing a complex spatially distributed process in a simple localized model is only ever going to produce a rough approximation. Those logistic fits, specifically, are motivated by a description of epidemics on the basis of a localized model (we used to call these "lumped parameter models") which result in a fairly simple (set of) ordinary differential equation(s), and, in their simplest case, the logistic function as a solution. All of these, very much including those SIR/SEIR/… models, suffer from the assumption of spatially uniform parameters.
In contrast, in particular in the US right now, we have widely varying population densities, chaotic testing, and wildly varying levels of compliance with social distancing rules. While I would therefore expect individual hot spots to be more easily described with localized models, that's near-impossible for the US as a whole.
I wish people would spend more effort on much more realistic approaches such as Wolfram's agent-based networks. Yes, those are enormously computation-intensive (and data-hungry), but models of that nature could at least clarify some crucial questions that are poorly understood at this point. Take just the question of the efficacy of social distancing, which is clearly hugely important: Current approaches take an enormous economic toll, and at some point the question needs to be asked (well, it's being asked right now) if and how these rules should be relaxed. Yet, to be perfectly frank, there is basically no real understanding, none at all, of what the effects of those social distancing rules w.r.t. cases and deaths could be. Sure, people babble about how "social distancing works", and it probably does, to some degree, but there is no quantitative understanding. Agent-based networks could provide some insight, but that will require a massive effort. Are there any groups who are doing such simulations?
Dietmar, forgive me for pigeon-holing this response as an "Argumentum Ad Vericundiam"; specifically, the physics teacher approach. This joins "aerodynamic lift as impulse/momentum of air molecules with due attention to inhomogeneities from micro gusts" and "aerodynamic drag as a function of velocity" [in this case because the calculus offers reasonable solutions, where "drag as a function of velocity squared" needs one to fall back on iterative methods for decent sizes and shapes.] Which is a slightly derogatory way of suggesting that there are emergent properties in building blocks where chaotic conditions are subsumed. No insult intended....
Robert, interesting cerebration on the shape of the asymptotic function. I have only jumped to a two exponential model in the last few days for US Deaths - the only plot that seems to support such a plot which I show here. At least one other modeler was dismayed by a step change in reporting data. The last such change (for Worldometer data sets) was the addition of Federal sources previously disregarded. I was able to see that the last few days is still best modeled by this double exponential, rather than the current default, which would be a straight-line growth. Sincerely Brian W
I was specifically thinking of this article/post. Some interesting ideas there, but you'll have to run this on some fairly high-powered cluster to be able to study networks and parameter sets large enough to start providing guidance for the real word.
For the fun of it, here is a notebook looking at a fit using an exponential startup with a linear tail:
What is somewhat interesting about this fit is that it's been fairly stable for more than a month now (use the slider in the dynamic plots to see how the fit has been changing over time as more data has become available). I do wonder if we can find a plausible model that would generate this kind of distribution. I vaguely remember from many years back obtaining similar behavior when I was working on hyperbolic one-D transport-reaction equations that we had obtained for fixed-bed absorption/desorption processes.
Clearly the data are showing a different structure late in the epidemic than was seen in the early data. I was looking yesterday at Massachusetts' website where they have the case rates per 100000 for each town. The case rates vary wildly with very high rates for low income areas where there are large immigrant populations and English isn't spoken in many homes. Today MA just made its raw data available. MA Data I have just started to look at it, but it is detailed enough that it might provide some insight into how the data unfolded, I am thinking right now that since most of the control measures required dissemination of information, different communities processed the information at different rates, with the low income areas lagging very badly in some states. If that is true we may have a mixture of models, but we may yet see an exponential tail if quarantine measures eventually succeed.
I lived in MA for forty years so I am familiar with the communities. I now live in AZ where the counties are very large, but the two counties which are mostly Indian Reservation have per capita case rates seven times higher than the rest of the state. NY and NJ maybe just starting to approach the tail behavior, I think it will be exponential, but rates are still varying daily when I fit the last ten days of new case results.
Yes, of course, that linear behavior we currently see cannot go on forever: At the very least, once we're getting closer to 100% of the population infected things will have to asymptote.
However, at this point, even taking into account that the number of people infected may be an order of magnitude higher than what the number of known (=tested) positives imply -- currently at about 1.4M, so ten times that number makes 14M -- we currently have only a few percent of the population infected.
The picture I have in my mind, from the remark in my previous post, is one of an "infection front" spatially moving through the population at constant speed, which may produce the linear growth in case numbers we are seeing now. This may go on like this for another couple of months. On the other hand, the good news is that the behavior is linear, thus the number of new cases per day is constant, and if things remain that way they will remain manageable.
There is another interesting distribution with two power tails, the double Pareto lognormal distribution. I have it implemented in Mathematica if you would like to fool around with it. The paper describing it is attached.
Robert, I found the paper you forwarded interesting, at first blush. I'll take the time to explore it well. Thank you! Brian W
If you use Mathematica I wrote a package for the lognormal double Pareto distribution. It was written for version 8, but I checked it this afternoon and it still works. I used it to go through every equation in that paper in a notebook. Let me know if you would like those files.
I was originally interested in it for stock market volatility measures, but they are modeled as well by the generalized extreme value distribution with one less parameter.
Robert,
I am interested in lognormal double Pareto distribution to be used for Stock trading. Can you share the associated Mathematica package.
Alan
Alan,
The zip file contains the package. If you use Install... in the file menu to load it, some limited documentation will be available. The file ReedPaper.nb uses the package to check all the math in the Reed paper noted above. DPLNTest.nb shows some of the basic functions, and calculates LMoments.
I couldn't get the zip file to load as an attachment, so I put it here:
Let me know when you have it and I will remove it from the link.
It was written nine years ago for version 8. I did test it with the two notebooks above and it still seems to work. Let me know if you have any questions
I have downloaded the DPLM.ZIP from the ftp. Thanks