Forum: >>> Magnum BBS <<<

Looking at Covid19 data

From root@21:1/5 to All on Mon May 25 20:33:15 2020

I have been analyzing the New York Times Covid19 data from:

https://github.com/nytimes/covid-19-data/archive/master.zip

and I believe I have found some interesting things. I would
be willing to share my findings with this newsgroup if
anyone is interested.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From RosemontCrest@21:1/5 to root on Mon May 25 15:01:12 2020

On 5/25/2020 1:33 PM, root wrote:

I have been analyzing the New York Times Covid19 data from:

https://github.com/nytimes/covid-19-data/archive/master.zip

and I believe I have found some interesting things. I would
be willing to share my findings with this newsgroup if
anyone is interested.

As moderator of the misc.consumers newsgroup, I represent all readers by stating that no one is interested in your conjecture. Seriously, you
don't need permission from anyone. Fire away.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From RosemontCrest@21:1/5 to root on Mon May 25 16:25:05 2020

On 5/25/2020 4:18 PM, root wrote:

RosemontCrest <[email protected]> wrote:

As moderator of the misc.consumers newsgroup, I represent all readers by
stating that no one is interested in your conjecture. Seriously, you
don't need permission from anyone. Fire away.

I would like to know that at least a few people are interested
in the subject, lest I become one of those pitiable people
who post their poetry and such.

My wife and girlfriend are both interested. So, including me, that makes
"a few."

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From root@21:1/5 to RosemontCrest on Mon May 25 23:42:26 2020

RosemontCrest <[email protected]> wrote:

On 5/25/2020 4:18 PM, root wrote:

RosemontCrest <[email protected]> wrote:

As moderator of the misc.consumers newsgroup, I represent all readers by >>> stating that no one is interested in your conjecture. Seriously, you
don't need permission from anyone. Fire away.

I would like to know that at least a few people are interested
in the subject, lest I become one of those pitiable people
who post their poetry and such.

My wife and girlfriend are both interested. So, including me, that makes
"a few."

OK, but let me know if you and yours lose interest.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From root@21:1/5 to RosemontCrest on Mon May 25 23:18:13 2020

RosemontCrest <[email protected]> wrote:

As moderator of the misc.consumers newsgroup, I represent all readers by stating that no one is interested in your conjecture. Seriously, you
don't need permission from anyone. Fire away.

I would like to know that at least a few people are interested
in the subject, lest I become one of those pitiable people
who post their poetry and such.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From root@21:1/5 to root on Tue May 26 00:27:25 2020

root <[email protected]> wrote:

OK, but let me know if you and yours lose interest.

Let me put one thing out there before going into details:
I conclude the NYT data are seriously flawed and should
not be used for policy decisions. Among other things
I assert that most of the recent new "cases" stem
from recounting old cases.

My approach to looking at the data begins with a mathematical
model in order to guide me. The model isn't very complicated.

Every localized contagion begins with an exponential growth.
The number of new cases is proportional to the number of
infected people. If we represent the number of infected
people as I(t), a function of time, the initial stage of
the contagion looks like:

dI/dt = R * I

which results in I(t) in the form of an exponential exp(R*t).

As time progresses, however, it becomes harder for the
contagion to find new people to infect. So the original
equation has to be modified to this:

dI/dt = R * I * (N-I) where N is the number of people
that can be infected, and R is another constant. I choose
to rewrite that equation as:

dI/dt = R * I * (1-I/N) where the last term can be seen
as the probability that a person selected at random will
be uninfected. Another term has to be added to the equation
to account for infected people coming into the area
from elsewhere. I want to refer to I as internally
generated infections and E(t) as infected people that
enter the area. At least one such person has to enter
in order to begin the infection since nothing can
happen when I starts at zero.

So now the equation becomes:

dI/dt = R * I * (1-I/N) + E(t)

Casting this equation into a form applicable to the NYT
data it becomes:

(daily new cases) = R * (cases so far) * (1-I/N) + (incoming cases)

As far as the math goes the worst is over. Please hang in there
if you can.

Typically you can look at the external cases as if they result
from a process similar to radioactivity: Random clicks that
happen at some average rate but are not otherwise predictable.
Such a process grows linearly in time and, at some point, will
be swamped by the exponential growth of the internal growth.

During the intial stages of the contagion, however, the E(t)
cannot be ignored. For most of my work with the NYT data
I ignored E(t) altogether.

My actual model includes a time (about 5 days) over which
an infected person is not contagious, and another period
(about 28 days) after which an infected person is no
longer contagious. Neither of these two refinements are
important.

My main interest in looking at the data was to find
a way to estimate how many people are actually infected
when some number are reported. I will go into my efforts
in that direction another time.

One last point for this chapter: since I can't show you
plots of the data I will have to describe what the data
show. If you have access to tools to manipulate data
and show results I hope you can follow along.

This is a sample of what I have to offer. Please let
me know if you still have any interest.

Thank you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob F@21:1/5 to root on Mon May 25 20:15:07 2020

On 5/25/2020 5:27 PM, root wrote:

root <[email protected]> wrote:

OK, but let me know if you and yours lose interest.

Let me put one thing out there before going into details:
I conclude the NYT data are seriously flawed and should
not be used for policy decisions. Among other things
I assert that most of the recent new "cases" stem
from recounting old cases.

My approach to looking at the data begins with a mathematical
model in order to guide me. The model isn't very complicated.

Every localized contagion begins with an exponential growth.
The number of new cases is proportional to the number of
infected people. If we represent the number of infected
people as I(t), a function of time, the initial stage of
the contagion looks like:

dI/dt = R * I

which results in I(t) in the form of an exponential exp(R*t).

As time progresses, however, it becomes harder for the
contagion to find new people to infect. So the original
equation has to be modified to this:

dI/dt = R * I * (N-I) where N is the number of people
that can be infected, and R is another constant. I choose
to rewrite that equation as:

dI/dt = R * I * (1-I/N) where the last term can be seen
as the probability that a person selected at random will
be uninfected. Another term has to be added to the equation
to account for infected people coming into the area
from elsewhere. I want to refer to I as internally
generated infections and E(t) as infected people that
enter the area. At least one such person has to enter
in order to begin the infection since nothing can
happen when I starts at zero.

So now the equation becomes:

dI/dt = R * I * (1-I/N) + E(t)

Casting this equation into a form applicable to the NYT
data it becomes:

(daily new cases) = R * (cases so far) * (1-I/N) + (incoming cases)

As far as the math goes the worst is over. Please hang in there
if you can.

Typically you can look at the external cases as if they result
from a process similar to radioactivity: Random clicks that
happen at some average rate but are not otherwise predictable.
Such a process grows linearly in time and, at some point, will
be swamped by the exponential growth of the internal growth.

During the intial stages of the contagion, however, the E(t)
cannot be ignored. For most of my work with the NYT data
I ignored E(t) altogether.

My actual model includes a time (about 5 days) over which
an infected person is not contagious, and another period
(about 28 days) after which an infected person is no
longer contagious. Neither of these two refinements are
important.

My main interest in looking at the data was to find
a way to estimate how many people are actually infected
when some number are reported. I will go into my efforts
in that direction another time.

One last point for this chapter: since I can't show you
plots of the data I will have to describe what the data
show. If you have access to tools to manipulate data
and show results I hope you can follow along.

This is a sample of what I have to offer. Please let
me know if you still have any interest.

Thank you.

That's too much. You can stop now. We certainly don't want to see
another newsgroup destroyed by unrelated politics.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From RosemontCrest@21:1/5 to Bob F on Wed May 27 14:15:28 2020

On 5/25/2020 8:15 PM, Bob F wrote:

On 5/25/2020 5:27 PM, root wrote:

root <[email protected]> wrote:

OK, but let me know if you and yours lose interest.

Let me put one thing out there before going into details:
I conclude the NYT data are seriously flawed and should
not be used for policy decisions. Among other things
I assert that most of the recent new "cases" stem
from recounting old cases.

My approach to looking at the data begins with a mathematical
model in order to guide me. The model isn't very complicated.

Every localized contagion begins with an exponential growth.
The number of new cases is proportional to the number of
infected people. If we represent the number of infected
people as I(t), a function of time, the initial stage of
the contagion looks like:

dI/dt = R * I

which results in I(t) in the form of an exponential exp(R*t).

As time progresses, however, it becomes harder for the
contagion to find new people to infect. So the original
equation has to be modified to this:

dI/dt = R * I * (N-I) where N is the number of people
that can be infected, and R is another constant. I choose
to rewrite that equation as:

dI/dt = R * I * (1-I/N) where the last term can be seen
as the probability that a person selected at random will
be uninfected. Another term has to be added to the equation
to account for infected people coming into the area
from elsewhere. I want to refer to I as internally
generated infections and E(t) as infected people that
enter the area. At least one such person has to enter
in order to begin the infection since nothing can
happen when I starts at zero.

So now the equation becomes:

dI/dt = R * I * (1-I/N) + E(t)

Casting this equation into a form applicable to the NYT
data it becomes:

(daily new cases) = R * (cases so far) * (1-I/N) + (incoming cases)

As far as the math goes the worst is over. Please hang in there
if you can.

Typically you can look at the external cases as if they result
from a process similar to radioactivity: Random clicks that
happen at some average rate but are not otherwise predictable.
Such a process grows linearly in time and, at some point, will
be swamped by the exponential growth of the internal growth.

During the intial stages of the contagion, however, the E(t)
cannot be ignored. For most of my work with the NYT data
I ignored E(t) altogether.

My actual model includes a time (about 5 days) over which
an infected person is not contagious, and another period
(about 28 days) after which an infected person is no
longer contagious. Neither of these two refinements are
important.

My main interest in looking at the data was to find
a way to estimate how many people are actually infected
when some number are reported. I will go into my efforts
in that direction another time.

One last point for this chapter: since I can't show you
plots of the data I will have to describe what the data
show. If you have access to tools to manipulate data
and show results I hope you can follow along.

This is a sample of what I have to offer. Please let
me know if you still have any interest.

Thank you.

That's too much. You can stop now. We certainly don't want to see
another newsgroup destroyed by unrelated politics.

Pay no mind to Bob F. It actively participates in off-topic, political discussions on alt.home.repair, thus destroying that newsgroup with
unrelated politics. Notice that it is the one who introduced politics to
this discussion about scientific data analysis.

I am interested to see more of your findings. Are you willing to share
your data?

Thank you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From root@21:1/5 to RosemontCrest on Thu May 28 00:25:26 2020

RosemontCrest <[email protected]> wrote:

Pay no mind to Bob F. It actively participates in off-topic, political discussions on alt.home.repair, thus destroying that newsgroup with
unrelated politics. Notice that it is the one who introduced politics to
this discussion about scientific data analysis.

I am interested to see more of your findings. Are you willing to share
your data?

The data I work with is publicly available at: https://github.com/nytimes/covid-19-data/archive/master.zip
Which includes data for the US as a whole, every state and
territory, and every county. Lots to chew on.

I don't want to post if there is no interest.

I will abandon the approach I started to take in favor
of a more important and easier thread to follow.

The above source reports a daily account of number of reported
cases and number of deaths. For some time I have wondered how
may people have actually been infected when some lesser number
is reported. Let's say that when X cases have been reported
there are really M*X people infected. Can I squeeze M out of
the reported data. I think I can.

Here is the basic plan:
1. for any given data set compute the daily differences of the
number of cases.
2. divide these daily differences by the corresponding number
of cases.
3. compute the variance and SD of the daily differences.

There is a trend to the daily differences and
that trend has to be removed or corrected before
computing the SD.
4. compute the expected variance and SD of the daily
differences.

I will show how the expected SD is computed below.
5. The ratio of the first SD to the second SD is my
best estimate of M.

I was motivated to consider this approach because the
SD of the daily differences was too large to be
explained.

In step 2 we divided the daily differences by the
number of cases to-date. What does this number
mean? Let C be the number of cases and deltaC be
the daily change. I assert that deltaC/C is the
probability that one of the C cases will infect
a new person in the next day. This is a binomial
probability (p) and, for a large value of C, we can
approximate the SD of the number of new cases by
a normal distribution with SD=sqrt(p*(1-p)*C)
Tyoically this is a few hundred cases. In contrast
the SD from step 3 is a few thousand cases and
the ratio (M) is a number on the order of 10.

I have computed the values for each of the states
and territories and the value for the US is 14.5 or
so. There is a discrepancy in that number which
I am investigating. Whatever the value of M,
the lethality of the Sars-Cov2 virus as determined
by deaths/cases is reduced by the factor M. If
M were 14.5 and deaths/cases was 4% then the
revised lethality would be .275% which is less
than 3 times that of ordinary seasonal flu.

The number M is vitally important.

There are still some things about the procedure that
bother me. I use my own software for all this, but
I have a friend who uses Excel to do the computations
at his end.

If you are familiar with Excel you can easily bring
up the data an have a look for yourself. Get back
here if you have any questions, and if you tire
of this let me know as well.

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From RosemontCrest@21:1/5 to root on Wed May 27 22:02:34 2020

On 5/27/2020 5:25 PM, root wrote:

RosemontCrest <[email protected]> wrote:

Pay no mind to Bob F. It actively participates in off-topic, political
discussions on alt.home.repair, thus destroying that newsgroup with
unrelated politics. Notice that it is the one who introduced politics to
this discussion about scientific data analysis.

I am interested to see more of your findings. Are you willing to share
your data?

The data I work with is publicly available at: https://github.com/nytimes/covid-19-data/archive/master.zip
Which includes data for the US as a whole, every state and
territory, and every county. Lots to chew on.

I don't want to post if there is no interest.

I will abandon the approach I started to take in favor
of a more important and easier thread to follow.

The above source reports a daily account of number of reported
cases and number of deaths. For some time I have wondered how
may people have actually been infected when some lesser number
is reported. Let's say that when X cases have been reported
there are really M*X people infected. Can I squeeze M out of
the reported data. I think I can.

Here is the basic plan:
1. for any given data set compute the daily differences of the
number of cases.
2. divide these daily differences by the corresponding number
of cases.
3. compute the variance and SD of the daily differences.

There is a trend to the daily differences and
that trend has to be removed or corrected before
computing the SD.
4. compute the expected variance and SD of the daily
differences.

I will show how the expected SD is computed below.
5. The ratio of the first SD to the second SD is my
best estimate of M.

I was motivated to consider this approach because the
SD of the daily differences was too large to be
explained.

In step 2 we divided the daily differences by the
number of cases to-date. What does this number
mean? Let C be the number of cases and deltaC be
the daily change. I assert that deltaC/C is the
probability that one of the C cases will infect
a new person in the next day. This is a binomial
probability (p) and, for a large value of C, we can
approximate the SD of the number of new cases by
a normal distribution with SD=sqrt(p*(1-p)*C)
Tyoically this is a few hundred cases. In contrast
the SD from step 3 is a few thousand cases and
the ratio (M) is a number on the order of 10.

I have computed the values for each of the states
and territories and the value for the US is 14.5 or
so. There is a discrepancy in that number which
I am investigating. Whatever the value of M,
the lethality of the Sars-Cov2 virus as determined
by deaths/cases is reduced by the factor M. If
M were 14.5 and deaths/cases was 4% then the
revised lethality would be .275% which is less
than 3 times that of ordinary seasonal flu.

The number M is vitally important.

There are still some things about the procedure that
bother me. I use my own software for all this, but
I have a friend who uses Excel to do the computations
at his end.

If you are familiar with Excel you can easily bring
up the data an have a look for yourself. Get back
here if you have any questions, and if you tire
of this let me know as well.

Thanks.

Thank you for the link. I remain interested and hope that others express interest. Presenting more findings and discussion may garner more
interest. Please continue to pursue and share your endeavor.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From root@21:1/5 to RosemontCrest on Thu May 28 14:44:29 2020

RosemontCrest <[email protected]> wrote:

Thank you for the link. I remain interested and hope that others express interest. Presenting more findings and discussion may garner more
interest. Please continue to pursue and share your endeavor.

Thank you for your interest.

Here are the results of my work so far:

The data columns are M=(actual cases/reported cases) effective Lethality #reported cases, and number of data points.

Alabama 9.22984 0.412334 15650 74
Alaska 1.94678 1.00473 416 75
Arizona 7.68512 0.637109 16783 121
Arkansas 13.7653 0.142299 6180 76
California 9.23045 0.43338 99925 122
Colorado 6.42828 0.858365 24553 82
Connecticut 10.1035 0.903221 41303 79
Delaware 7.95252 0.465357 9066 76
DistrictofColumbia 5.05766 1.05321 8334 80
Florida 12.7563 0.34465 52247 86
Georgia 12.6857 0.34598 42066 85
Guam 7.72439 0.068437 1139 72
Hawaii 1.84357 1.45675 633 81
Idaho 3.56433 0.836378 2699 74
Illinois 21.224 0.208173 113486 123
Indiana 5.96245 1.03078 32856 81
Iowa 9.32092 0.280614 17999 79
Kansas 15.0902 0.145086 9352 80
Kentucky 12.638 0.357062 9175 81
Louisiana 19.0675 0.378296 38252 78
Maine 3.41531 1.11136 2109 75
Maryland 11.9939 0.404626 48290 82
Massachusetts 9.44957 0.727614 93693 115
Michigan 12.8618 0.744241 55040 77
Minnesota 5.48437 0.777822 21969 81
Mississippi 6.67879 0.706156 13731 76
Missouri 5.69143 0.999333 12437 80
Montana 3.51638 0.949924 479 74
Nebraska 9.30838 0.135461 12619 99
Nevada 5.35033 0.93381 8059 82
NewHampshire 4.00271 1.25849 4231 85
NewJersey 13.1519 0.549124 155764 83
NewMexico 4.55819 1.00166 7130 76
NewYork 10.298 0.769915 368669 86
NorthCarolina 10.9867 0.302079 24188 84
NorthDakota 3.4634 0.632874 2422 76
NorthernMarianaIslands 4.54384 2.00071 22 59
Ohio 6.95347 0.887367 33006 78
Oklahoma 5.10908 1.00832 6137 81
Oregon 3.91037 0.96379 3967 88
Pennsylvania 10.2271 0.702067 72873 81
PuertoRico 5.50629 0.723253 3324 74
RhodeIsland 7.26354 0.595135 14210 86
SouthCarolina 5.24323 0.821752 10416 81
SouthDakota 7.7962 0.140552 4653 77
Tennessee 7.61852 0.216634 20960 82
Texas 10.6637 0.256482 57541 104
Utah 3.30713 0.349506 8622 91
Vermont 2.58748 2.18303 967 80
VirginIslands 4.82072 1.80381 69 73
Virginia 16.514 0.195645 39342 80
Washington 10.8357 0.475859 21278 126
WestVirginia 6.0884 0.667746 1854 70
Wisconsin 8.9827 0.369586 15923 111
Wyoming 3.836 0.38478 850 76

US 19.6869 0.302295 1.6701e+06 125

The overall average for the states is M=8, meaning 8 actual cases
for every reported case, and a lethality of 0.7% which is about
seven times as lethal as seasonal flu. While I don't want to
minimize a 0.7% death rate, it is a lot better than a 4%-6% rate.

These numbers differ by a factor of sqrt(2) from my earlier summary
owing to a correction for trend removal in the difference data.

As I said above:

There are still some things about the procedure that
bother me.

What bothers me is what I see as systematic reporting vagaries and
errors. These two problems are most evident in the US data. I
encourage anyone interested to look at the US data. The daily
differences report a significant weekly pattern which has been
reported in the Wall Street Journal. This, along with a linear
downward trend must be rectified before the data can be used.

As you can see the US data stands out with a very high M
value of 19.7 or so. Using a (model dependent) analysis I
conclude that the US data suffer from an accumulation of
recounted cases.

Although this does not prove recounting, take a look at
two consecutive days in the data for California:

2020-02-25 Humboldt California 06023 1 0
2020-02-25 LosAngeles California 06037 1 0
2020-02-25 Orange California 06059 1 0
2020-02-25 Sacramento California 06067 1 0
2020-02-25 SanDiego California 06073 1 0
2020-02-25 SanFrancisco California 06075 3 0
2020-02-25 SantaClara California 06085 2 0
2020-02-25 Solano California 06095 1 0 <<<<<<

2020-02-26 Humboldt California 06023 1 0
2020-02-26 LosAngeles California 06037 1 0
2020-02-26 Marin California 06041 1 0
2020-02-26 Napa California 06055 1 0
2020-02-26 Orange California 06059 1 0
2020-02-26 Sacramento California 06067 3 0
2020-02-26 SanDiego California 06073 1 0
2020-02-26 SanFrancisco California 06075 3 0
2020-02-26 SantaClara California 06085 2 0
2020-02-26 Solano California 06095 11 0 <<<<<<
2020-02-26 Sonoma California 06097 1 0

The last two entries on each line are accumulated cases and deaths
for each county.

On the first day California has a total of 11 cases and Solano
County has one of those 11. On the following day Solano has
11 cases, Sacramento picked up 2, and Sonoma 1. Solano
has a prison hospital and my guess is that some of the
previously counted victims were all transferred to Solano
and recounted. There should be a way to keep track of
who the victims are and not recount them if they move.
All those recount errors in all the states accumulate
in the US data.

I don't have much more to say on the data. I'm willing
to answer any questions you have if you want to look
at the data yourself.

Thanks for reading.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	55:35:38
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,353

Looking at Covid19 data

Who's Online

Recent Visitors

System Info