On Thu, 27 Aug 2020 18:59:01 -0700 (PDT), Cosine <
[email protected]>
wrote:
Hi:
Suppose we have 3 new methods of medical screening and we want to know whether: 1) any of them perform better than the existing standard method, and 2) the order of their performances, i.e., the best, the 2nd, and the 3rd.
It has lately come to attention of everyone following coronavirus
that COST and EASE of ADMINISTRATION are also relevant
to whether a screening method should be considered to
perform "better."
We test them by using the same set of samples and we use the following metrics for evaluating their performances: accuracy (AC), sensitivity (SE), and specificity (SP).
See the references that Bruce gives. "Accuracy" is a combination
of sensitivity and specificity, with a sliding scale across a range
of cutoffs. The graph of that scale is called the ROC curve.
What is recognized less often is that measures of reliability are
also conditioned by the (characteristics of) the particular
sample, starting with the variability of the range of scores
observed on the measures.
Now we have many comparisons to do, and it seems that this would raise an issue of false positive.
One way to solve this issue is to divide the alpha value by the number of tests to impose more stringent criteria on each of the tests regarding the false positive. That is, instead of using the original alpha (e.g., 5%), we use the corrected one:
alpha1 = alpha0/N; where N is the number of tests.
One convention I like for comparing multiple groups is to perform
the overall, repeated-measures F-test first. If that test is not "significant", then no further testing is performed.
I forget what tests are used for comparing ROC curves, if they
follow the F-distribution or chisquared.
I suppose it might end up as a t, for two groups, if it is simply
the Area under the Curve. On test is said to be "stochastically
dominant" to the other if it gives better results across the whole
range (which need not be the case).
Then we conduct the student t-test to see if any of the tests would be statistically significant.
Paired t-testing, surely. That is usually contrasted to Student's
where the latter is the grouped (not paired) test.
Using paired t-tests is proper followup for group comparisons
after an overall repeated-measures F shows a difference.
How many t-tests are you proposing? You have 3 tests if
you are comparing each of the 3 new methods only to the
standard method.
What decision do you want to make? If you are only
interested in finding something superior to Standard,
you could prescribe a 1-tailed test... though many people
do NOT like one-tailed testing, as a matter of principle or
of superstition.
But now we have some questions:
1) what is the value of N?
You have siad that N is the number of tests. As I said, if
you are only /interested in/ the comparisons to Standard,
you hae 3 tests. How serious were you about ranking them?
- You are unlikely to get results that show one test much
better than Standard, and another test even better than THAT.
So people will be apt to show (and expect) to see the ranking
without any stringent tests between the others.
2) by reducing the alpha from alpha0 to alpha1, we have made each of the test more difficult to be significant, wouldn't this increase the rate of false-negative? If so, how do we resolve this issue?
If you want to be strict in preserving the error level, you plan
your experiment in advance with large enough N that any difference
large enough to be interesting will provide the good test result,
for whichever test results you intent to present.
Larger N is how you "resolve the issue" of insufficient power.
--
Rich Ulrich
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)