6
$\begingroup$

If I were to have multiple related measures, each with "low" p-values that do not quite meet the threshold of significance, could the fact that multiple measures produce a general trend with "low" p-values (cautiously) justify that the null hypothesis may be rejected?

For example, I am comparing 100 students' performance on multiple exams. Of these students, 20 are theorized to be at high-risk of failure due to external factors. I analyze their performance on 10 exams, and note that they appear to have average scores worse than their counterparts. However, the comparison tests (using Mann-Whitney U tests) are returning p-values between .03 and .20 depending on the exam. In aggregate, would it be fair to (cautiously) justify that the null hypothesis can be rejected? The general trend seems to show these students are performing worse than their counterparts, despite the majority of comparison tests not returning significance.

I have heard some researchers believe this to be acceptable. However, I have yet to find any literature on this topic.

$\endgroup$
2
  • $\begingroup$ Sidenote: Isn't it normal that a population has students that perform worse than their counterparts, and at some point, given enough measurements, this can be significantly measured. Is the comparison related to other students a good way to consider high-risk of failure? $\endgroup$ Commented Jul 12 at 8:44
  • $\begingroup$ This question is about combining p-values, which has been covered in several ways in literature and on this website, but you have the raw data aside from just the p-values. You should be able to tackle this without considering the combination of p-values. For example you can consider the average performance and the error in the estimate of that average performance. $\endgroup$ Commented Jul 12 at 8:46

1 Answer 1

10
$\begingroup$

Interesting question, let me break it down into a few distinct problems:

Do multiple borderline significant results indicate overall significance?

The answer to this one is simple. No.

If anything, multiple tests inflate the overall chance of a false positive, so you should be extra wary of 'borderline significant' results. Especially with the ranges you mention: A $p$-value of $0.20$ is a completely unsurprising result under the null-hypothesis.

The results from multiple related analyses, sharing the same outcome, can be combined (through a meta-analytic model) and could result in an overall lower $p$-value. However, there is no need to combine $p$-values if you have the original data of each analysis (see the third problem).

Are multiple, related trends in the same direction more convincing?

If for the moment we stop caring about null-hypothesis significant testing, the question (and answer) is quite different. Especially seeing how in the example, you saw a negative trend on 10 different exams. Yes, of course repeatedly seeing the same trend makes us more convinced that there might be an overall tendency.

So how can we reconcile the apparent contradiction? I think this ties into the third problem:

Is there an alternative to running separate analyses?

Let's take the example from the question again: If we change the response variable from "performance on exam $x$," to simply "performance on an exam," then we can turn this into a multiple regression problem where the exam $x$ is just one of the explanatory variables.

You need to account for the fact that these are not independent measurements (the same student now appears multiple times in the data set, once for each exam), but there are plenty of ways to do that, like RM-ANOVA or (more flexible) mixed models and GEEs.

There are many advantages to combining multiple models into one. This has been discussed several times on this site, for example: 1, 2, 3, 4. The usual answer is that combining is better. You end up with a more powerful comparison, that is more likely to detect such an 'overall' trend than multiple different tests, each with its own false positive rate that you should also take into account.

(If you need help with this approach, create a new question where you explain the type of outcome you have.)

$\endgroup$
7
  • 2
    $\begingroup$ The first answer should be 'it depends'. Ofcourse one can combine p-values from multiple independent experiments and the resulting p-value may be smaller than the individual ones. The link that you refer to is about multiple comparisons, not about combining p-values. $\endgroup$ Commented Jul 12 at 8:07
  • 1
    $\begingroup$ @SextusEmpiricus, good point, I updated the answer. $\endgroup$ Commented Jul 12 at 8:16
  • $\begingroup$ How did you get the nice formatting and big font? $\endgroup$
    – Peter Flom
    Commented Jul 12 at 9:26
  • 3
    $\begingroup$ @PeterFlom It's markdown, single hashtag for chapter, double for paragraph, etc. You can see how to type it by editing the answer! $\endgroup$ Commented Jul 12 at 9:38
  • 1
    $\begingroup$ I disagree that multiple testing inflates the FPR. With alpha=0.05, you expect actual negatives to be called false positives 5% of the time, regardless of how many tests you run. Multiple hypothesis testing increases the number of false positives, the probability of making any false positive call (the FWER), and the proportion of incorrect positive calls (the FDR), but not the probability of making a Type I error in any individual test (FPR, controlled by alpha). $\endgroup$ Commented Jul 12 at 13:46

Not the answer you're looking for? Browse other questions tagged or ask your own question.