*P*-value, which is obtained from performing a statistical test. Such streamlined practice provides a clear-cut yes or no decision on whether the results can be published or not, and is considered as a standard procedure.

*Plant Pathology Journal*, I came across appreciable number of the manuscripts that could be improved in terms of study design, statistical analysis, and visualization of experimental data and results. Here I first provide the survey of the published papers in the

*Plant Pathology Journal*for how basic and fundamental statistical concepts and techniques were used, and how results were visualized, explained and interpreted. Based on this survey, then I propose conceptual and practical guidelines that can be followed in preparation of manuscript to be submitted to the journal. I also provide the resources and further readings wherever appropriate for those who want to delve into the individual topics.

### Survey of Published Papers in the *Plant Pathology Journal*

*P*-values from statistical test, (3) likely use of pseudo-replication, and (4) lack of clarity in explaining experimental set-up pertaining to how replicates were made and statistical test was performed (Fig. 1).

*P*-values as ranges (for example,

*P*< 0.05 or

*P*< 0.01). Although use of such notation is a common practice, I suggest that specifying exact

*P*-value is a much better practice. The survey also revealed that some studies appear to run their statistical test based on pseudo-replicated data set (Fig. 1C). Pseudo-replication occurs when counting the replicates (or observations) that are not independent on each other as independent ones and include them all in statistical testing, incorrectly inflating the sample size. However, it was difficult to know the use of pseudo-replication with great certainty due to the lack of clear description about experimental set-up in more than a half of papers examined (Fig. 1D). Many papers stated that three or more replicates were used, but it was not clear whether distinction between biological and technical replicates were made in choosing the data sets for their statistical tests.

### Standard Deviation vs. Standard Error

*single*sample. It may be used to show accuracy of your measurement or experiment, as a low SD indicates close clustering of your data around the sample mean. In contrast, SE describes variability across the

*multiple samples*of a population. It therefore tells us how accurately our sample reflects the whole population. Since I usually have a single sample, however, SE should be estimated from a single sample in our hand. A SE decreases as sample size (i.e., number of values in the sample) increases, indicating that the larger your sample is, more precise your estimation about the population is. Unlike SD, SE is useful in hypothesis testing, since it helps judge how representative your sample is when drawing any conclusion about the actual population that you are interested in.

### Pseudo-replication

*n*= 3) and varying standard deviations (Fig. 2A).

*t*-test) and reporting

*P*-value. First, the researcher runs

*t*-test for each biological replicate (

*n*= 3), comparing the control and treatment groups, and finds that all three

*P*-values are below 0.05. This seems to suggest to the researcher that there is a statistically significant difference between the control and treatment groups, and that his or her finding is reproducible. Accordingly, the researcher decides to report the means, standard errors, and

*P*-value obtained from the first biological replicate only (panel 1 in Fig. 2A and B). Alternatively, the researcher may decide to aggregate all the data points across three biological replicates (

*n*= 9) and run a

*t*-test using them. This would usually result in much smaller

*P*-values (more significant!) despite the smaller discrepancy between control and treatment mean values than the first scenario (compare Fig. 2B and C). Lastly, the researcher may want to take individual means of biological replicates (

*n*= 3), and use them to run a

*t-*test. This would lead to the larger

*P*-value than the one obtained in the second scenario due to the smaller sample size. Now the question is what would be the correct practice to do?

###
*P*-value

*t*-test is that both groups have identical means. The

*t*-test then calculate the probability of seeing the observed data, assuming the null hypothesis. This probability is the

*P*-value. If

*P*-values are below a certain threshold, then the null hypothesis is rejected and observed difference in mean values are declared ‘significant’, which is often denoted by an asterisk(s). This is the way that most of such so-called null hypothesis significance testing (NHST) are performed, regardless of types of statistical test employed. The

*P*-value is certainly a useful method to summarize the study results and provide a basis for dichotomous decision. It should be noted that

*P*-value, however, is not a measure of how right your hypothesis is, or how significant the difference is. Rather it is a measure of how unlikely the observed difference should be if there is no actual difference between the groups. So,

*P*-value should not be considered as a measure of the size of the effect. There has been fierce debate about use of

*P*-value and its influence on science (Goodman, 1999, 2001). Despite much debated problems of

*P*-value, use of

*P*-value is so widespread and prevalent that it is almost impossible to publish without it.

*P*-value becomes problematic. One reason for problem with

*P*-value is the arbitrary nature of its cutoff value. For demonstration purpose, I have generated again random datasets containing values that are randomly drawn from normal distribution having mean values of 3 and 5, respectively (Supplementary Material 1). When standard deviation is 0.5 (Fig. 3A),

*t*-test gives us the

*P*-value of 0.026, which is below the commonly used cutoff value of 0.05, supporting that the mean values of control and treatment groups are from different populations (please note again that you would end up obtaining a slightly different

*P*-value whenever you run the code). As I increase the standard deviation to 1 and 2 (Fig. 3B and C, respectively),

*P*-values from

*t*-test increase up to 0.05. This clearly shows how use of arbitrary cutoff in determining statistical significance can be misleading. Decisions can be even more complicated when, for example, marginal

*P*-values such as

*P*= 0.048 and

*P*= 0.052 were obtained. Is

*P*-value of 0.048 significant, while

*P*-value of 0.052 is not?

*P*-values are subject to experimental design and nature of experiment. To illustrate this point, I ran additional simulations (Fig. 3D-G). Fig. 3D shows that running t-test for two groups (control and treatment) of numbers randomly drawn from the identical normal distribution (mean = 3 and SD = 0.5) could result in declaration of significant difference (cutoff value of 0.05) between the two groups just by chance in approximately 5% of cases, regardless of sample size. In contrast, running t-test for two groups of numbers randomly drawn from different normal distributions (one with mean = 3 and SD = 0.5, and the other with mean = 5 and SD = 0.5) shows that

*P*-value are larger than 0.05 in considerable number of tests when sample size is 3 (left panel). Increasing the sample size in this case makes sure that all the test ends up detecting difference between the two group (right panel). However, when variability (standard deviation of normal distribution) within samples increases (SD = 1 and 1.5 for Fig. 3F and G, respectively),

*t*-test fails to detect the difference between the two groups, although this is mitigated by a larger sample size (right panels of Fig. 3F and G). These results show that

*P*-values should be interpreted with great care in the context of experimental design (e.g., sample size) and nature of experiment (e.g., large variability inherent to the type of experiment). A statistically insignificant difference, therefore, does not mean there is no difference at all.

*P*-value are not measures of effect size, so similar

*P*-values do not always mean similar effects. Suppose that I see two groups that are different and the associated

*P*-value support this conclusion (

*t*-test). How meaningful then the

*P*-value as low as 1 × 10

^{−50}is in this case? It would be much easier to understand what this question really implies if I rephrase it as follows: is it more significant than 1 × 10

^{−10}, or conversely, is it less significant than 1 × 10

^{−100}? As mentioned above,

*P*-value is not indication of effect size but just a measure of how unlikely your data is when assuming the null hypothesis. This suggests that our propensity to look for a difference in significance should be replaced by a check for the significance of the difference. I recommend readers to take a look at some of the efforts toward this shift from dichotomy based on

*P*-value to more quantitative and Bayesian reasoning by visiting https://www.estimationstats.com/#/ (Ho et al., 2019) and https://www.sumsar.net/best_online/ (Kruschke, 2013).

### Concluding Remarks

*Plant Pathology Journal*. In particular, I strongly recommend the followings: avoiding pseudo-replications, having as many biological replicates (not technical replicates) as possible, providing candid presentation of p-value and careful interpretation of it. I believe that being conscious about these issues and trying to avoid mistakes/errors are an important first step toward improving reproducibility and quality of the work published in the journal. Such efforts should be made by both authors of the manuscript and reviewers who would evaluate it. To that end, I provide a list of the recommended readings in order to help those who are eager to learn more (Altman and Bland, 2005; Diaba-Nuhoho and Amponsah-Offeh, 2021; Huber, 2019; Kass et al., 2016; Lazic, 2019; Madden et al., 2015; Nuzzo, 2014).