Peer Review & It's Consequences
Essay: "The Detrimental Effects of Peer Review’s Flaws on Scientific Literature"
no drug rant in this one :(
Outline: historical background and definition for peer review, outline of the publication process and how peer review fits in, analysis of its effectiveness and flaws, real-world examples of its consequences by means of cited scientific studies, proposed solution, namely switching back to an open review standard.
I had to submit this essay to my writing proficiency portfolio in undergrad. In summary, I aim to provide a cursory introduction to the process of peer review, focusing on the flaws of the peer review process and how they must be thoroughly scrutinized and revised, with an open review standard posed as a viable remedy to this. Hopefully this provides a glimpse of insight into the state of the scientific world (abysmal). You can read a snippet of my essay on this wiki page under “Examples”: Data Dredging
The most convincing evidence in this essay against the current peer review process is the adverse affect of blinded review on the publication process, the main argument being that it incentivizes dishonesty and makes the process more tedious, simultaneously being ineffective at reducing bias, whereas an open review standard incentivizes higher quality review and eases the process.
Here’s a question I’d like to address first:
Q: Lots of good points but one question, what does P-hacking have to do with peer review. As you said, it’s incentivised by the arbitrary requirement of a 0.05 significance threshold. People could set a higher or lower one and explain their reasoning and the peers could either reject or accept it. The 0.05 value is a scientific convention but could easily be changed which they have done in some areas of particle and quantum physics I think. I don’t see the connection
A: Imposing the necessity of the null ritual encourages peers to falsify data to increase chances of publication. The review process doesn't catch this most of the time. Not only that, but it’s also limiting when it comes to hypothesis testing. Using the null ritual makes no sense without pre-determined alternative hypotheses, and more emphasis should be placed on avoiding overfitting than strict conformity to the null, but this isn’t incentivized. More can be found in “Mindless Statistics.”
Without further ado, here’s the essay!
The Detrimental Effects of Peer Review’s Flaws on Scientific Literature
Usage of peer review has become essential for many researchers in the scientific field. Peer review is an essential step in the research publication process, as it determines the legitimacy of any piece of scientific literature. However, it is flawed enough to affect the integrity of the research process itself; therefore, people should exercise criticism towards all scientific literature by not simply attributing acclaim to a text based primarily on the reputability of the journal it was published in and the number of citations it contains. Peer review is deeply flawed and has had a major negative impact on the health of the general population, and it should be the subject of deep scrutiny and revision.
Peer review has not always been in scientific literature but is nevertheless considered a source of great authority, despite being a relatively modern, flawed development. Peer review is defined as the process of researchers subjecting their scholarly work to scrutiny for evaluation of quality by other experts in the same field (Kelly, 227). Proponents of the current methodology such as Ray Spier use its supposedly extensive history as support for its effectiveness. However, an analysis of his sources—all secondary—indicate its origins in ancient Greece as speculative, with documented practices not conforming to the definition of peer review, with the current scientific consensus being of origins much more recent, dating back to the 17th century. Peer review originates in 1665, during which time scientists from the Royal Society of London published the journal, Philosophical Transaction, which contained a procedure close to that of peer review (Spier, 357). Before the advent of this group, it was common for researchers to consult colleagues informally for review (Spier, 357), much like open review. The Royal Society formed later, functioning as an editorial gatekeeper for the publication of scientific knowledge, and much like their intended use of peer review was to selectively choose studies that conformed to their exclusive publication standards, modern peer review still functions this way.
Peer review is the primary determinant of paper publication, but before submitting a manuscript for review, a researcher must first understand the research process. One must follow the scientific method by first posing a question, reaching out to a supervisor for expert guidance, performing experimentation, and finally, submitting a manuscript to a journal for publication. A successful manuscript is scrutinized by editorial staff for quality before being submitted for peer review, and after further scrutiny, published. Editorial approval depends not only on the credibility of a manuscript’s sources, but also how well its subject matter fits with the journal’s standards of publication, with very few making it past this step (Kelly, 230). Peer review is also mediated by the editorial staff, with reviewers providing editors with recommendations regarding the quality of the manuscript (Kelly, 230).
For editors to determine what should be published, they rely on the input of experts in the field, called peers, and this communication is constrained by metrics considered alongside credibility and conflict of interest which do not properly address concerns about bias. These theoretically reduce bias; however, these measures are not actually effective, as human psychology works against the objectiveness of decision-making, and flaws are routinely taken advantage of to increase the likelihood of publication. The peer review stage has become a mandatory step in publication because in theory, it increases trust in science by encouraging high-quality research – research with valid, original, data that has significance – and maintaining integrity and authenticity of research (Kelly, 229). However, critics argue that this instead encourages editor bias by shifting their focus primarily to the “significance” of a manuscript; how well it conforms to the journal’s agenda rather than its impact on the scientific field. This conformity is highly considered by researchers when structuring a manuscript but is subjective, relying on an editor’s personal preference, which invites bias. To combat this, reviews are often “blinded,” meaning anonymity is added to prevent one or more parties from identifying another during the review stage. However, this anonymity is detrimental to the quality of review.
While the scientific community acknowledges the advantages of double-blinded peer review over single-blinded, there are also disadvantages that outweigh the advantages of open review. In a single-blinded review, peers know the identity of the researcher but not vice versa; in a double-blinded review, neither party is identifiable. With open review, both peers and researchers know each other’s identities. Foxe and Bolam, prominent members of an open-review-based journal, provide cogent points while advocating for a more open standard of review. For example, editors are concerned that open review fosters a less open and accountable discourse, leading to suppression of discussion and increased bias; however, while anonymity provides room for honesty, this is at the expense of a decrease in the quality of reviews due to peers’ uncertainty of public recognition after publication (Foxe and Bolam, 1125), which fosters unaccountability. It is also impossible for the researchers of a manuscript to be completely unidentifiable, and often, it is very easy to determine a researcher’s identity simply by looking at a reference list, because all experimentation relies on past experimentation (Foxe and Bolam, 1126). Because of this lack of transparency, much of the peer-editor discourse is reduced to social dynamics and petty squabbles over a few trivial lines of text, which delays the publication cycle. These delays detrimentally affect the timely dissemination of up-to-date information as well (Foxe and Bolam, 1125). All these factors combined make double-blinded review just as ineffective as single-blinded review, only more tedious. A study by Godlee et. al found that blinding had no effect on the rate of detection of experimental errors (237). This is a significant study that suggests blinding measures are unlikely to improve the quality of review. An obvious solution is to do away with blinded review altogether and opt for open review as the standard, which could potentially increase the number of high-quality reviews and make biases publicly addressable, thereby decreasing delays caused by trivial back and forth. The Frontiers and BioMed Central are just two examples of successful journals that operate on open review principles (Foxe and Bolam, 1126).
Not only is the current peer review system plagued by bias, but by also imposing the necessity of the null ritual, it encourages peers to falsify data, violating the integrity of research. Null hypothesis significance testing, also called the null ritual, is used to verify the data integrity of a manuscript by eliminating chance. First, researchers will choose a null hypothesis—the likelihood that an observed difference in data is due to chance—and attempt to disprove it. The null hypothesis can be understood as a commonly accepted observation, such as “The Earth is Round (is true),” which experimentation and significance testing aims to disprove. Although a null hypothesis can be true or false, it is impossible to prove, so researchers attempt to disprove it by using the p-value, which statistician Gigerenzer defines as the probability of obtaining observed data [replication], assuming the null hypothesis is true (595). The arbitrary, standardized value of 5% (p = 0.05) is used to determine whether the results are significant, meaning that any value under 5% will reject, or disprove, the null hypothesis. However, because studies with significant results have a higher publication rate than those of unsignificance, researchers are often incentivized to undertake a process known as “p-hacking,” in which they select data by lowering the p-value below 5% where nonsignificant results become significant (falsely positive), then stop data collection (Head et. al, 1). This false positive result is the result of a Type I error, where the researcher will incorrectly reject a null true hypothesis, and this occurs less often with a higher sample size (Banerjee et. al, 129). Decreasing the sample size, therefore, is one tactic used to increase false positives in p-hacking, an observation echoed by Ioannidis. Along with this, a greater the number of variables selected for testing is proportional to a higher tendency for a research study to be faulty (Ioannidis, 0698). A common p-hacking tactic is to add many observations (variables) and cherry-pick which to report (Head et. al, 1), even if further collection would demonstrate unsignificance.
P-hacking plays an essential role in the non-reproducibility of published studies, the prevalence of which implies that most published scientific studies are inaccurate or fraudulent and incentivized by the peer review process. This is prevalent among published scientific literature. It is commonly known that science relies on reproducibility, meaning the ability for other researchers to replicate a study and obtain the same data produced by the original study. If a study can’t be reproduced, it is likely unsignificant or fraudulent. According to data from the Reproducibility Project, only 36% of the published studies analyzed had significant results during the first replication (Open Science Collaboration, 943). The data shows that most published studies are flawed in some way, and further understanding of statistics and p-hacking tactics are required to determine why. The p-curve helps distinguish between replicable and non-replicable findings (Simonsohn et. al, 1), so by adding unnecessary observations, the p-curve will minimize the significance threshold, lowering the p-value until it is significant (Head et. al, 3). This creates a façade of replicability when in fact, there is none. Another impactful way to flatten the p-curve is to control for gender. An analysis by Simonsohn et. al of a study by Bruns and Ioannidis (2016) demonstrates this, as when Bruns and Ioannidis dropped the gender control, this also dropped the reported t-value from t=9.29 to t=0.88, showing a non-causal effect where a causal one was previously recorded (Simonsohn et. al, 3). This is an important finding because t-values are inversely proportional to p-values, meaning higher t-values (t > 2.8) indicate lower p-values. By controlling for gender, one can artificially inflate the t-value, thus artificially deflating the p-value as well. Decreasing reports of dependent variables is also a tactic for decreasing the p-value, and this is also prevalent. According to a study by John et. al, over 50% of researchers engaging in questionable practices reported failing to report all dependent variables and stopped collecting data after significance was reached (Head et. al, 11). All these tactics combined can increase the chances of false positive results from 5% to over 60% (Replicability-Index). While researchers are incentivized to partake in these activities, peers are also incentivized to partake when attempting to replicate these studies, resulting in more inaccurate consensus among peer reviewers and increased likelihood of publication of fraudulent data.
The concept of the p value is derived from the modern null ritual methodology, the result of a fundamental misunderstanding of statistics by experts, and this ritual has been standardized by peer reviewers. Gerd Gigerenzer’s premise for “Mindless Statistics” is that the principles of the null ritual are derived from the conflation of Fisher and Neyman–Pearson’s incompatible theories to create an “inconsistent hybrid” that should not be standardized (588). The first step of the null ritual, proposing a null hypothesis, is derived from Fisher’s theory, and while the null hypothesis requires testing chance unlike what Neyman-Pearson proposed, it does not specify alternative hypotheses (Gigerenzer, 591). Usage of the second step, assigning an arbitrary p-value of 0.05 to disprove the null hypothesis, was not proposed by either, with Neyman-Pearson believing this should be determined by sample size, not a fixed standard (Gigerenzer, 591). Both Fisher and Neyman-Pearson required an exact report of the p-value instead of a binary report centered around the value of 0.05 and discouraged its standardization (Gigerenzer, 590-591). According to Feyman, “to report a significant result and reject the null in favor of an alternative hypothesis is meaningless unless the alternative hypothesis has been stated before the data was obtained” (Gigerenzer, 602). This routine hypothesis testing and lack of multiple hypotheses, both practices which impose stringent barriers to statistical testing that affect data integrity, are the result of a deeply flawed amalgamation of conflicting ideas that should be replaced.
Fraudulent studies negatively affect the media, scientific field, and cause harm to the general population, and these p-hacking tactics, combined with high amount of unwavering trust people put in peer review and scientific consensus, directly contribute to fraud. Science relies on consensus, with a high quantity of peer reviewers in agreement often being enough to increase credibility of a manuscript, but consensus is not necessarily an indicator of reproducibility or credibility. One study of biomedical journals, utilizing data spanning from the years 1990 to 2015, found that in 2015, only 20% of researchers performed up to 94% of reviews (Kovanis et. al, 1), meaning a disproportionately small pool of peers review most literature, creating an inaccurate representation of scientific “consensus.” The façade of replicability this artificial flattening of the p-curve creates often leads to peers vetting these manuscripts, increasing the number of fraudulent studies in publication. One example is the chocolate weight loss hoax study conducted by journalist John Bohannon, who explained publicly in a Gizmodo article that this study was deliberately conducted fraudulently as a social experiment. This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This study was published in the Institute of Diet and Health by a Ph.D. researcher and was therefore vetted by peers, and even after publication, many media outlets were all in consensus that this was a novel, beneficial study (Bohannon). According to Bohannon, the key metric he employed to reduce the p-value to below 0.05 was to take 18 different, variables into consideration when testing. Another example of vast scientific consensus upon flawed data is the overcounting of COVID-19 deaths across countries early into the pandemic, which is a result of false-positive (p-hacking) testing (Ioannidis, 581). By including deaths unrelated to COVID-19 in the calculations, more false positives were generated, creating the illusion of causation of more death due to COVID-19 than what had actually occurred. Not only this, but the causation was more accurately linked to external causes, such as inappropriate hospital conditions and rushed pandemic measures (Ioannidis, 586). Within the cancer biology field, around 90% of all published studies are non-replicable, with only 6 of 53 showing significant results, according to Reproducibility Project data (Wen et. al, 619). This has a drastic impact on scientific literature and public health, because since the vast majority of studies are useless, this means a significant amount of cancer research is flawed and there has been little, even stalled progress within the field. This does not just apply to health and cancer research, though, but every field where peer review is a requirement for publication. The majority is not always right, in fact, it is most often wrong.
The idea that great quantity of consensus creates truth is false, as evidenced by the current crisis in peer review. As it was observed in the chocolate hoax study by Bohannon, large consensus and media coverage led to its popularity and acceptance, despite the whole study being fraudulent. However, the COVID-19 excess death study by Ioannidis is largely unpopular and uncommon, going against the current media consensus, despite Ioannidis himself being a highly influential researcher and the study being well-documented, containing logical and accurate data, well-referenced, highly sampled, and highly significant. In the cancer study data provided by the Reproducibility Project and conducted by Wen et. al, this data was consistent with the other Reproducibility Project data that suggests that very few published studies are replicable. Despite this, most of these studies were published with high peer consensus. All scientific, published studies depend on the peer review process, even those that critique it, and science itself is based on many assumptions such as consensus, assuming a sample size is an accurate representation of the field, or that the data was gathered correctly and remained intact. At first, one can conclude that the for-profit structure of peer review is the root cause of all its ills, but upon further examination, the root cause seems to be human trust in and rigorous demands to conform to the way science is conducted itself. Adopting open review frameworks, for-profit or not, seems to be the most effective solution to challenge the stringent and often bureaucratic demands of peer review so far. However, this is not yet the standard and is a recent re-emergence. While blinded review, the null ritual, and a non-critical approach to scientific consensus are still the standard practices, peer review will continue to be flawed.
Further reading: Mindless Statistics by Gerd Gigerenzer, Why Most Published Research Findings Are False by John P. A. Ioannidis
Bibliography
Kelly, Jacalyn, et al. “Peer Review in Scientific Publications: Benefits, Critiques, & A Survival Guide.” eJIFCC, vol. 25, no. 3, 2014 Oct 24, pp. 227-243.
Spier, Ray. “The history of the peer-review process.” CellPress, vol. 20, no. 8, 1 August 2002, pp. 357-358. ScienceDirect, doi:10.1016/S0167-7799(02)01985-6.
Gigerenzer, Gerd. “Mindless Statistics.” The Journal of Socio-Economics, vol. 33, 2004, pp. 587-606.
Foxe, John J., Bolam, Paul. “Open review and the quest for increased transparency in neuroscience publication.” European Journal of Neuroscience, vol. 45, no. 9, 2017 May, pp. 1125-1126, doi:10.1111/ejn.13541.
Godlee, F, et al. “Effect on the Quality of Peer Review of Blinding Reviewers and Asking Them to Sign Their Reports.” JAMA, vol. 280, no. 3, 1998 Jul 15, pp. 237-240, doi:10.1001/jama.280.3.237.
Banerjee, Amitav et al. “Hypothesis testing, type I and type II errors.” Ind Psychiatry J, vol. 18, no. 2, 2009 Jul-Dec, pp. 127-131, PubMed Central, doi:10.4103/0972-6748.62274.
Head, Megan L. et al. “The Extent and Consequences of P-Hacking in Science.” PLOS Biol, vol. 13, no. 3, 2015 Mar 13, pp. 1-15, doi:10.1371/journal.pbio.1002106.
Ioannidis, John P.A. “Why Most Published Research Findings Are False.” PLOS Medicine, vol. 19, no. 8, 2005 Aug 30, pp. 0696-0701, doi:10.1371/journal.pmed.1004085.
Open Science Collaboration. “PSYCHOLOGY. Estimating the reproducibility of psychological science.” Science, vol. 349, no. 6251, 2015, pp. aac4716-1-aac4716-8, doi:10.1126/science.aac4716.
Simonsohn, Uri et al. “P-curve won’t do your laundry, but it will distinguish replicable from non-replicable findings in observational research: Comment on Bruns & Ioannidis (2016).” PLOS ONE, vol. 14, no. 3, 2019 Mar 11, pp. 1-5, doi:10.1371/journal.pone0213454.
Replicability-Index. “Estimating the False Positive Risk in Psychological Science.” Replicability Index, https://replicationindex.com/2021/12/15/estimating-the-false-positive-risk-in-psychological-science/.
Kovanis, Michail et al. “The Global Burden of Journal Peer Review in the Biomedical Literature: Strong Imbalance in the Collective Enterprise.” PLOS ONE, vol. 11, no. 11, 2016 Nov 10, pp. 1-14, doi:10.1371/journal.pone.0166387.
Bohannon, John. “I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.” Gizmodo, 2015 May 27. https://gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800.
Wen, Haijun et al. “On the low reproducibility of cancer studies.” National Science Review, vol. 5, no. 5, 2018 Feb 2, pp. 619-624, doi: 10.1093/nsr/nwy021.
Ioannidis, John P.A. “Over- and under-estimation of COVID-19 deaths.” European Journal of Epidemiology, vol. 36, 2021 Jul 8, pp. 581-588, doi:10.1007/s10654-021-00787-9.
Since I received more discussion on this (thank you!), so I'll add a few things:
The null ritual convention incentivizes p-hacking, in turn incentivizing peer reviewers to accept more studies that align with "nice-sounding," p-hacked studies. The review process itself is most definitely affected by p-hacking in this regard, though I agree the convention is much easier to change than the system of peer review. However, I think the overall peer review system itself is in need of upheaval because of the aforementioned point. By "nice-sounding," I mean in more of a political way, as in "this study shows that eating dark chocolate every day makes you skinny, we'd like to promote that in our journal, so let's accept more studies that seemingly replicate this."