Beneath The Replication Crisis

Psychologists are simply, on an absolute scale, dullards... They seem to feel, many of them, that all we need to do is consolidate our scientific gains. Their self-confidence astonishes me. For these gains seem to me puny, and scientific psychology seems to me ill-founded. At any time the whole psychological applecart might be upset. Let them beware!” (Gibson, 1967, p. 142)

Psychology is declared to be in crisis. The reliability of thousands of studies have been called into question by failures to replicate their results. As a damning example, Aarts et al., 2015 conducted a replication of 100 experiments reported in papers published in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they found that only one-third to one-half of the original findings were also observed in the replication study. Powerful studies have failed to replicate famous psychological experiments such as the ego depletion effect (Hagger et al., 2016), the so-called “marshmallow experiment” on delayed gratification (Watts, Duncan and Quan, 2018), and neonatal imitation (Oostenbroek et al., 2016).

Responses to the crisis vary but can be categorised broadly as either denial or acceptance. Among the latter, the most commonly proposed solution is increasing statistical or methodological rigour: increasing sample sizes (IntHout, Ioannidis, Borm, and Goeman, 2015), increasing the threshold for statistical significance (Benjamin et al., 2018, see also Amrhein, Korner-Nievergelt, and Roth, 2017), retiring the concept of significance testing completely (McShane, Gal, Gelman, Robert, and Tackett, 2019, among 800 other signatories, but see Ioannidis, 2019), using confidence intervals instead of significance (Cummings, 2012), or switching to Bayesian methods (Dienes, 2011; Maxwell, Lau and Howard, 2015; Etz and Vandekerckhove, 2016).

More radical writers suggest this problem may have institutional dimensions, and as such suggest better statistical education (Hughes, 2018), measures for handling fraud and publishing replications (Begley and Ioannidis, 2015), and some suggest the problem lies with perverse incentives from grants and scientific bodies (Lilienfield, 2017). While these more expansive criticisms seem more appropriate for the scope of the problem, they have not yet met with success. Pre-registration, a method of holding researchers accountable for their work by getting them to pre-publish their methods before they produce their findings, has not been the silver bullet many hoped, (e.g. Cumming, 2014; Shiffrin, Börner and Stigler, 2018, Hughes, 2018). Moreover, as criticism becomes broader and more institutional, the question of why only some institutions (i.e. psychology and social sciences) seem vulnerable to non-replication becomes more pertinent. If the same incentive structure exists across academia, why is it that only some fields produce very high levels of positive results compared to other fields (Fanelli, 2010; Fanelli, Costas and Ioannidis, 2017)? There is demonstrably a problem with institutional science, but some fields are affected more than others (see figure 1), and this reason must be inherent to the field.

Fig 1: Proportion of papers reporting positive results by subject, from Fanelli (2010)

Responses that deny the crisis in some sense are more rare, though they can still be high-profile. Some object to the term ‘crisis’ as an overstatement, suggesting that the spate of irreproducible results actually demonstrate psychology’s strength, not its weakness (Fanelli, 2013; Stroebe and Hewstone, 2015; Gilbert, King, Pettigrew and Wilson, 2016; Shiffrin, Börner and Stigler, 2018; Fanelli, 2018; see also Barrett’s 2015 NYT op-ed). Science is supposed to be about falsifying the way to truth, so what’s the problem with finding that some results aren’t true? (there is a strange tension in this thought, both denying the crisis per se and also pointing to the crisis as proof of rude health). However, this type of denial mischaracterises the crisis. It is not the case that false studies are being rejected, as is proper: more that the ground beneath psychology’s feet is giving way. When Border et al., (2019), using a sample of over 600,000 participants, found no evidence for the 5-HTTLPR genotype affecting depression, they overturned the findings of 450 inadequately-conducted studies (and a burgeoning part of the pharmaceutical industry). A moderately viral blog post from Scott Alexander summarised the situation well:

“…[W]hat bothers me isn’t just that people said 5-HTTLPR mattered and it didn’t. It’s that we built whole imaginary edifices, whole castles in the air on top of this idea of 5-HTTLPR mattering. We ‘figured out’ how 5-HTTLPR exerted its effects, what parts of the brain it was active in, what sorts of things it interacted with, how its effects were enhanced or suppressed by the effects of other imaginary depression genes. This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.” (Alexander, 2019)

The repudiation of studies here and there is falsification; when entire fields and bedrock studies are thrown into doubt, this is a crisis.

Many of these same skeptics claim that the problem is media scare stories about the crisis, which could cause young academics to lose faith or expose vulnerabilities to anti-scientific conservatives (Oreskes, 2018; Jamieson, 2018; Hughes, 2018). One might think this was further reason to treat the matter seriously rather than dismiss it. John Ioannidis, the pre-eminent investigator of the current reproducibility crisis (e.g. Ioannidis et al., 2005, 2014), notes the current complacency of many scientists:

“Even well-intentioned academics, perceiving an attack on science, may be tempted to take an unproductive, hand-waving defensive position: ‘we have no problem with reproducibility’, ‘everything is fine’, ‘science is making progress’.” (Ioannidis, 2018, p. 2)

This prickliness on the sceptical side speaks to a real sense of crisis among academics as much as it is perceived as such in the media and industry.

More convincingly, other sceptics note that if psychology is in crisis today, it has been in crisis almost from the beginning. Declarations of crisis themselves have history in psychology back to the late 19th century (Willy, 1897: see Mülberger, 2011) and can be found throughout the 20th century, particularly in the 1920’s and 1970’s (Goertzen, 2008; Giner-Sorolla, 2012, Sturm and Mülberger, 2011). The observation that psychological studies fail to replicate is not a new one (see Lykken, 1991); The famed paper from Cohen, The Earth is Round, (p < .05), mocked statistical malpractice and null-hypothesis significance testing in psychology a quarter of a century before the time of my writing, and Cohen himself pointed out that he was preceded by 40 years of similar complaints (Cohen, 1994; see also Westland, 1978; Meehl, 1978).

Even if this point is intended as a deflection (a chronic crisis is a contradiction in terms), it provides a useful reconceptualisation of the problem as a recurrent issue that predates the rise of modern institutional and methodological practises. This ‘crisis’ is an episodic outburst within a chronic condition, so historical investigations of previous crises have some connection with the current replication crisis and deserve examination.

The historical crisis


Historical descriptions of crisis in psychology are occupied primarily with the proliferation of psychological schools. It is a theme of the first recorded declaration of crisis (Willy, 1897; see Mülberger, 2011), of Buhler’s De Krise Der Psychologie (1927), and of Vygotsky’s Historical Meaning of the Crisis in Psychology (1927). This ’embarrassment of riches’, as both Buhler and Willy refer to it, is considered by these writers to be the result of a fundamental division between subjectivist/idealist and objectivist/materialist paradigms. Where Vygotsky contrasted Pavlovian reflexology and Freudian psychoanalysis as examples of these two trends, Koffka contrasted Titchener and phenomenology. Today we could perhaps contrast neuroscience with cognitivism’s ‘radical subjectivism‘. Without a central theory capable of resolving this contradiction, the proliferation of different schools is inevitable. Vygotsky claims that to draw these schools together requires a historically and culturally conscious approach (a dialectical paradigm), against which he contrasts a nascent empiricism within psychology:

“The vast majority of contemporary psychological investigations write out the last decimal point with great care and precision in answer to a question that is stated fundamentally incorrectly.” (Vygotsky, 1927, p.258)

Empiricism is the other symptom of this contradiction: Empirical findings must be used to guide exploration in lieu of the missing theory. Of course, this data is not atheoretical itself, but comes with the assumptions that reflect the circumstances in which the data was gathered. These assumptions come to stand in for the role of theory in psychology, as Danziger argues:

“…[P]sychology appears to be unique in the degree to which statistical inference has come to dominate the investigation of theoretically postulated relationships. In this discipline it is generally assumed without question that the only valid way to test theoretical claims is by the use of statistical inference. This assumption is associated with an implicit belief in the theory-neutrality of the techniques employed… Faith in this methodology certainly unites a much larger number of research psychologists than does any kind of commitment to a particular theoretical framework. It is surely the most serious candidate for the status of a generally accepted puzzle solving paradigm in modern psychology.” (Danziger, 1985, p.3)

Increasingly, then, psychology has come to admit only data amenable to statistical testing, which Danziger goes on to argue elsewhere, structures the investigations of psychology in a similar way to a theory in other subjects:

“Psychology as a whole found it peculiarly difficult to achieve agreement… and hence to generate a product that represented the truth about its subject matter by common consent. The imposition of a quantitative structure on its knowledge base had seemed to offer a resolution of these difficulties…

Using statistical significance tests as the standard technique for the corroboration of psychological hypotheses meant that theories were generally regarded as ‘confirmed’ if a very weak logical complement, ‘chance’, or ‘the null hypothesis’ could be disconfirmed. This travesty of scientific method certainly allowed the growth of a major research industry that offered employment to many, even though its products can now be seen as having contributed nothing of either practical or theoretical value.” (Danziger, 1990, pp.148-155)

We can see this empiricism in several places in modern psychology. Any undergraduate course today includes quantitative research methods throughout, while different approaches to the subject (cognitivism, behaviourism, psychoanalysis, etc.) come and go. Tellingly, we can also see this empiricism at the earliest point of the replication crisis in psychology, the publishing of Bem’s (2011) ‘precognition’ experiments. Bem ran several standard psychological experiments backwards, and found statistically significant effects on behavioural stimuli after responses were asked for. This study was published to much controversy in the Journal of Personality & Social Psychology, who claimed to have no grounds for dismissing it theoretically, since it met their guidelines for peer review. As their editorial stated:

“We openly admit that the reported findings conflict with our own beliefs about causality and that we find them extremely puzzling. Yet, as editors we were guided by the conviction that this paper—as strange as the findings may be—should be evaluated just as any other manuscript on the basis of rigorous peer review. Our obligation as journal editors is not to endorse particular hypotheses but to advance and stimulate science through a rigorous review process.” (Judd and Gawronski, 2011)

This is a clear declaration that the work of psychology is published (or perishes) not due to any commitment to a hypothesis but due to meeting methodological criteria (see also Wilson and Golonka’s post on this editorial). Any work, no matter how incoherent (even work that violates physical causality), could form part of psychology’s evidence base, provided it met those standards. The unique vulnerability of psychology to the replication crisis is that scientific work is only guided this kind of empiricism, not theory (see also van Rooij’s post here, or Denny Boorsboom’s post here). Combined with industry-wide pressures to publish, the replication crisis was inevitable.

Reconstructing the subject

The centrality of statistical methodology to psychology, providing much the same function as a theory, explains why the replication crisis affected psychology in particular. This also explains why the replication crisis has many of the same characteristics of a theoretical crisis as described by Kuhn (1962) (repudiating old evidence, provoking a hostile response from the ‘old guard’). But since it is framed and composed of methodological failures, it is treated primarily as a methodological problem, and this serves to convolute efforts to resolve the crisis on the theoretical level.

The analysis of historical crises in psychology shows that if psychologists choose to limit the responses to this crisis to statistical, methodological, or even institutional corrections, they are following in the footsteps of their forebears by substituting theoretical questions for technical ones. As Danziger notes in Constructing the Subject, echoing Gibson’s assessment from 1967,

“… Confrontations in depth were the exception rather than the rule. Typically, the original confrontation was rendered harmless by its transformation… With this transformation the potentially disturbing features of the original critique were successfully avoided and the dominant style of investigative practice could continue to be taken for granted. The issue became trivialized to a form of question that could be investigated within the framework of traditional practices. This was the usual fate of controversies that threatened to upset the methodological applecart.” (Danziger, 1990, p. 248)

In 1990, perhaps Danziger’s words might have been received more frostily, but in the light of the replication crisis I think his words deserve serious consideration. It is common to see psychologists shy away from or deride conceptual or philosophical matters in favour of ‘getting on’, being ‘pragmatic’, thereby transforming the theoretical problem into a technical one. I find little to disagree with from Machado, Lourenço and Silva (2000) who say “conceptual investigations have been dismissed as philosophical speculation alien or even inimical to science, as misguided attempts to circumvent empirical research, a sort of shortcut in the path to the truth, or as armchair speculation about the meaning of words.” Machado, Lourenço and Silva (2000), p.26.

We can see an example of this aversion to theoretical matters from high-profile psychologists: Tomasello, Carpenter, and Liszkowski (2007) might wish to discuss the matter of joint knowledge through pointing, but clarify that “[they] are not attempting to address the large and complex philosophical literature on the nature of mutual knowledge nor the philosophical use of the word ‘know’.” Ironically, the authors’ disavowal of any theoretical commitments in this fashion only serves to blind them to their unexamined assumptions about the phenomenon of joint attention (see Racine, Leavens, Susswein, and Wereha, 2008, for a discussion).

Alternatively, see Whitehead and Rendell (2015), who spend the opening chapter of their book on cetacean culture pouring scorn on the social scientists who offer criticism of their conception of ‘culture’, and feel their definition (socially transferred behaviours across individuals) allows them to ‘get on’ to look at empirical data. Despite this pragmatism, Whitehead and Rendell never seem to escape the problems inherent in their definition being so loose, repeatedly returning throughout the book to the controversy it causes, without the evidence they cite providing any further clarification. In miniature, Whitehead and Rendell provide psychology with an example of the futility of pressing ahead without conceptual clarity.

The replication crisis, if nothing else, has shown that productivity is not intrinsically valuable. Much of what psychology has produced has been shown, empirically, to be a waste of time, effort, and money. As Gibson put it: our gains are puny, our science ill-founded. As a subject, it is hard to see what it has to lose from a period of theoretical confrontation. The ultimate response to the replication crisis will determine whether this bout is postponed or not.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s