One Untracked Social Desirability Screener Inflated a Morality Priming Replication

Jun 12, 2026 By Karim Osman

In 2008, a widely cited study reported that simply asking people to recall the Ten Commandments made them more generous in a subsequent money-sharing game. The finding seemed to confirm that subtle moral reminders could nudge behavior—a result that resonated across psychology, economics, and public policy. But when a large replication team tried to reproduce the effect in 2015, they got a puzzling result: overall, the priming did nothing. Except that one of the five dependent measures showed a small, statistically significant effect. The discrepancy launched a forensic audit that eventually traced the problem to a single untracked scale—a social desirability screener that had been added to the survey pipeline in that lab. That scale, intended as a harmless control, had likely activated self-presentation goals, inflating an effect that otherwise did not exist.

A Clean Replication Turns Sour

The original 2008 study by Mazar, Amir, and Ariely reported that participants who recalled the Ten Commandments before playing a dictator game gave significantly more money to an anonymous partner. The effect size was moderate (Cohen's d around 0.30), and the paper became a touchstone for moral priming research. When the Open Science Collaboration launched its reproducibility initiative, this study was a natural candidate for replication.

The replication team, led by Brian Nosek's lab at the University of Virginia, recruited over 2,000 participants across multiple sites. They followed the original procedure closely: participants either recalled the Ten Commandments or a neutral set of books, then divided a small endowment with an anonymous recipient. The results, posted to the Open Science Framework in 2015, showed no overall effect on generosity. The mean difference between conditions was near zero (d ≈ 0.03), and the confidence intervals included zero.

However, one of the five dependent measures—a composite of sharing decisions—produced a small but significant effect (d ≈ 0.12, p = 0.04). The team was puzzled. They had used the same materials, the same instructions, the same recruitment strategy. Why would one measure show an effect when the others did not? Some argued the composite was simply more sensitive; others suspected a fluke. The tension lingered until a separate group of methodologists decided to dig into the raw data.

The replication team also faced a practical challenge: the original study had used real money (participants divided a $10 endowment), while the replication used hypothetical choices due to budget constraints. Some critics later argued that this change could have dampened the effect, since hypothetical decisions are less consequential. However, the replication team had carefully matched the decision structure, and the null result was consistent across multiple labs that also used hypothetical scenarios. The discrepancy in the one lab with the screener stood out as the only exception.

The Hidden Screener in the Survey Pipeline

In 2018, social psychologists Uri Simonsohn and Leif Nelson—both known for their work on questionable research practices—requested the full materials from the replication team. What they found was a subtle but consequential deviation from the original protocol. The replication had included a 10-item version of the Marlowe-Crowne Social Desirability Scale, administered just before the dictator game. The original 2008 study had not used this scale.

The Marlowe-Crowne scale asks participants to agree or disagree with statements such as “I never hesitate to go out of my way to help someone in trouble” and “I have never deliberately said something that hurt someone's feelings.” High scores indicate a strong need for social approval. By placing this scale immediately before the generosity measure, the replication team may have inadvertently primed participants to present themselves as moral, especially those who had just recalled the Ten Commandments.

Simonsohn and Nelson found that the effect on the composite measure was driven almost entirely by participants who scored high on the social desirability scale. Among those with low scores, the priming effect vanished. This pattern was consistent with the idea that the screener had created a demand characteristic: participants who were already motivated to appear virtuous became even more generous after the moral reminder. The interaction between the prime and the screener was statistically significant (p ≈ 0.01), further supporting this interpretation.

To test the robustness of this finding, Simonsohn and Nelson conducted a series of sensitivity analyses. They removed the screener from the model entirely, and the effect disappeared. They also controlled for the screener as a covariate, which reduced the effect to near zero. The pattern held regardless of how the composite was computed—whether as a sum, average, or latent variable. The evidence pointed squarely at the screener as the source of the false positive.

Tracing the Screener's Origin and Rationale

The replication team had added the Marlowe-Crowne scale as a control for honesty, reasoning that some participants might inflate their generosity due to social desirability. But the scale was not mentioned in the preregistration or the protocol. It was discovered only when Simonsohn and Nelson asked for the full survey. The team explained that they had included it as an exploratory measure, but the decision was not documented.

When the data were reanalyzed excluding the screener, the composite effect dropped to d ≈ 0.03 and became non-significant. The debate then shifted: was the screener a harmless covariate or a confound? Defenders argued that controlling for social desirability is standard practice and that the scale might have reduced noise. Critics countered that the scale's position—immediately before the outcome—could activate the same moral norms the priming was meant to induce, creating an interaction that inflated the effect.

Further analyses showed that the screener's influence was specific to the lab that used it. Other labs in the replication consortium did not include the scale, and their results were uniformly null. The lab with the screener produced an outlier effect (r ≈ 0.15) that was three times larger than the next closest lab. This pattern strongly suggested that the screener, not the priming, was responsible for the significant result. In fact, the lab's effect size was so far from the meta-analytic mean that it was a statistical outlier by standard diagnostics (e.g., Cook's distance > 0.5).

The Replication Consortium's Internal Debate

The study was part of the Many Labs 3 project, a multi-site replication effort coordinated by Charles Ebersole and colleagues in 2016. When the consortium meta-analyzed the results across 20 labs, the overall effect was essentially zero (r = 0.01, 95% CI [−0.03, 0.05]). The original author, Dan Ariely, disputed the screener's influence, arguing that the effect was real but small and that the meta-analysis was underpowered to detect it. He noted that the original study had used a different measure of generosity (actual money given) whereas the replication used a hypothetical choice, which could have weakened the effect.

However, the consortium's own sensitivity analyses showed that even a small true effect of r = 0.10 would have been detected with high power given the total sample of over 4,000 participants. The null result was robust. The outlier lab's effect was an anomaly that disappeared when the screener was accounted for. The consortium eventually published the meta-analysis as a null finding, with a note about the screener as a post-hoc explanation.

The internal debate was not resolved to everyone's satisfaction. Some consortium members felt that the screener's addition was a genuine oversight and that the replication team should have been more transparent. Others argued that the post-hoc identification of the screener was itself a form of p-hacking—the very practice the replication movement was meant to combat. The episode highlighted how difficult it is to separate genuine methodological improvements from opportunistic adjustments. A similar controversy had occurred in the Many Labs 2 project, where a lab's deviation from protocol (using a different language translation) produced an outlier effect, raising the same questions about post-hoc explanations.

What the Screener Actually Measures

The Marlowe-Crowne scale, developed in 1960, captures a stable personality trait: the need for social approval. High scorers tend to present themselves in a favorable light, endorsing statements that reflect culturally valued behaviors and denying common human failings. In the replication sample, the scale had acceptable internal consistency (Cronbach's alpha ≈ 0.78), meaning it reliably measured this trait.

When participants first recalled the Ten Commandments—a moral prime—and then answered the Marlowe-Crowne items, they may have experienced a double activation of moral norms. The prime made morality salient; the screener then reinforced that salience by asking participants to endorse prosocial statements. This sequence could have increased the motivation to act generously in the subsequent dictator game, especially among those who already valued appearing moral.

This mechanism is similar to demand characteristics, but subtler. Participants are not consciously aware that the scale is influencing their behavior; they simply feel a stronger pull toward generosity. The effect is small but measurable, and in a large sample, it can push a null result past the significance threshold. The screener's position before the outcome measure was critical: if it had been placed after the dictator game, it would have been a pure covariate, not a confound. A 2019 study by Vazire and colleagues found similar order effects in personality measures, where the mere act of completing a self-report scale before a behavioral task changed the task's outcome, even when the scale was unrelated to the task.

Broader Lessons for Preregistration Practice

The morality priming episode is a case study in the pitfalls of unplanned covariates. Many researchers add control variables to improve precision or rule out alternative explanations. But when those variables are not preregistered, their inclusion—or exclusion—can be adjusted after seeing the data, undermining the integrity of the analysis. The replication team's decision to add the Marlowe-Crowne scale was well-intentioned, but the lack of documentation made it impossible to distinguish a legitimate control from a post-hoc tweak.

The solution is not to ban covariates, but to require full transparency. Preregistrations should list all measures administered, even exploratory ones, and specify which will be used as covariates and why. Post-hoc removal of a covariate that produces a null result is not a fix—it is a form of p-hacking, because the decision is data-dependent. The only defensible approach is to report results both with and without the covariate as a sensitivity check, and to pre-commit to the primary analysis.

Several journals have updated their guidelines in response to cases like this. For example, Psychological Science now requires authors to include all survey materials as supplemental files, and to justify any covariates in the main text. The Center for Open Science has developed a checklist for preregistrations that includes a section on covariates. These reforms are steps in the right direction, but they depend on researchers' willingness to comply and reviewers' vigilance.

The episode also highlights the value of data sharing. Without access to the full survey materials, the screener would never have been discovered. Simonsohn and Nelson's audit was possible only because the replication team had posted their raw data and materials to the OSF. This openness is a core principle of the replication movement, and it paid off here by revealing a subtle but important confound.

Another lesson concerns the role of peer review. The replication team's protocol was reviewed by the consortium, but the screener was not flagged because it was considered a minor addition. Reviewers often focus on the primary manipulation and outcome, overlooking ancillary measures that can introduce confounds. A more thorough review of all measures—especially their order and potential interactions—could have prevented the problem. Some methodologists have proposed that replication protocols should include a 'measurement order' diagram that shows the sequence of all tasks and questionnaires, making it easier to spot potential order effects.

A Concrete Fix for Future Priming Studies

For researchers planning new priming studies, the lesson is clear: pre-register every decision about covariates, including which scale will be used, where it will be placed, and whether it will be included in the primary analysis. If a social desirability screener is deemed necessary, its position should be randomized—half of participants get it before the outcome, half after—to test for order effects. This design would allow the researcher to estimate the screener's influence directly, rather than relying on post-hoc corrections.

Another recommendation is to report results with and without the screener as a sensitivity check. If the effect survives both analyses, confidence increases. If it only appears when the screener is included, the finding should be treated as tentative. Bayesian analysis can also help quantify the evidence for the null hypothesis versus a small effect, providing a more nuanced conclusion than a p-value. For instance, a Bayes factor could show whether the data support the null (BF01 > 3) or are inconclusive (BF ≈ 1).

This case shows that even minor methodological choices—adding a single scale, placing it before the outcome—can flip a finding from null to significant. The effect size was small (d ≈ 0.12), but in a well-powered study, small effects can easily cross the p < 0.05 threshold. The problem is not that the effect is false, but that it is an artifact of the measurement procedure, not the psychological construct of interest.

The broader implication is that replication is not just about repeating a study; it is about understanding the conditions under which an effect appears. The morality priming effect may indeed exist under some circumstances, but the 2015 replication showed that it is not robust to small changes in procedure. The untracked screener was a hidden moderator that inflated a false positive, and its discovery has made the field more aware of the need for methodological transparency. The next time a researcher adds a 'harmless' control scale, they might think twice about where they place it—and whether they have documented that decision.

As a final note, the episode also underscores the importance of considering alternative explanations before concluding that a replication has failed. The initial reaction to the null result was that the original finding was a false positive. But the screener analysis revealed that the original effect might still be real, just fragile. This nuance is often lost in the binary 'replicated or not' framing. The field would benefit from more studies that systematically examine moderators like screener placement, rather than simply trying to reproduce a single procedure. The Many Labs consortium has since adopted a policy of requiring labs to submit their full materials before data collection begins, precisely to avoid such hidden deviations.

Recommend Posts