Background
Statistics reflux
Leiden PhD student Tim van der Zee discovered a striking number of errors is a world-famous psychologists articles. "We have to ask ourselves how things like this keep happening."
Bart Braun
Thursday 23 February 2017

Brian Wansink is an American professor who studies, how, what and especially why people eat and he’s remarkably successful in his field. His creative and media savvy research designs have produced numerous papers, popular science books that sell like hot cakes and concepts that are even more famous than his name.

Remember that study with soup bowels filled from the bottom that revealed that people chomp away mindlessly instead of paying attention to when they are full? He won an Ig Nobel Prize for it, the award for amusing science. And how about the notion idea that you feel full sooner eating from a small plate than eating from a big one? Thanks to Brian Wansink, dieticians everywhere recommend down-sizing your crockery. People eat more popcorn if you give them a larger tub of it and then adamantly insist that the quantity they ate had nothing to do with the size of the popcorn tub – Brian Wansink’s group at Cornell University demonstrated that this even happens with stale, two-week old popcorn.

Late last year, he blogged about a Turkish guest researcher who had joined his group. Wansink had a data set, set up for a field study at an all-you-can-eat pizza restaurant. Some guests had been given a discount voucher and Wansink’s idea was that people would eat less if the food was cheaper. The hypothesis was not confirmed, but perhaps they could winkle something else of interest out of the data? His own post-doctoral researchers weren’t interested but the visitor wanted to give it a try. And so: thanks to her hard work and willingness to "make hay when the sun is shining", she could put her name to four articles on the pizza study!

Actually, that’s where things started to go wrong, already. The academic world, which does not approve of spreading your research out into as many short papers as possible, calls this behaviour "salami slicing". Hauling dragnets through your data after the study in the hope that you can find some connections is considered not done. Statisticians call it "P-hacking", because you are looking for a nice, low-chance value, P. To use it in an article, you have to make up a story – afterwards – about how the connection was created, but that is exactly what you are supposed to do before you start your research. There’s a scornful term for that too: "HARking", because it’s a Hypothesis After Result.

Many scientists regard these practices as relatively minor sins. Leiden has some scientists who are guilty of them, too. You really shouldn’t do it, but all too often, scientists are judged by the number of articles they have published. So some do it regardless.

Leiden PhD student Tim van der Zee read Wansink’s blog and took a closer look at the four pizza papers. "I quickly realised that things didn’t tally. If you do your research like this, you can’t produce an article of high quality." The diners at the pizza restaurant had been asked to fill in a questionnaire to indicate how full they felt and that was linked to the amount of pizza they had eaten. However, the restaurant had put out other dishes on the buffet, like pasta, that also make diners feel full.

"So I checked whether the numbers were right, and talked about it with some researchers I know. We noticed so many mistakes and impossibilities, in the end, we decided to write a paper on it." The pre-print was recently published on the publication site PeerJ titled: Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab. It’s evident that the authors had a lot of fun in writing the article and have a few digs at Wansink in the footnotes. The authors’ names are ordered by age as "an approximation of the total amount of pizza consumed during their lives."The appendix listed some 150 discovered mistakes all together.

Yes. One hundred and fifty. "Most of the errors are really not so bad on their own, and there’s always a possibility that you copy something wrong, or whatever," Van der Zee explains. "It’s mostly the quantity of mistakes that’s worrying. Sometimes, articles are withdrawn because they contain four or five such errors. I n one chart ,we marked all the incorrect figures in red. It looks like a Christmas tree because more than half of them are wrong. It’s nearly physically impossible to make so many mistakes."

Van der Zee is a researcher who actually enjoys statistics and understands what statistics software does – this type of researchers are rarer than outsiders might suppose. But you don’t even have to be like this in order to find a good number of those "faulty figures".

"Your sample size, for instance, should be the same size throughout the article, and if you base two articles on the same study, both articles should use the same figures," he says to illustrate the point.

Moreover, there is something called a "granularity error" which occurs when you use averages. "Let’s say I ask two people how full they feel, and to express that as a whole number on a scale from one to ten. An average, in such cases, can never be 3.85 - it’s either a whole number or it has .5 following it. It’s immediately obvious with two people, but the same applies to larger groups, up to a hundred: there are numbers than can never be an average. It’s a very basic principle. The main reason we decided to publish our story now is to show other scientists that they can check these things very easily. Van der Zee’s co-authors Jordan Anaya and Nick Brown, have already developed a free program for checks, GRIM, which is a freely available from Github.

"In theory, we’re not accusing Wansink of anything," Van der Zee adds. "We can demonstrate that his values are not consistent, impossible even. We can’t draw any conclusions to explain why they’re inconsistent, though." So, did the peer reviewers of the journals in which the Cornell group published their articles drop the ball here? "Absolutely. Each of those articles were reviewed by at least two people, and they should have noticed something. If the sample size changes, their alarm bells really should have gone off. You don’t need any complicated algorithms to see it."

Initially, Wansink tried to dodge Van der Zee’s requests to see that data set, but was more accommodating after the publication on PeerJ. He ate humble pie in his blog and apologised to psychology in general and more specifically to the journals that published the articles; the data will be made public so that it will be clear how it could go so wrong.

Something he has not done is to point a finger at the Turkish researcher. A smart move, because another seven articles with similar errors produced by his group surfaced recently. "It’s the right response, but I’m particularly interested in specific behaviour that will guarantee the quality of scientific literature. But this is a good start", says Van der Zee.

"It’s not the first time something like this happens, and it’s not the only time", he continues. "We need to ask ourselves why things like this keep happenin. The larger picture is that scientists are generally under pressure to publish lots of papers. One of the lessons we can learn from this case is that we should reward quality, not quantity. I myself am a great fan of pre-registration: before you begin, you record your questions, how you intend to study them and which analysis you intend to use. Then it’s subjected to a peer review that assesses your method. This will prevent people sifting through a dataset interminably until they find something they can publish, which won’t produce any reliable knowledge.’