Skip to content

Bothered by non-monotonicity? Here’s ONE QUICK TRICK to make you happy.

We’re often modeling non-monotonic functions. For example, performance at just about any task increases with age (babies can’t do much!) and then eventually decreases (dead people can’t do much either!). Here’s an example from a few years ago:

A function g(x) that increases and then decreases can be modeled by a quadratic, or some more general functional form that allows different curvatures on the left and right sides of the peak, or some constrained nonparametric family.

Here’s another approach: an additive model, of the form g(x) = g1(x) + g2(x), where g1 is strictly increasing and g2 is strictly decreasing. This sort of model gets us away from the restrictions of the quadratic family—it’s trivial for g1 and g2 to have different curvatures—but also it is a conceptual step forward in that it implies two different models, one for the process that causes the increase and one for the process that causes the decrease. This makes senses, in that typically the increasing and decreasing processes are completely different.

This is an example of the Pinocchio principle.

P.S. The original title of this post was “Additive models for non-monotonic functions,” but that seemed like the most boring thing possible, so I decided to go clickbait.

“Dynamically Rescaled Hamiltonian Monte Carlo for Bayesian Hierarchical Models”

Aki points us to this paper by Tore Selland Kleppe, which begins:

Dynamically rescaled Hamiltonian Monte Carlo (DRHMC) is introduced as a computationally fast and easily implemented method for performing full Bayesian analysis in hierarchical statistical models. The method relies on introducing a modified parameterisation so that the re-parameterised target distribution has close to constant scaling properties, and thus is easily sampled using standard (Euclidian metric) Hamiltonian Monte Carlo. Provided that the parameterisations of the conditional distributions specifying the hierarchical model are “constant information parameterisations” (CIP), the relation between the modified- and original parameterisation is bijective, explicitly computed and admit exploitation of sparsity in the numerical linear algebra involved. CIPs for a large catalogue of statistical models are presented, and from the catalogue, it is clear that many CIPs are currently routinely used in statistical computing. A relation between the proposed methodology and a class of explicitly integrated Riemann manifold Hamiltonian Monte Carlo methods is discussed. The methodology is illustrated on several example models, including a model for inflation rates with multiple levels of non-linearly dependent latent variables.

I don’t have time to read this paper right now—too busy cleaning out my damn inbox!—but I’m just posting here so I won’t forget to look at it at some point. Maybe it could solve some of the problems I was talking about at today’s Stan meeting?

The gaps between 1, 2, and 3 are just too large.

Someone who wishes to remain anonymous points to a new study of David Yeager et al. on educational mindset interventions (link from Alex Tabarrok) and asks:

On the blog we talk a lot about bad practice and what not to do. Might this be an example of how *to do* things? Or did they just get lucky? The theory does not seem any stronger than for myriad other too-noisy-to-say-anything studies.

My reply: Hey, I actually was involved in that project a bit! I don’t remember the details but I did help them in some way, I think at the design stage. I haven’t looked at all the details but they seemed to be doing all the right things, including careful measurements, connection to theory, and analysis of intermediate outcomes.

My correspondent also asks:

Also, if we need 65 random schools and 12,000 students to do a study, I fear that most researchers could not do research. Is it pointless to do small studies? I fear they are throwing out the baby with the bathwater.

My reply:

You don’t need such a large sample size if you collect enough data on each individual student and class. Don’t forget, you can learn from N=1 in a good qualitative study—after all, where do you think the ideas for all these interventions came from? I do think, though, that the sloppy-measurement, small-N study is not such a good idea: in that case, it’s all bathwater with no baby inside.

What we really need are bridges between the following three things:

1. Qualitative research, could be N=1 or could be larger N, but the point is to really understand what’s going on in individual cases.

2. Quantitative research with careful measurement, within-person comparisons, and large N.

3. The real world. Whatever people are doing when they’re not doing research.

The gaps between 1, 2, and 3 are just too large.

British journalists not running corrections and talking about putting people in the freezer

I happened to be reading an old issue of Private Eye (a friend subscribes and occasionally gives me some old copies) and came across this, discussing various misinformation regarding a recent crime that had been reported by a London tabloid columnist named Rod Liddle (no relation to the famous statistician, I assume):

Here is “what REALLY happened”, Liddle informed readers: “A drunken Polish bloke gets into an argument with a 15-year-old black kid. He pushes him and calls him ‘n*****’. The kid responds with a single punch. The Polish man, Arek Jozwik, is floored and killed.”

Not so: the youth who killed Jozwik with a single punch was white. But why worry about a fact? Almost a fortnight later, Liddle and the Sun website have yet to correct it.

I did a quick search on the Sun website and did not find the article (I searched for *Jozwik Liddle* and found nothing, then I searched for Jozwik alone and found a bunch of articles, but nothing by Liddle), which suggests that they removed it rather than correcting it. But maybe it’s been corrected and I just don’t know where to look. In the meantime, I found an old link to Liddle’s article here, and here it is on the Internet archive.

This interested me because we’ve been hearing a lot about problems with trust in the news in the U.S., so it’s good to remember that other countries have it a lot worse. I get annoyed when David Brooks pushes fake statistics at New York Times readers and then never runs a correction, but that’s nothing compared to what they do across the pond.

P.S. Also in the same issue:

After it emerged that Evening Standard editor George Osborne had told friends he wants Theresa May “chopped up in bags in my freezer”, fellow politicians queued up to criticize his choice of words.

This one was bizarre because all of the criticism had to do with Osborne (formerly the second-most-powerful person in the U.K. government) using this language to refer to a woman. I guess England’s the kind of place where it’s ok for a leading politician to talk about chopping someone up, if that someone is a man?

Also the bit about the “bags in the freezer.” What’s up with these people? You’d think killing someone would be enough, but then you have to chop up the body just to make sure? And then that’s not enough, you have to keep the chopped up body in your freezer? That’s one creepy country, where a major politician could talk this way.

Against Winner-Take-All Attribution

This is the anti-Wolfram.

I did not design or write the Stan language. I’m a user of Stan. Lots of people designed and wrote Stan, most notably Bob Carpenter (designed the language and implemented lots of the algorithms), Matt Hoffman (came up with the Nuts algorithm), and Daniel Lee (put together lots of the internals of the program). Also Jiqiang Guo and Ben Goodrich (Rstan), Michael Betancourt (improvements in Nuts), Jonah Gabry (Shinystan), and lots of others. As always, it’s hard to write these lists because whenever you stop, you’re excluding others.

Anyway, my primary role in Stan is “user.” The role of user is important. But I did not design the language, I did not write the language, and I did not come up with most of the algorithms that it uses. Other people did that. We have a great team (including lots of people who have no connection to Columbia University) and it’s great that we can work together in this way.


[edit from Bob Carpenter:

We have 30+ core developers now, which I figure is about 10 full-time equivalents. Also, dozens of people not on that list make contributions.

If you want to see what exactly people contributed in the way of code, see the stan-dev GitHub organization.

It’s harder to trace the ideas, but almost all of the bigger ideas and implementations were highly collaborative. It’s one of the most fun aspects of the project. I still love watching Andrew “use” Stan in our various applied projects.

Ben Goodrich did all of our earlier multivariate stuff, including the LKJ prior, the Cholesky factor codings with Jacobians etc. It was a real hurdle for me to try to understand this stuff and help get it implemented in Stan (you can see the math in the chapter of the reference manual on transformed variables). We were a very small team at the time (basically me, Matt, Daniel Lee, and Ben).

Some of the biggies not on Andrew’s list include our differential equation functionality (the first crude version of which was built by me and Daniel Lee, and since largely driven by Sebastian Weber, Michael Betancourt and more recently, Charles Margossian and Yi Zhang, along with ongoing input from Daniel Lee).

Then there’s our multi-core functionality (largely driven by Sebastian Weber with contemporary API design by Sean Talts). That just came out in Stan 2.18.

The other biggie recently is our GPU functionality (largely driven by Steve Bronder, Erik Strumbelj, and Rok Cesnovar, with a revised API design by Sean Talts [I hope you see the trend here—Sean’s also taking over the design of the Stan 3 language]). This will be in Stan 2.19, but already exists on branches and is beginning to be merged into the math library development branch on GitHub.

Another biggy which was contentious when we introduced it was all the diagnostics, the HMC-specific versions of which were driven by Michael Betancourt (yes, you can think him for the divergence warnings [that wasn’t sarcastic—you really should thank him for sticking to his guns and insisting we include it even if scared off users]). More recently, the gang (Michael, Andrew Gelman [we have two Andrews], Aki, Sean, Dan Simpson [we have two Dan’s]) came out with the simulation-based calibration algorithm (again, not Stan-specific, but a fantastic idea that required a lot of cleanup).

Then there’s model comparison advice, which isn’t really about Stan per se, but is largely derived from work by Aki Vehtari et al. with Jonah doing most of the coding and writing up of results.

Marcus Brubaker implemented L-BFGS and did a lot of the early efficient matrix implementations and helped with our low-level memory design at the assembly language generation level. Alp Kucukelbir designed ADVI and had a lot of help from Dustin Tran and Daniel Lee on implementation.

We also need a big shout out for Daniel Lee and Sean Talts, without whom our builds and continuous integration systems would’ve crumbled to dust ages ago. This is super thankless work that doesn’t take up a lot of their time in any given month, but which adds up to a huge effort over time.

We also have Allen Riddell (and more recently Ari Hartikainen) working on PyStan, and a host of other people working on the other (simpler) interfaces. Michael Betancourt built most of CmdStan. Ari and Allen are also working on an http server version of Stan.

Rob Trangucci’s done a lot of great work on making matrix operations more efficient. Krzysztof Sakrejda’s been doing a ton of work on interface and I/O design with Allen and Sean.

And of course, we now have Aki Vehtari (the closer) guiding our work on GPs (along with Rob Trangucci, Michael Betancourt, Andrew, etc.) and various sorts of diagnostics (with Jonah Gabry and Sean Talts and Dan Simpson). Not to mention, Aki organized StanCon Helsinki, which was a blast (sorry I didn’t have time to sit down and talk with everyone—I hadn’t even met half a dozen of our developers who were there, so I was very busy catching up). Dan Simpson’s curently working on adding efficient marginalizations for GLMs a la INLA and Matthias Vákár is working on efficient GLM functions (which Rok and Erik are then going to supercharge with GPUs).

We’ve also been concentrating a lot on workflow, diagnostics, etc., with extensive effort from Andrew Gelman, Michael Betancourt, Aki Vehtari, Jonah Gabry, and Dan Simpson, along with the rest of us trying to implement the recommendations and keep up with best practices. Michael’s case studies, in particular, should be required reading. I believe Jonah and crew are in Cardiff at the moment “reading” their RSS paper on Bayesian workflow. We plan to turn that into a book length monograph with full details and a couple of long examples.

Most recently, Ben Bales has been knocking every pitch thrown to him out of the park. He’s built the really neat adjoint-Jacobian product formulation of multivariate autodiff and figured out all the parameter pack stuff that’s going to let us massively simplify a ton of our functional interfaces. Mitzi Morris has been paying down all the technical debt I incurred writing the first version of the parser.

And let’s not forget Paul Bürkner, who built brms, which is another interface like rstanarm (which I believe was largely built by Ben Goodrich and Jonah Gabry, but even I can’t keep track of who actually builds what).

I could go on for pages, but I hope everyone gets the picture that this is a hugely collaborative effort!

end edit]

“We continuously increased the number of animals until statistical significance was reached to support our conclusions” . . . I think this is not so bad, actually!

For some reason, people have recently been asking me what I think of this journal article which I wrote about months ago . . . so I’ll just repeat my post here:

Jordan Anaya pointed me to this post, in which Casper Albers shared this snippet from a recently-published paper from an article in Nature Communications:

The subsequent twitter discussion is all about “false discovery rate” and statistical significance, which I think completely misses the point.

The problems

Before I get to why I think the quoted statement is not so bad, let me review various things that these researchers seem to be doing wrong:

1. “Until statistical significance was reached”: This is a mistake. Statistical significance does not make sense as an inferential or decision rule.

2. “To support our conclusions”: This is a mistake. The point of an experiment should be to learn, not to support a conclusion. Or, to put it another way, if they want support for their conclusion, that’s fine, but that has nothing to do with statistical significance.

3. “Based on [a preliminary data set] we predicted that about 20 unites are sufficient to statistically support our conclusions”: This is a mistake. The purpose of a pilot study is to demonstrate the feasibility of an experiment, not to estimate the treatment effect.

OK, so, yes, based on the evidence of the above snippet, I think this paper has serious problems.

Sequential data collection is ok

That all said, I don’t have a problem, in principle, with the general strategy of continuing data collection until the data look good.

I’ve thought a lot about this one. Let me try to explain here.

First, the Bayesian argument, discussed for example in chapter 8 of BDA3 (chapter 7 in earlier editions). As long as your model includes the factors that predict data inclusion, you should be ok. In this case, the relevant variable is time: If there’s any possibility of time trends in your underlying process, you want to allow for that in your model. A sequential design can yield a dataset that is less robust to model assumptions, and a sequential design changes how you’ll do model checking (see chapter 6 of BDA), but from a Bayesian standpoint, you can handle these issues. Gathering data until they look good is not, from a Bayesian perspective, a “questionable research practice.”

Next, the frequentist argument, which can be summarized as, “What sorts of things might happen (more formally, what is the probability distribution of your results) if you as a researcher follow a sequential data collection rule?

Here’s what will happen. If you collect data until you attain statistical significance, then you will attain statistical significance, unless you have to give up first because you run out of time or resources. But . . . so what? Statistical significance by itself doesn’t tell you anything at all. For one thing, your result might be statistically significant in the unexpected direction, so it won’t actually confirm your scientific hypothesis. For another thing, we already know the null hypothesis of zero effect and zero systematic error is false, so we know that with enough data you’ll find significance.

Now, suppose you run your experiment a really long time and you end up with an estimated effect size of 0.002 with a standard error of 0.001 (on some scale in which an effect of 0.1 is reasonably large). Then (a) you’d have to say whatever you’ve discovered is trivial, (b) it could easily be explained by some sort of measurement bias that’s crept into the experiment, and (c) in any case, if it’s 0.002 on this group of people, it could well be -0.001 or -0.003 on another group. So in that case you’ve learned nothing useful, except that the effect almost certainly isn’t large—and that thing you’ve learned has nothing to do with the statistical significance you’ve obtained.

Or, suppose you run an experiment a short time (which seems to be what happened here) and get an estimate of 0.4 with a standard error of 0.2. Big news, right! No. Enter the statistical significance filter and type M errors (see for example section 2.1 here). That’s a concern. But, again, it has nothing to do with sequential data collection. The problem would still be there with a fixed sample size (as we’ve seen in zillions of published papers).

Summary

Based on the snippet we’ve seen, there are lots of reasons to be skeptical of the paper under discussion. But I think the criticism based on sequential data collection misses the point. Yes, sequential data collection gives the researchers one more forking path. But I think the proposal to correct for this with some sort of type 1 or false discovery adjustment rule is essentially impossible and would be pointless even if it could be done, as such corrections are all about the uninteresting null hypothesis of zero effect and zero systematic error. Better to just report and analyze the data and go from there—and recognize that, in a world of noise, you need some combination of good theory and good measurement. Statistical significance isn’t gonna save your ass, no matter how it’s computed.

P.S. Clicking through, I found this amusing article by Casper Albers, “Valid Reasons not to participate in open science practices.” As they say on the internet: Read the whole thing.

The internet has no memory. Also, I’d be happy if the terms “false discovery,” “statistical significance,” “false positive,” and “false negative” were never to be heard again.

And here’s my post from a few years ago on stopping rules and Bayesian analysis.

Robert Heinlein vs. Lawrence Summers

Thomas Ball writes:

In this article about Nabokov and the influence of John Dunne’s theories on him (and others in the period l’entre deux guerres) you can see intimations of Borges’ story The Garden of Forking Paths….

The article in question is by Nicholson Baker. Nicholson Baker! It’s great to see that he’s still writing. I feel kinda bad for him though, as to my mind he suffers from what might be called George V. Higgins syndrome: his first book was his best. Don’t get me wrong, I’m a huge fan of both Baker and Higgins; it just happens that their first books were their most characteristic as well as their best efforts.

Anyway, here’s Baker:

Speak, Memory was written, so it seems, under the influence of an aeronautical engineer and avid fly fisherman named John W. Dunne. . . . Dunne’s book, published in 1927, was called An Experiment With Time, and it went into several editions. “I find it a fantastically interesting book,” wrote H.G. Wells in a huge article in The New York Times. Yeats, Joyce, and Walter de la Mare brooded over its implications, and T.S. Eliot’s publishing firm, Faber, brought the book out in paperback in 1934, right about the time when Eliot was writing “Burnt Norton,” all about how time present is contained in time past and time future, and vice versa.

And there’s more:

Dunne’s Experiment seems to have become one of the secret wellsprings, or wormholes, of twentieth-century literature. J.B. Priestley believed that An Experiment With Time was “one of the most curious and perhaps most important books of the age,” and he built several plays around it. C.E.M. Joad, the philosopher and radio personality, said of the book: “It can be recommended to everybody who wishes to learn how to anticipate his own future.” C.S. Lewis wrote a short story, “The Dark Tower,” using Dunne’s ideas. J.R.R. Tolkien found the book helpful as he imagined Middle Earth’s elven dreamtime. Agatha Christie wrote that it gave her a “truer knowledge of serenity than I had ever obtained before.” “Everybody in England is talking about J.W. Dunne, the man who made dreams popular,” reported a newspaper columnist in 1935, though he warned that the innumerable geometrical charts would drive the reader “loco.” Robert Heinlein cited Dunne’s theory in his novella “Elsewhen” in 1941. In 1940, Jorge Luis Borges reviewed the book. “Dunne assures us that in death we will finally learn how to handle eternity,” Borges wrote. “He states that the future, with its details and vicissitudes, already exists.”

Graham Greene comes up too.

But here’s the name that really stood out in the above list: science fiction writer Robert Heinlein, who’s come up from time to time on the blog. I happen to be reading The Puppet Masters, a Heinlein novel from the 1950s that I picked up in paperback format—I looove those old-time pocket books that really fit in my pocket!—and came across this line:

“Listen, son—most women are damn fools and children. But they’ve got more range than we’ve got. The brave ones are braver, the good ones are better—and the vile ones are viler. . . .”

This quote comes out of the mouths of one of the characters, but it’s a character who’s celebrated for his wisdom, so I think it’s safe to guess that Heinlein agreed with it.

I’m not trying to pick on Heinlein for his retro social attitudes. Rather, the opposite: it’s when authors let their guard down that they reveal interesting aspects of their life and times. It’s the Speed Racer principle: Sometimes the most interesting aspect of a cultural product is not its overt content but rather its unexamined assumptions.

What struck me about the above quote is how it goes in the opposite of current received wisdom about men and women, the view, associated with former U.S. Treasury Secretary Lawrence Summers, that men are more variable than women, the “wider tails” theory, which is said to explain why there are more male geniuses and more male imbeciles, more male heroes and more male villains, etc. Heinlein’s quote above says the opposite (on the moral, not the intellectual, dimension, but I think the feeling is the same).

My point here is not to use Heinlein to shoot down Summers (or vice-versa). Rather, it’s just interesting how received wisdom can change over time. What seemed like robust common sense back in the 1950s, has turned around 180 degrees, just a few decades later.

StanCon 2018 Helsinki tutorial videos online

StanCon 2018 Helsinki tutorial videos are now online at Stan YouTube channel

List of tutorials at StanCon 2018 Helsinki

  • Basics of Bayesian inference and Stan, parts 1 + 2, Jonah Gabry & Lauren Kennedy
  • Hierarchical models, parts 1 + 2, Ben Goodrich
  • Stan C++ development: Adding a new function to Stan, parts 1 + 2, Bob Carpenter, Sean Talts & Mitzi Morris
  • Ordinary differential equation (ODE) models in Stan, Daniel Lee
  • Productization of Stan, Eric Novik, Markus Ojala, Tom Nielsen, Anna Kircher
  • Model assessment and selection, Aki Vehtari

Abstracts for tutorials are available at conference website.

Talk videos will be edited and divided to individual talks this week.

A.I. parity with the West in 2020

Someone just sent me a link to an editorial by Ken Church, in the journal Natural Language Engineering (who knew that journal was still going? I’d have thought open access would’ve killed it). The abstract of Church’s column says of China,

There is a bold government plan for AI with specific milestones for parity with the West in 2020, major breakthroughs by 2025 and the envy of the world by 2030.

Something about that plan sounded familiar. Then I remembered the Japanese Fifth Generation project. Here’s Ehud Shapiro, writing a trip report for ACM  35 years ago:

As part of Japan’s effort to become a leader in the computer industry, the Institute for New Generation Computer Technology has launched a revolutionary ten-year plan for the development of large computer systems which will be applicable to knowledge information processing systems. These Fifth Generation computers will be built around the concepts of logic programming. In order to refute the accusation that Japan exploits knowledge from abroad without contributing any of its own, this project will stimulate original research and will make its results available to the international research community.

My Ph.D. thesis, circa 1989, was partly on logic programming, as was my first book in 1992 (this post isn’t by Andrew, just in case you hadn’t noticed). Unfortunately, by the time my book came out, the field was pretty much dead, not that it had ever really been alive in the United States. As an example of how poorly it was regarded in the U.S., my first grant proposal to the U.S. National Science Foundation, circa 1990, was rejected with a review that literally said it was “too European.”

Isaac Newton : Alchemy :: Michael Jordan : Golf

Not realizing the domain-specificity of their successes.

How to set up a voting system for a Hall of Fame?

Micah Cohen writes:

Our company is establishing a Hall of Fame and I am on a committee to help set it up which involved figuring out the voting system to induct a candidate. We have modeled it somewhat off of the voting for the Baseball Hall of Fame.

The details in short:
· Up to 40 candidates,
· 600 voters
· Each elector has up to 10 votes
· A candidate has to have 75% of the votes to get inducted

Our current projected model:
· Up to 20 candidates
· About 100-120 voters (lets say 100)
· Each elector has up to 3 votes
· A candidate has to have 75% of the votes to get inducted

The last 2 points are the variables in question that we need help with: How many votes should each elector get and what percentage should the candidate have to have to get inducted?

We don’t want to make it too easy and don’t want to make it too hard. Our thought is to have 2-5 people inducted per year but we want to avoid having 0, no one, inducted.

We will assume that each candidate has an equal chance of being voted in so it’s not weighted at all.

My initial thought was to increase the number of votes that each elector can have to the same ratio as the baseball system (10 votes for up to 40 candidates) so we could increase to up to 5 votes for the 20 candidates. Or should we keep it at 3 candidates and decrease the percent that the candidate would need? The other factor would be if there are less than 20 candidates, lets say 15, how would this all change?

With all that being said, is there a way to find out what the right number of votes each elector should have and what the percentage should be?

Is there a way to visually see this in a graph where we could plug in the variables to see how it would change the probability of no one being elected or how many would be elected in a year?

My reply: I have no specific ideas here but I have four general suggestions:

1. To get a sense of what numbers would work for you, I recommend simulating fake data and trying out various ideas. Your results will only be as good as your fake data; still, I think this can give a lot of insight.

2. No need for a fixed rule, right? I’d recommend starting with a tough threshold for getting into the hall of fame, and then if you’re getting not enough people inducted each year, you can loosen your rules. This only works in one direction: If your rules are too loose, you can’t retroactively tighten them and exclude people you’ve already honored. But if you start out too tight, it shouldn’t be hard to loosen up.

3. This still leaves the question of how to set the thresholds for the very first year: if you set the bar too high at first, you could well have zero people inducted. One way you could handle this is to put the percentage threshold on a sliding scale so that if you have just 0 or 1 people passing, you lower the threshold until at least 2 people get in.

4. Finally, think about the motivations of voters, strategic voting, etc. The way the rules are set up can affect how people will vote.

Hey—take this psychological science replication quiz!

Rob Wilbin writes:

I made this quiz where people try to guess ahead of time which results will replicate and which won’t in order to give then a more nuanced understanding of replication issues in psych. Based on this week’s Nature replication paper.

It includes quotes and p-values from the original study if people want to use them, and we offer some broader lessons on what kinds of things replicate and which usually don’t.

You can try the quiz yourself. Also, I have some thoughts about the recent replication paper and its reception—I’m mostly happy with how the paper was covered in the news media, but I think there are a few issues that were missed. I’ll discuss that another time. For now, enjoy the quiz.

John Hattie’s “Visible Learning”: How much should we trust this influential review of education research?

Dan Kumprey, a math teacher at Lake Oswego High School, Oregon, writes:

Have you considered taking a look at the book Visible Learning by John Hattie? It seems to be permeating and informing reform in our K-12 schools nationwide. Districts are spending a lot of money sending their staffs to conferences by Solution Tree to train their schools to become PLC communities which also use an RTI (Response To Intervention) model. Their powerpoint presentations prominently feature John Hattie’s work. Down the chain, then, if all of these school districts attending are like mine, their superintendents, assistant superintendents, principals, and vice principals are constantly quoting John Hattie’s work to support their initiatives, because they clearly see it as a powerful tool.

I am asking not as a proponent or opponent of Hattie’s work. I’m asking as a high school math teacher who found that there does not seem to have been much critical analysis of his work (except by Arne Kåre Topphol and Pierre-Jérôme Bergeron, as far as I can tell from a cursory search.) This seems strange given its ubiquitous impact on educational leaders’ plans for district and school-wide changes that affect many students and teachers. An old college wrestling teammate of mine, now a statistician, encouraged me to ask you about this.

The reason educational leaders have latched onto this book so much, I believe, is Hattie’s synthesis of over 1,000 meta-analyses. This is, no doubt, a very appealing thing. I’m glad to see educational leaders using data to inform their decisions, but I’m not glad to see them treating it as an educational research bible, of sorts. I wonder about the statistical soundness (and hence value) of synthesizing so many studies of so many designs. I wonder about a book where there’s only two statistics primarily used, one of them incorrectly. And, finally, I wonder about these things b/c this book is functioning as fuel for educational Professional Development conferences over multiple years in multiple states (i.e., it’s a significant component in a very profitable market) as well as the primary resource used by administrators in individual districts to affect change, often without teachers as change-agents. Regardless of these concerns, I also appreciate conversations the book elicits, and am open to the notion that perhaps there are some sound statistical conclusions from the book, ignoring Hattie’s misuse of the CLE stats. (Similarly, I should note, I like a lot about the RTI model that Solution Tree teaches/sells.) I’m sending you this email from a place of curiosity, not of cynicism.

My reply: I’ve not heard of this book by Hattie. I’m setting this down here as a placeholder, and if I have a chance to look at the Hattie book before the scheduled posting date, six months from now, I’ll give my impressions below. Otherwise, maybe some of you commenters know something about it?

“Identification of and correction for publication bias,” and another discussion of how forking paths is not the same thing as file drawer

Max Kasy and Isaiah Andrews sent along this paper, which begins:

Some empirical results are more likely to be published than others. Such selective publication leads to biased estimates and distorted inference. This paper proposes two approaches for identifying the conditional probability of publication as a function of a study’s results, the first based on systematic replication studies and the second based on meta-studies. For known conditional publication probabilities, we propose median-unbiased estimators and associated confidence sets that correct for selective publication. We apply our methods to recent large-scale replication studies in experimental economics and psychology, and to meta-studies of the effects of minimum wages and de-worming programs.

I sent them these comments:

1. This recent discussion might be relevant. My quick impression is that this sort of modeling (whether parametric or nonparametric; I can see the virtues of both approaches) could be useful in demonstrating the problems of selection in a literature, and setting some lower bound on how bad the selection could be, but not so valuable if it were attempted to be used to come up with some sort of corrected estimate. In that way, it’s similar to our work on type M and type S errors, which are more of a warning than a solution. I think your view may be similar, in that you mostly talk about biases, without making strong claims about the ability to use your method to correct these biases.

2. In section 2.1 you have your model where “a journal receives a stream of studies” and then decides which one to publish. I’m not sure how important this is for your mathematical model, but it’s my impression that most of the selection occurs within papers: a researcher gets some data and then has flexibility to analyze it in different ways, to come up with statistically significant conclusions or not. Even preregistered studies are subject to a lot of flexibility in interpretation of data analysis; see for example here.

3. Section 2.1 of this paper may be of particular interest to you as it discusses selection bias and the overestimation of effect sizes in a study that was performed by some economists. It struck me as ironic that economists, who are so aware of selection bias in so many ways, have been naive in taking selected point estimates without recognizing their systematic problems. It seems that, for many economists, identification and unbiased estimation (unconditional on selection) serve as talismans which provide a sort of aura or blessing to an entire project, allowing the researchers to turn off their usual skepticism. Sad, really.

4. You’ll probably agree with this point too: I think it’s a terrible attitude to say that a study replicates “if both the original study and its replication find a statistically significant effect in the same direction.” Statistical significance is close to meaningless at the best of times, but it’s particularly ridiculous here, in that using such a criterion is just throwing away information and indeed compounding the dichotomization that causes so many problems in the first place.

5. I notice you cite the paper of Gilbert, King, Pettigrew, and Wilson (2016). I’d be careful about citing that paper, as I don’t think the authors of that paper knew what they were doing. I think that paper was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric, and I don’t recommend citing it. If you do want to refer to it, I really think you should point out that its arguments are what academics call “tendentious” and what Mark Twain called “stretchers,” and you could refer to this post by Brian Nosek and Elizabeth Gilbert explaining what was going on. As I’m sure you’re aware, the scientific literature gets clogged with bad papers, and I think we should avoid citing bad papers uncritically, even if they appear in what are considered top journals.

Kasy replied:

1) We wouldn’t want to take our corrections too literally, either, but we believe there is some value in saying “if this is how selection operates, this is how you would perform corrected frequentist inference (estimation & confidence sets)”
Of course our model is not going to capture all the distortions that are going on, and so the resulting numbers should not be taken as literal truth, necessarily.

2) We think of our publication probability function p(.) as a “reduced form” object which is intended to capture selectivity by both researchers and journals. Our discussion is framed for concreteness in terms of journal decisions, but for our purposes journal decisions are indistinguishable from researcher decisions.
In terms of assessing the resulting biases, I think it doesn’t really matter who performs the selection?

3) Agreed; there seems to be some distortion in the emphasis put by empirical economists.
Though we’d also think that the focus on “internal validity” has led to improvements in empirical practice relative to earlier times?

4) Agree completely. We attempt to clarify this point in Section 3.3 of our paper.
The key point to us seems to be that, depending on the underlying distribution of true effects across studies, pretty much any value for “Probability that the replication Z > 1.96, given that the original Z > 1.96” is possible (with a lower bound of 0.025), even in the absence of any selectivity.

Regarding Kasy’s response to point 2, “it doesn’t really matter who performs the selection?”, I’m not so sure.

Here’s my concern: in the “file drawer” paradigm, there’s some fixed number of results that are streaming through, and some get selected for publication. In the “forking paths” paradigm, there’s a virtually unlimited number of possible findings, so it’s not clear that it makes sense to speak of a latent distribution of results.

One other thing: in the “file drawer” paradigm, the different studies are independent. In the “forking paths” paradigm, the different possible results are all coming from the same data so they don’t represent so much additional information.

3 recent movies from the 50s and the 70s

I’ve been doing some flying, which gives me the opportunity to see various movies on that little seat-back screen. And some of these movies have been pretty good:

Logan Lucky. Pure 70s. Kinda like how Stravinsky did those remakes of Tchaikovsky etc. that were cleaner than the original, so did Soderbergh in Logan Lucky, and earlier in The Limey, recreate that Seventies look and feel. The Limey had the visual style, the washed-out look of the L.A. scenes in all those old movies. Logan Lucky had the 70s-style populist thing going, Burt Reynolds, Caddyshack, the whole deal.

La La Land. I half-watched it—I guess I should say, I half-listened to it, on the overnight flight. I turned it on, plugged myself in, and put on the blindfold so I could sleep. A couple times I woke up in the middle of the night and restarted it. Between these three blind viewings, I pretty much heard the whole thing. On the return flight I actually watched the damn thing and then the plot all made sense. It was excellent, just beautiful. The actual tunes were forgettable, but maybe that was part of the design. Like Logan Lucky, this was a retro movie—in this case, from the Fifties—but better than the originals on which it was modeled.

Good Time. I’d never heard of this one. This was the most intense movie I’ve ever seen. Also pure 70s, but not like Logan Lucky, more like a cross between The French Connection and Dog Day Afternoon. Almost all the action takes place in Queens. Really intense—did I say that already?

StanCon Helsinki streaming live now (and tomorrow)

We’re streaming live right now!

Timezone is Eastern European Summer Time (EEST) +0300 UTC

Here’s a link to the full program [link fixed].

There have already been some great talks and they’ll all be posted with slides and runnable source code after the conference on the Stan web site.

Some clues that this study has big big problems

Paul Alper writes:

This article from the New York Daily News, reproduced in the Minneapolis Star Tribune, is so terrible in so many ways. Very sad commentary regarding all aspects of statistics education and journalism.

The news article, by Joe Dziemianowicz, is called “Study says drinking alcohol is key to living past 90,” with subheading, “When it comes to making it into your 90s, booze actually beats exercise, according to a long-term study,” and it continues:

The research, led by University of California neurologist Claudia Kawas, tracked 1,700 nonagenarians enrolled in the 90+ Study that began in 2003 to explore impacts of daily habits on longevity.

Researchers discovered that subjects who drank about two glasses of beer or wine a day were 18 percent less likely to experience a premature death, the Independent reports.

Meanwhile, participants who exercised 15 to 45 minutes a day cut the same risk by 11 percent. . . .

Other factors were found to boost longevity, including weight. Participants who were slightly overweight — but not obese — cut their odds of an early death by 3 percent. . . .

Subjects who kept busy with a daily hobby two hours a day were 21 percent less likely to die early, while those who drank two cups of coffee a day cut that risk by 10 percent.

At first, this seems like reasonable science reporting. But right away there are a couple flags that raise suspicion, such as the oddly specific “15 to 45 minutes a day”—what about people who exercise more or less than that?—and the bit about “overweight — but not obese.” It’s harder than you might think to estimate nonlinear effects. In this case the implication is not just nonlinearity but nonmonotonicity, and I’m starting to worry that the researchers are fishing through the data looking for patterns. Data exploration is great, but you should realize that you’ll be dredging up a lot of noise along with your signal. As we’ve said before, correlation (in your data) does not even imply correlation (in the underlying population, or in future data).

The claims produced by the 90+ Study can also be criticized on more specific grounds. Alper points to this news article by Michael Joyce, who writes:

their survey [found] that drinking the equivalent of two beers or two glasses of wine per day was associated with 18% fewer deaths, it also found that daily exercise of around 15 to 45 minutes was only associated with 11% fewer premature deaths.

TechTimes opted to blend these two findings into a single whopper of a headline:

Drinking Alcohol Helps Better Than Exercise If You Want To Live Past 90 Years Old

Not only is this language unjustified in referring to a study that can only show association, not causation, but the survey did not directly compare alcohol and exercise. So the headline is very misleading. . . .

Other reported findings of the study included:

– being slightly overweight (not obese) was associated with 3% fewer early deaths

– being involved in a daily hobby two hours a day was associated with a 21 % lower rate of premature deaths

– drinking two cups of coffee a day was associated with a 10% lower rate of early death

But these are observations and nothing more. Furthermore, they are based on self-reporting by the study subjects. That’s a notoriously unreliable way to get accurate information regarding people’s daily habits or behaviors.

Just after we published this piece we heard back from Dr. Michael Bierer, MD, MPH — one of our regular contributors — who we had reached out to for comment . . .:

Observational studies that demonstrate benefits to people engaged in a certain activity — in this case drinking — are difficult to do well. That’s because the behavior in question may co-vary with other features that predict health outcomes.

For example, those who abstain from alcohol completely may do so for a variety of reasons. In older adults, perhaps that reason is taking a medication that makes alcohol dangerous; such as anticoagulants, psychotropics, or aspirin. So not drinking might be a marker for other health conditions that themselves are associated — weakly or not-so-weakly — with negative outcomes. Or, abstaining may signal a history of problematic drinking and the advice to cut back. Likewise, there are many health conditions (like liver disease) that are reasons to abstain.

Conversely, moderate drinking might be a marker for more robust health. There is an established link between physical activity and drinking alcohol. People who take some alcohol may simply have more social contacts than those who abstain, and pro-social behaviors are linked to health.

P.S. I’d originally titled this post, “In Watergate, the saying was, ‘It’s not the crime, it’s the coverup.’ In science reporting, it’s not the results, it’s the hype.” But I changed the title to avoid the association with criminality. One thing I’ve said a lot is that, in science, honesty and transparency are not enough: You can be a scrupulous researcher but if your noise overwhelms your signal, and you’re using statistical methods (such as selection on statistical significance) that emphasize and amplify noise, that you can end up with junk science. Which, when put through the hype machine, becomes hyped junk science. Gladwell bait. Freakonomics bait. NPR bait. PNAS bait.

So, again:

(1) If someone points out problems with your data and statistical procedures, don’t assume they’re saying you’re dishonest.

(2) If you are personally honest, just trying to get at the scientific truth, accept that concerns about “questionable research practices” might apply to you too.

Old school

Maciej Cegłowski writes:

About two years ago, the Lisp programmer and dot-com millionaire Paul Graham wrote an essay entitled Hackers and Painters, in which he argues that his approach to computer programming is better described by analogies to the visual arts than by the phrase “computer science”.

When this essay came out, I was working as a computer programmer, and since I had also spent a few years as a full-time oil painter, everybody who read the article and knew me sent along the hyperlink. I didn’t particularly enjoy the essay . . . but it didn’t seem like anything worth getting worked up about. Just another programmer writing about what made him tick. . . .

But the emailed links continued, and over the next two years Paul Graham steadily ramped up his output while moving definitively away from subjects he had expertise in (like Lisp) to topics like education, essay writing, history, and of course painting. Sometime last year I noticed he had started making bank from an actual print book of collected essays, titled (of course) “Hackers and Painters”. I felt it was time for me to step up.

So let me say it simply – hackers are nothing like painters.

Cegłowski continues:

It’s surprisingly hard to pin Paul Graham down on the nature of the special bond he thinks hobbyist programmers and painters share . . . The closest he comes to a clear thesis statement is at the beginning “Hackers and Painters”:

[O]f all the different types of people I’ve known, hackers and painters are among the most alike. What hackers and painters have in common is that they’re both makers.

To which I’d add, what hackers and painters don’t have in common is everything else.

Ouch. Cegłowski continues:

The fatuousness of the parallel becomes obvious if you think for five seconds about what computer programmers and painters actually do.

– Computer programmers cause a machine to perform a sequence of transformations on electronically stored data.

– Painters apply colored goo to cloth using animal hairs tied to a stick.

It is true that both painters and programmers make things, just like a pastry chef makes a wedding cake, or a chicken makes an egg. But nothing about what they make, the purposes it serves, or how they go about doing it is in any way similar.

Start with purpose. With the exception of art software projects (which I don’t believe Graham has in mind here) all computer programs are designed to accomplish some kind of task. Even the most elegant of computer programs, in order to be considered a program, has to compile and run . . .

The only objective constraint a painter has is making sure the paint physically stays on the canvas . . .

Why does Graham bring up painting at all in his essay? Most obviously, because Graham likes to paint, and it’s natural for us to find connections between different things we like to do. But there’s more to it: also, as Cegłowski discusses, painting has a certain street-cred (he talks about it in terms of what can “get you laid,” but I think it’s more general than that). So if someone says that what he does is kinda like painting, I do think that part of this is an attempt to share in the social status that art has.

Cegłowski’s post is from 2005, and it’s “early blogging” in so many ways, from the length and tone, to the references to old-school internet gurus such as Paul Graham and Eric Raymond, to the occasional lapses in judgment. (In this particular example, I get off Cegłowski’s train when he goes on about Godel, Escher, Bach, a book that I positively hate, not so much for itself as for how overrated it was.)

Old-school blogging. Good stuff.

“To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.”

Alex Konkel writes on a topic that never goes out of style:

I’m working on a data analysis plan and am hoping you might help clarify something you wrote regarding missing data. I’m somewhat familiar with multiple imputation and some of the available methods, and I’m also becoming more familiar with Bayesian modeling like in Stan. In my plan, I started writing that we could use multiple imputation or Bayesian modeling, but then I realized that they might be the same. I checked your book with Jennifer Hill, and the multiple imputation chapter says that if outcome data are missing (which is the only thing we’re worried about) and you use a Bayesian model, “it is trivial to use this model to, in effect, impute missing values at each iteration”. Should I read anything into that “in effect”? Or is the Bayesian model identical to imputation?

A second, trickier question: I expect that most of our missing data, if we have any, can be considered missing at random (e.g., the data collection software randomly fails). One of our measures, though, is the participant rating their workload at certain times during task performance. If the participant misses the prompt and fails to respond, it could be that they just missed the prompt and nothing unusual was happening, and so I would consider that data to be missing at random. But participants could also fail to respond because the task is keeping them very busy, in which case the data are not missing at random but in fact reflect a high workload (and likely higher than other times when they do respond). Do you have any suggestions on how to handle this situation?

My reply:

to paragraph 1: Yes, what we’re saying is that if you just exclude the missing cases and fit your model, this is equivalent to the usual statistical inference, assuming missing at random. And then if you want inference for those missing values, you can impute them conditional on the fitted model. If the outcome data are missing not at random, though, then more modeling must be done.

to paragraph 2: Here you’d want to model this. You’d have a model, something like logistic regression, where probability of missingness depends on workload. This could be fit in a Stan model. It could be that strong priors would be needed; you can run into trouble trying to fit such models from data alone, as such fits can depend uncomfortably on distributional assumptions.

Konkel continues:

For the Bayesian/imputation part: Is there a difference between what you describe, which I think would be what’s described in 11.1 of the Stan manual, and trying to combine the missing and non-missing data into a single vector, more like 11.3? We wouldn’t be interested in the value of the missing data, per se, so much as maximizing our number of observations in order to examine differences across experimental conditions.

For the workload part: I understand what you’re saying in principle, but I’m having a little trouble imagining the implementation. I would model the probability of missingness depending on workload, but the data that’s missing is the workload! We’re running an experiment, so we don’t have a lot of other predictors with which to model/stand in for workload. Essentially we just have the experimental condition, although maybe I could jury-rig more predictors out of the details of the experiment in the moment when the workload response is supposed to occur…

My reply:

Those sections of the Stan model are all about how to fold in missing data into larger data structures. We’re working on allowing missing data types in Stan so that this hashing is done automatically—but until that’s done, yes, you’ll need to do this sort of trick to model missing data.

But in my point 1 above, if your missingness is only in the outcome variable in your regression, and you assume missing at random, then you can simply exclude the missing cases entirely from your analysis, and then you can impute them later in the generated quantities block. Then it’s there in the generated quantities block that you’ll combine the observed and imputed data.

For the workload model: Yes, you’re modeling the probability of missingness given something that’s missing in some cases. That’s why you’d need a model with informative priors. From a Bayesian standpoint, the model is fit all at once, so it can all be done.

To get started, I suggest coming up with a simple but reasonable model for missingness, then simulate fake complete data followed by a fake missingness pattern, and check that you can recover your missing-data model and your complete data model in that fake-data situation. You can then proceed from there. But if you can’t even do it with fake data, you’re sunk.

Bayesian model comparison in ecology

Conor Goold writes:

I was reading this overview of mixed-effect modeling in ecology, and thought you or your blog readers may be interested in their last conclusion (page 35):

Other modelling approaches such as Bayesian inference are available, and allow much greater flexibility in choice of model structure, error structure and link function. However, the ability to compare among competing models is underdeveloped, and where these tools do exist, they are not yet accessible enough to non-experts to be useful.

This strikes me as quite odd. The paper discusses model selection using information criterion and model averaging in quite some detail, and it is confusing that the authors dismiss the Bayesian analogues (I presume they are aware of DIC, WAIC, LOO etc. [see chapter 7 of BDA3 and this paper — ed.]) as being ‘too hard’ when parts of their article would probably also be too hard for non-experts.

In an area in which small sample sizes are common, I’d argue that effort to explain Bayesian estimation in hierarchical models would have been very worthwhile (e.g. estimation of variance components, more accurate estimation of predictor coefficients using informative priors/variable selection).

In general, I find the ‘Bayesian reasoning is too difficult for non-experts’ argument pretty tiring, especially when it’s thrown in at the end of a paper like this!

Along these lines, I used to get people telling me that I couldn’t use Bayesian methods for applied problems because people wouldn’t stand for it. Au contraire, I’ve used Bayesian methods in many different applied fields for a long time, ever since my first work in political science in the 1980s, and nobody’s ever objected to it. If you don’t want to use some statistical method (Bayesian or otherwise) cos you don’t like it, fine; give your justification and go from there. But don’t ever say not to use a method out of a purported concern that some third party will object. That’s so bogus. Stand behind your own choices.