10. On the OSF/OkCupid Data Dump: A Batman Analogy

Today, my worlds collide: I’m blogging about sexual/relationship science, scientific methodology, and Batman–three things I love talking about. I never thought that a cause to discuss the three, simultaneously, would manifest in a single issue. But here we are:

The post above was retweeted by a Twitter buddy, and my first reaction was likely the same had by many sexual/relationship scientists. “You mean I can finally use peoples’ answers to all those insanely interesting sexuality questions on OKC in my research!?” The result:


But then, when I went to the OSF repository where the data is located to read the authors’ description/justification of how the data were collected (tl;dr the authors wrote a program to “scrape” user content, after accessing OkCupid), and my elation quickly began to fade.

For one, usernames and locations were included in the dataset; users, in other words, were wholly identifiable. Already, alarm bells were beginning to sound in my head, and then read this:

Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it is a more
useful form.


In other words, the authors attempt to pass off their decision to share all this personalized data in an open repository as an ethically trivial matter. And before you leap to accuse me of putting words in the authors’ mouths, do note that they opted to use the term “merely“, an adverb that, in this case, communicates the data-dump is just a small/insignificant matter of openly sharing data–a scientific value the authors espouse, at length, earlier in their document–and nothing more (i.e., no ethical dilemma here worth talking about).

As the Twitterverse is clearly demonstrating, however, there is nothing “mere” about this case of open data sharing on the OSF. Before launching into what the heck this has to do with Batman, I’d like to first give some credit where credit is due, to Emily G on Twitter. Check out her tweets (many more, beyond the one below) and blog about this issue–so much of it is spot on.

The OKC-OSF Data Dump, Analogized as the Plot of The Dark Knight

If you’ve never seen The Dark Knight, this analogy is likely to fall flat (I’m not going to go into enough detail of the plot to catch you up to speed)–so skip this section. But if you have seen The Dark Knight, then here me out: I think this situation maps pretty well onto the major plot points of The Dark Knight.

The OSF in this analogy is Harvey Dent, the charming DA who has his heart set on one thing, and one thing only–bringing justice to Gotham (okay, and maybe Maggie Gyllenhaal, though I never understood why they replaced Rachel Adams for her…). Likewise, the OSF seems intent on achieving one goal that is similarly broad, but of crucial importance–increasing the transparency of science.

Harvey Dent: Champion of Justice…and Open Science

Throughout the movie, Dent represents a beacon of hope for the city of Gotham to achieve justice through totally  legitimate means; he nearly single-handedly puts most of the mob in jail at one point, and legally (v. Batman’s vigilante antics). But this glorification comes at a price: it is understood that Dent can never afford to become morally or legally compromised, lest it undo all that he laboured for. As the Mayor of Gotham describes:

The public likes you. That’s the only reason that this might fly, but it’s all on you. They’re all gonna come after you, now. Politicians, journalists, cops. Anyone whose wallet is about to get tight. Are you up for it? You better be. Cause if they get anything on you, those criminals are put back on the streets, followed swiftly by you and me.

So if the OSF is analogous to Dent, I see the battle for justice in gotham as analogous to the battle for transparency in science. The OSF is such a powerful force in the open science movement, and the COS folks have worked hard to create a case for open data, methods, and preregistration, while building safeguards into the OSF (e.g., private components, embargoes on revealing preregistrations, etc.,) to allay the concerns of those who are skeptical/cynical of open science. But given how central the OSF has become to open science methods, like Dent, I feel as though the OSF cannot afford to become compromised by pursuing open science (in this case, open data) by any (read: unethical) means.

The consequences of such a moral blow could be severe to the broader open science movement. Dent, for example, loses his mind when his girlfriend is killed by the mob, and goes on a murderous rampage–after some encouragement from the Joker–in order to get even with those he feels are responsible. But Dent is eventually killed, and Batman is left with a decision to make: either take the blame for all of Dent’s crimes, or suffer all of the hard-earned justice against the mob being undone. Batman chooses the latter while discussing the matter with Jim Gordon:

Commissioner James Gordon: The Joker won. All of Harvey’s prosecutions, everything he fought for: undone. Whatever chance you gave us of fixing our city dies with Harvey’s reputation. We bet it all on him. The Joker took the best of us and tore him down. People will lose hope.

Batman: They won’t. They must never know what he did.

Commissioner James Gordon: Five dead, two of them cops—you can’t sweep that up.

Batman: But the Joker cannot win. Gotham needs its true hero.

We are betting a lot on the OSF, and the open-science movement in psychology can ill-afford its most substantial contributor to be compromised by aiding and abetting those attempting to unethically share data that is not theirs to share. I don’t care that it makes cool data “open”; the cost to other cornerstone scientific values–ethics in data collection– is just too steep. Opponents of open science will simply point to such a large ethical failing as evidence for why science needs to stay closed, and if we don’t handle this ethical dilemma appropriately, they may have a point.

So the OKC-OSF Data-Dump is Unethical? 

Some think the data-dump isn’t unethical; others think that it could have been ethical, provided certain stipulations were met (e.g., if data were de-identified before hand); I take a more morally rigid position–the dataset was unethically collected and shared, and should be removed from the OSF now.

Why? Here’s a growing list–I’ll be sure to update this as I think about the issue more.

1. OKC users didn’t consent to their data being used in this way, or shared. And actually, elements of the OKC Terms & Conditions  (T&C) and Privacy Policy (PP) seem to suggest quite the opposite. Consider the following excerpts from both (emphasis added by me):

The nature of this Website promotes the sharing of personal information by users with other users

By accessing this Website, you agree to use any personal information provided to you by other users of this Website in a lawful and responsible manner. You further agree that you will not use personal information about other users of this Website for any reason without the express prior consent of the user that has provided such information to you.

Okay, so the authors of the data-dump shard the personal information of others with a helluva lot more than just other OKC users; that they didn’t seek IRB approval (more on that later) and seem rather flippant about whether they have violated any terms of use for OKC’s data (see tweet below), which suggests they did not use this personal information responsibly (and perhaps not legally either); and I’m certain they did not receive the express prior consent of all the folks whose data they just shared with the world.

But it get’s better (err, worse), once you read the following from the PP:

We do not share your personal information with others except as indicated in this Privacy Policy or when we inform you and give you an opportunity to opt out of having your personal information shared. We may share personal information with…

The PP goes on to list a number of groups, none of whom are affiliates of the authors of the data-dump. So OKC users signed up for an online dating service, and in doing so, consented to only share their data with other users of the service and OKC and their affiliates–and could opt out if desired–but the authors of the data-dump thought that it was fine to ignore what the OKC Users consented to, and instead shared the data with the world, which OKC Users did not consent to. So not cool. To me, this is argument #1, 2, 3, 4 & 5 in any debate on the ethics of this data-dump–I find it hard to even get past this one issue in order to humour other elements of the ethics at play here…

2. OKC did not consent to having its company’s data used this way. Not that I care too much about the wellbeing of OKC as a company (I’m much more concerned about the wellbeing of the users), but from the looks of the T&C, it looks as though the dumping and use of this data will put the authors (and data-analyzers) on the wrong side of OKC’s legal team:

So long as you comply with these Terms of Use, you are authorized to access, use and make a limited number of copies of information and materials available on this Website only for purposes of your personal use in order to learn more about Humor Rainbow or its products and services, or to otherwise communicate with Humor Rainbow or utilize its services. Any copies made by you must retain without modification any and all copyright notices and other proprietary marks. The pages and content on this Website may not be copied, distributed, modified, published, or transmitted in any other manner, including use for creative work or to sell or promote other products. Violation of this restriction may result in infringement of intellectual property and contractual rights of Humor Rainbow or third parties which is prohibited by law and could result in substantial civil and criminal penalties.

Unless you can argue that an academic paper is neither a “creative work” (doubtful), nor a “product” (especially doubtful once publishers take over copyright and start marketing/selling your paper), it seems pretty clear that the data-dump (and use of the data) constitute foul play according to OKC’s T&C.

3. This disclosure could lead to very real harm, and it doesn’t matter that somebody other  than psychologists could have found/disclosed these data. Alex Etz and Emily G (who also brings in the point on tthe consent of OKC Users being ignored) cut to the heart of this matter, with their tweets:

In response to the possibility of harm being done to vulnerable OKC users, some have argued, “Well, someone who wants to do those people harm could have created a profile and looked up that information themselves…”. Very true. However, (1) that isn’t addressing the matter of whether the data-dump was ethical, per se; (2) now the process of seeking out and doing harm to vulnerable OKC users has been made a little easier, because all of those profiles have been aggregated in a neat-and-tidy dataset; and (3) I am reluctant to get behind the idea that psychologists should feel okay profiting and advancing their careers based on a dataset that is ethically marred, and only to simply espouse an attitude of, “¯\_(ツ)_/¯ well, if we don’t analyze this data, someone else will.”

To me, this element of the ethics of the OKC-OSF data-dump feels all too close to the APA Torture Scandal, during which psychologists offered a similar justification for their involvement in the CIA torture of terror suspects. I’d like psychology, as a discipline, to aspire to a higher moral standard.

4. There was no IRB involvement, and a ton of conflict of interest. As far as I can tell from the authors document, and the twitter discourse surrounding the dataset, there was no IRB involvement in vetting the process of scraping and sharing the data. I welcome being corrected on this point if I am wrong, but if I am not, this is simply appalling research conduct for the year 2016. The problems of consent and possible harm are so clear, and the legal standing of this data-dump is so hazy, getting an IRB to vet your proposed research seems about as close of an ethics no-brainer as it gets. But should IRBs fail, at least journal editors are able to act as a last bastion for vetting the ethical conduct of research that is to appear in their journals… Except that in this case, the authors published the notice of their data-dump in a journal where one of the authors is the Editor. Fantastic. In other words, there appears to have been no oversight or impartial third-party chain of accountability to attest that these data were collected ethically.

Look, I get feeling excited about a research idea–especially when you are about to tap into a source of data that no one has yet to use–and wanting to jump into data collection and analysis as quickly as possible. But in my opinion, it is in these sorts of novel data collection efforts that IRB-oversight is the most important. In the coming weeks, for example, I am going to be pre-registering and announcing  a data collection effort for a new study; when I described the idea to my advisor, she said that it “sounds crazy”. And it is. So what did we do? We met with someone who used to serve on the IRB to talk about what sort of issues we should be mindful of in collecting the sensitive data we would be attempting to collect. And then I spent over a month working on the most difficult IRB application I’ve ever had to prepare. A couple of months in IRB limbo, and guess what? We are finally IRB approved; new and exciting studies can receive IRB approval–it just may take some time.

The authors list “open science” as a keyword of their paper, but they clearly fail to grasp that true open science is transparent at all stages of research–including the evaluation of ethics. Open scientists should strive to make their research process transparent from start to finish; picking and choosing what phases of science to be “open” during seems no better than p-hacking.

So What Now? 

As of now, it appears as though some steps have been taken to put the dumped OKC data behind a layer of protection on the OSF:

But if I had my druthers, the OSF would remove this datafile now, before any more external pressures (e.g., OKC lawyering up to tackle the authors/the OSF) can be applied to make the OSF look reactively, as opposed to proactively, ethical. Further distribution of this dataset compromises the Harvey Dent-ness of the OSF, and invites serious questions about the merits of an open science movement that is willing to compromise ethics in order to get more data  for psychologists to analyze. I don’t want that. I like the OSF serving as the White Knight of the open science movement in psychology–and in other disciplines too. And I worry that if the OSF doesn’t take a strong stand on the OKC data-dump and remove it now, in its entirety, regardless of what protections the authors are willing to put in place post-hoc, then the OSF will have condoned and thereby incentivized a system of open-data in which researchers collect and post data, and ask questions about the ethics of doing so after. That is not the version of open science that I signed up for.

It’s like my old former graduate instructor Chris Crandall used to say: there are many different values involved in science. Sometimes they are aligned, but oftentimes they compete, and so there are trade-offs to any approach of uncovering and communicating scientific findings.  In the case of the the OKC-OSF Data Dump, I hope that we, as a discipline, won’t place so much value on open sharing of data that we forget the importance of data collection ethics.

I’ll leave it with a final twitter-quote from Emily G:



4 thoughts on “10. On the OSF/OkCupid Data Dump: A Batman Analogy

  1. I’m genuinely unsure what the implications of this are for the OSF. If they get into the business of having ethical standards, whose standards should those be? This case is already arguably fairly borderline (I personally think that what the authors did was pretty crappy, but I don’t think the “data is all online anyway for those who care to look” argument will always be completely without merit in the general case), and it’s easy to think of variants that cause considerably less harm. There’s also the example of the study that Facebook ran a year or so (?) ago, where they (IIRC) very slightly manipulated people’s emotions via selective presentation of their feed, or something: what if those data had been placed on OSF? Where will the lines be drawn?


    • Well, I think one notable difference is that the OKC “study” was ostensibly run by “scientists” (I’m using both of those terms rather loosely, and more than a little contemptuously), whereas the Facebook experiment was run by a corporation on their own proprietary service. I think OSF is an outlet for the former (scientists) and not for the latter (corporations). And as for what standards should be taken up/where the lines should be drawn, I think as a bare-minimum, OSF could consider asking scientists to confirm that data from their original studies were collected with IRB approval, or if using scraped data, that the data were public or otherwise accessed and collected appropriately (e.g., through a service-approved API).


      • Hmmm. IRBs are mostly about protecting the institution’s butt; and apparently Kirkegaard et al. didn’t get/need IRB approval for this study (at least, I presume that’s the case, given that Aarhus has initially declined to take any action). IIRC, the guys at Facebook had some kind of internal sign-off on it? And I would say they were definitely doing science.

        Also, anyone without an affiliation doesn’t have an IRB. That can be a big problem when you have to fill in a box to certify that you’ve obeyed all the rules and the system doesn’t understand that the rules can’t apply to you.

        I’m also not sure your scraping clause might not have helped here. Again, while I don’t defend Kirkegaard et al.’s actions from a moral point of view, the worst they have done here, legally, is violate the terms of service of a web site; typically, the penalty for that is being denied access subsequently. If I were an OKC user I would assume that the OKC had done the necessary work to stop this kind of scraping (it really isn’t hard). (Maybe Kirkegaard et al. are doing OKC’s gay members a favour: suppose the intelligence services of a homophobic repressive government had scraped these data first?)

        I’m most interested in the general case. If OSF is genuinely to be a neutral repository, any conditions it imposes on the data it stores will have an effect on that neutrality. In the same way, whoever OSF’s hosting company is presumably doesn’t impose conditions, other than perhaps legality, on the activities of its customers. Once you start making any decisions about what’s legitimate content, the slippery slope begins, and it often doesn’t end up working in favour of the good guys. (Indeed, once you start to care about content A, you open yourself up to legal problems vis-à-vis content B through Z, because you can no longer claim that you don’t care. I seem to recall Compuserve lost a case 20 or more years ago because they moderated certain forums, so the judge ruled that users were entitled to assume that all forums were moderated, whereas another online services provider basically put up a notice saying “Here be dragons” and after that, users were on their own.)

        OSF is such a great idea that I would be unhappy to see it start having rules about “appropriateness” of content. I think this study would have been conducted, and the results put out there somewhere, had OSF not existed. I think a better solution to this kind of thing is to call out the authors for their behaviour, which seems to be doing pretty well in this case. In my fairly extensive experience of dealing with people (outside academia) who make creative end-runs around rules, there is no technical solution and often no procedural solution either; sometimes the price of a really cool thing is that shit occasionally happens, and it’s not always worth preventing that occasional shit because the mitigation has costs too. YMMV. 🙂


