Today, my worlds collide: I’m blogging about sexual/relationship science, scientific methodology, and Batman–three things I love talking about. I never thought that a cause to discuss the three, simultaneously, would manifest in a single issue. But here we are:
The post above was retweeted by a Twitter buddy, and my first reaction was likely the same had by many sexual/relationship scientists. “You mean I can finally use peoples’ answers to all those insanely interesting sexuality questions on OKC in my research!?” The result:
But then, when I went to the OSF repository where the data is located to read the authors’ description/justification of how the data were collected (tl;dr the authors wrote a program to “scrape” user content, after accessing OkCupid), and my elation quickly began to fade.
For one, usernames and locations were included in the dataset; users, in other words, were wholly identifiable. Already, alarm bells were beginning to sound in my head, and then read this:
Some may object to the ethics of gathering and releasing this data. However, all the data found in the dataset are or were already publicly available, so releasing this dataset merely presents it is a moreuseful form.
In other words, the authors attempt to pass off their decision to share all this personalized data in an open repository as an ethically trivial matter. And before you leap to accuse me of putting words in the authors’ mouths, do note that they opted to use the term “merely“, an adverb that, in this case, communicates the data-dump is just a small/insignificant matter of openly sharing data–a scientific value the authors espouse, at length, earlier in their document–and nothing more (i.e., no ethical dilemma here worth talking about).
As the Twitterverse is clearly demonstrating, however, there is nothing “mere” about this case of open data sharing on the OSF. Before launching into what the heck this has to do with Batman, I’d like to first give some credit where credit is due, to Emily G on Twitter. Check out her tweets (many more, beyond the one below) and blog about this issue–so much of it is spot on.
The OKC-OSF Data Dump, Analogized as the Plot of The Dark Knight
If you’ve never seen The Dark Knight, this analogy is likely to fall flat (I’m not going to go into enough detail of the plot to catch you up to speed)–so skip this section. But if you have seen The Dark Knight, then here me out: I think this situation maps pretty well onto the major plot points of The Dark Knight.
The OSF in this analogy is Harvey Dent, the charming DA who has his heart set on one thing, and one thing only–bringing justice to Gotham (okay, and maybe Maggie Gyllenhaal, though I never understood why they replaced Rachel Adams for her…). Likewise, the OSF seems intent on achieving one goal that is similarly broad, but of crucial importance–increasing the transparency of science.
Throughout the movie, Dent represents a beacon of hope for the city of Gotham to achieve justice through totally legitimate means; he nearly single-handedly puts most of the mob in jail at one point, and legally (v. Batman’s vigilante antics). But this glorification comes at a price: it is understood that Dent can never afford to become morally or legally compromised, lest it undo all that he laboured for. As the Mayor of Gotham describes:
The public likes you. That’s the only reason that this might fly, but it’s all on you. They’re all gonna come after you, now. Politicians, journalists, cops. Anyone whose wallet is about to get tight. Are you up for it? You better be. Cause if they get anything on you, those criminals are put back on the streets, followed swiftly by you and me.
So if the OSF is analogous to Dent, I see the battle for justice in gotham as analogous to the battle for transparency in science. The OSF is such a powerful force in the open science movement, and the COS folks have worked hard to create a case for open data, methods, and preregistration, while building safeguards into the OSF (e.g., private components, embargoes on revealing preregistrations, etc.,) to allay the concerns of those who are skeptical/cynical of open science. But given how central the OSF has become to open science methods, like Dent, I feel as though the OSF cannot afford to become compromised by pursuing open science (in this case, open data) by any (read: unethical) means.
The consequences of such a moral blow could be severe to the broader open science movement. Dent, for example, loses his mind when his girlfriend is killed by the mob, and goes on a murderous rampage–after some encouragement from the Joker–in order to get even with those he feels are responsible. But Dent is eventually killed, and Batman is left with a decision to make: either take the blame for all of Dent’s crimes, or suffer all of the hard-earned justice against the mob being undone. Batman chooses the latter while discussing the matter with Jim Gordon:
Commissioner James Gordon: The Joker won. All of Harvey’s prosecutions, everything he fought for: undone. Whatever chance you gave us of fixing our city dies with Harvey’s reputation. We bet it all on him. The Joker took the best of us and tore him down. People will lose hope.
Batman: They won’t. They must never know what he did.
Commissioner James Gordon: Five dead, two of them cops—you can’t sweep that up.
Batman: But the Joker cannot win. Gotham needs its true hero.
We are betting a lot on the OSF, and the open-science movement in psychology can ill-afford its most substantial contributor to be compromised by aiding and abetting those attempting to unethically share data that is not theirs to share. I don’t care that it makes cool data “open”; the cost to other cornerstone scientific values–ethics in data collection– is just too steep. Opponents of open science will simply point to such a large ethical failing as evidence for why science needs to stay closed, and if we don’t handle this ethical dilemma appropriately, they may have a point.
So the OKC-OSF Data-Dump is Unethical?
Some think the data-dump isn’t unethical; others think that it could have been ethical, provided certain stipulations were met (e.g., if data were de-identified before hand); I take a more morally rigid position–the dataset was unethically collected and shared, and should be removed from the OSF now.
Why? Here’s a growing list–I’ll be sure to update this as I think about the issue more.
The nature of this Website promotes the sharing of personal information by users with other users
By accessing this Website, you agree to use any personal information provided to you by other users of this Website in a lawful and responsible manner. You further agree that you will not use personal information about other users of this Website for any reason without the express prior consent of the user that has provided such information to you.
But it get’s better (err, worse), once you read the following from the PP:
The PP goes on to list a number of groups, none of whom are affiliates of the authors of the data-dump. So OKC users signed up for an online dating service, and in doing so, consented to only share their data with other users of the service and OKC and their affiliates–and could opt out if desired–but the authors of the data-dump thought that it was fine to ignore what the OKC Users consented to, and instead shared the data with the world, which OKC Users did not consent to. So not cool. To me, this is argument #1, 2, 3, 4 & 5 in any debate on the ethics of this data-dump–I find it hard to even get past this one issue in order to humour other elements of the ethics at play here…
2. OKC did not consent to having its company’s data used this way. Not that I care too much about the wellbeing of OKC as a company (I’m much more concerned about the wellbeing of the users), but from the looks of the T&C, it looks as though the dumping and use of this data will put the authors (and data-analyzers) on the wrong side of OKC’s legal team:
Unless you can argue that an academic paper is neither a “creative work” (doubtful), nor a “product” (especially doubtful once publishers take over copyright and start marketing/selling your paper), it seems pretty clear that the data-dump (and use of the data) constitute foul play according to OKC’s T&C.
3. This disclosure could lead to very real harm, and it doesn’t matter that somebody other than psychologists could have found/disclosed these data. Alex Etz and Emily G (who also brings in the point on tthe consent of OKC Users being ignored) cut to the heart of this matter, with their tweets:
In response to the possibility of harm being done to vulnerable OKC users, some have argued, “Well, someone who wants to do those people harm could have created a profile and looked up that information themselves…”. Very true. However, (1) that isn’t addressing the matter of whether the data-dump was ethical, per se; (2) now the process of seeking out and doing harm to vulnerable OKC users has been made a little easier, because all of those profiles have been aggregated in a neat-and-tidy dataset; and (3) I am reluctant to get behind the idea that psychologists should feel okay profiting and advancing their careers based on a dataset that is ethically marred, and only to simply espouse an attitude of, “¯\_(ツ)_/¯ well, if we don’t analyze this data, someone else will.”
To me, this element of the ethics of the OKC-OSF data-dump feels all too close to the APA Torture Scandal, during which psychologists offered a similar justification for their involvement in the CIA torture of terror suspects. I’d like psychology, as a discipline, to aspire to a higher moral standard.
4. There was no IRB involvement, and a ton of conflict of interest. As far as I can tell from the authors document, and the twitter discourse surrounding the dataset, there was no IRB involvement in vetting the process of scraping and sharing the data. I welcome being corrected on this point if I am wrong, but if I am not, this is simply appalling research conduct for the year 2016. The problems of consent and possible harm are so clear, and the legal standing of this data-dump is so hazy, getting an IRB to vet your proposed research seems about as close of an ethics no-brainer as it gets. But should IRBs fail, at least journal editors are able to act as a last bastion for vetting the ethical conduct of research that is to appear in their journals… Except that in this case, the authors published the notice of their data-dump in a journal where one of the authors is the Editor. Fantastic. In other words, there appears to have been no oversight or impartial third-party chain of accountability to attest that these data were collected ethically.
Look, I get feeling excited about a research idea–especially when you are about to tap into a source of data that no one has yet to use–and wanting to jump into data collection and analysis as quickly as possible. But in my opinion, it is in these sorts of novel data collection efforts that IRB-oversight is the most important. In the coming weeks, for example, I am going to be pre-registering and announcing a data collection effort for a new study; when I described the idea to my advisor, she said that it “sounds crazy”. And it is. So what did we do? We met with someone who used to serve on the IRB to talk about what sort of issues we should be mindful of in collecting the sensitive data we would be attempting to collect. And then I spent over a month working on the most difficult IRB application I’ve ever had to prepare. A couple of months in IRB limbo, and guess what? We are finally IRB approved; new and exciting studies can receive IRB approval–it just may take some time.
The authors list “open science” as a keyword of their paper, but they clearly fail to grasp that true open science is transparent at all stages of research–including the evaluation of ethics. Open scientists should strive to make their research process transparent from start to finish; picking and choosing what phases of science to be “open” during seems no better than p-hacking.
So What Now?
As of now, it appears as though some steps have been taken to put the dumped OKC data behind a layer of protection on the OSF:
But if I had my druthers, the OSF would remove this datafile now, before any more external pressures (e.g., OKC lawyering up to tackle the authors/the OSF) can be applied to make the OSF look reactively, as opposed to proactively, ethical. Further distribution of this dataset compromises the Harvey Dent-ness of the OSF, and invites serious questions about the merits of an open science movement that is willing to compromise ethics in order to get more data for psychologists to analyze. I don’t want that. I like the OSF serving as the White Knight of the open science movement in psychology–and in other disciplines too. And I worry that if the OSF doesn’t take a strong stand on the OKC data-dump and remove it now, in its entirety, regardless of what protections the authors are willing to put in place post-hoc, then the OSF will have condoned and thereby incentivized a system of open-data in which researchers collect and post data, and ask questions about the ethics of doing so after. That is not the version of open science that I signed up for.
It’s like my
old former graduate instructor Chris Crandall used to say: there are many different values involved in science. Sometimes they are aligned, but oftentimes they compete, and so there are trade-offs to any approach of uncovering and communicating scientific findings. In the case of the the OKC-OSF Data Dump, I hope that we, as a discipline, won’t place so much value on open sharing of data that we forget the importance of data collection ethics.
I’ll leave it with a final twitter-quote from Emily G: