Cyber & Technology
Parler Wasn’t Hacked, and Scraping Is Not a Crime
Thanks to its association with the attack on the U.S. Capitol, Parler, a social media network popular with supporters of former President Trump, became a household name just in time to vanish from the web. In that narrow window between Jan. 6 and Jan. 10, between the attack and Amazon’s decision to pull the hosting services on which the app relied, an independent researcher who goes by @donk_enby moved to copy as much of the site’s data as she could—preserving a critical body of knowledge about what, exactly, happened at the Capitol. That archive has now underpinned a wave of revealing data journalism, including an extensive ProPublica analysis of live video from the event and a detailed Gizmodo map of the location metadata tied to those posts. But the initial collection effort was shadowed by misconceptions about how archivists were able to access so much of Parler’s data, prompting some too-casual intimations that the app was “hacked”—especially insofar as donk_enby obtained data that the app’s users believed they had deleted.
It’s worth handling carefully the sort of language that can get a person sued or prosecuted, and the Justice Department has, in fact, tried unsuccessfully to prosecute similar conduct under the federal anti-hacking statute: the Computer Fraud and Abuse Act (CFAA). But what donk_enby seems to have done was really just scraping—automating the collection of the same information that a user with no special privileges could have retrieved by hand—and it can’t be said often or clearly enough that scraping is not a crime. That’s very far from saying, of course, that this conduct carries no risk; unfortunately, the fact that the best reading of the CFAA doesn’t punish this conduct hasn’t been enough to protect scrapers from litigation, or even criminal charges. (As Swift on Security put it, this argument “is not legal advice to enumerate random APIs.”) But the value of the Parler archive highlights, in that vein, the importance of shaking the clouds that still hang over techniques on which journalists and researchers have every right to rely.
Going by the most detailed reports, donk_enby’s effort was almost comically straightforward. According to Wired, for instance, Parler imposed no password-like limits on who could use its API to query its site (or how often) to retrieve a particular piece of user content, and all of that user content was hosted at a series of sequential URLs in the order that each piece had been posted. The “hack,” then, just entailed systematically downloading whatever happened to be posted at each public-facing URL. That haul turned out to include content that Parler users had “deleted” but that the company continued to host at addresses that anyone could visit.
If these facts sound familiar to Lawfare readers, it’s likely because of their resemblance to United States v. Auernheimer, a high-profile CFAA prosecution that foundered on venue grounds in 2014. (Andrew Auernheimer was represented by, inter alia, frequent Lawfare contributor and computer crime scholar Orin Kerr.) In that case, Auernheimer, the troll better known as weev, was charged with “unauthorized access” to AT&T’s systems because he and a collaborator had discovered the logic behind the URLs the company used to host certain log-in pages for customers who owned iPads. When visited, each user’s unique page would prepopulate with that user’s email address. By incrementing through possible URLs, then, Auernheimer was able to obtain thousands of email addresses associated with AT&T customers.
Auernheimer’s conviction was ultimately thrown out on the basis that he was prosecuted in the wrong venue, but not before sparking a remarkable debate about whether you can be charged with a federal crime for visiting a URL that the website owner thinks will be difficult to find or guess. And perhaps surprisingly, there remains little case law that directly addresses the question.
What the CFAA forbids, among other things, is “access[ing] a computer without authorization or exceed[ing] authorized access.” But the statute famously fails to define “authorization,” offering only the (unhelpful) clarification that “exceeds authorized access” means “to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter.” The import of that second definition is currently before the Supreme Court in Van Buren v. United States—a case involving a police officer’s misuse of a database in violation of his employer’s policies—and the ruling in that matter may provide some further guidance. But for lack of an authoritative interpretation, the statute’s bad drafting has fed years of uncertainty about its scope—uncertainty with a very real human cost and a chilling effect on journalists and researchers
The most familiar variation on that debate asks whether interacting with a site in a way that violates its terms of service also violates the CFAA. Should authorization be measured by what your credentials will, practically, let you access or by what the site owner says you may do with them? (Van Buren’s counsel framed this as a question of whether access for a forbidden purpose is “unauthorized”; Kerr maintains that the appropriate question is whether the CFAA’s conception of “authorization” incorporates contract-based restrictions or only code-based ones.) You could try to claim that the statute’s text adopts one approach or another, but it would be—as such claims often are—a bit of a fiction. Instead, the discussion tends to turn to whether punishing terms-of-service violations would delegate too much power to website owners and provide too little notice of what the law actually prohibits—or, from the other direction, whether excluding those violations leaves unpunished too much conduct that strikes us as intuitively wrongful.
On this issue, considerable warring precedent exists in the lower courts along with considerable argument about the impact of one interpretation or another on scraping. Sites, after all, routinely purport to forbid scraping, whether to prevent competitors from collecting information about their operations or to bar the kind of newsgathering that might surface unflattering, aggregate insights into how a platform operates—say, how many racially discriminatory advertisements it hosts. Since criminalizing terms-of-service violations would plainly chill that kind of reporting, a media coalition represented by my colleagues at the Reporters Committee and attorneys at Paul Weiss filed a brief highlighting as much in Van Buren, as did the data journalism outfit The Markup.
It’s easy to miss, though, that the question raised in the Parler and Auernheimer situations is a bit different, and it’s not inevitable that a decision in Van Buren will resolve it. Auernheimer didn’t violate any particular policy on the use of AT&T’s site. Neither, as far as I’m aware, did donk_enby violate any particular Parler term of service by retrieving “deleted” posts. At most, they violated the site owners’ tacit expectations about how their sites would be used (namely, not by visiting random URLs to see what might happen to be there). Precedent on that twist on the CFAA question—who, if anyone, is “authorized” to visit an unprotected website whose existence the owner hasn’t advertised?—is thinner on the ground than the terms-of-service variation. Many of us routinely violate terms of service, knowingly or not (say, by letting someone else use your account on a site that forbids password-sharing). URL enumeration is somewhat rarer.
The much better answer, though, is that you can’t be held liable just for visiting the obscure URL, as the Reporters Committee argued in an amicus brief last week. That filing came in a case in California state court, where the City of Fullerton has sued a set of local bloggers for visiting cityoffullerton.com/outbox—a public-records Dropbox that the city chose to host at that address without a password, on the misplaced expectation that no one would visit it.
The city was able to persuade a trial court that the debate over violating a “public” site’s rules was beside the point on the theory that the Dropbox folder was never available to the public in that sense. It leveraged a simplified conception of how law-abiding users move around the internet: by visiting addresses that they already know or else by clicking over to new pages from those known pages. Or as the city described its position, “One could not navigate to the Dropbox account on the internet, nor from the City’s public website”—cityoffullerton.com.
Except, of course, that one could, just as Auernheimer was able to navigate to AT&T’s log-in pages and donk_enby was able to navigate to Parler’s deleted content—by visiting the URL.
To the extent this fact pattern seems like a difficult one, it’s likely because of a gap between the norms of internet architecture and ordinary user behavior. There’s no serious question that the point of hosting something at a URL is to provide it to people who visit the address; URLs are used to make access available, not to restrict it. Still, most users happen not to visit arbitrary addresses out of curiosity, and it can take a moment to explain why you would.
But a contrary rule would make very little sense. For one, it’s not clear how a user could tell before visiting an address that the owner secretly thinks its contents should be secret. Suppose I know that a conference I attended last year hosted its schedule at conference.org/2020-program, and I visit conference.org/2021-program to see if the new one has been posted yet. Should I be subject to liability or even prosecution if it turns out the conference hadn’t intended to advertise its speaker roster yet? (This is roughly what recently happened to Intel, which initially claimed that its quarterly earning report “was hacked” and later conceded that there had been no breach—the company put the information on a publicly accessible website without meaning to.)
Whatever vagueness concerns arise when liability turns on illegible terms of service, they pale compared to the problem of tying punishment to expectations that site owners hold in the privacy of their hearts. In practice, fear of violating a ban on retrieving resources that were tacitly meant to be hidden would chill the use of any technique for navigating the web that could, potentially, retrieve them. Even if Parler had no rules against scraping, for instance, the presence of deleted posts would have amounted to a minefield of strict liability for anyone who scraped them.
By the same token, a rule against visiting nonobvious URLs would risk outlawing the search operators that all major search engines provide to make their tools more effective for users. Searching for “site:rcfp.org filetype:pdf,” for instance, will retrieve PDF files hosted on the Reporter Committee’s website even if those files would be difficult or impossible to navigate to by hand. There wouldn’t be anything nefarious about that search; it might be a good way of jumping directly to primary documents, like court filings, as opposed to the case pages that narrate them, especially if the site’s organization has changed since they were originally posted. But on the “you may not visit obscure URLs” theory, that search applied to another domain would risk violating the law. The telecom firm TerraCom threatened to sue Scripps in 2013 on just that basis, in a transparent effort to quash reporting that the company had mishandled the personal data of individuals who participated in its Lifeline subsidized cell phone program.
Endorsing that balance of consequences would make for very poor cost-benefit analysis. On one side of the ledger is the social worth of all of these techniques for navigating the open internet: the journalism that scraping underpins, every bad-faith breach that security research prevents and so on. On the other side is the website owner’s interest in hosting “private” content at a publicly accessible URL and in papering over that strange decision’s consequences with the threat of criminal sanctions. On-point precedent or no, then, this ought to be an easy case. Which do we value more? The singular archival resource the Parler data represents should highlight what we would lose in privileging the second.
It’s understandable to express concern, as many have, that this scrape nevertheless carried legal risk. The CFAA’s ambiguity has given its enforcers too much latitude to impose unjustified costs on others, whether through a prosecution that ultimately fails or a civil suit that never proceeds to judgment. But in the absence of some judicial resolution, it bears pressing at every opportunity the argument that this research is legal, especially when its value to the public is on prominent display. The Parler archive has already underpinned important journalism on the reality of the Capitol riots; without it, there would be a critical gap in our understanding of Jan. 6, and a legal rule that would require that result is not a very good one.
Because Parler was scraped, not hacked, and scraping is not a crime.