In Defense of Bulk Surveillance: It Works

By Nicholas Weaver
Wednesday, September 9, 2015, 12:51 PM

I personally believe the NSA's systems for bulk surveillance represent a direct attack on the Internet and everyone who uses it.  The mere presence of these systems is a threat to democracy, only mitigated by the intense level of professionalism demonstrated by the NSA (a subject for a future essay).

But at the same time, if I was in charge of the NSA I would have, without hesitation, built the same systems.  These systems are reasonably easy to understand, as the underlying technology of Internet Surveillance is effectively equivalent to both Network Intrusion Detection (NIDS) and Chinese-style Internet censorship.

Why would I build them?  For, simply put, this approach works for the NSA's objectives.

The idea behind Internet surveillance is not about looking for "needles in a haystack" but rather providing a capability to "pull threads": starting with some initial piece of interest, such as a phone number, a name, a keyword, a webpage visit, or a hunch, the analyst then seeks to follow the digital history.  But for this flow to work, the systems must already bulk record all the history that may possibly matter.

The primary systems start with an initial filter, either performed by the cooperating ISP or the NSA's own equipment.  This filter eliminates the large, uninteresting bulk flows, such as streaming videos, which occupy a huge amount of the network traffic but provide effectively no actionable intelligence.  The rest gets ingested into the primary acquisition systems.

The data feed then goes into a load balancer, which spreads the traffic across a cluster of computers, with probably 10 machines for each 10 Gbps network connection.  These systems perform an initial reassembly and decide whether it is another uninteresting bulk flow or deserves further analysis.  Everything that passes this filter is both recorded (with a retention time of roughly 5 days) and passed through a "metadata" analysis pass.

The term "metadata" is both precise and misleading.  It is misleading if one thinks of metadata under Smith v Maryland (the court decision that says phone metadata has less privacy protection because it is information freely given to the phone company): there is no expectation that the network would record or even care about this information.  Instead, it is "content derived metadata", small pieces of information extracted from the network flow itself such as the subject of an email or who is the author of a Word document.  Calling it "metadata" is only correct from a technical, not legal perspective.

The metadata-extraction process begins by reassembling the network traffic and applying code to generate metadata "fingerprints” .  Some metadata is generic, such as "request is for this URL", "all HTTP headers in a request", "sender of an email", "this request is from an iPhone", or "this is a vBulletin Private Message".  Such fingerprints define generally useful information which may or may not be relevant for an analyst.

But the fingerprints can be more powerful, such as "does the email body contain one of these predefined keywords", "what is the username embedded in this particular website", "is there a reference to a .onion URL", "is there a message body encrypted with 'Mojahaden Secrets'?".  The results of all these fingerprints goes into a MySQL database on the wiretap system.

In order to access this data, an analyst has a "federated search” interface: on a central site, the analyst specifies a query to run over the metadata stored on some or all of the wiretaps.  This approach handles the "flood of data" problem, instead of moving all the data to the analyst, the analyst's searches go to the data.

Some data still ends up centralized.  When the taps see particular tracking cookies (from advertisements or social networks), the presence seems to be recorded in a central "big data" datastore that retains data for a year.  Another analysis process looks at usernames embedded in web pages, creating a mapping of "login cookie to user" for various sites.  Finally this datastore also includes "cookie correlation", linking tracking and login cookies: if two different cookies (such as ones from Yahoo and Double-click) are seen from the same system as part of the same pageview, the database records that the two tracking cookies refer to the same browser.  This database effectively acts as a global identification and tracking system: for every user, what IPs did they use at what time and what are their tracking cookies.

Finally, the NSA systems supports "attack by name".  The analyst can specify a target's tracking cookies and, when a different wiretap sees a request, this special tap arranges for another computer to "shoot" an exploit into the target's traffic, directly compromising the target.

This approach works.  For example, finding all Jihobbiests is a single-query away: "Show all vBulletin private messages with a Mojahaden Secrets encrypted payload".  The analyst can then access the "full take" for any given address to understand a target's activity, such as retrieving email sent from the target’s computer or viewing his web surfing.  This can also help find an associated tracking cookie, which is a thread of information which reveals the target’s address usage history.  If the target failed to use a VPN, this now gives the target’s movements around the world.  

Perhaps the most powerful option is for the analyst to create another fingerprint rule, which the analyst can apply to both future traffic and all previously recoded traffic. So, for example, extract all Microsoft Office documents authored by the target, no matter where they were seen in the world.

The uses extend way beyond terrorism.  It was this basic flow, used to identify and then exploit network administrators, that enabled the NSA and GCHQ to penetrate Belgacom.  The same flow, with a fingerprint for trade related keywords in email bodies, allowed New Zealand's GCSB to intercept WTO vote-related emails.  And it enabled a chat-room intercept of an Anonymous member, indicating a URL visit, to identify that person, find their Facebook account, and map their online activities.  On a more theoretical level, it almost certainly enabled the NSA to know the perpetrators behind the Sony hack, and offers a unique ability to analyze communication networks encrypted with PGP.

From a pure effectiveness viewpoint, I can’t think of a better concept.  It enables attributing traffic to individuals, efficiently isolating any items of interest, following threads of information, retrospective analysis, and targeted exploitation.  The biggest problem from an effectiveness standpoint is probably secrecy.  The NSA's flow could easily support many more US government interests if this flow (and therefore effectively all derived data) wasn't segregated into TS//SCI compartments.

Unfortunately, there exists a huge flaw: it is not particularly difficult to implement.  Any foreign power that can install a tap can run this style of analysis.  In my next article, I'll discuss my own experience building a hobby version of an NSA-style surveillance suite, and thus why the US needs to take the lead in "going dark": protecting network traffic against bulk surveillance and targeted attack.  For others can do unto us as we have already done unto them.