Thursday, February 2, 2017

How to enable/disable privacy protection in Google Analytics (it's easy to get wrong!)

In my survey last year of ARL library web services, I found that 72% of them used Google Analytics. So it's not surprising that a common response to my article about leaking catalog searches to Amazon was to wonder whether the same thing is happening with Google Analytics.

The short answer is "It Depends". It might be OK to use Google Analytics on a library search facility, if the following things are true:
  1. The library trusts Google on user privacy. (Many do.)
  2. Google is acting in good faith to protect user privacy and is not acting under legal compulsion to act otherwise. (We don't really know.)
  3. Google Analytics is correctly doing what their documentation says they are doing and not being circumvented by the rest of Google. (They don't always.)
  4. The library has implemented Google Analytics correctly to enable user privacy.
There's an entire blog post to write about each of the first three conditions, but I have only so many hours in a day.  Given that many libraries have decided that the benefits using of Google Analytics outweigh the privacy risks, the rest of this post concerns only this last condition. Of the 72% of ARL libraries that use Google Analytics, I find that only 19% of them have implemented Google Analytics with privacy-protection features enabled.

So, if you care about library privacy but can't do without Google Analytics, read on!

Google Analytics has a lot of configuration options, which is why webmasters love it. For the purposes of user privacy, however, there are just two configuration options to pay attention to, the "IP Anonymization" option and the "Display Features" option.

IP Anonymization says to Google Analytics "please don't remember the exact IP address of my users". According to Google, enabling this mode masks the least significant bits of the user's IP address before the IP address is used or saved. Since many users can be identified by their IP address, this prevents anyone from discovering the search history for a given IP address. But remember, Google is still sent the IP address, and we have to trust that Google will obscure the IP address as advertised, and not save it in some log somewhere. Even with the masked IP address, it may still be possible to identify a user, particularly if a library serves a small number of geographically dispersed users.

"Display Features" says to Google to that you don't care about user privacy, and it's OK to track your users all to hell so that you can get access to "demographic" information. To understand what's happening, it's important to understand the difference between "first-party" and "third-party" cookies, and how they implicate privacy differently.

Out of the box, Google Analytics uses "first party" cookies to track users. So if you deploy Google Analytics on your "library.example.edu" server, the tracking cookie will be attached to the library.example.edu hostname. Google Analytics will have considerable difficulty connecting user number 1234 on the library.example.edu domain with user number 5678 on the "sci-hub.info" domain, because the user ids are chosen randomly for each hostname. But if you turn on Display Features, Google will connect the two user ids via a third party tracking cookie from its Doubleclick advertising service. This enables both you and Google to know more about your users. Anyone with access to Google's data will be able to connect the catalog searches saved for user number 1234 to that user's searches on any website that uses Google advertising or any site that has Display Features turned on.

IP Anonymization and Display Features can be configured in Google Analytics in three ways, depending on how it's being configured. The instructions here apply to the "Universal Analytics" script. You can tell a site uses Universal Analytics because the pages execute a javascript named "analytics.js". An older "classic" version of Google Analytics uses a script named "ga.js"; its configuration is similar to that of Universal. More complex websites may use Google Tag Manager to deploy and configure Google Analytics.

Google Analytics is usually deployed on a web page by inserting a script element that looks like this:
<script>
    (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
    (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
    m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('send', 'pageview');
</script>
IP Anonymization and Display Features are turned on with extra lines in the script:
    ga('create', 'UA-XXXXX-Y', 'auto');
    ga('require', 'displayfeatures');  // starts tracking users across sites
    ga('set', 'anonymizeIp', true); // makes it harder to identify the user from logs
    ga('send', 'pageview');
The Google Analytics Admin allows you to turn on cross site user tracking, though the privacy impact of what you're doing is not made clear . In the "Data Collection" item of the Tracking info pane, look at the toggle switches for "Remarketing" and "Advertising Reporting Features" if these are switched to "ON", then you've enabled cross site tracking and your users can expect no privacy.

Turning on IP anonymization is not quite as easy and turning on cross-site tracking. You have to add it explicitly in your script or turn it on in Google tag manager (where you won't find it unless you know what to look for!).

To check if cross-site tracking has been turned on in your institution's Google Analytics, use the procedures I described in my article on How to check if your library is leaking catalog searches to Amazon.  First, clear the cookies for your website, then load your site and look at the "Sources" tab in Chrome developer tools. If there's a resource from "stats.g.doubleclick.net", then your website is asking google to track your users across sites. If your institution is a library, you should not be telling Google to track your users across sites.

Bottom line: if you use Google Analytics, always remember that Google is fundamentally an advertising company and it will seldom guide you towards protecting your users' privacy.

Thursday, January 26, 2017

Policy-based Privacy is Over


Yesterday, President Donald Trump issued an executive order to enhance "Public Safety in the Interior of the United States".

Of interest here is section 14:
Sec. 14.  Privacy Act.  Agencies shall, to the extent consistent with applicable law, ensure that their privacy policies exclude persons who are not United States citizens or lawful permanent residents from the protections of the Privacy Act regarding personally identifiable information.  
What this means is that the executive branch, including websites, libraries and information systems may not use privacy policies to protect users other than US citizens and green card holders. Since websites, libraries and information systems typically don't keep track of user citizen status, this makes it very difficult to have any privacy policy at all.

Note that this executive order does not apply to the Library of Congress, an organ of the legislative branch of the US government. Nevertheless, it demonstrates the vulnerability of policy-based privacy. Who's to say that Congress won't enact the same restrictions for the legislative branch? Who's to say that Congress won't enact the same restrictions on any website. library or information system that operates in multiple states?

Lawyering privacy won't work any more. Librarianing privacy won't work any more. We need to rely on engineers to build privacy into our websites, libraries and information systems. This is possible. Engineers have tools such as strong cryptography that allow privacy to be built into systems without compromising functionality. It's not that engineers are immune from privacy-breaking mandates, but it's orders of magnitude more difficult to outlaw privacy engineering than it is to invalidate privacy policies. A system that doesn't record what a user does can't produce user activity records. Some facts are not alternativable. Math trumps Trump.

Friday, January 13, 2017

Google's "Crypto-Cookies" are tracking Chrome users

Ordinary HTTP cookies are used in many ways to make the internet work. Cookies help websites remember their users. A common use of cookies is for authentication: when you log into a website, the reason you stay logged is because of a cookie that contains your authentication info. Every request you make to the website includes this cookie; the website then knows to grant you access.

But there's a problem: someone might steal your cookies and hijack your login. This is particularly easy for thieves if your communication with the website isn't encrypted with HTTPS. To address the risk of cookie theft, the security engineers of the internet have been working on ways to protect these cookies with strong encryption. In this article, I'll call these "crypto-cookies", a term not used by the folks developing them. The Chrome user interface calls them Channel IDs.


Development of secure "crypto-cookies" has not been a straight path. A first approach, called "Origin Bound Certificates" has been abandoned. A second approach "TLS Channel IDs" has been implemented, then superseded by a third approach, "TLS Token Binding" (nicknamed "TokBind"). If you use the Chrome web browser, your connections to Google web services take advantage of TokBind for most, if not all, Google services.

This is excellent for security, but might not be so good for privacy; 3rd party content is the culprit. It turns out that Google has not limited crypto-cookie deployment to services like GMail and Youtube that have log-ins. Google hosts many popular utilities that don't get tracked by conventional cookies. Font libraries such as Google Fonts, javascript libraries such as jQuery, and app frameworks such as Angular, are all hosted on Google servers. Many websites load these resources from Google for convenience and fast load times.  In addition, Google utility scripts such as Analytics and Tag Manager are delivered from separate domains so that users are only tracked across websites if so configured.  But with Google Chrome (and Microsoft's Edge Browser), every user that visits any website using Google Analytics, Google Tag Manager, Google Fonts, JQuery, Angular, etc. are subject to tracking across websites by Google. According to Princeton's OpenWMP project, more than half of all websites embed content hosted on Google servers.
Top 3rd-party content hosts. From Princeton's OpenWMP.
Note that most of the hosts labeled "Non-Tracking Content"
are at this time subject to "crypto-cookie" tracking.


While using 3rd party content hosted by Google was always problematic for privacy-sensitive sites, the impact on privacy was blunted by two factors – cacheing and statelessness. If a website loads fonts from fonts.gstatic.com, or style files from fonts.googleapis.com, the files are cached by the browser and only loaded once per day. Before the rollout of crypto-cookies, Google had no way to connect one request for a font file with the next – the request was stateless; the domains never set cookies. In fact, Google says:
Use of Google Fonts is unauthenticated. No cookies are sent by website visitors to the Google Fonts API. Requests to the Google Fonts API are made to resource-specific domains, such as fonts.googleapis.com or fonts.gstatic.com, so that your requests for fonts are separate from and do not contain any credentials you send to google.com while using other Google services that are authenticated, such as Gmail. 
But if you use Chrome, your requests for these font files are no longer stateless. Google can follow you from one website to the next, without using conventional tracking cookies.

There's worse. Crypto-cookies aren't yet recognized by privacy plugins like Privacy Badger, so you can be tracked even though you're trying not to be. The TokBind RFC also includes a feature called "Referred Token Binding" which is meant to allow federated authentication (so you can sign into one site and be recognized by another). In the hands of the advertising industry, this will get used for sharing of the crypto-cookie across domains.

To be fair, there's nothing in the crypto-cookie technology itself that makes the privacy situation any different from the status quo. But as the tracking mechanism moves into the web security layer, control of tracking is moved away from application layers. It's entirely possible that the parts of Google running services like gstatic.com and googleapis.com have not realized that their infrastructure has started tracking users. If so, we'll eventually see the tracking turned off.  It's also possible that this is all part of Google's evil master plan for better advertising, but I'm guessing it's just a deployment mistake.

So far, not many companies have deployed crypto-cookie technology on the server-side. In addition to Google and Microsoft, I find a few advertising companies that are using it.  Chrome and Edge are the only client side implementations I know of.

For now, web developers who are concerned about user privacy can no longer ignore the risks of embedding third party content. Web users concerned about being tracked might want to use Firefox for a while.

Notes:

  1. This blog is hosted on a Google service, so assume you're being watched. Hi Google!
  2. OS X Chrome saves the crypto-cookies in an SQLite file at "~/Library/Application Support/Google/Chrome/Default/Origin Bound Certs". 
  3. I've filed bug reports/issues for Google Fonts, Google Chrome, and Privacy Badger. 
  4. Dirk Balfanz, one of the engineers behind TokBind has a really good website that explains the ins and outs of what I call crypto-cookies.


Thursday, December 22, 2016

How to check if your library is leaking catalog searches to Amazon

I've been writing about privacy in libraries for a while now, and I get a bit down sometimes because progress is so slow. I've come to realize that part of the problem is that the issues are sometimes really complex and  technical; people just don't believe that the web works the way it does, violating user privacy at every opportunity.

Content embedded in websites is a a huge source of privacy leakage in library services. Cover images can be particularly problematic. I've written before that, without meaning to, many libraries send data to Amazon about the books a user is searching for; cover images are almost always the culprit. I've been reporting this issue to the library automation companies that enable this, but a year and a half later, nothing has changed. (I understand that "discovery" services such as Primo/Summon even include config checkboxes that make this easy to do; the companies say this is what their customers want.)

Two indications that a third-party cover image is a privacy problem are:
  1. the provider sets tracking cookies on the hostname serving the content.
  2. the provider collects personal information, for example as part of commerce. 
For example, covers served by Amazon send a bonanza of actionable intelligence to Amazon.

Here's how to tell if your library is sending Amazon your library search data.

Setup

You'll need a web browser equipped with developer tools; I use Chrome. Firefox should work, too.

Log into Amazon.com. They will give you a tracking cookie that identifies you. If you buy something, they'll have your credit card number, your physical and electronic addresses, records about the stuff you buy, and a big chunk of your web browsing history on websites that offer affiliate linking. These cookies are used to optimize the advertisements you're shown around the web.

To see your Amazon cookies, go to Preferences > Settings. Click "Show advanced setting..." (It's hiding at the bottom.)

Click the  "Content settings.." button.

Now click the "All cookies and site data" button.

in the "Search cookies" box, type "amazon". Chances are, you'll see something like this.

I've got 65 cookies for "amazon.com"!

If you remove all the cookies and then go back to Amazon, you'll get 15 fresh cookies, most of them set to last for 20 years. Amazon knows who I am even if a delete all the cookies except "x-main".

Test the Library

Now it's time to find a library search box. For demonstration purposes, I'll use Harvard's "Hollis" catalog. I would get similar results at 36 different ARL libraries, but Harvard has lots of books and returns plenty of results. In the past, I've used What to expect as my search string, but just to make a point, I'll use Killing Trump, a book that Bill O'Reilly hasn't written yet.

Once you've executed your search, choose View > Developer > Developer Tools

Click on the "Sources" tab and to see the requests made of "images.amazon.com". Amazon has returned 1x1 clear pixels for three requested covers. The covers are requested by ISBN. But that's not all the information contained in the cover request.

To see the cover request, click on the "Network" tab and hit reload. You can see that the cover images were requested by a javascript called "primo_library_web" (Hollis is an instance of Ex Libris' Primo discovery service.)

Now click on the request you're interested in. Look at the request headers.


There are two of interest, the "Cookie" and the "Referer".

The "Cookie" sent to Amazon is this:
x-main="oO@WgrX2LoaTFJeRfVIWNu1Hx?a1Mt0s";
skin=noskin; session-token="bcgYhb7dksVolyQIRy4abz1kCvlXoYGNUM5gZe9z4pV75B53o/4Bs6cv1Plr4INdSFTkEPBV1pm74vGkGGd0HHLb9cMvu9bp3qekVLaboQtTr+gtC90lOFvJwXDM4Fpqi6bEbmv3lCqYC5FDhDKZQp1v8DlYr8ZdJJBP5lwEu2a+OSXbJhfVFnb3860I1i3DWntYyU1ip0s="; x-wl-uid=1OgIBsslBlOoArUsYcVdZ0IESKFUYR0iZ3fLcjTXQ1PyTMaFdjy6gB9uaILvMGaN9I+mRtJmbSFwNKfMRJWX7jg==; ubid-main=156-1472903-4100903;
session-id-time=2082787201l;
session-id=161-0692439-8899146
Note that Amazon can tell who I am from the x-main cookie alone. In the privacy biz, this is known as "PII" or personally identifiable information.

The "Referer" sent to Amazon is this:
http://hollis.harvard.edu/primo_library/libweb/action/search.do?fn=search&ct=search&initialSearch=true&mode=Basic&tab=everything&indx=1&dum=true&srt=rank&vid=HVD&frbg=&tb=t&vl%28freeText0%29=killing+trump&scp.scps=scope%3A%28HVD_FGDC%29%2Cscope%3A%28HVD%29%2Cscope%3A%28HVD_VIA%29%2Cprimo_central_multiple_fe&vl%28394521272UI1%29=all_items&vl%281UI0%29=contains&vl%2851615747UI0%29=any&vl%2851615747UI0%29=title&vl%2851615747UI0%29=any
To put this plainly, my entire search session, including my search string killing trump is sent to Amazon, alongside my personal information, whether I like it or not. I don't know what Amazon does with this information. I assume if a government actor wants my search history, they will get it from Amazon without much fuss.

I don't like it.

Rant

[I wrote a rant; but I decided to save it for a future post if needed.] Anyone want a Cookie?

Notes 12/23/2016:


  1. As Keith Jenkins noted, users can configure Chrome and Safari to block 3rd Party cookies. Firefox won't block Amazon cookies, however. And some libraries advise users to not to block 3rd party cookies because doing so can cause problems with proxy authentication.
  2. If Chrome's network panel tells you "Provisional headers are shown" this means it doesn't know what request headers were really sent because another plugin is modifying headers. So if you have HTTPS Everywhere, Ghostery, Adblock, or Privacy Badger installed, you may not be able to use Chrome developer tools to see request headers. Thanks to Scott Carlson for the heads up.
  3. Cover images from Google leak similar data; as does use of Google Analytics. As do Facebook Like buttons. Et cetera.
  4. Thanks to Sarah Houghton for suggesting that I write this up.

Friday, October 14, 2016

Maybe IDPF and W3C should *compete* in eBook Standards

A controversy has been brewing in the world of eBook standards. The International Digital Publishing Forum (IDPF) and the World Wide Web Consortium (W3C) have proposed to combine. At first glance, this seems a sensible thing to do; IDPF's EPUB work leans heavily on W3C's HTML5 standard, and IDPF has been over-achieving with limited infrastructure and resources.

Not everyone I've talked to thinks the combination is a good idea. In the publishing world, there is fear that the giants of the internet who dominate the W3C will not be responsive to the idiosyncratic needs of more traditional publishing businesses. On the other side, there is fear that the work of IDPF and Readium on "Lightweight Content Protection" (a.k.a. Digital Rights Management) will be a another step towards "locking down the web". (see the controversy about "Encrypted Media Extensions")

What's more, a peek into the HTML5 development process reveals a complicated history. The HTML5 that we have today derives from a a group of developers (the WHATWG) who got sick of the W3C's processes and dependencies and broke away from W3C. Politics above my pay grade occurred and the breakaway effort was folded back into W3C as a "Community Group". So now we have two, slightly different versions of HTML, the "standard" HTML5 and WHATWG's HTML "Living Standard". That's also why HTML5 omitted much of W3C's Semantic Web development work such as RDFa.

Amazon (not a member of either IDPF or W3C) is the elephant in the room. They take advantage of IDPF's work in a backhanded way. Instead of supporting the EPUB standard in their Kindle devices, they use proprietary formats under their exclusive control. But they accept EPUB files in their content ingest process and thus extract huge benefit from EPUB standardization. This puts the advancement of EPUB in a difficult position. New features added to EPUB have no effect on the majority of ebook user because Amazon just converts everything to a proprietary format.

Last month, the W3C published its vision for eBook standards, in the form on an innocuously titled "Portable Web Publications Use Cases and Requirements".  For whatever reason, this got rather limited notice or comment, considering that it could be the basis for the entire digital book industry. Incredibly, the word "ebook" appears not once in the entire document. "EPUB" appears just once, in the phrase "This document is also available in this non-normative format: ePub". But read the document, and it's clear that "Portable Web Publication" is intended to be the new standard for ebooks. For example, the PWP (can we just pronounce that "puup"?) "must provide the possibility to switch to a paginated view" . The PWP (say it, "puup") needs a "default reading order", i.e. a table of contents. And of course the PWP has to support digital rights management: "A PWP should allow for access control and write protections of the resource." Under the oblique requirement that "The distribution of PWPs should conform to the standard processes and expectations of commercial publishing channels." we discover that this means "Alice acquires a PWP through a subscription service and downloads it. When, later on, she decides to unsubscribe from the service, this PWP becomes unavailable to her." So make no mistake, PWP is meant to be EPUB 4 (or maybe ePub4, to use the non-normative capitalization).

There's a lot of unalloyed good stuff there, too. The issues of making web publications work well offline (an essential ingredient for archiving them) are technical, difficult and subtle, and W3C's document does a good job of flushing them out. There's a good start (albeit limited) on archiving issues for web publications. But nowhere in the statement of "use cases and requirements" is there a use case for low cost PWP production or for efficient conversion from other formats, despite the statement that PWPs "should be able to make use of all facilities offered by the [Open Web Platform]".

The proposed merger of IDPF and W3C raises the question: who gets to decide what "the ebook" will become? It's an important question, and the answer eventually has to be open rather than proprietary. If a combined IDPF and W3C can get the support of Amazon in open standards development, then everyone will benefit. But if not, a divergence is inevitable. The publishing industry needs to sustain their business; for that, they need an open standard for content optimized to feed supply chains like Amazon's. I'm not sure that's quite what W3C has in mind.

I think ebooks are more important than just the commercial book publishing industry. The world needs ways to deliver portable content that don't run through the Amazon tollgates. For that we need innovation that's as unconstrained and disruptive as the rest of the internet. The proposed combination of IDPF and W3C needs to be examined for its effects on innovation and competition.

Philip K. Dick's Mr. Robot is
one of the stories in Imagination
Stories of Science and Fantasy
,
January 1953. It is available as
an ebook from Project Gutenberg
and from GITenberg
My guess is that Amazon is not going to participate in open ebook standards development. That means that two different standards development efforts are needed. Publishers need a content markup format that plays well with whatever Amazon comes up with. But there also needs to be a way for the industry to innovate and compete with Amazon on ebook UI and features. That's a very different development project, and it needs a group more like WHATWG to nurture it. Maybe the W3C can fold that sort of innovation into its unruly stable of standards efforts.

I worry that by combining with IDPF, the W3C work on portable content will be chained to the supply-chain needs of today's publishing industry, and no one will take up the banner of open innovation for ebooks. But it's also possible that the combined resources of IDPF and W3C will catalyze the development of open alternatives for the ebook of tomorrow.

Is that too much to hope?