Monday, September 15, 2014

Analysis of Privacy Leakage on a Library Catalog Webpage

My post last month about privacy on library websites, and the surrounding discussion on the Code4Lib list prompted me to do a focused investigation, which I presented at last weeks Code4Lib-NYC meeting.

I looked at a single web page from the NYPL online catalog. I used Chrome developer tools to trace all the requests my browser made in the process of building that page. The catalog page in question is for The Communist Manifesto. It's here: http://nypl.bibliocommons.com/item/show/18235020052907_communist_manifesto .

You can imagine how reading this work might have been of interest to government investigators during the early fifties when Sen. Joe McCarthy was at the peak of his power. Note that, following good search-engine-optimization practice, the URL embeds the title of the resource being looked at.

I chose the NYPL catalog as my example, not because it's better or worse than any other library catalog with respect to privacy, but because it's exemplary. The people building it are awesome, and the results are top-notch. I happen to know the organization is working on making privacy improvements. Please don't take my investigation to be a criticism of NYPL. But it was Code4Lib-NYC, after all.

As an example of how far ahead of the curve the NYPL catalog is, note that the webpage offers links to free downloads at Project Gutenberg. The Communist Manifesto is in the public domain, so any library catalog that tells you that no ebook is available is lying. The majority of library catalogs today lie about this.

So here are the results.

In building the Communist Manifesto catalog page, my browser contacts 11 different hosts from 8 different companies.
  • nypl.secure.bibliocommons.com
  • cdn.bibliocommons.com
  • api.bookish.com
  • contentcafe2.btol.com
  • www.google-analytics.com
  • www.googletagmanager.com
  • cdn.foxycart.com
  • idreambooks.com
  • ws.sharethis.com
  • wd-edge.sharethis.com
  • b.scorecardresearch.com
Each of these hosts is informed of the address of the web page that generates the address. They are told, essentially, "this user is looking at our Communist Manifesto page". Some of the hosts need this information to deliver the services they contribute. Others get the same information via the "referer" header generated as part of the HTTP protocol.  If the catalog were served with the more secure protocol "HTTPS", the referer header would not be sent.

The first of these is Bibliocommons. I've written about Bibliocommons before. They host the NYPL catalog "in the cloud". I'm not particularly concerned about Bibliocommons with respect to privacy, because they contract directly with NYPL, and I'm pretty sure that contracts are in place that bind Bibliocommons to the privacy policies in place at NYPL. But since HTTP is used rather than HTTPS, every host between me and the bibliocommons server can see and capture the URL of the web page I'm looking at. At the moment, I'm using the wifi in a Paris cafe, so the hosts that can see that are in the proxad.net, aas6453.net, level3.net, firehost.com and other domains. I don't know what they do with my browsing history.

I've previously written about the NYPL's use of the Bookish recommendation engine.  The BTOL.com link is for Baker&Taylor's "Content Cafe" service that provides book covers for library catalogs. I'm guessing (but don't know for sure) that these offerings have privacy policies that are aware of the privacy expectations of library users.

Yes, Google is one of the companies that NYPL tells about my web browsing. I'm pretty sure that Google knows who I am. A careful look at the Google Analytics privacy policy suggests that they can't share my browsing history outside Google. Unless required to by law.

Foxycart is not a company I was familiar with. They provide the shopping cart technology that lets me buy a book from the NYPL website and benefit them with part of the proceeds. I've been in favor of enabling such commerce on library sites because libraries need to do it to participate fully in the modern reading ecosystem. But it's still controversial in the library world.

Foxycart's privacy policy, like all privacy policies ever written, takes your privacy very seriously. Some excerpts:
When you visit this website, some information, such as the site that referred you to us, your IP and email address, and navigational and purchase information, may be collected automatically as part of the site’s operation. This information is used to generate user profiles and to personalize the web site to your particular interests. 
The information collected online is stored indefinitely and is used for various purposes. 
Cookies offer you many conveniences. They allow FoxyCart.com LLC, and certain third party content providers, to recognize information, and so can determine what content is best suited to your needs.  
We also reserve the right to disclose your personal information if required to do so by law, or in the good faith belief that such action is reasonably necessary to comply with legal process, respond to claims, or protect the rights, property or safety of our company, employees, customers or the public.

Here I need to explain about cookies. When a website gives you a cookie, it acquires the ability to track you across all the websites that company serves. This can be a great convenience for you. When you fill out a credit card form with your name and address, Foxycart can remember it for you so you don't have to type it in again when you come back to order something else. You might find that creepy if the last order you placed was on a porn site. But while NYPL hasn't told FoxyCart anything that could identify you personally, your interaction with FoxyCart is such that you may well chose to identify yourself. And all that information is stored forever. And FoxyCart can pass that information to all the Sen. Joe McCarthys of 2020. As well as certain 3rd party content providers. FoxyCart probably doesn't give away your information today, but will they even be around in 2020?

IdreamBooks syndicates book reviews. I don't know anything about them, and their homepage doesn't seem to have a privacy policy.

ScorecardResearch "conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t." They probably know whether I like ScorecardResearch. Their cookie is set by the ShareThis software.

ShareThis was one of the companies I mentioned in my last post. ShareThis provides social sharing buttons for the NYPL catalog. They also take your privacy very seriously. Some more excerpts:
In addition to the sharing service offered directly to users, the technology we use to assist with user sharing also allows us to gather information from publisher Web sites that include our ShareThis Sharing Icon or use our advertising technology, and enables ShareThis and our partner publishers and advertisers to use the value of the shared content and other information gathered through our technology to facilitate the delivery of relevant, targeted advertising (the ShareThis Services). 
we also receive certain non-personally identifiable information (e.g., demographic information such as zip code) from our advertisers, ad network and publisher partners, and we may combine this information with what we have collected. We also collect information from third-party Web sites with whom you have registered, like social networks, that those third parties make publicly available. 
While using the ShareThis Services, We may place third party advertisers’ and publishers’ cookies and pixels on their behalf regarding Usage Information. 
We are not responsible for the information practices of these third parties and the cookies placed by ShareThis on behalf of those third parties.
So ShareThis turns out to be in the business of advertising. They use your browsing behavior over thousands of websites to help advertisers target advertising and content to you. That scene in Minority report where Tom Cruise gets personalized ads on the billboards he walks by? Thats what ShareThis is helping to make happen today, and the NYPL website is helping them.
Ad Mall from Minority Report
They do this by cookie-sharing. In addition to setting a sharethis.com cookie, they set cookies for other companies, so they also get to know what you're reading. And when they do this, they enable other companies to connect your browsing behavior at NYPL with information you've provided to social networks. The result is that it's possible for a company selling Karl Marx merch could target ads you based on browsing the Communist Manifesto catalog page.

But it's not like ShareThis is completely promiscuous. Their privacy agreement limits their cookie sharing to an exclusive group of advertising companies. Here's the beginning of the list:
  • 33across.png
  • accuen.png
  • Adap.png
  • adaramedia.com
  • adblade.com
  • addthis.com
  • adroll.com
  • aggregateknowledge.com
  • appnexus.com
  • atlassolutions.com
  • AudienceScience.com
That's just the A's.

In 1972, Zoia Horn, a librarian at Bucknell University, was jailed for almost three weeks for refusing to testify at the trial of the Harrisburg 7 concerning the library usage of one of the defendants. That was a long time ago. No longer is there a need to put librarians in jail.



Wednesday, August 13, 2014

Libraries are Giving Away the User-Privacy Store

AddThis makes some really nice widgets. Here are some for sharing this blogpost:

ShareThis is another company that does pretty much the same thing. Their share buttons are down at the end of the post. AddThis is bigger. It provides "behavioral, contextual, and interest based data that spans across hundreds of content categories and topics, reaching 1.7 billion uniques a month."

The widgets help users share your content. At the same time, AddThis and ShareThis widgets help a publisher figure out who is sharing what, while distributing the content into other websites. To do this, they track users, see what sort of web sites they like. They can also work with advertising networks to improve the relevancy of ads shown to users. The user tracking works by setting user cookies, or "web beacons" that enable the tracking of users across websites. In the case of AddThis, users are also tracked using "Canvas Fingerprinting", a technique that works even when a user blocks cookie tracking. ProPublica recently wrote about this technology, calling it the "Online Tracking Device that's Nearly Impossible to Block".

Here's what the ShareThis Privacy Policy says:
In some cases, if you have chosen to make PII (like your name) publicly available through third party sites like social networks, we may seek your consent to use that PII in connection with services we offer in conjunction with our partners. We will not disclose your PII without your consent.
We and our publisher, advertiser and ad network partners also use this data for other related purposes (for example, to do research regarding the results of our online advertising campaigns or to better understand the interests or activities of users of the ShareThis Services).
Similarly, AddThis says:
When an End User downloads a page that contains an AddThis Button, we may deploy a cookie on our own behalf or on behalf of our data partners, to record information about how an End User uses the web, such as the web search that landed the End User on a particular page or categories of the End User's interests. We may use the Data to target advertising toward the End User or authorize others to do the same. 
Many websites are using Google Analytics to measure usage; they let Google track their users in the same way (the website I run, Unglue.it, uses Google Analytics). However, the Analytics terms of service seem not to allow Google to share the collected data as freely as AddThis and ShareThis do.

Both AddThis and ShareThis assert in the legal terms that they mustn't collect usage information from children, so if children use your site, you're not supposed to use these services. Google Analytics does not have this restriction, which presumably means they can't use their data to advertise to children.

Together with "Cookie Syncing" and "Evercookies", the cumulative effect of all this tracking is that website users can be pretty comprehensively tracked, and if need be, identified, whether they like it or not. In exchange for deploying the trackers, websites get access to the valuable pool of information about their users.

Matt Mullenweg (of WordPress) has an interesting perspective:
services like AddThis and ShareThis will always spy on and tag your audience when you use their widgets, and you should avoid them if you care about that sort of thing.
This puts libraries in somewhat of a quandary. Traditionally, libraries have been havens of privacy for their users. Librarians have famously gone to jail for their refusal to turn over circulation records to law enforcement. But it seems that libraries are not much protecting their users from the sort of information gathering done by AddThis, ShareThis, and Google. For example, New York Public Library uses Google Analytics and ShareThis. OCLC and Worldcat use AddThis. My own public library catalog (hosted by BCCLS)  sets cookies for AddThis. I suppose they don't consider that their websites could be directed at children. Even the American Library Association's webpage extolling the important of privacy in libraries makes use of Google Analytics. (ironically, the link to a website privacy policy is broken on that page!)

It's true that these trackers are very common- even WhiteHouse.gov has employed AddThis buttons. But it seems to me that if libraries still think that user privacy is valuable  in this age of social media, they need to rethink out their use of web user tracking companies. What disturbs me most is there hasn't been much public discussion about the future role of privacy in library websites, even as it's rapidly being lost.

Update (Aug 15): AddThis says they're not using canvas fingerprinting and have terminated their test of it. I don't think this really changes the cost/benefit analysis for libraries. It remains true that libraries that use AddThis or ShareThis are allowing a third party to track their patrons' catalog browsing (not just their social sharing), under terms which permit the companies to use the data for advertising purposes. Use of Google Analytics allows Google to do the same tracking, but does not appear to permit use for advertising. Either way, libraries need to make informed choices and communicate those choices to their users. Same for Facebook "Like" buttons. Commercial sites, obviously, have different priorities and responsibilities.

Update (Aug 19): There are a number of free open-source solutions available both for social sharing and for analytics. There's a very useful discussion of these issues on Hacker News.



Thursday, July 31, 2014

Don't Bother Reading "Acts of the Apostles"

Read Biodigital instead.

After reviewing John Sundman's Biodigital, I promised to report back after reading Acts of the Apostles which shares about 60% of its text.

It's very unusual for a lay reader to have access to two versions of a book in this way. Biodigital is partly the result of the sort of editorial work that goes on behind the scenes of publishing, and to read Acts is to become aware of sausage making that is usually invisible.

The bottom line is that Biodigital is a much better book. You won't miss anything if you skip Acts. While there's a lot of tightening here and there, there are two big changes which lead me to urge you to set aside Acts.

The first is Gordon Biersch, which has been removed from the book. Gordon Biersch opened in 1988 on Emerson Street in Palo Alto, California. I remember when it opened, it was a revelation. The beer was pretty good, and the food was designed to go with the beer. Today, this sort of place has a name: "gastro-pub", but back in 1988, that word didn't exist, at least in the vocabulary of grad students like me. Yuppies flocked to the place and by the time Sundman was writing Acts, it signified everything good and bad about Silicon Valley. But since then, Gordon Biersch has gone all Vegas. No really, the founders were bought out by money from Las Vegas. Today, there's a Gordon Biersch gastropub in 34 places where restaurants are allowed to brew beer, including 4 in Taiwan. It's owned by the same company that owns "Rock Bottom" brewpubs.

In Biodigital, the events that occurred at Gordon Biersch have been moved a mile or so southeast to Antonio's Nut House. Antonio's is still around. Like everything else in the area, it's changed, but it's not like Silicon Valley changed into Las Vegas. It's like Sun Microsystems changed into Google. I went and had a beer there when I was visiting earlier this month. I took pictures. Google maps has a walk-through view.


View Larger Map





The other big change is the book's depiction of Bartlett Aubrey. Bartlett, the estranged wife of hero Nick Aubrey, is supposed to be a brilliant molecular biologist, but in Acts, she mostly has big breasts. It's not a realistic portrait at all, more of an adolescent fantasy character. In Biodigital, references to Bartlett's breasts are cut by 50%, and I swear that's not why I thought the character was a lot smarter than in Acts.

So, support your local author. Or your local beer bar. Better yet, do both at the same time.


Thursday, July 10, 2014

"Subtleism" is a Useful Word

Allison Kaptur has written about the last of Hacker School's lightweight social rules: "No Subtle -isms":
Our last social rule, "No subtle -isms," bans subtle racism, sexism, homophobia, transphobia, and other kinds of bias. Like the first three rules, it's targeting subtle, accidental, mildly hurtful behavior. This rule isn't targeting slurs, harassment, or threats. These kinds of severe violations would have consequences, up to and including expelling someone from Hacker School. 
Breaking the fourth social rule, like breaking any other social rule, is an accident and a small thing. In theory, someone should be able to say "Hey, that was subtly sexist," get the response "Oops, sorry!" and move on just as easily as if they'd well-actually'ed. In practice, people are less likely to point out when this rule is broken, and more likely to be defensive if they were the rule-breaker. We'd like to change this.
When this was explained to me by Hacker School Co-Founder Sonali Sridhar, I thought it was brilliant, but I heard "subtle -ism" as a single word, "subtleism". "Subtleism" conveyed to me the concept that something could be harmless by itself, but multiplied by a thousand could be oppressive. So for example, using "you guys" for the second person plural when both men and women are included, is never meant to be sexist, and is rarely taken the wrong way. But an ocean of hundreds or even thousands of tiny, insignificant locutions like "you guys" can drown even a strong swimmer.

The reason subtleism is a useful word is that it can convey forgiveness in a context of working together to create a culture that is supportive of a diverse team. Reminding someone of a subtleism doesn't need to be a "shaming ritual"; after all, everyone uses subtleisms all the time. Compare the word "micro-aggression", which is used as an accusation or a lamentation.

Also, the word we should be using for the second person plural is "youse".

Sunday, June 29, 2014

Is Freemium Really Open Access?

Should the term "Open Access" be restricted to materials with licenses that allow redistribution, like Creative Commons licenses? Or, as some advocate, only materials that allow remixing and commercial re-use, like CC-BY and CC-BY-SA?

I had lunch today with folks from OpenEditions, a French publishing organization whose ebook effort I've been admiring for a while. They're here in Las Vegas at the American Library Association Conference, hoping to get libraries interested in the 1,428 ebooks they have on their platform. (Booth 1437!)

Of those 1,076 books are in a program they call "Open Access Fremium". With these books, you can read them on the OpenEditions website for free, without having to register or anything. You can even embed them into your blog. So for example, here's Opinion Mining et Sentiment Analysis by Dominique Boullier and Audrey Lohard:



So is it OpenAccess™?

In this freemium model, the main product that's being sold is access to the downloadable ebook- whether PDF or EPUB. For libraries, a subscription allows for unlimited access with IP address authentication along with additional services. Creative Commons licenses, all of which allow for format conversion, wouldn't work for this business model because the free HTML could easily be converted into EPUB and PDF. They have their own license, you can read it here.

This is clearly not completely open, but there's no doubt that it's usefully open. For me, the biggest problem is that if OpenEditions goes away for some reason- business, politics, natural disaster, or stupidity, then the ebooks disappear. Similarly, if OpenEditions policies change or urls move, they could break the embed.

On the plus side, OpenEditions have convinced a group of normally conservative publishers of the advantages of creating usefully open versions of over a thousand books. It's a step in the right direction.