Thursday, December 22, 2016

How to check if your library is leaking catalog searches to Amazon

I've been writing about privacy in libraries for a while now, and I get a bit down sometimes because progress is so slow. I've come to realize that part of the problem is that the issues are sometimes really complex and  technical; people just don't believe that the web works the way it does, violating user privacy at every opportunity.

Content embedded in websites is a a huge source of privacy leakage in library services. Cover images can be particularly problematic. I've written before that, without meaning to, many libraries send data to Amazon about the books a user is searching for; cover images are almost always the culprit. I've been reporting this issue to the library automation companies that enable this, but a year and a half later, nothing has changed. (I understand that "discovery" services such as Primo/Summon even include config checkboxes that make this easy to do; the companies say this is what their customers want.)

Two indications that a third-party cover image is a privacy problem are:
  1. the provider sets tracking cookies on the hostname serving the content.
  2. the provider collects personal information, for example as part of commerce. 
For example, covers served by Amazon send a bonanza of actionable intelligence to Amazon.

Here's how to tell if your library is sending Amazon your library search data.

Setup

You'll need a web browser equipped with developer tools; I use Chrome. Firefox should work, too.

Log into Amazon.com. They will give you a tracking cookie that identifies you. If you buy something, they'll have your credit card number, your physical and electronic addresses, records about the stuff you buy, and a big chunk of your web browsing history on websites that offer affiliate linking. These cookies are used to optimize the advertisements you're shown around the web.

To see your Amazon cookies, go to Preferences > Settings. Click "Show advanced setting..." (It's hiding at the bottom.)

Click the  "Content settings.." button.

Now click the "All cookies and site data" button.

in the "Search cookies" box, type "amazon". Chances are, you'll see something like this.

I've got 65 cookies for "amazon.com"!

If you remove all the cookies and then go back to Amazon, you'll get 15 fresh cookies, most of them set to last for 20 years. Amazon knows who I am even if a delete all the cookies except "x-main".

Test the Library

Now it's time to find a library search box. For demonstration purposes, I'll use Harvard's "Hollis" catalog. I would get similar results at 36 different ARL libraries, but Harvard has lots of books and returns plenty of results. In the past, I've used What to expect as my search string, but just to make a point, I'll use Killing Trump, a book that Bill O'Reilly hasn't written yet.

Once you've executed your search, choose View > Developer > Developer Tools

Click on the "Sources" tab and to see the requests made of "images.amazon.com". Amazon has returned 1x1 clear pixels for three requested covers. The covers are requested by ISBN. But that's not all the information contained in the cover request.

To see the cover request, click on the "Network" tab and hit reload. You can see that the cover images were requested by a javascript called "primo_library_web" (Hollis is an instance of Ex Libris' Primo discovery service.)

Now click on the request you're interested in. Look at the request headers.


There are two of interest, the "Cookie" and the "Referer".

The "Cookie" sent to Amazon is this:
x-main="oO@WgrX2LoaTFJeRfVIWNu1Hx?a1Mt0s";
skin=noskin; session-token="bcgYhb7dksVolyQIRy4abz1kCvlXoYGNUM5gZe9z4pV75B53o/4Bs6cv1Plr4INdSFTkEPBV1pm74vGkGGd0HHLb9cMvu9bp3qekVLaboQtTr+gtC90lOFvJwXDM4Fpqi6bEbmv3lCqYC5FDhDKZQp1v8DlYr8ZdJJBP5lwEu2a+OSXbJhfVFnb3860I1i3DWntYyU1ip0s="; x-wl-uid=1OgIBsslBlOoArUsYcVdZ0IESKFUYR0iZ3fLcjTXQ1PyTMaFdjy6gB9uaILvMGaN9I+mRtJmbSFwNKfMRJWX7jg==; ubid-main=156-1472903-4100903;
session-id-time=2082787201l;
session-id=161-0692439-8899146
Note that Amazon can tell who I am from the x-main cookie alone. In the privacy biz, this is known as "PII" or personally identifiable information.

The "Referer" sent to Amazon is this:
http://hollis.harvard.edu/primo_library/libweb/action/search.do?fn=search&ct=search&initialSearch=true&mode=Basic&tab=everything&indx=1&dum=true&srt=rank&vid=HVD&frbg=&tb=t&vl%28freeText0%29=killing+trump&scp.scps=scope%3A%28HVD_FGDC%29%2Cscope%3A%28HVD%29%2Cscope%3A%28HVD_VIA%29%2Cprimo_central_multiple_fe&vl%28394521272UI1%29=all_items&vl%281UI0%29=contains&vl%2851615747UI0%29=any&vl%2851615747UI0%29=title&vl%2851615747UI0%29=any
To put this plainly, my entire search session, including my search string killing trump is sent to Amazon, alongside my personal information, whether I like it or not. I don't know what Amazon does with this information. I assume if a government actor wants my search history, they will get it from Amazon without much fuss.

I don't like it.

Rant

[I wrote a rant; but I decided to save it for a future post if needed.] Anyone want a Cookie?

Notes 12/23/2016:


  1. As Keith Jenkins noted, users can configure Chrome and Safari to block 3rd Party cookies. Firefox won't block Amazon cookies, however. And some libraries advise users to not to block 3rd party cookies because doing so can cause problems with proxy authentication.
  2. If Chrome's network panel tells you "Provisional headers are shown" this means it doesn't know what request headers were really sent because another plugin is modifying headers. So if you have HTTPS Everywhere, Ghostery, Adblock, or Privacy Badger installed, you may not be able to use Chrome developer tools to see request headers. Thanks to Scott Carlson for the heads up.
  3. Cover images from Google leak similar data; as does use of Google Analytics. As do Facebook Like buttons. Et cetera.
  4. Thanks to Sarah Houghton for suggesting that I write this up.

Update 3/23/2017:

There's good news in the comments!

Friday, October 14, 2016

Maybe IDPF and W3C should *compete* in eBook Standards

A controversy has been brewing in the world of eBook standards. The International Digital Publishing Forum (IDPF) and the World Wide Web Consortium (W3C) have proposed to combine. At first glance, this seems a sensible thing to do; IDPF's EPUB work leans heavily on W3C's HTML5 standard, and IDPF has been over-achieving with limited infrastructure and resources.

Not everyone I've talked to thinks the combination is a good idea. In the publishing world, there is fear that the giants of the internet who dominate the W3C will not be responsive to the idiosyncratic needs of more traditional publishing businesses. On the other side, there is fear that the work of IDPF and Readium on "Lightweight Content Protection" (a.k.a. Digital Rights Management) will be a another step towards "locking down the web". (see the controversy about "Encrypted Media Extensions")

What's more, a peek into the HTML5 development process reveals a complicated history. The HTML5 that we have today derives from a a group of developers (the WHATWG) who got sick of the W3C's processes and dependencies and broke away from W3C. Politics above my pay grade occurred and the breakaway effort was folded back into W3C as a "Community Group". So now we have two, slightly different versions of HTML, the "standard" HTML5 and WHATWG's HTML "Living Standard". That's also why HTML5 omitted much of W3C's Semantic Web development work such as RDFa.

Amazon (not a member of either IDPF or W3C) is the elephant in the room. They take advantage of IDPF's work in a backhanded way. Instead of supporting the EPUB standard in their Kindle devices, they use proprietary formats under their exclusive control. But they accept EPUB files in their content ingest process and thus extract huge benefit from EPUB standardization. This puts the advancement of EPUB in a difficult position. New features added to EPUB have no effect on the majority of ebook user because Amazon just converts everything to a proprietary format.

Last month, the W3C published its vision for eBook standards, in the form on an innocuously titled "Portable Web Publications Use Cases and Requirements".  For whatever reason, this got rather limited notice or comment, considering that it could be the basis for the entire digital book industry. Incredibly, the word "ebook" appears not once in the entire document. "EPUB" appears just once, in the phrase "This document is also available in this non-normative format: ePub". But read the document, and it's clear that "Portable Web Publication" is intended to be the new standard for ebooks. For example, the PWP (can we just pronounce that "puup"?) "must provide the possibility to switch to a paginated view" . The PWP (say it, "puup") needs a "default reading order", i.e. a table of contents. And of course the PWP has to support digital rights management: "A PWP should allow for access control and write protections of the resource." Under the oblique requirement that "The distribution of PWPs should conform to the standard processes and expectations of commercial publishing channels." we discover that this means "Alice acquires a PWP through a subscription service and downloads it. When, later on, she decides to unsubscribe from the service, this PWP becomes unavailable to her." So make no mistake, PWP is meant to be EPUB 4 (or maybe ePub4, to use the non-normative capitalization).

There's a lot of unalloyed good stuff there, too. The issues of making web publications work well offline (an essential ingredient for archiving them) are technical, difficult and subtle, and W3C's document does a good job of flushing them out. There's a good start (albeit limited) on archiving issues for web publications. But nowhere in the statement of "use cases and requirements" is there a use case for low cost PWP production or for efficient conversion from other formats, despite the statement that PWPs "should be able to make use of all facilities offered by the [Open Web Platform]".

The proposed merger of IDPF and W3C raises the question: who gets to decide what "the ebook" will become? It's an important question, and the answer eventually has to be open rather than proprietary. If a combined IDPF and W3C can get the support of Amazon in open standards development, then everyone will benefit. But if not, a divergence is inevitable. The publishing industry needs to sustain their business; for that, they need an open standard for content optimized to feed supply chains like Amazon's. I'm not sure that's quite what W3C has in mind.

I think ebooks are more important than just the commercial book publishing industry. The world needs ways to deliver portable content that don't run through the Amazon tollgates. For that we need innovation that's as unconstrained and disruptive as the rest of the internet. The proposed combination of IDPF and W3C needs to be examined for its effects on innovation and competition.

Philip K. Dick's Mr. Robot is
one of the stories in Imagination
Stories of Science and Fantasy
,
January 1953. It is available as
an ebook from Project Gutenberg
and from GITenberg
My guess is that Amazon is not going to participate in open ebook standards development. That means that two different standards development efforts are needed. Publishers need a content markup format that plays well with whatever Amazon comes up with. But there also needs to be a way for the industry to innovate and compete with Amazon on ebook UI and features. That's a very different development project, and it needs a group more like WHATWG to nurture it. Maybe the W3C can fold that sort of innovation into its unruly stable of standards efforts.

I worry that by combining with IDPF, the W3C work on portable content will be chained to the supply-chain needs of today's publishing industry, and no one will take up the banner of open innovation for ebooks. But it's also possible that the combined resources of IDPF and W3C will catalyze the development of open alternatives for the ebook of tomorrow.

Is that too much to hope?

Wednesday, September 7, 2016

Start Saying Goodbye to eBook Pagination

Book pages may be the most unfortunate things ever invented by the reading-industrial complex. No one knows the who or how of their invention. The Egyptians and the Chinese didn't need pages because they sensibly wrote in vertical lines. It must have been the Greeks who invented and refined the page.
Egyptian scroll held in the UNE Antiquities Museum
CC BY 
unephotos
In my imagination, some scribes invented the page in a dark and damp scriptorium after arguing about landscape versus portrait on their medieval iScrolls. They didn't worry about user experience. The debate must have been ergonomics versus cognitive load. Opening the scroll side-to-side allowed the monk to rest comfortably when the scribing got boring, and the addition brainwork of figuring out how to start a new column was probably a great relief from monotony. That and drop-caps. The codex probably came about when the scribes ran out of white-out.
Scroll of the Book of EstherSevilleSpain
Technical debt from this bad decision lingers. Consider the the horrors engendered by pagination:
  • We have to break pages in the middle of sentences?!?!?!? Those exasperating friends of yours who stop their sentences in mid-air due to lack of interest or memory probably work as professional paginators. 
  • When our pagination leaves just one line of a paragraph at the top of a page, you have what's known as a widow. Pagination is sexist as well as exasperating.
  • Don't you hate it when a wide table is printed sideways? Any engineer can see this is a kludgy result of choosing the wrong paper size.
To be fair, having pages is sometimes advantageous.
  • You can put numbers on the pages. This allows an entire class of students to turn to the same page. It also allows textbook companies to force students to buy the most recent edition of their exorbitantly priced textbooks. The ease of shifting page numbers spares the textbook company of the huge expense of making actual revisions to the text.
  • Pages have corners, convenient for folding.
  • You can tell often-read pages in a book by looking for finger-grease accumulation on the page-edges. I really hope that stuff is just finger-grease.
  • You can rip out important pages. Because you can't keep a library book forever.
  • Without pages in books, how would you press flowers?
  • With some careful bending, you can make a cute heart shape.
Definition of love by Billy Rowlinson, on Flickr; CC-BY 

While putting toes in the water of our ebook future, we still cling to pages like Linus and his blankie. At first, this was useful. Users who had never seen an ebook could guess how they worked. Early e-ink based e-reading devices had great contrast and readability but slow refresh rates. "Turning" a page was giant hack that turned a technical liability of slow refresh into a whizzy dissolve feature. Apple's iBooks app for the iPad appeared at the zenith of skeuomorphic UI design fashion and its too-cute page-turn animation is probably why the DOJ took Apple to court. (Anti-trust?? give me a break!)

But seriously, automated pagination is hard. I remember my first adventures with TEΧ, in the late '80s, half my time was spent wrestling with unfortunate, inexplicable pagination and equation bounding boxes. (Other half spent being mesmerized at seeing my own words in typeset glory.)

The thing that started me on this rant is the recent publication of the draft EPUB 3.1 specification, which has nothing wrong with it but makes me sad anyway. It's sad because the vast majority of ebook lovers will never be able to take advantage of all the good things in it. And it's not just because of Amazon and its Kindle propriety. It's the tug of war between the page-oriented past of books and the web-oriented future of ebooks. EPUB's role is to leverage web standards while preserving the publishing industry's investment in print-compatible text. Mission accomplished, as much as you can expect.

What has not been part of EPUB's mission is to leverage the web's amazingly rapid innovation in user interfaces. EPUB is essentially a website packaged into a compressed archive. But over the last eight years, innovations in "responsive" web reading UI, driven by the need of websites to work on both desktop and mobile devices, have been magical. Tap, scroll and swipe are now universally understood and websites that don't work that way seem buggy or weird. Websites adjust to your screen size and orientation. They're pretty easy to implement, because of javascript/css frameworks such as Bootstrap and Foundation. They're perfect for ebooks, except... the affordances provided by these responsive design frameworks conflict with the built-in affordances of ebook readers (such as pagination). The result has been that, from the UI point of view,  EPUBs are zipped up turn-of-the-century websites with added pagination.

Which is what makes me sad. Responsive, touch-based web designs, not container-paginated EPUBs, are the future of ebooks. The first step (which Apple took two years ago) is to stop resisting the scroll, and start saying goodbye to pagination.

Sunday, July 31, 2016

Entitled: The art of naming without further elaboration or qualification.

As I begin data herding for our project Mapping the Free Ebook Supply Chain, I've been thing a lot about titles and subtitles, and it got me wondering: what effect do subtitles have on the usage, and should open-access status of a book affect the naming strategy for the book? We are awash in click-bait titles for articles on the web; should ebook titles be clickbaity, too? To what extent should ebook titles be search-engine optimized, and which search engines should they be optimized for?

Here are some examples of titles that I've looked at recently, along with my non-specialist's reactions:
Title: Bigger than You: Big Data and Obesity
Subtitle: An Inquiry toward Decelerationist Aesthetics
The title is really excellent; it gives a flavor of what the book's about and piques my interest because I'm curious what obesity and big data might have to do with each other. The subtitle is a huge turn-off. It screams "you will hate this book unless you already know about decelerationist aesthetics" (and I don't).
(from Punctum Books)



Title: Web Writing
Subtitle: Why and How for Liberal Arts Teaching and Learning
The title is blah and I'm not sure whether the book consists of web writing or is something about how to write for or about the web. The subtitle at least clues me in to the genre, but fails to excite me. It also suggest to me that the people who came up with the name might not be experts in writing coherent, informative and effective titles for the web.
From University of Michigan Press



Title: DOOM
Subtitle: SCARYDARKFAST

If I saw the title alone I would probably mistake it for something it's not. An apocalyptic novel, perhaps. And why is it all caps? The subtitle is very cool though, I'd click to see what it means.
From Digital Culture Books




It's important to understand how title metadata gets used in the real world. Because the title and subtitle get transported in different metadata fields, using a subtitle cedes some control over title presentation to the websites that display it. Four example, Unglue.it's data model has a single title field, so if we get both title and subtitle in a metadata feed, we squash them together in the title field. Unless we don't. Because some of our incoming feeds don't include the subtitle. Different websites do different things. Amazon uses the full title but some sites omit the subtitle until you get to the detail page. So you should have a good reason to use a subtitle as opposed to just putting the words from the subtitle in the title field. DOOM: SCARYDARKFAST is a much better title than DOOM. (The DOOM in the book turns out to be the game DOOM, which I would have guessed from the all-caps if I had ever played DOOM.) And you can't depend on sites preserving your capitalization; Amazon presents several versions of DOOM: SCARYDARKFAST

Another thing to think about is the "marketing funnel". This is the idea that in order to make a sale or to have in impact, your product has to pass through a sequence of hurdles, each step yielding a market that's a fraction of the previous steps. So for ebooks, you have to first get them selected into channels, each of which might be a website. Then a fraction of users searching those websites might see your ebook's title (or cover), for example in a search result. Then a fraction of those users might decide to click on the title, to see a detail page, at which point there had better be an abstract or the potential reader becomes a non-reader.

Having reached a detail page, some fraction of potential readers (or purchase agents) will be enticed to buy or download the ebook. Any "friction" in this process is to be avoided. If you're just trying to sell the ebook, you're done. But if you're interested in impact, you're still not done, because even if a potential reader has downloaded the ebook, there's no impact until the ebook gets used. The title and cover continue to be important because the user is often saving the ebook for later use. If the ebook doesn't open to something interesting and useful, a free ebook will often be discarded or put aside.

Bigger than You's strong title should get it the clicks, but the subtitle doesn't help much at any step of the marketing funnel. "Aesthetics" might help it in searches; it's possible that even the book's author has never ever entered "Decelerationist" as a search term. The book's abstract, not the subtitle, needs to do the heavy lifting of driving purchases or downloads.

The first sentence of "Web Writing" suggest to me that a better title might have been:
"Rebooting how we think about the Internet in higher education
But check back in a couple months. Once we start looking at the data on usage, we might find that what I've written here is completely wrong, and the Web Writing was the best title of them all!

Notes:
1. The title of this blog post  is the creation of Adrian Short, who seems to have left twitter.







Wednesday, June 22, 2016

Wiley's Fake Journal of Constructive Metaphysics and the War on Automated Downloading


Suppose you were a publisher and you wanted to get the goods on a pirate who was downloading your subscription content and giving it away for free. One thing you could try is to trick the pirate into downloading fake content loaded with spy software and encoded information about how the downloading was being done. Then you could verify when the fake content was turning up on the pirate's website.

This is not an original idea. Back in the glory days of Napster, record companies would try to fill the site with bad files which somehow became infested with malware. Peer-to-peer networks evolved trust mechanisms to foil bad-file strategies.

I had hoped that the emergence of Sci-Hub as an efficient, though unlawful, distributor of scientific articles would not provoke scientific publishers to do things that could tarnish the reputation of their journals. I had hoped that publishers would get their acts together and implement secure websites so that they could be sure that articles were getting to their real subscribers. Silly me.

In series of tweets, Rik Smith-Unna noted with dismay that the Wiley Online Library was using "fake DOIs" as "trap URLs", URLs in links invisible to human users. A poorly written web spider or crawler would try to follow the link, triggering a revocation of the user's access privileges. (For not obeying the website's terms of service, I suppose.)

Gabriel J. Gardner of Cal State Long Beach has reported his library's receipt of a scary email from Wiley stating:
Wiley has been investigating activity that uses compromised user credentials from institutions to access proxy servers like EZProxy (or, in some cases, other types of proxy) to then access IP-authenticated content from the Wiley Online Library (and other material). We have identified a compromised proxy at your institution as evidenced by the log file below. 
We will need to restrict your institution’s proxy access to Wiley Online Library if we do not receive confirmation that this has been remedied within the next 24 hours.  

I've been seeing these trap urls in scholarly journals for almost 20 years now. Two years ago they reappeared in ACS journals. They're rarely well thought out, and from talking with publishers who have tried them, they don't work as intended. The Wiley trap URLs exhibit several mistakes in implementation.
  1. Spider trap URLs are are useful for detecting bots that ignore robot exclusions. But Wiley's robots.txt document doesn't exclude the trap urls, so "well-behaved" spiders, such as googlebot are also caught. As a result, the fake Wiley page is indexed in Google, and because of the way Google aggregates the weight of links, it's actually a rather highly ranked page
  2. The download urls for the fake article don't download anything, but instead return a 500 error code whenever an invalid pseudo-DOI is presented to the site. This is a site misconfiguration that can cause problems with link checking or link verification software.
  3. Because the fake URLs look like Wiley DOI's, they could cause confusion if circulated. Crossref discourages this.
  4. The trap URLs as implemented by Wiley can be used for malicious attacks. With a list of trap URLs, it's trivial to craft an email or a web page that causes the user to request the full list of trap URLs. When the trap URLs trigger service suspensions this gives you the ability to trigger a suspension by sending the target an email.
  5. Apparently, Wiley used a special cookie to block the downloading. Have they not heard of sessions?
  6. The blocks affected both subscription and open-access content. Umm, do I need to explain the concept of "Open Access"?
  7. It's just not a smart idea (Even on April Fools!) for a reputable publisher to create fake article pages for "Constructive Metaphysics in Theories of Continental Drift. (warning: until Wiley realizes their ineptness, this link may trigger unexpected behavior. Use Tor.) It's an insult to both geophysicists and philosophers. And how does the University of Bradford feel about hosting a fictitious Department of Geophysics???


Instead of trap urls, online businesses that need to detect automated activity have developed elaborate and effective mechanisms to do so. Automated downloads are a billion dollar problem for the advertising industry in particular. So advertisers, advertising networks, and market research companies use coded, downloaded javascripts and flash scripts to track and monitor both users and bots. I've written about how these practices are inappropriate in library contexts. In comparison, the trap URLs being deployed by Wiley are sophomoric and a technical embarrassment.

If you visit the Wiley fake article page now, you won't get an article. You get a full dose of monitoring software. Wiley uses a service called Qualtrics Site Intercept to send you "Creatives" if you meet targeting criteria. But you'll also get that if you access Wiley's Online Library's real articles, along with sophisticated trackers from Krux Digital, Grapeshot, Jivox, Omniture, Tradedesk, Videology and Neustar.

Here's the letter I'd like libraries to start sending publishers:
[Library] has been investigating activity that causes spyware from advertising networks to compromise the privacy of IP-authenticated users of the [Publisher] Online Library, a service for we have been billed [$XXX,XXX]. We have identified numerous third party tracking beacons and monitoring scripts infesting your service as evidenced by the log file below. 
We will need to restrict [Publisher]'s access to our payment processes if we do not receive confirmation that this has been remedied within the next 24 hours.  
Notes:
  1. Here's another example of Wiley cutting off access because of fake URL clicking. The implication that Wiley has stopped using trap URLs seems to be false.
  2. Some people have suggested that the "fake DOIs" are damaging the DOI system. Don't worry, they're not real DOI's and have not been registered. The DOI system is robust against this sort of thing; it's still disrespectful.
Update June 23:
  1. Tom Griffin, a spokesman for Wiley, has posted a denial to LIBLICENCE which has a tenuous grip on reality.
  2. Smith-Unna has posted a point-by-point response to Griffin's denial in the form of a gist . 

Monday, May 23, 2016

97% of Research Library Searches Leak Privacy... and Other Disappointing Statistics.


...But first, some good news. Among the 123 members of the Association of Research Libraries, there are four libraries with almost secure search services that don't send clickstream data to Amazon, Google, or any advertising network. Let's now sing the praises of libraries at Southern Illinois University, University of Louisville, University of Maryland, and University of New Mexico for their commendable attention to the privacy of their users. And it's no fault of their own that they're not fully secure. SIU fails to earn a green lock badge because of mixed content issues in the CARLI service; while Louisville, Maryland and New Mexico miss out on green locks because of the weak cipher suite used by OCLC on their Worldcat Local installations. These are relatively minor issues that are likely to get addressed without much drama.

Over the weekend, I decided to try to quantify the extent of privacy leakage in public-facing library services by studying the search services of the 123 ARL libraries. These are the best funded and most prestigious libraries in North America, and we should expect them to positively represent libraries. I went to each library's on-line search facility and did a search for a book whose title might suggest to an advertiser that I might be pregnant. (I'm not!) I checked to see whether the default search linked to by the library's home page (as listed on the ARL website) was delivered over a secure connection (HTTPS). I checked for privacy leakage of referer headers from cover images by using Chrome developer tools (the sources tab). I used Ghostery to see if the library's online search used Google Analytics or not. I also noted whether advertising network "web beacons" were placed by the search session.

72% of the ARL libraries let Google look over the shoulder of every click by every user, by virtue of the pervasive use of Google Analytics. Given the commitment to reader privacy embodied by the American Library Association's code of ethics, I'm surprised this is not more controversial. ALA even sponsors workshops on "Getting Started with Google Analytics". To paraphrase privacy advocate and educator Dorothea Salo, the code of ethics does not say:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted, except for Google Analytics.
While it's true that Google has a huge stake in maintaining the trust of users in their handling of personal information, and people seem to trust Google with their most intimate secrets, it's also true that Google's privacy policy puts almost no constraints on what Google (itself) can do with the information they collect. They offer strong commitments not to share personally identifiable information with other entities, but they are free to keep and use personally identifiable information. Google can associate Analytics-tracked library searches with personally identifiable information for any user that has a Google account; Libraries cannot be under the illusion that they are uninvolved with this data collection if they benefit from Google Analytics. (Full disclosure: many of the the web sites I administer also use Google Analytics.)

80% of the ARL libraries provide their default discovery tools to users without the benefit of a secure connection. This means that any network provider in the path between the library and the user can read and alter the query, and the results returned to the user. It also means that when a user accesses the library over public wifi, such as in a coffee shop, the user's clicks are available for everyone else in the coffee shop to look at, and potentially to tamper with. (The Digital Library Privacy Pledge is not having the effect we had hoped for, at least not yet.)

28% of ARL libraries enrich their catalog displays with cover images sourced from Amazon.com. Because of privacy leakage in referer headers, this means that a user's searches for library books are available for use by Amazon when Amazon wants to sell that user something. It's not clear that libraries realize this is happening, or whether they just don't realize that their catalog enrichment service uses cover images sourced by Amazon.

13% of ARL libraries help advertisers (other than Google) target their ads by allowing web beacons to be placed on their catalog web pages. Whether the beacons are from Facebook, DoubleClick, AddThis or Sharethis, advertisers track individual users, often in a personally identifiable way. Searches on these library catalogs are available to the ad networks to maximize the value of advertising placed throughout their networks.

Much of the privacy leakage I found in my survey occurs beyond the control of librarians. There are IT departments, vender-provided services, and incumbent bureaucracies involved. Important library services appear to be unavailable in secure versions. But specific, serious privacy leakage problems that I've discussed with product managers and CTOs of library automation vendors have gone unfixed for more than a year. I'm getting tired of it.

The results of my quick survey for each of the 123 ARL libraries are available as a Google Sheet. There are bound to be a few errors, and I'd love to be able to make changes as privacy leaks get plugged and websites become secure, so feel free to leave a comment.

Friday, April 1, 2016

April Fools is Cancelled This Year

Since the Onion dropped their fake news format in January in favor of serious reporting, it's become clear that the web's April Fools Day would be very different this year. Why make stuff up when real life is so hard to believe?

All my ideas for a satirical blog posts seemed too sadly realistic. After people thought my April 1 post last year was real, all my ideas for fake posts about false privacy and the All Writs Act seemed cruel. I thought about doing something about power inequity in libraries and publishing, but then all my crazy imaginings came true on the ACRL SCHOLCOMM list.

So no April Fools post on Go To Hellman this year. Except for this one, of course.

Monday, March 21, 2016

Sci-Hub, LibGen, and Total Information Awareness


"Good thing downloads NOT trackable!" was one twitter response to my post imagining a skirmish in the imminent scholarly publishing copyright war.

"You wish!" I responded.

Sooner or later, such illusions of privacy will fail spectacularly, and people will get hurt.

I had been in no hurry to see what the Sci-Hub furor was about. After writing frequently about piracy in the ebook industry, I figured that Sci-Hub would be just another copyright-flouting, adware-infested Russian website. When I finally took a look, I saw that Sci-Hub is a surprisingly sophisticated website that does a good job of facilitating evasion of research article paywalls. It styles itself as "the first pirate website in the world to provide mass and public access to tens of millions of research papers" and aspires to the righteous liberation of knowledge. David Rosenthal has written a rather comprehensive overview of the controversy surrounding it.

I also observed how easy it would be to track all the downloads being made via Sci-Hub. Today's internet is an environment where someone is tracking everything, and in the case of Sci-Hub, everything is being tracked.

My follow-up article was going to describe all the places that could track downloads via Sci-Hub, and how easy it would be to obtain a list of individuals who had downloaded or uploaded a Sci-Hub article – in violation of the laws currently governing copyright. But Sci-Hub is not doing things in the usual way of pirate websites. They're actually working to improve  user privacy. Around the time of my last post, they implemented HTTPS (SSLLabs grade: B) on their website. So instead of inducing users to announce their downloading activity to fellow WiFi users and every ISP on the planet, which is what Sci-Hub was doing in February, today Sci-Hub only registers download activity with Yandex Metrics, the Russian equivalent of Google Analytics.

As long as you trust a Russian internet company to NEVER monetize data about you by selling it to people with more money than good sense, you're not being betrayed by Sci-Hub. Unless the data SOMEHOW falls into the wrong hands.

There are more ways to track Sci-Hub downloads. Many of the downloads facilitated by Sci-Hub are fulfilled by LibGen.io a.k.a. "Library Genesis". LibGen is doing things in the usual way of pirate websites. The LibGen site does NOT support encryption, and it makes money by running advertising served by Google. As a result, Google gets informed of every LibGen download, and if a user has ever registered with Google, then Google knows exactly who they are, what they've downloaded and when they downloaded it. So to get a big list of downloaders, you'd just need to get Google to fork it over.

History suggests that copyright owners will eventually try to sue or otherwise monetize downloaders, and will be successful. In today's ad-network-created Total Information Awareness environment, it might even be a viable business model.

The best solution for a user wanting to download articles privately is to use the Tor Browser and Sci-Hub's onion address, http://scihub22266oqcxt.onion. Onion addresses provide encryption all the way to the destination, and since SciHub uses LibGen's onion address for linking, neither connection can be snooped by the network. Google and Yandex still get informed of all download activity, but the Tor browser hides the user's identity from them. ...Unless the user slips up and reveal their identity to another web site while using Tor.

Since .onion addresses don't use the DNS system (they won't work outside the Tor network), they won't be affected by legal attacks on the .io registrar. If you use the Sci-Hub.io address in the Tor Browser, your downloads from LibGen.io can be monitored (and perhaps tampered with) by inquisitive exit nodes, so be sure to use the .onion address for privacy and security. I would also recommend using "medium-high" security mode (Onion > Privacy and Security Settings).

It might also be a good idea to use the Tor Browser if you want read research articles in private, even in journals you've paid for; medical journals seem to be the worst of the bunch with respect to privacy.

If publishers begin to take Sci-Hub countermeasures seriously (Library Loon has a good summary of the horribles to expect) there will be more things to worry about. PDFs can be loaded with privacy attacks in many ways, ranging from embedded security exploits to usage-monitoring links.

This isn't going to be fun for anyone.

Monday, March 7, 2016

Inside a 2016 Big Deal Negotiation...


Dramatis Personae: 
  • A Sales Representative from STM Corporation
  • An Acquisitions Librarian at Prestige University.

STM Corp Sales Rep: It's so nice to see you! We have some exciting news about your Big Deal renewal contract!

PU Acquisitions Librarian: Actually, I'm afraid we have some bad news for you. The Acquisitions Committee has had to make some cutbacks...

Sales Rep: I'm sorry to hear that. In fact, we also have some disturbing data to show you.

Librarian: We've been studying our usage data, and STM Corp's journals aren't seeing the usage we'd expected.

Sales Rep: Funny you should mention that, because STM Corp's Big Deal service has implemented a new "Total Information Awareness (TIA)" system that will answer all your usage questions. The TIA system monitors usage of our articles however they are acquired, and pinpoints the users, whoever and where ever they are. Our customers have been wanting this information for years, and now we can provide it.

Librarian: Now that's interesting. We've been discussing whether that sort of data could improve our services, but as librarians we need to respect the privacy of our users.

Sales Rep: Of course! And as publishers, we need to protect our services from unauthorized access and piracy.

Librarian: ... and our license agreements oblige us to respond to those concerns.

Sales Rep: I'm so glad you understand! But the TIA has exposed some disturbing information about journal usage on your campus.

Librarian: Yes, usage is dropping, That's what we wanted to discuss with you.

Sales Rep: Actually, total usage is increasing. It's just licensed usage that's dropping. Illicit usage is going through the roof!

Librarian: What do you mean?

Sales Rep: Have you heard of a website called Sci-Hub?

Librarian: [suppressing smile] Why yes...

Sales Rep: It seems that students and faculty on your campus have been accessing our articles via Sci-Hub quite a lot, and have been uploading...

Librarian: [starting to worry] We would never condone that! Using articles from Sci-Hub is likely copyright infringement in our jurisdiction. And uploading articles would be a violation of our campus policies!

Sales Rep: Exactly! Which is why we wanted you to see this data.

Librarian: [scanning several pages] But.. but this is a list of hundreds of our students and faculty, including some of our most prominent scientists!

Sales Rep: [grinning] ... each of them potentially facing hundreds of thousands of dollars of statutory damages for copyright infringement. Even career-ending litigation. It's such a blessing for you that we would never pursue legal actions that would hurt a good customer like Prestige U. Now about your renewal...

Librarian: Where did this list come from?

Sales Rep: As I said before, STM Corp's "Total Information Awareness" system monitors usage of our articles and pinpoints the users. You said before you had some bad news for us?

Librarian: Umm... we need to make some cutbacks.

Sales Rep: [smug] Well, then you'll be happy to know that we're limiting your big deal price to just a 19% increase over last time.

Librarian: [non-gendered expression of profound despair] ... and our Dean who's been using Sci-Hub?

Sales Rep: Sci-Hub? never heard of it.

Librarian: [resigned] OK, send us the invoice.

[Everything in this drama is fictitious except Sci-Hub and TIA. more next time.]

Thursday, February 18, 2016

The Impact of Bitcoin on Fried Chicken Recipe Archives

Bitcoin is magic. Not the technology, but the hype machine behind it. You've probably heard that Bitcoin technology is going to change everything from banking to fried chicken recipes, from copyright to genome research. Like any good hype machine, Bitcoin's whips amazing facts together with plausible nonsense to make a perfect soufflé.

ChickenCoin (Comoros 25 francs 1982)
CC BY-NC-ND  by edelweisscoins
The hype cycle is not Bitcoin's fault. Bitcoin is a masterful and probably successful attack on a problem that many thought was impossible. Bitcoin creates a decentralized, open, transparent and secure way to maintain a monetary transaction ledger. The reason this is so hard is because of the money. Money creates strong incentives for cheating, hacking, subverting the ledger, and it's really hard to create an open system that has no centralized authority yet is hard to attack. The genius of bitcoin is to cleverly exploit that incentive to power its distributed ledger engine, often referred to as "the blockchain". Anyone can create bitcoin for themselves and add it to the ledger, but to get others to accept your addition, you have to play by the rules and do work to add to the ledger. This process is called "mining".

If you're building a fried chicken recipe archive, (let's call it FriChiReciChive) there's good news and bad news. The bad news is that fried chicken is a terrible fuel for a blockchain ledger. No one mines for fried chicken. The good news is that very few nation-states care about your fried chicken recipes. Defending your recipe archive against cheating, hacking, attack and subversion will not require heroic bank-vault tactics.

That's not to say you can't learn from Bitcoin and its blockchain. Bitcoin is cleverly assembled from mature technologies that each seemed impossible not long ago. Your legacy recipe system was probably built in the days of archive horses and database buggies; if you're building a new one it probably would be a good idea to have a set of modern tools.

What are these tools? Here are a few of them:
  1. Big storage. It's easy to forget how much storage is available today. The current size of the bitcoin blockchain, the ledger of every bitcoin transaction every made, is only 56 GB. That's about one iPhone of storage. The cheapest macbook Pro comes with 128 GB, which is more than you can imagine. Amazon Web Services offers 500GB of storage for $15 per month. Your job in making FriChiReciChive a reality is to imagine how make use of all that storage. Suppose the average fried chicken recipe is a thousand words. That's about 10 thousand bytes. With 500GB and a little math, you can store 50 million fried chicken recipes.

    Momofoku Fried Chicken
    CC BY-NC by gandhu
    Having trouble imagining 50 million chicken recipes? You could try a recipe a minute and it would take you 95 years to try them all. That would be a poor use of your time on earth, and it would be a poor use of 500 GB. So forget about deleting old recipes and start thinking about the FriChi info you could be keeping. How about recording every portion of fried chicken ever prepared, and who ate it. This is possible today. If you're working on an archive of books, you could record every time someone reads a book. Trust me, Amazon is trying to do that.

    Occasionally, you'll hear that you can store information directly in Bitcoin's blockchain. That's possible, but you probably don't want to do that because of cost. The current cost of adding a MB (about 1 block) to the bitcoin blockchain is 25 BTC. At current exchange rates, that's about $10 per kB. That cost is borne by the global Bitcoin system, and it pays for the power consumed by Bitcoin miners. For comparison, AWS will charge you 0.36 microcents per year to store a kilobyte. The blockchain does more than S3, but not 30 million times more.

  2. Moroccan Chicken Hash
    CC BY-NC-ND by mmm-yoso
    Cryptographic hashes. Bitcoin makes pervasive use of cryptographic hashes to build and access its blockchain. Cryptographic hashes have some amazing properties. For example, you can use a hash to identify a digital document of any kind. Whether it's a password or a video of a cat, you can compute a hash and use the hash to identify the cat video. Flip a single bit in your fried chicken recipe, and you get a completely different hash. That's why bitcoin uses hashes to identify transactions. And you can't make the chicken from the hash. Yes, that's why it called a hash.

    Once you have the hash of a digital object, you've made it tamper-proof. If someone makes a change in your recipe, or your cat video, or your software object, the hash of the thing will be completely different, and you'll be able to tell that it's been tampered with. So you never need to let anyone mess with Granny's fried chicken recipe.

  3. Hash chains. Once you've computed hashes for everything in FriChiReciChive, you probably think, "what good is it to hash a recipe? "If someone can change the recipe, someone can change the hash, too." Bitcoin solves this problem by hashing the hashes! each new data block contains the hash of the previous block. Which contains a hash of the block before that! etc. etc. all the way back to Satoshi's first block. Of course, this trick of chaining the block hashes was not invented by Bitcoin. And a chain isn't the only way to play this trick. A structure known a Merkle tree (after its inventor) lays out the hashes chains in a tree topology. So by using Merkle trees of fried chicken recipes, you can make the hash of a new recipe depend on every previous recipe. If someone wanted to mess with Granny, they'd have Nana to mess with too, not to mention the Colonel!

  4. Jingu-Galen Ginkgo Festival Fried Chicken
    CC BY-NC-ND by mawari
    Cryptographic signatures. If you're still paying attention, you might be thinking. "Hey! what's to stop Satoshi from rewriting the whole block chain?" And here's where cryptographic signatures come in. Every blockchain transaction gets signed by someone using public key cryptography. The signature works like a hash, and there are chains of signatures. Each signature can be verified using a public key, but without the owner's private key, any change to the block will cause the verification to fail. The result is that the block chain can't be altered without the cooperation of everyone who has signed blocks in the chain.

    Here's where FriChiReciChive is much easier to secure than Bitcoin. You don't need a lot of people participating to make the recipe ledger secure enough for the largest fried chicken attack you can imagine.

  5. Peer-to-peer. Perhaps the cleverest part of Bitcoin is the way that it arbitrates contention for the privilege of adding blocks. It uses puzzle solving to dole out this privilege to the "miners" (puzzle-solvers) who are best at solving the puzzle. Arbitration is needed because otherwise everyone could add blocks earning them Bitcoin. The puzzle solving turns out to be expensive because of the energy used to power the puzzle-solving computers. Peer-to-peer networks which share databases don't need this type of arbitration. While the contention for blocks in Bitcoin has been constantly rising, the contention for slots in distributed fried chicken data storage should drop into the foreseeable future.

  6. Charleston: Husk - Crispy Southern Fried
    Chicken Skins CC BY-NC-ND by wallyg
    Zero-knowledge proofs. Once everyone's Fried Chicken meals are recorded in FriChiReciChive, you might suppose that fried chicken privacy would be a thing of the past. The truth is much more complicated. Bitcoin offers a non-intuitive mix of anonymity and transparency. All of its transactions are public, but identities can be private. This is possible because in today's world, knowledge can be one-directional and partitioned in bizarre ways, like Voldemort's soul in the horcruxes. For example, you could build ChiFriReciChive in such a way that Bob or Alice could ask what they ate on Christmas but Eve would never know it, even with the entire fried chicken ledger.

  7. One more thing. Bitcoin appears to have solved a really hard problem by deploying mature digital tools in clever ways that give participants incentive to make the system work. When you're figuring out how to build FriChiReciChive, or solving whatever problem you might have, chances are you'll have a different set of really hard problems with a different set of participants and incentives. By adding the set of tools I've discussed here, you may be able to devise a new solution to your hard problem.
Bon appetit. Your soufflé will be délicieux!

Added note (2/18/2016): Jason Griffey points out that I have conflated Bitcoin and the "Blockchain". That's true, but I think partly justified. First of all, there are a number of "altcoins" that make use of modified versions of the bitcoin software to achieve various goals. The differences are interesting but mostly not relevant to a discussion of what you should learn from Bitcoin. There are rather narrow applications for blockchain-based distributed databases; these are well discussed by Coin Sciences founder Gideon Greenspan.

Wednesday, January 13, 2016

Not using HTTPS on your website is like sending your users outside in just their underwear.

#ALAMW16 exhibits,
viewed from the escalator
This past weekend, I spent 3 full days talking to librarians, publishers, and library vendors about making the switch to HTTPS. The Library Freedom Project staffed a table in the exhibits at the American Library Association Midwinter meeting. We had the best location we could possibly wish for, and we (Alison Macrina, Nima Fatemi, Jennie Rose Halperin and myself) talked our voices hoarse with anyone interested in privacy in libraries, which seemed to be everyone. We had help from Jason Griffey and Andromeda Yelton (who were next to us, showing off the cutest computers in town for the "Measure the Future" project).

Badass librarians with
framed @snowden tweet.
We had stickers, we had handouts. We had ACLU camera covers and 3D-printed logos. We had new business cards. We had a framed tweet from @Snowden praising @libraryfreedom and "Badass Librarians", who were invited to take selfies.
Apart from helping to raise awareness about internet privacy, talking to lots of real people can help hone a message. Some people didn't really get encryption, and a few were all "What??? Libraries don't use encrypted connections???" By the end of the first day, I had the message down to the one sentence:
Not using HTTPS on your website is like sending your users outside in just their underwear.
Because, if you don't use HTTPS, people can see everything, and though there's nothing really WRONG with not wearing clothes outside, we live in a society where doing so by custom is the respectful thing. There are many excellent reasons to preserve our users' privacy, but many of the reasons tend to highlight the needs of other people. The opposing viewpoint is often "Privacy is a thing of the past, just get over it" or "I don't have anything to hide, so why work hard so you can keep all your dirty secrets?" But most people don't think wearing clothes is a thing of the past; a connection made between encrypted connections and nice clothes just normalizes the normal.

We've previously used the analogy that HTTP is like sending postcards while HTTPS is like sending notes in envelopes. This is a harder analogy to use in a 30 second explainer because you have to make a second argument that websites shouldn't be sent on postcards.

We need to craft better slogans because there's a lot of anti-crypto noise trying to apply an odor of crime and terrorism to good privacy and security practices. The underwear argument is effective against that - I don't know anyone that isn't at least a bit creeped out by the "unclothing" done by the TSA's full body scanners.

No Pants Subway Ride 2015: cosmetic trierarchs CC BY-NC-ND by captin_nod

Maybe instead of green lock icons for HTTPS, browser software could display some sort of flesh-tone nudity icon for unencrypted HTTP connections. That might change user behavior rather quickly. I don't know about you but I never lose sleep over door locks, but I do have nightmares about going out without my pants!

Saturday, January 2, 2016

The Best eBook of 2015: "What is Code?"

When the Compact Disc was being standardized, its capacity was set to accommodate the length of Beethoven's Ninth Symphony, reportedly at the insistence of Sony executive Norio Ohga. In retrospect it seems obvious that a media technology should adapt to media art wherever possible, not vice versa. This is less possible when new media technologies enable new forms of creation, but that's what makes them so exciting.

I've been working primarily on ebooks for the past 5 years, mostly because I'm excited at the new possibilities they enable. I'm impressed - and excited -  when ebooks do things that can't be done for print books, partly because ebooks often can't capture the most innovative uses of ink on paper.

Looking back on 2015, there was one ebook more than any other that demonstrated the possibilities of the ebook as an art form, while at the same time being fun, captivating, and awe-inspiring, Paul Ford's What Is Code?

Unfortunately, today's ebook technology standards can't fully accommodate this work. The compact disc of ebooks can store only four and a half movements of Beethoven's Ninth. That makes me sad.

You might ask, how does What Is Code? qualify as an ebook if it doesn't quite work on a kindle or your favorite ebook app? What Is Code? was conceived and developed as an HTML5 web application for Business Week magazine, not with the idea of making an ebook. Nonetheless, What Is Code? uses the forms and structures of traditional books. It has a title page. It has chapters, sections, footnotes and a table of contents which support a linear narrative. It has marginal notes, figures and asides.

Despite its bookishness, it's hard to imagine What Is Code? in print. Although the share buttons and video embeds are mostly adornments for the text, activity modules are core to the book's exposition. The book is about code, and by bringing code to life, the reader becomes immersed in the book's subject matter. There's a greeter robot that waves and seems to know the reader, showing the ebook's "intelligence". The "how do you type an "A" activity in section 2.1 is a script worth a thousand words  and the "mousemove" activity in section 6.2 is a revelation even to an experienced programmer. If all that weren't enough, there's a random, active background that manages to soothe more than it distracts.

Even with its digital doodads, What Is Code? can be completely self contained and portable. To demonstrate this, I've packaged it up and archived it at Internet Archive; you can download with this link (21MB).  Once you've downloaded it, unzip it and load the "index.html" file into a modern browser. Everything will work fine, even if you turn off your internet connection. What Is Code? will continue to work after Business Week disappears from the internet (or behind the most censorious firewall). [1]

I was curious how much of What is Code? could be captured in a standard EPUB ebook file. I first tried making a EPUB version 2 file with Calibre. The result was not a lame as I thought it would be, but stripped of interactivity, it seemed like a photocopy of a sticker book - the story's there, but the fun, not so much. Same with the Kindle version .

I hoped that more of the scripts would work with an EPUB 3 file. This is more or the same as the zipped html file I made but I was unable to get it to display properly in iBooks despite 2 days of trying. Perhaps someone more experienced with javascript in EPUB3 could manage it. The display in Calibre was a bit better. Readium, the flagship software for EPUB3, just sat there spinning a cursor. It seems that the scripts handling the vertical swipe convention of the web conflict with the more bookish pagination attempted by iBooks.

The stand-alone HTML zip archive that I made addresses most of the use cases behind EPUB. The text is reflowable and user-adjustable. Elements adjust nicely to the size of the screen from laptop to smartphone. The javascript table of contents works the same as in an ebook reader. Accessibility could be improved, but that's mostly a matter of following accessibility standards that aren't specific to ebooks.

My experimentation with the code behind What Is Code? is another exciting aspect of making books into ebooks. Code and digital text can use a open licenses [2] that permit others to use, re-use and learn from What Is Code?. The entire project archive is hosted on GitHub and to date has been enhanced 671 times by 29 different contributors. There have been 187 forks (like mine) of the project. I think this collaborative creation process will be second nature to the ebook of the future.

There have been a number of proposals for portable HTML5 web archive formats for ebook technology moving forward. Among these are "5DOC"  and W3C's "Portable Web Platform".   As far as I can tell, these proposals aren't getting much traction or developer support. To succeed, the format has to be very lightweight and useful, or be supported by at least 2 of Amazon, Apple, and Google. I hope someone succeeds at this.

Whatever happens I hope there's room for Easter Eggs in the future of the ebook. There's a "secret" keyboard combination that triggers a laugh-out-loud Easter Egg on What is Code? And if you know how to look at What Is Code?'s javascript console, you'll see a message that's an appropriate ending for this post:


Best of 2015, don't you agree?

[1] To get everything in What Is Code? to work without an internet connection, I needed to add a small number of remotely loaded resources and fix a few small javascript bugs specific to loading from a file. (If you must know, pushing to the document.history of a file isn't allowed.) The YouTube embed is blank, of course, and a horrible, gratuitous Flash widget needed to be excised. You can see the details on GitHub.

[2] In this case, the Apache License and the Creative Commons By-NC-ND License.