Wednesday, June 22, 2016

Wiley's Fake Journal of Constructive Metaphysics and the War on Automated Downloading


Suppose you were a publisher and you wanted to get the goods on a pirate who was downloading your subscription content and giving it away for free. One thing you could try is to trick the pirate into downloading fake content loaded with spy software and encoded information about how the downloading was being done. Then you could verify when the fake content was turning up on the pirate's website.

This is not an original idea. Back in the glory days of Napster, record companies would try to fill the site with bad files which somehow became infested with malware. Peer-to-peer networks evolved trust mechanisms to foil bad-file strategies.

I had hoped that the emergence of Sci-Hub as an efficient, though unlawful, distributor of scientific articles would not provoke scientific publishers to do things that could tarnish the reputation of their journals. I had hoped that publishers would get their acts together and implement secure websites so that they could be sure that articles were getting to their real subscribers. Silly me.

In series of tweets, Rik Smith-Unna noted with dismay that the Wiley Online Library was using "fake DOIs" as "trap URLs", URLs in links invisible to human users. A poorly written web spider or crawler would try to follow the link, triggering a revocation of the user's access privileges. (For not obeying the website's terms of service, I suppose.)

Gabriel J. Gardner of Cal State Long Beach has reported his library's receipt of a scary email from Wiley stating:
Wiley has been investigating activity that uses compromised user credentials from institutions to access proxy servers like EZProxy (or, in some cases, other types of proxy) to then access IP-authenticated content from the Wiley Online Library (and other material). We have identified a compromised proxy at your institution as evidenced by the log file below. 
We will need to restrict your institution’s proxy access to Wiley Online Library if we do not receive confirmation that this has been remedied within the next 24 hours.  

I've been seeing these trap urls in scholarly journals for almost 20 years now. Two years ago they reappeared in ACS journals. They're rarely well thought out, and from talking with publishers who have tried them, they don't work as intended. The Wiley trap URLs exhibit several mistakes in implementation.
  1. Spider trap URLs are are useful for detecting bots that ignore robot exclusions. But Wiley's robots.txt document doesn't exclude the trap urls, so "well-behaved" spiders, such as googlebot are also caught. As a result, the fake Wiley page is indexed in Google, and because of the way Google aggregates the weight of links, it's actually a rather highly ranked page
  2. The download urls for the fake article don't download anything, but instead return a 500 error code whenever an invalid pseudo-DOI is presented to the site. This is a site misconfiguration that can cause problems with link checking or link verification software.
  3. Because the fake URLs look like Wiley DOI's, they could cause confusion if circulated. Crossref discourages this.
  4. The trap URLs as implemented by Wiley can be used for malicious attacks. With a list of trap URLs, it's trivial to craft an email or a web page that causes the user to request the full list of trap URLs. When the trap URLs trigger service suspensions this gives you the ability to trigger a suspension by sending the target an email.
  5. Apparently, Wiley used a special cookie to block the downloading. Have they not heard of sessions?
  6. The blocks affected both subscription and open-access content. Umm, do I need to explain the concept of "Open Access"?
  7. It's just not a smart idea (Even on April Fools!) for a reputable publisher to create fake article pages for "Constructive Metaphysics in Theories of Continental Drift. (warning: until Wiley realizes their ineptness, this link may trigger unexpected behavior. Use Tor.) It's an insult to both geophysicists and philosophers. And how does the University of Bradford feel about hosting a fictitious Department of Geophysics???


Instead of trap urls, online businesses that need to detect automated activity have developed elaborate and effective mechanisms to do so. Automated downloads are a billion dollar problem for the advertising industry in particular. So advertisers, advertising networks, and market research companies use coded, downloaded javascripts and flash scripts to track and monitor both users and bots. I've written about how these practices are inappropriate in library contexts. In comparison, the trap URLs being deployed by Wiley are sophomoric and a technical embarrassment.

If you visit the Wiley fake article page now, you won't get an article. You get a full dose of monitoring software. Wiley uses a service called Qualtrics Site Intercept to send you "Creatives" if you meet targeting criteria. But you'll also get that if you access Wiley's Online Library's real articles, along with sophisticated trackers from Krux Digital, Grapeshot, Jivox, Omniture, Tradedesk, Videology and Neustar.

Here's the letter I'd like libraries to start sending publishers:
[Library] has been investigating activity that causes spyware from advertising networks to compromise the privacy of IP-authenticated users of the [Publisher] Online Library, a service for we have been billed [$XXX,XXX]. We have identified numerous third party tracking beacons and monitoring scripts infesting your service as evidenced by the log file below. 
We will need to restrict [Publisher]'s access to our payment processes if we do not receive confirmation that this has been remedied within the next 24 hours.  
Notes:
  1. Here's another example of Wiley cutting off access because of fake URL clicking. The implication that Wiley has stopped using trap URLs seems to be false.
  2. Some people have suggested that the "fake DOIs" are damaging the DOI system. Don't worry, they're not real DOI's and have not been registered. The DOI system is robust against this sort of thing; it's still disrespectful.
Update June 23:
  1. Tom Griffin, a spokesman for Wiley, has posted a denial to LIBLICENCE which has a tenuous grip on reality.
  2. Smith-Unna has posted a point-by-point response to Griffin's denial in the form of a gist . 

Monday, May 23, 2016

97% of Research Library Searches Leak Privacy... and Other Disappointing Statistics.


...But first, some good news. Among the 123 members of the Association of Research Libraries, there are four libraries with almost secure search services that don't send clickstream data to Amazon, Google, or any advertising network. Let's now sing the praises of libraries at Southern Illinois University, University of Louisville, University of Maryland, and University of New Mexico for their commendable attention to the privacy of their users. And it's no fault of their own that they're not fully secure. SIU fails to earn a green lock badge because of mixed content issues in the CARLI service; while Louisville, Maryland and New Mexico miss out on green locks because of the weak cipher suite used by OCLC on their Worldcat Local installations. These are relatively minor issues that are likely to get addressed without much drama.

Over the weekend, I decided to try to quantify the extent of privacy leakage in public-facing library services by studying the search services of the 123 ARL libraries. These are the best funded and most prestigious libraries in North America, and we should expect them to positively represent libraries. I went to each library's on-line search facility and did a search for a book whose title might suggest to an advertiser that I might be pregnant. (I'm not!) I checked to see whether the default search linked to by the library's home page (as listed on the ARL website) was delivered over a secure connection (HTTPS). I checked for privacy leakage of referer headers from cover images by using Chrome developer tools (the sources tab). I used Ghostery to see if the library's online search used Google Analytics or not. I also noted whether advertising network "web beacons" were placed by the search session.

72% of the ARL libraries let Google look over the shoulder of every click by every user, by virtue of the pervasive use of Google Analytics. Given the commitment to reader privacy embodied by the American Library Association's code of ethics, I'm surprised this is not more controversial. ALA even sponsors workshops on "Getting Started with Google Analytics". To paraphrase privacy advocate and educator Dorothea Salo, the code of ethics does not say:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted, except for Google Analytics.
While it's true that Google has a huge stake in maintaining the trust of users in their handling of personal information, and people seem to trust Google with their most intimate secrets, it's also true that Google's privacy policy puts almost no constraints on what Google (itself) can do with the information they collect. They offer strong commitments not to share personally identifiable information with other entities, but they are free to keep and use personally identifiable information. Google can associate Analytics-tracked library searches with personally identifiable information for any user that has a Google account; Libraries cannot be under the illusion that they are uninvolved with this data collection if they benefit from Google Analytics. (Full disclosure: many of the the web sites I administer also use Google Analytics.)

80% of the ARL libraries provide their default discovery tools to users without the benefit of a secure connection. This means that any network provider in the path between the library and the user can read and alter the query, and the results returned to the user. It also means that when a user accesses the library over public wifi, such as in a coffee shop, the user's clicks are available for everyone else in the coffee shop to look at, and potentially to tamper with. (The Digital Library Privacy Pledge is not having the effect we had hoped for, at least not yet.)

28% of ARL libraries enrich their catalog displays with cover images sourced from Amazon.com. Because of privacy leakage in referer headers, this means that a user's searches for library books are available for use by Amazon when Amazon wants to sell that user something. It's not clear that libraries realize this is happening, or whether they just don't realize that their catalog enrichment service uses cover images sourced by Amazon.

13% of ARL libraries help advertisers (other than Google) target their ads by allowing web beacons to be placed on their catalog web pages. Whether the beacons are from Facebook, DoubleClick, AddThis or Sharethis, advertisers track individual users, often in a personally identifiable way. Searches on these library catalogs are available to the ad networks to maximize the value of advertising placed throughout their networks.

Much of the privacy leakage I found in my survey occurs beyond the control of librarians. There are IT departments, vender-provided services, and incumbent bureaucracies involved. Important library services appear to be unavailable in secure versions. But specific, serious privacy leakage problems that I've discussed with product managers and CTOs of library automation vendors have gone unfixed for more than a year. I'm getting tired of it.

The results of my quick survey for each of the 123 ARL libraries are available as a Google Sheet. There are bound to be a few errors, and I'd love to be able to make changes as privacy leaks get plugged and websites become secure, so feel free to leave a comment.

Friday, April 1, 2016

April Fools is Cancelled This Year

Since the Onion dropped their fake news format in January in favor of serious reporting, it's become clear that the web's April Fools Day would be very different this year. Why make stuff up when real life is so hard to believe?

All my ideas for a satirical blog posts seemed too sadly realistic. After people thought my April 1 post last year was real, all my ideas for fake posts about false privacy and the All Writs Act seemed cruel. I thought about doing something about power inequity in libraries and publishing, but then all my crazy imaginings came true on the ACRL SCHOLCOMM list.

So no April Fools post on Go To Hellman this year. Except for this one, of course.

Monday, March 21, 2016

Sci-Hub, LibGen, and Total Information Awareness


"Good thing downloads NOT trackable!" was one twitter response to my post imagining a skirmish in the imminent scholarly publishing copyright war.

"You wish!" I responded.

Sooner or later, such illusions of privacy will fail spectacularly, and people will get hurt.

I had been in no hurry to see what the Sci-Hub furor was about. After writing frequently about piracy in the ebook industry, I figured that Sci-Hub would be just another copyright-flouting, adware-infested Russian website. When I finally took a look, I saw that Sci-Hub is a surprisingly sophisticated website that does a good job of facilitating evasion of research article paywalls. It styles itself as "the first pirate website in the world to provide mass and public access to tens of millions of research papers" and aspires to the righteous liberation of knowledge. David Rosenthal has written a rather comprehensive overview of the controversy surrounding it.

I also observed how easy it would be to track all the downloads being made via Sci-Hub. Today's internet is an environment where someone is tracking everything, and in the case of Sci-Hub, everything is being tracked.

My follow-up article was going to describe all the places that could track downloads via Sci-Hub, and how easy it would be to obtain a list of individuals who had downloaded or uploaded a Sci-Hub article – in violation of the laws currently governing copyright. But Sci-Hub is not doing things in the usual way of pirate websites. They're actually working to improve  user privacy. Around the time of my last post, they implemented HTTPS (SSLLabs grade: B) on their website. So instead of inducing users to announce their downloading activity to fellow WiFi users and every ISP on the planet, which is what Sci-Hub was doing in February, today Sci-Hub only registers download activity with Yandex Metrics, the Russian equivalent of Google Analytics.

As long as you trust a Russian internet company to NEVER monetize data about you by selling it to people with more money than good sense, you're not being betrayed by Sci-Hub. Unless the data SOMEHOW falls into the wrong hands.

There are more ways to track Sci-Hub downloads. Many of the downloads facilitated by Sci-Hub are fulfilled by LibGen.io a.k.a. "Library Genesis". LibGen is doing things in the usual way of pirate websites. The LibGen site does NOT support encryption, and it makes money by running advertising served by Google. As a result, Google gets informed of every LibGen download, and if a user has ever registered with Google, then Google knows exactly who they are, what they've downloaded and when they downloaded it. So to get a big list of downloaders, you'd just need to get Google to fork it over.

History suggests that copyright owners will eventually try to sue or otherwise monetize downloaders, and will be successful. In today's ad-network-created Total Information Awareness environment, it might even be a viable business model.

The best solution for a user wanting to download articles privately is to use the Tor Browser and Sci-Hub's onion address, http://scihub22266oqcxt.onion. Onion addresses provide encryption all the way to the destination, and since SciHub uses LibGen's onion address for linking, neither connection can be snooped by the network. Google and Yandex still get informed of all download activity, but the Tor browser hides the user's identity from them. ...Unless the user slips up and reveal their identity to another web site while using Tor.

Since .onion addresses don't use the DNS system (they won't work outside the Tor network), they won't be affected by legal attacks on the .io registrar. If you use the Sci-Hub.io address in the Tor Browser, your downloads from LibGen.io can be monitored (and perhaps tampered with) by inquisitive exit nodes, so be sure to use the .onion address for privacy and security. I would also recommend using "medium-high" security mode (Onion > Privacy and Security Settings).

It might also be a good idea to use the Tor Browser if you want read research articles in private, even in journals you've paid for; medical journals seem to be the worst of the bunch with respect to privacy.

If publishers begin to take Sci-Hub countermeasures seriously (Library Loon has a good summary of the horribles to expect) there will be more things to worry about. PDFs can be loaded with privacy attacks in many ways, ranging from embedded security exploits to usage-monitoring links.

This isn't going to be fun for anyone.

Monday, March 7, 2016

Inside a 2016 Big Deal Negotiation...


Dramatis Personae: 
  • A Sales Representative from STM Corporation
  • An Acquisitions Librarian at Prestige University.

STM Corp Sales Rep: It's so nice to see you! We have some exciting news about your Big Deal renewal contract!

PU Acquisitions Librarian: Actually, I'm afraid we have some bad news for you. The Acquisitions Committee has had to make some cutbacks...

Sales Rep: I'm sorry to hear that. In fact, we also have some disturbing data to show you.

Librarian: We've been studying our usage data, and STM Corp's journals aren't seeing the usage we'd expected.

Sales Rep: Funny you should mention that, because STM Corp's Big Deal service has implemented a new "Total Information Awareness (TIA)" system that will answer all your usage questions. The TIA system monitors usage of our articles however they are acquired, and pinpoints the users, whoever and where ever they are. Our customers have been wanting this information for years, and now we can provide it.

Librarian: Now that's interesting. We've been discussing whether that sort of data could improve our services, but as librarians we need to respect the privacy of our users.

Sales Rep: Of course! And as publishers, we need to protect our services from unauthorized access and piracy.

Librarian: ... and our license agreements oblige us to respond to those concerns.

Sales Rep: I'm so glad you understand! But the TIA has exposed some disturbing information about journal usage on your campus.

Librarian: Yes, usage is dropping, That's what we wanted to discuss with you.

Sales Rep: Actually, total usage is increasing. It's just licensed usage that's dropping. Illicit usage is going through the roof!

Librarian: What do you mean?

Sales Rep: Have you heard of a website called Sci-Hub?

Librarian: [suppressing smile] Why yes...

Sales Rep: It seems that students and faculty on your campus have been accessing our articles via Sci-Hub quite a lot, and have been uploading...

Librarian: [starting to worry] We would never condone that! Using articles from Sci-Hub is likely copyright infringement in our jurisdiction. And uploading articles would be a violation of our campus policies!

Sales Rep: Exactly! Which is why we wanted you to see this data.

Librarian: [scanning several pages] But.. but this is a list of hundreds of our students and faculty, including some of our most prominent scientists!

Sales Rep: [grinning] ... each of them potentially facing hundreds of thousands of dollars of statutory damages for copyright infringement. Even career-ending litigation. It's such a blessing for you that we would never pursue legal actions that would hurt a good customer like Prestige U. Now about your renewal...

Librarian: Where did this list come from?

Sales Rep: As I said before, STM Corp's "Total Information Awareness" system monitors usage of our articles and pinpoints the users. You said before you had some bad news for us?

Librarian: Umm... we need to make some cutbacks.

Sales Rep: [smug] Well, then you'll be happy to know that we're limiting your big deal price to just a 19% increase over last time.

Librarian: [non-gendered expression of profound despair] ... and our Dean who's been using Sci-Hub?

Sales Rep: Sci-Hub? never heard of it.

Librarian: [resigned] OK, send us the invoice.

[Everything in this drama is fictitious except Sci-Hub and TIA. more next time.]