Friday, April 1, 2016

April Fools is Cancelled This Year

Since the Onion dropped their fake news format in January in favor of serious reporting, it's become clear that the web's April Fools Day would be very different this year. Why make stuff up when real life is so hard to believe?

All my ideas for a satirical blog posts seemed too sadly realistic. After people thought my April 1 post last year was real, all my ideas for fake posts about false privacy and the All Writs Act seemed cruel. I thought about doing something about power inequity in libraries and publishing, but then all my crazy imaginings came true on the ACRL SCHOLCOMM list.

So no April Fools post on Go To Hellman this year. Except for this one, of course.

Monday, March 21, 2016

Sci-Hub, LibGen, and Total Information Awareness

"Good thing downloads NOT trackable!" was one twitter response to my post imagining a skirmish in the imminent scholarly publishing copyright war.

"You wish!" I responded.

Sooner or later, such illusions of privacy will fail spectacularly, and people will get hurt.

I had been in no hurry to see what the Sci-Hub furor was about. After writing frequently about piracy in the ebook industry, I figured that Sci-Hub would be just another copyright-flouting, adware-infested Russian website. When I finally took a look, I saw that Sci-Hub is a surprisingly sophisticated website that does a good job of facilitating evasion of research article paywalls. It styles itself as "the first pirate website in the world to provide mass and public access to tens of millions of research papers" and aspires to the righteous liberation of knowledge. David Rosenthal has written a rather comprehensive overview of the controversy surrounding it.

I also observed how easy it would be to track all the downloads being made via Sci-Hub. Today's internet is an environment where someone is tracking everything, and in the case of Sci-Hub, everything is being tracked.

My follow-up article was going to describe all the places that could track downloads via Sci-Hub, and how easy it would be to obtain a list of individuals who had downloaded or uploaded a Sci-Hub article – in violation of the laws currently governing copyright. But Sci-Hub is not doing things in the usual way of pirate websites. They're actually working to improve  user privacy. Around the time of my last post, they implemented HTTPS (SSLLabs grade: B) on their website. So instead of inducing users to announce their downloading activity to fellow WiFi users and every ISP on the planet, which is what Sci-Hub was doing in February, today Sci-Hub only registers download activity with Yandex Metrics, the Russian equivalent of Google Analytics.

As long as you trust a Russian internet company to NEVER monetize data about you by selling it to people with more money than good sense, you're not being betrayed by Sci-Hub. Unless the data SOMEHOW falls into the wrong hands.

There are more ways to track Sci-Hub downloads. Many of the downloads facilitated by Sci-Hub are fulfilled by a.k.a. "Library Genesis". LibGen is doing things in the usual way of pirate websites. The LibGen site does NOT support encryption, and it makes money by running advertising served by Google. As a result, Google gets informed of every LibGen download, and if a user has ever registered with Google, then Google knows exactly who they are, what they've downloaded and when they downloaded it. So to get a big list of downloaders, you'd just need to get Google to fork it over.

History suggests that copyright owners will eventually try to sue or otherwise monetize downloaders, and will be successful. In today's ad-network-created Total Information Awareness environment, it might even be a viable business model.

The best solution for a user wanting to download articles privately is to use the Tor Browser and Sci-Hub's onion address, http://scihub22266oqcxt.onion. Onion addresses provide encryption all the way to the destination, and since SciHub uses LibGen's onion address for linking, neither connection can be snooped by the network. Google and Yandex still get informed of all download activity, but the Tor browser hides the user's identity from them. ...Unless the user slips up and reveal their identity to another web site while using Tor.

Since .onion addresses don't use the DNS system (they won't work outside the Tor network), they won't be affected by legal attacks on the .io registrar. If you use the address in the Tor Browser, your downloads from can be monitored (and perhaps tampered with) by inquisitive exit nodes, so be sure to use the .onion address for privacy and security. I would also recommend using "medium-high" security mode (Onion > Privacy and Security Settings).

It might also be a good idea to use the Tor Browser if you want read research articles in private, even in journals you've paid for; medical journals seem to be the worst of the bunch with respect to privacy.

If publishers begin to take Sci-Hub countermeasures seriously (Library Loon has a good summary of the horribles to expect) there will be more things to worry about. PDFs can be loaded with privacy attacks in many ways, ranging from embedded security exploits to usage-monitoring links.

This isn't going to be fun for anyone.

Monday, March 7, 2016

Inside a 2016 Big Deal Negotiation...

Dramatis Personae: 
  • A Sales Representative from STM Corporation
  • An Acquisitions Librarian at Prestige University.

STM Corp Sales Rep: It's so nice to see you! We have some exciting news about your Big Deal renewal contract!

PU Acquisitions Librarian: Actually, I'm afraid we have some bad news for you. The Acquisitions Committee has had to make some cutbacks...

Sales Rep: I'm sorry to hear that. In fact, we also have some disturbing data to show you.

Librarian: We've been studying our usage data, and STM Corp's journals aren't seeing the usage we'd expected.

Sales Rep: Funny you should mention that, because STM Corp's Big Deal service has implemented a new "Total Information Awareness (TIA)" system that will answer all your usage questions. The TIA system monitors usage of our articles however they are acquired, and pinpoints the users, whoever and where ever they are. Our customers have been wanting this information for years, and now we can provide it.

Librarian: Now that's interesting. We've been discussing whether that sort of data could improve our services, but as librarians we need to respect the privacy of our users.

Sales Rep: Of course! And as publishers, we need to protect our services from unauthorized access and piracy.

Librarian: ... and our license agreements oblige us to respond to those concerns.

Sales Rep: I'm so glad you understand! But the TIA has exposed some disturbing information about journal usage on your campus.

Librarian: Yes, usage is dropping, That's what we wanted to discuss with you.

Sales Rep: Actually, total usage is increasing. It's just licensed usage that's dropping. Illicit usage is going through the roof!

Librarian: What do you mean?

Sales Rep: Have you heard of a website called Sci-Hub?

Librarian: [suppressing smile] Why yes...

Sales Rep: It seems that students and faculty on your campus have been accessing our articles via Sci-Hub quite a lot, and have been uploading...

Librarian: [starting to worry] We would never condone that! Using articles from Sci-Hub is likely copyright infringement in our jurisdiction. And uploading articles would be a violation of our campus policies!

Sales Rep: Exactly! Which is why we wanted you to see this data.

Librarian: [scanning several pages] But.. but this is a list of hundreds of our students and faculty, including some of our most prominent scientists!

Sales Rep: [grinning] ... each of them potentially facing hundreds of thousands of dollars of statutory damages for copyright infringement. Even career-ending litigation. It's such a blessing for you that we would never pursue legal actions that would hurt a good customer like Prestige U. Now about your renewal...

Librarian: Where did this list come from?

Sales Rep: As I said before, STM Corp's "Total Information Awareness" system monitors usage of our articles and pinpoints the users. You said before you had some bad news for us?

Librarian: Umm... we need to make some cutbacks.

Sales Rep: [smug] Well, then you'll be happy to know that we're limiting your big deal price to just a 19% increase over last time.

Librarian: [non-gendered expression of profound despair] ... and our Dean who's been using Sci-Hub?

Sales Rep: Sci-Hub? never heard of it.

Librarian: [resigned] OK, send us the invoice.

[Everything in this drama is fictitious except Sci-Hub and TIA. more next time.]

Thursday, February 18, 2016

The Impact of Bitcoin on Fried Chicken Recipe Archives

Bitcoin is magic. Not the technology, but the hype machine behind it. You've probably heard that Bitcoin technology is going to change everything from banking to fried chicken recipes, from copyright to genome research. Like any good hype machine, Bitcoin's whips amazing facts together with plausible nonsense to make a perfect soufflé.

ChickenCoin (Comoros 25 francs 1982)
CC BY-NC-ND  by edelweisscoins
The hype cycle is not Bitcoin's fault. Bitcoin is a masterful and probably successful attack on a problem that many thought was impossible. Bitcoin creates a decentralized, open, transparent and secure way to maintain a monetary transaction ledger. The reason this is so hard is because of the money. Money creates strong incentives for cheating, hacking, subverting the ledger, and it's really hard to create an open system that has no centralized authority yet is hard to attack. The genius of bitcoin is to cleverly exploit that incentive to power its distributed ledger engine, often referred to as "the blockchain". Anyone can create bitcoin for themselves and add it to the ledger, but to get others to accept your addition, you have to play by the rules and do work to add to the ledger. This process is called "mining".

If you're building a fried chicken recipe archive, (let's call it FriChiReciChive) there's good news and bad news. The bad news is that fried chicken is a terrible fuel for a blockchain ledger. No one mines for fried chicken. The good news is that very few nation-states care about your fried chicken recipes. Defending your recipe archive against cheating, hacking, attack and subversion will not require heroic bank-vault tactics.

That's not to say you can't learn from Bitcoin and its blockchain. Bitcoin is cleverly assembled from mature technologies that each seemed impossible not long ago. Your legacy recipe system was probably built in the days of archive horses and database buggies; if you're building a new one it probably would be a good idea to have a set of modern tools.

What are these tools? Here are a few of them:
  1. Big storage. It's easy to forget how much storage is available today. The current size of the bitcoin blockchain, the ledger of every bitcoin transaction every made, is only 56 GB. That's about one iPhone of storage. The cheapest macbook Pro comes with 128 GB, which is more than you can imagine. Amazon Web Services offers 500GB of storage for $15 per month. Your job in making FriChiReciChive a reality is to imagine how make use of all that storage. Suppose the average fried chicken recipe is a thousand words. That's about 10 thousand bytes. With 500GB and a little math, you can store 50 million fried chicken recipes.

    Momofoku Fried Chicken
    CC BY-NC by gandhu
    Having trouble imagining 50 million chicken recipes? You could try a recipe a minute and it would take you 95 years to try them all. That would be a poor use of your time on earth, and it would be a poor use of 500 GB. So forget about deleting old recipes and start thinking about the FriChi info you could be keeping. How about recording every portion of fried chicken ever prepared, and who ate it. This is possible today. If you're working on an archive of books, you could record every time someone reads a book. Trust me, Amazon is trying to do that.

    Occasionally, you'll hear that you can store information directly in Bitcoin's blockchain. That's possible, but you probably don't want to do that because of cost. The current cost of adding a MB (about 1 block) to the bitcoin blockchain is 25 BTC. At current exchange rates, that's about $10 per kB. That cost is borne by the global Bitcoin system, and it pays for the power consumed by Bitcoin miners. For comparison, AWS will charge you 0.36 microcents per year to store a kilobyte. The blockchain does more than S3, but not 30 million times more.

  2. Moroccan Chicken Hash
    CC BY-NC-ND by mmm-yoso
    Cryptographic hashes. Bitcoin makes pervasive use of cryptographic hashes to build and access its blockchain. Cryptographic hashes have some amazing properties. For example, you can use a hash to identify a digital document of any kind. Whether it's a password or a video of a cat, you can compute a hash and use the hash to identify the cat video. Flip a single bit in your fried chicken recipe, and you get a completely different hash. That's why bitcoin uses hashes to identify transactions. And you can't make the chicken from the hash. Yes, that's why it called a hash.

    Once you have the hash of a digital object, you've made it tamper-proof. If someone makes a change in your recipe, or your cat video, or your software object, the hash of the thing will be completely different, and you'll be able to tell that it's been tampered with. So you never need to let anyone mess with Granny's fried chicken recipe.

  3. Hash chains. Once you've computed hashes for everything in FriChiReciChive, you probably think, "what good is it to hash a recipe? "If someone can change the recipe, someone can change the hash, too." Bitcoin solves this problem by hashing the hashes! each new data block contains the hash of the previous block. Which contains a hash of the block before that! etc. etc. all the way back to Satoshi's first block. Of course, this trick of chaining the block hashes was not invented by Bitcoin. And a chain isn't the only way to play this trick. A structure known a Merkle tree (after its inventor) lays out the hashes chains in a tree topology. So by using Merkle trees of fried chicken recipes, you can make the hash of a new recipe depend on every previous recipe. If someone wanted to mess with Granny, they'd have Nana to mess with too, not to mention the Colonel!

  4. Jingu-Galen Ginkgo Festival Fried Chicken
    CC BY-NC-ND by mawari
    Cryptographic signatures. If you're still paying attention, you might be thinking. "Hey! what's to stop Satoshi from rewriting the whole block chain?" And here's where cryptographic signatures come in. Every blockchain transaction gets signed by someone using public key cryptography. The signature works like a hash, and there are chains of signatures. Each signature can be verified using a public key, but without the owner's private key, any change to the block will cause the verification to fail. The result is that the block chain can't be altered without the cooperation of everyone who has signed blocks in the chain.

    Here's where FriChiReciChive is much easier to secure than Bitcoin. You don't need a lot of people participating to make the recipe ledger secure enough for the largest fried chicken attack you can imagine.

  5. Peer-to-peer. Perhaps the cleverest part of Bitcoin is the way that it arbitrates contention for the privilege of adding blocks. It uses puzzle solving to dole out this privilege to the "miners" (puzzle-solvers) who are best at solving the puzzle. Arbitration is needed because otherwise everyone could add blocks earning them Bitcoin. The puzzle solving turns out to be expensive because of the energy used to power the puzzle-solving computers. Peer-to-peer networks which share databases don't need this type of arbitration. While the contention for blocks in Bitcoin has been constantly rising, the contention for slots in distributed fried chicken data storage should drop into the foreseeable future.

  6. Charleston: Husk - Crispy Southern Fried
    Chicken Skins CC BY-NC-ND by wallyg
    Zero-knowledge proofs. Once everyone's Fried Chicken meals are recorded in FriChiReciChive, you might suppose that fried chicken privacy would be a thing of the past. The truth is much more complicated. Bitcoin offers a non-intuitive mix of anonymity and transparency. All of its transactions are public, but identities can be private. This is possible because in today's world, knowledge can be one-directional and partitioned in bizarre ways, like Voldemort's soul in the horcruxes. For example, you could build ChiFriReciChive in such a way that Bob or Alice could ask what they ate on Christmas but Eve would never know it, even with the entire fried chicken ledger.

  7. One more thing. Bitcoin appears to have solved a really hard problem by deploying mature digital tools in clever ways that give participants incentive to make the system work. When you're figuring out how to build FriChiReciChive, or solving whatever problem you might have, chances are you'll have a different set of really hard problems with a different set of participants and incentives. By adding the set of tools I've discussed here, you may be able to devise a new solution to your hard problem.
Bon appetit. Your soufflé will be délicieux!

Added note (2/18/2016): Jason Griffey points out that I have conflated Bitcoin and the "Blockchain". That's true, but I think partly justified. First of all, there are a number of "altcoins" that make use of modified versions of the bitcoin software to achieve various goals. The differences are interesting but mostly not relevant to a discussion of what you should learn from Bitcoin. There are rather narrow applications for blockchain-based distributed databases; these are well discussed by Coin Sciences founder Gideon Greenspan.

Wednesday, January 13, 2016

Not using HTTPS on your website is like sending your users outside in just their underwear.

#ALAMW16 exhibits,
viewed from the escalator
This past weekend, I spent 3 full days talking to librarians, publishers, and library vendors about making the switch to HTTPS. The Library Freedom Project staffed a table in the exhibits at the American Library Association Midwinter meeting. We had the best location we could possibly wish for, and we (Alison Macrina, Nima Fatemi, Jennie Rose Halperin and myself) talked our voices hoarse with anyone interested in privacy in libraries, which seemed to be everyone. We had help from Jason Griffey and Andromeda Yelton (who were next to us, showing off the cutest computers in town for the "Measure the Future" project).

Badass librarians with
framed @snowden tweet.
We had stickers, we had handouts. We had ACLU camera covers and 3D-printed logos. We had new business cards. We had a framed tweet from @Snowden praising @libraryfreedom and "Badass Librarians", who were invited to take selfies.
Apart from helping to raise awareness about internet privacy, talking to lots of real people can help hone a message. Some people didn't really get encryption, and a few were all "What??? Libraries don't use encrypted connections???" By the end of the first day, I had the message down to the one sentence:
Not using HTTPS on your website is like sending your users outside in just their underwear.
Because, if you don't use HTTPS, people can see everything, and though there's nothing really WRONG with not wearing clothes outside, we live in a society where doing so by custom is the respectful thing. There are many excellent reasons to preserve our users' privacy, but many of the reasons tend to highlight the needs of other people. The opposing viewpoint is often "Privacy is a thing of the past, just get over it" or "I don't have anything to hide, so why work hard so you can keep all your dirty secrets?" But most people don't think wearing clothes is a thing of the past; a connection made between encrypted connections and nice clothes just normalizes the normal.

We've previously used the analogy that HTTP is like sending postcards while HTTPS is like sending notes in envelopes. This is a harder analogy to use in a 30 second explainer because you have to make a second argument that websites shouldn't be sent on postcards.

We need to craft better slogans because there's a lot of anti-crypto noise trying to apply an odor of crime and terrorism to good privacy and security practices. The underwear argument is effective against that - I don't know anyone that isn't at least a bit creeped out by the "unclothing" done by the TSA's full body scanners.

No Pants Subway Ride 2015: cosmetic trierarchs CC BY-NC-ND by captin_nod

Maybe instead of green lock icons for HTTPS, browser software could display some sort of flesh-tone nudity icon for unencrypted HTTP connections. That might change user behavior rather quickly. I don't know about you but I never lose sleep over door locks, but I do have nightmares about going out without my pants!