Thursday, December 31, 2015

A New Year's Resolution for Publishers and Libraries: Switch to HTTPS

The endorsement list for the Library Digital Privacy Pledge of 2015-2016 is up and ready to add the name of your organization. We added the "-2016" part, because various things took longer than we thought.

Everything takes longer than you think it will. Web development, business, committee meetings, that blog post. Over the past few months, I've talked to all sorts of people about switching to HTTPS. Librarians, publishers, technologists. Library directors, CEOs, executive editors, engineering managers. Everyone wants to do it, but there are difficulties and complications, many of them small and some of them sticky. It's clear that we all have to work together to make this transition happen.

The list will soon get a lot longer, because a lot of people wanted to meet about it at the ALA Midwinter meeting just 1 week away OMG it's so soon! Getting it done is the perfect New Year's resolution for everyone in the world of libraries.

Here's what you can do:

If you're a Publisher...

... you probably know you need to make the switch, if for no other reason than the extra search engine ranking. By the end of the year, don't be surprised if non-secure websites look unprofessional, which is not what a publisher wants to project.

If you're a Librarian...

... you probably recognize the importance of user privacy, but you're at the mercy of your information and automation suppliers. If those publishers and suppliers haven't signed the pledge, go and ask them why not. And where you control a service, make it secure!

If you're a Library Technology Vendor...

... here's your opportunity to be a hero. You can now integrate security and privacy into your web solution without the customer paying for certificates. So what are you waiting for?

If you're a Library user...

... ask your library if their services are secure and private. Ask publishers if their services are immune to eavesdropping and corruption. If those services are delivered without encryption, the answer is NO!

Everything takes longer than you think it will. Until it happens faster than you can imagine. Kids grow up so fast!

Tuesday, December 22, 2015

xISBN: RIP


When I joined OCLC in 2006 (via acquisition), one thing I was excited about was the opportunity to make innovative uses of OCLC's vast bibliographic database. And there was an existence proof that this could be done, it was a neat little API that had been prototyped in OCLC's Office of Research: xISBN.

xISBN was an example of a microservice- it offered a small piece of functionality and it did it very fast. Throw it an ISBN, and it would give you back a set of related ISBNs. Ten years ago, microservices and mashups were all the rage. So I was delighted when my team was given the job of "productizing" the xISBN service- moving it out of research and into the marketplace.

Last week,  I was sorry to hear about the imminent shutdown of xISBN. But it got me thinking about the limitations of services like xISBN and why no tears need be shed on its passing.

The main function of xISBN was to say "Here's a group of books that are sort of the same as the book you're asking about." That summary instantly tells you why xISBN had to die, because any time a computer tells you something "sort of", it's a latent bug. Because where you draw the line between something that's the same and something that's different is a matter of opinion and depends on the use you want to make of the distinction. For example, if you ask for A Study in Scarlet, you might be interested in a version in Chinese, or you might be interested to get a paperback version, or you might want to get Sherlock Holmes compilations that included A Study in Scarlet. For each  question you want a slightly different answer. If you are a developer needing answers to these questions, you would combine xISBN with other information services to get what you need.

Today we have better ways to approach this sort of problem. Serious developers don't want a microservice, they want richly "Linked Data". In 2015, most of us can all afford our own data crunching big-data-stores-in-the-cloud and we don't need to trust algorithms we can't control. OCLC has been publishing rather nice Linked Data for this purpose. So, if you want all the editions for Cory Doctorow's Homeland, you can "follow your nose" and get all the data you need.

  1. First you look up the isbn at http://www.worldcat.org/isbn/9780765333698
  2. which leads you to http://www.worldcat.org/oclc/795174333.jsonld (containing a few more isbns
  3. you can follow the associated "work" record: http://experiment.worldcat.org/entity/work/data/1172568223
  4. which yields a bunch more ISBNs.

It's a lot messier than xISBN, but that's mostly because the real world is messy. Every application requires a different sort of cleaning up, and it's not all that hard.

If cleaning up the mess seems too intimidating, and you just want light-weight ISBN hints from a convenient microservice, there's always "thingISBN". ThingISBN is a data exhaust stream from the LibraryThing catalog. To be sustainable, microservices like xISBN need to be exhaust streams. The big cost to any data service is maintaining the data, so unless maintaining that data is in the engine block of your website, the added cost won't be worth it. But if you're doing it anyway, dressing the data up as a useful service costs you almost nothing and benefits the environment for everyone. Lets hope that OCLC's Linked Data services are of this sort.

In thinking about how I could make the data exhaust from Unglue.it more ecological, I realized that a microservice connecting ISBNs to free ebook files might be useful. So with a day of work, I added the "Free eBooks by ISBN" endpoint to the Unglue.it api.

xISBN, you lived a good micro-life. Thanks.

Wednesday, November 11, 2015

Using Let's Encrypt to Secure an Elastic Beanstalk Website

Since I've been pushing the library and academic publishing community to implement HTTPS on all their informations services, I was really curious to see how the new Let's Encrypt (LE) certificate authority is really working, with its "general availability" date imminent. My conclusion is that "general availability" will not mean "general usability" right away; its huge impact will take six months to a year to arrive. For now, it's really important for the community to put our developers to work on integrating Let's Encrypt into our digital infrastructure.

I decided to secure the www.gitenberg.org website as my test example. It's still being developed, and it's not quite ready for use, so if I screwed up it would be no disaster. Gitenberg.org is hosted using Elastic Beanstalk (EB) on Amazon Web Services (AWS), which is a popular and modern way to build scaleable web services. The servers that Elastic Beanstalk spins up have to be completely configured in advance- you can't just log in and write some files. And EB does its best to keep servers serving. It's no small matter to shut down a server and run some temporary server, because EB will spin up another server to handle rerouted traffic. These characteristics of  Elastic Beanstalk exposed some of the present shortcomings and future strengths of the Let's Encrypt project.

Here's the mission statement of the project:
Let’s Encrypt is a free, automated, and open certificate authority (CA), run for the public’s benefit.
While most of us focus on the word "free", the more significant word here is "automated":
Automatic: Software running on a web server can interact with Let’s Encrypt to painlessly obtain a certificate, securely configure it for use, and automatically take care of renewal.
Note that the objective is not to make it painless for website administrators to obtain a certificate, but to enable software to get certificates. If the former is what you want, in the near term, then I strongly recommend that you spend some money with one of the established certificate authorities. You'll get a certificate that isn't limited to 90 days, as the LE certificates are, you can get a wildcard certificate, and you'll be following the manual procedure that your existing web server software expects you to be following.

The real payoff for Let's Encrypt will come when your web server applications start expecting you to use the LE methods of obtaining security certificates. Then, the chore of maintaining certificates for secure web servers will disappear, and things will just work. That's an outcome worth waiting for, and worth working towards today.

So here's how I got Let's Encrypt working with Elastic Beanstalk for gitenberg.org.

The key thing to understand here is that before Let's Encrypt can issue me a certificate, I have to prove to them that I really control the hostname that I'm requesting a certificate for. So the Let's Encrypt client has to be given access to a "privileged" port on the host machine designated by DNS for that hostname. Typically, that means I have to have root access to the server in question.

In the future, Amazon should integrate a Let's Encrypt client with their Beanstalk Apache server software so all this is automatic, but for now we have to use the Let's Encrypt "manual mode". In manual mode, the Let's Encrypt client generates a cryptographic "challenge/response", which then needs to be served from the root directory of the gitenberg.org web server.

Even running Let's Encrypt in manual mode required some jumping through hoops. It won't run on Mac OSX. It doesn't yet support the flavor of Linux used by Elastic Beanstalk, so it does no good configuring Elastic Beanstalk to install it there. Instead I used the Let's Encrypt Docker container, which works nicely, and I ran a Docker-Machine inside "virtualbox" on my Mac.

Having configured Docker, I ran
docker run -it --rm -p 443:443 -p 80:80 --name letsencrypt \    
-v "/etc/letsencrypt:/etc/letsencrypt" \
-v "/var/lib/letsencrypt:/var/lib/letsencrypt" \
quay.io/letsencrypt/letsencrypt:latest -a manual -d www.gitenberg.org \
--server https://acme-v01.api.letsencrypt.org/directory auth
 
(the --server option requires your domain to be whitelisted during the beta period.) After paging through some screens asking for my email address and permission to log my IP address, the client responded with
Make sure your web server displays the following content at http://www.gitenberg.org/.well-known/acme-challenge/8wBDbWQIvFi2bmbBScuxg4aZcVbH9e3uNrkC4CutqVQ before continuing:
8wBDbWQIvFi2bmbBScuxg4aZcVbH9e3uNrkC4CutqVQ.hZuATXmlitRphdYPyLoUCaKbvb8a_fe3wVj35ISDR2A
To do this, I configured a virtual directory "/.well-known/acme-challenge/" in the Elastic Beanstalk console with a mapping to a "letsencrypt/" directory in my application (configuration page, software configuration section, static files section.). I then made a file named  "8wBDbWQIvFi2bmbBScuxg4aZcVbH9e3uNrkC4CutqVQ" with the specified content in my letsencrypt directory, committed the change with git, and deployed the application with the elastic beanstalk command line interface. After waiting for the deployment to succeed, I checked that http://www.gitenberg.org/.well-known/acme-challenge/8wBD... responded correctly, and then hit <enter>. (Though the LE client tells you that the MIME type "text/plain" MUST be sent, elastic beanstalk sets no MIME header, which is allowed.)

And SUCCESS!
IMPORTANT NOTES:  - Congratulations! Your certificate and chain have been saved at    /etc/letsencrypt/live/www.gitenberg.org/fullchain.pem. Your cert    will expire on 2016-02-08. To obtain a new version of the    certificate in the future, simply run Let's Encrypt again.
...except since I was running Docker inside virtualbox on my Mac, I had to log into the docker machine and copy three files out of that directory (cert.pem, privkey.pem, and chain.pem). I put them in my local <.elasticbeanstalk> directory. (See this note for a better way to do this.)

The final step was to turn on HTTPS in elastic beanstalk. But before doing that, I had to upload the three files to my AWS Identity and Access Management Console. To do this, I needed to use the aws command line interface, configured with admin privileges. The command was
aws iam upload-server-certificate \ --server-certificate-name gitenberg-le \ --certificate-body file://<.elasticbeanstalk>/cert.pem \ --private-key file://<.elasticbeanstalk>/privkey.pem \ --certificate-chain file://<.elasticbeanstalk>/chain.pem
One more trip to the Elastic Beanstalk configuration console (network/load balancer section), and gitenberg.org was on HTTPS.


Given that my sys-admin skills are rudimentary, the fact that I was able to get Let's Encrypt to work suggests that they've done a pretty good job of making the whole process simple. However, the documentation I needed was non-existent, apparently because the LE developers want to discourage the use of manual mode. Figuring things out required a lot of error-message googling. I hope this post makes it easier for people to get involved to improve that documentation or build support for Let's Encrypt into more server platforms.

(Also, given that my sys-admin skills are rudimentary, there are probably better ways to do what I did, so beware.)

If you use web server software developed by others, NOW is the time to register a feature request. If you are contracting for software or services that include web services, NOW is the time to add a Let's Encrypt requirement into your specifications and contracts. Let's Encrypt is ready for developers today, even if it's not quite ready for rank and file IT administrators.

Update (11/12/2015):
I was alerted to the fact that while https://www.gitenberg.org was working, https://gitenberg.org was failing authentication. So I went back and did it again, this time specifying both hostnames. I had to guess at the correct syntax. I also tested out the suggestion from the support forum to get the certificates saved in may mac's filesystem. (It's worth noting here that the community support forum is an essential and excellent resource for implementers.)

To get the multi-host certificate generated, I used the command:
docker run -it --rm -p 443:443 -p 80:80 --name letsencrypt \
-v "/Users/<my-mac-login>/letsencrypt/etc/letsencrypt:/etc/letsencrypt" \
-v "/Users/<my-mac-login>/letsencrypt/etc/letsencrypt/var/lib/letsencrypt:/var/lib/letsencrypt" \
-v "/Users/<my-mac-login>/letsencrypt/var/log/letsencrypt:/var/log/letsencrypt" \
quay.io/letsencrypt/letsencrypt:latest -a manual \
-d www.gitenberg.org -d gitenberg.org \
--server https://acme-v01.api.letsencrypt.org/directory auth
This time, I had to go through the challenge/response procedure twice, once for each hostname.

With the certs saved to my filesystem, the upload to AWS was easier:
aws iam upload-server-certificate \
--server-certificate-name gitenberg-both \
--certificate-body file:///Users/<my-mac-login>/letsencrypt/etc/letsencrypt/live/www.gitenberg.org/cert.pem \
--private-key file:///Users/<my-mac-login>/letsencrypt/etc/letsencrypt/live/www.gitenberg.org/privkey.pem \
--certificate-chain file:///Users/<my-mac-login>/letsencrypt/etc/letsencrypt/live/www.gitenberg.org/chain.pem
And now, traffic on both hostnames is secure!

Resources I used:

Update 12/6/2015:  Let's Encrypt is now in public beta, anyone can use it. I've added details about creating the virtual directory in response to a question on twitter.

Update 4/21/2016: When it came time for our second renewal, Paul Moss took a look at automating the process. If you're interested in doing this, read his notes.

Update 7/7/2016: Tony Gutierrez has an LE recipe for NodeJS on EB

Thursday, October 22, 2015

This is NOT a Portrait of Mary Astell

Not Mary Astell, by Sir Joshua Reynolds
Ten years ago, the University of Calgary Press published a very fine book by Christine Mason Sutherland called The Eloquence of Mary Astell, which focused on the proto-feminist's contributions as a rhetorician. The cover for the book featured a compelling image using a painted sketch from 1760-1765 by the master English portraitist Sir Joshua Reynolds, currently in Vienna's Kunsthistorisches Museum and known as Bildnisstudie einer jungen Dame (Study for the portrait of a young woman).

Cover images from books circulate widely on the internet. They are featured in online bookstores, they get picked up by search engines. Inevitably, they get re-used and separated from their context. Today (2015) "teh Internetz" firmly believe that the cover image is a portrait of Mary Astell.

For example:

If you look carefully, you'll see that the image most frequently used is the book cover with the title inexpertly removed.

But the painting doesn't depict Mary Astell. It was done 30 years after her death. In her book, Sutherland notes (page xii):
No portrait of her remains, but such evidence as we have suggests that she was not particularly attractive. Lady Mary Wortley Montagu’s granddaughter records her as having been “in outward form [...] rather ill-favoured and forbidding,” though Astell was long past her youth when this observation was made

Wikipedia has successfully resisted the misattribution.

A contributing factor for the confusion about Mary Astell's image is the book's failure to attribute the cover art. Typically a cover description is included in the front matter of the book. According to the Director of the University of Calgary Press, Brian Scrivener, proper attribution would certainly be done in a book produced today. Publishers now recognize that metadata is increasingly the cement that makes books part of the digital environment. Small presses often struggle to bring their back lists up to date, and publishers both large and small have "metadata debt" from past oversights, mergers, reorganizations and lack of resources.

Managing cover art and permissions for included graphics is often an expensive headache for digital books, particularly for Open Access works. I've previously written about the importance of clear licensing statements and front matter in ebooks. It's unfortunate when public domain art is not recognized as such, as in Eloquence, but nobody's perfect.

The good news is that University of Calgary Press has embraced Open Access ebooks in a big way. The Eloquence of Mary Astell and 64 other books are already available, making Calgary one of the world's leading publishers of Open Access ebooks. Twelve more are in the works.

You can find Eloquence at the Calgary University Press website (including the print edition), Unglue.itDOAB, and Internet Archive. Mary Astell's 1706 pamphlet Reflections Upon Marriage can be found at the Internet Archive and at the University of Pennsylvania's Celebration of Women Writers.

And maybe in 2025, teh internetz will know all about Sir Joshua Reynold's famous painting, Not Mary Astell. Happy Open Access Week!

Saturday, September 26, 2015

Weaponization of Library Resources

This post needs a trigger warning. You probably think the title indicates that I've gone off the deep end, or that this is one of my satirical posts. But read on, and I think you'll agree with me, we need to make sure that library resources are not turned into weapons. I'll admit that sounds ludicrous, but it won't after you learn about "The Great Cannon" and "QUANTUM".

But first, some background. Most of China's internet connects to the rest of the world through what's known in the rest of the world as "the Great Firewall of China". Similar to network firewalls used for most corporate intranets, the Great Firewall is used as a tool to control and monitor internet communications in and out of China. Websites that are deemed politically sensitive are blocked from view inside China. This blocking has been used against obscure and prominent websites alike. The New York Times, Google, Facebook and Twitter have all been blocked by the firewall.

When web content is unencrypted, it can be scanned at the firewall for politically sensitive terms such as "June 4th", a reference to the Tiananmen Square protests, and blocked at the webpage level. China is certainly not the only entity that does this; many school systems in the US do the same sort of thing to filter content that's considered inappropriate for children. Part of my motivation for working on the "Library Digital Privacy Pledge" is that I don't think libraries and publishers who provide online content to them should be complicit in government censorship of any kind.

Last March, however China's Great Firewall was associated with an offensive attack. To put it more accurately, software co-located with China's Great Firewall turned innocent users of  unencrypted websites into attack weapons. The targets of the attack were "GreatFire.org", a website that works to provide Chinese netizens a way to evade the surveillance of the Great Firewall, and GitHub.com, the website that hosts code for hundreds of thousand of programmers, including those supporting Greatfire.org.

Here's how the Great Cannon operated  In August, Bill Marczak and co-workers from Berkeley, Princeton and Citizen Lab presented their findings on the Great Cannon at the 5th USENIX Workshop on Free and Open Communications on the Internet.
The Great Cannon acted as a "man-in-the-middle"[*] to intercept the communications of users outside china with servers inside china. Javascripts that collected advertising and usage data for Baidu, the "Chinese Google", were replaced with weaponized javascripts. These javascripts, running in the browsers of internet users outside China, then mounted the denial-of-service attack on Greatfire.org and Github.
China was not the first to weaponized unencrypted internet traffic. Marczak et. al. write:
Our findings in China add another documented case to at least two other known instances of governments tampering with unencrypted Internet traffic to control information or launch attacks—the other two being the use of QUANTUM by the US NSA and UK’s GCHQ.[reference] In addition, product literature from two companies, FinFisher and Hacking Team, indicate that they sell similar “attack from the Internet” tools to governments around the world [reference]. These latest findings emphasize the urgency of replacing legacy web protocols like HTTP with their cryptographically strong counterparts, such as HTTPS.
It's worth thinking about how libraries and the resources they offer might be exploited by a man-in-the-middle attacker. Science journals might be extremely useful in targeting espionage scripts at military facilities, for example. A saboteur might alter reference technical information used by a chemical or pharmaceutical company with potentially disastrous consequences. It's easy to see why any publisher that wants its information to be perceived as reliable has no choice but to start encrypting their services now.

The unencrypted services of public libraries are attractive targets for other sorts of mischief, ironically because of their users' trust in them and because they have a reputation for protecting privacy. Think about how many users would enter their names, phone numbers, and last four digits of their social security numbers if a library website seemed to ask for it. When a website is unencrypted, it's possible for "man-in-the-middle" attacks to insert content into an unencrypted web page coming from a library or other trusted website. An easy way for an attacker to get into position to execute such an attack is to spoof a wifi network, for example in a cafe or other public space, such as a library. It doesn't help if only a website's login is encrypted if an attacker can easily insert content into the unencrypted parts of the website.

To be clear, we don't know that libraries and the type of digital resources they offer are being targeted for weaponization, espionage or other sorts of mischief. Unfortunately, the internet offers a target-rich environment of unencrypted websites.

I believe that libraries and their suppliers need to move swiftly to take the possibility off the table and help lead the way to a more secure digital environment for us all.

Tuesday, September 8, 2015

Hey, Google! Move Blogspot to HTTPS now!

Since I've been supporting a Library Privacy Pledge to implement HTTPS, I've made an inventory of the services I use myself, to make sure that all the services I use will by HTTPS by the end of 2016. The main outlier: THIS BLOG!

This is odd, because Google, the owner of Blogger and Blogspot, has made noise about moving its services to HTTPS, marking HTTP pages as non-secure, and is even giving extra search engine weight to webpages that use HTTPS.

I'd like to nudge Google, now that it's remade its logo and everything, to get their act together on providing secure service for Blogger. So I set the "description" of my blog to "Move Blogspot to HTTPS NOW." If you have a blog on Blogspot, you can do the same. Go to your control panel and click settings. "description" is the second setting at the top. Depending on the design of your page, it will look like this:


So Google, if you want to avoid a devastating loss of traffic when I move Go-To-Hellman to another platform on January 1, 2017, you better get cracking. Consider yourself warned.

Update 10/26/2015. The merciless pressure from the Go-To-Hellman blog worked. Blogger now supports HTTPS.

Sunday, August 30, 2015

Update on the Library Privacy Pledge

The Library Privacy Pledge of 2015, which I wrote about previously, has been finalized. We got a lot of good feedback, and the big changes have focused on the schedule.

Now, any library , organization or company that signs the pledge will have 6 months to implement HTTPS from the effective date of their signature. This should give everyone plenty of margin to do a good job on the implementation.

We pushed back our launch date to the first week of November December. That's when we'll announce the list of "charter signatories". If you want your library, company or organization to be included in the charter signatory list, please send an e-mail to pledge@libraryfreedomproject.org.

The Let's Encrypt project will be launching soon. They are just one certificate authority that can help with HTTPS implementation.

I think this is an very important step for the library information community to take, together. Let's make it happen.

Here's the finalized pledge:

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. It’s just a first step: HTTPS is a privacy prerequisite, not a privacy solution. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.

We focus on HTTPS as a first step because of its timeliness. The Let's Encrypt initiative of the Electronic Frontier Foundation will soon launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.

The 3rd article of the American Library Association Code of Ethics sets a broad objective:

We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.
It's not always clear how to interpret this broad mandate, especially when everything is done on the internet. However, one principle of implementation should be clear and uncontroversial:
Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.

The current best practice dictated by this principle is as following:
Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.

The Pledge for Libraries:

1. We will make every effort to ensure that web services and information resources under direct control of our library will use HTTPS within six months. [ dated______ ]

2. Starting in 2016, our library will assure that any new or renewed contracts for web services or information resources will require support for HTTPS by the end of 2016.

The Pledge for Service Providers (Publishers and Vendors):

1. We will make every effort to ensure that all web services that we (the signatories) offer to libraries will enable HTTPS within six months. [ dated______ ]

2. All web services that we (the signatories) offer to libraries will default to HTTPS by the end of 2016.

The Pledge for Membership Organizations:

1. We will make every effort to ensure that all web services that our organization directly control will use HTTPS within six months. [ dated______ ]

2. We encourage our members to support and sign the appropriate version of the pledge.

There's a FAQ available, too. The pledge is now posted on the Library Freedom Project website. (updated 9/14/2015)


Sunday, July 26, 2015

Library Privacy and the Freedom Not To Read

One of the most difficult privacy conundrums facing libraries today is how to deal with the data that their patrons generate in the course of using digital services. Commercial information services typically track usage in detail, keep the data indefinitely, and regard the data as a valuable asset. Data is used to make many improvements, often to personalize the service to best meet the needs of the user. User data can also be monetized; as I've written here before, many companies make money by providing web services in exchange for the opportunity to track users and help advertisers target them.

A Maginot Line fortification. Photo from the US Army.
The downside to data collection is its impact on user privacy, something that libraries have a history of defending, even at the risk of imprisonment. Since the Patriot Act, many librarians have believed that the best way to defend user privacy against legally sanctioned intrusion is to avoid collecting any sensitive data. But as libraries move onto the web, that defense seems more and more like a Maginot Line, impregnable, but easy to get around. (I've written about an effort to shore up some weak points in library privacy defenses.)

At the same time, "big data" has clouded the picture of what constitutes sensitive data. The correlation of digital library use with web activity outside the library can impact privacy in ways that never would occur in a physical library. For example, I've found that many libraries unknowingly use Amazon cover images to enrich their online catalogs, so that even a user who is completely anonymous to the library ends up letting Amazon know what books they're searching for.

Recently, I've been serving on the Steering Committee of an initiative of NISO to try to establish a set of principles that libraries, providers of services to libraries, and publishers can use to support privacy patron privacy. We held an in-person meeting in San Francisco at the end of July. There was solid support from libraries, publishers and service companies for improving reader privacy, but some issues were harder than others. The issues around data collection and use attracted the widest divergence in opinion.

One approach that was discussed centered on classifying different types of data depending on the extent to which they impact user privacy. This also the approach taken by most laws governing privacy of library records. They mostly apply only to "Personally Identifiable Information" (PII), which usually would mean a person's name, address, phone number, etc., but sometimes is defined to include the user's IP address. While it's important to protect this type of information, in practice this usually means that less personal information lacks any protection at all.

I find that the data classification approach is another Maginot privacy line. It encourages the assumption that collection of demographics data – age, gender, race, religion, education, profession, even sexual orientation – is fair game for libraries and participants in the library ecosystem. I raised some eyebrows when I suggested that demographic groups might deserve a level of privacy protection in libraries, just as individuals do.

OCLC's Andrew Pace gave an example that brought this home for us all. When he worked as a librarian at NC State, he tracked usage of the books and other materials in the collection. Every library needs to do this for many purposes. He noticed that materials placed on reserve for certain classes received little or no usage, and he thought that faculty shouldn't be putting so many things on reserve, effectively preventing students not taking the class from using these materials. And so he started providing usage reports to the faculty.

In retrospect, Andrew pointed out that, without thinking much about it, he might have violated the privacy of students by informing their teachers that that they weren't reading the assigned materials. After all, if a library wants to protect a user's right to read, they also have to protect the right not to read. Nobody's personally identifiable information had been exposed, but the combination of library data – a list of books that hadn't circulated – with some non-library data – the list of students enrolled in a class and the list of assigned reading – had intersected in a way that exposed individual reading behavior.

What this example illustrates is that libraries MUST collect at least SOME data that impinges on reader privacy. If reader privacy is to be protected, a "privacy impact assessment" must be made on almost all uses of that data.  In today's environment, users expect that their data signals will be listened to and their expressed needs will be accommodated. Given these expectations, building privacy in libraries is going to require a lot of work and a lot of thought.

Sunday, July 12, 2015

The Library Digital Privacy Pledge

I've been busy since my last post! We've created the Free Ebook Foundation, which will be the home for Unglue.it and GITenberg. I helped with the NISO "Consensus Framework to Support Patron Privacy in Digital Library and Information Systems", which I'll write more about soon. And some coding.


But I've also become a volunteer for the Library Freedom Project, run by radical librarian Alison Macrina. The project I'm working on is the "Library Digital Privacy Pledge."

The Library Digital Privacy Pledge is a result of discussions on several listservs about how libraries and the many organizations that serve libraries could work cooperatively to (putting it bluntly) start getting our shit together with regard to patron privacy.

I've talked to a lot of people about privacy in digital libraries, and there's remarkable unity about its importance. There's also a lot of confusion about some basic web privacy technology, like HTTPS. My view is that HTTPS sets a foundation for all the other privacy work that needs doing in libraries.

Someone asked me why I'm so passionate about working on this. After a bit of thought, I realized that the one thing that gives me the most satisfaction in my professional life is eliminating bugs. I hate bugs. Using HTTP for library services is a bug.

The draft of the Library Digital Privacy Pledge is open for comment and improvement  for a few more weeks. We want all sorts of stakeholders to have  a chance to improve it. The current text (July 12, 2015) is as follows: 

The Library Digital Privacy Pledge of 2015

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.
We focus on HTTPS as a first step because of its timeliness. At the end of July the Let's Encrypt initiative of the Electronic Frontier Foundation will launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.
The 3rd article of the American Library Association Code of Ethics sets a broad objective:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.


It's not always clear how to interpret this broad mandate, especially when the everything is done on the internet. However, one principle of implementation should be clear and uncontroversial:
Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.
The current best practice dictated by this principle is as following:
Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.

The Pledge for Libraries:

1. All web services and resources that this library directly controls will use HTTPS by the end of 2015.
2. Starting in 2016, this library will not sign or renew any contracts for web services or information resources that do not commit to use HTTPS by the end of 2016.

The Pledge for Service Providers (Publishers and Vendors):

1. All web services that we (the signatories) control will enable HTTPS by the end of 2015.
2. All web services that we (the signatories) offer will require HTTPS by the end of 2016.

The Pledge for Membership Organizations:

1. All web services that this organization directly controls will use HTTPS by the end of 2015.
2. We encourage our members to support and sign the appropriate version of the pledge.

Schedule:

This document will be open for discussion and modification until finalized by July 27, 2015. The finalized pledge will be published on the website of the Library Freedom Project. We expect a number of discussions to take place at the Annual Conference of the American Library Association and associated meetings.
The Library Freedom Project will broadly solicit signatures from libraries, vendors and publishers.
In September, in coordination with the Let's Encrypt project, the list of charter signatories will be made announced and broadly publicized to popular media.

FAQ

Q: What is HTTPS and what do we need to implement it?
A: When you use the web, your browser software communicates with a server computer through the internet. The messages back and forth pass through a series of computers (network nodes) that work together to pass messages. Depending on where you and the server are, there might be 5 computers in that chain, or there might be 50, each possibly owned by a different service provider. When a website uses HTTP, the content of these messages is open to inspection by each intermediate computer- like a postcard sent through the postal system, as well as by any other computer that shares a network those computers. If you’re connecting to the internet over wifi in a coffee shop, everyone else in the coffee shop can see the messages, too.


When a website uses HTTPS, the messages between your browser software and the server are encrypted so that none of the intermediate  network nodes can see the content of the messages. It’s like sending sealed envelopes through the postal system.


Your web site and other library services may be sending sensitive patron data across the internet: often bar codes and passwords, but sometimes also catalog searches, patron names, contact information, and reading records. This kind of data ought to be inside a sealed envelope, not exposed on a postcard.


Most web server software supports HTTPS, but to implement it, you’ll need to get a certificate signed by a recognized authority. The certificate is used to verify that you are who you say you are. Certificates have added cost to HTTPS, but the Electronic Frontier Foundation is implementing a certificate authority that will give out certificates at no charge. To find out more, go to Let’s Encrypt.


Q: Why the focus on HTTPS?
A: We think this issue should not be controversial and is relatively easy to explain. Libraries understand that circulation information can’t be sent to patron on postcards. Publishers don’t want their content scooped up by unauthorized entities. Service providers don’t want to betray the trust of their customers.
Q. How can my library/organization/company add our names to the list of signatories?
A. Email us at pledge@libraryfreedomproject.org. Please give us contact info so we can verify your participation.
Q. Is this the same as HTTPS Everywhere?
A. No, that's a browser plug-in which enforces use of HTTPS.
Q. My Library won't be able to meet the implementation deadline. Can we add our name to the list once we've completed implementation?
A. Yes.
Q. A local school uses an internet filter that blocks https websites to meet legal requirements. Can we sign the pledge and continue to serve them?
A. Most of the filtering solutions include options that will whitelist important services. Work with the school in question to implement a work-around.


Q. What else can I read about libraries using HTTPS?
A. The Electronic Frontier Foundation has published What Every Librarian Needs to Know About HTTPS
Q. How do I know if I have implemented HTTPS correctly?
A. The developers behind the “Let’s Encrypt” initiative are ensuring that best practices are used in setting up the HTTPS configuration.  If you are deploying HTTPS on your own, we encourage you to use the Qualys SSL Labs SSL Server Test service to review the performance of your implementation.  You should strive for at least a “B” rating with no major security vulnerabilities identified in the scan.


Q. Our library subscribes to over 200 databases only a fraction of them currently delivered via https. We might be able to say we will not sign new contracts but the renewal requirement could be difficult for an academic library like ours. Can we sign the pledge?
A. No one is going to penalize libraries that aren’t able to comply 100% with their pledge. One way to satisfy the ethical imperatives of the pledge would be to clearly label for users the small number of insecure library resources that remain after 2016 as being subject to surveillance.


Q. I/We can contribute to the effort in a way that isn’t covered well by the pledges. Can I add another pledge?

A. We want to keep this simple, but we welcome your support. email us with your individualized statement, and we may include it on our website when signatories are announced.

Wednesday, June 10, 2015

Protect Reader Privacy with Referrer Meta Tags

Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before.  There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

The fact that my nitride semiconductor website ranked up there with naked Russian women amused me; reader privacy issues didn't bother me because the Polish scientist's habits were safe with me.


Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting  results in the transmission of referer headers containing the user's request to the following hosts:
  • http://ajax.googleapis.com
  • http://www.google.com (with tracking cookies)
  • http://s7.addthis.com (with tracking cookies)
  • http://recommender.bibtip.de
None of the resources requested from these third parties actually need to know what page the user is viewing, but WorldCat causes that information to be sent anyway. In principle, this could allow advertising networks to begin marketing diapers to carefully targeted WorldCat users. (I've written about AddThis and how they sell data about you to advertising networks.)

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />

If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

 The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />

Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />

Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!

Links: