Have Some Soup
In my last post I mentioned having a large archive of SOPA-related news stories; that archive is now public. You can download the 4 GB collection of WARCs as a torrent here:
http://turk.floodnetwork.biz/SOPA.torrent
I only captured a single page for each news story. In many cases, this was enough; however, for e.g. some opinion pieces on the New York Times, this means I missed the second page. Live and learn, I guess.
If you prefer to browse and/or don’t want to deal with expanding and indexing WARCs (totally understandable, as AFAICT there does not yet exist any good desktop tool to do so), I have made the archive available on the Web using the Internet Archive’s Wayback software:
http://wayback.at.ninjawedding.org/
The current version of Wayback has fairly poor search functionality (you have to search by URL prefix, which isn’t particularly good for browsing), so I’ve built an index of pages in the archive. You can browse by page title here:
http://www.ninjawedding.org/sopa/stories.html
Each story link will take you to a link on wayback.at.ninjawedding.org.
I’m totally up for better ways to search this archive, like indexing it via Solr or some such, but this was TSTTCPW.
I’m not sure how long I’ll keep wayback.at.ninjawedding.org up and running. The host is a micro virtual machine on Amazon EC2, so it suffers from nasty CPU cycle starvation on long-running tasks. For some classes of webapps this isn’t a problem, but I have noticed that Wayback can be afflicted by this when running indexing tasks. I would move up to a small or large EC2 instance, but those quickly become very expensive. I’ll look into other hosting arrangements; my host has a feature in beta-testing that looks like it might work.
Anyway, enjoy.