Instapaper an Entire WordPress Blog
I love Instapaper, the web service that converts long form articles into beautifully typeset documents for offline reading on your iPhone or iPad. There are many blogs that have a very chronological progession, or tell a story in a linear way that isn’t dated over time (IE not a tech blog). I was thinking it would be cool to have an ebook of these blogs, but why bother figuring the epub format out when Instapaper is right there?
The problem is Instapaper assumes you only have one file your are instapapering, not blog posts split into many many pages. So I needed a way to make this one monolithic file. I couldn’t find any programs for free that do this, so I wrote a little Python script I call Cheeser. Cheeser downloads WordPress blog posts in chronological order. Why is it called Cheeser? From Urban Dictionary, Cheeser definition:
In the gaming world, a person who repeatedly performs the same moves in fighting games (such as in Soul Caliber, Street Fighter, etc) in order to win.
Cheeser goes through a WordPress site post by post, using the trick of WordPress’s ‘p=100′ where 100 is the post number. So it’s repetitive and using a lame trick, it’s a cheeser. The code for cheeser is at the end of this post. I tested it on Python 2.6. You will need the Beautiful Soup library for Python as well.
To get Cheeser working, edit the constants at the top of the script to meet your needs. In WordPress templates, usually there is one DIV container that holds the content of each post. The default search settings in Cheeser should find this DIV for most WordPress blogs.
Once you have your huge html file, you will need to upload it somewhere on the web Instapaper can see it. Once it’s there, browse to it, click your Instapaper ‘Read Later’ button, and your golden.
Oh one caveat, Cheeser currently isn’t smart about encoding, and assumes the blog is in UTF-8.
One other note, I know this scraping stuff is considered gauche by some. Of course, you shouldn’t scrape sites to redistribute them, sell them, or make spam sites from them. I would take down the html file as soon as I Instapaper/sync it to my device it if I were you. You don’t want the Google bot finding it. At the very least, use robots.txt to make it unsearchable.
Here’s the code.
import time, urllib2
from BeautifulSoup import BeautifulSoup
from urllib2 import HTTPError
# Base URL of the wordpress blog you will be downloading
BLOG_URL = 'http://www.domodomo.com?p='
# Save end result to this HTML file
SAVE_FILE = '/Users/ianfitzpatrick/code/cheeser/domodomo.html'
# Break between downloading each page, in seconds.
SLEEP_LENGTH = 30
# Range of posts to download
START_POST = 3
END_POST = 100
# These three constants are how Cheeser finds the element that contains the blog post
ELEMENT = 'div'
SELECTOR_TYPE = 'class'
SELECTOR_VALUE = 'post'
TITLE = "Domo Domo - Ian Fitzpatrick's Project Log"
HEAD = """
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' dir='ltr' lang='en-US'>
<head profile='http://gmpg.org/xfn/11'>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
"""
HEAD += '<title>' + TITLE + '</title></head><body><h1>' + TITLE + '</h1>'
TAIL = '</body></html>'
def write_output(output):
outfile = open(SAVE_FILE, 'a')
outfile.write(output.encode('utf-8'))
outfile.close()
# Write document header
write_output(HEAD)
for i in range(START_POST, END_POST):
try:
page = urllib2.urlopen(BLOG_URL + unicode(i))
soup = BeautifulSoup(page, fromEncoding='utf-8')
output = soup.find(ELEMENT, { SELECTOR_TYPE : SELECTOR_VALUE })
write_output(unicode(output))
print 'Reading blog post ' + str(i)
except HTTPError:
print 'Skipping blog post ' +str(i)
continue
time.sleep(SLEEP_LENGTH)
# Write document tail
write_output(TAIL)
August 22nd, 2010 at 8:02 pm
[...] My Old Tokyo Blog Hi, I’m Ian I like computer tech, web coding, music, electronics, sailing, Mandarin, Japanese, and cooking. Domo Domo is a project log of all that. Subscribe by Email Enter your email address: « Instapaper an Entire WordPress Blog [...]
February 26th, 2013 at 4:07 am
Once writing or recording a great blog post the subsequent step is to promote it using social bookmarking. These accounts embrace digg, propeller, furl, bebo, delicious and tumblr, among many others. The matter is finding time to bookmark your content across dozens of accounts. Fortunately, services like Onlywire can automate social bookmarking to dozens of the leading websites, and as well as plugins or buttons on your WordPress blog will facilitate blog syndication.-
Most recently released blog post on our website
<'http://www.prettygoddess.com/
April 18th, 2013 at 8:54 pm
Nice blog here! Additionally your web site quite a bit up fast! What web host are you the use of? Can I get your associate hyperlink in your host? I wish my site loaded up as quickly as yours lol