Instapaper an Entire WordPress Blog

I love Instapaper, the web service that converts long form articles into beautifully typeset documents for offline reading on your iPhone or iPad.  There are many blogs that have a very chronological progession, or tell a story in a linear way that isn’t dated over time (IE not a tech blog).  I was thinking it would be cool to have an ebook  of these blogs, but why bother figuring the epub format out when Instapaper is right there?

The problem is Instapaper assumes you only have one file your are instapapering, not blog posts split into many many pages.  So I needed a way to make this one monolithic file.  I couldn’t find any programs for free that do this, so I wrote a little Python script I call Cheeser.  Cheeser downloads WordPress blog posts in chronological order.  Why is it called Cheeser?  From Urban Dictionary, Cheeser definition:

In the gaming world, a person who repeatedly performs the same moves in fighting games (such as in Soul Caliber, Street Fighter, etc) in order to win.

Cheeser goes through a WordPress site post by post, using the trick of WordPress’s ‘p=100′ where 100 is the post number.   So it’s repetitive and using a lame trick, it’s a cheeser.  The code for cheeser is at the end of this post.  I tested it on Python 2.6.  You will need the Beautiful Soup library for Python as well.

To get Cheeser working, edit the constants at the top of the script to meet your needs.  In WordPress templates, usually there is one DIV container that holds the content of each post.  The default search settings in Cheeser should find this DIV for most WordPress blogs.

Once you have your huge html file, you will need to upload it somewhere on the web Instapaper can see it.  Once it’s there, browse to it, click your Instapaper ‘Read Later’ button, and your golden.

Oh one caveat, Cheeser currently isn’t smart about encoding, and assumes the blog is in UTF-8.

One other note, I know this scraping stuff is considered gauche by some.  Of course, you shouldn’t scrape sites to redistribute them, sell them, or make spam sites from them.  I would take down the html file as soon as I Instapaper/sync it to my device it if I were you.  You don’t want the Google bot finding it.  At the very least, use robots.txt to make it unsearchable.

Here’s the code.

import time, urllib2
from BeautifulSoup import BeautifulSoup
from urllib2 import HTTPError

# Base URL of the wordpress blog you will be downloading
BLOG_URL = 'http://www.domodomo.com?p='

# Save end result to this HTML file
SAVE_FILE = '/Users/ianfitzpatrick/code/cheeser/domodomo.html'

# Break between downloading each page, in seconds.
SLEEP_LENGTH = 30

#  Range of posts to download
START_POST = 3
END_POST = 100

#  These three constants are how Cheeser finds the element that contains the blog post
ELEMENT = 'div'
SELECTOR_TYPE = 'class'
SELECTOR_VALUE = 'post'

TITLE = "Domo Domo - Ian Fitzpatrick's Project Log"

HEAD = """
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html xmlns='http://www.w3.org/1999/xhtml' dir='ltr' lang='en-US'>
<head profile='http://gmpg.org/xfn/11'>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
"""

HEAD += '<title>' + TITLE + '</title></head><body><h1>' + TITLE + '</h1>'

TAIL = '</body></html>'

def write_output(output):
    outfile = open(SAVE_FILE, 'a')
    outfile.write(output.encode('utf-8'))
    outfile.close()

# Write document header
write_output(HEAD)

for i in range(START_POST, END_POST):
    try:
        page = urllib2.urlopen(BLOG_URL + unicode(i))
        soup = BeautifulSoup(page, fromEncoding='utf-8')
        output = soup.find(ELEMENT, { SELECTOR_TYPE : SELECTOR_VALUE })
        write_output(unicode(output))
        print 'Reading blog post ' + str(i)

    except HTTPError:
        print 'Skipping blog post ' +str(i)
        continue

    time.sleep(SLEEP_LENGTH)

# Write document tail
write_output(TAIL)

14 Responses to “Instapaper an Entire WordPress Blog”

  1. Domo Domo » Blog Archive » New Blog Post-Instapaper an En… Says:

    [...] My Old Tokyo Blog Hi, I’m Ian I like computer tech, web coding, music, electronics, sailing, Mandarin, Japanese, and cooking. Domo Domo is a project log of all that. Subscribe by Email Enter your email address: « Instapaper an Entire WordPress Blog [...]

  2. Un Abdella Says:

    Once writing or recording a great blog post the subsequent step is to promote it using social bookmarking. These accounts embrace digg, propeller, furl, bebo, delicious and tumblr, among many others. The matter is finding time to bookmark your content across dozens of accounts. Fortunately, services like Onlywire can automate social bookmarking to dozens of the leading websites, and as well as plugins or buttons on your WordPress blog will facilitate blog syndication.-

    Most recently released blog post on our website
    <'http://www.prettygoddess.com/

  3. Free ebook download Says:

    Nice blog here! Additionally your web site quite a bit up fast! What web host are you the use of? Can I get your associate hyperlink in your host? I wish my site loaded up as quickly as yours lol

  4. purificadoras Says:

    The color of your blog is quite great. i would love to have those colors too on my blog.

  5. e vapor cigarette uk Says:

    What’s Happening i am new to this, I stumbled upon this I have found It positively useful and it has aided me out loads.
    I am hoping to give a contribution & help other customers like
    its aided me. Good job.

  6. Sharon Says:

    Remarkable! Its in fact awesome article, I have got much clear idea concerning from this post.

  7. cheap web hosting reseller uk Says:

    I go to see day-to-day some web sites and blogs to read
    articles, but this webpage offers quality based posts.

  8. read more Says:

    I really do accept as true with the many ideas you’ve got presented on the publish. They may be really persuasive and might unquestionably operate. Even now, the discussions are far too small to start. Might you please extend these a lttle bit through up coming time frame? Information submit.

  9. prada outlet Says:

    Of course, what a splendid blog and illuminating posts, I
    definitely will bookmark your website.Best Regards!

    Also visit my page: prada outlet

  10. purificadoras Says:

    Definitely, what a fantastic website and informative posts, I definitely will bookmark your blog.All the Best!

  11. Elwood Says:

    Oh my goodness! Impressive article dude! Thanks, However
    I am going through difficulties with your RSS.
    I don’t understand the reason why I am unable to subscribe to
    it. Is there anybody else having the same RSS problems?
    Anyone that knows the solution can you kindly respond? Thanks!!

    webpage, Elwood,

  12. Ilektronikotsigaroivapor.Wordpress.Com Says:

    Hey there cool internet site! Dude. Wonderful. Exceptional.. ηλεκτρονικο τσιγαρο ατμοποιητες Let me bookmark your site as well as use the nourishes furthermore? Now i’m delighted to seek out numerous helpful information here while in the release, we’d like work out much more tactics for this reverence, thank you revealing.

  13. online business Says:

    Fantastic web site. A lot of useful information here.

    I am sending it to a few friends ans also sharing in delicious.

    And of course, thank you for your sweat!

  14. Commandez Votre Won Soccer Running shoes Says:

    Hey this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML. I’m starting a blog soon but have no coding knowledge so I wanted to get advice from someone with experience. Any help would be enormously appreciated!

Leave a Reply