2008-01-24

Using Google for File Sharing

Disclaimer: I am not responsible for how you use this information. I didn't even create this information, but it is there nonetheless.

2008-01-21

Screen Scraping With BeautifulSoup

This is an article on how to screen scrape the Astronomy Picture of the Day. The script is written in the Python programming language and depends on the BeautifulSoup module.

The Astronomy Picture of the Day is a nifty little service of NASA that publishes a nice space-related picture each day. They have something like 9 years worth of pictures that you can download, which I will discuss in a future post. Let's focus on scraping out one picture at a time, and automating that process so you can save your pointing and clicking for another place and time.

The site for the picture of the day is here. If you click the picture it will give you the full-sized version, which is the one we want. While you are on that page, make sure to view the source html of the page, because we will use it in this project. Shown below is a picture of the source html to be used for scraping.



The source code for the scraper is shown here:


#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
import urllib2, os, re
from urlparse import urljoin

url='http://apod.nasa.gov/apod/astropix.html'

baseUrl='http://apod.nasa.gov/apod/'

html=urllib2.urlopen(url).read()
soup=BeautifulSoup(html)

image=soup.find('a', href=re.compile('^image'))
imageLink=urljoin(baseUrl, image['href'])

os.system("wget -P <dir>/apod/2008 --quiet -U Mozilla/5.0 " + imageLink)