Quote Database Screen Scraper

I wanted to make fortune(6) cookie files out of the major internet quote databases. So I came up with a Bash script that uses lynx and sed to do it. It includes throttling by default because it’s not nice to suck up all the site’s bandwidth.

#!/bin/bash
# Scrapes a qdb and dumps output to a file
# and tokenizes it into a fortune cookie file
#
LYNX="/usr/bin/lynx"
DEFSLEEP="30"
PNUM_DELIM="%pnum%"

# parse command line args
# look for -s to specify sleep_int period
# look for -p to specify number of pages
# treat everything else like a URL
# %pnum% is page number delimiter
sleep_int=$DEFSLEEP
URLS=()
ARGV=("$@")
for (( thisarg = 0; thisarg < ${#ARGV[*]}; thisarg++ ))
do
	arg="${ARGV[$thisarg]}"
	if [ "$arg" = "-s" ]
	then
		thisarg=$(( $thisarg + 1 ))
		sleep_int=${ARGV[$thisarg]}
	elif [ "$arg" = "-p" ]
	then
		thisarg=$(( $thisarg + 1 ))
		pages=${ARGV[$thisarg]}
	else
		URLS=("${URLS[@]}" "$arg")
	fi
done

LYNXOPTS="-accept_all_cookies -assume_charset=iso-88591 -nolist -dump"

# debug
echo ${URLS[@]}
echo "Sleep is $sleep_int"

for url in "$URLS"
do
	name=`echo $url | cut -d / -f 3`
	mkdir "$name"

	for (( page = 1; page <= $pages; page++ ))
	do
		n_url=`echo $url | sed -e "s/$PNUM_DELIM/$page/g"`
		# debug
		echo "$LYNX $LYNXOPTS $n_url > $name/page_$page"
		$LYNX $LYNXOPTS $n_url > $name/page_$page
		sleep $sleep_int
	done

	# postprocess scraped text files
	cd "$name"

	# remove quote IDs and put in fortune delimiters
	for file in *
	do
		sed -e "s/^\s*#.*$/%/g" < "$file" >> "$name.txt"
	done

	# remove page navigation links
	sed -e "s/^.*[1-9][0-9]* of [1-9][0-9]*.*$//g" < "$name.txt" > "$name"

	# make the database
	strfile "$name"
	rm "$name.txt"

done
Advertisements
  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: