Never been to CodeSnippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world (or not, you can keep them private!)

About this user

chochem

1 total

Snip - extract a named element from an html file using bash

This is a primitive way of achieving the kind of data extraction that is more commonly associated with true XML for any reasonably modern html file (i.e. that it is well-formed and makes proper use of the id property). The purpose is mainly to get simple, yet fast and efficient text browsing, especially useful for quick look-ups and the like, e.g. dictionaries, thesauruses (thesauri?), encyclopedias etc. Since the data you're interested in is usually put into a specific element, text browsing is often greatly enhanced by extracting the element in question and discarding the rest. You run the script by specifying an element in the standard css way (element#id) and the file which is to be 'parsed', and the script responds by spitting out the element (and only that element) through html2text which does a really nice job of turning html code into legible console text.

EDIT: Added a quick check for the presence/absence of the element type in the line (before the grep operations) - greatly increases speed with large elements like #content on wikipedia.

#! /bin/bash

printhelp () {
echo "snip is a simple bash html cutter that works by extracting a specific element 
from an html file and feeding it to html2text. It presupposes wellformed html
and that you know the kind of element you want and it's id.

Syntax:
snip <element  type>#<element id> <file to parsed>

Example:
snip div#bodyContent /tmp/index.html
"
exit
}

quitter () {
echo "Element id not found. Quitting."; exit
}

[ "$1" = "-h" -o "$1" = "--help" -o "$1" = "" ] && printhelp

elementtype="$(echo $1 | cut -d '#' -f 1)"
id="$(echo $1 | cut -d '#' -f 2)"
htmlfile="$2"
thebegin=$(grep -nioE "id=\"$id\"" $htmlfile | cut -d ':' -f 1)
# echo $thebegin
[ -n "$thebegin" ] || quitter

sed -n ${thebegin}p "$htmlfile" | sed -re "s/^.*id=\"$id\"/<$elementtype id=\"$id\"/g" > /tmp/snipfile
sed -n $(($thebegin+1)),\$p "$htmlfile"  >> /tmp/snipfile

i=0
element=0
cat /tmp/snipfile | while read line; do
	let i++
	if [[ "$line" =~ "$elementtype" ]]; then
		elementbegincount="$(echo $line | grep -io "<$elementtype" | grep -c .)"
		elementendcount="$(echo $line | grep -io "</$elementtype" | grep -c .)"
		element=$(($element+$elementbegincount-$elementendcount))
		if [ "$element" -le 0 ]; then
			sed -n 1,${i}p /tmp/snipfile | html2text
			exit
		fi
	fi
done


As an example of how the script can be put to use, here's my Wikipedia lookup (the script above is referred to as 'snip' here):

#! /bin/bash

useragent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071619 Firefox/3.0.1"
if wget -q -U "$useragent" -O /tmp/wpfile "http://en.wikipedia.org/wiki/Special:Search?search=$*"; then
	clear
	echo "Page downloaded..."
	snip div#content /tmp/wpfile | less
else
	echo "No connection, sorry. Please try again."
fi

1 total