Never been to CodeSnippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world (or not, you can keep them private!)

cmdparser - parse command line arguments

Author: jv
License: The MIT License, Copyright (c) 2008 jv
Description: a basic regex-based command line parser for use in bash scripts (Mac OS X); an alternative to the builtin getopts command (cf. help getopts); use at your own risk
Usage: /path/to/script_with_cmdparser -a -b -c -f file
Related links: Process positional parameters non-destructively in Bash and ws - search the web from the command line (an example of using cmdparser)

#!/bin/bash

export PATH=/usr/bin:/bin:/usr/sbin:/sbin
export IFS=$' \t\n'

# create a fake command line
set -- -abcc -c -zz -flag1="" -flag2=arg -flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' -flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces  / * + ` \ ! ' -flag9 ~/Desktop/*.txt filename1 filename2 filename3

#set -- -abcc -c -zz -flag1="" -flag2=arg -flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' -flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ ! ' -flag9 '~/Desktop/*.txt' filename1 filename2 filename3


printf "%s\n" "$@" | nl
#printf "%s" "$@"$'\n' | nl
#printf "%s" "${@/%/ }" | nl


: <<-'COMMENT'

# copy & paste examples for the command line

echo "filename1" "filename2" "filename3" | ~/Desktop/cmdparser.txt -abcc -c -zz -flag1 arg -flag2=arg --flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' --flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ ! ' -flag9 ~/Desktop/*.txt -

echo "filename1" "filename2" "filename3" | ~/Desktop/cmdparser.txt -abcc -c -zz -flag1 arg -flag2=arg --flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' --flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ ! ' -flag9 '~/Desktop/*.txt' -

COMMENT



# cmdparser

usage="usage: $(/usr/bin/basename "$0") [-a] [-b] [-c] [-cc] [-zz] [-flag1 arg] [-flag2 'arg1 arg2 ...'] [-flag3=arg] [-flag4=\"arg1 arg2\"] ..."


# define the names of flags as a regular expression
# flags are command line options that require arguments

flags="(flag1|flag2|flag3|flag4|flag5|flag6|flag7|flag8|flag9)"


# define the names of switches as a regular expression
# Switches are command line options that do not take arguments.
# Make sure multi-char switches precede single-char switches in the regular expression.
# Note that the regular expression contains neither the special read-from-stdin switch "-" 
# nor the special end-of-options switch "--".

switches="(cc|zz|a|b|c)"  


declare flag1 flag2 flag3 flag4 flag5 flag6 flag7 flag8 flag9                # flags
declare -i a=0 b=0 c=0 cc=0 zz=0                                            # switches
                         
declare argstr argvar argvar_escaped char flagvar optstr piped pipedstr       # script variables
declare -i optid pipedvar

# piped="piped" will be used for variable creation 
# example: piped="piped"; pipedstr="piped arg"; eval $piped='"$(echo "$pipedstr")"'; echo "$piped"

piped="piped"

# default value is set to "no pipe"
pipedvar=0
pipedstr=""

# if /dev/stdin has a size greater than zero ...
if [[ -s /dev/stdin ]]; then pipedstr="$(</dev/stdin)"; fi 

if [[ $# -eq 0 ]] && [[ -z "$pipedstr" ]]; then
  printf "\n%s\n\n%s\n\n" 'No arguments specified!' "$usage" 1>&2
  exit 1
fi 

if [[ $# -eq 0 ]] && [[ -n "$pipedstr" ]]; then
  eval $piped='"${pipedstr}"'  
  pipedvar=1
fi 

# if there are command line arguments ...
# Note that $pipedvar may still be set to 1 below if the special read-from-stdin switch "-" is given.

if [[ $pipedvar -eq 0 ]]; then

   optstr=" "  
   optid=0

   while [[ -n "$optstr" ]]; do     

      # try to extract valid flags or switches from positional parameter $1
      # $1 gets shifted afterwards (cf. help shift)

      optstr="$(printf "%s" "$1" | /usr/bin/egrep -e "^--?${flags}$")"

      if [[ -n "$optstr" ]]; then optid=1; fi
      if [[ -z "$optstr" ]]; then optid=2; optstr="$(printf "%s" "$1" | /usr/bin/egrep -e "^--?${switches}$")"; fi
      if [[ -z "$optstr" ]]; then optid=3; optstr="$(printf "%s" "$1" | /usr/bin/egrep -e "^--?${switches}+$")"; fi
      if [[ -z "$optstr" ]]; then optid=4; optstr="$(printf "%s" "$1" | /usr/bin/egrep -e "^--?(${flags}=.*|${flags}[^[:space:]]+)$")"; fi

      if [[ -z "$optstr" ]]; then  
         if [[ "$1" = "-" ]] && [[ "$@" = "-" ]]; then  
            optid=5
            optstr="-" 
         elif [[ -n "$(printf "%s" "${@/%/ }" | /usr/bin/egrep -e "[[:space:]]--?(${flags}|${switches})")" ]]; then 
            # append a space to each command line argument
            argstr="$(printf "%s" "${@/%/ }")"
            printf "\n%s\x21\n\n%s\n\n%s\n\n" "Undefined non-option string: ${1} is followed by a legal flag or switch" "${argstr}" "$usage" 1>&2
            exit 1
         fi
      fi

      if [[ "$1" = "--" ]]; then shift; break; fi     # -- marks end of options

      if [[ -z "$optstr" ]]; then break; fi     # no further flags or switches to process


      # flag followed by space (example: -f file)
      if [[ $optid -eq 1 ]]; then 

         if [[ -z "$2" ]]; then
            printf "%s\n%s\n" "no argument given to flag: ${1}" "$usage" 1>&2
            exit 1
         fi 

         flagvar="${1##*-}"     # remove leading - or --
         argvar="$2"
         eval $flagvar='"${argvar}"'
         shift 2     # shift positional parameters $1 & $2 (that is, a flag plus its argument)
         continue

      # single switch (example: -a)
      elif [[ $optid -eq 2 ]]; then
         flagvar="${1##*-}"
         eval $flagvar='"1"'
         shift
         continue
  
      # combined switch (example: -abcc)
      elif [[ $optid -eq 3 ]]; then
         flagvar="${1##*-}"
         while [[ -n "$flagvar" ]]; do
            char="$(printf "%s" "$flagvar" | /usr/bin/sed -E "s/^${switches}.*$/\1/")"
            eval $char='"1"'
            flagvar="$(printf "%s" "$flagvar" | /usr/bin/sed -E "s/^${switches}//")"
         done
         shift
         continue

      # flag without following space (example: -ffile)
      elif [[ $optid -eq 4 ]]; then 

: <<-'COMMENT'

         argvar="$(printf "%s" "$1" | /usr/bin/sed -E "s/^\-\-?${flags}\=?//")"

         argvar2="${argvar//\\\\/\\\\}"       # escape \  (for Bash version 2.05b.0(1)-release)
         #argvar2="${argvar//\\/\\\\}"          # escape \  

         flagvar="${1%${argvar2}}"          # remove escaped $argvar string
         flagvar="${flagvar%=}"             # remove trailing =
         flagvar="${flagvar##*-}"           # remove leading - or --
         eval $flagvar='"${argvar}"'
         shift
         continue

COMMENT

         # alternative: no string escaping necessary
         #argvar="$(printf "%s" "$1" | /usr/bin/sed -E "s/^\-\-?${flags}\=?//")"
         #flagvar="$(printf "%s" "$1" | /usr/bin/sed -E -n -e "s/^\-\-?${flags}\=?.*$/\\1/p")"

         argvar="$(printf "%s" "${1##*-}" | /usr/bin/sed -E "s/^${flags}\=?//")"
         flagvar="$(printf "%s" "${1##*-}" | /usr/bin/sed -E -n -e "s/^${flags}\=?.*$/\\1/p")"

         eval $flagvar='"${argvar}"'
         shift
         continue


      # the special read-from-stdin switch "-"
      elif [[ $optid -eq 5 ]]; then 
         pipedvar=1
         eval $piped='"${pipedstr}"'
         shift
         break

      fi

      # remove positional parameter $1 from "$@"
      shift

   done

fi   # if [[$pipedvar -eq 0 ]]; then ...


echo 

printf "%s\t%s\n" "a:" "${a}"
printf "%s\t%s\n" "b:" "${b}"
printf "%s\t%s\n" "c:" "${c}"
printf "%s\t%s\n" "cc:" "${cc}"
printf "%s\t%s\n" "zz:" "${zz}"
printf "%s\t%s\n" "flag1:" "${flag1}"
printf "%s\t%s\n" "flag2:" "${flag2}"
printf "%s\t%s\n" "flag3:" "${flag3}"
printf "%s\t%s\n" "flag4:" "${flag4}"
printf "%s\t%s\n" "flag5:" "${flag5}"
printf "%s\t%s\n" "flag6:" "${flag6}"
printf "%s\t%s\n" "flag7:" "${flag7}"
printf "%s\t%s\n" "flag8:" "${flag8}"
printf "%s\t%s\n" "flag9:" "${flag9}"

echo


if [[ $pipedvar -eq 1 ]] && [[ -z "$@" ]]; then 
   echo "remaining string-piped: ${piped}"
else 
   echo "remaining string: ${@}"
fi

echo

if [[ $flag9 == '~/Desktop/*.txt' ]]; then printf "%s\n" ~/Desktop/*.txt | nl; fi

echo

exit 0



The non-destructive version of cmdparser does no modify (the number of) command line arguments ($# and $@):
#!/bin/bash

# create a fake command line
#set -- -abcc -c -zz -flag1="" -flag2=arg$'\n'plus_newline -flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' -flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ !' -flag9 ~/Desktop/*.txt filename1 filename2 filename3

#set -- -abcc -c -zz -flag1="" -flag2=arg$'\n'plus_newline -flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' -flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ !' -flag9 '~/Desktop/*.txt' filename1 filename2 filename3


printf "%s\n" "$@" | nl
#printf "%s" "$@"$'\n' | nl
#printf "%s" "${@/%/ }" | nl


: <<-'COMMENT'

# copy & paste examples

echo "filename1" "filename2" "filename3" | ~/Downloads/Mac-OS-X-bash-scripts/bash-cmdparser/cmdparser-non-destructive-1.txt -abcc -c -zz -flag1 arg -flag2=arg$'\n'plus_newline --flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' --flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ !' -flag9 ~/Desktop/*.txt -

echo "filename1" "filename2" "filename3" | ~/Downloads/Mac-OS-X-bash-scripts/bash-cmdparser/cmdparser-non-destructive-1.txt -abcc -c -zz -flag1 arg -flag2=arg$'\n'plus_newline --flag3="arg" -flag4='arg1=*,arg2=?,arg3=!' -flag5 '(arg1|arg2|arg3)' -flag6 'arg1=ag,arg2=bg,arg3=cg' --flag7 An\ argument\ with\ spaces\! -flag8='Yet another argument with spaces / * + ` \ !' -flag9 '~/Desktop/*.txt' -

COMMENT


echo

echo "Number of positional parameters: ${#}"

echo


# cmdparser

export PATH=/usr/bin:/bin:/usr/sbin:/sbin
export IFS=$' \t\n'


# non-builtin commands used
# cf. man builtin
declare basename=/usr/bin/basename egrep=/usr/bin/egrep sed=/usr/bin/sed


# define the names of flags as a regular expression
# flags are command line options that require arguments

flags="(flag1|flag2|flag3|flag4|flag5|flag6|flag7|flag8|flag9)"


# define the names of switches as a regular expression
# Switches are command line options that do not take arguments.
# Make sure multi-char switches precede single-char switches in the regular expression.
# Note that the regular expression contains neither the special read-from-stdin switch "-" 
# nor the special end-of-options switch "--".

switches="(cc|zz|a|b|c)"  


usage="usage: $(${basename} "$0") [-a] [-b] [-c] [-cc] [-zz] [-flag1 arg] [-flag2 'arg1 arg2 ...'] [-flag3=arg] [-flag4=\"arg1 arg2\"] ..."

declare flag1 flag2 flag3 flag4 flag5 flag6 flag7 flag8 flag9                # flags
declare -i a=0 b=0 c=0 cc=0 zz=0                                            # switches
                         
declare argn argstr argvar argvar_escaped char flagvar optstr piped pipedstr       # script variables
declare -i optid pipedvar

# piped="piped" will be used for variable creation 
# example: piped="piped"; pipedstr="piped arg"; eval $piped='"$(echo "$pipedstr")"'; echo "$piped"

piped="piped"

# default value is set to "no pipe"
pipedvar=0
pipedstr=""

# if /dev/stdin has a size greater than zero ...
if [[ -s /dev/stdin ]]; then pipedstr="$(</dev/stdin)"; fi 

if [[ $# -eq 0 ]] && [[ -z "$pipedstr" ]]; then
  printf "\n%s\n\n%s\n\n" 'No arguments specified!' "$usage" 1>&2
  exit 1
fi 

if [[ $# -eq 0 ]] && [[ -n "$pipedstr" ]]; then
  eval $piped='"${pipedstr}"'  
  pipedvar=1
fi 

# if there are command line arguments ...
# Note that $pipedvar may still be set to 1 below if the special read-from-stdin switch "-" is given

if [[ $pipedvar -eq 0 ]]; then

   optstr=" "  
   optid=0


   # processing one positional parameter at a time without modifying $# or $@
   # Process positional parameters non-destructively in Bash, http://codesnippets.joyent.com/posts/show/1706

   for (( i=1; i <= $#; i++ )); do 


      argn="${@:${i}:1}"     # current positional parameter
                             # "${@:(${i}+1):1}": the positional parameter following the current one
                             # "${@:${i}}": all positional parameters starting with the current one


      if [[ ${argn:0:1} != '-' ]]; then break; fi   # every flag or switch has to have a leading -

      optstr="$(printf "%s" "${argn}" | ${egrep} -e "^--?${flags}$")"

      if [[ -n "$optstr" ]]; then optid=1; fi
      if [[ -z "$optstr" ]]; then optid=2; optstr="$(printf "%s" "${argn}" | ${egrep} -e "^--?${switches}$")"; fi
      if [[ -z "$optstr" ]]; then optid=3; optstr="$(printf "%s" "${argn}" | ${egrep} -e "^--?${switches}+$")"; fi
      if [[ -z "$optstr" ]]; then optid=4; optstr="$(printf "%s" "${argn}" | ${egrep} -e "^--?(${flags}=.*|${flags}[^[:space:]]+)$")"; fi

      if [[ -z "$optstr" ]]; then  
         if [[ "${argn}" = "-" ]] && [[ "${@:${i}}" = "-" ]]; then  
            optid=5
            optstr="-" 

         elif [[ -n "$(printf "%s" "${@:${i}/%/ }" | ${egrep} -e "[[:space:]]--?(${flags}|${switches})")" ]]; then 
            # create argstr by appending a space to each command line argument
            argstr="$(printf "%s" "${@:${i}/%/ }" )"
            printf "\n%s\x21\n\n%s\n\n%s\n\n" "Undefined non-option string: ${argn} is followed by a legal flag or switch" "${argstr}" "$usage" 1>&2
            exit 1
         fi
      fi

      if [[ "${argn}" = "--" ]]; then break; fi     # -- marks end of options

      if [[ -z "$optstr" ]]; then break; fi     # no further flags or switches to process


      # flag followed by space (example: -f file)
      if [[ $optid -eq 1 ]]; then 

         if [[ -z "${@:(${i}+1):1}" ]]; then
            printf "%s\n%s\n" "no argument given to flag: ${argn}" "$usage" 1>&2
            exit 1
         fi 

         flagvar="${argn##*-}"     # remove leading dashes
         argvar="${@:(${i}+1):1}"
         eval $flagvar='"${argvar}"'
         let "i += 1"     # skip argument of current flag in next for loop
         continue

      # single switch (example: -a)
      elif [[ $optid -eq 2 ]]; then
         flagvar="${argn##*-}"
         eval $flagvar='"1"'
         continue
  
      # combined switch (example: -abcc)
      elif [[ $optid -eq 3 ]]; then
         flagvar="${argn##*-}"
         while [[ -n "$flagvar" ]]; do
            char="$(printf "%s" "$flagvar" | ${sed} -E "s/^${switches}.*$/\1/")"
            eval $char='"1"'
            flagvar="$(printf "%s" "$flagvar" | ${sed} -E "s/^${switches}//")"
         done
         continue

      # flag without following space (example: -ffile)
      elif [[ $optid -eq 4 ]]; then 

: <<-'COMMENT'

         argvar="$(printf "%s" "${argn}" | ${sed} -E "s/^\-\-?${flags}\=?//")"

         argvar2="${argvar//\\\\/\\\\}"       # escape \  (for Bash version 2.05b.0(1)-release)
         #argvar2="${argvar//\\/\\\\}"          # escape \  

         flagvar="${argn%${argvar2}}"         # remove escaped $argvar string
         flagvar="${flagvar%=}"               # remove trailing =
         flagvar="${flagvar##*-}"             # remove leading - or --
         eval $flagvar='"${argvar}"'
         continue

COMMENT

        # alternative: no string escaping required
         #argvar="$(printf "%s" "${argn}" | ${sed} -E "s/^\-\-?${flags}\=?//")"
         #flagvar="$(printf "%s" "${argn}" | ${sed} -E -n -e "s/^\-\-?${flags}\=?.*$/\\1/p")"

         argvar="$(printf "%s" "${argn##*-}" | ${sed} -E "s/^${flags}\=?//")"
         flagvar="$(printf "%s" "${argn##*-}" | ${sed} -E -n -e "s/^${flags}\=?.*$/\\1/p")"

         eval $flagvar='"${argvar}"'
         continue


      # the special read-from-stdin switch "-"
      elif [[ $optid -eq 5 ]]; then 
         pipedvar=1
         eval $piped='"${pipedstr}"'
         break

      fi

   done   # for loop

fi   # if [[$pipedvar -eq 0 ]]; then ...


echo 

printf "%s\t%s\n" "a:" "${a}"
printf "%s\t%s\n" "b:" "${b}"
printf "%s\t%s\n" "c:" "${c}"
printf "%s\t%s\n" "cc:" "${cc}"
printf "%s\t%s\n" "zz:" "${zz}"
printf "%s\t%s\n" "flag1:" "${flag1}"
printf "%s\t%s\n" "flag2:" "${flag2}"
printf "%s\t%s\n" "flag3:" "${flag3}"
printf "%s\t%s\n" "flag4:" "${flag4}"
printf "%s\t%s\n" "flag5:" "${flag5}"
printf "%s\t%s\n" "flag6:" "${flag6}"
printf "%s\t%s\n" "flag7:" "${flag7}"
printf "%s\t%s\n" "flag8:" "${flag8}"
printf "%s\t%s\n" "flag9:" "${flag9}"

echo


if [[ $pipedvar -eq 1 ]] && [[ -z "$@" ]]; then 
   echo "remaining string-piped: ${piped}"
else 
   echo "remaining string: ${@}"
fi

echo

echo "Number of positional parameters: ${#}"

echo

if [[ $flag9 == '~/Desktop/*.txt' ]]; then printf "%s\n" ~/Desktop/*.txt | nl; fi

echo

exit 0


Further information:

- Command-line argument
- In the Beginning... was the Command Line
- Handling Command Line Arguments
- Utility Conventions - Utility Argument Syntax (POSIX)
- Utility Conventions - Utility Syntax Guidelines (POSIX)
- GNU coding standards: 4.7 Standards for Command Line Interfaces
- bash-getopts
- Bash Shell my_getopts
- Parsing arguments for your shell script
- Bash: parsing arguments with 'getopts'
- More Power with Bash Getopts
- Getopt and getopts
- Option-ize your shell scripts
- Emulating getopt
- Positional Parameters
- Parsing Command Line Options in Shell Scripts
- Command Line Processing in Cocoa
- ddcli: An Objective-C Command Line Helper
- Arg_parser

Snip - extract a named element from an html file using bash

This is a primitive way of achieving the kind of data extraction that is more commonly associated with true XML for any reasonably modern html file (i.e. that it is well-formed and makes proper use of the id property). The purpose is mainly to get simple, yet fast and efficient text browsing, especially useful for quick look-ups and the like, e.g. dictionaries, thesauruses (thesauri?), encyclopedias etc. Since the data you're interested in is usually put into a specific element, text browsing is often greatly enhanced by extracting the element in question and discarding the rest. You run the script by specifying an element in the standard css way (element#id) and the file which is to be 'parsed', and the script responds by spitting out the element (and only that element) through html2text which does a really nice job of turning html code into legible console text.

EDIT: Added a quick check for the presence/absence of the element type in the line (before the grep operations) - greatly increases speed with large elements like #content on wikipedia.

#! /bin/bash

printhelp () {
echo "snip is a simple bash html cutter that works by extracting a specific element 
from an html file and feeding it to html2text. It presupposes wellformed html
and that you know the kind of element you want and it's id.

Syntax:
snip <element  type>#<element id> <file to parsed>

Example:
snip div#bodyContent /tmp/index.html
"
exit
}

quitter () {
echo "Element id not found. Quitting."; exit
}

[ "$1" = "-h" -o "$1" = "--help" -o "$1" = "" ] && printhelp

elementtype="$(echo $1 | cut -d '#' -f 1)"
id="$(echo $1 | cut -d '#' -f 2)"
htmlfile="$2"
thebegin=$(grep -nioE "id=\"$id\"" $htmlfile | cut -d ':' -f 1)
# echo $thebegin
[ -n "$thebegin" ] || quitter

sed -n ${thebegin}p "$htmlfile" | sed -re "s/^.*id=\"$id\"/<$elementtype id=\"$id\"/g" > /tmp/snipfile
sed -n $(($thebegin+1)),\$p "$htmlfile"  >> /tmp/snipfile

i=0
element=0
cat /tmp/snipfile | while read line; do
	let i++
	if [[ "$line" =~ "$elementtype" ]]; then
		elementbegincount="$(echo $line | grep -io "<$elementtype" | grep -c .)"
		elementendcount="$(echo $line | grep -io "</$elementtype" | grep -c .)"
		element=$(($element+$elementbegincount-$elementendcount))
		if [ "$element" -le 0 ]; then
			sed -n 1,${i}p /tmp/snipfile | html2text
			exit
		fi
	fi
done


As an example of how the script can be put to use, here's my Wikipedia lookup (the script above is referred to as 'snip' here):

#! /bin/bash

useragent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071619 Firefox/3.0.1"
if wget -q -U "$useragent" -O /tmp/wpfile "http://en.wikipedia.org/wiki/Special:Search?search=$*"; then
	clear
	echo "Page downloaded..."
	snip div#content /tmp/wpfile | less
else
	echo "No connection, sorry. Please try again."
fi

Serve php within .htm

In your .htaccess file (maybe only in a specific folder) add this line to parse a .htm as a php file. This works on TxD accounts.
AddType application/x-httpd-php .htm .php

Example using xmlplarser's saxdriver to parse huge files

// description of your code here

#!/usr/bin/evn ruby
## to run this you call run_amazon_import(datafile) with dataflie = a file to open for parsing, which later is opened based on:
## ("#{RAILS_ROOT}/data/" + datafile + ".xml")
## This is hard coded to look at Item elements, and in this example
## parses out the ASIN as @@product_id and ItemAttributes/Title as @@name
## see check_position_space(name,ch)

require 'xml/saxdriver'
@flag_item  = false

  @@finaldata = []
  @@vars = []
  @@positionSpace = []
  @@currentName = []
def reset_vals
  @@product_id = nil
  @@name = nil
end
def check_position_space(name,ch)
  # with each value within item we check to see if the
  # @@positionSpace (a concatenation of each value's name
  # equals the value we are looking for, if so put it in a global 
  # variable
  if @@positionSpace == 'ASIN'
    @@product_id =  ch
  elsif @@positionSpace == 'ItemAttributesTitle'
  	# if I did this again, I would name @@positionSpace
  	# with / between names in startElement so it would be simlar to other 
  	# ruby xml naming schems so:
  	# @@positionSpace == 'ItemAttributesTitle' would be:
  	# @@positionSpace == 'ItemAttributes/Title'
  	@@name = ch
  end
end

class TestHandler < XML::SAX::HandlerBase
  attr_accessor :data
  def startDocument
    @@data = []
  end
  def startElement(name, attr)
    @flag_item = true if name == 'Item'
    @@positionSpace = '' if name == 'Item'
    if @flag_item == true and name != 'Item'
        @@positionSpace = @@positionSpace + name
    elsif name == 'Item'
      reset_vals
    end
    @@currentName = name
  end
  def endElement(name)
    if @flag_item == true and name != 'Item'
        lenName = name.length
        @@positionSpace = @@positionSpace[0, @@positionSpace.length - lenName]
    end
    if name == 'Item'
      @@finaldata  << @@data.to_s
      @@data = []
	  ## Here I would have a fully parsed Item and do something with it
    end
    @flag_item = false if name == 'Item'
  end
  def characters(ch, start, length)
    check_position_space(@@currentName, ch[start, length])
  end
end

def run_amazon_import(datafile)
  @@datafile = datafile
  p = XML::SAX::Helpers::ParserFactory.makeParser("XML::Parser::SAXDriver")
  h = TestHandler.new
  p.setDocumentHandler(h)
  p.setDTDHandler(h)
  p.setEntityResolver(h)
  p.setErrorHandler(h)

  begin
    p.parse("#{RAILS_ROOT}/data/" + datafile + ".xml")
  rescue XML::SAX::SAXParseException
    p(["ParseError", $!.getSystemId, $!.getLineNumber, $!.getMessage])
  end
end

xml to objects

Доступ к дереву XML как к обычным объектам

Для этого можно использовать XSD::Mapping из стандартной библиотеки:

requirexsd/mapping’ 
people = XSD::Mapping.xml2obj(File.read("people.xml")) 
people.person[2].name # => "name3" 


Если в имени тэга присутствует дефис, можно сделать так: people[’foo-bar’]

Ну а выполнить обратное преобразование объектного дерева в XML поможет метод: XSD::Mapping.obj2xml