How to parse XML in Bash?

114

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |

getElementViaXPath --path='/html/head/title' |

sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

1

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心六四事件法轮功
Oct 7 '15 at 10:57

add a comment |

114

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |

getElementViaXPath --path='/html/head/title' |

sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

1

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心六四事件法轮功
Oct 7 '15 at 10:57

add a comment |

114

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |

getElementViaXPath --path='/html/head/title' |

sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |

getElementViaXPath --path='/html/head/title' |

sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

xml bash xhtml shell xpath

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

edited May 29 '14 at 3:30

Steven Penny

edited May 29 '14 at 3:30

Steven Penny

edited May 29 '14 at 3:30

Steven Penny

asked May 21 '09 at 15:36

asdfasdfasdf

asked May 21 '09 at 15:36

asdfasdfasdf

asked May 21 '09 at 15:36

asdfasdfasdf

1

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心六四事件法轮功
Oct 7 '15 at 10:57

add a comment |

1

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心六四事件法轮功
Oct 7 '15 at 10:57

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心六四事件法轮功
Oct 7 '15 at 10:57

add a comment |

15 Answers
15

active

oldest

votes

133

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

2

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

3

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

21

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

2

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

5

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

|
show 12 more comments

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.

The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

add a comment |

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)

answered May 21 '09 at 15:43

mirod

14.4k33762

add a comment |

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

add a comment |

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

answered Oct 24 '10 at 1:00

user485380

291

add a comment |

While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.

Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

add a comment |

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

add a comment |

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

add a comment |

-1

Introduction

Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

answered Oct 15 at 11:10

propatience

414

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f893585%2fhow-to-parse-xml-in-bash%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

15 Answers
15

active

oldest

votes

15 Answers
15

active

oldest

votes

133

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

<tag>value</tag>

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

2

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

3

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

21

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

2

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

5

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

|
show 12 more comments

133

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

<tag>value</tag>

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

2

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

3

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

21

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

2

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

5

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

|
show 12 more comments

133

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

<tag>value</tag>

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

<tag>value</tag>

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

edited May 23 '17 at 11:55

Community♦

edited May 23 '17 at 11:55

Community♦

edited May 23 '17 at 11:55

Community♦

answered Aug 13 '11 at 17:36

chad

2,11211215

answered Aug 13 '11 at 17:36

chad

2,11211215

answered Aug 13 '11 at 17:36

chad

2,11211215

2

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

3

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

21

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

2

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

5

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

|
show 12 more comments

2

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

3

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

21

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

2

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

5

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24

cool answer !!!!
– mtk
Oct 23 '12 at 20:15

Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49

@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27

Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47

|
show 12 more comments

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

answered Apr 9 '10 at 14:13

Yuzem

61152

answered Apr 9 '10 at 14:13

Yuzem

61152

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

answered May 21 '09 at 18:18

Nat

8,78032532

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

edited Jul 6 '12 at 16:19

mtk

7,261105392

edited Jul 6 '12 at 16:19

mtk

7,261105392

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

answered Apr 24 '12 at 15:03

Grisha

20626

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

add a comment |

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

add a comment |

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

edited Feb 1 at 16:40

peterh

6,095154667

edited Feb 1 at 16:40

peterh

6,095154667

edited Feb 1 at 16:40

peterh

6,095154667

answered Jan 29 '14 at 12:44

scavenger

13515

answered Jan 29 '14 at 12:44

scavenger

13515

answered Jan 29 '14 at 12:44

scavenger

13515

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

add a comment |

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
– khmarbaise
Mar 5 '14 at 8:37

Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47

sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36

add a comment |

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

answered May 21 '09 at 15:43

mirod

14.4k33762

add a comment |

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

answered May 21 '09 at 15:43

mirod

14.4k33762

add a comment |

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

answered May 21 '09 at 15:43

mirod

14.4k33762

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

answered May 21 '09 at 15:43

mirod

14.4k33762

answered May 21 '09 at 15:43

mirod

14.4k33762

answered May 21 '09 at 15:43

mirod

14.4k33762

answered May 21 '09 at 15:43

mirod

14.4k33762

add a comment |

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

answered Nov 7 '09 at 15:31

simon04

1,4091215

answered Nov 7 '09 at 15:31

simon04

1,4091215

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

add a comment |

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

add a comment |

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

answered Mar 27 '13 at 0:27

BeniBela

12.4k32440

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

add a comment |

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04

add a comment |

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

answered May 21 '09 at 15:39

alamar

10.7k24873

answered May 21 '09 at 15:39

alamar

10.7k24873

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

answered Oct 24 '10 at 1:00

user485380

291

add a comment |

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

answered Oct 24 '10 at 1:00

user485380

291

add a comment |

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

answered Oct 24 '10 at 1:00

user485380

291

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

answered Oct 24 '10 at 1:00

user485380

291

answered Oct 24 '10 at 1:00

user485380

291

answered Oct 24 '10 at 1:00

user485380

291

answered Oct 24 '10 at 1:00

user485380

291

add a comment |

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

add a comment |

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

add a comment |

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

edited May 18 at 22:19

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

answered Oct 25 '17 at 14:53

ccpizza

12.1k48183

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

add a comment |

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
– E. Moffat
Oct 31 at 0:12

add a comment |

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

add a comment |

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

add a comment |

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

answered Jan 24 '13 at 0:46

michaelmeyer

4,41232129

add a comment |

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

add a comment |

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

add a comment |

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

answered Jun 16 '12 at 16:53

Steven Penny

answered Jun 16 '12 at 16:53

Steven Penny

add a comment |

-1

Introduction

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

answered Oct 15 at 11:10

propatience

414

add a comment |

-1

Introduction

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

answered Oct 15 at 11:10

propatience

414

add a comment |

-1

Introduction

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

answered Oct 15 at 11:10

propatience

414

Introduction

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

answered Oct 15 at 11:10

propatience

414

answered Oct 15 at 11:10

propatience

414

answered Oct 15 at 11:10

propatience

414

answered Oct 15 at 11:10

propatience

414

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

How to parse XML in Bash?

15 Answers 15

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

Example 1

Usage

Example 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

15 Answers 15

15 Answers 15

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

鏡平學校

Why https connections are so slow when debugging (stepping over) in Java?

15 Answers
15

15 Answers
15

15 Answers
15