How to parse XML in Bash?












114














Ideally, what I would like to be able to do is:



cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt









share|improve this question




















  • 1




    unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Oct 7 '15 at 10:57
















114














Ideally, what I would like to be able to do is:



cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt









share|improve this question




















  • 1




    unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Oct 7 '15 at 10:57














114












114








114


62





Ideally, what I would like to be able to do is:



cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt









share|improve this question















Ideally, what I would like to be able to do is:



cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt






xml bash xhtml shell xpath






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 29 '14 at 3:30









Steven Penny

1




1










asked May 21 '09 at 15:36







asdfasdfasdf















  • 1




    unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Oct 7 '15 at 10:57














  • 1




    unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
    – Ciro Santilli 新疆改造中心 六四事件 法轮功
    Oct 7 '15 at 10:57








1




1




unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57




unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57












15 Answers
15






active

oldest

votes


















133














This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...



rdom () { local IFS=> ; read -d < E C ;}


Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:



read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}


Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:



<tag>value</tag>


The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.



Now his while loop cleaned up a bit to match the above:



while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt


The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).



Now given the following (similar to what you get from listing a bucket on S3) for input.xml:



<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>


and the following loop:



while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml


You should get:



 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>


So if we wrote a while loop like Yuzem's:



while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml


We'd get a listing of all the files in the S3 bucket.



EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:



read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}


Otherwise, any line splitting you do later in the script will be messed up.



EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:



read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}


Then write your function to parse and get the data you want like this:



parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}


Then while you read_dom call parse_dom:



while read_dom; do
parse_dom
done


Then given the following example markup:



<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>


You should get this output:



$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789


EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:



read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}


I don't see any reason why that shouldn't work






share|improve this answer



















  • 2




    If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
    – chad
    Jul 25 '12 at 15:24








  • 3




    cool answer !!!!
    – mtk
    Oct 23 '12 at 20:15






  • 21




    Just because you can write your own parser, doesn't mean you should.
    – Stephen Niedzielski
    Apr 23 '13 at 21:49






  • 2




    @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
    – chad
    Oct 11 '13 at 14:27






  • 5




    Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
    – William Pursell
    Nov 27 '13 at 16:47



















57














You can do that very easily using only bash.
You only have to add this function:



rdom () { local IFS=> ; read -d < E C ;}


Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.



For example, to do what you wanted to do:



while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt





share|improve this answer





















  • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
    – Alex Gray
    Jul 4 '11 at 2:14






  • 1




    alex, I clarified Yuzem's answer below...
    – chad
    Aug 13 '11 at 20:04






  • 1




    Cred to the original - this one-liner is so freakin' elegant and amazing.
    – maverick
    Dec 5 '13 at 22:06






  • 1




    great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
    – user311174
    Jan 16 '14 at 10:32






  • 3




    Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
    – peterh
    Feb 1 at 16:34



















50














Command-line tools that can be called from shell scripts include:





  • 4xpath - command-line wrapper around Python's 4Suite package

  • XMLStarlet

  • xpath - command-line wrapper around Perl's XPath library


  • Xidel - Works with URLs as well as files. Also works with JSON


I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.






share|improve this answer



















  • 2




    Where can I download 'xpath' or '4xpath' from ?
    – Opher
    Apr 15 '11 at 14:47






  • 3




    yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
    – David
    Nov 22 '11 at 0:34






  • 2




    sudo apt-get install libxml-xpath-perl
    – Andrew Wagner
    Nov 23 '12 at 12:37



















19














You can use xpath utility. It's installed with the Perl XML-XPath package.



Usage:



/usr/bin/xpath [filename] query


or XMLStarlet. To install it on opensuse use:



sudo zypper install xmlstarlet


or try cnf xml on other platforms.






share|improve this answer



















  • 5




    Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
    – Bruno von Paris
    Feb 8 '13 at 15:26










  • On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
    – tripleee
    Jul 27 '16 at 8:47






  • 2




    On Ubuntu/Debian apt-get install xmlstarlet
    – rubo77
    Dec 24 '16 at 0:48



















7














This is sufficient...



xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt





share|improve this answer





















  • Thanks, quick and did the job for me
    – Miguel Mota
    May 18 '16 at 23:03



















5














starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.



The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.



First, a definition of the UML words used in this post:



<!-- comment... -->
<tag attribute="value">content...</tag>


EDIT: updated functions, with handle of:




  • Websphere xml (xmi and xmlns attributes)

  • must have a compatible terminal with 256 colors

  • 24 shades of grey

  • compatibility added for IBM AIX bash 3.2.16(1)


The functions, first is the xml_read_dom which's called recursively by xml_read:



xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi

# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}


and the second one :



xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true

[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true

while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break

if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}


and lastly, the rtrim, trim and echo2 (to stderr) functions:



rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }


Colorization:



oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:



set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'


How to load all that stuff:



Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)



If not, just copy/paste everything on the command line.



How does it work:



xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags


With Debug mode (-d) comments and parsed attributes are printed to stderr






share|improve this answer























  • I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
    – khmarbaise
    Mar 5 '14 at 8:37










  • Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
    – khmarbaise
    Mar 5 '14 at 8:47










  • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
    – scavenger
    Apr 8 '15 at 3:36





















4














I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.



My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)






share|improve this answer





























    4














    Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.






    share|improve this answer





























      4














      Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.



      The title can be read like:



      xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt


      And it also has a cool feature to export multiple variables to bash. For example



      eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )


      sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.






      share|improve this answer





















      • This is exactly what I needed! :)
        – Thomas Daugaard
        Oct 18 '13 at 9:04



















      2














      Well, you can use xpath utility. I guess perl's XML::Xpath contains it.






      share|improve this answer





























        2














        After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:




        • General informations about XPaths

        • Amara - collection of Pythonic tools for XML

        • Develop Python/XML with 4Suite (2 parts)






        share|improve this answer





























          2














          While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.



          Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.



          Example 1



          #!/usr/bin/env python
          import sys
          from lxml import etree

          tree = etree.parse(sys.argv[1])
          xpath_expression = sys.argv[2]

          # a hack allowing to access the
          # default namespace (if defined) via the 'p:' prefix
          # E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
          # an XPath of '//p:module' will return all the 'module' nodes
          ns = tree.getroot().nsmap
          if ns.keys() and None in ns:
          ns['p'] = ns.pop(None)
          # end of hack

          for e in tree.xpath(xpath_expression, namespaces=ns):
          if isinstance(e, str):
          print(e)
          else:
          print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))


          lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.



          Usage



          python xpath.py myfile.xml "//mynode"


          lxml also accepts a URL as input:



          python xpath.py http://www.feedforall.com/sample.xml "//link"



          Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.






          Example 2



          A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:



          pom.xml:



          <?xml version="1.0" encoding="UTF-8"?>
          <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
          <modules>
          <module>cherries</module>
          <module>bananas</module>
          <module>pears</module>
          </modules>
          </project>


          module_extractor.py:



          from lxml import etree
          for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
          print(e.text)





          share|improve this answer























          • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
            – E. Moffat
            Oct 31 at 0:12



















          0














          Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:



          rdom () { local IFS=> ; read -d < E C ;}


          becomes:



          rdom () { local IFS=< ; read -d > C E ;}


          If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.






          share|improve this answer





























            0














            This works if you are wanting XML attributes:



            $ cat alfa.xml
            <video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

            $ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

            $ . ./alfa.sh

            $ echo "$stream"
            H264_400.mp4





            share|improve this answer































              -1














              Introduction



              Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.



              TLDR/Solution



              On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?



              On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?



              1. Install the "package" xmlstarlet



              2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt






              share|improve this answer





















                Your Answer






                StackExchange.ifUsing("editor", function () {
                StackExchange.using("externalEditor", function () {
                StackExchange.using("snippets", function () {
                StackExchange.snippets.init();
                });
                });
                }, "code-snippets");

                StackExchange.ready(function() {
                var channelOptions = {
                tags: "".split(" "),
                id: "1"
                };
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function() {
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled) {
                StackExchange.using("snippets", function() {
                createEditor();
                });
                }
                else {
                createEditor();
                }
                });

                function createEditor() {
                StackExchange.prepareEditor({
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader: {
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                },
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                });


                }
                });














                draft saved

                draft discarded


















                StackExchange.ready(
                function () {
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f893585%2fhow-to-parse-xml-in-bash%23new-answer', 'question_page');
                }
                );

                Post as a guest















                Required, but never shown
























                15 Answers
                15






                active

                oldest

                votes








                15 Answers
                15






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                133














                This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...



                rdom () { local IFS=> ; read -d < E C ;}


                Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                }


                Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:



                <tag>value</tag>


                The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.



                Now his while loop cleaned up a bit to match the above:



                while read_dom; do
                if [[ $ENTITY = "title" ]]; then
                echo $CONTENT
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt


                The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).



                Now given the following (similar to what you get from listing a bucket on S3) for input.xml:



                <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
                <Name>sth-items</Name>
                <IsTruncated>false</IsTruncated>
                <Contents>
                <Key>item-apple-iso@2x.png</Key>
                <LastModified>2011-07-25T22:23:04.000Z</LastModified>
                <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
                <Size>1785</Size>
                <StorageClass>STANDARD</StorageClass>
                </Contents>
                </ListBucketResult>


                and the following loop:



                while read_dom; do
                echo "$ENTITY => $CONTENT"
                done < input.xml


                You should get:



                 => 
                ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
                Name => sth-items
                /Name =>
                IsTruncated => false
                /IsTruncated =>
                Contents =>
                Key => item-apple-iso@2x.png
                /Key =>
                LastModified => 2011-07-25T22:23:04.000Z
                /LastModified =>
                ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
                /ETag =>
                Size => 1785
                /Size =>
                StorageClass => STANDARD
                /StorageClass =>
                /Contents =>


                So if we wrote a while loop like Yuzem's:



                while read_dom; do
                if [[ $ENTITY = "Key" ]] ; then
                echo $CONTENT
                fi
                done < input.xml


                We'd get a listing of all the files in the S3 bucket.



                EDIT
                If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:



                read_dom () {
                ORIGINAL_IFS=$IFS
                IFS=>
                read -d < ENTITY CONTENT
                IFS=$ORIGINAL_IFS
                }


                Otherwise, any line splitting you do later in the script will be messed up.



                EDIT 2
                To split out attribute name/value pairs you can augment the read_dom() like so:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local ret=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $ret
                }


                Then write your function to parse and get the data you want like this:



                parse_dom () {
                if [[ $TAG_NAME = "foo" ]] ; then
                eval local $ATTRIBUTES
                echo "foo size is: $size"
                elif [[ $TAG_NAME = "bar" ]] ; then
                eval local $ATTRIBUTES
                echo "bar type is: $type"
                fi
                }


                Then while you read_dom call parse_dom:



                while read_dom; do
                parse_dom
                done


                Then given the following example markup:



                <example>
                <bar size="bar_size" type="metal">bars content</bar>
                <foo size="1789" type="unknown">foos content</foo>
                </example>


                You should get this output:



                $ cat example.xml | ./bash_xml.sh 
                bar type is: metal
                foo size is: 1789


                EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local RET=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $RET
                }


                I don't see any reason why that shouldn't work






                share|improve this answer



















                • 2




                  If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                  – chad
                  Jul 25 '12 at 15:24








                • 3




                  cool answer !!!!
                  – mtk
                  Oct 23 '12 at 20:15






                • 21




                  Just because you can write your own parser, doesn't mean you should.
                  – Stephen Niedzielski
                  Apr 23 '13 at 21:49






                • 2




                  @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                  – chad
                  Oct 11 '13 at 14:27






                • 5




                  Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                  – William Pursell
                  Nov 27 '13 at 16:47
















                133














                This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...



                rdom () { local IFS=> ; read -d < E C ;}


                Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                }


                Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:



                <tag>value</tag>


                The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.



                Now his while loop cleaned up a bit to match the above:



                while read_dom; do
                if [[ $ENTITY = "title" ]]; then
                echo $CONTENT
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt


                The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).



                Now given the following (similar to what you get from listing a bucket on S3) for input.xml:



                <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
                <Name>sth-items</Name>
                <IsTruncated>false</IsTruncated>
                <Contents>
                <Key>item-apple-iso@2x.png</Key>
                <LastModified>2011-07-25T22:23:04.000Z</LastModified>
                <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
                <Size>1785</Size>
                <StorageClass>STANDARD</StorageClass>
                </Contents>
                </ListBucketResult>


                and the following loop:



                while read_dom; do
                echo "$ENTITY => $CONTENT"
                done < input.xml


                You should get:



                 => 
                ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
                Name => sth-items
                /Name =>
                IsTruncated => false
                /IsTruncated =>
                Contents =>
                Key => item-apple-iso@2x.png
                /Key =>
                LastModified => 2011-07-25T22:23:04.000Z
                /LastModified =>
                ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
                /ETag =>
                Size => 1785
                /Size =>
                StorageClass => STANDARD
                /StorageClass =>
                /Contents =>


                So if we wrote a while loop like Yuzem's:



                while read_dom; do
                if [[ $ENTITY = "Key" ]] ; then
                echo $CONTENT
                fi
                done < input.xml


                We'd get a listing of all the files in the S3 bucket.



                EDIT
                If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:



                read_dom () {
                ORIGINAL_IFS=$IFS
                IFS=>
                read -d < ENTITY CONTENT
                IFS=$ORIGINAL_IFS
                }


                Otherwise, any line splitting you do later in the script will be messed up.



                EDIT 2
                To split out attribute name/value pairs you can augment the read_dom() like so:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local ret=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $ret
                }


                Then write your function to parse and get the data you want like this:



                parse_dom () {
                if [[ $TAG_NAME = "foo" ]] ; then
                eval local $ATTRIBUTES
                echo "foo size is: $size"
                elif [[ $TAG_NAME = "bar" ]] ; then
                eval local $ATTRIBUTES
                echo "bar type is: $type"
                fi
                }


                Then while you read_dom call parse_dom:



                while read_dom; do
                parse_dom
                done


                Then given the following example markup:



                <example>
                <bar size="bar_size" type="metal">bars content</bar>
                <foo size="1789" type="unknown">foos content</foo>
                </example>


                You should get this output:



                $ cat example.xml | ./bash_xml.sh 
                bar type is: metal
                foo size is: 1789


                EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local RET=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $RET
                }


                I don't see any reason why that shouldn't work






                share|improve this answer



















                • 2




                  If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                  – chad
                  Jul 25 '12 at 15:24








                • 3




                  cool answer !!!!
                  – mtk
                  Oct 23 '12 at 20:15






                • 21




                  Just because you can write your own parser, doesn't mean you should.
                  – Stephen Niedzielski
                  Apr 23 '13 at 21:49






                • 2




                  @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                  – chad
                  Oct 11 '13 at 14:27






                • 5




                  Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                  – William Pursell
                  Nov 27 '13 at 16:47














                133












                133








                133






                This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...



                rdom () { local IFS=> ; read -d < E C ;}


                Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                }


                Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:



                <tag>value</tag>


                The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.



                Now his while loop cleaned up a bit to match the above:



                while read_dom; do
                if [[ $ENTITY = "title" ]]; then
                echo $CONTENT
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt


                The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).



                Now given the following (similar to what you get from listing a bucket on S3) for input.xml:



                <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
                <Name>sth-items</Name>
                <IsTruncated>false</IsTruncated>
                <Contents>
                <Key>item-apple-iso@2x.png</Key>
                <LastModified>2011-07-25T22:23:04.000Z</LastModified>
                <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
                <Size>1785</Size>
                <StorageClass>STANDARD</StorageClass>
                </Contents>
                </ListBucketResult>


                and the following loop:



                while read_dom; do
                echo "$ENTITY => $CONTENT"
                done < input.xml


                You should get:



                 => 
                ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
                Name => sth-items
                /Name =>
                IsTruncated => false
                /IsTruncated =>
                Contents =>
                Key => item-apple-iso@2x.png
                /Key =>
                LastModified => 2011-07-25T22:23:04.000Z
                /LastModified =>
                ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
                /ETag =>
                Size => 1785
                /Size =>
                StorageClass => STANDARD
                /StorageClass =>
                /Contents =>


                So if we wrote a while loop like Yuzem's:



                while read_dom; do
                if [[ $ENTITY = "Key" ]] ; then
                echo $CONTENT
                fi
                done < input.xml


                We'd get a listing of all the files in the S3 bucket.



                EDIT
                If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:



                read_dom () {
                ORIGINAL_IFS=$IFS
                IFS=>
                read -d < ENTITY CONTENT
                IFS=$ORIGINAL_IFS
                }


                Otherwise, any line splitting you do later in the script will be messed up.



                EDIT 2
                To split out attribute name/value pairs you can augment the read_dom() like so:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local ret=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $ret
                }


                Then write your function to parse and get the data you want like this:



                parse_dom () {
                if [[ $TAG_NAME = "foo" ]] ; then
                eval local $ATTRIBUTES
                echo "foo size is: $size"
                elif [[ $TAG_NAME = "bar" ]] ; then
                eval local $ATTRIBUTES
                echo "bar type is: $type"
                fi
                }


                Then while you read_dom call parse_dom:



                while read_dom; do
                parse_dom
                done


                Then given the following example markup:



                <example>
                <bar size="bar_size" type="metal">bars content</bar>
                <foo size="1789" type="unknown">foos content</foo>
                </example>


                You should get this output:



                $ cat example.xml | ./bash_xml.sh 
                bar type is: metal
                foo size is: 1789


                EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local RET=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $RET
                }


                I don't see any reason why that shouldn't work






                share|improve this answer














                This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...



                rdom () { local IFS=> ; read -d < E C ;}


                Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                }


                Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:



                <tag>value</tag>


                The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.



                Now his while loop cleaned up a bit to match the above:



                while read_dom; do
                if [[ $ENTITY = "title" ]]; then
                echo $CONTENT
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt


                The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).



                Now given the following (similar to what you get from listing a bucket on S3) for input.xml:



                <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
                <Name>sth-items</Name>
                <IsTruncated>false</IsTruncated>
                <Contents>
                <Key>item-apple-iso@2x.png</Key>
                <LastModified>2011-07-25T22:23:04.000Z</LastModified>
                <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
                <Size>1785</Size>
                <StorageClass>STANDARD</StorageClass>
                </Contents>
                </ListBucketResult>


                and the following loop:



                while read_dom; do
                echo "$ENTITY => $CONTENT"
                done < input.xml


                You should get:



                 => 
                ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
                Name => sth-items
                /Name =>
                IsTruncated => false
                /IsTruncated =>
                Contents =>
                Key => item-apple-iso@2x.png
                /Key =>
                LastModified => 2011-07-25T22:23:04.000Z
                /LastModified =>
                ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
                /ETag =>
                Size => 1785
                /Size =>
                StorageClass => STANDARD
                /StorageClass =>
                /Contents =>


                So if we wrote a while loop like Yuzem's:



                while read_dom; do
                if [[ $ENTITY = "Key" ]] ; then
                echo $CONTENT
                fi
                done < input.xml


                We'd get a listing of all the files in the S3 bucket.



                EDIT
                If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:



                read_dom () {
                ORIGINAL_IFS=$IFS
                IFS=>
                read -d < ENTITY CONTENT
                IFS=$ORIGINAL_IFS
                }


                Otherwise, any line splitting you do later in the script will be messed up.



                EDIT 2
                To split out attribute name/value pairs you can augment the read_dom() like so:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local ret=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $ret
                }


                Then write your function to parse and get the data you want like this:



                parse_dom () {
                if [[ $TAG_NAME = "foo" ]] ; then
                eval local $ATTRIBUTES
                echo "foo size is: $size"
                elif [[ $TAG_NAME = "bar" ]] ; then
                eval local $ATTRIBUTES
                echo "bar type is: $type"
                fi
                }


                Then while you read_dom call parse_dom:



                while read_dom; do
                parse_dom
                done


                Then given the following example markup:



                <example>
                <bar size="bar_size" type="metal">bars content</bar>
                <foo size="1789" type="unknown">foos content</foo>
                </example>


                You should get this output:



                $ cat example.xml | ./bash_xml.sh 
                bar type is: metal
                foo size is: 1789


                EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:



                read_dom () {
                local IFS=>
                read -d < ENTITY CONTENT
                local RET=$?
                TAG_NAME=${ENTITY%% *}
                ATTRIBUTES=${ENTITY#* }
                return $RET
                }


                I don't see any reason why that shouldn't work







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited May 23 '17 at 11:55









                Community

                11




                11










                answered Aug 13 '11 at 17:36









                chad

                2,11211215




                2,11211215








                • 2




                  If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                  – chad
                  Jul 25 '12 at 15:24








                • 3




                  cool answer !!!!
                  – mtk
                  Oct 23 '12 at 20:15






                • 21




                  Just because you can write your own parser, doesn't mean you should.
                  – Stephen Niedzielski
                  Apr 23 '13 at 21:49






                • 2




                  @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                  – chad
                  Oct 11 '13 at 14:27






                • 5




                  Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                  – William Pursell
                  Nov 27 '13 at 16:47














                • 2




                  If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                  – chad
                  Jul 25 '12 at 15:24








                • 3




                  cool answer !!!!
                  – mtk
                  Oct 23 '12 at 20:15






                • 21




                  Just because you can write your own parser, doesn't mean you should.
                  – Stephen Niedzielski
                  Apr 23 '13 at 21:49






                • 2




                  @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                  – chad
                  Oct 11 '13 at 14:27






                • 5




                  Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                  – William Pursell
                  Nov 27 '13 at 16:47








                2




                2




                If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                – chad
                Jul 25 '12 at 15:24






                If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
                – chad
                Jul 25 '12 at 15:24






                3




                3




                cool answer !!!!
                – mtk
                Oct 23 '12 at 20:15




                cool answer !!!!
                – mtk
                Oct 23 '12 at 20:15




                21




                21




                Just because you can write your own parser, doesn't mean you should.
                – Stephen Niedzielski
                Apr 23 '13 at 21:49




                Just because you can write your own parser, doesn't mean you should.
                – Stephen Niedzielski
                Apr 23 '13 at 21:49




                2




                2




                @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                – chad
                Oct 11 '13 at 14:27




                @Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
                – chad
                Oct 11 '13 at 14:27




                5




                5




                Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                – William Pursell
                Nov 27 '13 at 16:47




                Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
                – William Pursell
                Nov 27 '13 at 16:47













                57














                You can do that very easily using only bash.
                You only have to add this function:



                rdom () { local IFS=> ; read -d < E C ;}


                Now you can use rdom like read but for html documents.
                When called rdom will assign the element to variable E and the content to var C.



                For example, to do what you wanted to do:



                while rdom; do
                if [[ $E = title ]]; then
                echo $C
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt





                share|improve this answer





















                • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                  – Alex Gray
                  Jul 4 '11 at 2:14






                • 1




                  alex, I clarified Yuzem's answer below...
                  – chad
                  Aug 13 '11 at 20:04






                • 1




                  Cred to the original - this one-liner is so freakin' elegant and amazing.
                  – maverick
                  Dec 5 '13 at 22:06






                • 1




                  great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                  – user311174
                  Jan 16 '14 at 10:32






                • 3




                  Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                  – peterh
                  Feb 1 at 16:34
















                57














                You can do that very easily using only bash.
                You only have to add this function:



                rdom () { local IFS=> ; read -d < E C ;}


                Now you can use rdom like read but for html documents.
                When called rdom will assign the element to variable E and the content to var C.



                For example, to do what you wanted to do:



                while rdom; do
                if [[ $E = title ]]; then
                echo $C
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt





                share|improve this answer





















                • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                  – Alex Gray
                  Jul 4 '11 at 2:14






                • 1




                  alex, I clarified Yuzem's answer below...
                  – chad
                  Aug 13 '11 at 20:04






                • 1




                  Cred to the original - this one-liner is so freakin' elegant and amazing.
                  – maverick
                  Dec 5 '13 at 22:06






                • 1




                  great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                  – user311174
                  Jan 16 '14 at 10:32






                • 3




                  Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                  – peterh
                  Feb 1 at 16:34














                57












                57








                57






                You can do that very easily using only bash.
                You only have to add this function:



                rdom () { local IFS=> ; read -d < E C ;}


                Now you can use rdom like read but for html documents.
                When called rdom will assign the element to variable E and the content to var C.



                For example, to do what you wanted to do:



                while rdom; do
                if [[ $E = title ]]; then
                echo $C
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt





                share|improve this answer












                You can do that very easily using only bash.
                You only have to add this function:



                rdom () { local IFS=> ; read -d < E C ;}


                Now you can use rdom like read but for html documents.
                When called rdom will assign the element to variable E and the content to var C.



                For example, to do what you wanted to do:



                while rdom; do
                if [[ $E = title ]]; then
                echo $C
                exit
                fi
                done < xhtmlfile.xhtml > titleOfXHTMLPage.txt






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Apr 9 '10 at 14:13









                Yuzem

                61152




                61152












                • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                  – Alex Gray
                  Jul 4 '11 at 2:14






                • 1




                  alex, I clarified Yuzem's answer below...
                  – chad
                  Aug 13 '11 at 20:04






                • 1




                  Cred to the original - this one-liner is so freakin' elegant and amazing.
                  – maverick
                  Dec 5 '13 at 22:06






                • 1




                  great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                  – user311174
                  Jan 16 '14 at 10:32






                • 3




                  Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                  – peterh
                  Feb 1 at 16:34


















                • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                  – Alex Gray
                  Jul 4 '11 at 2:14






                • 1




                  alex, I clarified Yuzem's answer below...
                  – chad
                  Aug 13 '11 at 20:04






                • 1




                  Cred to the original - this one-liner is so freakin' elegant and amazing.
                  – maverick
                  Dec 5 '13 at 22:06






                • 1




                  great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                  – user311174
                  Jan 16 '14 at 10:32






                • 3




                  Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                  – peterh
                  Feb 1 at 16:34
















                could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                – Alex Gray
                Jul 4 '11 at 2:14




                could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
                – Alex Gray
                Jul 4 '11 at 2:14




                1




                1




                alex, I clarified Yuzem's answer below...
                – chad
                Aug 13 '11 at 20:04




                alex, I clarified Yuzem's answer below...
                – chad
                Aug 13 '11 at 20:04




                1




                1




                Cred to the original - this one-liner is so freakin' elegant and amazing.
                – maverick
                Dec 5 '13 at 22:06




                Cred to the original - this one-liner is so freakin' elegant and amazing.
                – maverick
                Dec 5 '13 at 22:06




                1




                1




                great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                – user311174
                Jan 16 '14 at 10:32




                great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
                – user311174
                Jan 16 '14 at 10:32




                3




                3




                Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                – peterh
                Feb 1 at 16:34




                Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
                – peterh
                Feb 1 at 16:34











                50














                Command-line tools that can be called from shell scripts include:





                • 4xpath - command-line wrapper around Python's 4Suite package

                • XMLStarlet

                • xpath - command-line wrapper around Perl's XPath library


                • Xidel - Works with URLs as well as files. Also works with JSON


                I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.






                share|improve this answer



















                • 2




                  Where can I download 'xpath' or '4xpath' from ?
                  – Opher
                  Apr 15 '11 at 14:47






                • 3




                  yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                  – David
                  Nov 22 '11 at 0:34






                • 2




                  sudo apt-get install libxml-xpath-perl
                  – Andrew Wagner
                  Nov 23 '12 at 12:37
















                50














                Command-line tools that can be called from shell scripts include:





                • 4xpath - command-line wrapper around Python's 4Suite package

                • XMLStarlet

                • xpath - command-line wrapper around Perl's XPath library


                • Xidel - Works with URLs as well as files. Also works with JSON


                I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.






                share|improve this answer



















                • 2




                  Where can I download 'xpath' or '4xpath' from ?
                  – Opher
                  Apr 15 '11 at 14:47






                • 3




                  yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                  – David
                  Nov 22 '11 at 0:34






                • 2




                  sudo apt-get install libxml-xpath-perl
                  – Andrew Wagner
                  Nov 23 '12 at 12:37














                50












                50








                50






                Command-line tools that can be called from shell scripts include:





                • 4xpath - command-line wrapper around Python's 4Suite package

                • XMLStarlet

                • xpath - command-line wrapper around Perl's XPath library


                • Xidel - Works with URLs as well as files. Also works with JSON


                I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.






                share|improve this answer














                Command-line tools that can be called from shell scripts include:





                • 4xpath - command-line wrapper around Python's 4Suite package

                • XMLStarlet

                • xpath - command-line wrapper around Perl's XPath library


                • Xidel - Works with URLs as well as files. Also works with JSON


                I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jan 10 '17 at 21:10









                RustyTheBoyRobot

                5,04022450




                5,04022450










                answered May 21 '09 at 18:18









                Nat

                8,78032532




                8,78032532








                • 2




                  Where can I download 'xpath' or '4xpath' from ?
                  – Opher
                  Apr 15 '11 at 14:47






                • 3




                  yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                  – David
                  Nov 22 '11 at 0:34






                • 2




                  sudo apt-get install libxml-xpath-perl
                  – Andrew Wagner
                  Nov 23 '12 at 12:37














                • 2




                  Where can I download 'xpath' or '4xpath' from ?
                  – Opher
                  Apr 15 '11 at 14:47






                • 3




                  yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                  – David
                  Nov 22 '11 at 0:34






                • 2




                  sudo apt-get install libxml-xpath-perl
                  – Andrew Wagner
                  Nov 23 '12 at 12:37








                2




                2




                Where can I download 'xpath' or '4xpath' from ?
                – Opher
                Apr 15 '11 at 14:47




                Where can I download 'xpath' or '4xpath' from ?
                – Opher
                Apr 15 '11 at 14:47




                3




                3




                yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                – David
                Nov 22 '11 at 0:34




                yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
                – David
                Nov 22 '11 at 0:34




                2




                2




                sudo apt-get install libxml-xpath-perl
                – Andrew Wagner
                Nov 23 '12 at 12:37




                sudo apt-get install libxml-xpath-perl
                – Andrew Wagner
                Nov 23 '12 at 12:37











                19














                You can use xpath utility. It's installed with the Perl XML-XPath package.



                Usage:



                /usr/bin/xpath [filename] query


                or XMLStarlet. To install it on opensuse use:



                sudo zypper install xmlstarlet


                or try cnf xml on other platforms.






                share|improve this answer



















                • 5




                  Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                  – Bruno von Paris
                  Feb 8 '13 at 15:26










                • On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                  – tripleee
                  Jul 27 '16 at 8:47






                • 2




                  On Ubuntu/Debian apt-get install xmlstarlet
                  – rubo77
                  Dec 24 '16 at 0:48
















                19














                You can use xpath utility. It's installed with the Perl XML-XPath package.



                Usage:



                /usr/bin/xpath [filename] query


                or XMLStarlet. To install it on opensuse use:



                sudo zypper install xmlstarlet


                or try cnf xml on other platforms.






                share|improve this answer



















                • 5




                  Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                  – Bruno von Paris
                  Feb 8 '13 at 15:26










                • On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                  – tripleee
                  Jul 27 '16 at 8:47






                • 2




                  On Ubuntu/Debian apt-get install xmlstarlet
                  – rubo77
                  Dec 24 '16 at 0:48














                19












                19








                19






                You can use xpath utility. It's installed with the Perl XML-XPath package.



                Usage:



                /usr/bin/xpath [filename] query


                or XMLStarlet. To install it on opensuse use:



                sudo zypper install xmlstarlet


                or try cnf xml on other platforms.






                share|improve this answer














                You can use xpath utility. It's installed with the Perl XML-XPath package.



                Usage:



                /usr/bin/xpath [filename] query


                or XMLStarlet. To install it on opensuse use:



                sudo zypper install xmlstarlet


                or try cnf xml on other platforms.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jul 6 '12 at 16:19









                mtk

                7,261105392




                7,261105392










                answered Apr 24 '12 at 15:03









                Grisha

                20626




                20626








                • 5




                  Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                  – Bruno von Paris
                  Feb 8 '13 at 15:26










                • On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                  – tripleee
                  Jul 27 '16 at 8:47






                • 2




                  On Ubuntu/Debian apt-get install xmlstarlet
                  – rubo77
                  Dec 24 '16 at 0:48














                • 5




                  Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                  – Bruno von Paris
                  Feb 8 '13 at 15:26










                • On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                  – tripleee
                  Jul 27 '16 at 8:47






                • 2




                  On Ubuntu/Debian apt-get install xmlstarlet
                  – rubo77
                  Dec 24 '16 at 0:48








                5




                5




                Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                – Bruno von Paris
                Feb 8 '13 at 15:26




                Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
                – Bruno von Paris
                Feb 8 '13 at 15:26












                On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                – tripleee
                Jul 27 '16 at 8:47




                On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
                – tripleee
                Jul 27 '16 at 8:47




                2




                2




                On Ubuntu/Debian apt-get install xmlstarlet
                – rubo77
                Dec 24 '16 at 0:48




                On Ubuntu/Debian apt-get install xmlstarlet
                – rubo77
                Dec 24 '16 at 0:48











                7














                This is sufficient...



                xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt





                share|improve this answer





















                • Thanks, quick and did the job for me
                  – Miguel Mota
                  May 18 '16 at 23:03
















                7














                This is sufficient...



                xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt





                share|improve this answer





















                • Thanks, quick and did the job for me
                  – Miguel Mota
                  May 18 '16 at 23:03














                7












                7








                7






                This is sufficient...



                xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt





                share|improve this answer












                This is sufficient...



                xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 5 '15 at 10:33









                teknopaul

                4,06921812




                4,06921812












                • Thanks, quick and did the job for me
                  – Miguel Mota
                  May 18 '16 at 23:03


















                • Thanks, quick and did the job for me
                  – Miguel Mota
                  May 18 '16 at 23:03
















                Thanks, quick and did the job for me
                – Miguel Mota
                May 18 '16 at 23:03




                Thanks, quick and did the job for me
                – Miguel Mota
                May 18 '16 at 23:03











                5














                starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.



                The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.



                First, a definition of the UML words used in this post:



                <!-- comment... -->
                <tag attribute="value">content...</tag>


                EDIT: updated functions, with handle of:




                • Websphere xml (xmi and xmlns attributes)

                • must have a compatible terminal with 256 colors

                • 24 shades of grey

                • compatibility added for IBM AIX bash 3.2.16(1)


                The functions, first is the xml_read_dom which's called recursively by xml_read:



                xml_read_dom() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                local ENTITY IFS=>
                if $ITSACOMMENT; then
                read -d < COMMENTS
                COMMENTS="$(rtrim "${COMMENTS}")"
                return 0
                else
                read -d < ENTITY CONTENT
                CR=$?
                [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
                TAG_NAME=${ENTITY%%[[:space:]]*}
                [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
                TAG_NAME=${TAG_NAME%%:*}
                ATTRIBUTES=${ENTITY#*[[:space:]]}
                ATTRIBUTES="${ATTRIBUTES//xmi:/}"
                ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
                fi

                # when comments sticks to !-- :
                [ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

                # http://tldp.org/LDP/abs/html/string-manipulation.html
                # INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
                [ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
                return $CR
                }


                and the second one :



                xml_read() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                ITSACOMMENT=false
                local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
                local TMP LOG LOGG
                LIGHT=false
                FORCE_PRINT=false
                XAPPLY=false
                MULTIPLE_ATTR=false
                XAPPLIED_COLOR=g
                TAGPRINTED=false
                GETCONTENT=false
                PROSTPROCESS=cat
                Debug=${Debug:-false}
                TMP=/tmp/xml_read.$RANDOM
                USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                ${nn[2]} -c = NOCOLOR${END}
                ${nn[2]} -d = Debug${END}
                ${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
                ${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
                ${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
                ${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

                ! (($#)) && echo2 "$USAGE" && return 99
                (( $# < 2 )) && ERROR nbaram 2 0 && return 99
                # getopts:
                while getopts :cdlpx:a: _OPT 2>/dev/null
                do
                {
                case ${_OPT} in
                c) PROSTPROCESS="${DECOLORIZE}" ;;
                d) local Debug=true ;;
                l) LIGHT=true; XAPPLIED_COLOR=END ;;
                p) FORCE_PRINT=true ;;
                x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
                a) XATTRIBUTE="${OPTARG}" ;;
                *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
                esac
                }
                done
                shift $((OPTIND - 1))
                unset _OPT OPTARG OPTIND
                [ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

                fileXml=$1
                tag=$2
                (( $# > 2 )) && shift 2 && attributes=$*
                (( $# > 1 )) && MULTIPLE_ATTR=true

                [ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
                $XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
                # nb attributes == 1 because $MULTIPLE_ATTR is false
                [ "${attributes}" == "content" ] && GETCONTENT=true

                while xml_read_dom; do
                # (( CR != 0 )) && break
                (( PIPESTATUS[1] != 0 )) && break

                if $ITSACOMMENT; then
                # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
                # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
                if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
                elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
                fi
                $Debug && echo2 "${N}${COMMENTS}${END}"
                elif test "${TAG_NAME}"; then
                if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
                if $GETCONTENT; then
                CONTENT="$(trim "${CONTENT}")"
                test ${CONTENT} && echo "${CONTENT}"
                else
                # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
                eval local $ATTRIBUTES
                $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
                if test "${attributes}"; then
                if $MULTIPLE_ATTR; then
                # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
                ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
                for attribute in ${attributes}; do
                ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
                if eval test ""$${attribute}""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
                else
                eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
                fi
                fi
                done
                # this trick prints a CR only if attributes have been printed durint the loop:
                $TAGPRINTED && ${print} "n" && TAGPRINTED=false
                else
                if eval test ""$${attributes}""; then
                if $XAPPLY; then
                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
                else
                eval echo "$${attributes}" && eval unset ${attributes}
                fi
                fi
                fi
                else
                echo eval $ATTRIBUTES >>$TMP
                fi
                fi
                fi
                fi
                unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
                done < "${fileXml}" | ${PROSTPROCESS}
                # http://mywiki.wooledge.org/BashFAQ/024
                # INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
                if [ -s "$TMP" ]; then
                $FORCE_PRINT && ! $LIGHT && cat $TMP
                # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
                $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
                . $TMP
                rm -f $TMP
                fi
                unset ITSACOMMENT
                }


                and lastly, the rtrim, trim and echo2 (to stderr) functions:



                rtrim() {
                local var=$@
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                trim() {
                local var=$@
                var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                echo2() { echo -e "$@" 1>&2; }


                Colorization:



                oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:



                set -a
                TERM=xterm-256color
                case ${UNAME} in
                AIX|SunOS)
                M=$(${print} '33[1;35m')
                m=$(${print} '33[0;35m')
                END=$(${print} '33[0m')
                ;;
                *)
                m=$(tput setaf 5)
                M=$(tput setaf 13)
                # END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
                END=$(${print} '33[0m')
                ;;
                esac
                # 24 shades of grey:
                for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
                # another way of having an array of 5 shades of grey:
                declare -a colorNums=(238 240 243 248 254)
                for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
                # piped decolorization:
                DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'


                How to load all that stuff:



                Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)



                If not, just copy/paste everything on the command line.



                How does it work:



                xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                -c = NOCOLOR
                -d = Debug
                -l = LIGHT (no "attribute=" printed)
                -p = FORCE PRINT (when no attributes given)
                -x = apply a command on an attribute and print the result instead of the former value, in green color
                (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

                xml_read server.xml title content # print content between <title></title>
                xml_read server.xml Connector port # print all port values from Connector tags
                xml_read server.xml any port # print all port values from any tags


                With Debug mode (-d) comments and parsed attributes are printed to stderr






                share|improve this answer























                • I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                  – khmarbaise
                  Mar 5 '14 at 8:37










                • Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                  – khmarbaise
                  Mar 5 '14 at 8:47










                • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                  – scavenger
                  Apr 8 '15 at 3:36


















                5














                starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.



                The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.



                First, a definition of the UML words used in this post:



                <!-- comment... -->
                <tag attribute="value">content...</tag>


                EDIT: updated functions, with handle of:




                • Websphere xml (xmi and xmlns attributes)

                • must have a compatible terminal with 256 colors

                • 24 shades of grey

                • compatibility added for IBM AIX bash 3.2.16(1)


                The functions, first is the xml_read_dom which's called recursively by xml_read:



                xml_read_dom() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                local ENTITY IFS=>
                if $ITSACOMMENT; then
                read -d < COMMENTS
                COMMENTS="$(rtrim "${COMMENTS}")"
                return 0
                else
                read -d < ENTITY CONTENT
                CR=$?
                [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
                TAG_NAME=${ENTITY%%[[:space:]]*}
                [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
                TAG_NAME=${TAG_NAME%%:*}
                ATTRIBUTES=${ENTITY#*[[:space:]]}
                ATTRIBUTES="${ATTRIBUTES//xmi:/}"
                ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
                fi

                # when comments sticks to !-- :
                [ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

                # http://tldp.org/LDP/abs/html/string-manipulation.html
                # INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
                [ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
                return $CR
                }


                and the second one :



                xml_read() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                ITSACOMMENT=false
                local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
                local TMP LOG LOGG
                LIGHT=false
                FORCE_PRINT=false
                XAPPLY=false
                MULTIPLE_ATTR=false
                XAPPLIED_COLOR=g
                TAGPRINTED=false
                GETCONTENT=false
                PROSTPROCESS=cat
                Debug=${Debug:-false}
                TMP=/tmp/xml_read.$RANDOM
                USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                ${nn[2]} -c = NOCOLOR${END}
                ${nn[2]} -d = Debug${END}
                ${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
                ${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
                ${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
                ${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

                ! (($#)) && echo2 "$USAGE" && return 99
                (( $# < 2 )) && ERROR nbaram 2 0 && return 99
                # getopts:
                while getopts :cdlpx:a: _OPT 2>/dev/null
                do
                {
                case ${_OPT} in
                c) PROSTPROCESS="${DECOLORIZE}" ;;
                d) local Debug=true ;;
                l) LIGHT=true; XAPPLIED_COLOR=END ;;
                p) FORCE_PRINT=true ;;
                x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
                a) XATTRIBUTE="${OPTARG}" ;;
                *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
                esac
                }
                done
                shift $((OPTIND - 1))
                unset _OPT OPTARG OPTIND
                [ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

                fileXml=$1
                tag=$2
                (( $# > 2 )) && shift 2 && attributes=$*
                (( $# > 1 )) && MULTIPLE_ATTR=true

                [ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
                $XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
                # nb attributes == 1 because $MULTIPLE_ATTR is false
                [ "${attributes}" == "content" ] && GETCONTENT=true

                while xml_read_dom; do
                # (( CR != 0 )) && break
                (( PIPESTATUS[1] != 0 )) && break

                if $ITSACOMMENT; then
                # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
                # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
                if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
                elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
                fi
                $Debug && echo2 "${N}${COMMENTS}${END}"
                elif test "${TAG_NAME}"; then
                if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
                if $GETCONTENT; then
                CONTENT="$(trim "${CONTENT}")"
                test ${CONTENT} && echo "${CONTENT}"
                else
                # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
                eval local $ATTRIBUTES
                $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
                if test "${attributes}"; then
                if $MULTIPLE_ATTR; then
                # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
                ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
                for attribute in ${attributes}; do
                ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
                if eval test ""$${attribute}""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
                else
                eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
                fi
                fi
                done
                # this trick prints a CR only if attributes have been printed durint the loop:
                $TAGPRINTED && ${print} "n" && TAGPRINTED=false
                else
                if eval test ""$${attributes}""; then
                if $XAPPLY; then
                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
                else
                eval echo "$${attributes}" && eval unset ${attributes}
                fi
                fi
                fi
                else
                echo eval $ATTRIBUTES >>$TMP
                fi
                fi
                fi
                fi
                unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
                done < "${fileXml}" | ${PROSTPROCESS}
                # http://mywiki.wooledge.org/BashFAQ/024
                # INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
                if [ -s "$TMP" ]; then
                $FORCE_PRINT && ! $LIGHT && cat $TMP
                # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
                $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
                . $TMP
                rm -f $TMP
                fi
                unset ITSACOMMENT
                }


                and lastly, the rtrim, trim and echo2 (to stderr) functions:



                rtrim() {
                local var=$@
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                trim() {
                local var=$@
                var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                echo2() { echo -e "$@" 1>&2; }


                Colorization:



                oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:



                set -a
                TERM=xterm-256color
                case ${UNAME} in
                AIX|SunOS)
                M=$(${print} '33[1;35m')
                m=$(${print} '33[0;35m')
                END=$(${print} '33[0m')
                ;;
                *)
                m=$(tput setaf 5)
                M=$(tput setaf 13)
                # END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
                END=$(${print} '33[0m')
                ;;
                esac
                # 24 shades of grey:
                for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
                # another way of having an array of 5 shades of grey:
                declare -a colorNums=(238 240 243 248 254)
                for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
                # piped decolorization:
                DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'


                How to load all that stuff:



                Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)



                If not, just copy/paste everything on the command line.



                How does it work:



                xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                -c = NOCOLOR
                -d = Debug
                -l = LIGHT (no "attribute=" printed)
                -p = FORCE PRINT (when no attributes given)
                -x = apply a command on an attribute and print the result instead of the former value, in green color
                (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

                xml_read server.xml title content # print content between <title></title>
                xml_read server.xml Connector port # print all port values from Connector tags
                xml_read server.xml any port # print all port values from any tags


                With Debug mode (-d) comments and parsed attributes are printed to stderr






                share|improve this answer























                • I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                  – khmarbaise
                  Mar 5 '14 at 8:37










                • Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                  – khmarbaise
                  Mar 5 '14 at 8:47










                • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                  – scavenger
                  Apr 8 '15 at 3:36
















                5












                5








                5






                starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.



                The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.



                First, a definition of the UML words used in this post:



                <!-- comment... -->
                <tag attribute="value">content...</tag>


                EDIT: updated functions, with handle of:




                • Websphere xml (xmi and xmlns attributes)

                • must have a compatible terminal with 256 colors

                • 24 shades of grey

                • compatibility added for IBM AIX bash 3.2.16(1)


                The functions, first is the xml_read_dom which's called recursively by xml_read:



                xml_read_dom() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                local ENTITY IFS=>
                if $ITSACOMMENT; then
                read -d < COMMENTS
                COMMENTS="$(rtrim "${COMMENTS}")"
                return 0
                else
                read -d < ENTITY CONTENT
                CR=$?
                [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
                TAG_NAME=${ENTITY%%[[:space:]]*}
                [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
                TAG_NAME=${TAG_NAME%%:*}
                ATTRIBUTES=${ENTITY#*[[:space:]]}
                ATTRIBUTES="${ATTRIBUTES//xmi:/}"
                ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
                fi

                # when comments sticks to !-- :
                [ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

                # http://tldp.org/LDP/abs/html/string-manipulation.html
                # INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
                [ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
                return $CR
                }


                and the second one :



                xml_read() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                ITSACOMMENT=false
                local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
                local TMP LOG LOGG
                LIGHT=false
                FORCE_PRINT=false
                XAPPLY=false
                MULTIPLE_ATTR=false
                XAPPLIED_COLOR=g
                TAGPRINTED=false
                GETCONTENT=false
                PROSTPROCESS=cat
                Debug=${Debug:-false}
                TMP=/tmp/xml_read.$RANDOM
                USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                ${nn[2]} -c = NOCOLOR${END}
                ${nn[2]} -d = Debug${END}
                ${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
                ${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
                ${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
                ${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

                ! (($#)) && echo2 "$USAGE" && return 99
                (( $# < 2 )) && ERROR nbaram 2 0 && return 99
                # getopts:
                while getopts :cdlpx:a: _OPT 2>/dev/null
                do
                {
                case ${_OPT} in
                c) PROSTPROCESS="${DECOLORIZE}" ;;
                d) local Debug=true ;;
                l) LIGHT=true; XAPPLIED_COLOR=END ;;
                p) FORCE_PRINT=true ;;
                x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
                a) XATTRIBUTE="${OPTARG}" ;;
                *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
                esac
                }
                done
                shift $((OPTIND - 1))
                unset _OPT OPTARG OPTIND
                [ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

                fileXml=$1
                tag=$2
                (( $# > 2 )) && shift 2 && attributes=$*
                (( $# > 1 )) && MULTIPLE_ATTR=true

                [ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
                $XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
                # nb attributes == 1 because $MULTIPLE_ATTR is false
                [ "${attributes}" == "content" ] && GETCONTENT=true

                while xml_read_dom; do
                # (( CR != 0 )) && break
                (( PIPESTATUS[1] != 0 )) && break

                if $ITSACOMMENT; then
                # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
                # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
                if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
                elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
                fi
                $Debug && echo2 "${N}${COMMENTS}${END}"
                elif test "${TAG_NAME}"; then
                if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
                if $GETCONTENT; then
                CONTENT="$(trim "${CONTENT}")"
                test ${CONTENT} && echo "${CONTENT}"
                else
                # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
                eval local $ATTRIBUTES
                $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
                if test "${attributes}"; then
                if $MULTIPLE_ATTR; then
                # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
                ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
                for attribute in ${attributes}; do
                ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
                if eval test ""$${attribute}""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
                else
                eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
                fi
                fi
                done
                # this trick prints a CR only if attributes have been printed durint the loop:
                $TAGPRINTED && ${print} "n" && TAGPRINTED=false
                else
                if eval test ""$${attributes}""; then
                if $XAPPLY; then
                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
                else
                eval echo "$${attributes}" && eval unset ${attributes}
                fi
                fi
                fi
                else
                echo eval $ATTRIBUTES >>$TMP
                fi
                fi
                fi
                fi
                unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
                done < "${fileXml}" | ${PROSTPROCESS}
                # http://mywiki.wooledge.org/BashFAQ/024
                # INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
                if [ -s "$TMP" ]; then
                $FORCE_PRINT && ! $LIGHT && cat $TMP
                # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
                $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
                . $TMP
                rm -f $TMP
                fi
                unset ITSACOMMENT
                }


                and lastly, the rtrim, trim and echo2 (to stderr) functions:



                rtrim() {
                local var=$@
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                trim() {
                local var=$@
                var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                echo2() { echo -e "$@" 1>&2; }


                Colorization:



                oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:



                set -a
                TERM=xterm-256color
                case ${UNAME} in
                AIX|SunOS)
                M=$(${print} '33[1;35m')
                m=$(${print} '33[0;35m')
                END=$(${print} '33[0m')
                ;;
                *)
                m=$(tput setaf 5)
                M=$(tput setaf 13)
                # END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
                END=$(${print} '33[0m')
                ;;
                esac
                # 24 shades of grey:
                for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
                # another way of having an array of 5 shades of grey:
                declare -a colorNums=(238 240 243 248 254)
                for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
                # piped decolorization:
                DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'


                How to load all that stuff:



                Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)



                If not, just copy/paste everything on the command line.



                How does it work:



                xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                -c = NOCOLOR
                -d = Debug
                -l = LIGHT (no "attribute=" printed)
                -p = FORCE PRINT (when no attributes given)
                -x = apply a command on an attribute and print the result instead of the former value, in green color
                (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

                xml_read server.xml title content # print content between <title></title>
                xml_read server.xml Connector port # print all port values from Connector tags
                xml_read server.xml any port # print all port values from any tags


                With Debug mode (-d) comments and parsed attributes are printed to stderr






                share|improve this answer














                starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.



                The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.



                First, a definition of the UML words used in this post:



                <!-- comment... -->
                <tag attribute="value">content...</tag>


                EDIT: updated functions, with handle of:




                • Websphere xml (xmi and xmlns attributes)

                • must have a compatible terminal with 256 colors

                • 24 shades of grey

                • compatibility added for IBM AIX bash 3.2.16(1)


                The functions, first is the xml_read_dom which's called recursively by xml_read:



                xml_read_dom() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                local ENTITY IFS=>
                if $ITSACOMMENT; then
                read -d < COMMENTS
                COMMENTS="$(rtrim "${COMMENTS}")"
                return 0
                else
                read -d < ENTITY CONTENT
                CR=$?
                [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
                TAG_NAME=${ENTITY%%[[:space:]]*}
                [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
                TAG_NAME=${TAG_NAME%%:*}
                ATTRIBUTES=${ENTITY#*[[:space:]]}
                ATTRIBUTES="${ATTRIBUTES//xmi:/}"
                ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
                fi

                # when comments sticks to !-- :
                [ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

                # http://tldp.org/LDP/abs/html/string-manipulation.html
                # INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
                [ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
                return $CR
                }


                and the second one :



                xml_read() {
                # https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
                ITSACOMMENT=false
                local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
                local TMP LOG LOGG
                LIGHT=false
                FORCE_PRINT=false
                XAPPLY=false
                MULTIPLE_ATTR=false
                XAPPLIED_COLOR=g
                TAGPRINTED=false
                GETCONTENT=false
                PROSTPROCESS=cat
                Debug=${Debug:-false}
                TMP=/tmp/xml_read.$RANDOM
                USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                ${nn[2]} -c = NOCOLOR${END}
                ${nn[2]} -d = Debug${END}
                ${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
                ${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
                ${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
                ${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

                ! (($#)) && echo2 "$USAGE" && return 99
                (( $# < 2 )) && ERROR nbaram 2 0 && return 99
                # getopts:
                while getopts :cdlpx:a: _OPT 2>/dev/null
                do
                {
                case ${_OPT} in
                c) PROSTPROCESS="${DECOLORIZE}" ;;
                d) local Debug=true ;;
                l) LIGHT=true; XAPPLIED_COLOR=END ;;
                p) FORCE_PRINT=true ;;
                x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
                a) XATTRIBUTE="${OPTARG}" ;;
                *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
                esac
                }
                done
                shift $((OPTIND - 1))
                unset _OPT OPTARG OPTIND
                [ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

                fileXml=$1
                tag=$2
                (( $# > 2 )) && shift 2 && attributes=$*
                (( $# > 1 )) && MULTIPLE_ATTR=true

                [ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
                $XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
                # nb attributes == 1 because $MULTIPLE_ATTR is false
                [ "${attributes}" == "content" ] && GETCONTENT=true

                while xml_read_dom; do
                # (( CR != 0 )) && break
                (( PIPESTATUS[1] != 0 )) && break

                if $ITSACOMMENT; then
                # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
                # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
                # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
                if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
                elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
                fi
                $Debug && echo2 "${N}${COMMENTS}${END}"
                elif test "${TAG_NAME}"; then
                if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
                if $GETCONTENT; then
                CONTENT="$(trim "${CONTENT}")"
                test ${CONTENT} && echo "${CONTENT}"
                else
                # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
                eval local $ATTRIBUTES
                $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
                if test "${attributes}"; then
                if $MULTIPLE_ATTR; then
                # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
                ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
                for attribute in ${attributes}; do
                ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
                if eval test ""$${attribute}""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
                else
                eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
                fi
                fi
                done
                # this trick prints a CR only if attributes have been printed durint the loop:
                $TAGPRINTED && ${print} "n" && TAGPRINTED=false
                else
                if eval test ""$${attributes}""; then
                if $XAPPLY; then
                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
                else
                eval echo "$${attributes}" && eval unset ${attributes}
                fi
                fi
                fi
                else
                echo eval $ATTRIBUTES >>$TMP
                fi
                fi
                fi
                fi
                unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
                done < "${fileXml}" | ${PROSTPROCESS}
                # http://mywiki.wooledge.org/BashFAQ/024
                # INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
                if [ -s "$TMP" ]; then
                $FORCE_PRINT && ! $LIGHT && cat $TMP
                # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
                $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
                . $TMP
                rm -f $TMP
                fi
                unset ITSACOMMENT
                }


                and lastly, the rtrim, trim and echo2 (to stderr) functions:



                rtrim() {
                local var=$@
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                trim() {
                local var=$@
                var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
                var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
                echo -n "$var"
                }
                echo2() { echo -e "$@" 1>&2; }


                Colorization:



                oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:



                set -a
                TERM=xterm-256color
                case ${UNAME} in
                AIX|SunOS)
                M=$(${print} '33[1;35m')
                m=$(${print} '33[0;35m')
                END=$(${print} '33[0m')
                ;;
                *)
                m=$(tput setaf 5)
                M=$(tput setaf 13)
                # END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
                END=$(${print} '33[0m')
                ;;
                esac
                # 24 shades of grey:
                for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
                # another way of having an array of 5 shades of grey:
                declare -a colorNums=(238 240 243 248 254)
                for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
                # piped decolorization:
                DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'


                How to load all that stuff:



                Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)



                If not, just copy/paste everything on the command line.



                How does it work:



                xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
                -c = NOCOLOR
                -d = Debug
                -l = LIGHT (no "attribute=" printed)
                -p = FORCE PRINT (when no attributes given)
                -x = apply a command on an attribute and print the result instead of the former value, in green color
                (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

                xml_read server.xml title content # print content between <title></title>
                xml_read server.xml Connector port # print all port values from Connector tags
                xml_read server.xml any port # print all port values from any tags


                With Debug mode (-d) comments and parsed attributes are printed to stderr







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Feb 1 at 16:40









                peterh

                6,095154667




                6,095154667










                answered Jan 29 '14 at 12:44









                scavenger

                13515




                13515












                • I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                  – khmarbaise
                  Mar 5 '14 at 8:37










                • Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                  – khmarbaise
                  Mar 5 '14 at 8:47










                • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                  – scavenger
                  Apr 8 '15 at 3:36




















                • I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                  – khmarbaise
                  Mar 5 '14 at 8:37










                • Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                  – khmarbaise
                  Mar 5 '14 at 8:47










                • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                  – scavenger
                  Apr 8 '15 at 3:36


















                I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                – khmarbaise
                Mar 5 '14 at 8:37




                I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0?
                – khmarbaise
                Mar 5 '14 at 8:37












                Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                – khmarbaise
                Mar 5 '14 at 8:47




                Line 22: [ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
                – khmarbaise
                Mar 5 '14 at 8:47












                sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                – scavenger
                Apr 8 '15 at 3:36






                sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
                – scavenger
                Apr 8 '15 at 3:36













                4














                I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.



                My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)






                share|improve this answer


























                  4














                  I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.



                  My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)






                  share|improve this answer
























                    4












                    4








                    4






                    I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.



                    My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)






                    share|improve this answer












                    I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.



                    My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered May 21 '09 at 15:43









                    mirod

                    14.4k33762




                    14.4k33762























                        4














                        Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.






                        share|improve this answer


























                          4














                          Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.






                          share|improve this answer
























                            4












                            4








                            4






                            Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.






                            share|improve this answer












                            Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 7 '09 at 15:31









                            simon04

                            1,4091215




                            1,4091215























                                4














                                Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.



                                The title can be read like:



                                xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt


                                And it also has a cool feature to export multiple variables to bash. For example



                                eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )


                                sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.






                                share|improve this answer





















                                • This is exactly what I needed! :)
                                  – Thomas Daugaard
                                  Oct 18 '13 at 9:04
















                                4














                                Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.



                                The title can be read like:



                                xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt


                                And it also has a cool feature to export multiple variables to bash. For example



                                eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )


                                sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.






                                share|improve this answer





















                                • This is exactly what I needed! :)
                                  – Thomas Daugaard
                                  Oct 18 '13 at 9:04














                                4












                                4








                                4






                                Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.



                                The title can be read like:



                                xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt


                                And it also has a cool feature to export multiple variables to bash. For example



                                eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )


                                sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.






                                share|improve this answer












                                Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.



                                The title can be read like:



                                xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt


                                And it also has a cool feature to export multiple variables to bash. For example



                                eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )


                                sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Mar 27 '13 at 0:27









                                BeniBela

                                12.4k32440




                                12.4k32440












                                • This is exactly what I needed! :)
                                  – Thomas Daugaard
                                  Oct 18 '13 at 9:04


















                                • This is exactly what I needed! :)
                                  – Thomas Daugaard
                                  Oct 18 '13 at 9:04
















                                This is exactly what I needed! :)
                                – Thomas Daugaard
                                Oct 18 '13 at 9:04




                                This is exactly what I needed! :)
                                – Thomas Daugaard
                                Oct 18 '13 at 9:04











                                2














                                Well, you can use xpath utility. I guess perl's XML::Xpath contains it.






                                share|improve this answer


























                                  2














                                  Well, you can use xpath utility. I guess perl's XML::Xpath contains it.






                                  share|improve this answer
























                                    2












                                    2








                                    2






                                    Well, you can use xpath utility. I guess perl's XML::Xpath contains it.






                                    share|improve this answer












                                    Well, you can use xpath utility. I guess perl's XML::Xpath contains it.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered May 21 '09 at 15:39









                                    alamar

                                    10.7k24873




                                    10.7k24873























                                        2














                                        After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:




                                        • General informations about XPaths

                                        • Amara - collection of Pythonic tools for XML

                                        • Develop Python/XML with 4Suite (2 parts)






                                        share|improve this answer


























                                          2














                                          After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:




                                          • General informations about XPaths

                                          • Amara - collection of Pythonic tools for XML

                                          • Develop Python/XML with 4Suite (2 parts)






                                          share|improve this answer
























                                            2












                                            2








                                            2






                                            After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:




                                            • General informations about XPaths

                                            • Amara - collection of Pythonic tools for XML

                                            • Develop Python/XML with 4Suite (2 parts)






                                            share|improve this answer












                                            After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:




                                            • General informations about XPaths

                                            • Amara - collection of Pythonic tools for XML

                                            • Develop Python/XML with 4Suite (2 parts)







                                            share|improve this answer












                                            share|improve this answer



                                            share|improve this answer










                                            answered Oct 24 '10 at 1:00









                                            user485380

                                            291




                                            291























                                                2














                                                While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.



                                                Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.



                                                Example 1



                                                #!/usr/bin/env python
                                                import sys
                                                from lxml import etree

                                                tree = etree.parse(sys.argv[1])
                                                xpath_expression = sys.argv[2]

                                                # a hack allowing to access the
                                                # default namespace (if defined) via the 'p:' prefix
                                                # E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
                                                # an XPath of '//p:module' will return all the 'module' nodes
                                                ns = tree.getroot().nsmap
                                                if ns.keys() and None in ns:
                                                ns['p'] = ns.pop(None)
                                                # end of hack

                                                for e in tree.xpath(xpath_expression, namespaces=ns):
                                                if isinstance(e, str):
                                                print(e)
                                                else:
                                                print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))


                                                lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.



                                                Usage



                                                python xpath.py myfile.xml "//mynode"


                                                lxml also accepts a URL as input:



                                                python xpath.py http://www.feedforall.com/sample.xml "//link"



                                                Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.






                                                Example 2



                                                A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:



                                                pom.xml:



                                                <?xml version="1.0" encoding="UTF-8"?>
                                                <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
                                                <modules>
                                                <module>cherries</module>
                                                <module>bananas</module>
                                                <module>pears</module>
                                                </modules>
                                                </project>


                                                module_extractor.py:



                                                from lxml import etree
                                                for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
                                                print(e.text)





                                                share|improve this answer























                                                • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                  – E. Moffat
                                                  Oct 31 at 0:12
















                                                2














                                                While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.



                                                Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.



                                                Example 1



                                                #!/usr/bin/env python
                                                import sys
                                                from lxml import etree

                                                tree = etree.parse(sys.argv[1])
                                                xpath_expression = sys.argv[2]

                                                # a hack allowing to access the
                                                # default namespace (if defined) via the 'p:' prefix
                                                # E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
                                                # an XPath of '//p:module' will return all the 'module' nodes
                                                ns = tree.getroot().nsmap
                                                if ns.keys() and None in ns:
                                                ns['p'] = ns.pop(None)
                                                # end of hack

                                                for e in tree.xpath(xpath_expression, namespaces=ns):
                                                if isinstance(e, str):
                                                print(e)
                                                else:
                                                print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))


                                                lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.



                                                Usage



                                                python xpath.py myfile.xml "//mynode"


                                                lxml also accepts a URL as input:



                                                python xpath.py http://www.feedforall.com/sample.xml "//link"



                                                Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.






                                                Example 2



                                                A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:



                                                pom.xml:



                                                <?xml version="1.0" encoding="UTF-8"?>
                                                <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
                                                <modules>
                                                <module>cherries</module>
                                                <module>bananas</module>
                                                <module>pears</module>
                                                </modules>
                                                </project>


                                                module_extractor.py:



                                                from lxml import etree
                                                for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
                                                print(e.text)





                                                share|improve this answer























                                                • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                  – E. Moffat
                                                  Oct 31 at 0:12














                                                2












                                                2








                                                2






                                                While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.



                                                Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.



                                                Example 1



                                                #!/usr/bin/env python
                                                import sys
                                                from lxml import etree

                                                tree = etree.parse(sys.argv[1])
                                                xpath_expression = sys.argv[2]

                                                # a hack allowing to access the
                                                # default namespace (if defined) via the 'p:' prefix
                                                # E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
                                                # an XPath of '//p:module' will return all the 'module' nodes
                                                ns = tree.getroot().nsmap
                                                if ns.keys() and None in ns:
                                                ns['p'] = ns.pop(None)
                                                # end of hack

                                                for e in tree.xpath(xpath_expression, namespaces=ns):
                                                if isinstance(e, str):
                                                print(e)
                                                else:
                                                print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))


                                                lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.



                                                Usage



                                                python xpath.py myfile.xml "//mynode"


                                                lxml also accepts a URL as input:



                                                python xpath.py http://www.feedforall.com/sample.xml "//link"



                                                Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.






                                                Example 2



                                                A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:



                                                pom.xml:



                                                <?xml version="1.0" encoding="UTF-8"?>
                                                <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
                                                <modules>
                                                <module>cherries</module>
                                                <module>bananas</module>
                                                <module>pears</module>
                                                </modules>
                                                </project>


                                                module_extractor.py:



                                                from lxml import etree
                                                for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
                                                print(e.text)





                                                share|improve this answer














                                                While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.



                                                Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.



                                                Example 1



                                                #!/usr/bin/env python
                                                import sys
                                                from lxml import etree

                                                tree = etree.parse(sys.argv[1])
                                                xpath_expression = sys.argv[2]

                                                # a hack allowing to access the
                                                # default namespace (if defined) via the 'p:' prefix
                                                # E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
                                                # an XPath of '//p:module' will return all the 'module' nodes
                                                ns = tree.getroot().nsmap
                                                if ns.keys() and None in ns:
                                                ns['p'] = ns.pop(None)
                                                # end of hack

                                                for e in tree.xpath(xpath_expression, namespaces=ns):
                                                if isinstance(e, str):
                                                print(e)
                                                else:
                                                print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))


                                                lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.



                                                Usage



                                                python xpath.py myfile.xml "//mynode"


                                                lxml also accepts a URL as input:



                                                python xpath.py http://www.feedforall.com/sample.xml "//link"



                                                Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.






                                                Example 2



                                                A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:



                                                pom.xml:



                                                <?xml version="1.0" encoding="UTF-8"?>
                                                <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
                                                <modules>
                                                <module>cherries</module>
                                                <module>bananas</module>
                                                <module>pears</module>
                                                </modules>
                                                </project>


                                                module_extractor.py:



                                                from lxml import etree
                                                for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
                                                print(e.text)






                                                share|improve this answer














                                                share|improve this answer



                                                share|improve this answer








                                                edited May 18 at 22:19

























                                                answered Oct 25 '17 at 14:53









                                                ccpizza

                                                12.1k48183




                                                12.1k48183












                                                • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                  – E. Moffat
                                                  Oct 31 at 0:12


















                                                • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                  – E. Moffat
                                                  Oct 31 at 0:12
















                                                This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                – E. Moffat
                                                Oct 31 at 0:12




                                                This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks!
                                                – E. Moffat
                                                Oct 31 at 0:12











                                                0














                                                Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:



                                                rdom () { local IFS=> ; read -d < E C ;}


                                                becomes:



                                                rdom () { local IFS=< ; read -d > C E ;}


                                                If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.






                                                share|improve this answer


























                                                  0














                                                  Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:



                                                  rdom () { local IFS=> ; read -d < E C ;}


                                                  becomes:



                                                  rdom () { local IFS=< ; read -d > C E ;}


                                                  If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.






                                                  share|improve this answer
























                                                    0












                                                    0








                                                    0






                                                    Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:



                                                    rdom () { local IFS=> ; read -d < E C ;}


                                                    becomes:



                                                    rdom () { local IFS=< ; read -d > C E ;}


                                                    If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.






                                                    share|improve this answer












                                                    Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:



                                                    rdom () { local IFS=> ; read -d < E C ;}


                                                    becomes:



                                                    rdom () { local IFS=< ; read -d > C E ;}


                                                    If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.







                                                    share|improve this answer












                                                    share|improve this answer



                                                    share|improve this answer










                                                    answered Jan 24 '13 at 0:46









                                                    michaelmeyer

                                                    4,41232129




                                                    4,41232129























                                                        0














                                                        This works if you are wanting XML attributes:



                                                        $ cat alfa.xml
                                                        <video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

                                                        $ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

                                                        $ . ./alfa.sh

                                                        $ echo "$stream"
                                                        H264_400.mp4





                                                        share|improve this answer




























                                                          0














                                                          This works if you are wanting XML attributes:



                                                          $ cat alfa.xml
                                                          <video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

                                                          $ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

                                                          $ . ./alfa.sh

                                                          $ echo "$stream"
                                                          H264_400.mp4





                                                          share|improve this answer


























                                                            0












                                                            0








                                                            0






                                                            This works if you are wanting XML attributes:



                                                            $ cat alfa.xml
                                                            <video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

                                                            $ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

                                                            $ . ./alfa.sh

                                                            $ echo "$stream"
                                                            H264_400.mp4





                                                            share|improve this answer














                                                            This works if you are wanting XML attributes:



                                                            $ cat alfa.xml
                                                            <video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

                                                            $ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

                                                            $ . ./alfa.sh

                                                            $ echo "$stream"
                                                            H264_400.mp4






                                                            share|improve this answer














                                                            share|improve this answer



                                                            share|improve this answer








                                                            edited Jan 3 '17 at 19:34

























                                                            answered Jun 16 '12 at 16:53









                                                            Steven Penny

                                                            1




                                                            1























                                                                -1














                                                                Introduction



                                                                Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.



                                                                TLDR/Solution



                                                                On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?



                                                                On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?



                                                                1. Install the "package" xmlstarlet



                                                                2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt






                                                                share|improve this answer


























                                                                  -1














                                                                  Introduction



                                                                  Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.



                                                                  TLDR/Solution



                                                                  On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?



                                                                  On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?



                                                                  1. Install the "package" xmlstarlet



                                                                  2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt






                                                                  share|improve this answer
























                                                                    -1












                                                                    -1








                                                                    -1






                                                                    Introduction



                                                                    Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.



                                                                    TLDR/Solution



                                                                    On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?



                                                                    On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?



                                                                    1. Install the "package" xmlstarlet



                                                                    2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt






                                                                    share|improve this answer












                                                                    Introduction



                                                                    Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.



                                                                    TLDR/Solution



                                                                    On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?



                                                                    On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?



                                                                    1. Install the "package" xmlstarlet



                                                                    2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt







                                                                    share|improve this answer












                                                                    share|improve this answer



                                                                    share|improve this answer










                                                                    answered Oct 15 at 11:10









                                                                    propatience

                                                                    414




                                                                    414






























                                                                        draft saved

                                                                        draft discarded




















































                                                                        Thanks for contributing an answer to Stack Overflow!


                                                                        • Please be sure to answer the question. Provide details and share your research!

                                                                        But avoid



                                                                        • Asking for help, clarification, or responding to other answers.

                                                                        • Making statements based on opinion; back them up with references or personal experience.


                                                                        To learn more, see our tips on writing great answers.





                                                                        Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                                                        Please pay close attention to the following guidance:


                                                                        • Please be sure to answer the question. Provide details and share your research!

                                                                        But avoid



                                                                        • Asking for help, clarification, or responding to other answers.

                                                                        • Making statements based on opinion; back them up with references or personal experience.


                                                                        To learn more, see our tips on writing great answers.




                                                                        draft saved


                                                                        draft discarded














                                                                        StackExchange.ready(
                                                                        function () {
                                                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f893585%2fhow-to-parse-xml-in-bash%23new-answer', 'question_page');
                                                                        }
                                                                        );

                                                                        Post as a guest















                                                                        Required, but never shown





















































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown

































                                                                        Required, but never shown














                                                                        Required, but never shown












                                                                        Required, but never shown







                                                                        Required, but never shown







                                                                        Popular posts from this blog

                                                                        Guess what letter conforming each word

                                                                        Port of Spain

                                                                        Run scheduled task as local user group (not BUILTIN)