Agfdhyk

Question

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |

getElementViaXPath --path='/html/head/title' |

sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/… — Oct 7 '15 at 10:57

Community♦ 11 · Accepted Answer · 2017-05-23 11:55:10Z

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash. — Jul 25 '12 at 15:24
Just because you can write your own parser, doesn't mean you should. — Apr 23 '13 at 21:49
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects — Oct 11 '13 at 14:27
Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.) — Nov 27 '13 at 16:47

Yuzem 61152 · Accepted Answer · 2010-04-09 14:13:13Z

57

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

RustyTheBoyRobot 5,04022450 · Accepted Answer · 2017-01-10 21:10:38Z

50

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

mtk 7,261105392 · Accepted Answer · 2012-07-06 16:19:39Z

19

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

teknopaul 4,06921812 · Accepted Answer · 2015-01-05 10:33:58Z

7

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

peterh 6,095154667 · Accepted Answer · 2018-02-01 16:40:37Z

starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.

The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0? — Mar 5 '14 at 8:37
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;) — Apr 8 '15 at 3:36

mirod 14.4k33762 · Accepted Answer · 2009-05-21 15:43:58Z

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)

simon04 1,4091215 · Accepted Answer · 2009-11-07 15:31:22Z

4

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

BeniBela 12.4k32440 · Accepted Answer · 2013-03-27 00:27:21Z

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04 — Oct 18 '13 at 9:04

alamar 10.7k24873 · Accepted Answer · 2009-05-21 15:39:23Z

2

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

user485380 291 · Accepted Answer · 2010-10-24 01:00:37Z

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

score 2 · Accepted Answer · 2018-05-18 22:19:21Z

While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.

Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks! — Oct 31 at 0:12

michaelmeyer 4,41232129 · Accepted Answer · 2013-01-24 00:46:44Z

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

score 0 · Accepted Answer · 2017-01-03 19:34:25Z

0

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

1

add a comment |

propatience 414 · Accepted Answer · 2018-10-15 11:10:35Z

Introduction

Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

Community♦ 11 · Accepted Answer · 2017-05-23 11:55:10Z

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=> ; read -d < E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do

    if [[ $ENTITY = "title" ]]; then

        echo $CONTENT

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">

  <Name>sth-items</Name>

  <IsTruncated>false</IsTruncated>

  <Contents>

    <Key>item-apple-iso@2x.png</Key>

    <LastModified>2011-07-25T22:23:04.000Z</LastModified>

    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>

    <Size>1785</Size>

    <StorageClass>STANDARD</StorageClass>

  </Contents>

</ListBucketResult>

and the following loop:

while read_dom; do

    echo "$ENTITY => $CONTENT"

done < input.xml

You should get:

 => 

ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 

Name => sth-items

/Name => 

IsTruncated => false

/IsTruncated => 

Contents => 

Key => item-apple-iso@2x.png

/Key => 

LastModified => 2011-07-25T22:23:04.000Z

/LastModified => 

ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;

/ETag => 

Size => 1785

/Size => 

StorageClass => STANDARD

/StorageClass => 

/Contents =>

So if we wrote a while loop like Yuzem's:

while read_dom; do

    if [[ $ENTITY = "Key" ]] ; then

        echo $CONTENT

    fi

done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT
If for some reason local IFS=> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {

    ORIGINAL_IFS=$IFS

    IFS=>

    read -d < ENTITY CONTENT

    IFS=$ORIGINAL_IFS

}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2
To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local ret=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $ret

}

Then write your function to parse and get the data you want like this:

parse_dom () {

    if [[ $TAG_NAME = "foo" ]] ; then

        eval local $ATTRIBUTES

        echo "foo size is: $size"

    elif [[ $TAG_NAME = "bar" ]] ; then

        eval local $ATTRIBUTES

        echo "bar type is: $type"

    fi

}

Then while you read_dom call parse_dom:

while read_dom; do

    parse_dom

done

Then given the following example markup:

<example>

  <bar size="bar_size" type="metal">bars content</bar>

  <foo size="1789" type="unknown">foos content</foo>

</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 

bar type is: metal

foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {

    local IFS=>

    read -d < ENTITY CONTENT

    local RET=$?

    TAG_NAME=${ENTITY%% *}

    ATTRIBUTES=${ENTITY#* }

    return $RET

}

I don't see any reason why that shouldn't work

If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash. — Jul 25 '12 at 15:24
Just because you can write your own parser, doesn't mean you should. — Apr 23 '13 at 21:49
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects — Oct 11 '13 at 14:27
Assigning IFS in a local variable is fragile and not necessary. Just do: IFS=< read ..., which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read to parse xml, and I believe doing so is fraught with peril and ought to be avoided.) — Nov 27 '13 at 16:47

Yuzem 61152 · Accepted Answer · 2010-04-09 14:13:13Z

57

You can do that very easily using only bash.
You only have to add this function:

rdom () { local IFS=> ; read -d < E C ;}

Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do

    if [[ $E = title ]]; then

        echo $C

        exit

    fi

done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

answered Apr 9 '10 at 14:13

Yuzem

61152

could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14

1

alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04

1

Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06

1

great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32

3

Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34

add a comment |

RustyTheBoyRobot 5,04022450 · Accepted Answer · 2017-01-10 21:10:38Z

50

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package

XMLStarlet

xpath - command-line wrapper around Perl's XPath library

Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

edited Jan 10 '17 at 21:10

RustyTheBoyRobot

5,04022450

answered May 21 '09 at 18:18

Nat

8,78032532

2

Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47

3

yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34

2

sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37

add a comment |

mtk 7,261105392 · Accepted Answer · 2012-07-06 16:19:39Z

19

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

edited Jul 6 '12 at 16:19

mtk

7,261105392

answered Apr 24 '12 at 15:03

Grisha

20626

5

Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26

On many systems, the xpath which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47

2

On Ubuntu/Debian apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48

add a comment |

teknopaul 4,06921812 · Accepted Answer · 2015-01-05 10:33:58Z

7

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt

answered Jan 5 '15 at 10:33

teknopaul

4,06921812

Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03

add a comment |

peterh 6,095154667 · Accepted Answer · 2018-02-01 16:40:37Z

starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.

The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.

First, a definition of the UML words used in this post:

<!-- comment... -->

<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

Websphere xml (xmi and xmlns attributes)

must have a compatible terminal with 256 colors

24 shades of grey

compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

local ENTITY IFS=>

if $ITSACOMMENT; then

  read -d < COMMENTS

  COMMENTS="$(rtrim "${COMMENTS}")"

  return 0

else

  read -d < ENTITY CONTENT

  CR=$?

  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0

  TAG_NAME=${ENTITY%%[[:space:]]*}

  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml

  TAG_NAME=${TAG_NAME%%:*}

  ATTRIBUTES=${ENTITY#*[[:space:]]}

  ATTRIBUTES="${ATTRIBUTES//xmi:/}"

  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"

fi



# when comments sticks to !-- :

[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0



# http://tldp.org/LDP/abs/html/string-manipulation.html

# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):

# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"

[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"

return $CR

}

and the second one :

xml_read() {

# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash

ITSACOMMENT=false

local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE

local TMP LOG LOGG

LIGHT=false

FORCE_PRINT=false

XAPPLY=false

MULTIPLE_ATTR=false

XAPPLIED_COLOR=g

TAGPRINTED=false

GETCONTENT=false

PROSTPROCESS=cat

Debug=${Debug:-false}

TMP=/tmp/xml_read.$RANDOM

USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

${nn[2]}  -c = NOCOLOR${END}

${nn[2]}  -d = Debug${END}

${nn[2]}  -l = LIGHT (no "attribute=" printed)${END}

${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}

${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}

${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"



! (($#)) && echo2 "$USAGE" && return 99

(( $# < 2 )) && ERROR nbaram 2 0 && return 99

# getopts:

while getopts :cdlpx:a: _OPT 2>/dev/null

do

{

  case ${_OPT} in

    c) PROSTPROCESS="${DECOLORIZE}" ;;

    d) local Debug=true ;;

    l) LIGHT=true; XAPPLIED_COLOR=END ;;

    p) FORCE_PRINT=true ;;

    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;

    a) XATTRIBUTE="${OPTARG}" ;;

    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;

  esac

}

done

shift $((OPTIND - 1))

unset _OPT OPTARG OPTIND

[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0



fileXml=$1

tag=$2

(( $# > 2 )) && shift 2 && attributes=$*

(( $# > 1 )) && MULTIPLE_ATTR=true



[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1

$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2

# nb attributes == 1 because $MULTIPLE_ATTR is false

[ "${attributes}" == "content" ] && GETCONTENT=true



while xml_read_dom; do

  # (( CR != 0 )) && break

  (( PIPESTATUS[1] != 0 )) && break



  if $ITSACOMMENT; then

    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):

    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false

    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false

    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false

    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false

    fi

    $Debug && echo2 "${N}${COMMENTS}${END}"

  elif test "${TAG_NAME}"; then

    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then

      if $GETCONTENT; then

        CONTENT="$(trim "${CONTENT}")"

        test ${CONTENT} && echo "${CONTENT}"

      else

        # eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes

        eval local $ATTRIBUTES

        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")

        if test "${attributes}"; then

          if $MULTIPLE_ATTR; then

            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found

            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "

            for attribute in ${attributes}; do

              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"

              if eval test ""$${attribute}""; then

                test "${tag2print}" && ${print} "${tag2print}"

                TAGPRINTED=true; unset tag2print

                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then

                  eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}

                else

                  eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}

                fi

              fi

            done

            # this trick prints a CR only if attributes have been printed durint the loop:

            $TAGPRINTED && ${print} "n" && TAGPRINTED=false

          else

            if eval test ""$${attributes}""; then

              if $XAPPLY; then

                eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}

              else

                eval echo "$${attributes}" && eval unset ${attributes}

              fi

            fi

          fi

        else

          echo eval $ATTRIBUTES >>$TMP

        fi

      fi

    fi

  fi

  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS

done < "${fileXml}" | ${PROSTPROCESS}

# http://mywiki.wooledge.org/BashFAQ/024

# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:

if [ -s "$TMP" ]; then

  $FORCE_PRINT && ! $LIGHT && cat $TMP

  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP

  $FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP

  . $TMP

  rm -f $TMP

fi

unset ITSACOMMENT

}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {

local var=$@

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

trim() {

local var=$@

var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters

var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters

echo -n "$var"

}

echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a

TERM=xterm-256color

case ${UNAME} in

AIX|SunOS)

  M=$(${print} '33[1;35m')

  m=$(${print} '33[0;35m')

  END=$(${print} '33[0m')

;;

*)

  m=$(tput setaf 5)

  M=$(tput setaf 13)

  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc

  END=$(${print} '33[0m')

;;

esac

# 24 shades of grey:

for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done

# another way of having an array of 5 shades of grey:

declare -a colorNums=(238 240 243 248 254)

for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done

# piped decolorization:

DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]

  -c = NOCOLOR

  -d = Debug

  -l = LIGHT (no "attribute=" printed)

  -p = FORCE PRINT (when no attributes given)

  -x = apply a command on an attribute and print the result instead of the former value, in green color

  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)



xml_read server.xml title content     # print content between <title></title>

xml_read server.xml Connector port    # print all port values from Connector tags

xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

I'm trying to use the above two functions which produces the following: ./read_xml.sh: line 22: (-1): substring expression < 0? — Mar 5 '14 at 8:37
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;) — Apr 8 '15 at 3:36

mirod 14.4k33762 · Accepted Answer · 2009-05-21 15:43:58Z

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)

simon04 1,4091215 · Accepted Answer · 2009-11-07 15:31:22Z

4

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

answered Nov 7 '09 at 15:31

simon04

1,4091215

add a comment |

BeniBela 12.4k32440 · Accepted Answer · 2013-03-27 00:27:21Z

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04 — Oct 18 '13 at 9:04

alamar 10.7k24873 · Accepted Answer · 2009-05-21 15:39:23Z

2

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

answered May 21 '09 at 15:39

alamar

10.7k24873

add a comment |

user485380 291 · Accepted Answer · 2010-10-24 01:00:37Z

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

General informations about XPaths

Amara - collection of Pythonic tools for XML

Develop Python/XML with 4Suite (2 parts)

score 2 · Accepted Answer · 2018-05-18 22:19:21Z

While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.

Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.

Example 1

#!/usr/bin/env python

import sys

from lxml import etree



tree = etree.parse(sys.argv[1])

xpath_expression = sys.argv[2]



#  a hack allowing to access the

#  default namespace (if defined) via the 'p:' prefix    

#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'

#  an XPath of '//p:module' will return all the 'module' nodes

ns = tree.getroot().nsmap

if ns.keys() and None in ns:

    ns['p'] = ns.pop(None)

#   end of hack    



for e in tree.xpath(xpath_expression, namespaces=ns):

    if isinstance(e, str):

        print(e)

    else:

        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.

Example 2

A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modules>

        <module>cherries</module>

        <module>bananas</module>

        <module>pears</module>

    </modules>

</project>

module_extractor.py:

from lxml import etree

for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):

    print(e.text)

This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra pip install over apt-get or yum call. Thanks! — Oct 31 at 0:12

michaelmeyer 4,41232129 · Accepted Answer · 2013-01-24 00:46:44Z

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=> ; read -d < E C ;}

becomes:

rdom () { local IFS=< ; read -d > C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

score 0 · Accepted Answer · 2017-01-03 19:34:25Z

0

This works if you are wanting XML attributes:

$ cat alfa.xml

<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>



$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh



$ . ./alfa.sh



$ echo "$stream"

H264_400.mp4

edited Jan 3 '17 at 19:34

answered Jun 16 '12 at 16:53

Steven Penny

1

add a comment |

propatience 414 · Accepted Answer · 2018-10-15 11:10:35Z

Introduction

Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml when what the questionnaire actually wants to parse xhtml, talk about ambiguity. Though they are similar they are definately not the same. And since xml and xhtml isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body> tags come before the <head> tags and the <body> tags contain a <title> tag, but that's very, very unlikely.

TLDR/Solution

On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?

On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?

1. Install the "package" xmlstarlet

2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt

How to parse XML in Bash?

15 Answers 15

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

Example 1

Usage

Example 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

15 Answers 15

15 Answers 15

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

The functions, first is the xml_read_dom which's called recursively by xml_read:

Colorization:

How to load all that stuff:

How does it work:

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Example 1

Usage

Example 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How to pass form data using jquery Ajax to insert data in database?

Guess what letter conforming each word

Tobacco

15 Answers
15

15 Answers
15

15 Answers
15