How to parse XML in Bash?
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
xml bash xhtml shell xpath
add a comment |
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
xml bash xhtml shell xpath
1
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57
add a comment |
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
xml bash xhtml shell xpath
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
xml bash xhtml shell xpath
xml bash xhtml shell xpath
edited May 29 '14 at 3:30
Steven Penny
1
1
asked May 21 '09 at 15:36
asdfasdfasdf
1
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57
add a comment |
1
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57
1
1
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57
add a comment |
15 Answers
15
active
oldest
votes
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=> ; read -d < E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
Assigning IFS in a local variable is fragile and not necessary. Just do:IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47
|
show 12 more comments
You can do that very easily using only bash.
You only have to add this function:
rdom () { local IFS=> ; read -d < E C ;}
Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
add a comment |
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package- XMLStarlet
- xpath - command-line wrapper around Perl's XPath library
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
add a comment |
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, thexpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47
2
On Ubuntu/Debianapt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
add a comment |
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
add a comment |
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
<!-- comment... -->
<tag attribute="value">content...</tag>
EDIT: updated functions, with handle of:
- Websphere xml (xmi and xmlns attributes)
- must have a compatible terminal with 256 colors
- 24 shades of grey
- compatibility added for IBM AIX bash 3.2.16(1)
The functions, first is the xml_read_dom which's called recursively by xml_read:
xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
and the second one :
xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
and lastly, the rtrim, trim and echo2 (to stderr) functions:
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
With Debug mode (-d) comments and parsed attributes are printed to stderr
I'm trying to use the above two functions which produces the following:./read_xml.sh: line 22: (-1): substring expression < 0
?
– khmarbaise
Mar 5 '14 at 8:37
Line 22:[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
add a comment |
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
add a comment |
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
add a comment |
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
add a comment |
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
add a comment |
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
- General informations about XPaths
- Amara - collection of Pythonic tools for XML
- Develop Python/XML with 4Suite (2 parts)
add a comment |
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
Example 1
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
Usage
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extrapip install
overapt-get
oryum
call. Thanks!
– E. Moffat
Oct 31 at 0:12
add a comment |
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=> ; read -d < E C ;}
becomes:
rdom () { local IFS=< ; read -d > C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
add a comment |
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4
add a comment |
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml
when what the questionnaire actually wants to parse xhtml
, talk about ambiguity. Though they are similar they are definately not the same. And since xml
and xhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body>
tags come before the <head>
tags and the <body>
tags contain a <title>
tag, but that's very, very unlikely.
TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package" xmlstarlet
2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f893585%2fhow-to-parse-xml-in-bash%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
15 Answers
15
active
oldest
votes
15 Answers
15
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=> ; read -d < E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
Assigning IFS in a local variable is fragile and not necessary. Just do:IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47
|
show 12 more comments
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=> ; read -d < E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
Assigning IFS in a local variable is fragile and not necessary. Just do:IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47
|
show 12 more comments
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=> ; read -d < E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=> ; read -d < E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>item-apple-iso@2x.png</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => item-apple-iso@2x.png
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=>
read -d < ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=>
read -d < ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
edited May 23 '17 at 11:55
Community♦
11
11
answered Aug 13 '11 at 17:36
chad
2,11211215
2,11211215
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
Assigning IFS in a local variable is fragile and not necessary. Just do:IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47
|
show 12 more comments
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
Assigning IFS in a local variable is fragile and not necessary. Just do:IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of usingread
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)
– William Pursell
Nov 27 '13 at 16:47
2
2
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash.
– chad
Jul 25 '12 at 15:24
3
3
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
cool answer !!!!
– mtk
Oct 23 '12 at 20:15
21
21
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
Just because you can write your own parser, doesn't mean you should.
– Stephen Niedzielski
Apr 23 '13 at 21:49
2
2
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
@Alastair see github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects
– chad
Oct 11 '13 at 14:27
5
5
Assigning IFS in a local variable is fragile and not necessary. Just do:
IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)– William Pursell
Nov 27 '13 at 16:47
Assigning IFS in a local variable is fragile and not necessary. Just do:
IFS=< read ...
, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using read
to parse xml, and I believe doing so is fraught with peril and ought to be avoided.)– William Pursell
Nov 27 '13 at 16:47
|
show 12 more comments
You can do that very easily using only bash.
You only have to add this function:
rdom () { local IFS=> ; read -d < E C ;}
Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
add a comment |
You can do that very easily using only bash.
You only have to add this function:
rdom () { local IFS=> ; read -d < E C ;}
Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
add a comment |
You can do that very easily using only bash.
You only have to add this function:
rdom () { local IFS=> ; read -d < E C ;}
Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
You can do that very easily using only bash.
You only have to add this function:
rdom () { local IFS=> ; read -d < E C ;}
Now you can use rdom like read but for html documents.
When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
answered Apr 9 '10 at 14:13
Yuzem
61152
61152
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
add a comment |
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output?
– Alex Gray
Jul 4 '11 at 2:14
1
1
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
alex, I clarified Yuzem's answer below...
– chad
Aug 13 '11 at 20:04
1
1
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
Cred to the original - this one-liner is so freakin' elegant and amazing.
– maverick
Dec 5 '13 at 22:06
1
1
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding)
– user311174
Jan 16 '14 at 10:32
3
3
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
Parsing XML with grep and awk is not okay. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever.
– peterh
Feb 1 at 16:34
add a comment |
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package- XMLStarlet
- xpath - command-line wrapper around Perl's XPath library
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
add a comment |
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package- XMLStarlet
- xpath - command-line wrapper around Perl's XPath library
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
add a comment |
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package- XMLStarlet
- xpath - command-line wrapper around Perl's XPath library
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package- XMLStarlet
- xpath - command-line wrapper around Perl's XPath library
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
edited Jan 10 '17 at 21:10
RustyTheBoyRobot
5,04022450
5,04022450
answered May 21 '09 at 18:18
Nat
8,78032532
8,78032532
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
add a comment |
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
2
2
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
Where can I download 'xpath' or '4xpath' from ?
– Opher
Apr 15 '11 at 14:47
3
3
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
yes, a second vote/request - where to download those tools, or do you mean one has to manually write a wrapper? I'd rather not waste time doing that unless necessary.
– David
Nov 22 '11 at 0:34
2
2
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
sudo apt-get install libxml-xpath-perl
– Andrew Wagner
Nov 23 '12 at 12:37
add a comment |
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, thexpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47
2
On Ubuntu/Debianapt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
add a comment |
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, thexpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47
2
On Ubuntu/Debianapt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
add a comment |
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
edited Jul 6 '12 at 16:19
mtk
7,261105392
7,261105392
answered Apr 24 '12 at 15:03
Grisha
20626
20626
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, thexpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47
2
On Ubuntu/Debianapt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
add a comment |
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, thexpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.
– tripleee
Jul 27 '16 at 8:47
2
On Ubuntu/Debianapt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
5
5
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers).
– Bruno von Paris
Feb 8 '13 at 15:26
On many systems, the
xpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.– tripleee
Jul 27 '16 at 8:47
On many systems, the
xpath
which comes preinstalled is unsuitable for use as a component in scripts. See e.g. stackoverflow.com/questions/15461737/… for an elaboration.– tripleee
Jul 27 '16 at 8:47
2
2
On Ubuntu/Debian
apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
On Ubuntu/Debian
apt-get install xmlstarlet
– rubo77
Dec 24 '16 at 0:48
add a comment |
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
add a comment |
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
add a comment |
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
answered Jan 5 '15 at 10:33
teknopaul
4,06921812
4,06921812
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
add a comment |
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
Thanks, quick and did the job for me
– Miguel Mota
May 18 '16 at 23:03
add a comment |
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
<!-- comment... -->
<tag attribute="value">content...</tag>
EDIT: updated functions, with handle of:
- Websphere xml (xmi and xmlns attributes)
- must have a compatible terminal with 256 colors
- 24 shades of grey
- compatibility added for IBM AIX bash 3.2.16(1)
The functions, first is the xml_read_dom which's called recursively by xml_read:
xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
and the second one :
xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
and lastly, the rtrim, trim and echo2 (to stderr) functions:
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
With Debug mode (-d) comments and parsed attributes are printed to stderr
I'm trying to use the above two functions which produces the following:./read_xml.sh: line 22: (-1): substring expression < 0
?
– khmarbaise
Mar 5 '14 at 8:37
Line 22:[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
add a comment |
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
<!-- comment... -->
<tag attribute="value">content...</tag>
EDIT: updated functions, with handle of:
- Websphere xml (xmi and xmlns attributes)
- must have a compatible terminal with 256 colors
- 24 shades of grey
- compatibility added for IBM AIX bash 3.2.16(1)
The functions, first is the xml_read_dom which's called recursively by xml_read:
xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
and the second one :
xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
and lastly, the rtrim, trim and echo2 (to stderr) functions:
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
With Debug mode (-d) comments and parsed attributes are printed to stderr
I'm trying to use the above two functions which produces the following:./read_xml.sh: line 22: (-1): substring expression < 0
?
– khmarbaise
Mar 5 '14 at 8:37
Line 22:[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
add a comment |
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
<!-- comment... -->
<tag attribute="value">content...</tag>
EDIT: updated functions, with handle of:
- Websphere xml (xmi and xmlns attributes)
- must have a compatible terminal with 256 colors
- 24 shades of grey
- compatibility added for IBM AIX bash 3.2.16(1)
The functions, first is the xml_read_dom which's called recursively by xml_read:
xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
and the second one :
xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
and lastly, the rtrim, trim and echo2 (to stderr) functions:
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
With Debug mode (-d) comments and parsed attributes are printed to stderr
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
<!-- comment... -->
<tag attribute="value">content...</tag>
EDIT: updated functions, with handle of:
- Websphere xml (xmi and xmlns attributes)
- must have a compatible terminal with 256 colors
- 24 shades of grey
- compatibility added for IBM AIX bash 3.2.16(1)
The functions, first is the xml_read_dom which's called recursively by xml_read:
xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=>
if $ITSACOMMENT; then
read -d < COMMENTS
COMMENTS="$(rtrim "${COMMENTS}")"
return 0
else
read -d < ENTITY CONTENT
CR=$?
[ "x${ENTITY:0:1}x" == "x/x" ] && return 0
TAG_NAME=${ENTITY%%[[:space:]]*}
[ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
TAG_NAME=${TAG_NAME%%:*}
ATTRIBUTES=${ENTITY#*[[:space:]]}
ATTRIBUTES="${ATTRIBUTES//xmi:/}"
ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi
# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0
# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}
and the second one :
xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
${nn[2]} -c = NOCOLOR${END}
${nn[2]} -d = Debug${END}
${nn[2]} -l = LIGHT (no "attribute=" printed)${END}
${nn[2]} -p = FORCE PRINT (when no attributes given)${END}
${nn[2]} -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]} (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"
! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
case ${_OPT} in
c) PROSTPROCESS="${DECOLORIZE}" ;;
d) local Debug=true ;;
l) LIGHT=true; XAPPLIED_COLOR=END ;;
p) FORCE_PRINT=true ;;
x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
a) XATTRIBUTE="${OPTARG}" ;;
*) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0
fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true
[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true
while xml_read_dom; do
# (( CR != 0 )) && break
(( PIPESTATUS[1] != 0 )) && break
if $ITSACOMMENT; then
# oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
# elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
fi
$Debug && echo2 "${N}${COMMENTS}${END}"
elif test "${TAG_NAME}"; then
if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
if $GETCONTENT; then
CONTENT="$(trim "${CONTENT}")"
test ${CONTENT} && echo "${CONTENT}"
else
# eval local $ATTRIBUTES => eval test ""$${attribute}"" will be true for matching attributes
eval local $ATTRIBUTES
$Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
if test "${attributes}"; then
if $MULTIPLE_ATTR; then
# we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
for attribute in ${attributes}; do
! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
if eval test ""$${attribute}""; then
test "${tag2print}" && ${print} "${tag2print}"
TAGPRINTED=true; unset tag2print
if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
eval ${print} "%s%s " "${attribute2print}" "${${XAPPLIED_COLOR}}"$($XCOMMAND $${attribute})"${END}" && eval unset ${attribute}
else
eval ${print} "%s%s " "${attribute2print}" ""$${attribute}"" && eval unset ${attribute}
fi
fi
done
# this trick prints a CR only if attributes have been printed durint the loop:
$TAGPRINTED && ${print} "n" && TAGPRINTED=false
else
if eval test ""$${attributes}""; then
if $XAPPLY; then
eval echo "${g}$($XCOMMAND $${attributes})" && eval unset ${attributes}
else
eval echo "$${attributes}" && eval unset ${attributes}
fi
fi
fi
else
echo eval $ATTRIBUTES >>$TMP
fi
fi
fi
fi
unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
$FORCE_PRINT && ! $LIGHT && cat $TMP
# $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
$FORCE_PRINT && $LIGHT && sed -r 's/[^"]*(["][^"]*["][,]?)[^"]*/1 /g' $TMP
. $TMP
rm -f $TMP
fi
unset ITSACOMMENT
}
and lastly, the rtrim, trim and echo2 (to stderr) functions:
rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}" # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}" # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
M=$(${print} '33[1;35m')
m=$(${print} '33[0;35m')
END=$(${print} '33[0m')
;;
*)
m=$(tput setaf 5)
M=$(tput setaf 13)
# END=$(tput sgr0) # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
END=$(${print} '33[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} "\033[38;5;$((232 + i))m")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "33[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "33[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}[[0-9;]*[m|K],,g"'
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
-c = NOCOLOR
-d = Debug
-l = LIGHT (no "attribute=" printed)
-p = FORCE PRINT (when no attributes given)
-x = apply a command on an attribute and print the result instead of the former value, in green color
(no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)
xml_read server.xml title content # print content between <title></title>
xml_read server.xml Connector port # print all port values from Connector tags
xml_read server.xml any port # print all port values from any tags
With Debug mode (-d) comments and parsed attributes are printed to stderr
edited Feb 1 at 16:40
peterh
6,095154667
6,095154667
answered Jan 29 '14 at 12:44
scavenger
13515
13515
I'm trying to use the above two functions which produces the following:./read_xml.sh: line 22: (-1): substring expression < 0
?
– khmarbaise
Mar 5 '14 at 8:37
Line 22:[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
add a comment |
I'm trying to use the above two functions which produces the following:./read_xml.sh: line 22: (-1): substring expression < 0
?
– khmarbaise
Mar 5 '14 at 8:37
Line 22:[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
I'm trying to use the above two functions which produces the following:
./read_xml.sh: line 22: (-1): substring expression < 0
?– khmarbaise
Mar 5 '14 at 8:37
I'm trying to use the above two functions which produces the following:
./read_xml.sh: line 22: (-1): substring expression < 0
?– khmarbaise
Mar 5 '14 at 8:37
Line 22:
[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
Line 22:
[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...
– khmarbaise
Mar 5 '14 at 8:47
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;)
– scavenger
Apr 8 '15 at 3:36
add a comment |
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
add a comment |
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
add a comment |
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool: xml_grep
, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the -t
option gives you the result as text instead of xml)
answered May 21 '09 at 15:43
mirod
14.4k33762
14.4k33762
add a comment |
add a comment |
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
add a comment |
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
add a comment |
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
answered Nov 7 '09 at 15:31
simon04
1,4091215
1,4091215
add a comment |
add a comment |
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
add a comment |
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
add a comment |
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt
And it also has a cool feature to export multiple variables to bash. For example
eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )
sets $title
to the title and $imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.
answered Mar 27 '13 at 0:27
BeniBela
12.4k32440
12.4k32440
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
add a comment |
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
This is exactly what I needed! :)
– Thomas Daugaard
Oct 18 '13 at 9:04
add a comment |
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
add a comment |
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
add a comment |
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
Well, you can use xpath utility. I guess perl's XML::Xpath contains it.
answered May 21 '09 at 15:39
alamar
10.7k24873
10.7k24873
add a comment |
add a comment |
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
- General informations about XPaths
- Amara - collection of Pythonic tools for XML
- Develop Python/XML with 4Suite (2 parts)
add a comment |
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
- General informations about XPaths
- Amara - collection of Pythonic tools for XML
- Develop Python/XML with 4Suite (2 parts)
add a comment |
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
- General informations about XPaths
- Amara - collection of Pythonic tools for XML
- Develop Python/XML with 4Suite (2 parts)
After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
- General informations about XPaths
- Amara - collection of Pythonic tools for XML
- Develop Python/XML with 4Suite (2 parts)
answered Oct 24 '10 at 1:00
user485380
291
291
add a comment |
add a comment |
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
Example 1
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
Usage
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extrapip install
overapt-get
oryum
call. Thanks!
– E. Moffat
Oct 31 at 0:12
add a comment |
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
Example 1
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
Usage
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extrapip install
overapt-get
oryum
call. Thanks!
– E. Moffat
Oct 31 at 0:12
add a comment |
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
Example 1
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
Usage
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.
Example 1
#!/usr/bin/env python
import sys
from lxml import etree
tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]
# a hack allowing to access the
# default namespace (if defined) via the 'p:' prefix
# E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
# an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
ns['p'] = ns.pop(None)
# end of hack
for e in tree.xpath(xpath_expression, namespaces=ns):
if isinstance(e, str):
print(e)
else:
print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))
lxml
can be installed with pip install lxml
. On ubuntu you can use sudo apt install python-lxml
.
Usage
python xpath.py myfile.xml "//mynode"
lxml
also accepts a URL as input:
python xpath.py http://www.feedforall.com/sample.xml "//link"
Note: If your XML has a default namespace with no prefix (e.g.
xmlns=http://abc...
) then you have to use thep
prefix (provided by the 'hack') in your expressions, e.g.//p:module
to get the modules from apom.xml
file. In case thep
prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.
Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module
) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}
:
pom.xml:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modules>
<module>cherries</module>
<module>bananas</module>
<module>pears</module>
</modules>
</project>
module_extractor.py:
from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
print(e.text)
edited May 18 at 22:19
answered Oct 25 '17 at 14:53
ccpizza
12.1k48183
12.1k48183
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extrapip install
overapt-get
oryum
call. Thanks!
– E. Moffat
Oct 31 at 0:12
add a comment |
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extrapip install
overapt-get
oryum
call. Thanks!
– E. Moffat
Oct 31 at 0:12
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra
pip install
over apt-get
or yum
call. Thanks!– E. Moffat
Oct 31 at 0:12
This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra
pip install
over apt-get
or yum
call. Thanks!– E. Moffat
Oct 31 at 0:12
add a comment |
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=> ; read -d < E C ;}
becomes:
rdom () { local IFS=< ; read -d > C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
add a comment |
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=> ; read -d < E C ;}
becomes:
rdom () { local IFS=< ; read -d > C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
add a comment |
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=> ; read -d < E C ;}
becomes:
rdom () { local IFS=< ; read -d > C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
Yuzem's method can be improved by inversing the order of the <
and >
signs in the rdom
function and the variable assignments, so that:
rdom () { local IFS=> ; read -d < E C ;}
becomes:
rdom () { local IFS=< ; read -d > C E ;}
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while
loop.
answered Jan 24 '13 at 0:46
michaelmeyer
4,41232129
4,41232129
add a comment |
add a comment |
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4
add a comment |
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4
add a comment |
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4
This works if you are wanting XML attributes:
$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>
$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh
$ . ./alfa.sh
$ echo "$stream"
H264_400.mp4
edited Jan 3 '17 at 19:34
answered Jun 16 '12 at 16:53
Steven Penny
1
1
add a comment |
add a comment |
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml
when what the questionnaire actually wants to parse xhtml
, talk about ambiguity. Though they are similar they are definately not the same. And since xml
and xhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body>
tags come before the <head>
tags and the <body>
tags contain a <title>
tag, but that's very, very unlikely.
TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package" xmlstarlet
2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt
add a comment |
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml
when what the questionnaire actually wants to parse xhtml
, talk about ambiguity. Though they are similar they are definately not the same. And since xml
and xhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body>
tags come before the <head>
tags and the <body>
tags contain a <title>
tag, but that's very, very unlikely.
TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package" xmlstarlet
2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt
add a comment |
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml
when what the questionnaire actually wants to parse xhtml
, talk about ambiguity. Though they are similar they are definately not the same. And since xml
and xhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body>
tags come before the <head>
tags and the <body>
tags contain a <title>
tag, but that's very, very unlikely.
TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package" xmlstarlet
2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse xml
when what the questionnaire actually wants to parse xhtml
, talk about ambiguity. Though they are similar they are definately not the same. And since xml
and xhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for /html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if another title
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the <body>
tags come before the <head>
tags and the <body>
tags contain a <title>
tag, but that's very, very unlikely.
TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package" xmlstarlet
2. Execute in bash xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt
answered Oct 15 at 11:10
propatience
414
414
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f893585%2fhow-to-parse-xml-in-bash%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
unix.stackexchange.com/questions/83385/… || superuser.com/questions/369996/…
– Ciro Santilli 新疆改造中心 六四事件 法轮功
Oct 7 '15 at 10:57