Python element tree - extract text from element, stripping tags

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

For example, say I have the following:

<tag>

  Some <a>example</a> text

</tag>

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54

Like this
– Wayne Werner
Oct 14 '13 at 21:55

1

If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57

Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'<.*?>', '', text).
– Wayne Werner
Oct 14 '13 at 21:59

add a comment |

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

For example, say I have the following:

<tag>

  Some <a>example</a> text

</tag>

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54

Like this
– Wayne Werner
Oct 14 '13 at 21:55

1

If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57

Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'<.*?>', '', text).
– Wayne Werner
Oct 14 '13 at 21:59

add a comment |

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

For example, say I have the following:

<tag>

  Some <a>example</a> text

</tag>

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

For example, say I have the following:

<tag>

  Some <a>example</a> text

</tag>

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

python xml-parsing elementtree

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

edited Jan 2 at 8:23

Franck Dernoncourt

36.3k30191341

asked Oct 14 '13 at 21:53

Trent Bing

6072620

asked Oct 14 '13 at 21:53

Trent Bing

6072620

asked Oct 14 '13 at 21:53

Trent Bing

6072620

IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54

Like this
– Wayne Werner
Oct 14 '13 at 21:55

1

If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57

Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'<.*?>', '', text).
– Wayne Werner
Oct 14 '13 at 21:59

add a comment |

IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54

Like this
– Wayne Werner
Oct 14 '13 at 21:55

1

If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57

Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'<.*?>', '', text).
– Wayne Werner
Oct 14 '13 at 21:59

IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54

Like this
– Wayne Werner
Oct 14 '13 at 21:55

If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57

Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'<.*?>', '', text).
– Wayne Werner
Oct 14 '13 at 21:59

add a comment |

3 Answers
3

active

oldest

votes

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET

xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7

def itertext(self):

    tag = self.tag

    if not isinstance(tag, basestring) and tag is not None:

        return

    if self.text:

        yield self.text

    for e in self:

        for s in e.itertext():

            yield s

        if e.tail:

            yield e.tail



# if necessary, monkey-patch the Element class

if 'itertext' not in ET.Element.__dict__:

    ET.Element.itertext = itertext



xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

1

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

add a comment |

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree and lxml on PyPI) can do this for you automatically in the tostring method:

>>> s = '''<tag>

...   Some <a>example</a> text

... </tag>'''

>>> t = ElementTree.fromstring(s)

>>> ElementTree.tostring(s, method='text')

'n  Some example textn'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()

'Some example text'

In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the texts and tails. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None. For example, here's a skeleton you can hook your own code on:

def textify(t):

    s = 

    if t.text:

        s.append(t.text)

    for child in t.getchildren():

        s.extend(textify(child))

    if t.tail:

        s.append(t.tail)

    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

add a comment |

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

When having a node (like a tag div) which itself contains text and other nodes as well (like tags a or center or another div) with text inside or it contains just text and we want to select all text in that div node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract(). What we will get is a list of all texts within a current element, stripping tags inside, if there are any.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

answered Sep 22 at 11:49

Michal

186

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19369901%2fpython-element-tree-extract-text-from-element-stripping-tags%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET

xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7

def itertext(self):

    tag = self.tag

    if not isinstance(tag, basestring) and tag is not None:

        return

    if self.text:

        yield self.text

    for e in self:

        for s in e.itertext():

            yield s

        if e.tail:

            yield e.tail



# if necessary, monkey-patch the Element class

if 'itertext' not in ET.Element.__dict__:

    ET.Element.itertext = itertext



xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

1

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

add a comment |

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET

xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7

def itertext(self):

    tag = self.tag

    if not isinstance(tag, basestring) and tag is not None:

        return

    if self.text:

        yield self.text

    for e in self:

        for s in e.itertext():

            yield s

        if e.tail:

            yield e.tail



# if necessary, monkey-patch the Element class

if 'itertext' not in ET.Element.__dict__:

    ET.Element.itertext = itertext



xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

1

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

add a comment |

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET

xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7

def itertext(self):

    tag = self.tag

    if not isinstance(tag, basestring) and tag is not None:

        return

    if self.text:

        yield self.text

    for e in self:

        for s in e.itertext():

            yield s

        if e.tail:

            yield e.tail



# if necessary, monkey-patch the Element class

if 'itertext' not in ET.Element.__dict__:

    ET.Element.itertext = itertext



xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET

xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7

def itertext(self):

    tag = self.tag

    if not isinstance(tag, basestring) and tag is not None:

        return

    if self.text:

        yield self.text

    for e in self:

        for s in e.itertext():

            yield s

        if e.tail:

            yield e.tail



# if necessary, monkey-patch the Element class

if 'itertext' not in ET.Element.__dict__:

    ET.Element.itertext = itertext



xml = '<tag>Some <a>example</a> text</tag>'

tree = ET.fromstring(xml)

print(''.join(tree.itertext()))



# -> 'Some example text'

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

edited Nov 13 at 17:33

Tomalak

256k51424540

edited Nov 13 at 17:33

Tomalak

256k51424540

edited Nov 13 at 17:33

Tomalak

256k51424540

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

answered Oct 14 '13 at 22:07

Benjamin Toueg

5,83252862

1

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

add a comment |

1

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03

add a comment |

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

>>> s = '''<tag>

...   Some <a>example</a> text

... </tag>'''

>>> t = ElementTree.fromstring(s)

>>> ElementTree.tostring(s, method='text')

'n  Some example textn'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()

'Some example text'

def textify(t):

    s = 

    if t.text:

        s.append(t.text)

    for child in t.getchildren():

        s.extend(textify(child))

    if t.tail:

        s.append(t.tail)

    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

add a comment |

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

>>> s = '''<tag>

...   Some <a>example</a> text

... </tag>'''

>>> t = ElementTree.fromstring(s)

>>> ElementTree.tostring(s, method='text')

'n  Some example textn'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()

'Some example text'

def textify(t):

    s = 

    if t.text:

        s.append(t.text)

    for child in t.getchildren():

        s.extend(textify(child))

    if t.tail:

        s.append(t.tail)

    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

add a comment |

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

>>> s = '''<tag>

...   Some <a>example</a> text

... </tag>'''

>>> t = ElementTree.fromstring(s)

>>> ElementTree.tostring(s, method='text')

'n  Some example textn'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()

'Some example text'

def textify(t):

    s = 

    if t.text:

        s.append(t.text)

    for child in t.getchildren():

        s.extend(textify(child))

    if t.tail:

        s.append(t.tail)

    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

>>> s = '''<tag>

...   Some <a>example</a> text

... </tag>'''

>>> t = ElementTree.fromstring(s)

>>> ElementTree.tostring(s, method='text')

'n  Some example textn'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()

'Some example text'

def textify(t):

    s = 

    if t.text:

        s.append(t.text)

    for child in t.getchildren():

        s.extend(textify(child))

    if t.tail:

        s.append(t.tail)

    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

edited Oct 14 '13 at 22:19

answered Oct 14 '13 at 21:59

abarnert

250k21350456

answered Oct 14 '13 at 21:59

abarnert

250k21350456

answered Oct 14 '13 at 21:59

abarnert

250k21350456

add a comment |

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

answered Sep 22 at 11:49

Michal

186

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

add a comment |

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

answered Sep 22 at 11:49

Michal

186

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

add a comment |

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

answered Sep 22 at 11:49

Michal

186

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

answered Sep 22 at 11:49

Michal

186

answered Sep 22 at 11:49

Michal

186

answered Sep 22 at 11:49

Michal

186

answered Sep 22 at 11:49

Michal

186

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

add a comment |

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk