Python element tree - extract text from element, stripping tags
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
python xml-parsing elementtree
add a comment |
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
python xml-parsing elementtree
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
Like this
– Wayne Werner
Oct 14 '13 at 21:55
1
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something likere.sub(r'<.*?>', '', text)
.
– Wayne Werner
Oct 14 '13 at 21:59
add a comment |
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
python xml-parsing elementtree
With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?
For example, say I have the following:
<tag>
Some <a>example</a> text
</tag>
I want to return Some example text
. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.
python xml-parsing elementtree
python xml-parsing elementtree
edited Jan 2 at 8:23
Franck Dernoncourt
36.3k30191341
36.3k30191341
asked Oct 14 '13 at 21:53
Trent Bing
6072620
6072620
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
Like this
– Wayne Werner
Oct 14 '13 at 21:55
1
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something likere.sub(r'<.*?>', '', text)
.
– Wayne Werner
Oct 14 '13 at 21:59
add a comment |
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
Like this
– Wayne Werner
Oct 14 '13 at 21:55
1
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something likere.sub(r'<.*?>', '', text)
.
– Wayne Werner
Oct 14 '13 at 21:59
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
Like this
– Wayne Werner
Oct 14 '13 at 21:55
Like this
– Wayne Werner
Oct 14 '13 at 21:55
1
1
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like
re.sub(r'<.*?>', '', text)
.– Wayne Werner
Oct 14 '13 at 21:59
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like
re.sub(r'<.*?>', '', text)
.– Wayne Werner
Oct 14 '13 at 21:59
add a comment |
3 Answers
3
active
oldest
votes
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
add a comment |
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'n Some example textn'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s =
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
add a comment |
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div
) which itself contains text and other nodes as well (like tags a
or center
or another div
) with text inside or it contains just text and we want to select all text in that div
node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract()
. What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.
n.b.: This applies only tolxml
. Thexml.etree
package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19369901%2fpython-element-tree-extract-text-from-element-stripping-tags%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
add a comment |
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
add a comment |
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running under Python 3.2+, you can use itertext
.
itertext
creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:
import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
If you are running in a lower version of Python, you can reuse the implementation of itertext()
by attaching it to the Element
class, after which you can call it exactly like above:
# original implementation of .itertext() for Python 2.7
def itertext(self):
tag = self.tag
if not isinstance(tag, basestring) and tag is not None:
return
if self.text:
yield self.text
for e in self:
for s in e.itertext():
yield s
if e.tail:
yield e.tail
# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
ET.Element.itertext = itertext
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))
# -> 'Some example text'
edited Nov 13 at 17:33
Tomalak
256k51424540
256k51424540
answered Oct 14 '13 at 22:07
Benjamin Toueg
5,83252862
5,83252862
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
add a comment |
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
1
1
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
Thank you, was searching for this for a while!
– CodeMonkey
Jun 2 '16 at 11:03
add a comment |
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'n Some example textn'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s =
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
add a comment |
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'n Some example textn'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s =
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
add a comment |
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'n Some example textn'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s =
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text
and tail
attributes in the correct order.
However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree
and lxml
on PyPI) can do this for you automatically in the tostring
method:
>>> s = '''<tag>
... Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'n Some example textn'
If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:
>>> ElementTree.tostring(s, method='text').strip()
'Some example text'
In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the text
s and tail
s. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None
. For example, here's a skeleton you can hook your own code on:
def textify(t):
s =
if t.text:
s.append(t.text)
for child in t.getchildren():
s.extend(textify(child))
if t.tail:
s.append(t.tail)
return ''.join(s)
This version only works when text
and tail
are guaranteed to be a str
or None
. For trees you build up manually, that's not guaranteed to be true.
edited Oct 14 '13 at 22:19
answered Oct 14 '13 at 21:59
abarnert
250k21350456
250k21350456
add a comment |
add a comment |
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div
) which itself contains text and other nodes as well (like tags a
or center
or another div
) with text inside or it contains just text and we want to select all text in that div
node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract()
. What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.
n.b.: This applies only tolxml
. Thexml.etree
package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56
add a comment |
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div
) which itself contains text and other nodes as well (like tags a
or center
or another div
) with text inside or it contains just text and we want to select all text in that div
node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract()
. What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.
n.b.: This applies only tolxml
. Thexml.etree
package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56
add a comment |
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div
) which itself contains text and other nodes as well (like tags a
or center
or another div
) with text inside or it contains just text and we want to select all text in that div
node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract()
. What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.
Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.
When having a node (like a tag div
) which itself contains text and other nodes as well (like tags a
or center
or another div
) with text inside or it contains just text and we want to select all text in that div
node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract()
. What we will get is a list of all texts within a current element, stripping tags inside, if there are any.
What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).
Here is StackOverflow question concerning this proposed solution.
answered Sep 22 at 11:49
Michal
186
186
n.b.: This applies only tolxml
. Thexml.etree
package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56
add a comment |
n.b.: This applies only tolxml
. Thexml.etree
package does not know enough XPath to do this.
– Tomalak
Nov 13 at 16:56
n.b.: This applies only to
lxml
. The xml.etree
package does not know enough XPath to do this.– Tomalak
Nov 13 at 16:56
n.b.: This applies only to
lxml
. The xml.etree
package does not know enough XPath to do this.– Tomalak
Nov 13 at 16:56
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19369901%2fpython-element-tree-extract-text-from-element-stripping-tags%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
IIRC BeautifulSoup has some simple ways to take care of that...
– Wayne Werner
Oct 14 '13 at 21:54
Like this
– Wayne Werner
Oct 14 '13 at 21:55
1
If possible, I'd like to avoid using additional external libraries
– Trent Bing
Oct 14 '13 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like
re.sub(r'<.*?>', '', text)
.– Wayne Werner
Oct 14 '13 at 21:59