python - html - how to modify code by converting text outside of a tag into a tag
How to replace/convert/correct a string representing tag into a tag?
I have below example where I need to clean some parts of the code and need to convert strings like </div>
into the proper tags
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
I tried
soup = BeautifulSoup(html,"lxml")
tag = soup.find(text="<")
tag.replace_with("<")
print(soup.prettify())
but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?
python html beautifulsoup
add a comment |
How to replace/convert/correct a string representing tag into a tag?
I have below example where I need to clean some parts of the code and need to convert strings like </div>
into the proper tags
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
I tried
soup = BeautifulSoup(html,"lxml")
tag = soup.find(text="<")
tag.replace_with("<")
print(soup.prettify())
but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?
python html beautifulsoup
Did you try:soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.
– Lie Ryan
Nov 16 '18 at 5:50
add a comment |
How to replace/convert/correct a string representing tag into a tag?
I have below example where I need to clean some parts of the code and need to convert strings like </div>
into the proper tags
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
I tried
soup = BeautifulSoup(html,"lxml")
tag = soup.find(text="<")
tag.replace_with("<")
print(soup.prettify())
but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?
python html beautifulsoup
How to replace/convert/correct a string representing tag into a tag?
I have below example where I need to clean some parts of the code and need to convert strings like </div>
into the proper tags
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
I tried
soup = BeautifulSoup(html,"lxml")
tag = soup.find(text="<")
tag.replace_with("<")
print(soup.prettify())
but this logic doesn't work, the find function doesn't pick up the string. The fact that the text is outside of any tag makes it more difficult. How can this be achieved?
python html beautifulsoup
python html beautifulsoup
asked Nov 16 '18 at 2:30
ChrisChris
335213
335213
Did you try:soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.
– Lie Ryan
Nov 16 '18 at 5:50
add a comment |
Did you try:soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.
– Lie Ryan
Nov 16 '18 at 5:50
Did you try:
soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.– Lie Ryan
Nov 16 '18 at 5:50
Did you try:
soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.– Lie Ryan
Nov 16 '18 at 5:50
add a comment |
3 Answers
3
active
oldest
votes
Using str.replace
In [3]: print(html.replace('<', '<').replace('>', '>'))
<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
To place into BeautifulSoup
from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup
. Something like this
with open('malformed.html') as f:
malformed = f.read()
html = malformed.replace('<', '<').replace('>', '>')
soup = bs4.BeautifulSoup(html)
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
add a comment |
I think you need a function to decode them, such as unescape
on html.parser
.
from html.parser import HTMLParser
unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
print(unescape(html))
Output
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
i ranhtml=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, useopen('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Usewith open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
add a comment |
Try using regular expressions instead.
Something like:
html = re.sub("<", "<", html)
for less-than and
html = re.sub(">", ">", html)
for greater-than.
Make sure you import re
first.
Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub
Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330609%2fpython-html-how-to-modify-code-by-converting-text-outside-of-a-tag-into-a-ta%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using str.replace
In [3]: print(html.replace('<', '<').replace('>', '>'))
<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
To place into BeautifulSoup
from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup
. Something like this
with open('malformed.html') as f:
malformed = f.read()
html = malformed.replace('<', '<').replace('>', '>')
soup = bs4.BeautifulSoup(html)
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
add a comment |
Using str.replace
In [3]: print(html.replace('<', '<').replace('>', '>'))
<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
To place into BeautifulSoup
from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup
. Something like this
with open('malformed.html') as f:
malformed = f.read()
html = malformed.replace('<', '<').replace('>', '>')
soup = bs4.BeautifulSoup(html)
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
add a comment |
Using str.replace
In [3]: print(html.replace('<', '<').replace('>', '>'))
<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
To place into BeautifulSoup
from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup
. Something like this
with open('malformed.html') as f:
malformed = f.read()
html = malformed.replace('<', '<').replace('>', '>')
soup = bs4.BeautifulSoup(html)
Using str.replace
In [3]: print(html.replace('<', '<').replace('>', '>'))
<html>
<body>
<div>
</div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
To place into BeautifulSoup
from a file. Open the file first, replace the malformed text and then load the contents to BeautifulSoup
. Something like this
with open('malformed.html') as f:
malformed = f.read()
html = malformed.replace('<', '<').replace('>', '>')
soup = bs4.BeautifulSoup(html)
edited Nov 17 '18 at 23:01
answered Nov 16 '18 at 3:14
aydowaydow
2,34211026
2,34211026
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
add a comment |
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@ aydow that works on the self contained example, however when I load the html from a file into beautifulsoup first and then try to replace I get an error 'NoneType' object is not callable. Do you know how to work around that?
– Chris
Nov 17 '18 at 0:54
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
@Chris see updated answer
– aydow
Nov 17 '18 at 23:01
add a comment |
I think you need a function to decode them, such as unescape
on html.parser
.
from html.parser import HTMLParser
unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
print(unescape(html))
Output
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
i ranhtml=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, useopen('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Usewith open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
add a comment |
I think you need a function to decode them, such as unescape
on html.parser
.
from html.parser import HTMLParser
unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
print(unescape(html))
Output
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
i ranhtml=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, useopen('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Usewith open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
add a comment |
I think you need a function to decode them, such as unescape
on html.parser
.
from html.parser import HTMLParser
unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
print(unescape(html))
Output
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
I think you need a function to decode them, such as unescape
on html.parser
.
from html.parser import HTMLParser
unescape = HTMLParser().unescape
html = """
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
"""
print(unescape(html))
Output
<html>
<body>
<div>
</div> <----- how to convert the line into </div>
<div class="first_class">
<h1 id="Header_1">
Header_1
</h1>
</div>
</body>
</html>
answered Nov 16 '18 at 5:41
kcorlidykcorlidy
2,1702318
2,1702318
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
i ranhtml=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, useopen('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Usewith open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
add a comment |
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
i ranhtml=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, useopen('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Usewith open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
@ kcorlidy, the logic doesn't work in case the html part is first parsed into BeautifulSoup like html=(, BeautifulSoup(open('C:\FolderTest.html'), 'html.parser')) error: a bytes-like object is required, not 'str'
– Chris
Nov 17 '18 at 0:01
1
1
i ran
html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
i ran
html=(BeautifulSoup(open('C:\FolderTest.html'), 'html.parser'))
but i did not get such error. If you want to read as bytes, use open('C:\FolderTest.html','rb')
. Btw you must close the file when you reading finished. Use with open('Test.html',"rb") as fd: html = BeautifulSoup(fd.read(), 'html.parser')
– kcorlidy
Nov 17 '18 at 2:07
add a comment |
Try using regular expressions instead.
Something like:
html = re.sub("<", "<", html)
for less-than and
html = re.sub(">", ">", html)
for greater-than.
Make sure you import re
first.
Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub
Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
add a comment |
Try using regular expressions instead.
Something like:
html = re.sub("<", "<", html)
for less-than and
html = re.sub(">", ">", html)
for greater-than.
Make sure you import re
first.
Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub
Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
add a comment |
Try using regular expressions instead.
Something like:
html = re.sub("<", "<", html)
for less-than and
html = re.sub(">", ">", html)
for greater-than.
Make sure you import re
first.
Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub
Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.
Try using regular expressions instead.
Something like:
html = re.sub("<", "<", html)
for less-than and
html = re.sub(">", ">", html)
for greater-than.
Make sure you import re
first.
Edit: for reference on how to use re.sub - https://lzone.de/examples/Python%20re.sub
Edit2: After some further research it seems like str.replace() is faster, so you may want to use that instead.
edited Nov 16 '18 at 6:06
answered Nov 16 '18 at 2:51
jwoffjwoff
76112
76112
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
add a comment |
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
@ jwoff, doesn't work in case the html is loaded from the file into BeautifulSoup first, do some operations and then try to replace at the end. I tried to convert the Beautifulsoap object into the string like str(html) and replace and convert back to beautifulsoup for nice output format, but there were some small unexpected changes to the structure of the code.
– Chris
Nov 17 '18 at 1:01
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53330609%2fpython-html-how-to-modify-code-by-converting-text-outside-of-a-tag-into-a-ta%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Did you try:
soup.find(text="<")
? The string was encoded in the original HTML, but BeautifulSoup should have decoded them when parsing and therefore used the decoded version for matching find.– Lie Ryan
Nov 16 '18 at 5:50