Jsoup dose not parse title tag correctly
I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Test {
public static void main(String args) throws IOException {
String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
Document doc= Jsoup.parse(s);
System.out.println(doc.select("body > title").first());
System.out.println(doc.select("head > title").first());
}
}
example html document
jsoup
add a comment |
I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Test {
public static void main(String args) throws IOException {
String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
Document doc= Jsoup.parse(s);
System.out.println(doc.select("body > title").first());
System.out.println(doc.select("head > title").first());
}
}
example html document
jsoup
add a comment |
I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Test {
public static void main(String args) throws IOException {
String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
Document doc= Jsoup.parse(s);
System.out.println(doc.select("body > title").first());
System.out.println(doc.select("head > title").first());
}
}
example html document
jsoup
I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.
import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Test {
public static void main(String args) throws IOException {
String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
Document doc= Jsoup.parse(s);
System.out.println(doc.select("body > title").first());
System.out.println(doc.select("head > title").first());
}
}
example html document
jsoup
jsoup
asked Nov 16 '18 at 7:59
AlpacaManAlpacaMan
768
768
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I've tried the code above but with a modification:
public class Main {
public static void main(String args) {
String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
"<html xmlns="http://www.w3.org/1999/xhtml">n" +
"<head>n" +
" <title>404</title>n" +
"</head>n" +
"<body>n" +
"</body>n" +
"</html>n";
Document document = Jsoup.parse(html);
System.out.println(document.select("head > title").text());
System.out.println(document.select("title").text());
System.out.println(document.select("body > title").text());
}
}
Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.
The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:
String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.
If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.
Hope this helped you! If you need anything else feel free to ask!
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333649%2fjsoup-dose-not-parse-title-tag-correctly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I've tried the code above but with a modification:
public class Main {
public static void main(String args) {
String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
"<html xmlns="http://www.w3.org/1999/xhtml">n" +
"<head>n" +
" <title>404</title>n" +
"</head>n" +
"<body>n" +
"</body>n" +
"</html>n";
Document document = Jsoup.parse(html);
System.out.println(document.select("head > title").text());
System.out.println(document.select("title").text());
System.out.println(document.select("body > title").text());
}
}
Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.
The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:
String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.
If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.
Hope this helped you! If you need anything else feel free to ask!
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
add a comment |
I've tried the code above but with a modification:
public class Main {
public static void main(String args) {
String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
"<html xmlns="http://www.w3.org/1999/xhtml">n" +
"<head>n" +
" <title>404</title>n" +
"</head>n" +
"<body>n" +
"</body>n" +
"</html>n";
Document document = Jsoup.parse(html);
System.out.println(document.select("head > title").text());
System.out.println(document.select("title").text());
System.out.println(document.select("body > title").text());
}
}
Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.
The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:
String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.
If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.
Hope this helped you! If you need anything else feel free to ask!
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
add a comment |
I've tried the code above but with a modification:
public class Main {
public static void main(String args) {
String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
"<html xmlns="http://www.w3.org/1999/xhtml">n" +
"<head>n" +
" <title>404</title>n" +
"</head>n" +
"<body>n" +
"</body>n" +
"</html>n";
Document document = Jsoup.parse(html);
System.out.println(document.select("head > title").text());
System.out.println(document.select("title").text());
System.out.println(document.select("body > title").text());
}
}
Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.
The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:
String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.
If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.
Hope this helped you! If you need anything else feel free to ask!
I've tried the code above but with a modification:
public class Main {
public static void main(String args) {
String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
"<html xmlns="http://www.w3.org/1999/xhtml">n" +
"<head>n" +
" <title>404</title>n" +
"</head>n" +
"<body>n" +
"</body>n" +
"</html>n";
Document document = Jsoup.parse(html);
System.out.println(document.select("head > title").text());
System.out.println(document.select("title").text());
System.out.println(document.select("body > title").text());
}
}
Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.
The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:
String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.
If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.
Hope this helped you! If you need anything else feel free to ask!
edited Nov 16 '18 at 19:16
Zoe
11.3k73976
11.3k73976
answered Nov 16 '18 at 19:14
alvarobarttalvarobartt
12418
12418
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
add a comment |
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.
– AlpacaMan
Nov 19 '18 at 1:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333649%2fjsoup-dose-not-parse-title-tag-correctly%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown