Jsoup dose not parse title tag correctly

I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.

import java.io.File;

import java.io.IOException;

import java.nio.charset.StandardCharsets;



import org.apache.commons.io.FileUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;



public class Test {

    public static void main(String args) throws IOException {

        String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

        Document doc= Jsoup.parse(s);

        System.out.println(doc.select("body > title").first());

        System.out.println(doc.select("head > title").first());

    }

}

example html document

asked Nov 16 '18 at 7:59

AlpacaMan

768

add a comment |

import java.io.File;

import java.io.IOException;

import java.nio.charset.StandardCharsets;



import org.apache.commons.io.FileUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;



public class Test {

    public static void main(String args) throws IOException {

        String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

        Document doc= Jsoup.parse(s);

        System.out.println(doc.select("body > title").first());

        System.out.println(doc.select("head > title").first());

    }

}

example html document

asked Nov 16 '18 at 7:59

AlpacaMan

768

add a comment |

import java.io.File;

import java.io.IOException;

import java.nio.charset.StandardCharsets;



import org.apache.commons.io.FileUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;



public class Test {

    public static void main(String args) throws IOException {

        String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

        Document doc= Jsoup.parse(s);

        System.out.println(doc.select("body > title").first());

        System.out.println(doc.select("head > title").first());

    }

}

example html document

asked Nov 16 '18 at 7:59

AlpacaMan

768

import java.io.File;

import java.io.IOException;

import java.nio.charset.StandardCharsets;



import org.apache.commons.io.FileUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;



public class Test {

    public static void main(String args) throws IOException {

        String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

        Document doc= Jsoup.parse(s);

        System.out.println(doc.select("body > title").first());

        System.out.println(doc.select("head > title").first());

    }

}

example html document

jsoup

asked Nov 16 '18 at 7:59

AlpacaMan

768

asked Nov 16 '18 at 7:59

AlpacaMan

768

asked Nov 16 '18 at 7:59

AlpacaMan

768

asked Nov 16 '18 at 7:59

AlpacaMan

768

asked Nov 16 '18 at 7:59

AlpacaMan

768

add a comment |

1 Answer
1

active

oldest

votes

I've tried the code above but with a modification:

public class Main {



    public static void main(String args) {

        String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +

                "<html xmlns="http://www.w3.org/1999/xhtml">n" +

                "<head>n" +

                "  <title>404</title>n" +

                "</head>n" +

                "<body>n" +

                "</body>n" +

                "</html>n";



        Document document = Jsoup.parse(html);

        System.out.println(document.select("head > title").text());

        System.out.println(document.select("title").text());

        System.out.println(document.select("body > title").text());

    }

}

Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.

The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:

String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.

If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.

Hope this helped you! If you need anything else feel free to ask!

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333649%2fjsoup-dose-not-parse-title-tag-correctly%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I've tried the code above but with a modification:

public class Main {



    public static void main(String args) {

        String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +

                "<html xmlns="http://www.w3.org/1999/xhtml">n" +

                "<head>n" +

                "  <title>404</title>n" +

                "</head>n" +

                "<body>n" +

                "</body>n" +

                "</html>n";



        Document document = Jsoup.parse(html);

        System.out.println(document.select("head > title").text());

        System.out.println(document.select("title").text());

        System.out.println(document.select("body > title").text());

    }

}

String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.

Hope this helped you! If you need anything else feel free to ask!

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

add a comment |

I've tried the code above but with a modification:

public class Main {



    public static void main(String args) {

        String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +

                "<html xmlns="http://www.w3.org/1999/xhtml">n" +

                "<head>n" +

                "  <title>404</title>n" +

                "</head>n" +

                "<body>n" +

                "</body>n" +

                "</html>n";



        Document document = Jsoup.parse(html);

        System.out.println(document.select("head > title").text());

        System.out.println(document.select("title").text());

        System.out.println(document.select("body > title").text());

    }

}

String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.

Hope this helped you! If you need anything else feel free to ask!

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

add a comment |

I've tried the code above but with a modification:

public class Main {



    public static void main(String args) {

        String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +

                "<html xmlns="http://www.w3.org/1999/xhtml">n" +

                "<head>n" +

                "  <title>404</title>n" +

                "</head>n" +

                "<body>n" +

                "</body>n" +

                "</html>n";



        Document document = Jsoup.parse(html);

        System.out.println(document.select("head > title").text());

        System.out.println(document.select("title").text());

        System.out.println(document.select("body > title").text());

    }

}

String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.

Hope this helped you! If you need anything else feel free to ask!

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

I've tried the code above but with a modification:

public class Main {



    public static void main(String args) {

        String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +

                "<html xmlns="http://www.w3.org/1999/xhtml">n" +

                "<head>n" +

                "  <title>404</title>n" +

                "</head>n" +

                "<body>n" +

                "</body>n" +

                "</html>n";



        Document document = Jsoup.parse(html);

        System.out.println(document.select("head > title").text());

        System.out.println(document.select("title").text());

        System.out.println(document.select("body > title").text());

    }

}

String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);

If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.

Hope this helped you! If you need anything else feel free to ask!

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

edited Nov 16 '18 at 19:16

Zoe

11.3k73976

answered Nov 16 '18 at 19:14

alvarobartt

12418

answered Nov 16 '18 at 19:14

alvarobartt

12418

answered Nov 16 '18 at 19:14

alvarobartt

12418

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

add a comment |

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

– AlpacaMan
Nov 19 '18 at 1:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk