Jsoup dose not parse title tag correctly












0















I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
Following is test program and example html document.



import java.io.File;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Test {
public static void main(String args) throws IOException {
String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
Document doc= Jsoup.parse(s);
System.out.println(doc.select("body > title").first());
System.out.println(doc.select("head > title").first());
}
}


example html document










share|improve this question



























    0















    I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
    Following is test program and example html document.



    import java.io.File;
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;

    import org.apache.commons.io.FileUtils;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;

    public class Test {
    public static void main(String args) throws IOException {
    String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
    Document doc= Jsoup.parse(s);
    System.out.println(doc.select("body > title").first());
    System.out.println(doc.select("head > title").first());
    }
    }


    example html document










    share|improve this question

























      0












      0








      0








      I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
      Following is test program and example html document.



      import java.io.File;
      import java.io.IOException;
      import java.nio.charset.StandardCharsets;

      import org.apache.commons.io.FileUtils;
      import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;

      public class Test {
      public static void main(String args) throws IOException {
      String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
      Document doc= Jsoup.parse(s);
      System.out.println(doc.select("body > title").first());
      System.out.println(doc.select("head > title").first());
      }
      }


      example html document










      share|improve this question














      I have an example html document, when I use Jsoup to parse it, I find title tag is parsed as child node of body tag instead of head tag by mistake. Is it a bug?
      Following is test program and example html document.



      import java.io.File;
      import java.io.IOException;
      import java.nio.charset.StandardCharsets;

      import org.apache.commons.io.FileUtils;
      import org.jsoup.Jsoup;
      import org.jsoup.nodes.Document;

      public class Test {
      public static void main(String args) throws IOException {
      String s= FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);
      Document doc= Jsoup.parse(s);
      System.out.println(doc.select("body > title").first());
      System.out.println(doc.select("head > title").first());
      }
      }


      example html document







      jsoup






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 16 '18 at 7:59









      AlpacaManAlpacaMan

      768




      768
























          1 Answer
          1






          active

          oldest

          votes


















          0














          I've tried the code above but with a modification:



          public class Main {

          public static void main(String args) {
          String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
          "<html xmlns="http://www.w3.org/1999/xhtml">n" +
          "<head>n" +
          " <title>404</title>n" +
          "</head>n" +
          "<body>n" +
          "</body>n" +
          "</html>n";

          Document document = Jsoup.parse(html);
          System.out.println(document.select("head > title").text());
          System.out.println(document.select("title").text());
          System.out.println(document.select("body > title").text());
          }
          }


          Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.



          The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:



          String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);


          To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.



          If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.



          Hope this helped you! If you need anything else feel free to ask!






          share|improve this answer


























          • I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

            – AlpacaMan
            Nov 19 '18 at 1:27











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333649%2fjsoup-dose-not-parse-title-tag-correctly%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          I've tried the code above but with a modification:



          public class Main {

          public static void main(String args) {
          String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
          "<html xmlns="http://www.w3.org/1999/xhtml">n" +
          "<head>n" +
          " <title>404</title>n" +
          "</head>n" +
          "<body>n" +
          "</body>n" +
          "</html>n";

          Document document = Jsoup.parse(html);
          System.out.println(document.select("head > title").text());
          System.out.println(document.select("title").text());
          System.out.println(document.select("body > title").text());
          }
          }


          Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.



          The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:



          String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);


          To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.



          If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.



          Hope this helped you! If you need anything else feel free to ask!






          share|improve this answer


























          • I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

            – AlpacaMan
            Nov 19 '18 at 1:27
















          0














          I've tried the code above but with a modification:



          public class Main {

          public static void main(String args) {
          String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
          "<html xmlns="http://www.w3.org/1999/xhtml">n" +
          "<head>n" +
          " <title>404</title>n" +
          "</head>n" +
          "<body>n" +
          "</body>n" +
          "</html>n";

          Document document = Jsoup.parse(html);
          System.out.println(document.select("head > title").text());
          System.out.println(document.select("title").text());
          System.out.println(document.select("body > title").text());
          }
          }


          Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.



          The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:



          String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);


          To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.



          If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.



          Hope this helped you! If you need anything else feel free to ask!






          share|improve this answer


























          • I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

            – AlpacaMan
            Nov 19 '18 at 1:27














          0












          0








          0







          I've tried the code above but with a modification:



          public class Main {

          public static void main(String args) {
          String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
          "<html xmlns="http://www.w3.org/1999/xhtml">n" +
          "<head>n" +
          " <title>404</title>n" +
          "</head>n" +
          "<body>n" +
          "</body>n" +
          "</html>n";

          Document document = Jsoup.parse(html);
          System.out.println(document.select("head > title").text());
          System.out.println(document.select("title").text());
          System.out.println(document.select("body > title").text());
          }
          }


          Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.



          The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:



          String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);


          To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.



          If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.



          Hope this helped you! If you need anything else feel free to ask!






          share|improve this answer















          I've tried the code above but with a modification:



          public class Main {

          public static void main(String args) {
          String html = "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">n" +
          "<html xmlns="http://www.w3.org/1999/xhtml">n" +
          "<head>n" +
          " <title>404</title>n" +
          "</head>n" +
          "<body>n" +
          "</body>n" +
          "</html>n";

          Document document = Jsoup.parse(html);
          System.out.println(document.select("head > title").text());
          System.out.println(document.select("title").text());
          System.out.println(document.select("body > title").text());
          }
          }


          Here I define a String called "html", and I initialize it with the HTML code you shared via OneDrive and it works as expected. If you select the head tag and then the nested title tag it prints 404 and so on if you just select the title tag. But if you try to select the body tag and then the nested title tag it just prints blank, because it does not exists.



          The error you are having I guess it is because the HTML import method, because it modifies its structure or the read of the document is wrong. So check that piece of code and consider it replacing it:



          String s = FileUtils.readFileToString(new File(Test.class.getResource("test.html").getFile()), StandardCharsets.UTF_8);


          To solve your problem in this particular case you can just select the title tag, because it is going to be unique in the whole HTML. JSOUP selection does not need you to select every element from the top until the element you really want to select, you can select it directly if it has any kind of identifier.



          If you need anything related to the CSS selection of JSOUP, take a look at JSOUP CSS Selection Syntax examples.



          Hope this helped you! If you need anything else feel free to ask!







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 16 '18 at 19:16









          Zoe

          11.3k73976




          11.3k73976










          answered Nov 16 '18 at 19:14









          alvarobarttalvarobartt

          12418




          12418













          • I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

            – AlpacaMan
            Nov 19 '18 at 1:27



















          • I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

            – AlpacaMan
            Nov 19 '18 at 1:27

















          I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

          – AlpacaMan
          Nov 19 '18 at 1:27





          I guess it is because of BOM of the UTF-8 encoded html document. I change the BOM to a space when debugging and then Jsoup works as expected.

          – AlpacaMan
          Nov 19 '18 at 1:27


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333649%2fjsoup-dose-not-parse-title-tag-correctly%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to pass form data using jquery Ajax to insert data in database?

          National Museum of Racing and Hall of Fame

          Guess what letter conforming each word