Python pdfminer.six doubles or tripples each line in an lt_obj.get

Python pdfminer.six doubles or tripples each line in an lt_obj.get_text()

i experience strange behavior with pdfminer.six.
I'm trying to parse paragraphs from contract documents.
On many documents, everything works fine, but on some others, this happens:
I use lt_obj.get_text() to extract the text of a paragraph and get each line doubled or tripled.

Example (anonymized):
PDF Example

And what i get from print(lt_obj.get_text()) is:

"Line1 Content

Line1 Content

Line2 Content

Line2 Content

Line3 Content

Line3 Content

Line4 Content

Line4 Content

Line5 Content

Line5 Content"

I thought it had something to do with recursively getting the lt_objects that contain text, but as you see in my code, i turned that off and still get these results.
However, this does not happen in all my documents, only sometimes. But if it happens, it happens for the whole document and all its paragraphs.

My code:

def parse_layout(layout, page_counter, doc):

        for lt_obj in layout:

        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):

            text = lt_obj.get_text()

            print(text)

        elif isinstance(lt_obj, LTFigure):

            pass

            #parse_layout(lt_obj, page_counter, doc)  # Recursive

asked Nov 21 '18 at 11:55

Milipp

112

add a comment |

Example (anonymized):
PDF Example

And what i get from print(lt_obj.get_text()) is:

"Line1 Content

Line1 Content

Line2 Content

Line2 Content

Line3 Content

Line3 Content

Line4 Content

Line4 Content

Line5 Content

Line5 Content"

My code:

def parse_layout(layout, page_counter, doc):

        for lt_obj in layout:

        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):

            text = lt_obj.get_text()

            print(text)

        elif isinstance(lt_obj, LTFigure):

            pass

            #parse_layout(lt_obj, page_counter, doc)  # Recursive

asked Nov 21 '18 at 11:55

Milipp

112

add a comment |

Example (anonymized):
PDF Example

And what i get from print(lt_obj.get_text()) is:

"Line1 Content

Line1 Content

Line2 Content

Line2 Content

Line3 Content

Line3 Content

Line4 Content

Line4 Content

Line5 Content

Line5 Content"

My code:

def parse_layout(layout, page_counter, doc):

        for lt_obj in layout:

        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):

            text = lt_obj.get_text()

            print(text)

        elif isinstance(lt_obj, LTFigure):

            pass

            #parse_layout(lt_obj, page_counter, doc)  # Recursive

asked Nov 21 '18 at 11:55

Milipp

112

Example (anonymized):
PDF Example

And what i get from print(lt_obj.get_text()) is:

"Line1 Content

Line1 Content

Line2 Content

Line2 Content

Line3 Content

Line3 Content

Line4 Content

Line4 Content

Line5 Content

Line5 Content"

My code:

def parse_layout(layout, page_counter, doc):

        for lt_obj in layout:

        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):

            text = lt_obj.get_text()

            print(text)

        elif isinstance(lt_obj, LTFigure):

            pass

            #parse_layout(lt_obj, page_counter, doc)  # Recursive

python pdfminer

asked Nov 21 '18 at 11:55

Milipp

112

asked Nov 21 '18 at 11:55

Milipp

112

asked Nov 21 '18 at 11:55

Milipp

112

asked Nov 21 '18 at 11:55

Milipp

112

asked Nov 21 '18 at 11:55

Milipp

112

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411521%2fpython-pdfminer-six-doubles-or-tripples-each-line-in-an-lt-obj-get-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk