Python pdfminer.six doubles or tripples each line in an lt_obj.get_text()
i experience strange behavior with pdfminer.six.
I'm trying to parse paragraphs from contract documents.
On many documents, everything works fine, but on some others, this happens:
I use lt_obj.get_text() to extract the text of a paragraph and get each line doubled or tripled.
Example (anonymized):
PDF Example
And what i get from print(lt_obj.get_text()) is:
"Line1 Content
Line1 Content
Line2 Content
Line2 Content
Line3 Content
Line3 Content
Line4 Content
Line4 Content
Line5 Content
Line5 Content"
I thought it had something to do with recursively getting the lt_objects that contain text, but as you see in my code, i turned that off and still get these results.
However, this does not happen in all my documents, only sometimes. But if it happens, it happens for the whole document and all its paragraphs.
My code:
def parse_layout(layout, page_counter, doc):
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
text = lt_obj.get_text()
print(text)
elif isinstance(lt_obj, LTFigure):
pass
#parse_layout(lt_obj, page_counter, doc) # Recursive
python pdfminer
add a comment |
i experience strange behavior with pdfminer.six.
I'm trying to parse paragraphs from contract documents.
On many documents, everything works fine, but on some others, this happens:
I use lt_obj.get_text() to extract the text of a paragraph and get each line doubled or tripled.
Example (anonymized):
PDF Example
And what i get from print(lt_obj.get_text()) is:
"Line1 Content
Line1 Content
Line2 Content
Line2 Content
Line3 Content
Line3 Content
Line4 Content
Line4 Content
Line5 Content
Line5 Content"
I thought it had something to do with recursively getting the lt_objects that contain text, but as you see in my code, i turned that off and still get these results.
However, this does not happen in all my documents, only sometimes. But if it happens, it happens for the whole document and all its paragraphs.
My code:
def parse_layout(layout, page_counter, doc):
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
text = lt_obj.get_text()
print(text)
elif isinstance(lt_obj, LTFigure):
pass
#parse_layout(lt_obj, page_counter, doc) # Recursive
python pdfminer
add a comment |
i experience strange behavior with pdfminer.six.
I'm trying to parse paragraphs from contract documents.
On many documents, everything works fine, but on some others, this happens:
I use lt_obj.get_text() to extract the text of a paragraph and get each line doubled or tripled.
Example (anonymized):
PDF Example
And what i get from print(lt_obj.get_text()) is:
"Line1 Content
Line1 Content
Line2 Content
Line2 Content
Line3 Content
Line3 Content
Line4 Content
Line4 Content
Line5 Content
Line5 Content"
I thought it had something to do with recursively getting the lt_objects that contain text, but as you see in my code, i turned that off and still get these results.
However, this does not happen in all my documents, only sometimes. But if it happens, it happens for the whole document and all its paragraphs.
My code:
def parse_layout(layout, page_counter, doc):
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
text = lt_obj.get_text()
print(text)
elif isinstance(lt_obj, LTFigure):
pass
#parse_layout(lt_obj, page_counter, doc) # Recursive
python pdfminer
i experience strange behavior with pdfminer.six.
I'm trying to parse paragraphs from contract documents.
On many documents, everything works fine, but on some others, this happens:
I use lt_obj.get_text() to extract the text of a paragraph and get each line doubled or tripled.
Example (anonymized):
PDF Example
And what i get from print(lt_obj.get_text()) is:
"Line1 Content
Line1 Content
Line2 Content
Line2 Content
Line3 Content
Line3 Content
Line4 Content
Line4 Content
Line5 Content
Line5 Content"
I thought it had something to do with recursively getting the lt_objects that contain text, but as you see in my code, i turned that off and still get these results.
However, this does not happen in all my documents, only sometimes. But if it happens, it happens for the whole document and all its paragraphs.
My code:
def parse_layout(layout, page_counter, doc):
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
text = lt_obj.get_text()
print(text)
elif isinstance(lt_obj, LTFigure):
pass
#parse_layout(lt_obj, page_counter, doc) # Recursive
python pdfminer
python pdfminer
asked Nov 21 '18 at 11:55
MilippMilipp
112
112
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411521%2fpython-pdfminer-six-doubles-or-tripples-each-line-in-an-lt-obj-get-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53411521%2fpython-pdfminer-six-doubles-or-tripples-each-line-in-an-lt-obj-get-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown