Parsing unstructured data to pandas data frame

up vote
0
down vote

favorite

I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:

    label   text

0   ###24293578 NaN

1   INTRO   Some text...

2   METHODS Some text...

3   METHODS Some text...

4   METHODS Some text...

5   RESULTS Some text...

6   ###24854809 NaN

7   BACKGROUND  Some text...

8   INTRO   Some text...

9   METHODS Some text...

10  METHODS Some text...

11  RESULTS Some text...

12  ###25165090 NaN

13  BACKGROUND  Some text...

14  METHODS Some text...

...

What I like to achieve is a running index for each row, retrieved from the id marked with "###":

id        label       text

24293578  INTRO       Some text...

24293578  METHODS     Some text...

24293578  ...         ...

24854809  BACKGROUND  Some text...

24854809  ...         ...

25165090  BACKGROUND  Some text...

25165090  ...         ...

I currently use following code to transform the data:

m = df['label'].str.contains("###", na=False) 

df['new'] = df['label'].where(m).ffill()

df = df[df['label'] != df['new']].copy()

df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']

df[['id','area']] = df['label'].str.split(' ',expand=True)

df = df.drop(columns=['label'])

df

Out:

    text            id          area

1   Some text...    24293578    OBJECTIVE

...

6   Some text...    24854809    BACKGROUND

...

It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.

Thank you!

asked Nov 9 at 17:30

Christopher

3351619

add a comment |

up vote
0
down vote

favorite

I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:

    label   text

0   ###24293578 NaN

1   INTRO   Some text...

2   METHODS Some text...

3   METHODS Some text...

4   METHODS Some text...

5   RESULTS Some text...

6   ###24854809 NaN

7   BACKGROUND  Some text...

8   INTRO   Some text...

9   METHODS Some text...

10  METHODS Some text...

11  RESULTS Some text...

12  ###25165090 NaN

13  BACKGROUND  Some text...

14  METHODS Some text...

...

What I like to achieve is a running index for each row, retrieved from the id marked with "###":

id        label       text

24293578  INTRO       Some text...

24293578  METHODS     Some text...

24293578  ...         ...

24854809  BACKGROUND  Some text...

24854809  ...         ...

25165090  BACKGROUND  Some text...

25165090  ...         ...

I currently use following code to transform the data:

m = df['label'].str.contains("###", na=False) 

df['new'] = df['label'].where(m).ffill()

df = df[df['label'] != df['new']].copy()

df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']

df[['id','area']] = df['label'].str.split(' ',expand=True)

df = df.drop(columns=['label'])

df

Out:

    text            id          area

1   Some text...    24293578    OBJECTIVE

...

6   Some text...    24854809    BACKGROUND

...

Thank you!

asked Nov 9 at 17:30

Christopher

3351619

add a comment |

up vote
0
down vote

favorite

I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:

    label   text

0   ###24293578 NaN

1   INTRO   Some text...

2   METHODS Some text...

3   METHODS Some text...

4   METHODS Some text...

5   RESULTS Some text...

6   ###24854809 NaN

7   BACKGROUND  Some text...

8   INTRO   Some text...

9   METHODS Some text...

10  METHODS Some text...

11  RESULTS Some text...

12  ###25165090 NaN

13  BACKGROUND  Some text...

14  METHODS Some text...

...

What I like to achieve is a running index for each row, retrieved from the id marked with "###":

id        label       text

24293578  INTRO       Some text...

24293578  METHODS     Some text...

24293578  ...         ...

24854809  BACKGROUND  Some text...

24854809  ...         ...

25165090  BACKGROUND  Some text...

25165090  ...         ...

I currently use following code to transform the data:

m = df['label'].str.contains("###", na=False) 

df['new'] = df['label'].where(m).ffill()

df = df[df['label'] != df['new']].copy()

df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']

df[['id','area']] = df['label'].str.split(' ',expand=True)

df = df.drop(columns=['label'])

df

Out:

    text            id          area

1   Some text...    24293578    OBJECTIVE

...

6   Some text...    24854809    BACKGROUND

...

Thank you!

asked Nov 9 at 17:30

Christopher

3351619

I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:

    label   text

0   ###24293578 NaN

1   INTRO   Some text...

2   METHODS Some text...

3   METHODS Some text...

4   METHODS Some text...

5   RESULTS Some text...

6   ###24854809 NaN

7   BACKGROUND  Some text...

8   INTRO   Some text...

9   METHODS Some text...

10  METHODS Some text...

11  RESULTS Some text...

12  ###25165090 NaN

13  BACKGROUND  Some text...

14  METHODS Some text...

...

What I like to achieve is a running index for each row, retrieved from the id marked with "###":

id        label       text

24293578  INTRO       Some text...

24293578  METHODS     Some text...

24293578  ...         ...

24854809  BACKGROUND  Some text...

24854809  ...         ...

25165090  BACKGROUND  Some text...

25165090  ...         ...

I currently use following code to transform the data:

m = df['label'].str.contains("###", na=False) 

df['new'] = df['label'].where(m).ffill()

df = df[df['label'] != df['new']].copy()

df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']

df[['id','area']] = df['label'].str.split(' ',expand=True)

df = df.drop(columns=['label'])

df

Out:

    text            id          area

1   Some text...    24293578    OBJECTIVE

...

6   Some text...    24854809    BACKGROUND

...

Thank you!

pandas indexing transformation

asked Nov 9 at 17:30

Christopher

3351619

asked Nov 9 at 17:30

Christopher

3351619

asked Nov 9 at 17:30

Christopher

3351619

asked Nov 9 at 17:30

Christopher

3351619

asked Nov 9 at 17:30

Christopher

3351619

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here you can do it in 3 steps:

# put in the label column into id where text is null, and strip out the #. 

# The rest will be NaN

df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')



# forward fill in ID

df['id'].ffill(inplace=True)



# Remove the columns where text is null

df.dropna(subset=['text'], inplace=True)



>>> df

         label          text        id

1        INTRO  Some text...  24293578

2      METHODS  Some text...  24293578

3      METHODS  Some text...  24293578

4      METHODS  Some text...  24293578

5      RESULTS  Some text...  24293578

7   BACKGROUND  Some text...  24854809

8        INTRO  Some text...  24854809

9      METHODS  Some text...  24854809

10     METHODS  Some text...  24854809

11     RESULTS  Some text...  24854809

13  BACKGROUND  Some text...  25165090

14     METHODS  Some text...  25165090

answered Nov 9 at 17:36

sacul

27k41638

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53230662%2fparsing-unstructured-data-to-pandas-data-frame%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here you can do it in 3 steps:

# put in the label column into id where text is null, and strip out the #. 

# The rest will be NaN

df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')



# forward fill in ID

df['id'].ffill(inplace=True)



# Remove the columns where text is null

df.dropna(subset=['text'], inplace=True)



>>> df

         label          text        id

1        INTRO  Some text...  24293578

2      METHODS  Some text...  24293578

3      METHODS  Some text...  24293578

4      METHODS  Some text...  24293578

5      RESULTS  Some text...  24293578

7   BACKGROUND  Some text...  24854809

8        INTRO  Some text...  24854809

9      METHODS  Some text...  24854809

10     METHODS  Some text...  24854809

11     RESULTS  Some text...  24854809

13  BACKGROUND  Some text...  25165090

14     METHODS  Some text...  25165090

answered Nov 9 at 17:36

sacul

27k41638

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

add a comment |

up vote
1
down vote

accepted

Here you can do it in 3 steps:

# put in the label column into id where text is null, and strip out the #. 

# The rest will be NaN

df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')



# forward fill in ID

df['id'].ffill(inplace=True)



# Remove the columns where text is null

df.dropna(subset=['text'], inplace=True)



>>> df

         label          text        id

1        INTRO  Some text...  24293578

2      METHODS  Some text...  24293578

3      METHODS  Some text...  24293578

4      METHODS  Some text...  24293578

5      RESULTS  Some text...  24293578

7   BACKGROUND  Some text...  24854809

8        INTRO  Some text...  24854809

9      METHODS  Some text...  24854809

10     METHODS  Some text...  24854809

11     RESULTS  Some text...  24854809

13  BACKGROUND  Some text...  25165090

14     METHODS  Some text...  25165090

answered Nov 9 at 17:36

sacul

27k41638

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

add a comment |

up vote
1
down vote

accepted

Here you can do it in 3 steps:

# put in the label column into id where text is null, and strip out the #. 

# The rest will be NaN

df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')



# forward fill in ID

df['id'].ffill(inplace=True)



# Remove the columns where text is null

df.dropna(subset=['text'], inplace=True)



>>> df

         label          text        id

1        INTRO  Some text...  24293578

2      METHODS  Some text...  24293578

3      METHODS  Some text...  24293578

4      METHODS  Some text...  24293578

5      RESULTS  Some text...  24293578

7   BACKGROUND  Some text...  24854809

8        INTRO  Some text...  24854809

9      METHODS  Some text...  24854809

10     METHODS  Some text...  24854809

11     RESULTS  Some text...  24854809

13  BACKGROUND  Some text...  25165090

14     METHODS  Some text...  25165090

answered Nov 9 at 17:36

sacul

27k41638

Here you can do it in 3 steps:

# put in the label column into id where text is null, and strip out the #. 

# The rest will be NaN

df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')



# forward fill in ID

df['id'].ffill(inplace=True)



# Remove the columns where text is null

df.dropna(subset=['text'], inplace=True)



>>> df

         label          text        id

1        INTRO  Some text...  24293578

2      METHODS  Some text...  24293578

3      METHODS  Some text...  24293578

4      METHODS  Some text...  24293578

5      RESULTS  Some text...  24293578

7   BACKGROUND  Some text...  24854809

8        INTRO  Some text...  24854809

9      METHODS  Some text...  24854809

10     METHODS  Some text...  24854809

11     RESULTS  Some text...  24854809

13  BACKGROUND  Some text...  25165090

14     METHODS  Some text...  25165090

answered Nov 9 at 17:36

sacul

27k41638

answered Nov 9 at 17:36

sacul

27k41638

answered Nov 9 at 17:36

sacul

27k41638

answered Nov 9 at 17:36

sacul

27k41638

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

add a comment |

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RR6AW RBQalT4p,8TKc8UkC3fvzWRC9p366dGw 2UkN,D,lt1UJ77nM3avwQqnNHvVzJb3a

搜尋此網誌

Agfdhyk