Parsing unstructured data to pandas data frame
up vote
0
down vote
favorite
I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:
label text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...
What I like to achieve is a running index for each row, retrieved from the id marked with "###":
id label text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...
I currently use following code to transform the data:
m = df['label'].str.contains("###", na=False)
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df
Out:
text id area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...
It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.
Thank you!
pandas indexing transformation
add a comment |
up vote
0
down vote
favorite
I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:
label text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...
What I like to achieve is a running index for each row, retrieved from the id marked with "###":
id label text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...
I currently use following code to transform the data:
m = df['label'].str.contains("###", na=False)
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df
Out:
text id area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...
It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.
Thank you!
pandas indexing transformation
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:
label text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...
What I like to achieve is a running index for each row, retrieved from the id marked with "###":
id label text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...
I currently use following code to transform the data:
m = df['label'].str.contains("###", na=False)
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df
Out:
text id area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...
It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.
Thank you!
pandas indexing transformation
I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:
label text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...
What I like to achieve is a running index for each row, retrieved from the id marked with "###":
id label text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...
I currently use following code to transform the data:
m = df['label'].str.contains("###", na=False)
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df
Out:
text id area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...
It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.
Thank you!
pandas indexing transformation
pandas indexing transformation
asked Nov 9 at 17:30
Christopher
3351619
3351619
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Here you can do it in 3 steps:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Here you can do it in 3 steps:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
add a comment |
up vote
1
down vote
accepted
Here you can do it in 3 steps:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Here you can do it in 3 steps:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090
Here you can do it in 3 steps:
# put in the label column into id where text is null, and strip out the #.
# The rest will be NaN
df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')
# forward fill in ID
df['id'].ffill(inplace=True)
# Remove the columns where text is null
df.dropna(subset=['text'], inplace=True)
>>> df
label text id
1 INTRO Some text... 24293578
2 METHODS Some text... 24293578
3 METHODS Some text... 24293578
4 METHODS Some text... 24293578
5 RESULTS Some text... 24293578
7 BACKGROUND Some text... 24854809
8 INTRO Some text... 24854809
9 METHODS Some text... 24854809
10 METHODS Some text... 24854809
11 RESULTS Some text... 24854809
13 BACKGROUND Some text... 25165090
14 METHODS Some text... 25165090
answered Nov 9 at 17:36
sacul
27k41638
27k41638
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
add a comment |
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
Thanks, that seems perfect!
– Christopher
Nov 9 at 18:05
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53230662%2fparsing-unstructured-data-to-pandas-data-frame%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown