Parsing unstructured data to pandas data frame











up vote
0
down vote

favorite












I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:



    label   text
0 ###24293578 NaN
1 INTRO Some text...
2 METHODS Some text...
3 METHODS Some text...
4 METHODS Some text...
5 RESULTS Some text...
6 ###24854809 NaN
7 BACKGROUND Some text...
8 INTRO Some text...
9 METHODS Some text...
10 METHODS Some text...
11 RESULTS Some text...
12 ###25165090 NaN
13 BACKGROUND Some text...
14 METHODS Some text...
...


What I like to achieve is a running index for each row, retrieved from the id marked with "###":



id        label       text
24293578 INTRO Some text...
24293578 METHODS Some text...
24293578 ... ...
24854809 BACKGROUND Some text...
24854809 ... ...
25165090 BACKGROUND Some text...
25165090 ... ...


I currently use following code to transform the data:



m = df['label'].str.contains("###", na=False) 
df['new'] = df['label'].where(m).ffill()
df = df[df['label'] != df['new']].copy()
df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
df[['id','area']] = df['label'].str.split(' ',expand=True)
df = df.drop(columns=['label'])
df


Out:



    text            id          area
1 Some text... 24293578 OBJECTIVE
...
6 Some text... 24854809 BACKGROUND
...


It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.



Thank you!










share|improve this question


























    up vote
    0
    down vote

    favorite












    I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:



        label   text
    0 ###24293578 NaN
    1 INTRO Some text...
    2 METHODS Some text...
    3 METHODS Some text...
    4 METHODS Some text...
    5 RESULTS Some text...
    6 ###24854809 NaN
    7 BACKGROUND Some text...
    8 INTRO Some text...
    9 METHODS Some text...
    10 METHODS Some text...
    11 RESULTS Some text...
    12 ###25165090 NaN
    13 BACKGROUND Some text...
    14 METHODS Some text...
    ...


    What I like to achieve is a running index for each row, retrieved from the id marked with "###":



    id        label       text
    24293578 INTRO Some text...
    24293578 METHODS Some text...
    24293578 ... ...
    24854809 BACKGROUND Some text...
    24854809 ... ...
    25165090 BACKGROUND Some text...
    25165090 ... ...


    I currently use following code to transform the data:



    m = df['label'].str.contains("###", na=False) 
    df['new'] = df['label'].where(m).ffill()
    df = df[df['label'] != df['new']].copy()
    df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
    df[['id','area']] = df['label'].str.split(' ',expand=True)
    df = df.drop(columns=['label'])
    df


    Out:



        text            id          area
    1 Some text... 24293578 OBJECTIVE
    ...
    6 Some text... 24854809 BACKGROUND
    ...


    It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.



    Thank you!










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:



          label   text
      0 ###24293578 NaN
      1 INTRO Some text...
      2 METHODS Some text...
      3 METHODS Some text...
      4 METHODS Some text...
      5 RESULTS Some text...
      6 ###24854809 NaN
      7 BACKGROUND Some text...
      8 INTRO Some text...
      9 METHODS Some text...
      10 METHODS Some text...
      11 RESULTS Some text...
      12 ###25165090 NaN
      13 BACKGROUND Some text...
      14 METHODS Some text...
      ...


      What I like to achieve is a running index for each row, retrieved from the id marked with "###":



      id        label       text
      24293578 INTRO Some text...
      24293578 METHODS Some text...
      24293578 ... ...
      24854809 BACKGROUND Some text...
      24854809 ... ...
      25165090 BACKGROUND Some text...
      25165090 ... ...


      I currently use following code to transform the data:



      m = df['label'].str.contains("###", na=False) 
      df['new'] = df['label'].where(m).ffill()
      df = df[df['label'] != df['new']].copy()
      df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
      df[['id','area']] = df['label'].str.split(' ',expand=True)
      df = df.drop(columns=['label'])
      df


      Out:



          text            id          area
      1 Some text... 24293578 OBJECTIVE
      ...
      6 Some text... 24854809 BACKGROUND
      ...


      It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.



      Thank you!










      share|improve this question













      I currently have following data structure in a pandas dataframe, after importing a *.txt file via read_csv:



          label   text
      0 ###24293578 NaN
      1 INTRO Some text...
      2 METHODS Some text...
      3 METHODS Some text...
      4 METHODS Some text...
      5 RESULTS Some text...
      6 ###24854809 NaN
      7 BACKGROUND Some text...
      8 INTRO Some text...
      9 METHODS Some text...
      10 METHODS Some text...
      11 RESULTS Some text...
      12 ###25165090 NaN
      13 BACKGROUND Some text...
      14 METHODS Some text...
      ...


      What I like to achieve is a running index for each row, retrieved from the id marked with "###":



      id        label       text
      24293578 INTRO Some text...
      24293578 METHODS Some text...
      24293578 ... ...
      24854809 BACKGROUND Some text...
      24854809 ... ...
      25165090 BACKGROUND Some text...
      25165090 ... ...


      I currently use following code to transform the data:



      m = df['label'].str.contains("###", na=False) 
      df['new'] = df['label'].where(m).ffill()
      df = df[df['label'] != df['new']].copy()
      df['label'] = df.pop('new').str.lstrip('#') + ' ' + df['label']
      df[['id','area']] = df['label'].str.split(' ',expand=True)
      df = df.drop(columns=['label'])
      df


      Out:



          text            id          area
      1 Some text... 24293578 OBJECTIVE
      ...
      6 Some text... 24854809 BACKGROUND
      ...


      It does the job but I feel this isn't the best approach. Is there a way to write the code cleaner, or make it more efficient? I'm also curious, whether the a function could be directly embedded into the read_csv step.



      Thank you!







      pandas indexing transformation






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 9 at 17:30









      Christopher

      3351619




      3351619
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          Here you can do it in 3 steps:



          # put in the label column into id where text is null, and strip out the #. 
          # The rest will be NaN
          df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

          # forward fill in ID
          df['id'].ffill(inplace=True)

          # Remove the columns where text is null
          df.dropna(subset=['text'], inplace=True)

          >>> df
          label text id
          1 INTRO Some text... 24293578
          2 METHODS Some text... 24293578
          3 METHODS Some text... 24293578
          4 METHODS Some text... 24293578
          5 RESULTS Some text... 24293578
          7 BACKGROUND Some text... 24854809
          8 INTRO Some text... 24854809
          9 METHODS Some text... 24854809
          10 METHODS Some text... 24854809
          11 RESULTS Some text... 24854809
          13 BACKGROUND Some text... 25165090
          14 METHODS Some text... 25165090





          share|improve this answer





















          • Thanks, that seems perfect!
            – Christopher
            Nov 9 at 18:05











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53230662%2fparsing-unstructured-data-to-pandas-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          Here you can do it in 3 steps:



          # put in the label column into id where text is null, and strip out the #. 
          # The rest will be NaN
          df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

          # forward fill in ID
          df['id'].ffill(inplace=True)

          # Remove the columns where text is null
          df.dropna(subset=['text'], inplace=True)

          >>> df
          label text id
          1 INTRO Some text... 24293578
          2 METHODS Some text... 24293578
          3 METHODS Some text... 24293578
          4 METHODS Some text... 24293578
          5 RESULTS Some text... 24293578
          7 BACKGROUND Some text... 24854809
          8 INTRO Some text... 24854809
          9 METHODS Some text... 24854809
          10 METHODS Some text... 24854809
          11 RESULTS Some text... 24854809
          13 BACKGROUND Some text... 25165090
          14 METHODS Some text... 25165090





          share|improve this answer





















          • Thanks, that seems perfect!
            – Christopher
            Nov 9 at 18:05















          up vote
          1
          down vote



          accepted










          Here you can do it in 3 steps:



          # put in the label column into id where text is null, and strip out the #. 
          # The rest will be NaN
          df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

          # forward fill in ID
          df['id'].ffill(inplace=True)

          # Remove the columns where text is null
          df.dropna(subset=['text'], inplace=True)

          >>> df
          label text id
          1 INTRO Some text... 24293578
          2 METHODS Some text... 24293578
          3 METHODS Some text... 24293578
          4 METHODS Some text... 24293578
          5 RESULTS Some text... 24293578
          7 BACKGROUND Some text... 24854809
          8 INTRO Some text... 24854809
          9 METHODS Some text... 24854809
          10 METHODS Some text... 24854809
          11 RESULTS Some text... 24854809
          13 BACKGROUND Some text... 25165090
          14 METHODS Some text... 25165090





          share|improve this answer





















          • Thanks, that seems perfect!
            – Christopher
            Nov 9 at 18:05













          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          Here you can do it in 3 steps:



          # put in the label column into id where text is null, and strip out the #. 
          # The rest will be NaN
          df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

          # forward fill in ID
          df['id'].ffill(inplace=True)

          # Remove the columns where text is null
          df.dropna(subset=['text'], inplace=True)

          >>> df
          label text id
          1 INTRO Some text... 24293578
          2 METHODS Some text... 24293578
          3 METHODS Some text... 24293578
          4 METHODS Some text... 24293578
          5 RESULTS Some text... 24293578
          7 BACKGROUND Some text... 24854809
          8 INTRO Some text... 24854809
          9 METHODS Some text... 24854809
          10 METHODS Some text... 24854809
          11 RESULTS Some text... 24854809
          13 BACKGROUND Some text... 25165090
          14 METHODS Some text... 25165090





          share|improve this answer












          Here you can do it in 3 steps:



          # put in the label column into id where text is null, and strip out the #. 
          # The rest will be NaN
          df['id'] = df.loc[df['text'].isnull(),'label'].str.strip('#')

          # forward fill in ID
          df['id'].ffill(inplace=True)

          # Remove the columns where text is null
          df.dropna(subset=['text'], inplace=True)

          >>> df
          label text id
          1 INTRO Some text... 24293578
          2 METHODS Some text... 24293578
          3 METHODS Some text... 24293578
          4 METHODS Some text... 24293578
          5 RESULTS Some text... 24293578
          7 BACKGROUND Some text... 24854809
          8 INTRO Some text... 24854809
          9 METHODS Some text... 24854809
          10 METHODS Some text... 24854809
          11 RESULTS Some text... 24854809
          13 BACKGROUND Some text... 25165090
          14 METHODS Some text... 25165090






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 9 at 17:36









          sacul

          27k41638




          27k41638












          • Thanks, that seems perfect!
            – Christopher
            Nov 9 at 18:05


















          • Thanks, that seems perfect!
            – Christopher
            Nov 9 at 18:05
















          Thanks, that seems perfect!
          – Christopher
          Nov 9 at 18:05




          Thanks, that seems perfect!
          – Christopher
          Nov 9 at 18:05


















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53230662%2fparsing-unstructured-data-to-pandas-data-frame%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Guess what letter conforming each word

          Run scheduled task as local user group (not BUILTIN)

          Port of Spain