Summarizing the contents of a text file











up vote
3
down vote

favorite












I have a text file like this example:



chrX    7970000    8670000   3  2   7   7   RPS6KA6   4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2


I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns).
in the output, there would be 5 columns. the 1st 3 columns will be the same (only one repeat of every line) but in the 4th column there would multiple characters in the same column and the same line (these characters are in the 8th column of original file). the 5th column is the number of times that the 1st 3 lines are repeated in original file.



in short: in the input file, columns 4,5,6,7 and 9 are useless for the output file.
we should count the number of lines in which the 1st 3 columns are the same, so, in the output file the 1st 3 column would be the same as input file (but only repeated once). the 5th column is the number of times the line is repeated. the 4th column of output is all characters from 8th column which are in the repeated lines.
in the expected output, this line is repeated 4 times: chrX 7970000 8670000. so, the 5th column is 4, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2. as you see the characters in the 4th column are comma separated.



Here is the expected output:



chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2


I am trying to do that in Python and wrote the following code:



file = open("myfile.txt", 'rb')
infile =
for line in file:
infile.append(line)
count = 0
final =
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])


This code does not return what I want. Do you know how to fix it?










share|improve this question




























    up vote
    3
    down vote

    favorite












    I have a text file like this example:



    chrX    7970000    8670000   3  2   7   7   RPS6KA6   4
    chrX 7970000 8670000 3 2 7 7 SATL1 3
    chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
    chrX 7970000 8670000 3 2 7 7 VCX2 1
    chrX 86580000 86980000 1 1 1 5 KLHL4 2
    chrX 87370000 88620000 4 4 11 11 CPXCR1 2
    chrX 87370000 88620000 4 4 11 11 FAM9A 2
    chrX 89050000 91020000 11 6 10 13 FAM9B 3
    chrX 89050000 91020000 11 6 10 13 PABPC5 2


    I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns).
    in the output, there would be 5 columns. the 1st 3 columns will be the same (only one repeat of every line) but in the 4th column there would multiple characters in the same column and the same line (these characters are in the 8th column of original file). the 5th column is the number of times that the 1st 3 lines are repeated in original file.



    in short: in the input file, columns 4,5,6,7 and 9 are useless for the output file.
    we should count the number of lines in which the 1st 3 columns are the same, so, in the output file the 1st 3 column would be the same as input file (but only repeated once). the 5th column is the number of times the line is repeated. the 4th column of output is all characters from 8th column which are in the repeated lines.
    in the expected output, this line is repeated 4 times: chrX 7970000 8670000. so, the 5th column is 4, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2. as you see the characters in the 4th column are comma separated.



    Here is the expected output:



    chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4
    chrX 86580000 86980000 KLHL4 1
    chrX 87370000 88620000 CPXCR1,FAM9A 2
    chrX 89050000 91020000 FAM9B,PABPC5 2


    I am trying to do that in Python and wrote the following code:



    file = open("myfile.txt", 'rb')
    infile =
    for line in file:
    infile.append(line)
    count = 0
    final =
    for i in range(len(infile)):
    count += 1
    if infile[i-1] == infile[i]
    final.append(infile[0,1,2,7, count])


    This code does not return what I want. Do you know how to fix it?










    share|improve this question


























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      I have a text file like this example:



      chrX    7970000    8670000   3  2   7   7   RPS6KA6   4
      chrX 7970000 8670000 3 2 7 7 SATL1 3
      chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
      chrX 7970000 8670000 3 2 7 7 VCX2 1
      chrX 86580000 86980000 1 1 1 5 KLHL4 2
      chrX 87370000 88620000 4 4 11 11 CPXCR1 2
      chrX 87370000 88620000 4 4 11 11 FAM9A 2
      chrX 89050000 91020000 11 6 10 13 FAM9B 3
      chrX 89050000 91020000 11 6 10 13 PABPC5 2


      I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns).
      in the output, there would be 5 columns. the 1st 3 columns will be the same (only one repeat of every line) but in the 4th column there would multiple characters in the same column and the same line (these characters are in the 8th column of original file). the 5th column is the number of times that the 1st 3 lines are repeated in original file.



      in short: in the input file, columns 4,5,6,7 and 9 are useless for the output file.
      we should count the number of lines in which the 1st 3 columns are the same, so, in the output file the 1st 3 column would be the same as input file (but only repeated once). the 5th column is the number of times the line is repeated. the 4th column of output is all characters from 8th column which are in the repeated lines.
      in the expected output, this line is repeated 4 times: chrX 7970000 8670000. so, the 5th column is 4, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2. as you see the characters in the 4th column are comma separated.



      Here is the expected output:



      chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4
      chrX 86580000 86980000 KLHL4 1
      chrX 87370000 88620000 CPXCR1,FAM9A 2
      chrX 89050000 91020000 FAM9B,PABPC5 2


      I am trying to do that in Python and wrote the following code:



      file = open("myfile.txt", 'rb')
      infile =
      for line in file:
      infile.append(line)
      count = 0
      final =
      for i in range(len(infile)):
      count += 1
      if infile[i-1] == infile[i]
      final.append(infile[0,1,2,7, count])


      This code does not return what I want. Do you know how to fix it?










      share|improve this question















      I have a text file like this example:



      chrX    7970000    8670000   3  2   7   7   RPS6KA6   4
      chrX 7970000 8670000 3 2 7 7 SATL1 3
      chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
      chrX 7970000 8670000 3 2 7 7 VCX2 1
      chrX 86580000 86980000 1 1 1 5 KLHL4 2
      chrX 87370000 88620000 4 4 11 11 CPXCR1 2
      chrX 87370000 88620000 4 4 11 11 FAM9A 2
      chrX 89050000 91020000 11 6 10 13 FAM9B 3
      chrX 89050000 91020000 11 6 10 13 PABPC5 2


      I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns).
      in the output, there would be 5 columns. the 1st 3 columns will be the same (only one repeat of every line) but in the 4th column there would multiple characters in the same column and the same line (these characters are in the 8th column of original file). the 5th column is the number of times that the 1st 3 lines are repeated in original file.



      in short: in the input file, columns 4,5,6,7 and 9 are useless for the output file.
      we should count the number of lines in which the 1st 3 columns are the same, so, in the output file the 1st 3 column would be the same as input file (but only repeated once). the 5th column is the number of times the line is repeated. the 4th column of output is all characters from 8th column which are in the repeated lines.
      in the expected output, this line is repeated 4 times: chrX 7970000 8670000. so, the 5th column is 4, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2. as you see the characters in the 4th column are comma separated.



      Here is the expected output:



      chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4
      chrX 86580000 86980000 KLHL4 1
      chrX 87370000 88620000 CPXCR1,FAM9A 2
      chrX 89050000 91020000 FAM9B,PABPC5 2


      I am trying to do that in Python and wrote the following code:



      file = open("myfile.txt", 'rb')
      infile =
      for line in file:
      infile.append(line)
      count = 0
      final =
      for i in range(len(infile)):
      count += 1
      if infile[i-1] == infile[i]
      final.append(infile[0,1,2,7, count])


      This code does not return what I want. Do you know how to fix it?







      python






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 8 at 22:34









      martineau

      64.5k887171




      64.5k887171










      asked Nov 8 at 22:30









      elly

      213




      213
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote













          An Alternative solution :



          from collections import defaultdict
          summary = defaultdict(list)

          # Input and collate
          with open('myfile.txt', 'r') as fp:
          for line in fp:
          items = line.strip().split()
          key, data = (items[0], items[1], items[2]), items[7]
          summary[key].append(data)

          # Output
          for keys, entries in summary.items():
          print('{keys}t{entries} {count}'.format(
          keys=' '.join(keys),
          entries=','.join(entries),
          count=len(entries) ))


          With Python 2.7 - this produces the output



          chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
          chrX 89050000 91020000 FAM9B,PABPC5 2
          chrX 87370000 88620000 CPXCR1,FAM9A 2
          chrX 86580000 86980000 KLHL4 1


          With Python 3.6, the output is :



          chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
          chrX 86580000 86980000 KLHL4 1
          chrX 87370000 88620000 CPXCR1,FAM9A 2
          chrX 89050000 91020000 FAM9B,PABPC5 2


          The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
          It wasn't clear from your description if the ordering was important.



          The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.



          It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.



          In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.






          share|improve this answer





















          • Nice explanation of different sort order depending on Python version! :)
            – lmichelbacher
            Nov 9 at 12:11


















          up vote
          2
          down vote













          This should do what you want:



          from collection import defaultdict # 1

          lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

          counter = defaultdict(list) # 3
          for line in lines:
          counter[(line[0], line[1], line[2])].append(line[7]) # 4

          for key, value in counter.iteritems(): # 5
          print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6


          Explanation:




          1. We're going to use a handy library that gives us a dictionary with a default value

          2. Read the whole input file, remove the new line at the end and split into parts (on white space)

          3. Make a dictionary whose values are empty lists by default for any key

          4. Go through the lines and populate the dictionary


            1. Columns 1-3 are the key

            2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)



          5. Iterate the dictionary's key-value pairs

          6. Print the output, joining the data structures to the desired format.


          Hope this helps 🙂.






          share|improve this answer

















          • 1




            snap - except I do it line by line rather than reading the entirely file in in one go.
            – Tony Suffolk 66
            Nov 8 at 22:57










          • Great minds think alike ;).
            – lmichelbacher
            Nov 9 at 12:09


















          up vote
          0
          down vote













          This is nice opportunity to use pandas. You can open your file like this:



          import pandas as pd
          # open file
          df = pd.read_csv('myfile.txt`)
          # group and apply functions
          df = df.groupby([0,1,2])[7].agg([('count', 'size'),
          ('genes', lambda col: ', '.join(col))
          ]).reset_index()
          # rename columns
          df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
          # save new file
          df.to_csv('newfile.txt', sep='t', index=False, header=True)


          This creates a DataFrame that looks like this:



                0         1         2   3  4   5   6        7  8
          0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
          1 chrX 7970000 8670000 3 2 7 7 SATL1 3
          2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
          3 chrX 7970000 8670000 3 2 7 7 VCX2 1
          4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
          5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
          6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
          7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
          8 chrX 89050000 91020000 11 6 10 13 PABPC5 2


          Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:



                0         1         2  count                          genes
          0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
          1 chrX 86580000 86980000 1 KLHL4
          2 chrX 87370000 88620000 2 CPXCR1, FAM9A
          3 chrX 89050000 91020000 2 FAM9B, PABPC5


          What this does is groups the data and adds the columns we are interested in:



          ('count', 'size') creates the column count using the function size
          ('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.



          This is what the final file will look like:



          chromosome  start_region  end_region  count                          genes
          chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
          chrX 86580000 86980000 1 KLHL4
          chrX 87370000 88620000 2 CPXCR1, FAM9A
          chrX 89050000 91020000 2 FAM9B, PABPC5


          If you have any questions, come visit the pandas tag.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217128%2fsummarizing-the-contents-of-a-text-file%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote













            An Alternative solution :



            from collections import defaultdict
            summary = defaultdict(list)

            # Input and collate
            with open('myfile.txt', 'r') as fp:
            for line in fp:
            items = line.strip().split()
            key, data = (items[0], items[1], items[2]), items[7]
            summary[key].append(data)

            # Output
            for keys, entries in summary.items():
            print('{keys}t{entries} {count}'.format(
            keys=' '.join(keys),
            entries=','.join(entries),
            count=len(entries) ))


            With Python 2.7 - this produces the output



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 89050000 91020000 FAM9B,PABPC5 2
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 86580000 86980000 KLHL4 1


            With Python 3.6, the output is :



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 86580000 86980000 KLHL4 1
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 89050000 91020000 FAM9B,PABPC5 2


            The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
            It wasn't clear from your description if the ordering was important.



            The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.



            It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.



            In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.






            share|improve this answer





















            • Nice explanation of different sort order depending on Python version! :)
              – lmichelbacher
              Nov 9 at 12:11















            up vote
            3
            down vote













            An Alternative solution :



            from collections import defaultdict
            summary = defaultdict(list)

            # Input and collate
            with open('myfile.txt', 'r') as fp:
            for line in fp:
            items = line.strip().split()
            key, data = (items[0], items[1], items[2]), items[7]
            summary[key].append(data)

            # Output
            for keys, entries in summary.items():
            print('{keys}t{entries} {count}'.format(
            keys=' '.join(keys),
            entries=','.join(entries),
            count=len(entries) ))


            With Python 2.7 - this produces the output



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 89050000 91020000 FAM9B,PABPC5 2
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 86580000 86980000 KLHL4 1


            With Python 3.6, the output is :



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 86580000 86980000 KLHL4 1
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 89050000 91020000 FAM9B,PABPC5 2


            The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
            It wasn't clear from your description if the ordering was important.



            The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.



            It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.



            In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.






            share|improve this answer





















            • Nice explanation of different sort order depending on Python version! :)
              – lmichelbacher
              Nov 9 at 12:11













            up vote
            3
            down vote










            up vote
            3
            down vote









            An Alternative solution :



            from collections import defaultdict
            summary = defaultdict(list)

            # Input and collate
            with open('myfile.txt', 'r') as fp:
            for line in fp:
            items = line.strip().split()
            key, data = (items[0], items[1], items[2]), items[7]
            summary[key].append(data)

            # Output
            for keys, entries in summary.items():
            print('{keys}t{entries} {count}'.format(
            keys=' '.join(keys),
            entries=','.join(entries),
            count=len(entries) ))


            With Python 2.7 - this produces the output



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 89050000 91020000 FAM9B,PABPC5 2
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 86580000 86980000 KLHL4 1


            With Python 3.6, the output is :



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 86580000 86980000 KLHL4 1
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 89050000 91020000 FAM9B,PABPC5 2


            The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
            It wasn't clear from your description if the ordering was important.



            The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.



            It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.



            In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.






            share|improve this answer












            An Alternative solution :



            from collections import defaultdict
            summary = defaultdict(list)

            # Input and collate
            with open('myfile.txt', 'r') as fp:
            for line in fp:
            items = line.strip().split()
            key, data = (items[0], items[1], items[2]), items[7]
            summary[key].append(data)

            # Output
            for keys, entries in summary.items():
            print('{keys}t{entries} {count}'.format(
            keys=' '.join(keys),
            entries=','.join(entries),
            count=len(entries) ))


            With Python 2.7 - this produces the output



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 89050000 91020000 FAM9B,PABPC5 2
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 86580000 86980000 KLHL4 1


            With Python 3.6, the output is :



            chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
            chrX 86580000 86980000 KLHL4 1
            chrX 87370000 88620000 CPXCR1,FAM9A 2
            chrX 89050000 91020000 FAM9B,PABPC5 2


            The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
            It wasn't clear from your description if the ordering was important.



            The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.



            It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.



            In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 8 at 22:56









            Tony Suffolk 66

            4,1131833




            4,1131833












            • Nice explanation of different sort order depending on Python version! :)
              – lmichelbacher
              Nov 9 at 12:11


















            • Nice explanation of different sort order depending on Python version! :)
              – lmichelbacher
              Nov 9 at 12:11
















            Nice explanation of different sort order depending on Python version! :)
            – lmichelbacher
            Nov 9 at 12:11




            Nice explanation of different sort order depending on Python version! :)
            – lmichelbacher
            Nov 9 at 12:11












            up vote
            2
            down vote













            This should do what you want:



            from collection import defaultdict # 1

            lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

            counter = defaultdict(list) # 3
            for line in lines:
            counter[(line[0], line[1], line[2])].append(line[7]) # 4

            for key, value in counter.iteritems(): # 5
            print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6


            Explanation:




            1. We're going to use a handy library that gives us a dictionary with a default value

            2. Read the whole input file, remove the new line at the end and split into parts (on white space)

            3. Make a dictionary whose values are empty lists by default for any key

            4. Go through the lines and populate the dictionary


              1. Columns 1-3 are the key

              2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)



            5. Iterate the dictionary's key-value pairs

            6. Print the output, joining the data structures to the desired format.


            Hope this helps 🙂.






            share|improve this answer

















            • 1




              snap - except I do it line by line rather than reading the entirely file in in one go.
              – Tony Suffolk 66
              Nov 8 at 22:57










            • Great minds think alike ;).
              – lmichelbacher
              Nov 9 at 12:09















            up vote
            2
            down vote













            This should do what you want:



            from collection import defaultdict # 1

            lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

            counter = defaultdict(list) # 3
            for line in lines:
            counter[(line[0], line[1], line[2])].append(line[7]) # 4

            for key, value in counter.iteritems(): # 5
            print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6


            Explanation:




            1. We're going to use a handy library that gives us a dictionary with a default value

            2. Read the whole input file, remove the new line at the end and split into parts (on white space)

            3. Make a dictionary whose values are empty lists by default for any key

            4. Go through the lines and populate the dictionary


              1. Columns 1-3 are the key

              2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)



            5. Iterate the dictionary's key-value pairs

            6. Print the output, joining the data structures to the desired format.


            Hope this helps 🙂.






            share|improve this answer

















            • 1




              snap - except I do it line by line rather than reading the entirely file in in one go.
              – Tony Suffolk 66
              Nov 8 at 22:57










            • Great minds think alike ;).
              – lmichelbacher
              Nov 9 at 12:09













            up vote
            2
            down vote










            up vote
            2
            down vote









            This should do what you want:



            from collection import defaultdict # 1

            lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

            counter = defaultdict(list) # 3
            for line in lines:
            counter[(line[0], line[1], line[2])].append(line[7]) # 4

            for key, value in counter.iteritems(): # 5
            print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6


            Explanation:




            1. We're going to use a handy library that gives us a dictionary with a default value

            2. Read the whole input file, remove the new line at the end and split into parts (on white space)

            3. Make a dictionary whose values are empty lists by default for any key

            4. Go through the lines and populate the dictionary


              1. Columns 1-3 are the key

              2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)



            5. Iterate the dictionary's key-value pairs

            6. Print the output, joining the data structures to the desired format.


            Hope this helps 🙂.






            share|improve this answer












            This should do what you want:



            from collection import defaultdict # 1

            lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

            counter = defaultdict(list) # 3
            for line in lines:
            counter[(line[0], line[1], line[2])].append(line[7]) # 4

            for key, value in counter.iteritems(): # 5
            print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6


            Explanation:




            1. We're going to use a handy library that gives us a dictionary with a default value

            2. Read the whole input file, remove the new line at the end and split into parts (on white space)

            3. Make a dictionary whose values are empty lists by default for any key

            4. Go through the lines and populate the dictionary


              1. Columns 1-3 are the key

              2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)



            5. Iterate the dictionary's key-value pairs

            6. Print the output, joining the data structures to the desired format.


            Hope this helps 🙂.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 8 at 22:50









            lmichelbacher

            2,70011932




            2,70011932








            • 1




              snap - except I do it line by line rather than reading the entirely file in in one go.
              – Tony Suffolk 66
              Nov 8 at 22:57










            • Great minds think alike ;).
              – lmichelbacher
              Nov 9 at 12:09














            • 1




              snap - except I do it line by line rather than reading the entirely file in in one go.
              – Tony Suffolk 66
              Nov 8 at 22:57










            • Great minds think alike ;).
              – lmichelbacher
              Nov 9 at 12:09








            1




            1




            snap - except I do it line by line rather than reading the entirely file in in one go.
            – Tony Suffolk 66
            Nov 8 at 22:57




            snap - except I do it line by line rather than reading the entirely file in in one go.
            – Tony Suffolk 66
            Nov 8 at 22:57












            Great minds think alike ;).
            – lmichelbacher
            Nov 9 at 12:09




            Great minds think alike ;).
            – lmichelbacher
            Nov 9 at 12:09










            up vote
            0
            down vote













            This is nice opportunity to use pandas. You can open your file like this:



            import pandas as pd
            # open file
            df = pd.read_csv('myfile.txt`)
            # group and apply functions
            df = df.groupby([0,1,2])[7].agg([('count', 'size'),
            ('genes', lambda col: ', '.join(col))
            ]).reset_index()
            # rename columns
            df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
            # save new file
            df.to_csv('newfile.txt', sep='t', index=False, header=True)


            This creates a DataFrame that looks like this:



                  0         1         2   3  4   5   6        7  8
            0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
            1 chrX 7970000 8670000 3 2 7 7 SATL1 3
            2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
            3 chrX 7970000 8670000 3 2 7 7 VCX2 1
            4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
            5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
            6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
            7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
            8 chrX 89050000 91020000 11 6 10 13 PABPC5 2


            Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:



                  0         1         2  count                          genes
            0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
            1 chrX 86580000 86980000 1 KLHL4
            2 chrX 87370000 88620000 2 CPXCR1, FAM9A
            3 chrX 89050000 91020000 2 FAM9B, PABPC5


            What this does is groups the data and adds the columns we are interested in:



            ('count', 'size') creates the column count using the function size
            ('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.



            This is what the final file will look like:



            chromosome  start_region  end_region  count                          genes
            chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
            chrX 86580000 86980000 1 KLHL4
            chrX 87370000 88620000 2 CPXCR1, FAM9A
            chrX 89050000 91020000 2 FAM9B, PABPC5


            If you have any questions, come visit the pandas tag.






            share|improve this answer

























              up vote
              0
              down vote













              This is nice opportunity to use pandas. You can open your file like this:



              import pandas as pd
              # open file
              df = pd.read_csv('myfile.txt`)
              # group and apply functions
              df = df.groupby([0,1,2])[7].agg([('count', 'size'),
              ('genes', lambda col: ', '.join(col))
              ]).reset_index()
              # rename columns
              df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
              # save new file
              df.to_csv('newfile.txt', sep='t', index=False, header=True)


              This creates a DataFrame that looks like this:



                    0         1         2   3  4   5   6        7  8
              0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
              1 chrX 7970000 8670000 3 2 7 7 SATL1 3
              2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
              3 chrX 7970000 8670000 3 2 7 7 VCX2 1
              4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
              5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
              6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
              7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
              8 chrX 89050000 91020000 11 6 10 13 PABPC5 2


              Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:



                    0         1         2  count                          genes
              0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
              1 chrX 86580000 86980000 1 KLHL4
              2 chrX 87370000 88620000 2 CPXCR1, FAM9A
              3 chrX 89050000 91020000 2 FAM9B, PABPC5


              What this does is groups the data and adds the columns we are interested in:



              ('count', 'size') creates the column count using the function size
              ('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.



              This is what the final file will look like:



              chromosome  start_region  end_region  count                          genes
              chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
              chrX 86580000 86980000 1 KLHL4
              chrX 87370000 88620000 2 CPXCR1, FAM9A
              chrX 89050000 91020000 2 FAM9B, PABPC5


              If you have any questions, come visit the pandas tag.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                This is nice opportunity to use pandas. You can open your file like this:



                import pandas as pd
                # open file
                df = pd.read_csv('myfile.txt`)
                # group and apply functions
                df = df.groupby([0,1,2])[7].agg([('count', 'size'),
                ('genes', lambda col: ', '.join(col))
                ]).reset_index()
                # rename columns
                df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
                # save new file
                df.to_csv('newfile.txt', sep='t', index=False, header=True)


                This creates a DataFrame that looks like this:



                      0         1         2   3  4   5   6        7  8
                0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
                1 chrX 7970000 8670000 3 2 7 7 SATL1 3
                2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
                3 chrX 7970000 8670000 3 2 7 7 VCX2 1
                4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
                5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
                6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
                7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
                8 chrX 89050000 91020000 11 6 10 13 PABPC5 2


                Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:



                      0         1         2  count                          genes
                0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
                1 chrX 86580000 86980000 1 KLHL4
                2 chrX 87370000 88620000 2 CPXCR1, FAM9A
                3 chrX 89050000 91020000 2 FAM9B, PABPC5


                What this does is groups the data and adds the columns we are interested in:



                ('count', 'size') creates the column count using the function size
                ('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.



                This is what the final file will look like:



                chromosome  start_region  end_region  count                          genes
                chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
                chrX 86580000 86980000 1 KLHL4
                chrX 87370000 88620000 2 CPXCR1, FAM9A
                chrX 89050000 91020000 2 FAM9B, PABPC5


                If you have any questions, come visit the pandas tag.






                share|improve this answer












                This is nice opportunity to use pandas. You can open your file like this:



                import pandas as pd
                # open file
                df = pd.read_csv('myfile.txt`)
                # group and apply functions
                df = df.groupby([0,1,2])[7].agg([('count', 'size'),
                ('genes', lambda col: ', '.join(col))
                ]).reset_index()
                # rename columns
                df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
                # save new file
                df.to_csv('newfile.txt', sep='t', index=False, header=True)


                This creates a DataFrame that looks like this:



                      0         1         2   3  4   5   6        7  8
                0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
                1 chrX 7970000 8670000 3 2 7 7 SATL1 3
                2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
                3 chrX 7970000 8670000 3 2 7 7 VCX2 1
                4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
                5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
                6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
                7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
                8 chrX 89050000 91020000 11 6 10 13 PABPC5 2


                Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:



                      0         1         2  count                          genes
                0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
                1 chrX 86580000 86980000 1 KLHL4
                2 chrX 87370000 88620000 2 CPXCR1, FAM9A
                3 chrX 89050000 91020000 2 FAM9B, PABPC5


                What this does is groups the data and adds the columns we are interested in:



                ('count', 'size') creates the column count using the function size
                ('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.



                This is what the final file will look like:



                chromosome  start_region  end_region  count                          genes
                chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
                chrX 86580000 86980000 1 KLHL4
                chrX 87370000 88620000 2 CPXCR1, FAM9A
                chrX 89050000 91020000 2 FAM9B, PABPC5


                If you have any questions, come visit the pandas tag.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 9 at 5:25









                Alex

                649520




                649520






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217128%2fsummarizing-the-contents-of-a-text-file%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Guess what letter conforming each word

                    Port of Spain

                    Run scheduled task as local user group (not BUILTIN)