Summarizing the contents of a text file

up vote
3
down vote

favorite

I have a text file like this example:

chrX    7970000    8670000   3  2   7   7   RPS6KA6   4

chrX    7970000    8670000   3  2   7   7     SATL1   3

chrX    7970000    8670000   3  2   7   7   SH3BGRL   4

chrX    7970000    8670000   3  2   7   7      VCX2   1

chrX   86580000   86980000   1  1   1   5     KLHL4   2

chrX   87370000   88620000   4  4  11  11    CPXCR1   2

chrX   87370000   88620000   4  4  11  11     FAM9A   2

chrX   89050000   91020000  11  6  10  13     FAM9B   3

chrX   89050000   91020000  11  6  10  13    PABPC5   2

I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns).
in the output, there would be 5 columns. the 1st 3 columns will be the same (only one repeat of every line) but in the 4th column there would multiple characters in the same column and the same line (these characters are in the 8th column of original file). the 5th column is the number of times that the 1st 3 lines are repeated in original file.

in short: in the input file, columns 4,5,6,7 and 9 are useless for the output file.
we should count the number of lines in which the 1st 3 columns are the same, so, in the output file the 1st 3 column would be the same as input file (but only repeated once). the 5th column is the number of times the line is repeated. the 4th column of output is all characters from 8th column which are in the repeated lines.
in the expected output, this line is repeated 4 times: chrX 7970000 8670000. so, the 5th column is 4, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2. as you see the characters in the 4th column are comma separated.

Here is the expected output:

chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4

chrX    86580000    86980000    KLHL4   1

chrX    87370000    88620000    CPXCR1,FAM9A    2

chrX    89050000    91020000    FAM9B,PABPC5    2

I am trying to do that in Python and wrote the following code:

file = open("myfile.txt", 'rb')

infile = 

for line in file:

    infile.append(line)

    count = 0

    final = 

    for i in range(len(infile)):

        count += 1

        if infile[i-1] == infile[i]

            final.append(infile[0,1,2,7, count])

This code does not return what I want. Do you know how to fix it?

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

add a comment |

up vote
3
down vote

favorite

I have a text file like this example:

chrX    7970000    8670000   3  2   7   7   RPS6KA6   4

chrX    7970000    8670000   3  2   7   7     SATL1   3

chrX    7970000    8670000   3  2   7   7   SH3BGRL   4

chrX    7970000    8670000   3  2   7   7      VCX2   1

chrX   86580000   86980000   1  1   1   5     KLHL4   2

chrX   87370000   88620000   4  4  11  11    CPXCR1   2

chrX   87370000   88620000   4  4  11  11     FAM9A   2

chrX   89050000   91020000  11  6  10  13     FAM9B   3

chrX   89050000   91020000  11  6  10  13    PABPC5   2

Here is the expected output:

chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4

chrX    86580000    86980000    KLHL4   1

chrX    87370000    88620000    CPXCR1,FAM9A    2

chrX    89050000    91020000    FAM9B,PABPC5    2

I am trying to do that in Python and wrote the following code:

file = open("myfile.txt", 'rb')

infile = 

for line in file:

    infile.append(line)

    count = 0

    final = 

    for i in range(len(infile)):

        count += 1

        if infile[i-1] == infile[i]

            final.append(infile[0,1,2,7, count])

This code does not return what I want. Do you know how to fix it?

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

add a comment |

up vote
3
down vote

favorite

I have a text file like this example:

chrX    7970000    8670000   3  2   7   7   RPS6KA6   4

chrX    7970000    8670000   3  2   7   7     SATL1   3

chrX    7970000    8670000   3  2   7   7   SH3BGRL   4

chrX    7970000    8670000   3  2   7   7      VCX2   1

chrX   86580000   86980000   1  1   1   5     KLHL4   2

chrX   87370000   88620000   4  4  11  11    CPXCR1   2

chrX   87370000   88620000   4  4  11  11     FAM9A   2

chrX   89050000   91020000  11  6  10  13     FAM9B   3

chrX   89050000   91020000  11  6  10  13    PABPC5   2

Here is the expected output:

chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4

chrX    86580000    86980000    KLHL4   1

chrX    87370000    88620000    CPXCR1,FAM9A    2

chrX    89050000    91020000    FAM9B,PABPC5    2

I am trying to do that in Python and wrote the following code:

file = open("myfile.txt", 'rb')

infile = 

for line in file:

    infile.append(line)

    count = 0

    final = 

    for i in range(len(infile)):

        count += 1

        if infile[i-1] == infile[i]

            final.append(infile[0,1,2,7, count])

This code does not return what I want. Do you know how to fix it?

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

I have a text file like this example:

chrX    7970000    8670000   3  2   7   7   RPS6KA6   4

chrX    7970000    8670000   3  2   7   7     SATL1   3

chrX    7970000    8670000   3  2   7   7   SH3BGRL   4

chrX    7970000    8670000   3  2   7   7      VCX2   1

chrX   86580000   86980000   1  1   1   5     KLHL4   2

chrX   87370000   88620000   4  4  11  11    CPXCR1   2

chrX   87370000   88620000   4  4  11  11     FAM9A   2

chrX   89050000   91020000  11  6  10  13     FAM9B   3

chrX   89050000   91020000  11  6  10  13    PABPC5   2

Here is the expected output:

chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4

chrX    86580000    86980000    KLHL4   1

chrX    87370000    88620000    CPXCR1,FAM9A    2

chrX    89050000    91020000    FAM9B,PABPC5    2

I am trying to do that in Python and wrote the following code:

file = open("myfile.txt", 'rb')

infile = 

for line in file:

    infile.append(line)

    count = 0

    final = 

    for i in range(len(infile)):

        count += 1

        if infile[i-1] == infile[i]

            final.append(infile[0,1,2,7, count])

This code does not return what I want. Do you know how to fix it?

python

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

edited Nov 8 at 22:34

martineau

64.5k887171

edited Nov 8 at 22:34

martineau

64.5k887171

edited Nov 8 at 22:34

martineau

64.5k887171

asked Nov 8 at 22:30

elly

213

asked Nov 8 at 22:30

elly

213

asked Nov 8 at 22:30

elly

213

add a comment |

3 Answers
3

active

oldest

votes

up vote
3
down vote

An Alternative solution :

from collections import defaultdict

summary = defaultdict(list)



# Input and collate

with open('myfile.txt', 'r') as fp:

    for line in fp:

        items = line.strip().split()

        key, data = (items[0], items[1], items[2]), items[7]

        summary[key].append(data)



# Output

for keys, entries in summary.items():

    print('{keys}t{entries} {count}'.format(

          keys=' '.join(keys),

          entries=','.join(entries), 

          count=len(entries) ))

With Python 2.7 - this produces the output

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 89050000 91020000  FAM9B,PABPC5 2

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 86580000 86980000  KLHL4 1

With Python 3.6, the output is :

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 86580000 86980000  KLHL4 1

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 89050000 91020000  FAM9B,PABPC5 2

The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.

The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.

It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.

In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

add a comment |

up vote
2
down vote

This should do what you want:

from collection import defaultdict # 1



lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2



counter = defaultdict(list) # 3

for line in lines:

    counter[(line[0], line[1], line[2])].append(line[7]) # 4



for key, value in counter.iteritems(): # 5

    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

Explanation:

We're going to use a handy library that gives us a dictionary with a default value

Read the whole input file, remove the new line at the end and split into parts (on white space)

Make a dictionary whose values are empty lists by default for any key

Go through the lines and populate the dictionary
1. Columns 1-3 are the key
2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)

Iterate the dictionary's key-value pairs

Print the output, joining the data structures to the desired format.

Hope this helps 🙂.

answered Nov 8 at 22:50

lmichelbacher

2,70011932

1

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

add a comment |

up vote
0
down vote

This is nice opportunity to use pandas. You can open your file like this:

import pandas as pd

# open file

df = pd.read_csv('myfile.txt`)

# group and apply functions

df = df.groupby([0,1,2])[7].agg([('count', 'size'), 

                                 ('genes', lambda col: ', '.join(col))

                                ]).reset_index()

# rename columns

df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)

# save new file

df.to_csv('newfile.txt', sep='t', index=False, header=True)

This creates a DataFrame that looks like this:

      0         1         2   3  4   5   6        7  8

0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4

1  chrX   7970000   8670000   3  2   7   7    SATL1  3

2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4

3  chrX   7970000   8670000   3  2   7   7     VCX2  1

4  chrX  86580000  86980000   1  1   1   5    KLHL4  2

5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2

6  chrX  87370000  88620000   4  4  11  11    FAM9A  2

7  chrX  89050000  91020000  11  6  10  13    FAM9B  3

8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:

      0         1         2  count                          genes

0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

1  chrX  86580000  86980000      1                          KLHL4

2  chrX  87370000  88620000      2                  CPXCR1, FAM9A

3  chrX  89050000  91020000      2                  FAM9B, PABPC5

What this does is groups the data and adds the columns we are interested in:

('count', 'size') creates the column count using the function size
('genes', lambda col: ', '.join(col)) creates the column genes using the lambda function that just joins the grouped column together.

This is what the final file will look like:

chromosome  start_region  end_region  count                          genes

      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

      chrX      86580000    86980000      1                          KLHL4

      chrX      87370000    88620000      2                  CPXCR1, FAM9A

      chrX      89050000    91020000      2                  FAM9B, PABPC5

If you have any questions, come visit the pandas tag.

answered Nov 9 at 5:25

Alex

649520

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217128%2fsummarizing-the-contents-of-a-text-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
3
down vote

An Alternative solution :

from collections import defaultdict

summary = defaultdict(list)



# Input and collate

with open('myfile.txt', 'r') as fp:

    for line in fp:

        items = line.strip().split()

        key, data = (items[0], items[1], items[2]), items[7]

        summary[key].append(data)



# Output

for keys, entries in summary.items():

    print('{keys}t{entries} {count}'.format(

          keys=' '.join(keys),

          entries=','.join(entries), 

          count=len(entries) ))

With Python 2.7 - this produces the output

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 89050000 91020000  FAM9B,PABPC5 2

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 86580000 86980000  KLHL4 1

With Python 3.6, the output is :

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 86580000 86980000  KLHL4 1

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 89050000 91020000  FAM9B,PABPC5 2

The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.

In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

add a comment |

up vote
3
down vote

An Alternative solution :

from collections import defaultdict

summary = defaultdict(list)



# Input and collate

with open('myfile.txt', 'r') as fp:

    for line in fp:

        items = line.strip().split()

        key, data = (items[0], items[1], items[2]), items[7]

        summary[key].append(data)



# Output

for keys, entries in summary.items():

    print('{keys}t{entries} {count}'.format(

          keys=' '.join(keys),

          entries=','.join(entries), 

          count=len(entries) ))

With Python 2.7 - this produces the output

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 89050000 91020000  FAM9B,PABPC5 2

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 86580000 86980000  KLHL4 1

With Python 3.6, the output is :

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 86580000 86980000  KLHL4 1

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 89050000 91020000  FAM9B,PABPC5 2

The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.

In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

add a comment |

up vote
3
down vote

An Alternative solution :

from collections import defaultdict

summary = defaultdict(list)



# Input and collate

with open('myfile.txt', 'r') as fp:

    for line in fp:

        items = line.strip().split()

        key, data = (items[0], items[1], items[2]), items[7]

        summary[key].append(data)



# Output

for keys, entries in summary.items():

    print('{keys}t{entries} {count}'.format(

          keys=' '.join(keys),

          entries=','.join(entries), 

          count=len(entries) ))

With Python 2.7 - this produces the output

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 89050000 91020000  FAM9B,PABPC5 2

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 86580000 86980000  KLHL4 1

With Python 3.6, the output is :

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 86580000 86980000  KLHL4 1

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 89050000 91020000  FAM9B,PABPC5 2

The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.

In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

An Alternative solution :

from collections import defaultdict

summary = defaultdict(list)



# Input and collate

with open('myfile.txt', 'r') as fp:

    for line in fp:

        items = line.strip().split()

        key, data = (items[0], items[1], items[2]), items[7]

        summary[key].append(data)



# Output

for keys, entries in summary.items():

    print('{keys}t{entries} {count}'.format(

          keys=' '.join(keys),

          entries=','.join(entries), 

          count=len(entries) ))

With Python 2.7 - this produces the output

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 89050000 91020000  FAM9B,PABPC5 2

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 86580000 86980000  KLHL4 1

With Python 3.6, the output is :

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4

chrX 86580000 86980000  KLHL4 1

chrX 87370000 88620000  CPXCR1,FAM9A 2

chrX 89050000 91020000  FAM9B,PABPC5 2

The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count] doesn't do what you think it does.

In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

answered Nov 8 at 22:56

Tony Suffolk 66

4,1131833

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

add a comment |

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11

add a comment |

up vote
2
down vote

This should do what you want:

from collection import defaultdict # 1



lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2



counter = defaultdict(list) # 3

for line in lines:

    counter[(line[0], line[1], line[2])].append(line[7]) # 4



for key, value in counter.iteritems(): # 5

    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

Explanation:

We're going to use a handy library that gives us a dictionary with a default value

Read the whole input file, remove the new line at the end and split into parts (on white space)

Make a dictionary whose values are empty lists by default for any key

Go through the lines and populate the dictionary
1. Columns 1-3 are the key
2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)

Iterate the dictionary's key-value pairs

Print the output, joining the data structures to the desired format.

Hope this helps 🙂.

answered Nov 8 at 22:50

lmichelbacher

2,70011932

1

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

add a comment |

up vote
2
down vote

This should do what you want:

from collection import defaultdict # 1



lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2



counter = defaultdict(list) # 3

for line in lines:

    counter[(line[0], line[1], line[2])].append(line[7]) # 4



for key, value in counter.iteritems(): # 5

    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

Explanation:

We're going to use a handy library that gives us a dictionary with a default value

Read the whole input file, remove the new line at the end and split into parts (on white space)

Make a dictionary whose values are empty lists by default for any key

Go through the lines and populate the dictionary
1. Columns 1-3 are the key
2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)

Iterate the dictionary's key-value pairs

Print the output, joining the data structures to the desired format.

Hope this helps 🙂.

answered Nov 8 at 22:50

lmichelbacher

2,70011932

1

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

add a comment |

up vote
2
down vote

This should do what you want:

from collection import defaultdict # 1



lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2



counter = defaultdict(list) # 3

for line in lines:

    counter[(line[0], line[1], line[2])].append(line[7]) # 4



for key, value in counter.iteritems(): # 5

    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

Explanation:

We're going to use a handy library that gives us a dictionary with a default value

Read the whole input file, remove the new line at the end and split into parts (on white space)

Make a dictionary whose values are empty lists by default for any key

Go through the lines and populate the dictionary
1. Columns 1-3 are the key
2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)

Iterate the dictionary's key-value pairs

Print the output, joining the data structures to the desired format.

Hope this helps 🙂.

answered Nov 8 at 22:50

lmichelbacher

2,70011932

This should do what you want:

from collection import defaultdict # 1



lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2



counter = defaultdict(list) # 3

for line in lines:

    counter[(line[0], line[1], line[2])].append(line[7]) # 4



for key, value in counter.iteritems(): # 5

    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

Explanation:

We're going to use a handy library that gives us a dictionary with a default value

Read the whole input file, remove the new line at the end and split into parts (on white space)

Make a dictionary whose values are empty lists by default for any key

Go through the lines and populate the dictionary
1. Columns 1-3 are the key
2. For each character sequence in column 8, we append it to the list (if we hadn't used a defaultdict with list this would be more complicated)

Iterate the dictionary's key-value pairs

Print the output, joining the data structures to the desired format.

Hope this helps 🙂.

answered Nov 8 at 22:50

lmichelbacher

2,70011932

answered Nov 8 at 22:50

lmichelbacher

2,70011932

answered Nov 8 at 22:50

lmichelbacher

2,70011932

answered Nov 8 at 22:50

lmichelbacher

2,70011932

1

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

add a comment |

1

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57

Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09

add a comment |

up vote
0
down vote

This is nice opportunity to use pandas. You can open your file like this:

import pandas as pd

# open file

df = pd.read_csv('myfile.txt`)

# group and apply functions

df = df.groupby([0,1,2])[7].agg([('count', 'size'), 

                                 ('genes', lambda col: ', '.join(col))

                                ]).reset_index()

# rename columns

df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)

# save new file

df.to_csv('newfile.txt', sep='t', index=False, header=True)

This creates a DataFrame that looks like this:

      0         1         2   3  4   5   6        7  8

0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4

1  chrX   7970000   8670000   3  2   7   7    SATL1  3

2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4

3  chrX   7970000   8670000   3  2   7   7     VCX2  1

4  chrX  86580000  86980000   1  1   1   5    KLHL4  2

5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2

6  chrX  87370000  88620000   4  4  11  11    FAM9A  2

7  chrX  89050000  91020000  11  6  10  13    FAM9B  3

8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:

      0         1         2  count                          genes

0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

1  chrX  86580000  86980000      1                          KLHL4

2  chrX  87370000  88620000      2                  CPXCR1, FAM9A

3  chrX  89050000  91020000      2                  FAM9B, PABPC5

What this does is groups the data and adds the columns we are interested in:

This is what the final file will look like:

chromosome  start_region  end_region  count                          genes

      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

      chrX      86580000    86980000      1                          KLHL4

      chrX      87370000    88620000      2                  CPXCR1, FAM9A

      chrX      89050000    91020000      2                  FAM9B, PABPC5

If you have any questions, come visit the pandas tag.

answered Nov 9 at 5:25

Alex

649520

add a comment |

up vote
0
down vote

This is nice opportunity to use pandas. You can open your file like this:

import pandas as pd

# open file

df = pd.read_csv('myfile.txt`)

# group and apply functions

df = df.groupby([0,1,2])[7].agg([('count', 'size'), 

                                 ('genes', lambda col: ', '.join(col))

                                ]).reset_index()

# rename columns

df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)

# save new file

df.to_csv('newfile.txt', sep='t', index=False, header=True)

This creates a DataFrame that looks like this:

      0         1         2   3  4   5   6        7  8

0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4

1  chrX   7970000   8670000   3  2   7   7    SATL1  3

2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4

3  chrX   7970000   8670000   3  2   7   7     VCX2  1

4  chrX  86580000  86980000   1  1   1   5    KLHL4  2

5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2

6  chrX  87370000  88620000   4  4  11  11    FAM9A  2

7  chrX  89050000  91020000  11  6  10  13    FAM9B  3

8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:

      0         1         2  count                          genes

0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

1  chrX  86580000  86980000      1                          KLHL4

2  chrX  87370000  88620000      2                  CPXCR1, FAM9A

3  chrX  89050000  91020000      2                  FAM9B, PABPC5

What this does is groups the data and adds the columns we are interested in:

This is what the final file will look like:

chromosome  start_region  end_region  count                          genes

      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

      chrX      86580000    86980000      1                          KLHL4

      chrX      87370000    88620000      2                  CPXCR1, FAM9A

      chrX      89050000    91020000      2                  FAM9B, PABPC5

If you have any questions, come visit the pandas tag.

answered Nov 9 at 5:25

Alex

649520

add a comment |

up vote
0
down vote

This is nice opportunity to use pandas. You can open your file like this:

import pandas as pd

# open file

df = pd.read_csv('myfile.txt`)

# group and apply functions

df = df.groupby([0,1,2])[7].agg([('count', 'size'), 

                                 ('genes', lambda col: ', '.join(col))

                                ]).reset_index()

# rename columns

df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)

# save new file

df.to_csv('newfile.txt', sep='t', index=False, header=True)

This creates a DataFrame that looks like this:

      0         1         2   3  4   5   6        7  8

0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4

1  chrX   7970000   8670000   3  2   7   7    SATL1  3

2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4

3  chrX   7970000   8670000   3  2   7   7     VCX2  1

4  chrX  86580000  86980000   1  1   1   5    KLHL4  2

5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2

6  chrX  87370000  88620000   4  4  11  11    FAM9A  2

7  chrX  89050000  91020000  11  6  10  13    FAM9B  3

8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:

      0         1         2  count                          genes

0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

1  chrX  86580000  86980000      1                          KLHL4

2  chrX  87370000  88620000      2                  CPXCR1, FAM9A

3  chrX  89050000  91020000      2                  FAM9B, PABPC5

What this does is groups the data and adds the columns we are interested in:

This is what the final file will look like:

chromosome  start_region  end_region  count                          genes

      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

      chrX      86580000    86980000      1                          KLHL4

      chrX      87370000    88620000      2                  CPXCR1, FAM9A

      chrX      89050000    91020000      2                  FAM9B, PABPC5

If you have any questions, come visit the pandas tag.

answered Nov 9 at 5:25

Alex

649520

This is nice opportunity to use pandas. You can open your file like this:

import pandas as pd

# open file

df = pd.read_csv('myfile.txt`)

# group and apply functions

df = df.groupby([0,1,2])[7].agg([('count', 'size'), 

                                 ('genes', lambda col: ', '.join(col))

                                ]).reset_index()

# rename columns

df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)

# save new file

df.to_csv('newfile.txt', sep='t', index=False, header=True)

This creates a DataFrame that looks like this:

      0         1         2   3  4   5   6        7  8

0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4

1  chrX   7970000   8670000   3  2   7   7    SATL1  3

2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4

3  chrX   7970000   8670000   3  2   7   7     VCX2  1

4  chrX  86580000  86980000   1  1   1   5    KLHL4  2

5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2

6  chrX  87370000  88620000   4  4  11  11    FAM9A  2

7  chrX  89050000  91020000  11  6  10  13    FAM9B  3

8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

Now, using built-in functions we can groupby on columns [0, 1, 2] and apply functions over the groups, resulting in:

      0         1         2  count                          genes

0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

1  chrX  86580000  86980000      1                          KLHL4

2  chrX  87370000  88620000      2                  CPXCR1, FAM9A

3  chrX  89050000  91020000      2                  FAM9B, PABPC5

What this does is groups the data and adds the columns we are interested in:

This is what the final file will look like:

chromosome  start_region  end_region  count                          genes

      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2

      chrX      86580000    86980000      1                          KLHL4

      chrX      87370000    88620000      2                  CPXCR1, FAM9A

      chrX      89050000    91020000      2                  FAM9B, PABPC5

If you have any questions, come visit the pandas tag.

answered Nov 9 at 5:25

Alex

649520

answered Nov 9 at 5:25

Alex

649520

answered Nov 9 at 5:25

Alex

649520

answered Nov 9 at 5:25

Alex

649520

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk