Summarizing the contents of a text file
up vote
3
down vote
favorite
I have a text file like this example:
chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2
I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns
).
in the output
, there would be 5 columns
. the 1st 3 columns
will be the same (only one repeat of every line) but in the 4th column
there would multiple characters in the same column
and the same line
(these characters are in the 8th column
of original file
). the 5th column
is the number of times that the 1st 3 lines are repeated
in original file
.
in short
: in the input file
, columns 4,5,6,7 and 9 are useless
for the output file.
we should count the number of lines in which the 1st 3 columns are the same
, so, in the output file
the 1st 3 column would be the same as input file
(but only repeated once
). the 5th column is the number of times
the line is repeated. the 4th column of output
is all characters from 8th column
which are in the repeated lines.
in the expected output
, this line is repeated 4 times
: chrX 7970000 8670000
. so, the 5th column is 4
, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2
. as you see the characters in the 4th column are comma separated
.
Here is the expected output:
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
I am trying to do that in Python and wrote the following code:
file = open("myfile.txt", 'rb')
infile =
for line in file:
infile.append(line)
count = 0
final =
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])
This code does not return what I want. Do you know how to fix it?
python
add a comment |
up vote
3
down vote
favorite
I have a text file like this example:
chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2
I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns
).
in the output
, there would be 5 columns
. the 1st 3 columns
will be the same (only one repeat of every line) but in the 4th column
there would multiple characters in the same column
and the same line
(these characters are in the 8th column
of original file
). the 5th column
is the number of times that the 1st 3 lines are repeated
in original file
.
in short
: in the input file
, columns 4,5,6,7 and 9 are useless
for the output file.
we should count the number of lines in which the 1st 3 columns are the same
, so, in the output file
the 1st 3 column would be the same as input file
(but only repeated once
). the 5th column is the number of times
the line is repeated. the 4th column of output
is all characters from 8th column
which are in the repeated lines.
in the expected output
, this line is repeated 4 times
: chrX 7970000 8670000
. so, the 5th column is 4
, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2
. as you see the characters in the 4th column are comma separated
.
Here is the expected output:
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
I am trying to do that in Python and wrote the following code:
file = open("myfile.txt", 'rb')
infile =
for line in file:
infile.append(line)
count = 0
final =
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])
This code does not return what I want. Do you know how to fix it?
python
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I have a text file like this example:
chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2
I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns
).
in the output
, there would be 5 columns
. the 1st 3 columns
will be the same (only one repeat of every line) but in the 4th column
there would multiple characters in the same column
and the same line
(these characters are in the 8th column
of original file
). the 5th column
is the number of times that the 1st 3 lines are repeated
in original file
.
in short
: in the input file
, columns 4,5,6,7 and 9 are useless
for the output file.
we should count the number of lines in which the 1st 3 columns are the same
, so, in the output file
the 1st 3 column would be the same as input file
(but only repeated once
). the 5th column is the number of times
the line is repeated. the 4th column of output
is all characters from 8th column
which are in the repeated lines.
in the expected output
, this line is repeated 4 times
: chrX 7970000 8670000
. so, the 5th column is 4
, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2
. as you see the characters in the 4th column are comma separated
.
Here is the expected output:
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
I am trying to do that in Python and wrote the following code:
file = open("myfile.txt", 'rb')
infile =
for line in file:
infile.append(line)
count = 0
final =
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])
This code does not return what I want. Do you know how to fix it?
python
I have a text file like this example:
chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2
I want to count the number of time that every line is repeated (only 1st, 2nd and 3rd columns
).
in the output
, there would be 5 columns
. the 1st 3 columns
will be the same (only one repeat of every line) but in the 4th column
there would multiple characters in the same column
and the same line
(these characters are in the 8th column
of original file
). the 5th column
is the number of times that the 1st 3 lines are repeated
in original file
.
in short
: in the input file
, columns 4,5,6,7 and 9 are useless
for the output file.
we should count the number of lines in which the 1st 3 columns are the same
, so, in the output file
the 1st 3 column would be the same as input file
(but only repeated once
). the 5th column is the number of times
the line is repeated. the 4th column of output
is all characters from 8th column
which are in the repeated lines.
in the expected output
, this line is repeated 4 times
: chrX 7970000 8670000
. so, the 5th column is 4
, and the 4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2
. as you see the characters in the 4th column are comma separated
.
Here is the expected output:
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
I am trying to do that in Python and wrote the following code:
file = open("myfile.txt", 'rb')
infile =
for line in file:
infile.append(line)
count = 0
final =
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])
This code does not return what I want. Do you know how to fix it?
python
python
edited Nov 8 at 22:34
martineau
64.5k887171
64.5k887171
asked Nov 8 at 22:30
elly
213
213
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
3
down vote
An Alternative solution :
from collections import defaultdict
summary = defaultdict(list)
# Input and collate
with open('myfile.txt', 'r') as fp:
for line in fp:
items = line.strip().split()
key, data = (items[0], items[1], items[2]), items[7]
summary[key].append(data)
# Output
for keys, entries in summary.items():
print('{keys}t{entries} {count}'.format(
keys=' '.join(keys),
entries=','.join(entries),
count=len(entries) ))
With Python 2.7 - this produces the output
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000 FAM9B,PABPC5 2
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 86580000 86980000 KLHL4 1
With Python 3.6, the output is :
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.
The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count]
doesn't do what you think it does.
It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.
In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
add a comment |
up vote
2
down vote
This should do what you want:
from collection import defaultdict # 1
lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2
counter = defaultdict(list) # 3
for line in lines:
counter[(line[0], line[1], line[2])].append(line[7]) # 4
for key, value in counter.iteritems(): # 5
print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6
Explanation:
- We're going to use a handy library that gives us a dictionary with a default value
- Read the whole input file, remove the new line at the end and split into parts (on white space)
- Make a dictionary whose values are empty lists by default for any key
- Go through the lines and populate the dictionary
- Columns 1-3 are the key
- For each character sequence in column 8, we append it to the list (if we hadn't used a
defaultdict
withlist
this would be more complicated)
- Iterate the dictionary's key-value pairs
- Print the output, joining the data structures to the desired format.
Hope this helps 🙂.
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
add a comment |
up vote
0
down vote
This is nice opportunity to use pandas
. You can open your file like this:
import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'),
('genes', lambda col: ', '.join(col))
]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='t', index=False, header=True)
This creates a DataFrame that looks like this:
0 1 2 3 4 5 6 7 8
0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
1 chrX 7970000 8670000 3 2 7 7 SATL1 3
2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
3 chrX 7970000 8670000 3 2 7 7 VCX2 1
4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
8 chrX 89050000 91020000 11 6 10 13 PABPC5 2
Now, using built-in functions we can groupby
on columns [0, 1, 2]
and apply functions over the groups, resulting in:
0 1 2 count genes
0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
1 chrX 86580000 86980000 1 KLHL4
2 chrX 87370000 88620000 2 CPXCR1, FAM9A
3 chrX 89050000 91020000 2 FAM9B, PABPC5
What this does is groups the data and adds the columns we are interested in:
('count', 'size')
creates the column count
using the function size
('genes', lambda col: ', '.join(col))
creates the column genes
using the lambda
function that just joins the grouped column together.
This is what the final file will look like:
chromosome start_region end_region count genes
chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
chrX 86580000 86980000 1 KLHL4
chrX 87370000 88620000 2 CPXCR1, FAM9A
chrX 89050000 91020000 2 FAM9B, PABPC5
If you have any questions, come visit the pandas tag.
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
An Alternative solution :
from collections import defaultdict
summary = defaultdict(list)
# Input and collate
with open('myfile.txt', 'r') as fp:
for line in fp:
items = line.strip().split()
key, data = (items[0], items[1], items[2]), items[7]
summary[key].append(data)
# Output
for keys, entries in summary.items():
print('{keys}t{entries} {count}'.format(
keys=' '.join(keys),
entries=','.join(entries),
count=len(entries) ))
With Python 2.7 - this produces the output
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000 FAM9B,PABPC5 2
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 86580000 86980000 KLHL4 1
With Python 3.6, the output is :
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.
The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count]
doesn't do what you think it does.
It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.
In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
add a comment |
up vote
3
down vote
An Alternative solution :
from collections import defaultdict
summary = defaultdict(list)
# Input and collate
with open('myfile.txt', 'r') as fp:
for line in fp:
items = line.strip().split()
key, data = (items[0], items[1], items[2]), items[7]
summary[key].append(data)
# Output
for keys, entries in summary.items():
print('{keys}t{entries} {count}'.format(
keys=' '.join(keys),
entries=','.join(entries),
count=len(entries) ))
With Python 2.7 - this produces the output
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000 FAM9B,PABPC5 2
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 86580000 86980000 KLHL4 1
With Python 3.6, the output is :
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.
The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count]
doesn't do what you think it does.
It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.
In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
add a comment |
up vote
3
down vote
up vote
3
down vote
An Alternative solution :
from collections import defaultdict
summary = defaultdict(list)
# Input and collate
with open('myfile.txt', 'r') as fp:
for line in fp:
items = line.strip().split()
key, data = (items[0], items[1], items[2]), items[7]
summary[key].append(data)
# Output
for keys, entries in summary.items():
print('{keys}t{entries} {count}'.format(
keys=' '.join(keys),
entries=','.join(entries),
count=len(entries) ))
With Python 2.7 - this produces the output
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000 FAM9B,PABPC5 2
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 86580000 86980000 KLHL4 1
With Python 3.6, the output is :
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.
The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count]
doesn't do what you think it does.
It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.
In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.
An Alternative solution :
from collections import defaultdict
summary = defaultdict(list)
# Input and collate
with open('myfile.txt', 'r') as fp:
for line in fp:
items = line.strip().split()
key, data = (items[0], items[1], items[2]), items[7]
summary[key].append(data)
# Output
for keys, entries in summary.items():
print('{keys}t{entries} {count}'.format(
keys=' '.join(keys),
entries=','.join(entries),
count=len(entries) ))
With Python 2.7 - this produces the output
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000 FAM9B,PABPC5 2
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 86580000 86980000 KLHL4 1
With Python 3.6, the output is :
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
The output order is different between the two Python version, because dictionaries (and by extension defaultdicts) in Python 3.6 preserve the order in which keys are inserted.
It wasn't clear from your description if the ordering was important.
The main reason I think that your version wouldn't work is that your expression : infile[0,1,2,7, count]
doesn't do what you think it does.
It seems like you expect that to extract the 0th, 1st, 2nd and 7th columns from your line. However this is not valid index notation in Python, and Python doesn't know about the columns in your data anyway - all it knows about are characters.
In my version I use the 'split' method on each line - that will separate the line based on where the spaces/tabs are - i.e. splitting the data into columns.
answered Nov 8 at 22:56
Tony Suffolk 66
4,1131833
4,1131833
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
add a comment |
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
Nice explanation of different sort order depending on Python version! :)
– lmichelbacher
Nov 9 at 12:11
add a comment |
up vote
2
down vote
This should do what you want:
from collection import defaultdict # 1
lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2
counter = defaultdict(list) # 3
for line in lines:
counter[(line[0], line[1], line[2])].append(line[7]) # 4
for key, value in counter.iteritems(): # 5
print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6
Explanation:
- We're going to use a handy library that gives us a dictionary with a default value
- Read the whole input file, remove the new line at the end and split into parts (on white space)
- Make a dictionary whose values are empty lists by default for any key
- Go through the lines and populate the dictionary
- Columns 1-3 are the key
- For each character sequence in column 8, we append it to the list (if we hadn't used a
defaultdict
withlist
this would be more complicated)
- Iterate the dictionary's key-value pairs
- Print the output, joining the data structures to the desired format.
Hope this helps 🙂.
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
add a comment |
up vote
2
down vote
This should do what you want:
from collection import defaultdict # 1
lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2
counter = defaultdict(list) # 3
for line in lines:
counter[(line[0], line[1], line[2])].append(line[7]) # 4
for key, value in counter.iteritems(): # 5
print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6
Explanation:
- We're going to use a handy library that gives us a dictionary with a default value
- Read the whole input file, remove the new line at the end and split into parts (on white space)
- Make a dictionary whose values are empty lists by default for any key
- Go through the lines and populate the dictionary
- Columns 1-3 are the key
- For each character sequence in column 8, we append it to the list (if we hadn't used a
defaultdict
withlist
this would be more complicated)
- Iterate the dictionary's key-value pairs
- Print the output, joining the data structures to the desired format.
Hope this helps 🙂.
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
add a comment |
up vote
2
down vote
up vote
2
down vote
This should do what you want:
from collection import defaultdict # 1
lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2
counter = defaultdict(list) # 3
for line in lines:
counter[(line[0], line[1], line[2])].append(line[7]) # 4
for key, value in counter.iteritems(): # 5
print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6
Explanation:
- We're going to use a handy library that gives us a dictionary with a default value
- Read the whole input file, remove the new line at the end and split into parts (on white space)
- Make a dictionary whose values are empty lists by default for any key
- Go through the lines and populate the dictionary
- Columns 1-3 are the key
- For each character sequence in column 8, we append it to the list (if we hadn't used a
defaultdict
withlist
this would be more complicated)
- Iterate the dictionary's key-value pairs
- Print the output, joining the data structures to the desired format.
Hope this helps 🙂.
This should do what you want:
from collection import defaultdict # 1
lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2
counter = defaultdict(list) # 3
for line in lines:
counter[(line[0], line[1], line[2])].append(line[7]) # 4
for key, value in counter.iteritems(): # 5
print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6
Explanation:
- We're going to use a handy library that gives us a dictionary with a default value
- Read the whole input file, remove the new line at the end and split into parts (on white space)
- Make a dictionary whose values are empty lists by default for any key
- Go through the lines and populate the dictionary
- Columns 1-3 are the key
- For each character sequence in column 8, we append it to the list (if we hadn't used a
defaultdict
withlist
this would be more complicated)
- Iterate the dictionary's key-value pairs
- Print the output, joining the data structures to the desired format.
Hope this helps 🙂.
answered Nov 8 at 22:50
lmichelbacher
2,70011932
2,70011932
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
add a comment |
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
1
1
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
snap - except I do it line by line rather than reading the entirely file in in one go.
– Tony Suffolk 66
Nov 8 at 22:57
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
Great minds think alike ;).
– lmichelbacher
Nov 9 at 12:09
add a comment |
up vote
0
down vote
This is nice opportunity to use pandas
. You can open your file like this:
import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'),
('genes', lambda col: ', '.join(col))
]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='t', index=False, header=True)
This creates a DataFrame that looks like this:
0 1 2 3 4 5 6 7 8
0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
1 chrX 7970000 8670000 3 2 7 7 SATL1 3
2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
3 chrX 7970000 8670000 3 2 7 7 VCX2 1
4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
8 chrX 89050000 91020000 11 6 10 13 PABPC5 2
Now, using built-in functions we can groupby
on columns [0, 1, 2]
and apply functions over the groups, resulting in:
0 1 2 count genes
0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
1 chrX 86580000 86980000 1 KLHL4
2 chrX 87370000 88620000 2 CPXCR1, FAM9A
3 chrX 89050000 91020000 2 FAM9B, PABPC5
What this does is groups the data and adds the columns we are interested in:
('count', 'size')
creates the column count
using the function size
('genes', lambda col: ', '.join(col))
creates the column genes
using the lambda
function that just joins the grouped column together.
This is what the final file will look like:
chromosome start_region end_region count genes
chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
chrX 86580000 86980000 1 KLHL4
chrX 87370000 88620000 2 CPXCR1, FAM9A
chrX 89050000 91020000 2 FAM9B, PABPC5
If you have any questions, come visit the pandas tag.
add a comment |
up vote
0
down vote
This is nice opportunity to use pandas
. You can open your file like this:
import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'),
('genes', lambda col: ', '.join(col))
]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='t', index=False, header=True)
This creates a DataFrame that looks like this:
0 1 2 3 4 5 6 7 8
0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
1 chrX 7970000 8670000 3 2 7 7 SATL1 3
2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
3 chrX 7970000 8670000 3 2 7 7 VCX2 1
4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
8 chrX 89050000 91020000 11 6 10 13 PABPC5 2
Now, using built-in functions we can groupby
on columns [0, 1, 2]
and apply functions over the groups, resulting in:
0 1 2 count genes
0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
1 chrX 86580000 86980000 1 KLHL4
2 chrX 87370000 88620000 2 CPXCR1, FAM9A
3 chrX 89050000 91020000 2 FAM9B, PABPC5
What this does is groups the data and adds the columns we are interested in:
('count', 'size')
creates the column count
using the function size
('genes', lambda col: ', '.join(col))
creates the column genes
using the lambda
function that just joins the grouped column together.
This is what the final file will look like:
chromosome start_region end_region count genes
chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
chrX 86580000 86980000 1 KLHL4
chrX 87370000 88620000 2 CPXCR1, FAM9A
chrX 89050000 91020000 2 FAM9B, PABPC5
If you have any questions, come visit the pandas tag.
add a comment |
up vote
0
down vote
up vote
0
down vote
This is nice opportunity to use pandas
. You can open your file like this:
import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'),
('genes', lambda col: ', '.join(col))
]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='t', index=False, header=True)
This creates a DataFrame that looks like this:
0 1 2 3 4 5 6 7 8
0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
1 chrX 7970000 8670000 3 2 7 7 SATL1 3
2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
3 chrX 7970000 8670000 3 2 7 7 VCX2 1
4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
8 chrX 89050000 91020000 11 6 10 13 PABPC5 2
Now, using built-in functions we can groupby
on columns [0, 1, 2]
and apply functions over the groups, resulting in:
0 1 2 count genes
0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
1 chrX 86580000 86980000 1 KLHL4
2 chrX 87370000 88620000 2 CPXCR1, FAM9A
3 chrX 89050000 91020000 2 FAM9B, PABPC5
What this does is groups the data and adds the columns we are interested in:
('count', 'size')
creates the column count
using the function size
('genes', lambda col: ', '.join(col))
creates the column genes
using the lambda
function that just joins the grouped column together.
This is what the final file will look like:
chromosome start_region end_region count genes
chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
chrX 86580000 86980000 1 KLHL4
chrX 87370000 88620000 2 CPXCR1, FAM9A
chrX 89050000 91020000 2 FAM9B, PABPC5
If you have any questions, come visit the pandas tag.
This is nice opportunity to use pandas
. You can open your file like this:
import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'),
('genes', lambda col: ', '.join(col))
]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='t', index=False, header=True)
This creates a DataFrame that looks like this:
0 1 2 3 4 5 6 7 8
0 chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
1 chrX 7970000 8670000 3 2 7 7 SATL1 3
2 chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
3 chrX 7970000 8670000 3 2 7 7 VCX2 1
4 chrX 86580000 86980000 1 1 1 5 KLHL4 2
5 chrX 87370000 88620000 4 4 11 11 CPXCR1 2
6 chrX 87370000 88620000 4 4 11 11 FAM9A 2
7 chrX 89050000 91020000 11 6 10 13 FAM9B 3
8 chrX 89050000 91020000 11 6 10 13 PABPC5 2
Now, using built-in functions we can groupby
on columns [0, 1, 2]
and apply functions over the groups, resulting in:
0 1 2 count genes
0 chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
1 chrX 86580000 86980000 1 KLHL4
2 chrX 87370000 88620000 2 CPXCR1, FAM9A
3 chrX 89050000 91020000 2 FAM9B, PABPC5
What this does is groups the data and adds the columns we are interested in:
('count', 'size')
creates the column count
using the function size
('genes', lambda col: ', '.join(col))
creates the column genes
using the lambda
function that just joins the grouped column together.
This is what the final file will look like:
chromosome start_region end_region count genes
chrX 7970000 8670000 4 RPS6KA6, SATL1, SH3BGRL, VCX2
chrX 86580000 86980000 1 KLHL4
chrX 87370000 88620000 2 CPXCR1, FAM9A
chrX 89050000 91020000 2 FAM9B, PABPC5
If you have any questions, come visit the pandas tag.
answered Nov 9 at 5:25
Alex
649520
649520
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217128%2fsummarizing-the-contents-of-a-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown