How to remove lines that start with the same letters (sequence) in a txt file?












0















#!/usr/bin/env python

FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5

lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)


This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)









share|improve this question




















  • 1





    It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

    – ritlew
    Nov 20 '18 at 18:12






  • 4





    What the issue? I got output you shown as correct.

    – Filip Młynarski
    Nov 20 '18 at 18:18











  • Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

    – Alpa Luca
    Nov 20 '18 at 18:22











  • change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

    – Filip Młynarski
    Nov 20 '18 at 18:25


















0















#!/usr/bin/env python

FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5

lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)


This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)









share|improve this question




















  • 1





    It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

    – ritlew
    Nov 20 '18 at 18:12






  • 4





    What the issue? I got output you shown as correct.

    – Filip Młynarski
    Nov 20 '18 at 18:18











  • Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

    – Alpa Luca
    Nov 20 '18 at 18:22











  • change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

    – Filip Młynarski
    Nov 20 '18 at 18:25
















0












0








0








#!/usr/bin/env python

FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5

lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)


This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)









share|improve this question
















#!/usr/bin/env python

FILE_NAME = "testprecomb.txt"
NR_MATCHING_CHARS = 5

lines = set()
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
print(line)
lines.add(beginOfSequence)


This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT
(no third line)






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 18:09









eyllanesc

81.1k103259




81.1k103259










asked Nov 20 '18 at 18:08









Alpa LucaAlpa Luca

85




85








  • 1





    It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

    – ritlew
    Nov 20 '18 at 18:12






  • 4





    What the issue? I got output you shown as correct.

    – Filip Młynarski
    Nov 20 '18 at 18:18











  • Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

    – Alpa Luca
    Nov 20 '18 at 18:22











  • change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

    – Filip Młynarski
    Nov 20 '18 at 18:25
















  • 1





    It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

    – ritlew
    Nov 20 '18 at 18:12






  • 4





    What the issue? I got output you shown as correct.

    – Filip Młynarski
    Nov 20 '18 at 18:18











  • Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

    – Alpa Luca
    Nov 20 '18 at 18:22











  • change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

    – Filip Młynarski
    Nov 20 '18 at 18:25










1




1





It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12





It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12




4




4





What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18





What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18













Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22





Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22













change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25







change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25














2 Answers
2






active

oldest

votes


















0














I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.



FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5

lines = set()
output_lines = # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + 'n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines

with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 18:54











  • Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

    – LeKhan9
    Nov 20 '18 at 19:15





















0














Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:



FILE_NAME = "testprecomb.txt"                       
NR_MATCHING_CHARS = 5

prefixCache = set()
data =
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 19:02











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398991%2fhow-to-remove-lines-that-start-with-the-same-letters-sequence-in-a-txt-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.



FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5

lines = set()
output_lines = # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + 'n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines

with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 18:54











  • Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

    – LeKhan9
    Nov 20 '18 at 19:15


















0














I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.



FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5

lines = set()
output_lines = # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + 'n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines

with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 18:54











  • Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

    – LeKhan9
    Nov 20 '18 at 19:15
















0












0








0







I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.



FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5

lines = set()
output_lines = # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + 'n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines

with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file





share|improve this answer













I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.



FILE_NAME = "sample_file.txt"
NR_MATCHING_CHARS = 5

lines = set()
output_lines = # keep track of lines you want to keep
with open(FILE_NAME, "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines):
output_lines.append(line + 'n') # add line to list, newline needed since we will write to file
lines.add(beginOfSequence)
print output_lines

with open(FILE_NAME, 'w') as f:
f.writelines(output_lines) # write it out to the file






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 18:35









LeKhan9LeKhan9

1,065113




1,065113













  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 18:54











  • Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

    – LeKhan9
    Nov 20 '18 at 19:15





















  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 18:54











  • Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

    – LeKhan9
    Nov 20 '18 at 19:15



















Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54





Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54













Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15







Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15















0














Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:



FILE_NAME = "testprecomb.txt"                       
NR_MATCHING_CHARS = 5

prefixCache = set()
data =
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 19:02
















0














Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:



FILE_NAME = "testprecomb.txt"                       
NR_MATCHING_CHARS = 5

prefixCache = set()
data =
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)





share|improve this answer
























  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 19:02














0












0








0







Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:



FILE_NAME = "testprecomb.txt"                       
NR_MATCHING_CHARS = 5

prefixCache = set()
data =
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)





share|improve this answer













Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:



FILE_NAME = "testprecomb.txt"                       
NR_MATCHING_CHARS = 5

prefixCache = set()
data =
with open(FILE_NAME, "r") as testFile:
for line in testFile:
line = line.strip()
if not line:
continue
beginOfSequence = line[:NR_MATCHING_CHARS]
if (beginOfSequence in prefixCache):
continue
else:
print(line)
data.append(line)
prefixCache.add(beginOfSequence)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 18:50









Woody1193Woody1193

2,286931




2,286931













  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 19:02



















  • Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

    – Alpa Luca
    Nov 20 '18 at 19:02

















Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02





Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398991%2fhow-to-remove-lines-that-start-with-the-same-letters-sequence-in-a-txt-file%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to pass form data using jquery Ajax to insert data in database?

National Museum of Racing and Hall of Fame

Guess what letter conforming each word