How to remove lines that start with the same letters (sequence) in a txt file?

#!/usr/bin/env python



FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



lines = set()                                 

with open(FILE_NAME, "r") as inF:            

    for line in inF:                         

        line = line.strip()                 

        if line == "": continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):  

            print(line)                      

            lines.add(beginOfSequence)

This is the code I have right now but it is not working. I have a file that has lines of DNA that sometimes start with the same sequence (or pattern of letters). I need to write a code that will find all lines of DNA that start with the same letters (perhaps the same 10 characters) and delete one of the lines.

Example (issue):

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT



***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

What I need as result after one is taken out of file:

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT

(no third line)

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

1

It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12

4

What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18

Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22

change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25

add a comment |

#!/usr/bin/env python



FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



lines = set()                                 

with open(FILE_NAME, "r") as inF:            

    for line in inF:                         

        line = line.strip()                 

        if line == "": continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):  

            print(line)                      

            lines.add(beginOfSequence)

Example (issue):

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT



***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

What I need as result after one is taken out of file:

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT

(no third line)

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

1

It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12

4

What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18

Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22

change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25

add a comment |

#!/usr/bin/env python



FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



lines = set()                                 

with open(FILE_NAME, "r") as inF:            

    for line in inF:                         

        line = line.strip()                 

        if line == "": continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):  

            print(line)                      

            lines.add(beginOfSequence)

Example (issue):

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT



***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

What I need as result after one is taken out of file:

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT

(no third line)

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

#!/usr/bin/env python



FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



lines = set()                                 

with open(FILE_NAME, "r") as inF:            

    for line in inF:                         

        line = line.strip()                 

        if line == "": continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):  

            print(line)                      

            lines.add(beginOfSequence)

Example (issue):

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT



***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT

What I need as result after one is taken out of file:

CCTGGATGGCTTATATAAGAT***GTTAT***



***GTTAT***ATAATATACCACCGGGCTGCTT

(no third line)

python

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

edited Nov 20 '18 at 18:09

eyllanesc

81.1k103259

asked Nov 20 '18 at 18:08

Alpa Luca

asked Nov 20 '18 at 18:08

Alpa Luca

asked Nov 20 '18 at 18:08

Alpa Luca

1

It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12

4

What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18

Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22

change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25

add a comment |

1

It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12

4

What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18

Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22

change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25

It looks like you want lines.append(line) instead of lines.add(beginOfSequence)

– ritlew
Nov 20 '18 at 18:12

What the issue? I got output you shown as correct.

– Filip Młynarski
Nov 20 '18 at 18:18

Traceback (most recent call last): File "./RemoveDuplicate.py", line 14, in <module> lines.append(line) AttributeError: 'set' object has no attribute 'append' @FilipMłynarski

– Alpa Luca
Nov 20 '18 at 18:22

change lines = set() to lines = . And keep in mind if you'll be storing whole lines instead of beginning of lines in your lines list code won't work properly.

– Filip Młynarski
Nov 20 '18 at 18:25

add a comment |

2 Answers
2

active

oldest

votes

I think your set logic is correct. You are just missing the portion that will save the lines you want to write back into the file. I am guessing you tried this with a separate list that you forgot to add here since you are using append somewhere.

FILE_NAME = "sample_file.txt"

NR_MATCHING_CHARS = 5



lines = set()

output_lines =  # keep track of lines you want to keep

with open(FILE_NAME, "r") as inF:

    for line in inF:

        line = line.strip()

        if line == "": continue

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):

            output_lines.append(line + 'n') # add line to list, newline needed since we will write to file

            lines.add(beginOfSequence)

print output_lines



with open(FILE_NAME, 'w') as f:

    f.writelines(output_lines) # write it out to the file

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

add a comment |

Your approach has a few problems. First, I would avoid naming file variables inF as this can be confused with inf. Descriptive names are better: testFile for instance. Also testing for empty strings using equality misses a few important edge cases (what if line is None for instance?); use the not keyword instead. As for your actual problem, you're not actually doing anything based on that set membership:

FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



prefixCache = set()

data = 

with open(FILE_NAME, "r") as testFile:            

    for line in testFile:                         

        line = line.strip()                 

        if not line: 

            continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if (beginOfSequence in prefixCache):

            continue

        else:

            print(line)

            data.append(line)

            prefixCache.add(beginOfSequence)

answered Nov 20 '18 at 18:50

Woody1193

2,286931

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398991%2fhow-to-remove-lines-that-start-with-the-same-letters-sequence-in-a-txt-file%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

FILE_NAME = "sample_file.txt"

NR_MATCHING_CHARS = 5



lines = set()

output_lines =  # keep track of lines you want to keep

with open(FILE_NAME, "r") as inF:

    for line in inF:

        line = line.strip()

        if line == "": continue

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):

            output_lines.append(line + 'n') # add line to list, newline needed since we will write to file

            lines.add(beginOfSequence)

print output_lines



with open(FILE_NAME, 'w') as f:

    f.writelines(output_lines) # write it out to the file

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

add a comment |

FILE_NAME = "sample_file.txt"

NR_MATCHING_CHARS = 5



lines = set()

output_lines =  # keep track of lines you want to keep

with open(FILE_NAME, "r") as inF:

    for line in inF:

        line = line.strip()

        if line == "": continue

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):

            output_lines.append(line + 'n') # add line to list, newline needed since we will write to file

            lines.add(beginOfSequence)

print output_lines



with open(FILE_NAME, 'w') as f:

    f.writelines(output_lines) # write it out to the file

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

add a comment |

FILE_NAME = "sample_file.txt"

NR_MATCHING_CHARS = 5



lines = set()

output_lines =  # keep track of lines you want to keep

with open(FILE_NAME, "r") as inF:

    for line in inF:

        line = line.strip()

        if line == "": continue

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):

            output_lines.append(line + 'n') # add line to list, newline needed since we will write to file

            lines.add(beginOfSequence)

print output_lines



with open(FILE_NAME, 'w') as f:

    f.writelines(output_lines) # write it out to the file

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

FILE_NAME = "sample_file.txt"

NR_MATCHING_CHARS = 5



lines = set()

output_lines =  # keep track of lines you want to keep

with open(FILE_NAME, "r") as inF:

    for line in inF:

        line = line.strip()

        if line == "": continue

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if not (beginOfSequence in lines):

            output_lines.append(line + 'n') # add line to list, newline needed since we will write to file

            lines.add(beginOfSequence)

print output_lines



with open(FILE_NAME, 'w') as f:

    f.writelines(output_lines) # write it out to the file

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

answered Nov 20 '18 at 18:35

LeKhan9

1,065113

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

add a comment |

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 18:54

Well it would depend on how many of the shorter ones you want to remove. For example: A, AB, and ABC all share the prefix. Do you only want A? You can try storing all the matches into a list value in a dictionary instead. Something like d[beginOfSequence] = [line1, line2, ...]. At the end of your iteration, Just scoop out the top x short ones from each dictionary list.

– LeKhan9
Nov 20 '18 at 19:15

add a comment |

FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



prefixCache = set()

data = 

with open(FILE_NAME, "r") as testFile:            

    for line in testFile:                         

        line = line.strip()                 

        if not line: 

            continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if (beginOfSequence in prefixCache):

            continue

        else:

            print(line)

            data.append(line)

            prefixCache.add(beginOfSequence)

answered Nov 20 '18 at 18:50

Woody1193

2,286931

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02

add a comment |

FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



prefixCache = set()

data = 

with open(FILE_NAME, "r") as testFile:            

    for line in testFile:                         

        line = line.strip()                 

        if not line: 

            continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if (beginOfSequence in prefixCache):

            continue

        else:

            print(line)

            data.append(line)

            prefixCache.add(beginOfSequence)

answered Nov 20 '18 at 18:50

Woody1193

2,286931

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02

add a comment |

FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



prefixCache = set()

data = 

with open(FILE_NAME, "r") as testFile:            

    for line in testFile:                         

        line = line.strip()                 

        if not line: 

            continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if (beginOfSequence in prefixCache):

            continue

        else:

            print(line)

            data.append(line)

            prefixCache.add(beginOfSequence)

answered Nov 20 '18 at 18:50

Woody1193

2,286931

FILE_NAME = "testprecomb.txt"                       

NR_MATCHING_CHARS = 5                        



prefixCache = set()

data = 

with open(FILE_NAME, "r") as testFile:            

    for line in testFile:                         

        line = line.strip()                 

        if not line: 

            continue              

        beginOfSequence = line[:NR_MATCHING_CHARS]

        if (beginOfSequence in prefixCache):

            continue

        else:

            print(line)

            data.append(line)

            prefixCache.add(beginOfSequence)

answered Nov 20 '18 at 18:50

Woody1193

2,286931

answered Nov 20 '18 at 18:50

Woody1193

2,286931

answered Nov 20 '18 at 18:50

Woody1193

2,286931

answered Nov 20 '18 at 18:50

Woody1193

2,286931

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02

add a comment |

Hi, I first I would like to say thank you so much for your help! If some of my strands are longer than others would there be an easy way for me to tell python to remove the shorter of the matching strands?

– Alpa Luca
Nov 20 '18 at 19:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk