Why opening and iterating over file handle over twice as fast in Python 2 vs Python 3?












7















I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?



Warning: the code snippet writes a ~2GB file



Timings:



$ python2 test.py 
5.01580309868
$ python3 test.py
10.664075019994925


Code for test.py:



import os

SEQ_LINE = 'ATCGN'* 80 + 'n'

if not os.path.isfile('many_medium.fa'):
with open('many_medium.fa', 'w') as out_f:
for i in range(1000000):
out_f.write('>{}n'.format(i))
for _ in range(5):
out_f.write(SEQ_LINE)

from timeit import timeit

def f():
with open('many_medium.fa') as f:
for line in f:
pass

print(timeit('f()', setup='from __main__ import f', number=5))









share|improve this question























  • See stackoverflow.com/questions/46415568/…

    – user2357112
    Sep 27 '18 at 21:19











  • (It's not quite the same, since you're not on Windows, but it's closely related.)

    – user2357112
    Sep 27 '18 at 21:26






  • 4





    Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

    – user2357112
    Sep 27 '18 at 23:41






  • 4





    Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

    – user2357112
    Sep 28 '18 at 17:21






  • 2





    @user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

    – Chris_Rands
    Oct 1 '18 at 8:24
















7















I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?



Warning: the code snippet writes a ~2GB file



Timings:



$ python2 test.py 
5.01580309868
$ python3 test.py
10.664075019994925


Code for test.py:



import os

SEQ_LINE = 'ATCGN'* 80 + 'n'

if not os.path.isfile('many_medium.fa'):
with open('many_medium.fa', 'w') as out_f:
for i in range(1000000):
out_f.write('>{}n'.format(i))
for _ in range(5):
out_f.write(SEQ_LINE)

from timeit import timeit

def f():
with open('many_medium.fa') as f:
for line in f:
pass

print(timeit('f()', setup='from __main__ import f', number=5))









share|improve this question























  • See stackoverflow.com/questions/46415568/…

    – user2357112
    Sep 27 '18 at 21:19











  • (It's not quite the same, since you're not on Windows, but it's closely related.)

    – user2357112
    Sep 27 '18 at 21:26






  • 4





    Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

    – user2357112
    Sep 27 '18 at 23:41






  • 4





    Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

    – user2357112
    Sep 28 '18 at 17:21






  • 2





    @user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

    – Chris_Rands
    Oct 1 '18 at 8:24














7












7








7


2






I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?



Warning: the code snippet writes a ~2GB file



Timings:



$ python2 test.py 
5.01580309868
$ python3 test.py
10.664075019994925


Code for test.py:



import os

SEQ_LINE = 'ATCGN'* 80 + 'n'

if not os.path.isfile('many_medium.fa'):
with open('many_medium.fa', 'w') as out_f:
for i in range(1000000):
out_f.write('>{}n'.format(i))
for _ in range(5):
out_f.write(SEQ_LINE)

from timeit import timeit

def f():
with open('many_medium.fa') as f:
for line in f:
pass

print(timeit('f()', setup='from __main__ import f', number=5))









share|improve this question














I can't work out why it's so much faster to parse this file in Python 2.7 than in Python 3.6. I've found this pattern both on macOS and Arch-Linux independently. Can others replicate it? Any explanation?



Warning: the code snippet writes a ~2GB file



Timings:



$ python2 test.py 
5.01580309868
$ python3 test.py
10.664075019994925


Code for test.py:



import os

SEQ_LINE = 'ATCGN'* 80 + 'n'

if not os.path.isfile('many_medium.fa'):
with open('many_medium.fa', 'w') as out_f:
for i in range(1000000):
out_f.write('>{}n'.format(i))
for _ in range(5):
out_f.write(SEQ_LINE)

from timeit import timeit

def f():
with open('many_medium.fa') as f:
for line in f:
pass

print(timeit('f()', setup='from __main__ import f', number=5))






python python-3.x python-2.7 file parsing






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 27 '18 at 21:16









Chris_RandsChris_Rands

15.7k53869




15.7k53869













  • See stackoverflow.com/questions/46415568/…

    – user2357112
    Sep 27 '18 at 21:19











  • (It's not quite the same, since you're not on Windows, but it's closely related.)

    – user2357112
    Sep 27 '18 at 21:26






  • 4





    Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

    – user2357112
    Sep 27 '18 at 23:41






  • 4





    Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

    – user2357112
    Sep 28 '18 at 17:21






  • 2





    @user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

    – Chris_Rands
    Oct 1 '18 at 8:24



















  • See stackoverflow.com/questions/46415568/…

    – user2357112
    Sep 27 '18 at 21:19











  • (It's not quite the same, since you're not on Windows, but it's closely related.)

    – user2357112
    Sep 27 '18 at 21:26






  • 4





    Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

    – user2357112
    Sep 27 '18 at 23:41






  • 4





    Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

    – user2357112
    Sep 28 '18 at 17:21






  • 2





    @user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

    – Chris_Rands
    Oct 1 '18 at 8:24

















See stackoverflow.com/questions/46415568/…

– user2357112
Sep 27 '18 at 21:19





See stackoverflow.com/questions/46415568/…

– user2357112
Sep 27 '18 at 21:19













(It's not quite the same, since you're not on Windows, but it's closely related.)

– user2357112
Sep 27 '18 at 21:26





(It's not quite the same, since you're not on Windows, but it's closely related.)

– user2357112
Sep 27 '18 at 21:26




4




4





Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

– user2357112
Sep 27 '18 at 23:41





Aside from newline handling, the big obvious thing is that Python 3 is doing Unicode decoding. That's pretty expensive.

– user2357112
Sep 27 '18 at 23:41




4




4





Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

– user2357112
Sep 28 '18 at 17:21





Skip decoding entirely by opening the file in 'rb' mode, but everything that uses the data will have to be able to handle bytestrings on Python 3.

– user2357112
Sep 28 '18 at 17:21




2




2





@user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

– Chris_Rands
Oct 1 '18 at 8:24





@user2357112 Thanks opening with rb does indeed account for the majority of the time difference!

– Chris_Rands
Oct 1 '18 at 8:24












2 Answers
2






active

oldest

votes


















5





+100









Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.



To make this a fair comparison, you'd have to add the following line to the top of your test:



from io import open


With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.



So what is happening here? The io library offers much more functionality than the old Python 2 file object:




  • File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.

  • Support for non-blocking I/O streams

  • A consistent interface across a wide range of streams

  • Much more control over the universal newline translation feature.

  • Full Unicode support.


That extra functionality comes at a performance price.



But your Python 2 test reads byte strings, newlines are always translated to n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.



So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:




  • For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like n. Binary mode only supports n as line separator.

  • Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.


When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.



Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.






share|improve this answer

































    6














    Did some research and came across this article by Nelson Minar that explains what the difference is between python2 and python3 file reading.





    • Python 3 is ~1.7x little slower reading bytes line by line than Python 2


    • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!


    • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.


    • In Python 3, codecs is really slow for reading line by line. Avoid.





    And continues to say:




    Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.




    According to the SO answer that @user2357112 linked:




    When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only n in Python. So a file containing foonbar would appear the same as a file containing foornbar or foorbar (since n, rn and r are all line ending conventions used on some operating systems at some time).




    The solution mentioned in this answer is to open the file in byte mode to avoid the conversion:



    open('many_medium.fa', "r+b")


    My tests indicated a massive difference in the speed, but python2 still seemed to be slightly faster. There does not seem to be a way to avoid this, as it's handled by python's interpreter.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52545269%2fwhy-opening-and-iterating-over-file-handle-over-twice-as-fast-in-python-2-vs-pyt%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      5





      +100









      Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.



      To make this a fair comparison, you'd have to add the following line to the top of your test:



      from io import open


      With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.



      So what is happening here? The io library offers much more functionality than the old Python 2 file object:




      • File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.

      • Support for non-blocking I/O streams

      • A consistent interface across a wide range of streams

      • Much more control over the universal newline translation feature.

      • Full Unicode support.


      That extra functionality comes at a performance price.



      But your Python 2 test reads byte strings, newlines are always translated to n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.



      So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:




      • For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like n. Binary mode only supports n as line separator.

      • Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.


      When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.



      Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.






      share|improve this answer






























        5





        +100









        Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.



        To make this a fair comparison, you'd have to add the following line to the top of your test:



        from io import open


        With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.



        So what is happening here? The io library offers much more functionality than the old Python 2 file object:




        • File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.

        • Support for non-blocking I/O streams

        • A consistent interface across a wide range of streams

        • Much more control over the universal newline translation feature.

        • Full Unicode support.


        That extra functionality comes at a performance price.



        But your Python 2 test reads byte strings, newlines are always translated to n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.



        So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:




        • For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like n. Binary mode only supports n as line separator.

        • Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.


        When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.



        Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.






        share|improve this answer




























          5





          +100







          5





          +100



          5




          +100





          Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.



          To make this a fair comparison, you'd have to add the following line to the top of your test:



          from io import open


          With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.



          So what is happening here? The io library offers much more functionality than the old Python 2 file object:




          • File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.

          • Support for non-blocking I/O streams

          • A consistent interface across a wide range of streams

          • Much more control over the universal newline translation feature.

          • Full Unicode support.


          That extra functionality comes at a performance price.



          But your Python 2 test reads byte strings, newlines are always translated to n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.



          So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:




          • For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like n. Binary mode only supports n as line separator.

          • Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.


          When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.



          Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.






          share|improve this answer















          Because in Python 2, the standard open() call creates a far simpler file object than the Python 3 open() call does. The Python 3 open call is the same thing as io.open(), and the same framework is available on Python 2.



          To make this a fair comparison, you'd have to add the following line to the top of your test:



          from io import open


          With that change, the timings on Python 2 go from 5.5 seconds, to 37 seconds. Compared to that figure, the 11 seconds Python 3 takes on my system to run the test really is much, much faster.



          So what is happening here? The io library offers much more functionality than the old Python 2 file object:




          • File objects returned by open() consist of up to 3 layers of composed functionality, allowing you to control buffering and text handling.

          • Support for non-blocking I/O streams

          • A consistent interface across a wide range of streams

          • Much more control over the universal newline translation feature.

          • Full Unicode support.


          That extra functionality comes at a performance price.



          But your Python 2 test reads byte strings, newlines are always translated to n, and the file object the code is working with is pretty close to the OS-supplied file primitive, with all the downsides. In Python 3, you usually want to process data from files as text, so opening a file in text mode gives you a file object that decodes the binary data to Unicode str objects.



          So how can you make things go 'faster' on Python 3? That depends on your specific use case, but you have some options:




          • For text-mode files, disable universal newline handling, especially when handling a file that uses line endings that differ from the platform standard. Set the newline parameter to the expected newline character sequence, like n. Binary mode only supports n as line separator.

          • Process the file as binary data, and don't decode to str. Alternatively, decode to Latin-1, a straight one-on-one mapping from byte to codepoint. This is an option when your data is ASCII-only too, where Latin-1 omits an error check on the bytes being in the range 0-127 rather than 0-255.


          When using mode='rb', Python 3 can easily match the Python 2 timings, the test only takes 5.05 seconds on my system, using Python 3.7.



          Using latin-1 as the codec vs. UTF-8 (the usual default) makes only a small difference; UTF-8 can be decoded very efficiently. But it could make a difference for other codecs. You generally want to set the encoding parameter explicitly, and not rely on the default encoding used.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 19 '18 at 23:46

























          answered Nov 18 '18 at 11:58









          Martijn PietersMartijn Pieters

          705k13524592285




          705k13524592285

























              6














              Did some research and came across this article by Nelson Minar that explains what the difference is between python2 and python3 file reading.





              • Python 3 is ~1.7x little slower reading bytes line by line than Python 2


              • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!


              • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.


              • In Python 3, codecs is really slow for reading line by line. Avoid.





              And continues to say:




              Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.




              According to the SO answer that @user2357112 linked:




              When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only n in Python. So a file containing foonbar would appear the same as a file containing foornbar or foorbar (since n, rn and r are all line ending conventions used on some operating systems at some time).




              The solution mentioned in this answer is to open the file in byte mode to avoid the conversion:



              open('many_medium.fa', "r+b")


              My tests indicated a massive difference in the speed, but python2 still seemed to be slightly faster. There does not seem to be a way to avoid this, as it's handled by python's interpreter.






              share|improve this answer




























                6














                Did some research and came across this article by Nelson Minar that explains what the difference is between python2 and python3 file reading.





                • Python 3 is ~1.7x little slower reading bytes line by line than Python 2


                • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!


                • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.


                • In Python 3, codecs is really slow for reading line by line. Avoid.





                And continues to say:




                Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.




                According to the SO answer that @user2357112 linked:




                When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only n in Python. So a file containing foonbar would appear the same as a file containing foornbar or foorbar (since n, rn and r are all line ending conventions used on some operating systems at some time).




                The solution mentioned in this answer is to open the file in byte mode to avoid the conversion:



                open('many_medium.fa', "r+b")


                My tests indicated a massive difference in the speed, but python2 still seemed to be slightly faster. There does not seem to be a way to avoid this, as it's handled by python's interpreter.






                share|improve this answer


























                  6












                  6








                  6







                  Did some research and came across this article by Nelson Minar that explains what the difference is between python2 and python3 file reading.





                  • Python 3 is ~1.7x little slower reading bytes line by line than Python 2


                  • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!


                  • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.


                  • In Python 3, codecs is really slow for reading line by line. Avoid.





                  And continues to say:




                  Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.




                  According to the SO answer that @user2357112 linked:




                  When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only n in Python. So a file containing foonbar would appear the same as a file containing foornbar or foorbar (since n, rn and r are all line ending conventions used on some operating systems at some time).




                  The solution mentioned in this answer is to open the file in byte mode to avoid the conversion:



                  open('many_medium.fa', "r+b")


                  My tests indicated a massive difference in the speed, but python2 still seemed to be slightly faster. There does not seem to be a way to avoid this, as it's handled by python's interpreter.






                  share|improve this answer













                  Did some research and came across this article by Nelson Minar that explains what the difference is between python2 and python3 file reading.





                  • Python 3 is ~1.7x little slower reading bytes line by line than Python 2


                  • In Python 2, reading lines with Unicode is hella slow. About 7x slower than reading Unicode all at once. And Unicode lines are 70x slower than byte lines!


                  • In Python 3, reading lines with Unicode is quite fast. About as fast as reading the file all at once. But only if you use the built-in open, not codecs.


                  • In Python 3, codecs is really slow for reading line by line. Avoid.





                  And continues to say:




                  Python 3 UTF-8 decoding is significantly faster than Python 2. And it’s probably best to stick with the stock open() call in Py3, not codecs. It may be slower in some circumstances but it’s the recommended option going further and the difference isn’t enormous.




                  According to the SO answer that @user2357112 linked:




                  When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only n in Python. So a file containing foonbar would appear the same as a file containing foornbar or foorbar (since n, rn and r are all line ending conventions used on some operating systems at some time).




                  The solution mentioned in this answer is to open the file in byte mode to avoid the conversion:



                  open('many_medium.fa', "r+b")


                  My tests indicated a massive difference in the speed, but python2 still seemed to be slightly faster. There does not seem to be a way to avoid this, as it's handled by python's interpreter.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 14 '18 at 17:42









                  JershJersh

                  558618




                  558618






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52545269%2fwhy-opening-and-iterating-over-file-handle-over-twice-as-fast-in-python-2-vs-pyt%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Guess what letter conforming each word

                      Port of Spain

                      Run scheduled task as local user group (not BUILTIN)