This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Newline skipped in "for line in file"
Type: Stage:
Components: Library (Lib) Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: amonthei, brett.cannon, doerwalter, mark-roberts, runedevik
Priority: normal Keywords:

Created on 2007-01-16 16:56 by amonthei, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg31036 - (view) Author: Andy Monthei (amonthei) Date: 2007-01-16 16:56
When processing huge fixed block files of about 7000 bytes wide and several hundred thousand lines long some pairs of lines get read as one long line with no line break when using "for line in file:".  The problem is even worse when using the fileinput module and reading in five or six huge files consisting of 4.8 million records causes several hundred pairs of lines to be read as single lines. When a newline is skipped it is usually followed by several more in the next few hundred lines. I have not noticed any other characters being skipped, only the line break.

O.S. Windows (5, 1, 2600, 2, 'Service Pack 2')
Python 2.5
msg31037 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-01-16 22:33
Do you happen to have a sample you could upload that triggers the bug?
msg31038 - (view) Author: Andy Monthei (amonthei) Date: 2007-01-17 21:58
I can not upload the files that trigger this because of the data that is in them but I am working on getting around that.

In my data line 617391 in a fixed block file of 6990 bytes wide gets read in with the next line after it.  The line break is 0d0a (same as the others) where the bug happens so I am wondering if it is a buffer issue where the linebreak falls at the edge, however no other characters are ever missed. The total file is 888420 lines and this happens in four spots.

I will hopefully have a file to send soon.
msg31039 - (view) Author: Mark Roberts (mark-roberts) Date: 2007-01-18 05:24
How wide are the min and max widths of the lines?  This problem is of particular interest to me.
msg31040 - (view) Author: Mark Roberts (mark-roberts) Date: 2007-01-18 07:12
I don't know if this helps: I spent the last little while creating / reading random files that all (seemingly) matched the description you gave us.  None of these files failed to read properly.  (e.g., have the right amount of rows with a line length that seemingly was the right line.  Definitely no doubling lines).

Perusing the file source code found a detailed discussion of fgets vs fgetc for finding the next line in the file.  Have you tried reading the file with fp.read(8192) or similar?  Hopefully you're able to reproduce the bug with scrubbed data (because I couldn't construct random data to do so).  Good luck.
msg31041 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-01-18 09:23
Are you using any of the unicode reading features (i.e. codecs.EncodedFile etc.) or are you using plain open() for reading the file?
msg31042 - (view) Author: Andy Monthei (amonthei) Date: 2007-01-18 15:34
I am using open() for reading the file, no other features. I have also had fileinput.input(fileList) compound the problem.  Each file that this has happened to is a fixed block file of either 6990 or 7700 bytes wide but this I think is insignificant. When looking at the file in a hex editor everything looks fine and a small Java program using a buffered reader will give me the correct line count when Python does not.

Using something like fp.read(8192) I'm sure might temporarily solve my problem but I will keep working on getting a file I can upload.

msg31043 - (view) Author: Andy Monthei (amonthei) Date: 2007-01-20 22:53
I have had no luck creating random data to reproduce the problem which leaves me to come to the conclusion that it was the data itself.  Using a hex editor I find no problem with the line breaks.

The data that triggers this bug is transferred several time before it gets to me. It originates on a Unix box, then goes to an IBM mainframe, then to my Windows machine and through many updates along the way. It may be an EBCDIC/ASCII conversion or possibly something to do with the mainframe to PC transfer. Whatever it is, it's in the data itself.

The only thing that bothers me is that Java somehow is not affected by this bad data.
msg31044 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2007-01-21 00:46
Well, with Andy saying he can't reproduce the problem I am going to close as invalid.

Andy, if you ever happen to be able to upload data that triggers it, then please re-open this bug.
msg31045 - (view) Author: Rune Devik (runedevik) Date: 2007-06-27 10:00
Hi

I have the same problem with a huge file (8GB) containing long lines. Sometimes two lines are merged into one and rerunning the test script that reads the file it's always the same lines that are merged. Also the merging happens more frequently towards the end of the file it seems. I tried to reproduce with a smaller data set (10 lines before the two lines that get merged, the two lines that gets merged and the 10 lines after that) but I was not able to reproduce on this smaller data set. However if you open this huge file in "rb" mode instead of "r" mode everything works as it should and no lines are merged at all! If I copy the file over to linux and rerun the test script no lines are merged (regardless if mode is "r" or "rb") so this is windows specific and might have something todo with the adding of \r\n if only \n is found when you open the file in "r" mode maybe? Also I have reproduced it on both python 2.3.5 and 2.5c1 on both windows XP and windows 2003. 

More stats on the input file in both "r" mode and "rb" mode below:

Input file size: 8 695 828 KB

fp = open(file, "r"):
  - total number of lines read:  668909
  - length of the longest line:  13179792
  - length of the shortest line: 89
  - 56 lines contains the content of two lines
  - Always just two lines that are merged into one! 
  - Always the same lines that are merged rerunning the test on the same file. 

open(file, "rb"):
  - total number of lines read:  668965
  - length of the longest line:  13179793
  - length of the shortest line: 90
  - no lines merged

Regards,
Rune Devik
History
Date User Action Args
2022-04-11 14:56:22adminsetgithub: 44473
2007-01-16 16:56:09amontheicreate