This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: reading very large files
Type: behavior Stage: test needed
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: 1672853 Superseder: Newline skipped in "for line in file" for huge file
View: 1744752
Assigned To: Nosy List: JosephArmbruster, Richard.Christen@unice.fr, amaury.forgeotdarc, josiahcarlson, richardchristen, terry.reedy, tim.peters
Priority: high Keywords:

Created on 2006-03-16 17:21 by richardchristen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg27792 - (view) Author: christen (richardchristen) Date: 2006-03-16 17:21
I work on the human genome
I extracted words from chromosomes using a suffix tree
(C compiled for 64 done on a SUN with 300 Go RAM, since
my suffix tree requires 150 Go RAM for chromosome 1,
the largest one)

this gave some >5 Go files, for example with 163763326
lines for chr 4, the one presently analyzed.

Using python 2.4.2 on a windows 32-computer (1.5 Go
RAM), reading this file line by line either

for li in file:
    do something

or

while li!='':
    li=file.readline()

I got problems seemingly around the 4 Go boundary
(after reading the problematic first line), for some
lines (not all), the li returned the correct content
but with the first word of the next line also within li
(see below)

As a result a simple
file1=open('1')
file2=open('2','w')
li=file1.readline()
while li!='':
    file2.write(li) 
    li=file1.readline()

produced a second file of only
163754385 lines
problem lines were "seemingly random", i.e. not in a
row, with the last line being OK.


The same code on the same file but on my OSX
64-dualcore machine went fine, despite the use of
default Python 2.2.3 and "file Python" showing it is a
Mach-0 executable ppc, i.e. a 32 bit app.

Everything was run from the command line.


the first file looks like that
...
TCAGCCACAGCAGAAAGTGA:\t33240 551212 751185
TCAGCCACAGCAGAAAGTGC:\t131324047
TCAGCCACAGCACTGTGTTA:\t61641912
....

the second file contains lines like these :
TCAGCCACAGCAGAAAGTGC:\t131324047TCAGCCACAGCAGAAGAAGA:  

which is 'first line'+'1rst word of next line'

PS1 : no problem to read the big file with UEdit on the
windows machine. Therefore the OS itself is not the
problem (also I transfered the bigfile from the Windows
to the Mac, if the file had had problems, it would have
been corrupted on the Mac)
PS2 : I tried python 2.3.5 on windows with the same
problem.
PS3: If needed, I can run the same test on a similar
file but for chromosome 8 which is slightly below the 4
Go limit (3.99).
PS4: I think I remember having done a similar parsing
on a Linux Athlon 64 monoCPU a month ago, with no trouble.
msg27793 - (view) Author: Josiah Carlson (josiahcarlson) * (Python triager) Date: 2006-03-18 00:35
Logged In: YES 
user_id=341410

Sounds like an issue with file objects on certain platforms
not being able to handle offsets of 2**32 or larger.  I
personally have read and written files > 4gb on the windows
platform, but I seem to recall having issues on 32 bit linux
some time in the past.
msg27794 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-03-18 02:33
Logged In: YES 
user_id=31435

"windows 32-computer" is too vague.  Which operating system
(Win95, Win98, WinME, NT, Win2K, WinXP), and which
filesystem (FAT, FAT32, NTFS)?

Are you sure this is a text file?  If it's a binary file,
then  all sorts of bad things can happen opening it in text
mode (which your sample code does).
msg27795 - (view) Author: christen (richardchristen) Date: 2006-03-18 07:29
Logged In: YES 
user_id=1477618

In reply to previous comment

Are you sure this is a text file?
Yes I made it myself.
Besides I transfered it from the UX machine to the windows
one by ftp with change of the end of line character to the
window's kind. I checked with type myfile, that the control
character was indeed changed. Also, I mentioned that I
manually checked with Uedit, both in ASCII and HEX modes for
the akward lines.

"windows 32-computer" is too vague."
I agree, I should have been more specific:
System: Microsoft Windows 2000 Professionnel
Version 5.0.2195 Service Pack 4 version 2195
Mother card : ASUSTek 
System Model A7N8X-E
BIOS Phoenix AwardBIOS v6-00PG
Memory 1.5Go
Swap 2.4 Go

File System NTFS

Best Regards
msg27796 - (view) Author: christen (richardchristen) Date: 2007-07-02 07:11
In 2006, I signaled a bug in windows 32 for reading very large files : python-Bugs-1451466

I have now tried with a windows 64 machines and python 2.5
I find the same bug

For very large files (the two I tried were around 7-8 Go), the end of line is sometimes not taken into account

The file is fine, as viewed in hexa, the end of line characters are perfectly ok at the place where the parser goes wrong.
Everything seems to be ok with the same script on my Mac OSX

Exemple :
Original file reads:
###########################
.........
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-
1|ENSE00001440927|105829519|105829650|-1|1
         (132 letters)

Database: Homo_sapiens.NCBI36.45.dna.chromosome17 
           1 sequences; 78,774,742 total letters
...............
###########################

in hexa:
###########################
...
c5bd3500h: 32 2E 0D 0A 0D 0A 51 75 65 72 79 3D 20 31 30 7C ; 2.....Query= 10|
c5bd3510h: 45 4E 53 47 30 30 30 30 30 32 30 33 32 38 38 7C ; ENSG00000203288|
c5bd3520h: 70 73 65 75 64 6F 67 65 6E 65 7C 31 30 35 38 32 ; pseudogene|10582
c5bd3530h: 39 34 31 36 7C 31 30 35 38 32 39 36 35 30 7C 2D ; 9416|105829650|-
c5bd3540h: 0D 0A 31 7C 45 4E 53 45 30 30 30 30 31 34 34 30 ; ..1|ENSE00001440
c5bd3550h: 39 32 37 7C 31 30 35 38 32 39 35 31 39 7C 31 30 ; 927|105829519|10
c5bd3560h: 35 38 32 39 36 35 30 7C 2D 31 7C 31 0D 0A 20 20 ; 5829650|-1|1..  
c5bd3570h: 20 20 20 20 20 20 20 28 31 33 32 20 6C 65 74 74 ;        (132 lett
c5bd3580h: 65 72 73 29 0D 0A 0D 0A 44 61 74 61 62 61 73 65 ; ers)....Database
c5bd3590h: 3A 20 48 6F 6D 6F 5F 73 61 70 69 65 6E 73 2E 4E ; : Homo_sapiens.N
c5bd35a0h: 43 42 49 33 36 2E 34 35 2E 64 6E 61 2E 63 68 72 ; CBI36.45.dna.chr
c5bd35b0h: 6F 6D 6F 73 6F 6D 65 31 37 20 0D 0A 20 20 20 20 ; omosome17 ..    
c5bd35c0h: 20 20 20 20 20 20 20 31 20 73 65 71 75 65 6E 63 ;        1 sequenc
c5bd35d0h: 65 73 3B 20 37 38 2C 37 37 34 2C 37 34 32 20 74 ; es; 78,774,742 t
c5bd35e0h: 6F 74 61 6C 20 6C 65 74 74 65 72 73 0D 0A 0D 0A ; otal letters....
...
#######################################


Demo: python script :
#############################
import os.path
initial_dir=r'D:\human_exons\chr17'	
fichier=os.path.join(initial_dir, '10_17.out')
fichin=open(fichier)
ok=0
i=0
for li in fichin:
	i+=1
	if li.startswith('Query= '):
		query=li
	elif li.startswith('1|ENSE00001440927|105829519|105829650|-1|1'):
		ok=1
	if ok==1: 
		print i
		print query
		print li

fichin.close()
################################

output :
160968087
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-

1|ENSE00001440927|105829519|105829650|-1|1         (132 letters)

160968088
Query= 10|ENSG00000203288|pseudogene|105829416|105829650|-

in fact line 160968087, should be 160981763



####################################
Computer 
Dell Precision PWS690 2 CPU dual core
Intel Xeon
5160 @ 3.00GHz
2.99 GHz, 16.0 GB of RAM

Microsoft Windows XP
Professional x64 Edition
Version 2003
Windows [Version 5.2.3790]

#####################################

Richard Christen
msg63154 - (view) Author: Joseph Armbruster (JosephArmbruster) Date: 2008-03-01 01:04
I believe this may be related to issue 1672853.

http://bugs.python.org/issue1672853
msg63441 - (view) Author: Joseph Armbruster (JosephArmbruster) Date: 2008-03-10 13:00
Note: If this issue is related to 1672853, I ran through the test code
provided in the issue recently and it appeared to pass for both the
trunk and 2.5 maint.
msg105542 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-05-11 20:49
Is this still an issue for 2.7?
msg105583 - (view) Author: christen (Richard.Christen@unice.fr) Date: 2010-05-12 12:30
I have no idea because
- I am using 2.5 (windows) or 2.6 (2.5 because of old stuff that I 
compiled compatible with 2.5 not 2.6)
- I am using open(file, 'U') that solved the problem under windows, and 
the pd does not exist in Linux
best
Richard

Terry J. Reedy a écrit :
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
>
> Is this still an issue for 2.7?
>
> ----------
> nosy: +tjreedy
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1451466>
> _______________________________________
>
>
>
msg116719 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-17 20:22
issue1744752 describes why it's probably a bug in the C library.
possible workarounds are to open the files in universal mode, to use io.open(), or to switch to python 3!
History
Date User Action Args
2022-04-11 14:56:15adminsetgithub: 43040
2010-09-17 20:22:25amaury.forgeotdarcsetstatus: open -> closed

nosy: + amaury.forgeotdarc
messages: + msg116719

superseder: Newline skipped in "for line in file" for huge file
resolution: wont fix
2010-05-12 12:30:50Richard.Christen@unice.frsetnosy: + Richard.Christen@unice.fr
messages: + msg105583
2010-05-11 20:49:45terry.reedysetnosy: + terry.reedy
messages: + msg105542
2009-03-30 06:32:45ajaksu2setdependencies: + Error reading files larger than 4GB
type: behavior
stage: test needed
versions: + Python 2.6, - Python 2.5
2008-03-10 13:00:50JosephArmbrustersetmessages: + msg63441
2008-03-01 01:04:08JosephArmbrustersetnosy: + JosephArmbruster
messages: + msg63154
2006-03-16 17:21:35richardchristencreate