This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 'Plus' filemode exposes uninitialized memory on win32
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: clintonroy, corydodt, exarkun, paul_g, tim.peters
Priority: normal Keywords:

Created on 2006-01-01 00:06 by corydodt, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg27198 - (view) Author: Cory Dodt (corydodt) Date: 2006-01-01 00:06
(Note: I'm using cygwin zsh, hence the prompts.  I am
using standard, python.org Python for these tests.)

% echo abcdef > foo
% python
Python 2.3.5 (#62, Feb  8 2005, 16:23:02) [MSC v.1200
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for
more information.
>>> f = file('foo','r+b')
>>> f.write('ghi')
>>> f.read()
'\x00x\x01\x83\x00\xe8\x03\x00\x00\xff\xff\xff\xff\x00\x00\x00\x00e\x00\x00i\x01
\x00d\x00\x00\x83\x01\x00Fd\x01\x00S\x00S\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0
0\x00\x00\x00[...lots and lots and lots of
uninitialized memory deleted...]\x00\x00\x00\x00\x00\
x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\
x00\x00\x00\x00abcdef\n'
>>> f.close()
>>>
msg27199 - (view) Author: Clinton Roy (clintonroy) Date: 2006-01-01 05:38
Logged In: YES 
user_id=31446

Hi Cory, I don't think r+ mode will create the file if it
doesn't exist, so at a guess I think what you're seeing is
the actual contents of a file named foo that are on the
disk, not junk. If you delete the file foo and run your test
again, you should get an error to that effect.

cheers,
msg27200 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-01-01 06:06
Logged In: YES 
user_id=31435

This is actually pilot error (not a bug!), although it's
subtle:  Python uses the platform C I/O implementation, and
in standard C mixing reads with writes yields undefined
behavior unless a file-positioning operation (typically a
seek()) occurs between switching from reading to writing (or
vice versa); here from the C standard:

    When a file is opened with update mode (’+’ as the
    second or third character in the above list of mode
    argument values), both input and output may be
    performed on the associated stream. However, output
    shall not be directly followed by input without an
    intervening call to the fflush function or to a file
    positioning function (fseek, fsetpos, or rewind), and
    input shall not be directly followed by output
    without an intervening call to a file positioning
    function, unless the input operation encounters
    end-of-file.

In other words, the result of running your sample code is
undefined:  nothing is guaranteed about its behavior, which
both can and does vary across platforms.

If you want defined behavior, then, for example, add

>>> f.seek(0)

between your write() and read() calls.
msg27201 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2006-01-01 22:08
Logged In: YES 
user_id=366566

I think Cory was aware of the underlying requirement to
interpose a seek operation between the write and read
operations, but was concerned about the consequences of not
doing so.  Python normally protects one from doing things
that are *too* dangerous: I guess it's unclear to him (and
perhaps others) whether the current behavior of file is just
(relatively) obscure or if it could lead to real problems
(exposing sensitive data, crashing the process).  It seems
like the latter is somewhat unlikely in practice, but since
the behavior is unspecified, it seems like it *could* happen.

I guess since Tim closed this, he thinks it's not too
dangerous.  In this case, the documentation could probably
stand to be improved somewhat.  The section
(<http://python.org/doc/lib/built-in-funcs.html#built-in-funcs>)
that documents the various modes which can be used to open a
file could updated to include a warning along the lines of
that in the C standard.  It should probably explicitly state
which file methods can be used to satisfy this requirement,
since it's not clear otherwise except by reading the
implementation of the file type (one could guess, from
<http://python.org/doc/lib/bltin-file-objects.html#bltin-file-objects>
that file.flush() and file.seek() are suitable, but the
documentation for these only says "like stdio ...", so you
can't be completely sure).
msg27202 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-01-02 00:24
Logged In: YES 
user_id=31435

There's nothing Python can do about this short of
implementing I/O itself.  That isn't likely.  If someone is
truly bothered by this behavior on Windows (or any other
platform), they need to convince Microsoft (or other
relevant C vendors) to change _their_ stdio implementation
-- Python inherits the platform C's behavior, quirks and all.

I'll note that I'm not bothered by it.  It's one of those
"doctor, doctor, it hurts when I do this!" kinds of
pseudo-problems, IMO:  so don't do that.  It's not like
hostile user input can "trick" a Python application into
doing I/O operations in an undefined order.
msg27203 - (view) Author: Cory Dodt (corydodt) Date: 2006-01-02 00:43
Logged In: YES 
user_id=889183

Tim - at a minimum this should be documented; even if it's
just a link to the ANSI C documentation.  Python is not ANSI
C, we shouldn't expect the Python user to seek out ANSI C 
documentation.  Want me to open a separate doc bug?  The
current doc only says this (about the file builtin):
"""
Modes 'r+', 'w+' and 'a+' open the file for updating (note
that 'w+' truncates the file). Append 'b' to the mode to
open the file in binary mode, on systems that differentiate
between binary and text files (else it is ignored). If the
file cannot be opened, IOError is raised.
"""

Either here, or perhaps in section 2.3.9, a clear
description should be given of how to properly operate a +
mode file.  Failing that, a pointer to ANSI C documentation
so the user can read about it on its own (and so the user
knows that this behavior conforms to the underlying platform
API in every ugly detail).

I'm also dubious that this exposed memory is innocuous, but
I'll defer to your expertise on that one.
msg27204 - (view) Author: Paul G (paul_g) Date: 2006-01-02 21:32
Logged In: YES 
user_id=1417712

i think there's a bit of confusion here as to what exactly
the problem is.

ansi c says that for files fopen()ed for reading and writing
(ie r+, w+ etc), you must issue an fflush(), fseek(),
fsetpos(), or rewind() between a read and a write. the
exception to this is if the read last read EOF.

the behaviour we are seeing using python file objects:

with glibc:
1. read + write + read result in no data being returned by
the last read. this is the case regardless of whether we do
f.readlines()+f.writelines()+f.readlines() or
f.read()+f.write()+f.read(). this does not comnform to
expected behaviour (as per ansi c and glibc fopen(3)),
because at least in the latter (read() with no size
parameter) case, python docs promise to stop at EOF,
triggering the exception ansi c/glibc make to the
intervening synchronization with file positioning requirement.

with msvscrt:
1. in the f.read()+f.write()+f.read() case, the f.write()
generates an IOError. this deviates from ansi c, but is in
line with msdn docs.
2. in the f.readlines()+f.writelines()+f.readlines() case,
you see the type of results quoted in the bug submission.
this deviates from ansi c if you expect readlines() to read
EOF, but is still in line with msdn docs.

there are 3 issues here:

1. if we give users a high level interface for file i/o, as
we do by giving them a File object, should we expect them to
 research, be aware of and deal with the requirements
imposed by the low level implementation used? if it is
reasonable to require that when they use read() and write(),
is it still reasonable to require it when they user
readlines() and writelines()?

2. if we expect users to be aware of ansi c requirements for
file stream usage and deal with them, is it reasonable to
expect them to deal with the differences in libc
implementations, including the differing requirements they
impose and differing failure modes being seen? should we not
attempt to present an ansi c compliant interface to them,
performing workarounds as is necessary on a given platform
(or libc make as is the case here)? we certainly do that in
some cases (but not in this one) based on my brief reading
of Objects/fileobject.c.

3. if we leave users to deal with this mess, should we not
at least document this in some fashion? whether it be a
detailed explanation or just a pointer to look at the
appropriate docs, or even just a mention that they should be
reading up on fopen(), since that is the underlying
implemention behind file objects. is it reasonable to expect
folks for whom python is their first language, as some folks
seem to promote python, to figure all of this out when they
haven't the foggiest about ansi c?


to recap, the real issue, imo, seems to be that we shouldn't
be exposing users to this, rather than the funky results of
not doing this right.

there are 4 options for dealing with this:

1. do nothing (what tim currently fabours, it appears)
2. document this to some extent
3. make this work the same across all libcs
4. perform the syncrhonization (fflush, fsetpos etc
depending on libc) for the user, behind the scenes, if we
see a write coming in and the previous op was a read.

the latter option, from the perspective of "this is exactly
what a high level interface should do for the user", makes
the most sense to me. but then, maybe that's why i'm not a
python core dev ;)

cheers,
-p
msg27205 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-01-03 00:08
Logged In: YES 
user_id=31435

paul_g's option #1 is the only that takes no work, so is the
only one I'm volunteering for <0.5 wink>.  Docs would be
good.  #4 is in general impossible short of Python
implementing its own I/O -- "the last" operation done on a C
stream isn't necessarily visible to the Python
implementation (extensions can and do perform their own I/O
on C streams directly via platform C stdio calls -- Python
has no way to know about that now even in theory).

BTW, I don't understand:

"""1. in the f.read()+f.write()+f.read() case, the f.write()
generates an IOError. this deviates from ansi c, but is in
line with msdn docs."""

All behavior in that case is explicitly not defined by ANSI
C if there isn't a file-positioning operation too between
the read() and write(), and again between the write() and
read().  Raising an exception is fine by ANSI C in that
case.  So is a segfault.  So is reading nothing, or reading
a terabtye, or wiping the disk clean, etc:  nothing is
defined about it.
msg27206 - (view) Author: Paul G (paul_g) Date: 2006-01-03 00:22
Logged In: YES 
user_id=1417712

i'll comment about the rest later, but re not understanding:

here is what ansi says: "If the file has been opened for
read/write, a read may not be followed by a write.
Or vice versa, without first calling either fflush(),
fseek(), fsetpos(), or rewind().
Unless EOF was the last character read by fread(). "

note the last sentence. python docs say that f.read() with
no size parameter will read all data until it hits EOF. this
means that any ansi c compliant implementation should
perform synchronization when you fread() an EOF. glibc
fopen(3) man page states that it follows this; it does not.
msvscrt docs do not state that it follows this; it does not.

so glibc promises ansi c compliance, but does not deliver.
msvscrt does not promise ansi c compliance and doesn't
deliver either, but at least it behaves as advertised in
this respect.

make sense?

-p
msg27207 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-01-03 00:55
Logged In: YES 
user_id=31435

"make sense?"

So far as it goes, yes -- thanks.  At a higher level ;-),
I've been slinging Python for 15 years and have never had
any trouble with this stuff because I never push on the end
cases (I always seek when switching between reading and
writing, even if only by using the peculiar- looking
f.seek(f.tell()), and regardless of the platform du jour).

Pushing beyond that doesn't interest me.

Note that Python's file_read() (in fileobject.c) is already
much more complicated than simply calling C's fread(). 
Because of this, it may be that Python is adding strange end
case behavior beyond what the platform C would exhibit if
the latter were used directly.
msg27208 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2006-01-03 01:02
Logged In: YES 
user_id=31435

BTW, the standard actually says:

    and input shall not be directly followed by output
    without an intervening call to a file positioning
    function, unless the input operation encounters
    end-of-file.

In your f.read() + f.write() example, that doesn't happen. 
It _would_ happen if the sequence were f.read() + f.read() +
f.write() instead.  It's the second f.read() that
"encounters end-of-file".  The first f.read() merely reads
_up to_ EOF, leaving the file pointer _at_ EOF.

On Windows, that appears to work fine, too:

C:\Code>echo abc > foo
C:\Code>\python24\python.exe
Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit
(Intel)] on win32
...
>>> f = open('foo', 'r+b')
>>> f.read()
'abc \r\n'
>>> f.read() # _this_ read "encounters EOF"
''
>>> f.write('xyz')
>>> f.seek(0)
>>> f.read()
'abc \r\nxyz'
>>>
msg27209 - (view) Author: Paul G (paul_g) Date: 2006-01-03 01:39
Logged In: YES 
user_id=1417712

i haven't ecnountered this edge case either, in my c days or
otherwise. for those who are familiar with this (and in
python's case, realize what the underlying implementation
is), it simply wouldn't occur to them to *omit* flushes.

this issue actually cropped up in a unit test in twisted
which was failing on windows and not failing elsewhere.

the explanation of read() and why this isn't working as i
initially expected makes sense. however, this is what python
docs say:

"Read at most size bytes from the file (less if the read
hits EOF before obtaining size bytes). If the size  argument
is negative or omitted, read all data until EOF is reached.
The bytes are returned as a string object. An empty string
is returned when EOF is encountered immediately. (For
certain files, like ttys, it makes sense to continue reading
after an EOF is hit.) "

this states, absolutely unequivocally, that the first read
does 'encounter' or 'hit' EOF.

as i stated previously, the correct solution in my view is
to handle this for the users. however, in light of the issue
of extensions doing their own thing, doing this would
require making extensions use python's fread wrapper. this
is unlikely (understatement *ahem*) to happen.

as such, this becomes a documentation issue. users should be
made aware that they are expected to deal with this.
ideally, they would also be told what to do. wording in the
read() docs should be corrected to remove any implication
that EOF actually gets encountered in the sense of ansi c
until "" is returned.

make sense?

-p
History
Date User Action Args
2022-04-11 14:56:14adminsetgithub: 42745
2006-01-01 00:06:19corydodtcreate