This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Problems with urllib2 read()
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, andyshorts, ironfroggy, jjlee, lucas_malor, maenpaa, orsenthil, pitrou
Priority: low Keywords:

Created on 2007-03-16 16:00 by lucas_malor, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg31544 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-16 16:00
urllib2 objects opened with urlopen() does not have the method seek() as file objects. So reading only some bytes from opened urls is pratically forbidden.

An example: I tried to open an url and check if it's a gzip file. If IOError is raised I read the file (to do this I applied the #1675951 patch: https://sourceforge.net/tracker/index.php?func=detail&aid=1675951&group_id=5470&atid=305470 )

But after I tried to open the file as gzip, if it's not a gzip file the current position in the urllib object is on the second byte. So read() returns the data from the 3rd to the last byte. You can't check the header of the file before storing it on hd. Well, so what is urlopen() for? If I must store the file by url on hd and reload it, I can use urlretrieve() ...
msg31545 - (view) Author: Zacherates (maenpaa) Date: 2007-03-20 02:43
I'd contend that this is not a bug:
 * If you need to seek, you can wrap the file-like object in a StringIO (which is what urllib would have to do internally, thus incurring the StringIO overhead for all clients, even those that don't need the functionality).
 * You can check the type of the response content before you try to uncompress it via the Content-Encoding header of the response.  The meta-data is there for a reason.

Check http://www.diveintopython.org/http_web_services/gzip_compression.html for a rather complete treatment of your use-case.
msg31546 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-03-20 08:59
> If you need to seek, you can wrap the file-like object in a
> StringIO (which is what urllib would have to do internally
> [...] )

I think it's really a bug, or at least a non-pythonic method.
I use the method you wrote, but this must be done manually,
and I don't know why. Actually without this "trick" you can't
handle url and file objects together as they don't work in
the same manner. I think it's not too complicated using the
internal StringIO object in urllib class when I must seek()
or use other file-like methods.

> You can check the type of the response content before you try
> to uncompress it via the Content-Encoding header of the
> response

It's not a generic solution

(thanks anyway for suggested solutions :) )
msg31547 - (view) Author: Zacherates (maenpaa) Date: 2007-03-21 01:39
> I use the method you wrote, but this must be done manually,
> and I don't know why.
read() is a stream processing method, whereas seek() is a random access processing method.  HTTP resources are in essence streams so they implement read() but not seek().  Trying to shoehorn a stream to act like a random access file has some rather important technical implications.  For example: what happens when an HTTP resource is larger than available memory and we try to maintain a full featured seek() implementation?

> so what is urlopen() for?
Fetching a webpage or RSS feed and feeding it to a parser, for example.

StringIO is a class that was designed to implement feature complete, random access, file-like object behavior that can be wrapped around a stream.  StringIO can and should be used as an adapter for when you have a stream that you need random access to.  This allows designers the freedom to simply implement a good read() implementation and let clients wrap the output in a StringIO if needed.

If in your application you always want random access and you don't have to deal with large files:
def my_urlopen(*args, **kwargs):
   return StringIO.StringIO(urllib2.urlopen(*args, **kwargs).read())

Python makes delegation trivially easy.

In essence, urlfiles (the result of urllib2.urlopen()) and regular files (the result of open()) behave differently because they implement different interfaces.  If you use the common interface (read), then you can treat them equally.  If you use the specialized interface (seek, tell, etc.) you'll have trouble.  The solution is wrap the general objects in a specialized object that implements the desired interface, StringIO.
msg31548 - (view) Author: Calvin Spealman (ironfroggy) Date: 2007-04-26 13:55
I have to agree that this is not a bug. HTTP responses are strings, not random access files. Adding a seek would have disastrous performance penalties. If you think the work around is too complicated, I can't understand why.
msg31549 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-04-26 20:41
In my opinion it's not complicated, it's convoluted. I must use two object
to handle one data stream.

Furthermore it's a waste of resources. I must copy data to another object.
Luckily in my script I download and handle only little files. But what if a
python program must handle big files?

If seek() can't be used (an except is raised), urllib could use a
sequential access method.
msg31550 - (view) Author: Zacherates (maenpaa) Date: 2007-04-27 03:36
> In my opinion it's not complicated, it's convoluted. I must use two
> object to handle one data stream.

seek() is not a stream operation. It is a random access operation (file-like != stream). If you were only trying to use stream operations then you wouldn't have these problems.   

Each class provides a separate functionality, urllib gets the file while StringIO stores it.  The fact that these responsibilities are given to different classes should not be surprising since the represent separately useful concepts that abstract different things.  It's not convoluted, it's good design.  If every class tried to do everything, pretty soon you're adding solve_my_business_problem_using_SOA() to __builtins__ and nobody wants that.


> Furthermore it's a waste of resources. I must copy data to another
> object. Luckily in my script I download and handle only little files. But what if
> a python program must handle big files?

This is exactly why urllib *doesn't* provide seek. Deep down in the networking library there's a socket with a 8KiB buffer talking to the HTTP server. No matter how big the file you're getting with urllib, once that buffer is full the socket starts dropping packets. 

To provide seek(), urllib would need to keep an entire copy of the file that was retrieved, (or provide mark()/seek(), but those have wildly different semantics from the seek()s were used to in python, and besides they're too Java).  This works fine if you're only working with small files, but you raise a good point: "But what if a python program must handle big files?".  What about really big files (say a Knoppix DVD ISO)?  Sure you could use urlretrieve, but what if urlretrive is implemented in terms of urlopen?

Sure urllib could implement seek (with the same semantics as file.seek()) but that would mean breaking urllib for any resource big enough that you don't want the whole thing in memory.


>> You can check the type of the response content before you try
>> to uncompress it via the Content-Encoding header of the
>> response

>It's not a generic solution

The point of this suggestion is not that this is the be all and end all solution, but that code that *needs* seek can probably be rewritten so that it does not.  Either that or you could implement BufferedReader with the methods mark() and seek() and wrap the result of urlopen.
msg31551 - (view) Author: Lucas Malor (lucas_malor) Date: 2007-04-27 09:26
If you don't want the intere file on your memory, there's two solutions:

----------
import urllib

urlobj = urllib.urlopen("someurl")
header = urlobj.read(1)
# some other operations (no other urlobj.read())

contents = header + urlobj.read()

----------

I don't think it's a --good-- solution, because some other programmers can do other read() operations and mess all the result.

The other solution is:

----------
def readall(x) :
  url = x.geturl
  x.close()
  try :
    y = urllib.openurl(url)
  except :
    return None
  return y.read()
  

import urllib

urlobj = urllib.urlopen("someurl")
header = urlobj.read(1)
# some other operations

contents = readall(urlobj)

----------

This is still a bad solution (two calls to the server for the same file).

On the contrary I'm pretty sure using a sequencial access this can be done without doing these workarounds.

(anyway I don't understand a thing: HTTP can't delegate the server to seek() the file?)
msg31552 - (view) Author: Zacherates (maenpaa) Date: 2007-04-27 11:57
> import urllib

> urlobj = urllib.urlopen("someurl")
> header = urlobj.read(1)
> # some other operations (no other urlobj.read())

> contents = header + urlobj.read()

This is effectively buffering the output, which is a perfectly acceptable solution...  although I'd write like this:

import urllib

class BufferedReader(object):
   def __init__(self, fileobj, buffsize = 8192):
   
   def mark(self, maxbytes = 8192):

   def seek(self):

br = BufferedReader(urllib.urlopen())
br.mark()
header = br.read(1)

br.seek()
contents = br.read()

That way you store all bytes that have been read.  Rather than hoping nobody calls read().


> On the contrary I'm pretty sure using a sequential access this can be done
> without doing these workarounds.

Right now sequential access is provided without keeping a copy in memory.  The issue arises when you want random access, however; urlobjs have no indication as to whether you're going to call seek().  As such, to provide the method they must assume you will call it.  Therefore, regardless of whether seek() is actually called or not, a copy must be kept to offer the *possibility* that it can be called.

You can work around this by offering the degenerate seek() provided by BufferedReader, but that's functionality that belongs in it's own class anyway.


> anyway I don't understand a thing: HTTP can't delegate the server to
> seek() the file?

For one thing, its not supported by the standard.  For another, it would be a waste of server resources, bandwidth and to top it off it would be really slow... even slower than using StringIO.  HTTP resources are not simply files served up by httpd, they can also be dynamically generated content... How is an HTTP server supposed to seek backward and forward in a page that is programatically generated? Go try an tell web developers that they need to keep a copy of every page requested indefinitely, in case you want to send a SEEK request.




HTTP resources are not local.  To treat them as local you must make then local by putting them in a container, such as StringIO, a buffer or a local file. It's that simple.  To try and abstract this fact would result in major performance issues or unreliability, or both.
msg31553 - (view) Author: AndyShorts (andyshorts) Date: 2007-06-18 22:28
While a newbie to Python I would like to point out that RFC2616 (the HTTP/1.1 spec) does allow for byte ranges to be requested and that these could be used to mimic seek &c. In order to do so though the client must retain the file pointer itself etc.

However, given my experience of real world servers you are never sure if the target will support them - it need not - and even if it does proxies (both visible and transparent) are free to utterly trash these requests as they see fit. I've gained this the hard way while writing HTTP client software :-)

The only safe thing to do with HTTP data is treat it as a stream and if you want to seek through it then your choices are:

 a. Buffer it locally and access it that way
 b. Keep opening and closing the resource and reading through to where you want to be.

imo the urllib.urlopen acts in the only sane way possible.

[btw this is my first post so sorry if this is OT - though I notice this thread has gone into torpor]
msg81795 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-12 18:21
I think the bug should be closed as invalid. seek() should only be
implemented by genuinely seekable streams, which HTTP responses aren't.
msg81845 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-13 01:20
Anyone against closing?
History
Date User Action Args
2022-04-11 14:56:23adminsetgithub: 44732
2009-02-20 01:53:10ajaksu2setstatus: pending -> closed
resolution: not a bug
stage: test needed -> resolved
2009-02-18 01:52:21ajaksu2setstatus: open -> pending
priority: normal -> low
2009-02-13 01:20:40ajaksu2setnosy: + jjlee
messages: + msg81845
2009-02-12 18:21:33pitrousetnosy: + pitrou
messages: + msg81795
2009-02-12 18:12:33ajaksu2setstage: patch review -> test needed
2009-02-12 18:12:19ajaksu2setmessages: - msg81792
2009-02-12 18:11:57ajaksu2setnosy: + ajaksu2, orsenthil
stage: patch review
type: behavior
messages: + msg81792
versions: + Python 2.6, - Python 2.5
2007-03-16 16:00:20lucas_malorcreate