This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: httplib patch to make _read_chunked() more robust
Type: Stage:
Components: None Versions: Python 2.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, jjlee, vila
Priority: normal Keywords: patch

Created on 2006-01-20 20:26 by jjlee, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
wo.patch jjlee, 2006-01-20 20:26
chunk.patch jjlee, 2006-02-06 01:24
Messages (13)
msg49362 - (view) Author: John J Lee (jjlee) Date: 2006-01-20 20:26
To reproduce:

import urllib2
print urllib2.urlopen("http://66.117.37.13/").read()


The attached patch "fixes" the hang, but that patch is
not acceptable because it also removes the .readline()
and .readlines() methods on the response object
returned by urllib2.urlopen().

The patch seems to demonstrate that the problem is
caused by the (ab)use of socket._fileobject in
urllib2.AbstractHTTPHandler (I believe this hack was
introduced when urllib2 switched to using
httplib.HTTPConnection).

Not sure yet what the actual problem is...
msg49363 - (view) Author: John J Lee (jjlee) Date: 2006-01-21 22:10
Logged In: YES 
user_id=261020

In fact the commit message for rev 36871 states the real
reason _fileobject is used (handling chunked encoding),
showing my workaround is even more harmful than I thought. 
Moreover, doing a urlopen on 66.117.37.13 shows the response
*is* chunked.

The problem seems to be caused by httplib failing to find a
CRLF at the end of the chunked response, so the loop at the
end of _read_chunked() never terminates.  Haven't looked in
detail yet, but I'm guessing a) it's the server's fault and
b) httplib should work around it.


Here's the commit message from 36871:


Fix urllib2.urlopen() handling of chunked content encoding.

The change to use the newer httplib interface admitted the
possibility
that we'd get an HTTP/1.1 chunked response, but the code
didn't handle
it correctly.  The raw socket object can't be pass to
addinfourl(),
because it would read the undecoded response.  Instead,
addinfourl()
must call HTTPResponse.read(), which will handle the decoding.

One extra wrinkle is that the HTTPReponse object can't be
passed to
addinfourl() either, because it doesn't implement readline() or
readlines().  As a quick hack, use socket._fileobject(), which
implements those methods on top of a read buffer. 
(suggested by mwh)

Finally, add some tests based on test_urllibnet.

Thanks to Andrew Sawyers for originally reporting the
chunked problem.
msg49364 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 00:38
Logged In: YES 
user_id=261020

First, expanding a bit on what I wrote on 2006-01-21: The
problem does relate to chunked encoding, and is unrelated to
urllib2's use of _fileobject.  My hack to remove use of
socket._fileobject from urllib2 merely breaks handling of
chunked encoding by cutting httplib.HTTPResponse out of the
picture.  The problem is seen in urllib2 in recent Pythons
thanks to urllib2 switching to use of httplib.HTTPConnection
and HTTP/1.1, hence chunked encoding is allowed.  urllib
still uses httplib.HTTP, hence HTTP/1.0, so is unaffected.
To reproduce with httplib:
import httplib
conn = httplib.HTTPConnection("66.117.37.13")
conn.request("GET", "/", headers={"Connection": "close"})
r1 = conn.getresponse()
print r1.read()
The Connection: close is required -- if it's not there the
server doesn't use chunked transfer-encoding.
I verified with a packet sniffer that the problem is that
this server does not send the final trailing CRLF required
by section 3.6.1 of RFC 2616.  However, that section also
says that trailers (trailing HTTP headers) MUST NOT be sent
by the server unless either a TE header was present and
indicated that trailers are acceptable (httplib does not
send the TE header), or the trailers are optional metadata
and may be discarded by the client.  So, I propose the
attached patch to httplib (chunk.patch) as a work-around.
msg49365 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 01:24
Logged In: YES 
user_id=261020

Oops, fixed chunk.patch to .strip() before comparing to "".
msg49366 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 20:36
Logged In: YES 
user_id=261020

I missed the fact that, if the connection will not close at
the end of the transaction, the behaviour should not change
from what's currently in SVN (we should not assume that the
chunked response has ended unless we see the proper
terminating CRLF).  I intend to upload a slightly modified
patch that tests for self._will_close, and behaves accordingly.
msg49367 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 21:18
Logged In: YES 
user_id=261020

Conservative or not, I see no utility in changing the
default, and several major harmful effects: old code breaks,
and people have to pore over the specs to figure out why
"urlopen() doesn't work".
msg49368 - (view) Author: John J Lee (jjlee) Date: 2006-02-06 21:20
Logged In: YES 
user_id=261020

Please ignore last comment (posted to wrong tracker item).
msg49369 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-05-03 05:17
Logged In: YES 
user_id=849994

Are you still working on your slightly modified patch?
msg49370 - (view) Author: John J Lee (jjlee) Date: 2006-05-03 18:57
Logged In: YES 
user_id=261020

I *hope* to get back to it soon.  But if anybody beats me to
it, that's fine :-)

One problem: I don't understand the need for
HTTPConnection._safe_read(), rather than checking for an
EINTR resulting from the recv() call (or WSAEINTR on
Windows).  Can anybody explain that?
msg49371 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-03-14 07:00
Ping!
msg49372 - (view) Author: John J Lee (jjlee) Date: 2007-03-14 20:38
Haven't forgotten, just haven't done it...

I never did figure out what should be done re EINTR.
msg62847 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-02-23 23:58
This seems to have been fixed in r60747.
msg76767 - (view) Author: John J Lee (jjlee) Date: 2008-12-02 19:27
For the record, I think my worry in msg49366 was a non-issue: the only
time .readline() will return "" is when the connection has closed (so I
think the fix that eventually applied is correct).
History
Date User Action Args
2022-04-11 14:56:15adminsetgithub: 42815
2008-12-02 19:27:34jjleesetmessages: + msg76767
2008-02-23 23:58:03georg.brandlsetstatus: open -> closed
resolution: out of date
messages: + msg62847
2008-01-05 14:03:10vilasetnosy: + vila
2006-01-20 20:26:50jjleecreate