This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Heavy revisions to urllib2 howto
Type: Stage:
Components: Documentation Versions: Python 2.5
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: akuchling Nosy List: akuchling, jjlee
Priority: normal Keywords: patch

Created on 2006-05-01 19:50 by jjlee, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
reformatted.rst jjlee, 2006-05-01 19:50 Reformatted original text of Doc/howto/urllib2.rst
edited.rst jjlee, 2006-05-01 19:59 Revised version of Doc/howto/urllib2.rst
Messages (3)
msg50166 - (view) Author: John J Lee (jjlee) Date: 2006-05-01 19:50
Lots of people have been complaining about lack of
urllib2 docs (though I'm never quite sure what people
are looking for, being too familiar with all the
details), so a tutorial may well be a useful addition.
 I'm sure you'll understand that my brutal criticism
:-) is intended to make it even more useful.

Michael: feel free to make further revisions, but
unless you have major objections I suggest that this is
checked in first, then we make any further changes
after that by uploading patches on SF for review (I
haven't stepped back and re-read it with a fresh mind,
and no doubt would be useful for somebody to do that).
 Editing this took me quite a while, and if I can help
it I don't want to go through too many revisions or
argue about the details before anything gets fixed!-).
 I've taken the liberty of mentioning myself as a
reviewer somewhere at the end of the document :-)

Important: I reformatted paragraphs to max 70 character
width (it's conventional, and plain-text diffs are
especially painful to read otherwise, though admittedly
diffs are never great for paragraphs anyway... I hope
emacs didn't muck up any ReST syntax).  I've uploaded
just that formatting change as reformatted.rst (which
also removes trailing whitespace from all lines).  This
should be done in a separate initial commit of course.
 For this reason, I've uploaded the whole document for
both reformatted (reformatted.rst) and edited versions
(edited.rst) rather than using patches.

I've made all of the changes I discuss below, *with the
exception of* the missing example of GET with
urlencoded data that's really needed (search for XXX in
the comments below) -- that should just need a few lines.

BTW, it would be a really fantastic idea to turn the
whole document into a valid doctest (I know I'm myself
almost incapable of writing correct examples unless I
do something like that).  All that would require of
course is adding a few >>>s and ...s and running it
through doctest.testfile until it stops complaining ;-)



Now a list explaining and justifying the changes I made:


Spelling / paragraph structure etc. fixes.  I won't
list these.


Most importantly, you seem a bit unsure who your
audience is.  For example, on headers -- you explain
that "HTTP is based on requests and responses", but
dive into User-Agent without actually mentioning what a
header is.  In my changes, I ended up adding brief
explanations of the concepts for people new to or fuzzy
about HTTP, but didn't go into details of
implementation.  For example, introducing the concept
of "HTTP header", but not explaining how HTTP
implements them "on the wire" (though in fact I think
it would be a good thing to add one example that showed
an HTTP request and pointed out the request line, the
headers and the data, since that makes everything very
concrete and easy to grasp for newbies).


Removed link to external howto on cookie handling. 
Despite the description ("How to handle cookies, when
fetching web pages with Python."), this actually spends
most of its time discussing what conditional imports
are needed if you want to be maximally compatible
across libraries and older versions of Python.  While
that is certainly useful for people who need that, I
think this is rather obscure and distracting detail
that seems out of place being referenced from the
Python 2.5 documentation, even in a howto.  Perhaps
some general statement that further tutorials are
available on your site?  Referencing your basic auth
tutorial seems fine.


You limit mention of urllib2.urlopen(url) to a
footnote, and in the text of the tutorial itself, you
say: """urllib2 mirrors this by having you form a
``Request``""" .  That's not true: a string URL is
fine, as you explain in the footnote.  That seems an
innaccuracy with no obvious didactic payoff.  In the
footnote, you say:

"""You *can* fetch URLs directly with urlopen, without
using a request object. It's more explicit, and
therefore more Pythonic, to use ``urllib2.Request``
though. It also makes it easier to add headers to your
request.

I find that bizarre!  Why is urlopen(url) unpythonic??
 On the contrary, using an extra object for no reason
*does* seem unpythonic to me.  I rewrote this a bit.


You needlessly assign the_url = "http:...", then
request = Request(the_url) -- why not a single line? 
Where it's useful to do that (i.e. in the more
complicated examples), I've s/the_url/url/, since I
object to chaff like "the_" in variable names ;-)


Your discussion of Request implies that it only
represents HTTP requests.  Fixed that.


Use of the word "handle" to talk about response objects
is unfortunate for two reasons: First, many objects in
Python are "handles" in some sense ("object reference"
semantics), so it's too vague to be a helpful name. 
Second, it's particularly unfortunate to use the word
"handle" when urllib2 makes heavy use of "handler"
objects that "handle" requests.  The fact that methods
on these handlers often return your "handles" only
makes things more confusing!  s/handle/response/


"""Sometimes you want to **POST** data to a CGI (Common
Gateway Interface) [#]_ or other web application"""

It's clear to us old hands what you mean here, but in a
tutorial at the level you seem to have picked we
probably shouldn't expect the reader to have all these
concepts straight, so being sloppy here is bad.

 - By "a CGI" I'm guessing you mean "a CGI
script/program".  Also, the whole sentence is unclear
whether you're talking about a web application in the
abstract, or some concrete CGI script.  I certainly
remember being very confused about this kind of thing
as a newbie.

 - "...or other web application" implies that all POSTs
go to web applications.  That's using "web application"
in a broader sense than it's usually understood.

 - You introduce "POST" without explanation.  Would be
nice to say "send data" instead of "POST", then explain
POST.

I rewrote this bit to try to address those points.



Re POST: """This is what your browser does when you
fill in a FORM on the web"""

Thats needed qualifying: form submission can also
result in a GET.


I added a bit on side-effects and GET/POST.


"""You may be mimicking a FORM submission, or
transmitting data to your own application."""

This reads oddly to me.  I know what you're getting at
(forms are not part of HTTP), but surely if you are
submitting form data you're not "mimicking" form
submission, you *are* submitting a form.  And in an
English sentence the "or" reads as an "exclusive or";
with that in mind: In what sense does form submission
*not* involve "transmitting data to your own
application"?  Reworded and s/FORM/HTML form/, since
we're talking about the abstract thing rather than
specifically about the HTML element.


"""In either case the data needs to be encoded for safe
transmission over HTTP"""

Arbitrary binary data does not need to be URL-encoded.
 Rephrased.


"""The encoding is done using a function from the
``urllib`` library *not* from ``urllib2``. ::"""

This is not true in general even for HTML forms.  For
example, HTML form file upload data is not encoded in
this way.  There are more obscure cases, too.  Noted this.


The quoted User-Agent string was out-of-date.  Fixed,
noting that it changes with each minor Python version.


Headers / data : I added a bit of explanatory context
to tell people what we're about to explain, and break
up paragraphs / add sections to clarify the structure.
 Also explained the concept of "HTTP header", as I
noted above.


XXX example needed on GET with urlencoded data (as it's
written ATM, this would go immediately before the
"Headers" section).


"""Coping With Errors"""

"Handling exceptions" seems more accurate.  Not all
HTTP status codes for which urllib2 raises an exception
involve HTTP error responses.  The text is also
confused on this point, so I rewrote it.


Errors: I believe urlopen can still actually raise
socket.error.  This is a bug, but I haven't dared to
submit a patch to fix it, fearing
backwards-compatibility issues.  I guess it should
probably be documented :-( But I suppose we should
discuss that in a separate tracker item, rather than
adding it to your howto straight away.


You mention IOError.  Without a motivating use case I
don't know why you mention this.  Since I'm not really
sure what the use case for this subclassing was ever
intended to be :-) I removed this example: feel free to
add it back if you know of a use or can get Jeremy
Hylton to explain it to you ;-)


Re URLError : you imply that the only reason for
URLError to be raised is failure to connect to the
server.  This is often the cause, but certainly not always.


For HTTP status codes, you refer to a document that
states "This is a historic document and is not accurate
anymore".  RFC 2616 is authoritative, and IMHO fairly
readable on error codes.  Removed the reference to the
other document.


"""As of Python 2.5 a dictionary like this one has
become part of ``urllib2``."""

In fact, this was moved to httplib.  The reference to
"HTTPBaseServer" (sic) is interesting: I think the copy
in httplib should be removed, since it's already there
in BaseHTTPServer (albeit missing 306, but that is
unused) -- would you mind filing a patch, Michael?

Your listing differed from BaseHTTPServer and from RFC
2616, so I replaced it with the BaseHTTPServer copy.


"""shows all the defined response codes"""

These are only those defined by RFC 2616 of course:
other standards can and do define other response status
codes (e.g. DAV).  Clarified this.


"""When an error is raised the server responds by
returning an http error code *and* an error page."""

This is sloppy: HTTP doesn't define "raising" an error,
so it can't respond to one.  Fixed.


httplib.HTTPMessage

Reworded to avoid impling it's *always* going to be
this concrete class.


"""In versions of Python prior to 2.3.4 it wasn't safe
to iterate over the object directly, so you should
iterate over the list returned by ``msg.keys()``
instead."""

Is this appropriate advice in the 2.5 docs?  I removed
this (am I too harsh on this point?).


"""Openers and handlers are slightly esoteric parts of
**urllib2**."""

I don't want to scare people off: they're easy to use
(if not to write).  Removed this.


I added a tiny bit more on what handlers do.


Changed the text to avoid implying that build_opener()
is the only way to create openers.


Don't refer to ``opener`` in those typewriter-font ReST
backticks, since that seems a little misleading: it's
not a Python class name (unfortunately the class is
named OpenerDirector, which rather clashes with the use
of the name "opener" of course, but personally I'm with
you in preferring "opener").


Wrote a bit more about opener construction.


Changed realm name to make it clear it may contain spaces.


Changed references to URI to URL in discussion of
authentication -- seems an irrelevant and distracting
distinction here.


I edited the basic auth description a little.


Comments conventionally come *before* code it refers
to, not after.  Fixed that, removed an over-obvious
comment or two (even in docs, "create the handler"
seems redundant if that's *all* it says), and the fixed
the curious line breaks.


"""The only reason to explicitly supply these to
``build_opener`` (which chains handlers provided as a
list), would be to change the order they appear in the
chain."""

I don't know of a use case for that in the case of the
handlers you list.  Also, that doesn't actually work:
handler ordering is determined by sorting.  Removed this.


"""One thing not to get bitten by is that the
``top_level_url`` in the code above *must not* contain
the protocol - the ``http://`` part. So if the URL we
are trying to access is"""

This is not correct usage (though I can see why it
worked); removed it.  Admittedly, urllib2 auth was the
subject of a quite a few bug fixes recently (I seem to
have just found yet another one five minutes ago, in
fact :-( ), so the situation pre-2.5 was certainly
messy.  However, I advise against trying to document
the old bugs!  Note that I haven't given examples of
"sub-URLs" since the RFC (2617) isn't clear to me on
this point, and I haven't yet tested whether urllib2
gets it right according to de-facto standards (as
defined by browsers, Apache, etc.)  for "sub-URLs" of
the one passed to .add_password().  It's on the list...


In your note explaining that HTTPS proxies are not
supported, you use "caution" rather than "note", which
conveys the strange implication to me that this lack of
support is somehow a consequence of using your previous
recipe for switching off proxy handling (or am I weird
in reading it that way??).  s/caution/note/


""".. [#] Possibly some of this tutorial will make it
into the standard library docs for versions of Python
after 2.4.1."""

Removed this.


Whew!
msg50167 - (view) Author: John J Lee (jjlee) Date: 2006-05-01 19:59
Logged In: YES 
user_id=261020

(I guess if I had any sense in me, I would have uploaded
those comments as an attachment instead of pasting them into
the summary -- sorry.)

I'm uploading the revised document now.
msg50168 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-05-07 17:13
Logged In: YES 
user_id=11375

Edited.rst has been committed; thanks!
History
Date User Action Args
2022-04-11 14:56:17adminsetgithub: 43307
2006-05-01 19:50:49jjleecreate