This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: chr, ord, unichr documentation updates
Type: Stage:
Components: Documentation Versions: Python 2.4
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: fdrake Nosy List: fdrake, lemburg, mike_j_brown, rhettinger, terry.reedy
Priority: normal Keywords: patch

Created on 2004-10-31 07:25 by mike_j_brown, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
libfuncs.tex.diff mike_j_brown, 2005-01-19 11:00 libfuncs.tex diff (updated 19 Jan 2005)
Messages (13)
msg47201 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-10-31 07:25
The attached diff may be applied against v1.175 of
libfuncs.tex --
http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/dist/src/Doc/lib/libfuncs.tex?content-type=text%2Fplain&rev=1.175


chr(): A str is not in any particular encoding, so
don't talk about ASCII, which does not apply to
arguments > 127 anyway. Also make reference to unichr().

ord(): A str is not in any particular encoding, so
don't talk about ASCII. Describe what the return value
represents for each type of string (str, unicode), and
mention the TypeError that will be raised on narrow
unicode builds of Python.

unichr(): Mention the restrictions on the argument
depending on whether Python was built with wide or
narrow unicode.

The precedent in unicode() is to refer to str objects
as "8-bit strings", so the wording of the above changes
was chosen accordingly.
msg47202 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-10-31 07:38
Logged In: YES 
user_id=80475

The attachment didn't make it.  Try again.

And, FWIW, I think the documentation is perfectly clear as
is.  Though the ASCII reference is not strict, I think
taking it out would be a mistake.  Though many encodings are
possible, there is a strong relationship between the number
97 and the letter 'a'.  Mentioning ASCII makes that
relationship clear.

IOW, I -1 on changing it until a new bytes type is introduced.
msg47203 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-10-31 07:51
Logged In: YES 
user_id=371366

That kind of resistance to using accurate, strict
terminology just perpetuates common misunderstandings about
the relationship between characters and encodings.
msg47204 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-10-31 08:23
Logged In: YES 
user_id=371366

Also note that I did not suggest removing the example with
the letter "a". I just suggested removing the reference to
"ASCII" in particular.

Ideally, IMHO, the documentation for sequence types is where
one should mention the strong association between strings
and ASCII. It currently doesn't even really describe what a
string or Unicode string is. It should state that
non-Unicode strings are an abstraction in which each member
of the sequence is a "character" that is actually an 8-bit
value, as in Standard C, intended to represent a character
in an arbitrary encoding, and that there is an _informal_
convention, in documentation, of referring to these values
as being ASCII values, in part due to the notational
conventions of string literals, such as using "\t", "\n",
and "\r" to represent decimal values 9, 10, and 13,
respectively (associations that only make sense in ASCII or
ASCII-based encodings), and in part because it is easier to
talk about the lower 128 values in terms of their ASCII
equivalents (e.g. "chr(97) produces the string 'a'").
Likewise, the unicode type could be described as being an
abstraction of 16-bit ("narrow") or 32-bit ("wide") code
units, depending on how Python was built, and so on... I
would see making such unambiguous statements to be a
reasonable alternative to just deleting mentions of ASCII
from the library docs, although I think making all of the
changes would be best, as people already have preconceived
notions of what a 'string' is and I know from experience
that they tend to not worry about straightening out their
understanding of such nuances until they get burned by
assumptions built around statements like "ord() gives you
the ASCII value".
msg47205 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-10-31 18:17
Logged In: YES 
user_id=371366

Oops, didn't mean to remove the assignment to fdrake when
adding previous comment.
msg47206 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-11-01 11:11
Logged In: YES 
user_id=38388

The new wording is indeed better than the old one. +1 on that
change.

However, you should use the term "code point" consistently
and perhaps add a footnote explaining the difference between
code point, glyph and character (Unicode strings are arrays
of code points - not characters).

Another note: I don't particularly like the terms "narrow"
and "wide"
Unicode builds. If possible, these terms should be replaced
by the
more accurate technical terms UCS2 and UCS4 - since the error
messages relating to this difference also mention these
technical
terms rather then narrow or wide builds.
msg47207 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-11-02 06:56
Logged In: YES 
user_id=371366

You're right re: UCS2/UCS4. I can work up another patch.

I think you know this, but "code point" is not accurate
UTR#17-conformant terminology, as it just refers to the
single integer number from the code space that is available
to Unicode (0x0-0xD7FF and 0xE000-0x10FFFF), bearing in mind
that not all code points correspond to characters (all those
whose hex values end in FFFE and FFFF, for example).

If we are just talking about what a Unicode string is in
general sense, we say it is just a sequence of characters --
a character being a unit like, say, "Latin small letter z",
or "plus sign", in a writing system ("script") like
Latin/Roman, Cyrillic, Hiragana, etc.

If we are talking about what the unicode type is in Python,
to be accurate, we should say it is a sequence of UCS2 or
UCS4 "code values", depending on how Python was compiled,
and note that in its printable representation, the unicode
type displays, for characters outside the ASCII range, the
"code points" represented by those code values. It does this
using the same syntax as for string literals, but treats
surrogate pairs of code values as being representative of a
single code point (e.g., a unicode object consisting of code
value 0xD800 followed by 0xDC00 is printably represented by
u'\U00010000' even though it's still a string of length 2 in
both UCS2 and UCS4 builds of Python).

Is there a recommendation for how to refer unambiguously to
an instance of a unicode type? Is it a "unicode object"? How
about an instance of the str type? Is it an "8-bit string"?
I notice we say "byte string" a lot but apparently not
everyone is happy about that.
msg47208 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2005-01-19 04:52
Logged In: YES 
user_id=3066

Is the patch here finished, or was additional work needed?
msg47209 - (view) Author: Mike Brown (mike_j_brown) Date: 2005-01-19 06:42
Logged In: YES 
user_id=371366

I was just waiting for someone to answer my question about
terminology. (1) Is there a recommendation for how to refer
unambiguously to an instance of a unicode type? Is it a
"unicode object"? (2) How about an instance of the str type?
Is it an "8-bit string"? I notice we say "byte string" a lot
but apparently not everyone is happy about that.
msg47210 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2005-01-19 06:59
Logged In: YES 
user_id=3066

Ah, ok, here's some answers, then:

(1)  "unicode object" is right.

(2) I'm happy with either "8-bit string" or "byte string",
so whichever you find makes more sense in context is good.
msg47211 - (view) Author: Mike Brown (mike_j_brown) Date: 2005-01-19 10:50
Logged In: YES 
user_id=371366

Thanks. I've attached a new copy of the patch, with minor
substitions made (UCS2 and UCS4 instead of narrow and wide,
mainly).
msg47212 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2005-01-24 06:22
Logged In: YES 
user_id=593130

I strongly prefer byte string to 8-bit string both because the 
former is easier to think/say and because it is more 
accurate.  8-bits, or rather, 256 different possible values, is a 
minimum but not a maximum.  If, for instance, Python were 
ported to old machines with 6-bit chars, it would likely use 
12-bit bytes (double machine bytes) with code similar to 
USC2 (double 8-bit byte) unicode builds.  And, given that 
there are no bit operations of the bytes of a byte string, the 
machine implementation in terms of bits is not really 
relevant.
msg47213 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2005-08-23 04:35
Logged In: YES 
user_id=3066

The portion of this that applies to the ord() documentation
has been committed; the remainder of this patch is no longer
necessary due to other changes to the documentation.

Relevant portion committed to Doc/lib/libfuncs.tex revisions
1.188, 1.175.2.8.
History
Date User Action Args
2022-04-11 14:56:07adminsetgithub: 41108
2004-10-31 07:25:37mike_j_browncreate