Issue 432401: unicode encoding error callbacks

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/34615

classification

Title:	unicode encoding error callbacks
Type:		Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	doerwalter, lemburg
Priority:	high	Keywords:	patch

Created on 2001-06-12 13:43 by doerwalter, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	doerwalter, 2001-06-12 13:43
diff.txt	doerwalter, 2001-06-12 18:59	Revised patch
diff.txt	doerwalter, 2001-06-13 13:57	Patch V3: Renamed the encode function to include "unicode". a few fixes in Lib/encodings
diff.txt	doerwalter, 2001-07-12 11:03	Patch V4: (enc, uni, pos, state) -> (out, newpos) communication and speedups
diff.txt	doerwalter, 2001-07-23 17:03	Patch V5: Unicode encoding error handling callback registry
diff.txt	doerwalter, 2001-07-27 03:55	Patch V6: Encoding and decoding for string and unicode done
diff7.txt	doerwalter, 2002-04-18 19:22
test_codeccallbacks.py	doerwalter, 2002-04-18 19:24	test script
diff8.txt	doerwalter, 2002-05-01 17:57	PyUnicode_EncodeDecimal done
diff9.txt	doerwalter, 2002-05-16 19:06
diff10.txt	doerwalter, 2002-05-29 20:50
test_codeccallbacks.py	doerwalter, 2002-05-29 20:51	test script which catches the bugs fixed in diff10.txt
diff11.txt	doerwalter, 2002-05-30 16:30
speedtest.py	doerwalter, 2002-05-30 16:31	test speed for encoding
diff12.txt	doerwalter, 2002-07-24 18:55
test_codeccallbacks.py	doerwalter, 2002-07-24 19:04	test script for diff12.txt
test_codeccallbacks.py	doerwalter, 2002-07-26 15:41	Adds parameter and result test for the callbacks
diff13.txt	doerwalter, 2002-08-14 10:23

Messages (42)
msg36773 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-12 13:43
This patch adds unicode error handling callbacks to the encode functionality. With this patch it's possible to not only pass 'strict', 'ignore' or 'replace' as the errors argument to encode, but also a callable function, that will be called with the encoding name, the original unicode object and the position of the unencodable character. The callback must return a replacement unicode object that will be encoded instead of the original character. For example replacing unencodable characters with XML character references can be done in the following way. u"aäoöuüß".encode( "ascii", lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos]) )
msg36774 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-06-12 14:29
Logged In: YES user_id=38388 Thanks for the patch -- it looks very impressive !. I'll give it a try later this week. Some first cosmetic tidbits: * please don't place more than one C statement on one line like in: """ + unicode = unicode2; unicodepos = unicode2pos; + unicode2 = NULL; unicode2pos = 0; """ * Comments should start with a capital letter and be prepended to the section they apply to * There should be spaces between arguments in compares (a == b) not (a==b) * Where does the name "...Encode121" originate ? * module internal APIs should use lower case names (you converted some of these to PyUnicode_...() -- this is normally reserved for APIs which are either marked as potential candidates for the public API or are very prominent in the code) One thing which I don't like about your API change is that you removed the Py_UNICODE*data, int size style arguments -- this makes it impossible to use the new APIs on non-Python data or data which is not available as Unicode object. Please separate the errors.c patch from this patch -- it seems totally unrelated to Unicode. Thanks.
msg36775 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-12 16:32
Logged In: YES user_id=89016 > * please don't place more than one C statement on one line > like in: > """ > + unicode = unicode2; unicodepos = > unicode2pos; > + unicode2 = NULL; unicode2pos = 0; > """ OK, done! > * Comments should start with a capital letter and be > prepended > to the section they apply to Fixed! > * There should be spaces between arguments in compares > (a == b) not (a==b) Fixed! > * Where does the name "...Encode121" originate ? encode one-to-one, it implements both ASCII and latin-1 encoding. > * module internal APIs should use lower case names (you > converted some of these to PyUnicode_...() -- this is > normally reserved for APIs which are either marked as > potential candidates for the public API or are very > prominent in the code) Which ones? I introduced a new function for every old one, that had a "const char errors" argument, and a few new ones in codecs.h, of those PyCodec_EncodeHandlerForObject is vital, because it is used to map for old string arguments to the new function objects. PyCodec_RaiseEncodeErrors can be used in the encoder implementation to raise an encode error, but it could be made static in unicodeobject.h so only those encoders implemented there have access to it. > One thing which I don't like about your API change is that > you removed the Py_UNICODEdata, int size style arguments > -- > this makes it impossible to use the new APIs on non-Python > data or data which is not available as Unicode object. I look through the code and found no situation where the Py_UNICODE/int version is really used and having two (PyObject )s (the original and the replacement string), instead of UNICODE/int and PyObject made the implementation a little easier, but I can fix that. > Please separate the errors.c patch from this patch -- it > seems totally unrelated to Unicode. PyCodec_RaiseEncodeErrors uses this the have a \Uxxxx with four hex digits. I removed it. I'll upload a revised patch as soon as it's done.
msg36776 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-12 16:56
Logged In: YES user_id=89016 > One thing which I don't like about your API change is that > you removed the Py_UNICODEdata, int size style arguments > -- > this makes it impossible to use the new APIs on non-Python > data or data which is not available as Unicode object. Another problem is, that the callback requires a Python object, so in the PyObject version, the refcount is incref'd and the object is passed to the callback. The Py_UNICODE*/int version would have to create a new Unicode object from the data.
msg36777 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-06-12 18:00
Logged In: YES user_id=38388 About the Py_UNICODE*data, int size APIs: Ok, point taken. In general, I think we ought to keep the callback feature as open as possible, so passing in pointers and sizes would not be very useful. BTW, could you summarize how the callback works in a few lines ? About _Encode121: I'd name this _EncodeUCS1 since that's what it is ;-) About the new functions: I was referring to the new static functions which you gave PyUnicode_... names. If these are not supposed to turn into non-static functions, I'd rather have them use lower case names (since that's how the Python internals work too -- most of the times).
msg36778 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-12 18:59
Logged In: YES user_id=89016 How the callbacks work: A PyObject * named errors is passed in. This may by NULL, Py_None, 'strict', u'strict', 'ignore', u'ignore', 'replace', u'replace' or a callable object. PyCodec_EncodeHandlerForObject maps all of these objects to one of the three builtin error callbacks PyCodec_RaiseEncodeErrors (raises an exception), PyCodec_IgnoreEncodeErrors (returns an empty replacement string, in effect ignoring the error), PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode replacement character to signify to the encoder that it should choose a suitable replacement character) or directly returns errors if it is a callable object. When an unencodable character is encounterd the error handling callback will be called with the encoding name, the original unicode object and the error position and must return a unicode object that will be encoded instead of the offending character (or the callback may of course raise an exception). U+FFFD characters in the replacement string will be replaced with a character that the encoder chooses ('?' in all cases). The implementation of the loop through the string is done in the following way. A stack with two strings is kept and the loop always encodes a character from the string at the stacktop. If an error is encountered and the stack has only one entry (during encoding of the original string) the callback is called and the unicode object returned is pushed on the stack, so the encoding continues with the replacement string. If the stack has two entries when an error is encountered, the replacement string itself has an unencodable character and a normal exception raised. When the encoder has reached the end of it's current string there are two possibilities: when the stack contains two entries, this was the replacement string, so the replacement string will be poppep from the stack and encoding continues with the next character from the original string. If the stack had only one entry, encoding is finished. (I hope that's enough explanation of the API and implementation) I have renamed the static ...121 function to all lowercase names. BTW, I guess PyUnicode_EncodeUnicodeEscape could be reimplemented as PyUnicode_EncodeASCII with a \uxxxx replacement callback. PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors, PyCodec_ReplaceEncodeErrors are globally visible because they have to be available in _codecsmodule.c to wrap them as Python function objects, but they can't be implemented in _codecsmodule, because they need to be available to the encoders in unicodeobject.c (through PyCodec_EncodeHandlerForObject), but importing the codecs module might result in an endless recursion, because importing a module requires unpickling of the bytecode, which might require decoding utf8, which ... (but this will only happen, if we implement the same mechanism for the decoding API) I have not touched PyUnicode_TranslateCharmap yet, should this function also support error callbacks? Why would one want the insert None into the mapping to call the callback? A remaining problem is how to implement decoding error callbacks. In Python 2.1 encoding and decoding errors are handled in the same way with a string value. But with callbacks it doesn't make sense to use the same callback for encoding and decoding (like codecs.StreamReaderWriter and codecs.StreamRecoder do). Decoding callbacks have a different API. Which arguments should be passed to the decoding callback, and what is the decoding callback supposed to do?
msg36779 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-12 19:18
Logged In: YES user_id=89016 One additional note: It is vital that errors is an assignable attribute of the StreamWriter. Consider the XML example: For writing an XML DOM tree one StreamWriter object is used. When a text node is written, the error handling has to be set to codecs.xmlreplace_encode_errors, but inside a comment or processing instruction replacing unencodable characters with charrefs is not possible, so here codecs.raise_encode_errors should be used (or better a custom error handler that raises an error that says "sorry, you can't have unencodable characters inside a comment") BTW, should we continue the discussion in the i18n SIG mailing list? An email program is much more comfortable than a HTML textarea! ;)
msg36780 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-06-13 08:05
Logged In: YES user_id=38388 > How the callbacks work: > > A PyObject * named errors is passed in. This may by NULL, > Py_None, 'strict', u'strict', 'ignore', u'ignore', > 'replace', u'replace' or a callable object. > PyCodec_EncodeHandlerForObject maps all of these objects to > one of the three builtin error callbacks > PyCodec_RaiseEncodeErrors (raises an exception), > PyCodec_IgnoreEncodeErrors (returns an empty replacement > string, in effect ignoring the error), > PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode > replacement character to signify to the encoder that it > should choose a suitable replacement character) or directly > returns errors if it is a callable object. When an > unencodable character is encounterd the error handling > callback will be called with the encoding name, the original > unicode object and the error position and must return a > unicode object that will be encoded instead of the offending > character (or the callback may of course raise an > exception). U+FFFD characters in the replacement string will > be replaced with a character that the encoder chooses ('?' > in all cases). Nice. > The implementation of the loop through the string is done in > the following way. A stack with two strings is kept and the > loop always encodes a character from the string at the > stacktop. If an error is encountered and the stack has only > one entry (during encoding of the original string) the > callback is called and the unicode object returned is pushed > on the stack, so the encoding continues with the replacement > string. If the stack has two entries when an error is > encountered, the replacement string itself has an > unencodable character and a normal exception raised. When > the encoder has reached the end of it's current string there > are two possibilities: when the stack contains two entries, > this was the replacement string, so the replacement string > will be poppep from the stack and encoding continues with > the next character from the original string. If the stack > had only one entry, encoding is finished. Very elegant solution ! > (I hope that's enough explanation of the API and implementation) Could you add these docs to the Misc/unicode.txt file ? I will eventually take that file and turn it into a PEP which will then serve as general documentation for these things. > I have renamed the static ...121 function to all lowercase > names. Ok. > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > replacement callback. Hmm, wouldn't that result in a slowdown ? If so, I'd rather leave the special encoder in place, since it is being used a lot in Python and probably some applications too. > PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors, > PyCodec_ReplaceEncodeErrors are globally visible because > they have to be available in _codecsmodule.c to wrap them as > Python function objects, but they can't be implemented in > _codecsmodule, because they need to be available to the > encoders in unicodeobject.c (through > PyCodec_EncodeHandlerForObject), but importing the codecs > module might result in an endless recursion, because > importing a module requires unpickling of the bytecode, > which might require decoding utf8, which ... (but this will > only happen, if we implement the same mechanism for the > decoding API) I think that codecs.c is the right place for these APIs. _codecsmodule.c is only meant as Python access wrapper for the internal codecs and nothing more. One thing I noted about the callbacks: they assume that they will always get Unicode objects as input. This is certainly not true in the general case (it is for the codecs you touch in the patch). I think it would be worthwhile to rename the callbacks to include "Unicode" somewhere, e.g. PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but then it points out the application field of the callback rather well. Same for the callbacks exposed through the _codecsmodule. > I have not touched PyUnicode_TranslateCharmap yet, > should this function also support error callbacks? Why would > one want the insert None into the mapping to call the callback? 1. Yes. 2. The user may want to e.g. restrict usage of certain character ranges. In this case the codec would be used to verify the input and an exception would indeed be useful (e.g. say you want to restrict input to Hangul + ASCII). > A remaining problem is how to implement decoding error > callbacks. In Python 2.1 encoding and decoding errors are > handled in the same way with a string value. But with > callbacks it doesn't make sense to use the same callback for > encoding and decoding (like codecs.StreamReaderWriter and > codecs.StreamRecoder do). Decoding callbacks have a > different API. Which arguments should be passed to the > decoding callback, and what is the decoding callback > supposed to do? I'd suggest adding another set of PyCodec_UnicodeDecode...() APIs for this. We'd then have to augment the base classes of the StreamCodecs to provide two attributes for .errors with a fallback solution for the string case (i.s. "strict" can still be used for both directions). > One additional note: It is vital that errors is an > assignable attribute of the StreamWriter. It is already ! > Consider the XML example: For writing an XML DOM tree one > StreamWriter object is used. When a text node is written, > the error handling has to be set to > codecs.xmlreplace_encode_errors, but inside a comment or > processing instruction replacing unencodable characters with > charrefs is not possible, so here codecs.raise_encode_errors > should be used (or better a custom error handler that raises > an error that says "sorry, you can't have unencodable > characters inside a comment") Sure. > BTW, should we continue the discussion in the i18n SIG > mailing list? An email program is much more comfortable than > a HTML textarea! ;) I'd rather keep the discussions on this patch here -- forking it off to the i18n sig will make it very hard to follow up on it. (This HTML area is indeed damn small ;-)
msg36781 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-13 13:57
Logged In: YES user_id=89016 > > [...] > > raise an exception). U+FFFD characters in the replacement > > string will be replaced with a character that the encoder > > chooses ('?' in all cases). > > Nice. But the special casing of U+FFFD makes the interface somewhat less clean than it could be. It was only done to be 100% backwards compatible. With the original "replace" error handling the codec chose the replacement character. But as far as I can tell none of the codecs uses anything other than '?', so I guess we could change the replace handler to always return u'?'. This would make the implementation a little bit simpler, but the explanation of the callback feature a lot simpler. And if you still want to handle an unencodable U+FFFD, you can write a special callback for that, e.g. def FFFDreplace(enc, uni, pos): if uni[pos] == "\ufffd": return u"?" else: raise UnicodeError(...) > > The implementation of the loop through the string is done > > in the following way. A stack with two strings is kept > > and the loop always encodes a character from the string > > at the stacktop. If an error is encountered and the stack > > has only one entry (during encoding of the original string) > > the callback is called and the unicode object returned is > > pushed on the stack, so the encoding continues with the > > replacement string. If the stack has two entries when an > > error is encountered, the replacement string itself has > > an unencodable character and a normal exception raised. > > When the encoder has reached the end of it's current string > > there are two possibilities: when the stack contains two > > entries, this was the replacement string, so the replacement > > string will be poppep from the stack and encoding continues > > with the next character from the original string. If the > > stack had only one entry, encoding is finished. > > Very elegant solution ! I'll put it as a comment in the source. > > (I hope that's enough explanation of the API and > implementation) > > Could you add these docs to the Misc/unicode.txt file ? I > will eventually take that file and turn it into a PEP which > will then serve as general documentation for these things. I could, but first we should work out how the decoding callback API will work. > > I have renamed the static ...121 function to all lowercase > > names. > > Ok. > > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > > replacement callback. > > Hmm, wouldn't that result in a slowdown ? If so, I'd rather > leave the special encoder in place, since it is being used a > lot in Python and probably some applications too. It would be a slowdown. But callbacks open many possiblities. For example: Why can't I print u"gürk"? is probably one of the most frequently asked questions in comp.lang.python. For printing Unicode stuff, print could be extended the use an error handling callback for Unicode strings (or objects where __str__ or tp_str returns a Unicode object) instead of using str() which always returns an 8bit string and uses strict encoding. There might even be a sys.setprintencodehandler()/sys.getprintencodehandler() > [...] > I think it would be worthwhile to rename the callbacks to > include "Unicode" somewhere, e.g. > PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but > then it points out the application field of the callback > rather well. Same for the callbacks exposed through the > _codecsmodule. OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors really is a long name ;)) > > I have not touched PyUnicode_TranslateCharmap yet, > > should this function also support error callbacks? Why > > would one want the insert None into the mapping to call > > the callback? > > 1. Yes. > 2. The user may want to e.g. restrict usage of certain > character ranges. In this case the codec would be used to > verify the input and an exception would indeed be useful > (e.g. say you want to restrict input to Hangul + ASCII). OK, do we want TranslateCharmap to work exactly like encoding, i.e. in case of an error should the returned replacement string again be mapped through the translation mapping or should it be copied to the output directly? The former would be more in line with encoding, but IMHO the latter would be much more useful. BTW, when I implement it I can implement patch #403100 ("Multicharacter replacements in PyUnicode_TranslateCharmap") along the way. Should the old TranslateCharmap map to the new TranslateCharmapEx and inherit the "multicharacter replacement" feature, or should I leave it as it is? > > A remaining problem is how to implement decoding error > > callbacks. In Python 2.1 encoding and decoding errors are > > handled in the same way with a string value. But with > > callbacks it doesn't make sense to use the same callback > > for encoding and decoding (like codecs.StreamReaderWriter > > and codecs.StreamRecoder do). Decoding callbacks have a > > different API. Which arguments should be passed to the > > decoding callback, and what is the decoding callback > > supposed to do? > > I'd suggest adding another set of PyCodec_UnicodeDecode... () > APIs for this. We'd then have to augment the base classes of > the StreamCodecs to provide two attributes for .errors with > a fallback solution for the string case (i.s. "strict" can > still be used for both directions). Sounds good. Now what is the decoding callback supposed to do? I guess it will be called in the same way as the encoding callback, i.e. with encoding name, original string and position of the error. It might returns a Unicode string (i.e. an object of the decoding target type), that will be emitted from the codec instead of the one offending byte. Or it might return a tuple with replacement Unicode object and a resynchronisation offset, i.e. returning (u"?", 1) means emit a '?' and skip the offending character. But to make the offset really useful the callback has to know something about the encoding, perhaps the codec should be allowed to pass an additional state object to the callback? Maybe the same should be added to the encoding callbacks to? Maybe the encoding callback should be able to tell the encoder if the replacement returned should be reencoded (in which case it's a Unicode object), or directly emitted (in which case it's an 8bit string)? > > One additional note: It is vital that errors is an > > assignable attribute of the StreamWriter. > > It is already ! I know, but IMHO it should be documented that an assignable errors attribute must be supported as part of the official codec API. Misc/unicode.txt is not clear on that: """ It is not required by the Unicode implementation to use these base classes, only the interfaces must match; this allows writing Codecs as extension types. """
msg36782 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-06-13 15:49
Logged In: YES user_id=89016 Guido van Rossum wrote in python-dev: > True, the "codec" pattern can be used for other > encodings than Unicode. But it seems to me that the > entire codecs architecture is rather strongly geared > towards en/decoding Unicode, and it's not clear > how well other codecs fit in this pattern (e.g. I > noticed that all the non-Unicode codecs ignore the > error handling parameter or assert that > it is set to 'strict'). I noticed that too. asserting that errors=='strict' would mean that the encoder is not able to deal in any other way with unencodable stuff than by raising an error. But that is not the problem here, because for zlib, base64, quopri, hex and uu encoding there can be no unencodable characters. The encoders can simply ignore the errors parameter. Should I remove the asserts from those codecs and change the docstrings accordingly, or will this be done separately?
msg36783 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-06-13 17:00
Logged In: YES user_id=38388 On your comment about the non-Unicode codecs: let's keep this separated from the current patch. Don't have much time today. I'll comment on the other things tomorrow.
msg36784 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-06-22 20:51
Logged In: YES user_id=38388 Sorry to keep you waiting, Walter. I will look into this again next week -- this week was way too busy...
msg36785 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-07-10 12:29
Logged In: YES user_id=38388 Ok, here we go... > > > raise an exception). U+FFFD characters in the > replacement > > > string will be replaced with a character that the > encoder > > > chooses ('?' in all cases). > > > > Nice. > > But the special casing of U+FFFD makes the interface > somewhat > less clean than it could be. It was only done to be 100% > backwards compatible. With the original "replace" > error > handling the codec chose the replacement character. But as > far as I can tell none of the codecs uses anything other > than '?', True. > so I guess we could change the replace handler > to always return u'?'. This would make the implementation a > little bit simpler, but the explanation of the callback > feature a lot simpler. Go for it. > And if you still want to handle > an unencodable U+FFFD, you can write a special callback for > that, e.g. > > def FFFDreplace(enc, uni, pos): > if uni[pos] == "\ufffd": > return u"?" > else: > raise UnicodeError(...) > > > ...docs... > > > > Could you add these docs to the Misc/unicode.txt file ? I > > will eventually take that file and turn it into a PEP > which > > will then serve as general documentation for these things. > > I could, but first we should work out how the decoding > callback API will work. Ok. BTW, Barry Warsaw already did the work of converting the unicode.txt to PEP 100, so the docs should eventually go there. > > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > > > replacement callback. > > > > Hmm, wouldn't that result in a slowdown ? If so, I'd > rather > > leave the special encoder in place, since it is being > used a > > lot in Python and probably some applications too. > > It would be a slowdown. But callbacks open many > possiblities. True, but in this case I believe that we should stick with the native implementation for "unicode-escape". Having a standard callback error handler which does the \uXXXX replacement would be nice to have though, since this would also be usable with lots of other codecs (e.g. all the code page ones). > For example: > > Why can't I print u"gürk"? > > is probably one of the most frequently asked questions in > comp.lang.python. For printing Unicode stuff, print could be > extended the use an error handling callback for Unicode > strings (or objects where __str__ or tp_str returns a > Unicode object) instead of using str() which always returns > an 8bit string and uses strict encoding. There might even > be a > sys.setprintencodehandler()/sys.getprintencodehandler() There already is a print callback in Python (forgot the name of the hook though), so this should be possible by providing the encoding logic in the hook. > > > I have not touched PyUnicode_TranslateCharmap yet, > > > should this function also support error callbacks? Why > > > would one want the insert None into the mapping to > call > > > the callback? > > > > 1. Yes. > > 2. The user may want to e.g. restrict usage of certain > > character ranges. In this case the codec would be used to > > verify the input and an exception would indeed be useful > > (e.g. say you want to restrict input to Hangul + ASCII). > > OK, do we want TranslateCharmap to work exactly like > encoding, > i.e. in case of an error should the returned replacement > string again be mapped through the translation mapping or > should it be copied to the output directly? The former would > be more in line with encoding, but IMHO the latter would > be much more useful. It's better to take the second approach (copy the callback output directly to the output string) to avoid endless recursion and other pitfalls. I suppose this will also simplify the implementation somewhat. > BTW, when I implement it I can implement patch #403100 > ("Multicharacter replacements in > PyUnicode_TranslateCharmap") > along the way. I've seen it; will comment on it later. > Should the old TranslateCharmap map to the new > TranslateCharmapEx > and inherit the "multicharacter replacement" feature, > or > should I leave it as it is? If possible, please also add the multichar replacement to the old API. I think it is very useful and since the old APIs work on raw buffers it would be a benefit to have the functionality in the old implementation too. [Decoding error callbacks] > > > A remaining problem is how to implement decoding error > > > callbacks. In Python 2.1 encoding and decoding errors > are > > > handled in the same way with a string value. But with > > > callbacks it doesn't make sense to use the same > callback > > > for encoding and decoding (like > codecs.StreamReaderWriter > > > and codecs.StreamRecoder do). Decoding callbacks have > a > > > different API. Which arguments should be passed to the > > > decoding callback, and what is the decoding callback > > > supposed to do? > > > > I'd suggest adding another set of PyCodec_UnicodeDecode... > () > > APIs for this. We'd then have to augment the base classes > of > > the StreamCodecs to provide two attributes for .errors > with > > a fallback solution for the string case (i.s. "strict" > can > > still be used for both directions). > > Sounds good. Now what is the decoding callback supposed to > do? > I guess it will be called in the same way as the encoding > callback, i.e. with encoding name, original string and > position of the error. It might returns a Unicode string > (i.e. an object of the decoding target type), that will be > emitted from the codec instead of the one offending byte. Or > it might return a tuple with replacement Unicode object and > a resynchronisation offset, i.e. returning (u"?", 1) > means > emit a '?' and skip the offending character. But to make > the offset really useful the callback has to know something > about the encoding, perhaps the codec should be allowed to > pass an additional state object to the callback? > > Maybe the same should be added to the encoding callbacks to? > Maybe the encoding callback should be able to tell the > encoder if the replacement returned should be reencoded > (in which case it's a Unicode object), or directly emitted > (in which case it's an 8bit string)? I like the idea of having an optional state object (basically this should be a codec-defined arbitrary Python object) which then allow the callback to apply additional tricks. The object should be documented to be modifyable in place (simplifies the interface). About the return value: I'd suggest to always use the same tuple interface, e.g. callback(encoding, input_data, input_position, state) -> (output_to_be_appended, new_input_position) (I think it's better to use absolute values for the position rather than offsets.) Perhaps the encoding callbacks should use the same interface... what do you think ? > > > One additional note: It is vital that errors is an > > > assignable attribute of the StreamWriter. > > > > It is already ! > > I know, but IMHO it should be documented that an assignable > errors attribute must be supported as part of the official > codec API. > > Misc/unicode.txt is not clear on that: > """ > It is not required by the Unicode implementation to use > these base classes, only the interfaces must match; this > allows writing Codecs as extension types. > """ Good point. I'll add that to the PEP 100.
msg36786 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-07-12 11:03
Logged In: YES user_id=89016 > > [...] > > so I guess we could change the replace handler > > to always return u'?'. This would make the > > implementation a little bit simpler, but the > > explanation of the callback feature a lot > > simpler. > > Go for it. OK, done! > [...] > > > Could you add these docs to the Misc/unicode.txt > > > file ? I will eventually take that file and turn > > > it into a PEP which will then serve as general > > > documentation for these things. > > > > I could, but first we should work out how the > > decoding callback API will work. > > Ok. BTW, Barry Warsaw already did the work of converting > the unicode.txt to PEP 100, so the docs should eventually > go there. OK. I guess it would be best to do this when everything is finished. > > > > BTW, I guess PyUnicode_EncodeUnicodeEscape > > > > could be reimplemented as PyUnicode_EncodeASCII > > > > with \uxxxx replacement callback. > > > > > > Hmm, wouldn't that result in a slowdown ? If so, > > > I'd rather leave the special encoder in place, > > > since it is being used a lot in Python and > > > probably some applications too. > > > > It would be a slowdown. But callbacks open many > > possiblities. > > True, but in this case I believe that we should stick with > the native implementation for "unicode-escape". Having > a standard callback error handler which does the \uXXXX > replacement would be nice to have though, since this would > also be usable with lots of other codecs (e.g. all the > code page ones). OK, done, now there's a PyCodec_EscapeReplaceUnicodeEncodeErrors/ codecs.escapereplace_unicodeencode_errors that uses \u (or \U if x>0xffff (with a wide build of Python)). > > For example: > > > > Why can't I print u"gürk"? > > > > is probably one of the most frequently asked > > questions in comp.lang.python. For printing > > Unicode stuff, print could be extended the use an > > error handling callback for Unicode strings (or > > objects where __str__ or tp_str returns a Unicode > > object) instead of using str() which always > > returns an 8bit string and uses strict encoding. > > There might even be a > > sys.setprintencodehandler()/sys.getprintencodehandler () > > There already is a print callback in Python (forgot the > name of the hook though), so this should be possible by > providing the encoding logic in the hook. True: sys.displayhook > [...] > > Should the old TranslateCharmap map to the new > > TranslateCharmapEx and inherit the > > "multicharacter replacement" feature, > > or should I leave it as it is? > > If possible, please also add the multichar replacement > to the old API. I think it is very useful and since the > old APIs work on raw buffers it would be a benefit to have > the functionality in the old implementation too. OK! I will try to find the time to implement that in the next days. > [Decoding error callbacks] > > About the return value: > > I'd suggest to always use the same tuple interface, e.g. > > callback(encoding, input_data, input_position, state) -> > (output_to_be_appended, new_input_position) > > (I think it's better to use absolute values for the > position rather than offsets.) > > Perhaps the encoding callbacks should use the same > interface... what do you think ? This would make the callback feature hypergeneric and a little slower, because tuples have to be created, but it (almost) unifies the encoding and decoding API. ("almost" because, for the encoder output_to_be_appended will be reencoded, for the decoder it will simply be appended.), so I'm for it. I implemented this and changed the encoders to only lookup the error handler on the first error. The UCS1 encoder now no longer uses the two-item stack strategy. (This strategy only makes sense for those encoder where the encoding itself is much more complicated than the looping/callback etc.) So now memory overflow tests are only done, when an unencodable error occurs, so now the UCS1 encoder should be as fast as it was without error callbacks. Do we want to enforce new_input_position>input_position, or should jumping back be allowed? > > > > One additional note: It is vital that errors > > > > is an assignable attribute of the StreamWriter. > > > > > > It is already ! > > > > I know, but IMHO it should be documented that an > > assignable errors attribute must be supported > > as part of the official codec API. > > > > Misc/unicode.txt is not clear on that: > > """ > > It is not required by the Unicode implementation > > to use these base classes, only the interfaces must > > match; this allows writing Codecs as extension types. > > """ > > Good point. I'll add that to the PEP 100. OK. Here's is the current todo list: 1. implement a new TranslateCharmap and fix the old. 2. New encoding API for string objects too. 3. Decoding 4. Documentation 5. Test cases I'm thinking about a different strategy for implementing callbacks (see http://mail.python.org/pipermail/i18n-sig/2001- July/001262.html) We coould have a error handler registry, which maps names to error handlers, then it would be possible to keep the errors argument as "const char " instead of "PyObject ". Currently PyCodec_UnicodeEncodeHandlerForObject is a backwards compatibility hack that will never go away, because it's always more convenient to type u"...".encode("...", "strict") instead of import codecs u"...".encode("...", codecs.raise_encode_errors) But with an error handler registry this function would become the official lookup method for error handlers. (PyCodec_LookupUnicodeEncodeErrorHandler?) Python code would look like this: --- def xmlreplace(encoding, unicode, pos, state): return (u"&#%d;" % ord(uni[pos]), pos+1) import codec codec.registerError("xmlreplace",xmlreplace) --- and then the following call can be made: u"äöü".encode("ascii", "xmlreplace") As soon as the first error is encountered, the encoder uses its builtin error handling method if it recognizes the name ("strict", "replace" or "ignore") or looks up the error handling function in the registry if it doesn't. In this way the speed for the backwards compatible features is the same as before and "const char error" can be kept as the parameter to all encoding functions. For speed common error handling names could even be implemented in the encoder itself. But for special one-shot error handlers, it might still be useful to pass the error handler directly, so maybe we should leave error as PyObject , but implement the registry anyway?
msg36787 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-07-13 11:26
Logged In: YES user_id=38388 > > > > > BTW, I guess PyUnicode_EncodeUnicodeEscape > > > > > could be reimplemented as PyUnicode_EncodeASCII > > > > > with \uxxxx replacement callback. > > > > > > > > Hmm, wouldn't that result in a slowdown ? If so, > > > > I'd rather leave the special encoder in place, > > > > since it is being used a lot in Python and > > > > probably some applications too. > > > > > > It would be a slowdown. But callbacks open many > > > possiblities. > > > > True, but in this case I believe that we should stick with > > the native implementation for "unicode-escape". Having > > a standard callback error handler which does the \uXXXX > > replacement would be nice to have though, since this would > > also be usable with lots of other codecs (e.g. all the > > code page ones). > > OK, done, now there's a > PyCodec_EscapeReplaceUnicodeEncodeErrors/ > codecs.escapereplace_unicodeencode_errors > that uses \u (or \U if x>0xffff (with a wide build > of Python)). Great ! > > [...] > > > Should the old TranslateCharmap map to the new > > > TranslateCharmapEx and inherit the > > > "multicharacter replacement" feature, > > > or should I leave it as it is? > > > > If possible, please also add the multichar replacement > > to the old API. I think it is very useful and since the > > old APIs work on raw buffers it would be a benefit to have > > the functionality in the old implementation too. > > OK! I will try to find the time to implement that in the > next days. Good. > > [Decoding error callbacks] > > > > About the return value: > > > > I'd suggest to always use the same tuple interface, e.g. > > > > callback(encoding, input_data, input_position, > state) -> > > (output_to_be_appended, new_input_position) > > > > (I think it's better to use absolute values for the > > position rather than offsets.) > > > > Perhaps the encoding callbacks should use the same > > interface... what do you think ? > > This would make the callback feature hypergeneric and a > little slower, because tuples have to be created, but it > (almost) unifies the encoding and decoding API. ("almost" > because, for the encoder output_to_be_appended will be > reencoded, for the decoder it will simply be appended.), > so I'm for it. That's the point. Note that I don't think the tuple creation will hurt much (see the make_tuple() API in codecs.c) since small tuples are cached by Python internally. > I implemented this and changed the encoders to only > lookup the error handler on the first error. The UCS1 > encoder now no longer uses the two-item stack strategy. > (This strategy only makes sense for those encoder where > the encoding itself is much more complicated than the > looping/callback etc.) So now memory overflow tests are > only done, when an unencodable error occurs, so now the > UCS1 encoder should be as fast as it was without > error callbacks. > > Do we want to enforce new_input_position>input_position, > or should jumping back be allowed? No; moving backwards should be allowed (this may be useful in order to resynchronize with the input data). > Here's is the current todo list: > 1. implement a new TranslateCharmap and fix the old. > 2. New encoding API for string objects too. > 3. Decoding > 4. Documentation > 5. Test cases > > I'm thinking about a different strategy for implementing > callbacks > (see http://mail.python.org/pipermail/i18n-sig/2001- > July/001262.html) > > We coould have a error handler registry, which maps names > to error handlers, then it would be possible to keep the > errors argument as "const char " instead of "PyObject ". > Currently PyCodec_UnicodeEncodeHandlerForObject is a > backwards compatibility hack that will never go away, > because > it's always more convenient to type > u"...".encode("...", "strict") > instead of > import codecs > u"...".encode("...", codecs.raise_encode_errors) > > But with an error handler registry this function would > become the official lookup method for error handlers. > (PyCodec_LookupUnicodeEncodeErrorHandler?) > Python code would look like this: > --- > def xmlreplace(encoding, unicode, pos, state): > return (u"&#%d;" % ord(uni[pos]), pos+1) > > import codec > > codec.registerError("xmlreplace",xmlreplace) > --- > and then the following call can be made: > u"äöü".encode("ascii", "xmlreplace") > As soon as the first error is encountered, the encoder uses > its builtin error handling method if it recognizes the name > ("strict", "replace" or "ignore") or looks up the error > handling function in the registry if it doesn't. In this way > the speed for the backwards compatible features is the same > as before and "const char error" can be kept as the > parameter to all encoding functions. For speed common error > handling names could even be implemented in the encoder > itself. > > But for special one-shot error handlers, it might still be > useful to pass the error handler directly, so maybe we > should leave error as PyObject , but implement the > registry anyway? Good idea ! One minor nit: codecs.registerError() should be named codecs.register_errorhandler() to be more inline with the Python coding style guide.
msg36788 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-07-23 17:03
Logged In: YES user_id=89016 New version of the patch with the error handling callback registry. > > OK, done, now there's a > > PyCodec_EscapeReplaceUnicodeEncodeErrors/ > > codecs.escapereplace_unicodeencode_errors > > that uses \u (or \U if x>0xffff (with a wide build > > of Python)). > > Great! Now PyCodec_EscapeReplaceUnicodeEncodeErrors uses \x in addition to \u and \U where appropriate. > > [...] > > But for special one-shot error handlers, it might still be > > useful to pass the error handler directly, so maybe we > > should leave error as PyObject *, but implement the > > registry anyway? > > Good idea ! > > One minor nit: codecs.registerError() should be named > codecs.register_errorhandler() to be more inline with > the Python coding style guide. OK, but these function are specific to unicode encoding, so now the functions are called: codecs.register_unicodeencodeerrorhandler codecs.lookup_unicodeencodeerrorhandler Now all callbacks (including the new ones: "xmlcharrefreplace" and "escapereplace") are registered in the codecs.c/_PyCodecRegistry_Init so using them is really simple: u"gürk".encode("ascii", "xmlcharrefreplace")
msg36789 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-07-27 03:55
Logged In: YES user_id=89016 Changing the decoding API is done now. There are new functions codec.register_unicodedecodeerrorhandler and codec.lookup_unicodedecodeerrorhandler. Only the standard handlers for 'strict', 'ignore' and 'replace' are preregistered. There may be many reasons for decoding errors in the byte string, so I added an additional argument to the decoding API: reason, which gives the reason for the failure, e.g.: >>> "\\U1111111".decode("unicode_escape") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: encoding 'unicodeescape' can't decode byte 0x31 in position 8: truncated \UXXXXXXXX escape >>> "\\U11111111".decode("unicode_escape") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: encoding 'unicodeescape' can't decode byte 0x31 in position 9: illegal Unicode character For symmetry I added this to the encoding API too: >>> u"\xff".encode("ascii") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: encoding 'ascii' can't decode byte 0xff in position 0: ordinal not in range(128) The parameters passed to the callbacks now are: encoding, unicode, position, reason, state. The encoding and decoding API for strings has been adapted too, so now the new API should be usable everywhere: >>> unicode("a\xffb\xffc", "ascii", ... lambda enc, uni, pos, rea, sta: (u"<?>", pos+1)) u'a<?>b<?>c' >>> "a\xffb\xffc".decode("ascii", ... lambda enc, uni, pos, rea, sta: (u"<?>", pos+1)) u'a<?>b<?>c' I had a problem with the decoding API: all the functions in _codecsmodule.c used the t# format specifier. I changed that to O! with &PyString_Type, because otherwise we would have the problem that the decoding API would must pass buffer object around instead of strings, and the callback would have to call str() on the buffer anyway to access a specific character, so this wouldn't be any faster than calling str() on the buffer before decoding. It seems that buffers aren't used anyway. I changed all the old function to call the new ones so bugfixes don't have to be done in two places. There are two exceptions: I didn't change PyString_AsEncodedString and PyString_AsDecodedString because they are documented as deprecated anyway (although they are called in a few spots) This means that I duplicated part of their functionality in PyString_AsEncodedObjectEx and PyString_AsDecodedObjectEx. There are still a few spots that call the old API: E.g. PyString_Format still calls PyUnicode_Decode (but with strict decoding) because it passes the rest of the format string to PyUnicode_Format when it encounters a Unicode object. Should we switch to the new API everywhere even if strict encoding/decoding is used? The size of this patch begins to scare me. I guess we need an extensive test script for all the new features and documentation. I hope you have time to do that, as I'll be busy with other projects in the next weeks. (BTW, I have't touched PyUnicode_TranslateCharmap yet.)
msg36790 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-08-16 10:53
Logged In: YES user_id=38388 I think we ought to summarize these changes in a PEP to get some more feedback and testing from others as well. I'll look into this after I'm back from vacation on the 10.09. Given the release schedule I am not sure whether this feature will make it into 2.2. The size of the patch is huge and probably needs a lot of testing first.
msg36791 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-09-20 10:38
Logged In: YES user_id=38388 I am postponing this patch until the PEP process has started. This feature won't make it into Python 2.2. Walter, you may want to reference this patch in the PEP.
msg36792 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-03-05 16:49
Logged In: YES user_id=38388 Walter, are you making any progress on the new scheme we discussed on the mailing list (adding an error handler registry much like the codec registry itself instead of trying to redo the complete codec API) ?
msg36793 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-03-07 01:29
Logged In: YES user_id=89016 I started from scratch, and the current state is this: Encoding mostly works (except that I haven't changed TranslateCharmap and EncodeDecimal yet) and most of the decoding stuff works (DecodeASCII and DecodeCharmap are still unchanged) and the decoding callback helper isn't optimized for the "builtin" names yet (i.e. it still calls the handler). For encoding the callback helper knows how to handle "strict", "replace", "ignore" and "xmlcharrefreplace" itself and won't call the callback. This should make the encoder fast enough. As callback name string comparison results are cached it might even be faster than the original. The patch so far didn't require any changes to unicodeobject.h, stringobject.h or stringobject.c
msg36794 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-03-07 23:09
Logged In: YES user_id=89016 I'm think about extending the API a little bit: Consider the following example: >>> "\\u1".decode("unicode-escape") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: encoding 'unicodeescape' can't decode byte 0x31 in position 2: truncated \uXXXX escape The error message is a lie: Not the '1' in position 2 is the problem, but the complete truncated sequence '\\u1'. For this the decoder should pass a start and an end position to the handler. For encoding this would be useful too: Suppose I want to have an encoder that colors the unencodable character via an ANSI escape sequences. Then I could do the following: >>> import codecs >>> def color(enc, uni, pos, why, sta): ... return (u"\033[1m<%d>\033[0m" % ord(uni[pos]), pos+1) ... >>> codecs.register_unicodeencodeerrorhandler("color", color) >>> u"aäüöo".encode("ascii", "color") 'a\x1b[1m<228>\x1b[0m\x1b[1m<252>\x1b[0m\x1b[1m<246>\x1b [0mo' But here the sequences "\x1b[0m\x1b[1m" are not needed. To fix this problem the encoder could collect as many unencodable characters as possible and pass those to the error callback in one go (passing a start and end+1 position). This fixes the above problem and reduces the number of calls to the callback, so it should speed up the algorithms in case of custom encoding names. (And it makes the implementation very interesting ;)) What do you think?
msg36795 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-03-08 15:15
Logged In: YES user_id=38388 Sounds like a good idea. Please keep the encoder and decoder APIs symmetric, though, ie. add the slice information to both APIs. The slice should use the same format as Python's standard slices, that is left inclusive, right exclusive. I like the highlighting feature !
msg36796 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-03-08 17:31
Logged In: YES user_id=89016 What should replace do: Return u"?" or (end-start)*u"?"
msg36797 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-03-08 18:36
Logged In: YES user_id=38388 Hmm, whatever it takes to maintain backwards compatibility. Do you have an example ?
msg36798 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-03-15 17:06
Logged In: YES user_id=89016 For encoding it's always (end-start)*u"?": >>> u"ää".encode("ascii", "replace") '??' But for decoding, it is neither nor: >>> "\\Ux\\U".decode("unicode-escape", "replace") u'\ufffd\ufffd' i.e. a sequence of 5 illegal characters was replace by two replacement characters. This might mean that decoders can't collect all the illegal characters and call the callback once. They might have to call the callback for every single illegal byte sequence to get the old behaviour. (It seems that this patch would be much, much simpler, if we only change the encoders)
msg36799 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-03-15 17:19
Logged In: YES user_id=89016 So this means that the encoder can collect illegal characters and pass it to the callback. "replace" will replace this with (end-start)*u"?". Decoders don't collect all illegal byte sequences, but call the callback once for every byte sequence that has been found illegal and "replace" will replace it with u"?". Does this make sense?
msg36800 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-04-17 10:21
Logged In: YES user_id=89016 Another note: the patch will change the meaning of charmap encoding slightly: currently "replace" will put a ? into the output, even if ? is not in the mapping, i.e. codecs.charmap_encode(u"c", "replace", {ord("a"): ord ("b")}) will return ('?', 1). With the patch the above example will raise an exception. Off course with the patch many more replace characters can appear, so it is vital that for the replacement string the mapping is done. Is this semantic change OK? (I guess all of the existing codecs have a mapping ord("?")->ord("?"))
msg36801 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-04-17 19:40
Logged In: YES user_id=38388 Sorry for the late response. About the difference between encoding and decoding: you shouldn't just look at the case where you work with Unicode and strings, e.g. take the rot-13 codec which works on strings only or other codecs which translate objects into strings and vice-versa. Error handling has to be flexible enough to handle all these situations. Since the codecs know best how to handle the situations, I'd make this an implementation detail of the codec and leave the behaviour undefined in the general case. For the existing codecs, backward compatibility should be maintained, if at all possible. If the patch gets overly complicated because of this, we may have to provide a downgrade solution for this particular problem (I don't think replace is used in any computational context, though, since you can never be sure how many replacement character do get inserted, so the case may not be that realistic). Raising an exception for the charmap codec is the right way to go, IMHO. I would consider the current behaviour a bug. For new codecs, I think we should suggest that replace tries to collect as much illegal data as possible before invoking the error handler. The handler should be aware of the fact that it won't necessarily get all the broken data in one call. About the codec error handling registry: You seem to be using a Unicode specific approach here. I'd rather like to see a generic approach which uses the API we discussed earlier. Would that be possible ? In that case, the codec API should probably be called codecs.register_error('myhandler', myhandler). Does that make sense ? BTW, the patch which uses the callback registry does not seem to be available on this SF page (the last patch still converts the errors argument to a PyObject, which shouldn't be needed anymore with the new approach). Can you please upload your latest version ? Note that the highlighting codec would make a nice example for the new feature. Thanks.
msg36802 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-04-17 20:50
Logged In: YES user_id=89016 > About the difference between encoding > and decoding: you shouldn't just look > at the case where you work with Unicode > and strings, e.g. take the rot-13 codec > which works on strings only or other > codecs which translate objects into > strings and vice-versa. unicode.encode encodes to str and str.decode decodes to unicode, even for rot-13: >>> u"gürk".encode("rot13") 't\xfcex' >>> "gürk".decode("rot13") u't\xfcex' >>> u"gürk".decode("rot13") Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'unicode' object has no attribute 'decode' >>> "gürk".encode("rot13") Traceback (most recent call last): File "<stdin>", line 1, in ? File "/home/walter/Python-current- readonly/dist/src/Lib/encodings/rot_13.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeError: ASCII decoding error: ordinal not in range (128) Here the str is converted to unicode first, before encode is called, but the conversion to unicode fails. Is there an example where something else happens? > Error handling has to be flexible enough > to handle all these situations. Since > the codecs know best how to handle the > situations, I'd make this an implementation > detail of the codec and leave the > behaviour undefined in the general case. OK, but we should suggest, that for encoding unencodable characters are collected and for decoding seperate byte sequences that are considered broken by the codec are passed to the callback: i.e for decoding the handler will never get all broken data in one call, e.g. for "\\u30\\Uffffffff".decode("unicode-escape") the handler will be called twice (once for "\\u30" and "truncated \\u escape" as the reason and once for "\\Uffffffff" and "illegal character" as the reason.) > For the existing codecs, backward > compatibility should be maintained, > if at all possible. If the patch gets > overly complicated because of this, > we may have to provide a downgrade solution > for this particular problem (I don't think > replace is used in any computational context, > though, since you can never be sure how > many replacement character do get > inserted, so the case may not be > that realistic). > > Raising an exception for the charmap codec > is the right way to go, IMHO. I would > consider the current behaviour a bug. OK, this is implemented in PyUnicode_EncodeCharmap now, and collecting unencodable characters works too. I completely changed the implementation, because the stack approach would have gotten much more complicated when unencodable characters are collected. > For new codecs, I think we should > suggest that replace tries to collect > as much illegal data as possible before > invoking the error handler. The handler > should be aware of the fact that it > won't necessarily get all the broken > data in one call. OK for encoders, for decoders see above. > About the codec error handling > registry: You seem to be using a > Unicode specific approach here. > I'd rather like to see a generic > approach which uses the API > we discussed earlier. Would that be possible? The handlers in the registry are all Unicode specific. and they are different for encoding and for decoding. I renamed the function because of your comment from 2001-06-13 10:05 (which becomes exceedingly difficult to find on this long page! ;)). > In that case, the codec API should > probably be called > codecs.register_error('myhandler', myhandler). > > Does that make sense ? We could require that unique names are used for custom handlers, but for the standard handlers we do have name collisions. To prevent them, we could either remove them from the registry and require that the codec implements the error handling for those itself, or we could to some fiddling, so that u"üöä".encode("ascii", "replace") becomes u"üöä".encode("ascii", "unicodeencodereplace") behind the scenes. But I think two unicode specific registries are much simpler to handle. > BTW, the patch which uses the callback > registry does not seem to be available > on this SF page (the last patch still > converts the errors argument to a > PyObject, which shouldn't be needed > anymore with the new approach). > Can you please upload your > latest version? OK, I'll upload a preliminary version tomorrow. PyUnicode_EncodeDecimal and PyUnicode_TranslateCharmap are still missing, but otherwise the patch seems to be finished. All decoders work and the encoders collect unencodable characters and implement the handling of known callback handler names themselves. As PyUnicode_EncodeDecimal is only used by the int, long, float, and complex constructors, I'd love to get rid of the errors argument, but for completeness sake, I'll implement the callback functionality. > Note that the highlighting codec > would make a nice example > for the new feature. This could be part of the codec callback test script, which I've started to write. We could kill two birds with one stone here: 1. Test the implementation. 2. Document and advocate what is possible with the patch. Another idea: we could have as an example a decoding handler that relaxes the UTF-8 minimal encoding restriction, e.g. def relaxedutf8(enc, uni, startpos, endpos, reason, data): if uni[startpos:startpos+2] == u"\xc0\x80": return (u"\x00", startpos+2) else: raise UnicodeError(...)
msg36803 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-04-18 19:22
Logged In: YES user_id=89016 OK, here is the current version of the patch (diff7.txt). PyUnicode_EncodeDecimal and PyUnicode_TranslateCharmap are still missing.
msg36804 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-04-18 19:24
Logged In: YES user_id=89016 And here is the test script (test_codeccallbacks.py)
msg36805 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-04-20 15:34
Logged In: YES user_id=89016 A new idea for the interface between the codec and the callback: Maybe we could have new exception classes UnicodeEncodeError, UnicodeDecodeError and UnicodeTranslateError derived from UnicodeError. They have all the attributes that are passed as an argument tuple in the current version: string: the original string start: the start position of the unencodable characters/undecodable bytes end: the end position+1 of the unencodable characters/undecodable bytes. reason: the a string, that explains, why the encoding/decoding doesn't work. There is no data object, because when a codec wants to pass extended information to the callback it can do this via a derived class. It might be better to move these attributes to the base class UnicodeError, but this might have backwards compatibility problems. With this method we really can have one global registry for all callbacks, because for callback names that must work with encoding and decoding and translating (i.e. "strict", "replace" and "ignore"), the callback can check which type of exception was passed, so "replace" can e.g. look like this: def replace(exc): if isinstance(exc, UnicodeDecodeError): return ("?", exc.end) else: return (u"?"(exc.end-exc.start), exc.end) Another possibility would be to do the commucation callback->codec by assigning to attributes of the exception object. The resyncronisation position could even be preassigned to end, so the callback only needs to specify the replacement in most cases: def replace(exc): if isinstance(exc, UnicodeDecodeError): exc.replacement = "?" else: exc.replacement = u"?"(exc.end-exc.start) As many of the assignments can now be done on the C level without having to allocate Python objects (except for the replacement string and the reason), this version might even be faster, especially if we allow the codec to reuse the exception object for the next call to the callback. Does this make sense, or is this to fancy?
msg36806 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-05-01 17:57
Logged In: YES user_id=89016 OK, PyUnicode_EncodeDecimal is done (diff8.txt), but as the errors argument can't be accessed from Python code, there's not much testing for this.
msg36807 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-05-16 19:06
Logged In: YES user_id=89016 OK, PyUnicode_TranslateCharmap is finished too. As the errors argument is again not exposed to Python it can't really be tested. Should we add errors as an optional argument to unicode.translate?
msg36808 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-05-29 20:50
Logged In: YES user_id=89016 This new version diff10.txt fixes a memory overwrite/reallocation bug in PyUnicode_EncodeCharmap and moves the error handling out of PyUnicode_EncodeCharmap. A new version of the test script is included too.
msg36809 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-05-30 16:30
Logged In: YES user_id=89016 diff11.txt fixes two refcounting bugs in codecs.c. speedtest.py is a little test script, that checks to speed of various string/encoding/error combinations.
msg36810 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-07-24 18:55
Logged In: YES user_id=89016 diff12.txt finally implements the PEP293 specification (i.e. using exceptions for the communication between codec and handler)
msg36811 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-07-24 19:04
Logged In: YES user_id=89016 Attached is a new version of the test script. But we need more tests. UTF-7 is completely untested and using codecs that pass wrong arguments to the handler and handler that return wrong or out of bounds results is untested too.
msg36812 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-07-26 15:41
Logged In: YES user_id=89016 The attached new version of the test script add test for wrong parameter passed to the callbacks or wrong results returned from the callback. It also add tests to the long string tests for copies of the builtin error handlers, so the codec does not recognize the name and goes through the general callback machinery. UTF-7 decoding still has a flaw inherited from the current implementation: >>> "+xxx".decode("utf-7") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-3: unterminated shift sequence *>>> "+xxx".decode("utf-7", "ignore") u'\uc71c' The decoder should consider the whole sequence "+xxx" as undecodable, so "Ignore" should return an empty string. Currently the correct sequence will be passed to the callback, but the faulty sequence has already been emitted to the result string.
msg36813 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-08-14 10:23
Logged In: YES user_id=89016 This new version diff13.txt moves the initialization of codec.strict_errors etc. from Modules/_codecsmodule.c to Lib/codecs.py. The error logic for the accessor function is inverted (now its 0 for success and -1 for error). Updated the prototypes to use the new PyAPI_FUNC macro. Enhanced the docstrings for str.(de\|en)code and unicode.encode. There seems to be a new string decoding function PyString_DecodeEscape in current CVS. This function has to be updated too.
msg36814 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-09-02 13:19
Logged In: YES user_id=89016 Checked in as: (this is diff13.txt + the test script + dodumentation in two TeX files) Doc/lib/libcodecs.tex 1.11 Doc/lib/libexcs.tex 1.49 Include/codecs.h 2.5 Include/pyerrors.h 2.58 Lib/codecs.py 1.27 Lib/test/test_codeccallbacks.py 1.1 Misc/NEWS 1.476 Modules/_codecsmodule.c 2.15 Objects/stringobject.c 2.186 Objects/unicodeobject.c 2.167 Python/codecs.c 2.15 Python/exceptions.c 1.35

History
Date	User	Action	Args
2022-04-10 16:04:07	admin	set	github: 34615
2001-06-12 13:43:03	doerwalter	create