This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: msgfmt cannot cope with BOM - improve error message
Type: behavior Stage: needs patch
Components: Demos and Tools, Unicode Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: loewis Nosy List: cito, eric.araujo, loewis, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2007-04-10 20:58 by cito, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
msgfmt.diff cito, 2007-04-10 20:58 review
Messages (9)
msg31755 - (view) Author: Christoph Zwerschke (cito) * Date: 2007-04-10 20:58
If a .po file has a BOM (byte order mark) at the beginning, as is often the case for utf-8 files created on Windows, msgfmt.py complines about a syntax error.

The attached patch fixes this problem.
msg31756 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-04-11 16:07
Martin, is this your code?
msg31757 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-04-11 22:13
It's my code, but I will need to establish first whether it's a bug. That depends on what the PO specification says, and, if is it silent on the matter, what GNU gettext does.
msg31758 - (view) Author: Christoph Zwerschke (cito) * Date: 2007-04-12 09:10
It may well be that GNU gettext also chokes on a BOM, because they aren't used under Linux. But I think as a Python tool it should be more Windows-tolerant. The annoying thing is that you get a syntax error, but everything looks right because the BOM is usually invisible. Such error messages are really frustrating. Either the BOM should be silently ignored (as in the patch) or the users should get a friendly error message asking them to save the file without BOM. If GNU behaves badly to Windows users, that's not an excuse to do the same. They are already suffering enough because of their (or their bosses') bad choice of OS ;-)

msg70042 - (view) Author: Christoph Zwerschke (cito) * Date: 2008-07-19 16:17
Small improvement of the patch: Instead of hardcoding the BOM as
'\xef\xbb\xbf', we should use codecs.BOM_UTF8.
msg125940 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-10 22:18
Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".

See also the following section explaing issues with UTF-8 BOM:
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

I agree that Python should handle (UTF-8) BOM to read a CSV file (#7185), because the file format is common on Windows.

But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?

About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.
msg125941 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-10 22:19
See also issue #7651: "Python3: guess text file charset using the BOM".
msg290519 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-26 09:47
Corresponding GNU gettext issue [1] was closed as "Not a Bug".

[1] https://savannah.gnu.org/bugs/?18345
msg290524 - (view) Author: Christoph Zwerschke (cito) * Date: 2017-03-26 10:53
> Corresponding GNU gettext issue [1] was closed as "Not a Bug".

Though I think the rationale given there pointing to RFC3629 section 6 is wrong, since that section explicitly refers to Internet protocols, but PO files are not an Internet protocol.

Anyway, if silently ignoring BOMs is considered a bad idea, then at least there should be a more helpful error message. Because the BOM is invisible, users - who may not even be aware that something like a BOM exist or that their editor saves files with BOM - may be frustrated about the current error message because they don't see any invalid character when they open the PO file in their editor. A more explicit error message like "PO files should not be saved with a byte order mark" might point users in the right direction.

After all, these tools are supposed to be used directly by human beings on the command line. Who said that command line tools must not be user friendly?
History
Date User Action Args
2022-04-11 14:56:23adminsetgithub: 44827
2021-04-22 15:26:29iritkatrielsettitle: msgfmt cannot cope with BOM -> msgfmt cannot cope with BOM - improve error message
resolution: not a bug ->
versions: + Python 3.11, - Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-03-26 10:53:38citosetstatus: pending -> open

messages: + msg290524
versions: + Python 3.4, Python 3.5, Python 3.6
2017-03-26 09:47:55serhiy.storchakasetstatus: open -> pending

nosy: + serhiy.storchaka
messages: + msg290519

resolution: not a bug
2011-01-10 22:19:48vstinnersetnosy: loewis, rhettinger, cito, vstinner, eric.araujo
messages: + msg125941
2011-01-10 22:18:38vstinnersetnosy: loewis, rhettinger, cito, vstinner, eric.araujo
messages: + msg125940
2011-01-06 17:03:44pitrousetnosy: + vstinner
stage: test needed -> needs patch

versions: + Python 2.7, Python 3.2, Python 3.3, - Python 2.6
2010-06-11 14:58:50eric.araujosetnosy: + eric.araujo
2009-05-15 02:21:09ajaksu2setversions: + Python 2.6, Python 3.1, - Python 2.5
nosy: loewis, rhettinger, cito
components: + Unicode
keywords: + patch
type: behavior
stage: test needed
2008-07-19 16:17:29citosetmessages: + msg70042
2007-04-10 20:58:04citocreate