This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email.Utils.encode doesn't obey rfc2047
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: barry Nosy List: barry, tsarna
Priority: normal Keywords:

Created on 2002-05-06 18:13 by tsarna, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (2)
msg10662 - (view) Author: Ty Sarna (tsarna) Date: 2002-05-06 18:13
The email.Utils.encoding function has two bugs, which
are somewhat related -- it fails to deal with long
input strings in two different ways.

First, newlines are not allowed in the middle of
rfc2047 encoded-words (per section 2: "[...] white
space characters MUST NOT appear between components of
an 'encoded-word'"). The _bencode and _qencode routines
that the encode function uses include newlines (or
"=\n"'s for quopri) in their output, and the encode
function doesn't remove them. Try encoding a long
string with 'q' for example. The resulting output will
contain one or more "= \n"'s, and the
email.Utils.decode function will not be able to parse it.

Patch:

*** Utils.py.orig       Mon May  6 13:17:05 2002
--- Utils.py    Mon May  6 13:18:16 2002
***************
*** 98,106 ****
      """Encode a string according to RFC 2047."""
      encoding = encoding.lower()
      if encoding == 'q':
!         estr = _qencode(s)
      elif encoding == 'b':
!         estr = _bencode(s)
      else:
          raise ValueError, 'Illegal encoding code: ' +
encoding
      return '=?%s?%s?%s?=' % (charset.lower(),
encoding, estr)
--- 98,106 ----
      """Encode a string according to RFC 2047."""
      encoding = encoding.lower()
      if encoding == 'q':
!         estr = _qencode(s).replace('=\n','')
      elif encoding == 'b':
!         estr = _bencode(s).replace('\n','')
      else:
          raise ValueError, 'Illegal encoding code: ' +
encoding
      return '=?%s?%s?%s?=' % (charset.lower(),
encoding, estr)

NOTE: The .replace()-ing should NOT be done in _bencode
and _quencode, because they're used other places where
their current behaviour is fine/expected.


Second problem: rfc2047 specifies that an encoded-word
 may be no longer than 75 characters (see section 2).
Also, in the case of, say, a From: header with high-bit
characters in the sender's name, you really want to
encode only the name, not the whole line, so that dumb
mail programs are able to recognize the email address
in the line without having to understand rfc2047.

Proposed solution: rename existing encode function
(with above patche applied) to encode_word. Add a new
encode function that splits the input string into a
list of words and whitespace runs.  Words are encoded
individually using encode_word() iff they are not pure
ascii. The results are then concatenated back with
original whitespace.

This still leaves the possibility that a single word,
when encoded, is longer than 75 characters. The
recommended practice in rfc2047 is to use multiple
encoded words separated by CRLF SPACE (or in our case ,
"\n "). 


Here is code that implements the above:

wsplit = re.compile('([ \n\t]+)').split


def encode(s, charset='iso-8859-1', encoding='q'):
    i = wsplit(s)
    o = []

    # max encoded-word length per rfc2047 section 2 is 75
    # 75 - len("=?" + "?" + "?" + "?=") == 69
    max_enc_text = 69 - len(charset) - len(encoding)
    if encoding == 'q':
        # 3 bytes per character worst case
        safe_wlen = max_enc_text / 3
    elif encoding == 'b':
        safe_wlen = (max_enc_text * 6) / 8
    else:
        safe_wlen = max_enc_text # ?

    for w in i:
        if w[0] in " \n\t":
            o.append(w)
        else:
            try:
                o.append(w.encode('ascii'))
            except:
                ew = encode_word(w, charset, encoding)
                while len(ew) > 75:
                   
o.append(encode_word(w[:safe_wlen],charset,encoding)+"\n ")
                    w = w[safe_wlen:]
                    ew = encode_word(w, charset, encoding)
                o.append(ew)

    return ''.join(o)
msg10663 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2002-06-29 02:01
Logged In: YES 
user_id=12800

Ty, is it worth patching up email.Utils.encode() given its
deprecation and the existance of the Header class?  I tend
to think not (there should be only one way to do it).

Is Header vulnerable to the same problems?  If so, please
submit a new bug report with a test case.  Please also
attach diffs and patches as attachments instead of in the
bug report because otherwise SF will mess up the indentation.

BTW, you might want to check Python 2.3's cvs since there
have been a lot of updates lately.

Thanks, I'm closing this one.
History
Date User Action Args
2022-04-10 16:05:18adminsetgithub: 36564
2002-05-06 18:13:36tsarnacreate