Issue 594893: printing email object deletes whitespace

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37023

classification

Title:	printing email object deletes whitespace
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.3

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	barry	Nosy List:	barry, skip.montanaro
Priority:	normal	Keywords:

Created on 2002-08-14 04:59 by skip.montanaro, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
email.zip	skip.montanaro, 2002-08-14 04:59

Messages (10)
msg11918 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-08-14 04:59
I certain situations when printing email Message objects (I think), whitespace in headers disappears. The attached zip file demonstrates this problem. In email.orig, there is a line break followed by a TAB in the X-Vm-v5-Data header at the end of the first continuation line. In email.new, which was generated by printing an email.Message object, the line break and TAB are gone, but no SPACE was inserted in their place. This example is from a larger program which reads in a Unix mailbox like so: msgdict = {} i = 0 for msg in mailbox.PortableUnixMailbox(f, email.Parser.Parser().parse): subj = msg["subject"] item = msgdict.get(subj) or [] item.append((i, msg)) msgdict[subj] = item i += 1 runs through msgdict and deletes a bunch of messages matching various criteria, then prints out those which remain retaining the relative order they had in the original mailbox: msglist = [] for val in msgdict.values(): msglist.extend(val) msglist.sort() for i,msg in msglist: print msg email.orig was plucked from the input mailbox and email.new from the output mailbox.
msg11919 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-08-29 14:45
Logged In: YES user_id=44345 Hmmm... Sometimes seems to add whitespace as well. Here's an example using the X-Face: header: Before: X-Face: $LeJ8}Gzj%b'dmF:@bMiTrpT\|UL=3O!CG~3;}dS[43`qefo('''9?B=2a0uB4u+a)$"DYl S After: X-Face: $LeJ8}Gzj%b'dmF:@bMiTrpT\|UL=3O!CG~3; }dS[43`qefo('''9?B=2a0uB4u+a)$"DYlS
msg11920 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-09-10 17:29
Logged In: YES user_id=12800 Skip, you've got two difficult examples here. RFC 2822 recommends splitting lines at "the highest syntactic level" possible, but that differs depending on the semantics of the header. By default, Header._split_ascii() splits first on semicolons (for multiple parameter headers) and then on whitespace. Your two examples exploit weaknesses in this algorithm. In the first case, X-VM... has the syntax of a lisp expression. A coarser way to look at the contents would be to try to keep "-delimited strings without line breaks. The email package doesn't know anything about either of these syntactic levels. In the second case, you actually have X-Face data which contains a semi-colon, so the split mentioned above does the wrong thing in this case. I'm not sure what the best answer is. We can't hardcode too much syntactic information into the Header class. Do we need some kind of registration/callback mechanism so that applications can create their own tokenization routines for providing non-breaking tokens to the ascii_split() method? Yeesh. I'm up for suggestions. I can add a hack so that at least the X-VM header doesn't lose information when printed, but it's just a hack, so I'm not sure what the best solution is.
msg11921 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-09-10 19:51
Logged In: YES user_id=44345 Hmmm... How can RFC 2822 presume to know anything about the syntax of X-* headers? Perhaps they should just be left alone...
msg11922 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-09-10 20:04
Logged In: YES user_id=12800 It doesn't, it just suggests that when wrapping a line: [...] folding SHOULD be limited to placing the CRLF at higher-level syntactic breaks. For instance, if a field body is defined as comma-separated values, it is recommended that folding occur after the comma separating the structured items in preference to other places where the field could be folded, even if it is allowed elsewhere. So it's really up to the application in most cases to define what the higher-level syntactic breaks should be. Problem is, the email package currently has no way for applications to tell it what to do for particular headers, so email tries a couple of simplistic generalized splitting algorithms (semi's then whitespace). Wild thought: allow each header to be assigned a splitting tokenizer method which does the "higher-level syntactic breaks". Tricky bits are to provide a useable API (where? in the Generator or in Message?), and what to do about encoded headers vs. ascii headers.
msg11923 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-09-10 20:24
Logged In: YES user_id=44345 A slightly less wild idea - why not just suppress all folding/reformatting for X-* headers and instead always emit the raw header value that was in the original message? That should solve the problem in the short term and allow you to come up with a suitable API for the longer term.
msg11924 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-09-10 20:32
Logged In: YES user_id=12800 Why treat just the X-* header specially? BTW, the reason headers are wrapped in the first place is that RFC 2822 specifies hard and soft limits to header lengths. I think the hard limit is 998 characters, but it is recommended that no header be longer than 78 characters without wrapping. OTOH, a header like the X-VM-... header is for internal use only, so it's probably never used outside of your own applications. Note that you can suppress all wrapping by setting the maxheaderlen argument in the Generator's constructor to some outrageously large value (try 2000). Maybe a negative value should indicate that no wrapping of any headers be done? (Maybe limited to just non-encoded headers?)
msg11925 - (view)	Author: Skip Montanaro (skip.montanaro) *	Date: 2002-09-10 20:45
Logged In: YES user_id=44345 > Why treat just the X-* header specially? Because of all the possible headers they are the ones we know the least about format-wise. From my selfish perspective, they are the ones I am having the most trouble with... ;-) I'd be happy to experiment with the maxheaderlen argument. I wasn't aware it existed. Will that also solve the problem of space getting deleted?
msg11926 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-09-10 20:52
Logged In: YES user_id=12800 > Will that also solve the problem of space getting deleted? I'm not sure, give it a try! :) If not, then I think we'll add maxheaderlen=-1 to mean do no wrapping or filling of header values (which should take care of the problem).
msg11927 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2003-03-10 16:59
Logged In: YES user_id=12800 Skip, I'm finally getting back to this. In the latest cvs of email pkg (which will be 2.5), Header.encode() has an optional splitchars argument. If you were to load the X- header data into a Header instance and print it with splitchars='', then you should suppress splitting. Can you look to see if the semantics and API are appropriate for you? I'm closing the bug because I suspect you've long worked around it. If you don't care any more, you can just leave it closed <wink>.

History
Date	User	Action	Args
2022-04-10 16:05:35	admin	set	github: 37023
2002-08-14 04:59:58	skip.montanaro	create