This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: split() breaks no-break spaces
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lemburg Nosy List: doerwalter, effbot, hyeshik.chang, lemburg, maxim_razin, sjoerd
Priority: normal Keywords:

Created on 2005-12-26 15:03 by maxim_razin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg27152 - (view) Author: MvR (maxim_razin) Date: 2005-12-26 15:03
string.split(), str.split() and unicode.split() without
parameters break strings by the No-break space (U+00A0)
character.  This character is specially intended not to
be a split border.  

>>> u"Hello\u00A0world".split()
[u'Hello', u'world']
msg27153 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-12-29 20:42
Logged In: YES 
user_id=38376

split isn't a word-wrapping split, so I'm not sure that's
the right place to fix this.  ("no-break space" is white-
space, according to the Unicode standard, and split breaks
on whitespace).
msg27154 - (view) Author: Hyeshik Chang (hyeshik.chang) * (Python committer) Date: 2005-12-30 00:30
Logged In: YES 
user_id=55188

Python documentation says that it splits in "whitespace 
characters" not "breaking characters". So, current 
behavior is correct according to the documentation. And 
even rationale among string methods are heavily depends on 
ctype functions on libc. Therefore, we can't serve special 
treatment for the NBSP.

However, I feel the need for the splitting function that 
awares what character is breaking or not. How about to add 
it as unicodedata.split()?
msg27155 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-12-30 12:35
Logged In: YES 
user_id=89016

What's wrong with the following?

import sys, unicodedata
spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
foo.split(spaces)
msg27156 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-12-30 13:06
Logged In: YES 
user_id=38388

Maxim, you are right that \xA0 is a non-break space.
However, like the others already mentioned, the .split()
method defaults to breaking a string on whitespace
characters, not breakable whitespace characters. The intent
is not a typographical one, but originates from the desire
to quickly tokenize a string.

If you'd rather like to see a different set of whitespace
characters used, you can pass such a template string to the
.split() method (Walter gave an example).

Closing this as "Won't fix".
msg27157 - (view) Author: Sjoerd Mullender (sjoerd) * (Python committer) Date: 2006-01-02 10:48
Logged In: YES 
user_id=43607

Walter and MAL, did you actually try that work around?  It
doesn't work:
>>> import sys, unicodedata
>>> spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
>>> foo = u"Hello\u00A0world"
>>> foo.split(spaces)
[u'Hello\xa0world']

That's because split() takes the whole separator argument as
separator, not any of the characters in it.
msg27158 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-01-02 11:13
Logged In: YES 
user_id=38388

Oops. You're right, Sjoerd.

Still, you could achieve the splitting by using a
re-expression that is build from the set of characters
fetched from the Unicode database and then using the
.split() method of the re object.

msg27159 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2006-01-03 10:33
Logged In: YES 
user_id=89016

Seems I confused strip() with split(). I *did* try that work
around, and it did what I expected: It *didn't* split on
U+00A0 ;)

If we want to fix this discrepancy, we could add methods
stripchars(), (as a synonym for strip()) and stripstring(),
as well as splitchars() and splitstring() (as a synonym for
split()).
msg27160 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-01-03 11:07
Logged In: YES 
user_id=38388

No. 

These things are application scope details and should thus
be implemented in the application rather than as method on
an object.

The methods always work on whitespace and that's clearly
defined.
History
Date User Action Args
2022-04-11 14:56:14adminsetgithub: 42731
2005-12-26 15:03:58maxim_razincreate