This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Undocumented implicit strip() in split(None) string method
Type: Stage:
Components: Documentation Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: calvin, jimjjewett, rhettinger, terry.reedy, tim.peters, yohell
Priority: normal Keywords:

Created on 2005-01-19 15:04 by yohell, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (12)
msg23981 - (view) Author: YoHell (yohell) Date: 2005-01-19 15:04
Hi! 

I noticed that the string method split() first does an
implicit strip() before splitting when it's used with
no arguments or with None as the separator (sep in the
docs). There is no mention of this implicit strip() in
the docs.

Example 1:
s = " word1 word2 "

s.split() then returns ['word1', 'word2'] and not ['',
'word1', 'word2', ''] as one might expect.

WHY IS THIS BAD?

1. Because it's undocumented. See:
http://www.python.org/doc/current/lib/string-methods.html#l2h-197

2. Because it may lead to unexpected behavior in programs. 
Example 2:
FASTA sequence headers are one line descriptors of
biological sequences and are on this form: 
">" + Identifier + whitespace + free text description.

Let sHeader be a Python string containing a FASTA
header. One could then use the following syntax to
extract the identifier from the header:

sID = sHeader[1:].split(None, 1)[0]

However, this does not work if sHeader contains a
faulty FASTA header where the identifier is missing or
consists of whitespace. In that case sID will contain
the first word of the free text description, which is
not the desired behavior. 

WHAT SHOULD BE DONE?

The implicit strip() should be removed, or at least
should programmers be given the option to turn it off.
At the very least it should be documented so that
programmers have a chance of adapting their code to it.

Thank you for an otherwise splendid language!
/Joel Hedlund
Ph.D. Student
IFM Bioinformatics
Linköping University
msg23982 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2005-01-19 16:56
Logged In: YES 
user_id=31435

I think the docs for split() under "String Methods" are quite 
clear:

"""
...

If sep is not specified or is None, a different splitting 
algorithm is applied. Words are separated by arbitrary length 
strings of whitespace characters (spaces, tabs, newlines, 
returns, and formfeeds). Consecutive whitespace delimiters 
are treated as a single delimiter ("'1 2 3'.split()" 
returns "['1', '2', '3']"). Splitting an empty string returns "['']". 
"""

This won't change, because mountains of code rely on this 
behavior -- it's probably the single most common use case 
for .split().
msg23983 - (view) Author: YoHell (yohell) Date: 2005-01-20 10:15
Logged In: YES 
user_id=1008220

In RE to tim_one:
> I think the docs for split() under "String Methods" are quite 
> clear:

On the countrary, my friend, and here's why:

> """
> ...
> If sep is not specified or is None, a different splitting
> algorithm is applied. 

This sentecnce does not say that whitespace will be
implicitly stripped from the edges of the string.

> Words are separated by arbitrary length strings of whitespace 
> characters (spaces, tabs, newlines, returns, and formfeeds). 

Neither does this one.

> Consecutive whitespace delimiters are treated as a single
delimiter ("'1 
> 2 3'.split()" returns "['1', '2', '3']"). 

And not that one.

> Splitting an empty string returns "['']".
> """

And that last one does not mention it either. In fact, there
is no mention in the docs of how separators on edges of
strings are treated by the split method. And furthermore,
there is no mention of that s.split(sep) treats them
differrently when sep is None than it does otherwise. Example:

>>> ",2,".split(',')
['', '2', '']
>>> " 2 ".split()
['2']

This inconsistent behavior is not in line with how
beautifully thought out the Python language is otherwise,
and how brilliantly everything else is documented on the
http://python.org/doc/ documentation pages. 

> This won't change, because mountains of code rely on this 
> behavior -- it's probably the single most common use case 
> for .split().

I thought as much. However - it's would be Really easy for
an admin to add a line of documentation to .split() to
explain this. That would certainly help make me a happier
man, and hopefully others too.

Cheers guys!
/Joel
msg23984 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-01-20 14:04
Logged In: YES 
user_id=80475

What new wording do you propose to be added?
msg23985 - (view) Author: Jim Jewett (jimjjewett) Date: 2005-01-20 14:28
Logged In: YES 
user_id=764593

Replacing the quoted line:

"""
...

If sep is not specified or is None, a different splitting 
algorithm is applied. First whitespace (spaces, tabs, 
newlines, returns, and formfeeds) is stripped from both 
ends.   Then words are separated by arbitrary length 
strings of whitespace characters . Consecutive whitespace 
delimiters are treated as a single delimiter ("'1 2 3'.split()" 
returns "['1', '2', '3']"). Splitting an empty (or whitespace-
only) string returns "['']".
"""
msg23986 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-01-20 14:50
Logged In: YES 
user_id=80475

The prosposed wording is fine.

If there are no objections or concerns, I'll apply it soon.
msg23987 - (view) Author: YoHell (yohell) Date: 2005-01-20 14:59
Logged In: YES 
user_id=1008220

Brilliant, guys!

Thanks again for a superb scripting language, and with
documentation to match!

Take care!
/Joel Hedlund
msg23988 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2005-01-24 07:15
Logged In: YES 
user_id=593130

To me, the removal of whitespace at the ends (stripping) is 
consistent with the removal (or collapsing) of extra 
whitespace in between so that .split() does not return empty 
words anywhere.  Consider:

>>> ',1,,2,'.split(',')
['', '1', '', '2', '']

If ' 1  2 '.split() were to return null strings at the beginning 
and end of the list, then to be consistent, it should also put 
one in the middle.  One can get this by being explicit (mixed 
WS can be handled by translation):

>>> ' 1  2 '.split(' ')
['', '1', '', '2', '']

Having said this, I also agree that the extra words proposed 
by jj are helpful.

BUG??  In 2.2, splitting an empty or whitespace only string 
produces an empty list [], not a list with a null word [''].

>>> ''.split()
[]
>>> '   '.split()
[]

which is what I see as consistent with the rest of the no-null-
word behavior.  Has this changed since?  (Yes, must 
upgrade.)  I could find no indication of such change in either 
the tracker or CVS.
msg23989 - (view) Author: Bastian Kleineidam (calvin) Date: 2005-01-24 12:51
Logged In: YES 
user_id=9205

This should probably also be added to rsplit()?
msg23990 - (view) Author: YoHell (yohell) Date: 2006-11-07 14:06
Logged In: YES 
user_id=1008220

I'm opening this again, since the docs still don't reflect
the behavior of the method. 

from the docs:
"""
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from
both ends. 
"""

This is not true when maxsplit is given.

Example:

>>> " foo bar ".split(None)
['foo', 'bar']
>>> " foo bar ".split(None, 1)
['foo', 'bar ']

Whitespace is obviously not stripping whitespace from the
ends of the string before splitting the rest of the string. 

msg23991 - (view) Author: YoHell (yohell) Date: 2006-11-07 14:11
Logged In: YES 
user_id=1008220

*resubmission: grammar corrected*

I'm opening this again, since the docs still don't reflect
the behavior of the method. 

from the docs:
"""
If sep is not specified or is None, a different splitting
algorithm is applied. First, whitespace characters (spaces,
tabs, newlines, returns, and formfeeds) are stripped from
both ends. 
"""

This is not true when maxsplit is given.

Example:
>>> " foo bar ".split(None)
['foo', 'bar']
>>> " foo bar ".split(None, 1)
['foo', 'bar ']

Whitespace is obviously not stripped from the ends before
the rest of the string is split.
msg23992 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-01-06 02:16
I think the current wording is clear enough and that further attempts to specify corner cases will only make the docs harder to understand and less useful.
History
Date User Action Args
2022-04-11 14:56:09adminsetgithub: 41462
2005-01-19 15:04:27yohellcreate