This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Popen4 wait() fails sporadically with threads
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: nnorwitz Nosy List: atila-cheops, gjb1002, loewis, nnorwitz, tskogan
Priority: normal Keywords:

Created on 2005-04-15 14:27 by tskogan, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
popen_bug.py tskogan, 2005-04-15 14:27 Script to reproduce error.
popen2.patch nnorwitz, 2006-03-23 08:41 v1
popen2.patch nnorwitz, 2006-03-24 05:17 v2 - using __del__
popen2.diff loewis, 2006-03-24 07:40 v3
Messages (12)
msg25022 - (view) Author: Taale Skogan (tskogan) Date: 2005-04-15 14:27
Calling wait() on a popen2.Popen4 object fails 
intermittently with the error

Traceback (most recent call last):
  ...
  File "/usr/local/lib/python2.3/popen2.py", line 90, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

when using threads. 

The problem seems to be a race condition when a thread 
calls wait() on a popen2.Popen4 object. This also apllies 
to Popen3 objects. 

The constructor of Popen4. calls _cleanup() which calls 
poll() which calls the system call waitpid() for all acitve 
child processes. If another thread calls poll() before the 
current thread calls wait() on it's child process and the 
child process has terminated, the child process is no 
longer waitable and the second call to wait() fails.

Code to replicate this behavoir is attached in popen_bug.
py.

Solution: Popen4 and Popen3 should be threadsafe.

Related modules: A seemingly related error occurs with 
Popen from the new subprocess module. Use the -s 
option in the popen_bug.py script to test this. 

Tested on Linux RedHat Enterprise 3 for Python 2.3.3, 
Python 2.3.5 and Python 2.4.1 and Solaris for Python 2.
4.1. The error did not occur on a RedHat 7.3 machine 
with Python 2.3.5. See the attached file popen_bug.py for 
details on the platforms.

msg25023 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-03-23 08:41
Logged In: YES 
user_id=33168

The attached patch fixes the problem for me.  It also
addresses another issue where wait could be called from
outside the popen2 module.  I'm not sure this is the best
solution.  I'm not sure there really is a good solution. 
Perhaps it's best to allow an exception to be raised?
msg25024 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-03-24 00:37
Logged In: YES 
user_id=21627

I don't understand why you are setting self.sts to 0 if wait
fails: most likely, there was a simultaneous call to .poll,
which should have set self.sts to the real return value. So
we should return that instead.

I think the whole issue can be avoid if we use resurrection:
If __del__ would put unwaited objects into _active, rather
than __init__, it would not happen that _cleanup polls a pid
which a thread still intends to wait for. In fact, it would
be sufficient to only put the pid into _active (avoiding the
need for resurrection).

If then a thread calls poll explicitly, and another calls
wait, they deserve to lose (with ECHILD). I would claim the
same error exists if part of the application calls
os.wait[3,4](), and another calls .wait later - they also
deserve the exception.

With that approach, I don't think further thread
synchronization would be needed.
msg25025 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-03-24 05:17
Logged In: YES 
user_id=33168

I agree with your comment about setting self.sts to 0.  That
was the problem I alluded to on python-dev.

Although I dislike __del__, this does seem like an
appropriate place to do the modification of _active.

Note that currently all os.error's are swallowed in poll().
 I'm not sure if that was the best idea, but that's the
current interface.  wait() does *not* catch any exceptions.

I wasn't really sure what to do about threads.  The threads
could always manage their calls into a popen object like you
propose (don't try to handle simultaneous calls to poll and
wait).  Another question I had was if popen should be
deprecated in favor of subprocess?

I've attached a patch which I think implements your
suggestion.  It also seems to fix all the problems.
msg25026 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-03-24 07:40
Logged In: YES 
user_id=21627

This looks all fine.

As a further issue, I think _cleanup should also clean
processes which already have been waited on. So if waitpid
gives ECHILD (in _cleanup), I think the object should get
removed from _active - otherwise, it would stay there
forever. Of course, care is then need to avoid __del__
adding it back to _active.

Putting these all together, I propose v3 of the patch.

Another aspect that puzzles me is the repeated test that
waitpid() really returns the pid you asked for. How could it
not? If it fails, you get an os.error.
msg25027 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-03-24 07:56
Logged In: YES 
user_id=33168

It makes sense to remove from _active on ECHILD.

I wondered the same thing about waitpid(), but left it as it
was.  I don't believe it's possible for waitpid to return
any pid other than what you ask for unless the O/S is very,
very broken.

This patch is fine with me, feel free to check it in.  BTW,
nice comments and precondition checks in the test.
msg25028 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-03-24 08:16
Logged In: YES 
user_id=21627

Committed as 43286. I also added .cmd to Popen4.
msg25029 - (view) Author: cheops (atila-cheops) Date: 2006-04-10 14:55
Logged In: YES 
user_id=1276121

see patch # 1467770 for subprocess.py library module
msg25030 - (view) Author: Geoffrey Bache (gjb1002) Date: 2007-07-10 15:38
Did this get fixed in subprocess.py? The patches all seem to be for popen2.

I have been observing similar problems in subprocess.py, so I downloaded the test script and ran it with the -s option. It didn't work out of the box, I had to pass "shell=True" to the subprocess.Popen before it did anything at all. Which also made me wonder if the subprocess variant of this problem got forgotten.

Having done that, I have observed failures using Python 2.4.4 on Red Hat EL3 Linux, and also using Python 2.4.3 on Red Hat EL4 linux. Most of the time it works, sometimes it hangs forever, and sometimes we get something that look like this:

Started 20 threads
Exception in thread Thread-19:
Traceback (most recent call last):
  File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "popen_bug.py", line 53, in run
    pipe.wait()
  File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
    pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

P.S. Googling for "[Errno 10] No child processes" suggests others have this problem. There have been long discussions on the Zope list as to why some people on Linux get exceptions that look like this, for example.

msg25031 - (view) Author: Geoffrey Bache (gjb1002) Date: 2007-07-10 16:00
As an additional note, I have also reproduced the popen problem using Python 2.4.4, though only once.
It seems to occur more frequently in subprocess.py.
msg25032 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-07-10 21:50
If you are seeing a bug in subprocess, please report it separately. This one has been fixed.

Please don't assume that it is the "same" problem as the one reported here, unless you have a working patch that proves that it is indeed a similar problem.
msg25033 - (view) Author: Geoffrey Bache (gjb1002) Date: 2007-07-11 12:46
From the original description:

"Related modules: A seemingly related error occurs with
Popen from the new subprocess module. Use the -s
option in the popen_bug.py script to test this."

I can be confident it is the same problem as I have used the exact 
same script to reproduce it, and got the exact same stack trace.

As I added, I have also reproduced the "popen" problem using Python 2.4.4 once, though
this is much harder to reproduce. So I claim this is not in fact fixed in all cases.

I can report separately if you really want, but it seems odd when the symptoms and test case
are the same.
History
Date User Action Args
2022-04-11 14:56:10adminsetgithub: 41862
2005-04-15 14:27:54tskogancreate