Issue 1041645: Thread management corrupts heap

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/40988

classification

Title:	Thread management corrupts heap
Type:		Stage:
Components:	None	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	tim.peters	Nosy List:	benson_basis, ncoghlan, tim.peters
Priority:	high	Keywords:

Created on 2004-10-06 18:15 by benson_basis, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
ts.zip	benson_basis, 2004-10-06 18:15	Sample application that demonstrates

Messages (18)
msg22604 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-06 18:15
The PyGILState_Ensure mechanism appears to have a built-in heap-corrupting race condition. If a new thread calls PyGILState_Ensure, then the code allocates a new 'tstate' for it on the heap. This allocation is not protected by any lock. So, multiple racing threads can hit the heap at the same time, and corrupt it. I have observed this with both 2.3 and with 2.4a3. I will attach a sample application. The application is Win32, but should be easy enough to adapt to Unix if someone cares to. Since the stated purpose of this mechanism, in PEP311, is to allow any-old-thread to call into Python, I believe that the usage model here is legitimate. To watch this explode, run the attached with arguments like, oh, 1 100 40 against the debug python build.
msg22605 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-06 23:31
Logged In: YES user_id=1038590 The attached program calls PyEval_InitThreads and Py_Initialize in the wrong order.
msg22606 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-06 23:32
Logged In: YES user_id=1038590 I haven't checked if that would fix the problem, since I'm currently on Linux.
msg22607 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-06 23:49
Logged In: YES user_id=876734 The documentation very specifically states that the order doesn't matter. Further, I've read the code, and the problem is unprotected data structures in the python debug heap, which won't be any more protected in the other order. I'll switch them and report back, just in case.
msg22608 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-06 23:56
Logged In: YES user_id=876734 On the one hand, you are correct that changing the order improves the behavior. On the other hand, the doc is as I recalled it, asserting that you can call it first. Perhaps this should be registered as a doc problem? This is a no-op when called for a second time. It is safe to call this function before calling Py_Initialize() .
msg22609 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-07 00:15
Logged In: YES user_id=1038590 Disregard the previous comment regarding PyEval_InitThreads and Py_Initialize - I was reading an older version of the documentation which suggested the order of the calls mattered. This does leak the constructed strings (no call to Py_XDECREF(trash)), but that shouldn't cause anything too catastrophic. However, the call to PyGILState_Release without a preceding call to PyGILState_Ensure looks dubious (it ends up deleting the main thread's thread state, which seems unhealthy). Do the same symptoms occur when using the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS macros around the call to run_group()? (These macros are recommended when you know you already have the GIL. The PyGILState functions are for when you don't know if you have the GIL or not)
msg22610 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-07 00:43
Logged In: YES user_id=1038590 rereads bug description and comments I think I'm on the right page now. The offending call would appear to be PyMem_NEW inside PyThreadState_New. Under a release build, this resolves into a direct call to malloc(), but under debug it resolves to _PyObjectDebugMalloc. Which is what you said in the bug report - I just read it incorrectly. Sorry for the confusion. I just remembered that this came up a few weeks ago on Python dev, with a proposed solution (changing this particular call to use malloc() directly): http://mail.python.org/pipermail/python-dev/2004-September/048683.html
msg22611 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-07 00:46
Logged In: YES user_id=1038590 I'd suggest assigning the bug report to Tim Peters (aka tim_one), to see if he still agrees with the fix he proposed last month.
msg22612 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-07 01:57
Logged In: YES user_id=876734 The rather funny-looking initialization was recommended by the author of the PEP. Once InitThreads is called, the GIL is held. To get it unheld, you have to release it. Here's his explanation .. OK - here is my explanation which you can adapt A simple application is one that initializes Python, and then executes some Python code. This Python code may itself cause other threads, or may indirectly cause other applications to "callback" into Python, but once the main thread returns, all the code has finished - you application terminates. With these applications, your init code need take no further action - just initialize Python and execute your code. Only your extension functions or callback entry points need manage the GIL state. python.exe is an example of this kind of app - it initializes Python, then executes the code specified on the command line. More complex applications may need to initialize Python, perform a little bit of initialization, but then do nothing else on that thread. An example would be where that initialization code bootstraps a few threads, then terminates immediately. The application however continues to live until explicitly terminated by the running code. In this scenario, you must use the following code - this code calls into Python. Once this call has been made, the thread still holds the GIL. This thread must release the GIL before other threads can work. [Technical aside: This works in the first case as the main thread which holds the lock continues to run. When it calls out to external libraries and at periodic times during execution, the thread-lock is unlocked allowing other threads to run. Once the main thread returns, it does still hold the lock - but as the application is terminating, that is OK - no other threads are competing for the lock. In our complicated example, the main thread returning does not indicate application termination, so the lock must be manually dropped.] That doesn't make a huge amount of sense, but I hope it is useful. All this being said, I have two contradictory bits of evidence. 1) We all agree that the code, as written, would fail due to the unsafe debug malloc. 2) When I reverse the calls, it produces a very different situation. I think, perhaps, that the GIL never gets unlocked properly, and then instead of crashing everything just gets stuck somehow. The change to call malloc seems pretty reasonable to me. Some tests seem called for in the tree on this. If order is not supposed to matter, the test should test both orders.
msg22613 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-07 02:24
Logged In: YES user_id=876734 I'm afraid that something is more seriously wrong here than just the malloc in Ensure. Here's what I did in our production code. I protected all calls to Ensure with a mutex. The result was a similiar problem with the debug heap happening inside code that was executing under the GIL. Given the structure of our code, I'm reasonably confident that there is no rogue thread running around outside the GIL. In fact, at the time of the explosion here, the other threads are all waiting their turn to get the GIL. btpython23_d.dll!PyObject_Malloc(unsigned int nbytes=98) Line 581 + 0x6 C btpython23_d.dll!_PyObject_DebugMalloc(unsigned int nbytes=82) Line 990 + 0x9 C btpython23_d.dll!PyString_FromStringAndSize(const char * str=0x03176384, int size=50) Line 78 + 0xc C btpython23_d.dll!string_slice(PyStringObject * a=0x03176368, int i=0, int j=50) Line 1006 + 0x17 C btpython23_d.dll!PySequence_GetSlice(_object * s=0x03176368, int i1=0, int i2=50) Line 1218 + 0x12 C btpython23_d.dll!apply_slice(_object * u=0x03176368, _object * v=0x00000000, _object * w=0x02fb7acc) Line 3806 + 0x11 C btpython23_d.dll!eval_frame(_frame * f=0x031aeeb0) Line 1391 + 0x11 C btpython23_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x030c9130, _object * globals=0x03101ab8, _object * locals=0x00000000, _object * * args=0x0323650c, int argcount=2, _object * * kws=0x00000000, int kwcount=0, _object * * defs=0x030eed0c, int defcount=2, _object * closure=0x00000000) Line 2663 + 0x9 C btpython23_d.dll!function_call(_object * func=0x0316eea8, _object * arg=0x032364f8, _object * kw=0x00000000) Line 509 + 0x40 C btpython23_d.dll!PyObject_Call(_object * func=0x0316eea8, _object * arg=0x032364f8, _object * kw=0x00000000) Line 1755 + 0xf C btpython23_d.dll!PyEval_CallObjectWithKeywords (_object * func=0x0316eea8, _object * arg=0x032364f8, _object * kw=0x00000000) Line 3346 + 0x11 C btpython23_d.dll!PyErr_Warn(_object * category=0x0306da10, char * message=0x1e14df70) Line 627 + 0xf C > btpython23_d.dll!builtin_apply(_object * self=0x00000000, _object * args=0x03238a78) Line 80 + 0x10 C btpython23_d.dll!PyCFunction_Call(_object * func=0x0305fb38, _object * arg=0x03238a78, _object * kw=0x00000000) Line 73 + 0xb C btpython23_d.dll!call_function(_object * * * pp_stack=0x0364f2b0, int oparg=2) Line 3439 + 0xf C btpython23_d.dll!eval_frame(_frame * f=0x0314a310) Line 2116 + 0xd C btpython23_d.dll!PyEval_EvalCodeEx(PyCodeObject * co=0x03178868, _object * globals=0x03174818, _object * locals=0x00000000, _object * * args=0x0d558f0c, int argcount=2, _object * * kws=0x00000000, int kwcount=0, _object * * defs=0x00000000, int defcount=0, _object * closure=0x00000000) Line 2663 + 0x9 C btpython23_d.dll!function_call(_object * func=0x0320f198, _object * arg=0x0d558ef8, _object * kw=0x00000000) Line 509 + 0x40 C btpython23_d.dll!PyObject_Call(_object * func=0x0320f198, _object * arg=0x0d558ef8, _object * kw=0x00000000) Line 1755 + 0xf C btpython23_d.dll!instancemethod_call(_object * func=0x0320f198, _object * arg=0x0d558ef8, _object * kw=0x00000000) Line 2432 + 0x11 C btpython23_d.dll!PyObject_Call(_object * func=0x0321fd38, _object * arg=0x030728f8, _object * kw=0x00000000) Line 1755 + 0xf C btpython23_d.dll!PyObject_CallMethod(_object * o=0x0317a4f8, char * name=0x64c453f4, char * format=0x64c453f0, ...) Line 1844 + 0xf C btrexcoreC230.dll!BT_REX_LP_PyProxy::Run (BT_Blackboard * ref=0x02eacc88) Line 136 + 0x1b C++ btrexcoreC230.dll!BT_Blackboard::Run() Line 166 + 0x29 C++ btrexcoreC230.dll! BT_REX_ContextImp::ProcessUTF16Buffer(const unsigned short * inbuf=0x00ab78bc, unsigned int inlen=673, int lid=30) Line 609 C++ basistechnology.rlp.dll! BasisTechnology.RLP.ContextImp.ProcessBuffer(String* data = "South African answers U.S. message in a bottle. JOHANNESBURG 08-22-1996 A South African boy is writing back to an American girl whose message in a bottle he found washed up on President Nelson Mandela's old prison island. But Carlo Hoffmann, an 11-year- old jailer's son who found the bottle on the beach at Robben Island off Cape Town after winter storms, will send his letter back by ordinary mail on Thursday, the post office said. It will be sent for free. Danielle Murray from Sandusky, Ohio, the same age as her new penfriend, asked for a reply from whoever received the message she flung on its journey months ago on the other side of the Atlantic Ocean. ", BasisTechnology.RLP.LanguageID __unnamed001 = ENGLISH) Line 84 + 0x33 bytes C++ ThreadStress.exe! BasisTechnology.RLPUnitTest.Worker.Go() Line 63 C#
msg22614 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2004-10-07 02:35
Logged In: YES user_id=1038590 I'm well out of my depth now - I definitely recommend assigning the bug report to Tim (I'd do it, but I can't edit bugs I didn't create).
msg22615 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-09 04:52
Logged In: YES user_id=31435 Random: - Arguments "1 100 40" don't make sense on my box, as MAXIMUM_WAIT_OBJECTS is 64, and so the call to WaitForMultipleObjects() fails immediately when passed 100 handles. - The failures in a debug build are much more dramatic if the memory leak is repaired (as Nick said, "trash" leaks; freeing the memory again is much more effective at stressing the heap). - Is benson_basis the KSR Benson M? If so, you'll appreciate that debugging someone else's thread problems is a lot harder than resolving the Banach-Tarski Paradox <wink>. - While I was a fan of Mark's new thread API, I've never used it, so it may take some time to figure out what this is all about. Alas, spare time is in short supply.
msg22616 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-09 05:04
Logged In: YES user_id=31435 For the record, in a release build this reliably dies, on a hyper- threaded 3.2GHz box, with """ Fatal Python error: auto-releasing thread-state, but no thread-state for this thread """ given arguments "200 64 100". WRT this failure, it doesn't matter in which order Py_Initialize(); PyEval_InitThreads(); are done.
msg22617 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-10 00:40
Logged In: YES user_id=31435 One problem here is that the portable TLS implementation isn't actually threadsafe -- find_key() can get fatally confused if another thread is deleting a key at the same time. That appears to account for the release-build crashes in my immediately preceding comment. That should be repaired soon.
msg22618 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-10 02:00
Logged In: YES user_id=31435 A fix for find_key() was checked in, and the test program (repaired to plug the memory leaks) passed a long release- build run with arguments "5000 64 1000" after.
msg22619 - (view)	Author: benson margulies (benson_basis)	Date: 2004-10-10 02:24
Logged In: YES user_id=876734 I am indeed the benson in question. Hi there. I seem doomed to be in the thread business. b i m 2 0 0 4 splat b a s i s t e c h blot c o m I'll try to backport these fixes into our 2.3 codebase. I'd appreciate your view on the following question: given a choice between protecting all of our calls into python with a mutex, and on the other hand picking up these fixes and using the GIL, does python manage to internally unlock enough to make it worth the bother?
msg22620 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-10 02:50
Logged In: YES user_id=31435 Hi, Benson! Long time. Quickie on your question: it depends on your app. In general, the Python implementation releases the GIL around (and only around) potentially blocking C calls, when it's safe to do so. So think I/O here -- operations on files and sockets, and miscellaneous operations like sleep() and select(). That can be extremely helpful if you're doing I/O, and more helpful the longer the I/O latency. If you're doing pure computation, it doesn't help at all -- unless you're also using, e.g., a C or Fortran extension module that releases the GIL around its internal computations.
msg22621 - (view)	Author: Tim Peters (tim.peters) *	Date: 2004-10-10 02:58
Logged In: YES user_id=31435 OK, PyInterpreterState_New() and PyThreadState_New() were changed to call the platform malloc()/free() directly, and that appears to fix the remaining problems in a debug build. Two debug runs of the test program, with memory leaks repaired, using arguments "5000 64 100", completed without incident. One did initialization in the order: Py_Initialize(); PyEval_InitThreads(); and the other in order: PyEval_InitThreads(); Py_Initialize(); I saw no difference in behavior between those runs. There's not enough info in the 2004-10-06 22:24 comment to know what that was about, so I'm closing this bug report. If you continue to see problems, please open a new bug report (or reopen this one -- but this one is pretty "busy" already). Note that the problem with find_key() was horrid, and could lead to arbitrary memory corruption.

History
Date	User	Action	Args
2022-04-11 14:56:07	admin	set	github: 40988
2004-10-06 18:15:38	benson_basis	create