Issue 495401: Build troubles: --with-pymalloc

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/35799

classification

Title:	Build troubles: --with-pymalloc
Type:		Stage:
Components:	Build	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	loewis	Nosy List:	doerwalter, gvanrossum, lemburg, loewis, mwh, tim.peters
Priority:	normal	Keywords:

Created on 2001-12-20 13:24 by doerwalter, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
PythonSegFault.txt	doerwalter, 2001-12-20 13:24	build log
unicode.diff	loewis, 2001-12-30 01:44
unicode2.diff	loewis, 2001-12-30 10:26
time_utf8.zip	loewis, 2002-02-11 17:07
unicode3.diff	loewis, 2002-04-11 06:26
utim3.py	loewis, 2002-04-11 06:27

Messages (32)
msg8350 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-12-20 13:24
The build process segfaults with the current CVS version when using --with-pymalloc System is SuSE Linux 7.0 > uname -a Linux amazonas 2.2.16-SMP #1 SMP Wed Aug 2 20:01:21 GMT 2000 i686 unknown > gcc -v Reading specs from /usr/lib/gcc-lib/i486-suse- linux/2.95.2/specs gcc version 2.95.2 19991024 (release) Attached is the complete build log.
msg8351 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-22 14:42
Logged In: YES user_id=21627 Did that produce a core file? If so, can you attach a gdb backtrace as well?
msg8352 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-12-23 12:24
Logged In: YES user_id=89016 Unfortunately no core file was generated. Can I somehow force core file generation?
msg8353 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2001-12-28 22:30
Logged In: YES user_id=6380 My prediction: this is irreproducible. Lowering the priority accordingly.
msg8354 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-29 09:23
Logged In: YES user_id=21627 Atleast I cannot reproduce it, on SuSE 7.3. Can you retry this, building from a clean source tree (no .o files, no build directory)?
msg8355 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2001-12-29 10:53
Logged In: YES user_id=89016 OK, I did a "make distclean" which removed .o files and the build directory and redid a "./configure --enable- unicode=ucs4 --with-pymalloc && make && make altinstall". The build process still crashes in the same spot: Compiling /usr/local/lib/python2.2/test/test_urlparse.py ... make: *** [libinstall] Segmentation fault I also retried with a fresh untarred Python-2.2.tgz. This shows the same behaviour.
msg8356 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2001-12-29 14:52
Logged In: YES user_id=6380 Aha! The --enable-unicode=ucs4 is more suspicious than the --with-pymalloc. I had missed that info when this was first reported. Not that I'm any closer to solving it... :-(
msg8357 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-29 16:00
Logged In: YES user_id=21627 Ok, I can reproduce it now; I did not 'make install' before. Here is a gdb back trace #0 _PyCore_ObjectMalloc (nbytes=33) at Objects/obmalloc.c:417 #1 0x805885c in PyString_FromString (str=0x816c6e8 "checkJoin") at Objects/stringobject.c:136 #2 0x805d772 in PyString_InternFromString (cp=0x816c6e8 "checkJoin") at Objects/stringobject.c:3640 #3 0x807c9c6 in com_addop_varname (c=0xbfffe87c, kind=0, name=0x816c6e8 "checkJoin") at Python/compile.c:939 #4 0x807de24 in com_atom (c=0xbfffe87c, n=0x816c258) at Python/compile.c:1478 #5 0x807f01c in com_power (c=0xbfffe87c, n=0x816c8b8) at Python/compile.c:1846 #6 0x807f545 in com_factor (c=0xbfffe87c, n=0x816c898) at Python/compile.c:1975 #7 0x807f56c in com_term (c=0xbfffe87c, n=0x816c878) at Python/compile.c:1985 #8 0x807f6bc in com_arith_expr (c=0xbfffe87c, n=0x816c858) at Python/compile.c:2020 #9 0x807f7dc in com_shift_expr (c=0xbfffe87c, n=0x816c838) at Python/compile.c:2046 #10 0x807f8fc in com_and_expr (c=0xbfffe87c, n=0x816c818) at Python/compile.c:2072 #11 0x807fa0c in com_xor_expr (c=0xbfffe87c, n=0x816c7f8) at Python/compile.c:2094 ... The access that crashes is (block *)bp, since bp is 0x2. Not sure how that happens; I'll investigate (but would appreciate a clue). It seems that the pool chain got corrupted.
msg8358 - (view)	Author: Tim Peters (tim.peters) *	Date: 2001-12-29 18:43
Logged In: YES user_id=31435 Ouch. Boosted priority back to 5, since Martin can reproduce it. Alas, where pymalloc got called from is almost certainly irrelevant -- we're seeing the end result of earlier corruption. Note that pymalloc is unusually sensitive to off-by-1 stores, since the chunks it hands out are contiguous (there's no hidden bookkeeping padding between them). Plausible: an earlier bogus store went beyond the end of its allocated chunk, overwriting the "next free block" pointer at the start of a previously free()'ed chunk of the same size (rounded up to a multiple of 8; 40 bytes in this case). At the time this blows up, bp is supposed to point to a previously free()'ed chunk of size 40 bytes (if there were none free()'ed and available, the earlier "pool != pool- >nextpool" guard should have failed). The first 4 bytes (let's simplify by assuming this is a 32-bit box) of the free chunks link the free chunks together, most recently free()'ed at the start of the (singly linked) list. So the code at this point is intent on returning bp, and "pool- >freeblock = (block )bp" is setting the 40-byte-chunk list header's idea of the next* available 40-byte chunk. But bp is bogus. The value of bp is gotten out of the free list headers, the static array usedpools. This mechanism is horridly obscure, an array of pointer pairs that, in effect, capture just the first two members of the pool_header struct, once for each chunk size. It's possible that someone is overwriting usedpools[4 + 4]- >freeblock directly with 2, but that seems unlikely. More likely is that a free() operation linked a 40-byte chunk into the list headed at usedpools[4+4]->freeblock correctly, and a later bad store overwrote the first 4 bytes of the free()'ed block with 2. Then the "pool- >freeblock = (block )bp)" near the start of an unexceptional pymalloc would copy the 2 into the list header's freeblock without complaint. The error wouldn't show up until a subsequent malloc tried to use it. So that's one idea to get closer to the cause: add code to dereference pool->freeblock, before the "return (void ) bp". If that blows up earlier, then the first four bytes of bp were corrupted, and that gives you a useful data breakpoint address for the next run. If it doesn't blow up earlier, the corruption will be harder to find, but let's count on being lucky at first <wink>.
msg8359 - (view)	Author: Michael Hudson (mwh)	Date: 2001-12-29 21:29
Logged In: YES user_id=6656 I don't know if these are helpful observations or not, but anyway: (1) it doesn't core without the --enable-unicode=ucs4 option (2) if you just run "make altinstall" the library files are installed and compiled before the dynamically linked modules are built. Then we don't crash. (3) if you run "make altinstall" again, we crash. If you initially ran "make && make install", we crash. (4) when we crash, it's not long after the unicode tests are compiled. Are these real clues or just red herrings? I'm afraid I can't tell :(
msg8360 - (view)	Author: Michael Hudson (mwh)	Date: 2001-12-29 22:01
Logged In: YES user_id=6656 Hmm. I now think that the stuff about extension modules is almost certainly a read herring. What I said about "make && make altinstall" vs "make altinstall" still seems to be true, though. If you compile with --with-pydebug, you crash right at the end of the second (-O) run of compileall.py -- I suspect this is something else, but it might not be.
msg8361 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-29 23:13
Logged In: YES user_id=21627 It's a Heisenbug. I found that even slightest modifications to the Python source make it come and go, or appear at different places. On my system, the crashes normally occur in the first run (.pyc). So I don't think the order of make invocations is the source of the problem. It's likely as Tim says: somebody overwrites memory somewhere. Unfortunately, even though I can reproduce crashes for the same pool, for some reason, my gdb memory watches don't trigger. Tim's approach of checking whether a the value came from following the free list did not help, either: the bug disappeared under the change.
msg8362 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-30 01:44
Logged In: YES user_id=21627 I've found one source of troubles, see the attached unicode.diff. Guido's intuition was right; it was an UCS-4 problem: EncodeUTF8 would over-allocate 3size bytes, but can actually write 4size in the worst case, which occurs in test_unicode. I'll leave the patch for review and experiments; it fixes the problem for me. The existing adjustment for surrogates is pointless, IMO: for the surrogate pair, it will allocate 6 bytes UTF-8 in advance, which is more than actually needed.
msg8363 - (view)	Author: Tim Peters (tim.peters) *	Date: 2001-12-30 03:54
Logged In: YES user_id=31435 Good eye, Martin! It's clearly possible for the unpatched code to write beyond the memory allocated. The one thing that doesn't jibe is that you said bp is 0x2, which means 3 of its 4 bytes are 0x00, but UTF-8 doesn't produce 0 bytes except for one per \u0000 input character. Right? So, if this routine is the cause, where are the 0 bytes coming from? (It could be test_unicode sets up a UTF-8 encoding case with several \u0000 characters, but if so I didn't stumble into it.) Plausible: when a new pymalloc "page" is allocated, the 40-byte chunks in it are not linked together at the start. Instead a NULL pointer is stored at just the start of "the first" 40-byte chunk, and pymalloc-- on subsequent mallocs --finds that NULL and incrementally carves out additional 40-byte chunks. So as a startup-- but not a steady-state --condition, the "next free block" pointers will very often be NULLs, and then if this is a little-endian machine, writing a single 2 byte at the start of a free block would lead to a bogus pointer value of 0x2. About a fix, I'm in favor of junking all the cleverness here, by allocating size4 bytes from the start. It's overallocating in all normal cases already, so we're going to incur the expense of cutting the result string back anyway; how much* we overallocate doesn't matter to speed, except that if we don't have to keep checking inside the loop, the code gets simpler and quicker and more robust. The loop should instead merely assert that cbWritten <= cbAllocated at the end of each trip. Had this been done from the start, a debug build would have assert-failed a few nanoseconds after the wild store.
msg8364 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-30 10:26
Logged In: YES user_id=21627 I disabled threading, which fortunately gave me memory watchpoints back. Then I noticed that the final *p=0 corrupted a non-NULL freeblock pointer, slightly decreasing it. Then following the freeblock pointer, freeblock was changed to a bogus block, which had its next pointer as garbage. I had to trace this from the back, of course. As for overallocation, I wonder whether the UTF-8 encoding should overallocate at all. unicode2.diff is a modification where it runs over the string twice, counting the number of needed bytes the first time. This is likely slower (atleast if no reallocations occur), but doesn't waste that much memory (I notice that pymalloc will never copy objects when they shrink, to return the extra space).
msg8365 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2001-12-31 12:48
Logged In: YES user_id=38388 I like the unicode.diff, but not the unicode2.diff. Instead of fixing the UTF-8 codec here we should fix the pymalloc implementation, since overallocation is common thing to do and not only used in codecs. (Note that all Python extensions will start to use pymalloc too once we enable it per default.) I don't know whether it's relevant, but I found that core dumps can easily be triggered by mixing the various memory allocation APIs in Python and the C lib. The bug may not only be related to the UTF-8 codec but may also linger in some other extension modules.
msg8366 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2001-12-31 13:46
Logged In: YES user_id=21627 MAL, I'm 100% positive that the crash on my system was caused by the UTF-8 encoding; I've seen it in the debugger overwrite memory that it doesn't own. As for unicode.diff: Tim has proposed that this should not be done, but that 4*size should be allocated in advance. What do you think? On unicode2.diff: If pymalloc was changed to shrink the memory, it would have to copy the original string, since it would likely be in a different size class. This is less efficient than the approach taken in unicode2.diff. What specifically is it that you dislike about first counting the memory requirements? It actually simplifies the code. Notice that the current code is still buggy with regard to surrogates. If there is a high surrogate, but not a low one, it will write bogus UTF-8 (with no lead byte). This is fixed in unicode2.diff as well.
msg8367 - (view)	Author: Tim Peters (tim.peters) *	Date: 2001-12-31 19:36
Logged In: YES user_id=31435 Martin, I like your second patch fine, but would like it a lot better with assert(p - PyString_AS_STRING(v) == cbAllocated); at the end. Else a piddly change in either loop can cause more miserably hard-to-track-down problems.
msg8368 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-06 18:12
Logged In: YES user_id=38388 I've checked in a patch which fixes the memory allocation problem. Please give it a try and tell me whether this fixes your problem, Walter. If so, I'd suggest to close the bug.
msg8369 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-02-07 09:58
Logged In: YES user_id=89016 I tried the current CVS and make altinstall runs to completion.
msg8370 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-02-07 22:59
Logged In: YES user_id=21627 I still think the current algorithm has serious problems as it is based on overallocation, and that it should be replaced with an algorithm that counts the memory requirements upfront. This will be particularly important for pymalloc, but will also avoid unnecessary copies for many other malloc implementations.
msg8371 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-08 22:27
Logged In: YES user_id=38388 Good news, Walter. Martin: As I explained in an earlier comment, pymalloc needs to be fixed to better address overallocation. The two pass logic would avoid overallocation, but at a high price. Copying memory (if at all needed) is likely to be much faster. The UTF-8 codec has to be as fast as possible since it is one of the most used codecs in Python's Unicode implementation. Also note that I have reduced overallocation to 2*size in the codec. I suggest to close the bug report.
msg8372 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-02-11 17:06
Logged In: YES user_id=21627 Marc: Please have a look at pymalloc; it cannot be "fixed". It is in the nature of a pool allocator that you have to copy when you want to move between pools - or you have to waste the extra space. I agree that UTF-8 coding needs to be fast; that's why I wrote this patch. I've revised it to fit the current implementation, and too add the assert that Tim has requested (unicode3.diff). For the test case time_utf8.zip (which is a UTF-8 converted Misc/ACKS), the version that first counts the size is about 10% faster on my system (Linux glibc 2.2.4) (see timings inside time_utf8.py; #592 is the patched version). So the price for counting the size turns out to be negligible, and to offer significant, and is more than compensated for by the reduction of calls to the memory management system.
msg8373 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-11 17:49
Logged In: YES user_id=38388 I get different timings (note that you have to use time.clock() for benchmarks, not time.time()): without your patch: 0.470 seconds with your patch: 0.960 seconds This is on Linux with pgcc 2.95.2, glibc 2.2, without pymalloc (which is the normal configuration).
msg8374 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-02-11 18:04
Logged In: YES user_id=21627 time.clock vs. time.time does not make a big difference on an unloaded machine (except time.time has a higher resolution). Can you please run the test 10x more often? I then get 12.520 clocks with CVS python, glibc 2.2.4, gcc 2.95, and 10.890 with my patch.
msg8375 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-11 18:42
Logged In: YES user_id=38388 Ok, with 100000 loops and time.clock() I get: 4.690 - 4.710 without your patch, 9.560 - 9.570 with your patch (again, without pymalloc and the same compiler/machine on SuSE 7.1).
msg8376 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-02-11 20:06
Logged In: YES user_id=31435 time.time() sucks for benchmarking on Windows (updates at about 18Hz). Running the test as-is, MSVC6 and Win98SE, it's 1.3 seconds with current CVS, and 1.7 with unicode3.diff. The quantization error in Windows time.time() is > 0.05 seconds, so no point pretending there are 3 significant digits there; luckily(?), it's apparent there's a major difference with just 2 digits. MAL, are you still using an AMD box? In a decade, nobody else has ever reproduced the timing results you see <wink>.
msg8377 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-11 20:50
Logged In: YES user_id=38388 Tim: Yes, I'm still all AMD based... it's Athlon 1200 I'm running. PGCC (the Pentium GCC groups version) has a special AMD optimization mode for Athlon which is what I'm using. Somebody has to hold up the flag against the Wintel camp ;-) Hmm, I could do the same tests on my notebook which runs on one of those Inteliums. Maybe tomorrow...
msg8378 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-02-11 21:06
Logged In: YES user_id=31435 MAL, cool -- I saw a major slowdown using the patch too, but not nearly as dramatic as you saw, so was curious about what could account for that. Chip, compiler and OS can all have major effects. I assume Martin is using a Pentium box, so assuming your notebook is running Linux too, it would be good to get another LinTel datapoint.
msg8379 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-02-13 12:39
Logged In: YES user_id=38388 Tim, I ran the test on my notebook and guess what: when compiling Python with -mcpu=pentium I get 58 vs. 59.8 (with / without patch) when compiling Python with -mcpu=i686, it's 65.8 vs. 60.17 (with / without patch) No idea what chip is used in the notebook. It's pretty old, though. I used gcc 2.95.2 and some old SuSE Linux version (glibc 2). Would be interesting to check this on a modern pentium 4 machine and maybe a more recent sun sparc machine. Oh yes, and your Cray, of coure ;-)
msg8380 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-11 06:26
Logged In: YES user_id=21627 I've revised the patch unicode3.diff to the current code base (it still includes the assertion at the end); I've also added a new timing test (utim3.py) which considers the following cases: - a string consisting of only spaces - a string consisting of only spaces, and a single character that needs three bytes in UTF-8 - a string only consisting of characters that need three bytes. For all three cases, it tests various sizes of the string, both below and above the pymalloc threshold. For the current CVS (unicodeobject.c 2.136: MAL's change to use a variable overalloc), I get 10 spaces 20.060 100 spaces 2.600 200 spaces 2.030 1000 spaces 0.930 10000 spaces 0.690 10 spaces, 3 bytes 23.520 100 spaces, 3 bytes 3.730 200 spaces, 3 bytes 2.470 1000 spaces, 3 bytes 0.980 10000 spaces, 3 bytes 0.690 30 bytes 24.800 300 bytes 5.220 600 bytes 3.830 3000 bytes 2.480 30000 bytes 2.230 With unicode3.diff, I get 10 spaces 19.940 100 spaces 3.260 200 spaces 2.340 1000 spaces 1.650 10000 spaces 1.450 10 spaces, 3 bytes 21.420 100 spaces, 3 bytes 3.410 200 spaces, 3 bytes 2.420 1000 spaces, 3 bytes 1.660 10000 spaces, 3 bytes 1.450 30 bytes 22.260 300 bytes 5.830 600 bytes 4.700 3000 bytes 3.740 30000 bytes 3.540 So it appears that unicode3.diff is more efficient for short strings, and overallocation is more efficient for long strings. Since overallocation may waste memory, I still recommend to integrate unicode3.diff.
msg8381 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-20 13:46
Logged In: YES user_id=21627 I have now applied unicode3.diff as unicodeobject.c 2.140, adding the assertion that the allocated and the used number of bytes are identical. If this problem ever occurs again, the assertion will indicate what the problem is.

History
Date	User	Action	Args
2022-04-10 16:04:48	admin	set	github: 35799
2001-12-20 13:24:34	doerwalter	create