This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: A simple 3-4% speed-up for PCs
Type: Stage:
Components: Interpreter Core Versions: Python 2.4
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: arigo, rhettinger, tim.peters
Priority: normal Keywords: patch

Created on 2004-04-28 18:33 by arigo, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
oparg-opt.diff arigo, 2004-04-28 18:33 eval_frame opcode/oparg optimizations
asm-locals.diff arigo, 2004-05-10 10:48 Put the two main locals into registers
sp0.diff arigo, 2004-06-17 07:35 Re-generated patch, without 386-specific optimizations
Messages (10)
msg45864 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-04-28 18:33
The result of a few experiments looking at the assembler produced by gcc for eval_frame():

* on PCs, reading the arguments as an unsigned short instead of two bytes is a good win.

* oparg is more "local" with this patch: its value doesn't need to be saved across an iteration of the main loop, allowing it to live in a register only.

* added an explicit "case STOP_CODE:" so that the switch starts at 0 instead of 1 -- that's one instruction less with gcc.

* it seems not to pay off to move reading the argument at the start of each case of an operation that expects one, even though it removes the unpredictable branch "if (HAS_ARG(op))".

This patch should be timed on other platforms to make sure that it doesn't slow things down.  If it does, then only reading the arg as an unsigned short could be checked in -- it is compilation-conditional over the fact that shorts are 2 bytes in little endian order.

By the way, anyone knows why 'stack_pointer' isn't a 'register' local?  I bet it would make a difference on PowerPC, for example, with compilers that care about this keyword.
msg45865 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-04-28 21:02
Logged In: YES 
user_id=4771

stack_pointer isn't a register because its address is taken at two places.  This is a really bad idea for optimization.  Instead of &stack_pointer, we should do:

PyObject **sp = stack_pointer;
... use &sp ...
stack_pointer = sp;

I'm pretty sure this simple change along with a 'register' declaration of stack_pointer gives a good speed-up on all architectures with plenty of registers.

For PCs I've experimented with forcing one or two locals into specific registers, with the gcc syntax  asm("esi"), asm("ebx"), etc.  Forcing stack_pointer and next_instr gives another 3-4% of improvement.

Next step is to see if this can be done with #if's for common compilers beside gcc.
msg45866 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-04-28 23:45
Logged In: YES 
user_id=80475

With MSVC++ 6.0 under WinME on a Pentium III, there is no
change in timing (measurements accurate within 0.25%):

I wonder if the speedup from retrieving the unsigned short
is offset by alignment penalties when the starting address
is odd.
msg45867 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-05-10 10:48
Logged In: YES 
user_id=4771

The short trick might be a bit fragile.  For example, the current patch would incorrectly use it on machines where unaligned accesses are forbidden.

I isolated the other issue I talked about (making stack_pointer a register variable) in a separate patch.  This patch alone is clearly safe.  It should give a bit of speed-up on any machine but it definitely gives 5% on PCs with gcc by forcing the two most important local variables into specific registers.  (If someone knows the corresponding syntax for other compilers, it can be added in the #if.)
msg45868 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-05-10 14:26
Logged In: YES 
user_id=4771

Tested on a MacOSX box, the patch also gives a 5% speed-up
there.  Allowing stack_pointer to be in a register is a very
good idea.  (all tests with Pystone)
msg45869 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-05-12 15:27
Logged In: YES 
user_id=80475

Tim, I remember you having some options about these sort of
optimizations.  Will you take a brief look at Armin's latest
patch.
msg45870 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-05-21 15:44
Logged In: YES 
user_id=4771

The bit with gcc-specific keywords is useful but arguably
scary, but the other part of the patch -- stack_pointer not
being assignable to a register -- solves a definite
performance bug in my opinion.  I'd even suggest
back-porting this one to 2.3.  Apple is more likely to ship
its next MacOSX release with the latest 2.3 than with 2.4,
as far as I can tell.
msg45871 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-06-17 07:35
Logged In: YES 
user_id=4771

The patch no longer cleanly applies, so here is it again.  This one is minimalistic and does not contain the 386-specific register tweaks.  It only allows two variables (stack_pointer and oparg) to be stored in registers instead of the stack on machines that have enough of them.  I still regard this as a small performance bugfix and suggest back-porting.
msg45872 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2004-06-17 08:50
Logged In: YES 
user_id=80475

Very nice.  Code passes review, passes regression tests, and
the timings were confirmed (Pentium III running Win ME with
MSVC++ 6.0).  Please apply.

Though this patch is very clean, we do not backport
performance tweaks.  The only exception would be to repair
devastatingly bad performance.  Let this be some incentive
to step up to Py2.4.
msg45873 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2004-06-17 10:23
Logged In: YES 
user_id=4771

Checked in as ceval.c 2.400.

Let's forget about the GCC-specific extension and close the patch.
History
Date User Action Args
2022-04-11 14:56:03adminsetgithub: 40192
2004-04-28 18:33:16arigocreate