This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: heapq should provide iterators: itersmallest, iterlargest
Type: enhancement Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: rhettinger, scoder
Priority: normal Keywords:

Created on 2005-03-07 14:15 by scoder, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg54410 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2005-03-07 14:15
The current C-based heapq implementation provides
implementations of min-heaps and max-heaps for the
"nsmallest" and "nlargest" functions. I would like to
see them used to provide iterators over the
smallest/largest items of a heap: "itersmallest(heap)"
and "iterlargest(heap)".

Why:

The functions nsmallest() and nlargest() are efficient
(and sufficient) if n is known. However, if n is *not*
known in advance, it would still be more efficient than
sorting to run heapify() over a list and then have an
iterator run heappop on each iteration. While this is
easily implemented for min-heaps using heappop,
max-heaps are not part of the Python-Implementation.
Currently, they are only implemented in C.

Possible counter-arguments:

The straight-forward iterator implementation will
manipulate the heap. This could be seen as a
side-effect. It should, however, be enough to document
that. Similar problems are documented for mutating
sequences during iteration.
msg54411 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-03-07 16:22
Logged In: YES 
user_id=80475

The idea for introducing heapiter() function was
unsuccessfully proposed on python-dev:

 
http://mail.python.org/pipermail/python-dev/2004-June/045348.html

The use cases were rare and the pure python solution was
both easy and fast.

If you need C coded max-heaps, that could be a separate
feature request.  There would need to be sufficient use
cases to warrant building out the module's API.
msg54412 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2005-03-08 09:20
Logged In: YES 
user_id=313935

"easy and fast" was the python solution for min-heaps, you
mean. Their API is sufficient for a few-line iterator
implementation. The possible solutions for max-heaps are
comparatively ugly and generally less efficient, the best
ways being either a custom per-case solution
(decorate-useheap-undecorate) or a generic indirection
through an item wrapper class that mirrors the __le__
operator with the additional overhead of python's method
call infrastructure.

I don't think max-heaps are less common than min-heaps in
any way. It just doesn't show that well since custom
solutions are easy to write most of the time (each and every
time).

The major issue with the current heapq implementation is the
design choice of not making it a class. I agree that it is a
major advantage to have it operate on generic lists, but it
come at the price of preventing easy integration of keyword
arguments as in sort() and sorted(). A usage like

h = heap(myIterator, reverse=true)
for item in h: print item

would be so handy...

For the use-cases: As I said before, nsmallest/nlargest are
enough in many cases, but they actually handle a very
special case where n is known. On the other hand, iterators
have become a major first-level construct for Python and I
consider iteration over the ordered elements of a heap a
very comon use case. The step towards an augmented API that
provides efficient iterators both for min-heaps and
max-heaps would both underline Python's paradigm shift and
considerably simplify code that currently suffers from
custom solutions.

And again: most of the code is already there.

Another possible solution: what about a module function
heapified(seq_or_iter, key=..., cmp=..., reverse=...)
in the style of sorted() but returning an iterator?
Advantage over sorted() being the higher efficiency if the
iterator is thrown away after a small number of calls.

Just an idea...
msg54413 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-03-08 16:15
Logged In: YES 
user_id=80475

I am -1 on this feature request. The design of the module
(as functions on lists rather than as a class with
encapsulation) does not mesh well these proposals.

Building out the API for maxheap functions makes the module
harder to learn and it introduces a risk of a user
mismatching functions (i.e. heappop with maxheapify). 
Maxheap functionality can already be obtained by decorating
with class which redefines __cmp__.

The case for an iterator wrapper is not strong.  It was not
well received when proposed on python-dev. The risks of
mutating the heap during iteration are greater than the
equivalent situation for lists because there is no simple
way to verify that the heap condition has remained intact
between iteration steps. If truly needs, a iterator around
the existing minheap is trivial to write,

Most user needs are met by sort(), nlargest(), nsmallest(),
the existing minheap functions, or the existing minheap
functions applied to decorated data.

IMO, introducing iterators and maxheaps do not add enough
value to warrant complicating the module and introducing new
risks of misuse (altering the heapcondition during iteration
or mismatching minheap and maxheap functions).

The story would be somewhat different if the heaps had
originally been encapsulated in a class.  If they had,
iterators could be added and given protection from mutation.
Also, if there were a heap class, the API would easily
support key= and reversed= options.  But in the absence of a
heap class, it would be a mistake to force these buildouts.

The OP's corner use case presented on comp.lang.python was
met with solutions using the existing toolset.  The corner
case was wanting a maxheap to read all data into memory,
heapify it, and extract n largest elements AND n is not
known in advance AND n is small enough that sort(data) was
deemed to have too many comparisons.  Not knowing n in
advance arose because the data contained duplicate records
which the OP was unwilling to filter-out in advance with a
single pass O(n) step using a dictionary to condense
equivalence groups to a single element.
History
Date User Action Args
2022-04-11 14:56:10adminsetgithub: 41663
2005-03-07 14:15:01scodercreate