This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: itertools.imerge: merge sequences
Type: enhancement Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: jneb, rhettinger, schoepflin
Priority: normal Keywords:

Created on 2005-04-18 12:11 by jneb, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg54443 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2005-04-18 12:11
(For the itertools library, so Python 2.2 and up)
This is a suggested addition to itertools, proposed name imerge.
usage: imerge(seq0, seq1, ..., [key=<key function>])
result: imerge assumes the sequences are all in sorted order, and 
produces a iterator that returns pairs of the form (value, index),
where value is a value of one of the sequences, and index is the 
index number of the given sequence.
The output the imerge is in sorted order (taking into account the 
key function), so that identical values in the sequences will be 
produced from left to right.
The code is surprisingly short, making use of the builtin heap 
module.
(You may disagree with my style of argument handling; feel free to 
optimize it.)
def imerge(*iterlist, **key):
	"""Merge a sequence of sorted iterables.
	
	Returns pairs [value, index] where each value comes from 
iterlist[index], and the pairs are sorted
	if each of the iterators is sorted.
	Hint use groupby(imerge(...), operator.itemgetter(0)) to get 
the items one by one.
	"""
	if key.keys() not in ([], ["key"]): raise TypeError, "Excess 
keyword arguments for imerge"
	key = key.get("key", lambda x:x)
	from heapq import heapreplace, heappop
	#initialize the heap containing (inited, value, index, 
currentItem, iterator)
	#this automatically makes sure all iterators are initialized, 
then run, and finally emptied
	heap = [(False, None, index, None, iter(iterator)) for index, 
iterator in enumerate(iterlist)]
	while heap:
		inited, item, index, value, iterator = heap[0]
		if inited: yield value, index
		try: item = iterator.next()
		except StopIteration: heappop(heap)
		else: heapreplace(heap, (True, key(item), index, item, 
iterator))

If you find this little routine worth its size, please put it into 
itertools.

- Jurjen
msg54444 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-04-18 22:43
Logged In: YES 
user_id=80475

I had previously looked at an imerge() utility and found
that it had only a single application (isomorphic to lazy
mergesorting) and that the use cases were dominated by the
in-memory alternative:  sorted(chain(*iterlist)).

Short of writing an external mergesort, what applications
did you have in mind?  What situations have you encountered
where you have multiple sources of sorted data being
generated on the fly (as opposed to already being
in-memory), have needed one element at a time sequential
access to a combined sort of that data, needed that combined
sort only once, and could not afford to have the dataset
in-memory?
msg54445 - (view) Author: Jurjen N.E. Bos (jneb) * Date: 2005-04-19 08:19
Logged In: YES 
user_id=446428

Well, I was optimizing a piece of code with reasonbly long sorted lists (in 
memory, I agree) that were modified in all kinds of ways. I did not want 
the nlogn behaviour of sort, so I started writing a merge routine.
I found out that the boundary cases of a merge implementation are a 
mess, until I disccovered the heap trick. Then I decided to clean it up 
and and put it up for a library routine.
The fact that it uses iterators is obnly to make it more general, not 
specifically for the "lazy" properties.
- Jurjen
msg54446 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-04-19 18:58
Logged In: YES 
user_id=80475

For your specific application, it is better to use sorted().
 When the underlying data consists of long runs of
previously ordered data, sorted() will take advantage of
that ordering and run in O(n) time.  In contrast, using a
heap will unnecessarily introduce O(n log n) behavior and
not exploit the underlying data order.

Recommend that you close this request.  This discussion thus
far confirms the original conclusion that imerge() use cases
are dominated by sorted(chain(*iterlist)) which gives code
that is shorter, faster, and easier to understand.
msg54447 - (view) Author: Markus Schöpflin (schoepflin) Date: 2005-08-24 09:03
Logged In: YES 
user_id=91733

I was looking for exactly this functionality in itertools,
so I would think it a good addition to the module.

I have an application, which exposes messages in (possibly
huge) log files via an iterator interface, and imerge
allowed me to easily put together a little utility that
merges log files from different sources according to the
timestamp of the message.

Markus
History
Date User Action Args
2022-04-11 14:56:10adminsetgithub: 41871
2005-04-18 12:11:24jnebcreate