Issue672491
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-01-22 15:28 by loewis, last changed 2022-04-10 16:06 by admin. This issue is now closed.
Messages (10) | |||
---|---|---|---|
msg14172 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2003-01-22 15:28 | |
In Python 2.[012], the code import re exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)") print exp.match("namespace").lastgroup prints "NCName". In Python 2.3a1, it prints "None". The problem is that last index is 2, instead of 1, as it should be. |
|||
msg14173 - (view) | Author: Greg Chapman (glchapman) | Date: 2003-02-05 01:56 | |
Logged In: YES user_id=86307 I believe the discrepancy was deliberately introduced in revision 2.84 of _sre.c. I agree with you that lastindex should return the the index of the matching group with the rightmost closing parenthesis (perhaps some elaboration in the docs is also in order). If this is the correct interpretation, two places need to be patched: 1) the handling of SRE_OP_MARK needs to be reverted to the 2.22 code and 2) the code in the lastmark_restore function needs to be tweaked so that lastindex is not accidentally set to the last matched group entered. Thinking further though, given a (contrived) pattern like this: re.match('((x))y(?:(a)b|ac)', 'xyac') what should lastindex be? I assume 1, given the definition above (lastindex = matching group with rightmost close parens). In 2.22 it is 3, since group 3 matched before the branch failed at the 'b'. In 2.3a1 it is 2, since lastindex is restored (after the branch fails) using the saved lastmark. Anyway, if it should be 1, then I think _sre.c will have to save lastindex as well as lastmark when processing the three opcodes which may end up calling lastmark_restore. |
|||
msg14174 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2003-04-19 07:32 | |
Logged In: YES user_id=21627 Assigning to Gustavo, since he wrote 2.84. Gustavo, can you confirm that your change broke this feature? Can you also agree that it should be fixed? |
|||
msg14175 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-04-19 21:03 | |
Logged In: YES user_id=7887 Martin, the lastgroup/lastindex handling was quite broken in Python < 2.3. We can easily see this if testing your example in the following way: >>> import re >>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)") >>> match = exp.match("namespace") >>> match.groups() ('namespace', 'e') >>> match.groupdict() {'NCName': 'namespace'} This has the same result in any python you execute. This also demonstrates that the group numbering has always been assigned by the opening parenthesis. About the None result, that's also correct. In the example, we notice that the second unnamed group is correctly matching 'e'. Accordingly to the sre module documentation, that's how lastgroups should behave: lastgroup The name of the last matched capturing group, or None if the group didn't have a name, or if no group was matched at all. In the case above, the group didn't have a name. If we check lastindex, we'll see it's correctly set to "2". Greg, your example is correctly showing one of the bugs in lastgroup/index handling in Python < 2.3. OTOH, the current result of 2 is right, given the "count on open" rule, which was always used. Here's an example in 2.3: >>> re.match('((x))y(?:(a)b|ac)', 'xyac').groups() ('x', 'x', None) >>> re.match('((x))y(?:(a)b|ac)', 'xyac').lastindex 2 (notice that groups always start in 1, as group 0 is the "whole match" group) Martin, if you agree, please close the bug. If you have any doubts, it'll be a pleasure to explain. |
|||
msg14176 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2003-04-19 21:25 | |
Logged In: YES user_id=21627 Gustavo, I agree that the numbering of groups is and should be by opening parentheses, i.e. that group 1 is <NCName>, and group 2 is unnamed. I also agree that *if* lastindex is 2, lastgroup should be None. I still think this value is incorrect, though. It is illogical, unhelpful, and incompatible with earlier versions. lastindex should be 1, and lastgroup should be "NCName". It is illogical because group 1 ends *after* group 2 ends, as the closing parenthesis of group 1 is after the closing parenthesis of group 2. It is unhelpful because one of the primary purposes of lastgroup is to find efficiently the entire match if you have a list of top-level alternatives, such as you do in a scanner. With Python <2.3, you can put named alternatives into a regex, and use lastindex to find out the alternative. Indeed, this is what sre.Scanner does; I believe you have broken that class as well. It is incompatible because earlier Python versions behaved differently. As a result of this change, PyXML does not work with Python 2.3. |
|||
msg14177 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-04-19 22:02 | |
Logged In: YES user_id=7887 Why do you think lastindex is incorrect? Isn't 2 the lastindex? >>> import re >>> exp = re.compile("(?P<NCName>[a-zA-Z_](\w|[_.-])*)") >>> match = exp.match("namespace") >>> match.group(0) 'namespace' >>> match.group(1) 'namespace' >>> match.group(2) 'e' It works like this in all Python versions. Also, if you agree that group 2 is unnamed, shouldn't lastindex be 2? It actually *is* the lastindex, as you see above. It is incompatible with old versions because old versions were broken, and didn't follow what was documented. It was indeed a side effect of a bug. For example, is it logical that if we remove the second group, lastindex is still 1 (in Python 2.2)? >>> exp = re.compile("(?P<NCName>[a-zA-Z_])") >>> print exp.match("namespace").lastindex 1 > It is illogical because group 1 ends *after* group 2 ends, > as the closing parenthesis of group 1 is after the closing > parenthesis of group 2. How this changes anything? As we agreed, groups are numbered by the opening parenthesis. Hummm.. perhaps you think that the old behavior was to show what group had the last closing parenthesis? No.. that wasn't the case either. Look at this example, in Python 2.2: >>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex 4 >>> re.compile("(a(b)?)((c)d)?").match("abce").groups() ('ab', 'b', None, None) In Python 2.3: >>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex 2 >>> re.compile("(a(b)?)((c)d)?").match("abce").groups() ('ab', 'b', None, None) > It is unhelpful because one of the primary purposes of > lastgroup is to find efficiently the entire match if you > have a list of top-level alternatives I don't understand this. If you want the entire match, just use group 0. Did I misunderstand it? Can you show me some example? Can you show me how I've broken the Scanner? If PyXML is broken, it trusted in an undocumented and broken feature. Should we maintain it broken to avoid breaking PyXML? |
|||
msg14178 - (view) | Author: Greg Chapman (glchapman) | Date: 2003-04-19 22:24 | |
Logged In: YES user_id=86307 I'll just add my two cents here. Gustavo, I think given your defintion of lastindex, there's no reason for the state to track a seperate lastindex field; the correct value could easily be determined by examining the mark array after matching is completed (in the process of initializing the Match object). On the other hand, I don't think there is an obvious way of determining lastindex given the 2.22 definition without either examining the compiled pattern or tracking it as the pattern executes. I also agree that it is an incompatible change. Although the implementation was partly broken, I think the clear intent of the 2.22 code was to report the matching group with the last closing parens. FYI, I posted a patch here to revert back to the previous behavior: http://www.python.org/sf/712900 You two may want to look at it to see if it looks like it's on the right track. Here is what a python with this patch reports on this example from Gustavo: >>> re.compile("(a(b)?)((c)d)?").match("abce").lastindex 1 As you can see, it reports the correct value for lastindex given the 2.22 definition. |
|||
msg14179 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-04-19 23:26 | |
Logged In: YES user_id=7887 > As you can see, it reports the correct value for lastindex given > the 2.22 definition. The problem here is defining what's the 2.2 definition. I've checked with other examples, and it looks like your definition is correct in all cases where the shown bug is not acting. If that's the case, the documentation is *very* misleading, and should be fixed. |
|||
msg14180 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-04-19 23:33 | |
Logged In: YES user_id=7887 Concluding: - I agree that my implementation is uncompliant with the previous behavior, and I think it should be modified to behave like before. - There was a bug which was fixed by my implementation, and should be fixed when porting to the old behavior. - The documentation is very misleading, and seems closer to my implementation than the old behavior. This should be fixed. |
|||
msg14181 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-04-20 00:59 | |
Logged In: YES user_id=7887 I backed out the changes made in 2.84 which changed the behavior, and maintained the changes which fixed the bug. Applied as: Modules/_sre.c: 2.90 I'm also including some examples in the lastindex documentation, explaining how it works. Sorry about the trouble this may have caused. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:06:10 | admin | set | github: 37819 |
2003-01-22 15:28:53 | loewis | create |