This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: robotparser only applies first applicable rule
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: calvin, f8dy, skip.montanaro
Priority: normal Keywords:

Created on 2003-02-20 18:55 by f8dy, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
690214.patch f8dy, 2003-02-20 19:04 Patch for robotparser.py
Messages (3)
msg14708 - (view) Author: Mark Pilgrim (f8dy) Date: 2003-02-20 18:55
robotparser robotparser.py::RobotFileParser::can_fetch 
currently returns the result of the first applicable rule.  It 
should loop through all rules looking for anything that 
disallows access.  For example, if your first rule applies 
to 'wget' and 'python' and disallows access to /dir1/, and 
your second rule is a 'python' rule that disallows access 
to /dir2/, robotparser will falsely claim that python is 
allowed to access /dir2/.

Patch against current source attached.
msg14709 - (view) Author: Bastian Kleineidam (calvin) Date: 2003-03-03 11:46
Logged In: YES 
user_id=9205

Mark, if you dive into
http://www.robotstxt.org/wc/norobots-rfc.txt you'll note
that the first matching user-agent line as well as the first
matching allow or disallow line must be obeyed by the robot
(see 3.2.1 and 3.2.2).

Now, I am not opposed to disobey the above rfc, but there
are other arguments against your patch:
a) it breaks current implementations of robots.txt
(potentially disallowing access to sites)
b) your problem is easily solved by moving Disallow and/or
User-Agent lines to the top

Therefore my count is -1 for this patch.

Cheers, Bastian
msg14710 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2003-03-06 08:27
Logged In: YES 
user_id=44345

Closing as it appears robotparser's behavior matches the rfc as Bastian
indicated.
History
Date User Action Args
2022-04-10 16:06:57adminsetgithub: 38016
2003-02-20 18:55:13f8dycreate