skip to content

PHP: Parsing robots.txt

 Tweet Share0 Tweets

If you're writing any kind of script that involves fetching HTML pages or files from another server you really need to make sure that you follow netiquette - the "unofficial rules defining proper behaviour on Internet".

This means that your script needs to:

  1. identify itself using the User Agent string including a URL;
  2. check the site's robots.txt file to see if they want you to have access to the pages in question; and
  3. not flood their server with too-frequent, repetitive or otherwise unnecessary requests.

If you don't meet these requirements then don't be surprised if they retaliate by blocking your IP address and/or filing a complaint. This article presents methods for achieving the first two goals, but the third is up to you.

Setting a User Agent

Before using any of the PHP file functions on a remote server you should decide on and set a sensible User Agent string. There are no real restrictions on what this can be, but some commonality is beginning to emerge.

The following formats are widely recognised:

The detail you provide should be proportionate to the amount of activity you're going to generate on the targeted sites/servers. The NameOfAgent value should be chosen with care as there are a lot of established user agents and you don't want to have to change this later. Check your server log files and our directory of user agents for examples.

Once you've settled on a name, using it is as simple as adding the following line to the start of your script:

<?PHP ini_set('user_agent', 'NameOfAgent (http://www.example.net)'); ?>

By passing a User Agent string with all requests you run less risk of your IP address being blocked, but you also take on some extra responsibility. People will want to know why your script is accessing their site. They may also expect it to follow any restrictions defined in their robots.txt file...

Parsing robots.txt

That brings us to the purpose of this article - how to fetch and parse a robots.txt file.

The following script is useful if you only want to fetch one or two pages from a site (to check for links to your site for example). It will tell you whether a given user agent can access a specific page.

If you're building a search engine spider or intend to download a lot of files then you should implement a cacheing mechanism so that the robots.txt file only needs to be fetched once every day or so.

<?PHP // Original PHP code by Chirp Internet: www.chirp.com.au // Please acknowledge use of this code by including this header. function robots_allowed($url, $useragent=false) { // parse url to retrieve host and path $parsed = parse_url($url); $agents = array(preg_quote('*')); if($useragent) $agents[] = preg_quote($useragent); $agents = implode('|', $agents); // location of robots.txt file $robotstxt = @file("http://{$parsed['host']}/robots.txt"); // if there isn't a robots, then we're allowed in if(empty($robotstxt)) return true; $rules = array(); $ruleApplies = false; foreach($robotstxt as $line) { // skip blank lines if(!$line = trim($line)) continue; // following rules only apply if User-agent matches $useragent or '*' if(preg_match('/^\s*User-agent: (.*)/i', $line, $match)) { $ruleApplies = preg_match("/($agents)/i", $match[1]); } if($ruleApplies && preg_match('/^\s*Disallow:(.*)/i', $line, $regs)) { // an empty rule implies full access - no further tests required if(!$regs[1]) return true; // add rules that apply to array for testing $rules[] = preg_quote(trim($regs[1]), '/'); } } foreach($rules as $rule) { // check if page is disallowed to us if(preg_match("/^$rule/", $parsed['path'])) return false; } // page is not disallowed return true; } ?>

This script is designed to parse a well-formed robots.txt file with no in-line comments. Each call to the script will result in the robots.txt file being downloaded again. A missing robots.txt file or a Disallow statement with no argument will result in a return value of true.

If the script is failing you might try removing the @ (highlighted in the code above) to see if any errors are being generated by the file() command.

The script can be called as follows:

$canaccess = robots_allowed("http://www.example.net/links.php"); $canaccess = robots_allowed("http://www.example.net/links.php", "NameOfAgent");

or, in practice:

<?PHP $url = "http://www.example.net/links.php"; if(robots_allowed($url, "NameOfAgent")) { // access granted $tmp = file_get_contents($url); } else { // access disallowed } ?>

If you don't pass a value for the second parameter then the script will only check for global rules - those under '*' in the robots.txt file. If you do pass the name of an agent then the script also finds and applies rules specific to that agent.

For more information on the robots.txt file see the links below.

Allowing for 404 errors and the Allow directive

The following modified code has been supplied by Eric at LinkUp.com. It fixes a bug where a missing (404) robots.txt file would result in the false return value. It also adds extra code to cater for the Allow directive now recognised by some search engines.

The 404 checking requires the cURL module to be compiled into PHP and we haven't tested ourselves the Allow directive parsing, but I'm sure it works. Please report any transcription errors.

<?PHP // Original PHP code by Chirp Internet: www.chirp.com.au // Adapted to include 404 and Allow directive checking by Eric at LinkUp.com // Please acknowledge use of this code by including this header. function robots_allowed($url, $useragent=false) { // parse url to retrieve host and path $parsed = parse_url($url); $agents = array(preg_quote('*')); if($useragent) $agents[] = preg_quote($useragent, '/'); $agents = implode('|', $agents); // location of robots.txt file, only pay attention to it if the server says it exists if(function_exists('curl_init')) { $handle = curl_init("http://{$parsed['host']}/robots.txt"); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); $response = curl_exec($handle); $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); if($httpCode == 200) { $robotstxt = explode("\n", $response); } else { $robotstxt = false; } curl_close($handle); } else { $robotstxt = @file("http://{$parsed['host']}/robots.txt"); } // if there isn't a robots, then we're allowed in if(empty($robotstxt)) return true; $rules = array(); $ruleApplies = false; foreach($robotstxt as $line) { // skip blank lines if(!$line = trim($line)) continue; // following rules only apply if User-agent matches $useragent or '*' if(preg_match('/^\s*User-agent: (.*)/i', $line, $match)) { $ruleApplies = preg_match("/($agents)/i", $match[1]); continue; } if($ruleApplies) { list($type, $rule) = explode(':', $line, 2); $type = trim(strtolower($type)); // add rules that apply to array for testing $rules[] = array( 'type' => $type, 'match' => preg_quote(trim($rule), '/'), ); } } $isAllowed = true; $currentStrength = 0; foreach($rules as $rule) { // check if page hits on a rule if(preg_match("/^{$rule['match']}/", $parsed['path'])) { // prefer longer (more specific) rules and Allow trumps Disallow if rules same length $strength = strlen($rule['match']); if($currentStrength < $strength) { $currentStrength = $strength; $isAllowed = ($rule['type'] == 'allow') ? true : false; } elseif($currentStrength == $strength && $rule['type'] == 'allow') { $currentStrength = $strength; $isAllowed = true; } } } return $isAllowed; } ?>

expand code box

Another option for the last section might be to first sort the $rules by length and then only check the longest ones for an Allow or Disallow directive as they will override any shorter rules.

Previously robots.txt could only be used to Disallow spiders from accessing specific directories, or the whole website. The Allow directive allows you to then grant access to specific subdirectories that would otherwise be blocked by Disallow rules.

You should be careful using this, however, as it's not part of the original standard and not all search engines will understand. On the other hand, if you're running a web spider, taking Allow rules into account will give you access to more pages.

References

< PHP

Send a message to The Art of Web:


used only for us to reply, and to display your gravatar.

<- copy the digits from the image into this box

press <Esc> or click outside this box to close

User Comments

Post your comment or question

8 April, 2017

Hey,

I've just released a new library to do robots.txt policy checker. You can have a look at it here : github.com/hugsbrugs/php-robots-txt

25 April, 2016

Replace '$' by ".*$" is not correct.
the dollar sign ($) match the end of the string.
For example, to block URLs that end with .asp:
Disallow: /*.asp$

4 November, 2014

Your way for make regex rules is absolutely incorrect. You use preg_quote, which add slashes for * and $
I've make try to fix it:

$rule = addcslashes(trim($regs[1]), "/\+?[^](){}=!<>|:-");
$rule = str_replace("*", ".*", $rule);
if ($rule[mb_strlen($rule)-1] == '$')
$rule = rtrim($rule, '$') . ".*$";
else
$rule .= ".*";
Please reply me to email if i'm wrong.

p.s.: sorry for my english, im russian.

7 September, 2014

Hi. Thank for the Parse-Robots article.

Please note that this parsing code does Not work with User-agent groups, ie several UA with a same group of Disallow.

User-agent: Googlebot
User-agent: bingbot
Disallow: /private/

This is "standard" for robots.txt. See www.robotstxt.org/orig.html#format or developers.google.com/webmasters/control-crawl-index/docs/robots_txt

9 September, 2013

Great script thanks!

I ran into some trouble when there wasn't a / on the end of a URL.

so for example www.test.net/css

was returning true when /css/ was actually forbidden in robots.txt.

This code fixed it...

//if the path part (/css) doesn't end in /
if(substr($parsed['path'], -1) != "/"){
//whack a / on the end
$parsed['path'] = $parsed['path'] . "/";
};

I'm sure it's not bullet proof, but it worked for my tiny example.
Great code

4 March, 2013

I found an error, very important, because wikipedia use this, check out de.wikipedia.org/robots.txt, in line 11 the use the '*' for applying it to all Google-Ads bots, your script recognize this as a 'User-Agent: *' Line,
Nice script anyway,
Greets

Interesting. I can't find any evidence that "Mediapartners-Google*" is actually a valid entry in robots.txt for the "User-agent" line.

The original robots.txt protocal recommends a "case insensitive substring match of the name without version information" so the asterisk serves no purpose. The valid use of '*' is to match all user agents.

Trying to block "Mediapartners-Google" is in any case pointless as that user agent only visits websites that display AdSense ads, and then it's required to give it access.

For the robots.txt parser to cater for this I would either strip out the '*' when it follows or precedes other characters after User-agent, or ignore those rules completely.

top