Controlling the Spiders
The following information describes how to keep search engine spiders from indexing pages and directories in your site.You also can restrict access by users. If you want to restrict access to a whole site, however, you should contact NIS.
Hiding Pages
The META ROBOTS tag allows you to tell search engine spiders whether or not to crawl your page. Not all spiders recognize the ROBOTS tag; however, Google does. You can use this tag to hide pages on the BU Web.
If you want to hide a page so that it does not appear in search results, add the following tag to the head section of the page.
<meta name="robots" content="noindex">
If you also want to instruct the spider not to follow the links on the page, you can add the "nofollow" value:
<meta name="robots" content="noindex,nofollow">
Removing Pages from the Cache
Google automatically takes a "snapshot" of each page it crawls and caches it. This enables the search engine to show the search terms highlighted on text heavy pages so users can find relevant information quickly and to retrieve pages for users if the site's server temporarily fails. Users can access the cached version by choosing the "Cached" link on the search results page. If you do not want your content to be accessible through Google's cache, you can use the NOARCHIVE meta-tag. Place this in the head section of your pages:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
This tag will tell robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users.
If you want to allow other robots to archive your content, but prevent Google's robots from caching, you can use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOARCHIVE">
Note that the change will occur the next time Google crawls the page containing the NOARCHIVE tag (typically about once a month). If you want the change to take effect sooner than this, you must contact NIS to request immediate removal of archived content.
Hiding Directories
There is a standard method involving a "robots.txt" file for excluding robot crawlers. This will prevent Googlebot or other crawlers from visiting your site. Googlebot has a user-agent of "Googlebot". In addition, Googlebot understands some extensions to the robots.txt standard: Disallow patterns may include * to match any sequence of characters, and patterns may end in $ to indicate that the $ must match the end of a name. For example, to prevent Googlebot from crawling files that end in gif, you may use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$
Remember, changing your server's robots.txt file or changing the META tags on its pages will not cause an immediate change in what results Google returns. It is likely that it will take a while for any changes you make to propagate to Google's next index of the web.
|