Robots exclusion standard
standard used to advise web crawlers and scrapers not to index a web page or site From Wikipedia, the free encyclopedia
Remove ads
The robots exclusion standard (also called the robots exclusion protocol or robots.txt protocol) is a way of telling Web crawlers and other Web robots which parts of a Web site they can see.
To give robots instructions about which pages of a Web site they can access, site owners put a text file called robots.txt
in the main directory of their Web site, e.g. http://www.example.com/robots.txt
.[1] This text file tells robots which parts of the site they can and cannot access. However, robots can ignore robots.txt
files, especially malicious (bad) robots.[2] If the robots.txt
file does not exist, Web robots assume that they can see all parts of the site.
Remove ads
Examples of robots.txt files
References
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads