Robot Exclusions
By default, SiteSucker honors robots.txt exclusions, the Robots META tag, and the X-Robots-Tag HTTP header.
The robots.txt file (downloaded as robots.ssl.txt when https is used) allows website administrators to define what parts of a site are off-limits to robots like SiteSucker. Website administrators can disallow access to private and temporary directories, for example, because they do not want pages in those areas downloaded.
The robots.txt file can include URLs to sitemaps, which SiteSucker will download unless the Delete robots.txt option is on in the Preferences. Although sitemaps include URLs for a site, SiteSucker cannot scan sitemaps for those URLs.
SiteSucker also honors the Crawl-delay
directive in robots.txt. This parameter specifies the number of seconds to wait between successive requests to the same server. If this element is found in the robots.txt file, SiteSucker will delay between requests.
In addition, SiteSucker honors the Robots META tag and the X-Robots-Tag HTTP header, which can be used to request that links on specific pages not be followed by robots.
For the most part, this behavior can be overridden with the Ignore Robot Exclusions option under the General settings. However, robots.txt directives aimed specifically at SiteSucker are always honored.
Warning: Ignoring robot exclusions is not recommended. Robot exclusions are usually put in place for a good reason and should be obeyed.