Limitations

SiteSucker is a relatively simple program and it has a number of limitations.

SiteSucker only examines CSS and HTML for URLs. When SiteSucker analyzes HTML, it looks for URLs in the style attribute and in the following tags:

  • <a>
  • <area>
  • <audio>
  • <base>
  • <body>
  • <embed>
  • <frame>
  • <iframe>
  • <img>
  • <input>
  • <link>
  • <meta>
  • <object>
  • <script>
  • <source>
  • <style>
  • <table>
  • <tbody>
  • <td>
  • <tfoot>
  • <th>
  • <thead>
  • <tr>
  • <track>
  • <video>

If a link is specified elsewhere in an HTML file, SiteSucker will not see it.

SiteSucker does not scan PDFs, Flash files (swf), Quicktime movie files (mov), or other media files for embedded links.

SiteSucker totally ignores JavaScript. Any link specified within JavaScript will not be seen by SiteSucker and will not be downloaded. (If the Log Warnings option is on in the settings, SiteSucker will include a warning in the log file for any page that uses JavaScript.)

By default, SiteSucker honors robots.txt exclusions, the Robots META tag, and the X-Robots-Tag HTTP header. Therefore, any directories or pages disallowed by robot exclusions will not be downloaded by SiteSucker. See the SiteSucker manual for more information about robot exclusions.