Limitations

SiteSucker is a relatively simple program and it has a number of limitations.

SiteSucker only examines CSS and HTML for URLs. When SiteSucker analyzes HTML, it looks for URLs in the style attribute and in the following tags:

  • <a>
  • <area>
  • <audio>
  • <base>
  • <body>
  • <embed>
  • <form>
  • <frame>
  • <iframe>
  • <img>
  • <input>
  • <link>
  • <meta>
  • <object>
  • <script>
  • <source>
  • <style>
  • <table>
  • <tbody>
  • <td>
  • <tfoot>
  • <th>
  • <thead>
  • <tr>
  • <track>
  • <video>

If a link is specified elsewhere in an HTML file, SiteSucker will not see it.

SiteSucker does not scan PDFs, Flash files (swf), QuickTime movie files (mov), or other media files for embedded links.

SiteSucker totally ignores JavaScript. Any URL specified within JavaScript will not be seen by SiteSucker and will not be downloaded. (SiteSucker will include a warning in the log file for any page that uses JavaScript if the Log Warnings option is on in the Log settings.) If JavaScript is embedded in HTML text, you might be able to extract URLs from the JavaScript by using the Patterns setting or the JavaScript setting in the Webpage settings or by using a script. See Settings in the SiteSucker manual for more information.

SiteSucker ignores forms for the most part, although it will download the action URL from a form tag and the src URL from an input tag, and you can use the web browser to submit forms. See Password-protected Sites in the SiteSucker manual for more information.

SiteSucker cannot download videos. SiteSucker Pro, however, can download embedded videos, including embedded YouTube and Vimeo videos. See Frequently Asked Questions in the SiteSucker manual for more information.

By default, any directories or files disallowed by robot exclusions will not be downloaded by SiteSucker. See Robot Exclusions in the SiteSucker manual for more information.

SiteSucker gathers a large amount of information when downloading a site. When the site is very large, SiteSucker could exhaust all available memory before the entire site is downloaded. When downloading a very large site, you also risk running out of storage space or being constrained by file system limitations.