URL

The URL settings offer options that affect URLs and provide a way to specify which URLs should be included in or excluded from the download.

URL

The URL pane of the Settings dialog provides the following controls:

Check All Links

Check this box to have SiteSucker check all links in all downloaded HTML files — including links to files that the app is not downloading — and log any errors that occur. With this option turned on, SiteSucker may report many errors that you normally would not see. This setting is intended as a debugging tool for web designers who want to see if their own sites have any bad links.

To minimize the time it takes to check all links, set the Filter setting under the File Type settings to Allow Specified File Types with nothing checked so that only HTML and CSS are downloaded, and set the File Modification setting under the General settings to Delete After Analysis, which will delete HTML and CSS files after they are downloaded and analyzed.

Scan Comments for URLs

Check this box to have SiteSucker scan comments for URLs in all downloaded HTML files. Normally, SiteSucker ignores comments. This option is useful when tags are included in comments so that they can be used by Internet Explorer or JavaScript.

Treat Ambiguous URLs as Folders

Check this box to have SiteSucker treat ambiguous URLs as folders. If a URL does not end with a ‘/’ or a file extension, SiteSucker considers it to be ambiguous. For example, if this option is on and SiteSucker downloads a webpage from https://www.example.com/directory, the webpage will be saved at www.example.com/directory/index.html in the destination folder. If this option is off, the webpage will be saved at www.example.com/directory.html in the destination folder. See File Names for more information about how SiteSucker names downloaded files.

Download Links in PDFs

Check this box to have SiteSucker download files linked to PDF documents if allowed by other settings. Since PDF links cannot be localized, the PDF documents themselves are never modified.


URL Constraint

Use this control to limit downloaded files to those at a specific site, those within a specific directory, or those having a specific URL. This option works in conjunction with the Include and Exclude URL settings and the General settings. SiteSucker provides the following path constraints:

  • None - SiteSucker downloads the file specified in the URL text box and every file that it links to and every site that these files link to, etc. Be aware that this option could result in a HUGE download if allowed to continue forever.
  • Host - SiteSucker limits the download to those files on the host of the original file being downloaded. For example, if the URL is https://www.example.com/directory/home.html, this setting limits the download to those URLs beginning with http://www.example.com or https://www.example.com.
  • Host + 1 - SiteSucker limits the download to those files on the host of the original file being downloaded (just like the Host option), plus one level of files from other domains linked to the original host.
  • Subdomains - SiteSucker limits the download to those files within the second-level domain and all subdomains of the original file being downloaded. Extending the previous example, this setting will download URLs beginning with http://www.example.com, https://images.example.com, http://guide.example.com, or https://example.com.
  • Directory - SiteSucker only downloads those files that are within the directory of the original file being downloaded. For example, if you are downloading https://www.example.com/directory/home.html or https://www.example.com/directory/ using this setting, SiteSucker will only download files in the directory directory. But if you are downloading https://www.example.com/directory, SiteSucker will download all files from www.example.com unless the Treat Ambiguous URLs as Folders setting is on, in which case SiteSucker will only download files in the directory directory.
  • URL Settings - SiteSucker only downloads the file specified in the URL text box and any files that have URLs allowed by the Include and Exclude URL settings.

Include and Exclude URL Settings

Excluded URLs

The Include and Exclude URL settings work in conjunction with the URL Constraint setting and the General settings according to the following rules:

  1. If this is the original URL (that is, the URL specified in the URL text field), then the file is downloaded.
  2. Otherwise, if the URL begins with one of the strings or matches one of the regular expressions in the Exclude table, then the file is not downloaded.
  3. Otherwise, if the URL meets the requirements of the current URL Constraint setting, then the file is allowed to download.
  4. Otherwise, if the URL begins with one of the strings or matches one of the regular expressions in the Include table, then the file is allowed to download.
  5. Otherwise, if the Always Download HTML and CSS option in the General settings is on and the URL references an HTML or CSS file type, then the file is allowed to download.
  6. Otherwise, if the Include Supporting Files option in the General settings is on and the URL references a non-HTML file type, then the file is allowed to download.
  7. Otherwise, the file is not downloaded.

Ultimately, a file that has a URL allowed by the URL settings will only be downloaded if the robot exclusions, Limit settings, and File Type settings allow it to download.

In the tables under the Include and Exclude tabs, enter absolute URLs (that is, URLs beginning with http:// or https://) or regular expression patterns. URLs should be entered as they appear in the Safari address and search field, i.e., without encoding except for characters from the ISO-8859-1 extended character set and spaces (which are encoded as %20).

If you are adding a regular expression, check the corresponding Regex box. When using regular expressions, the pattern must match the entire URL. For example, to match any URL that contains “logout”, enter the “.*logout.*” regular expression. The pattern syntax currently supported is that specified by ICU, which is described at Regular Expressions - ICU Documentation.

To add a row to the table, click the Plus button, enter the URL or pattern, and press ↩.

To remove rows from the table, select them in the table and click the Minus button.

To modify a row, double-click on a string in the table, enter a new string, and press ↩.