The Webpage section of the Settings dialog provides the following controls:
Use this control to specify the text encoding for webpages. SiteSucker will read and save all webpages using the specified text encoding. If it is set to Default, SiteSucker will try to detect the webpage's text encoding. This setting is ignored when reading or saving webpages that were previously downloaded.
Check All Links
Check this box to have SiteSucker check all links in all downloaded HTML files — including links to files that you are not downloading — and log any errors that occur. With this option turned on, SiteSucker may report many errors that you normally wouldn't see. This setting is intended as a debugging tool for web designers who want to see if their own sites have any bad links.
To minimize the time it takes to check all links, set the Filter setting under the File Type settings to Allow Specified File Types with nothing checked so that only HTML and CSS are downloaded, and set the File Modification setting under the General settings to None.
Only Follow Image Links
Check this box to have SiteSucker only follow image links, that is, links that you would navigate in a web browser by clicking on an image. This is useful when you want to download galleries that are navigated by clicking on thumbnails.
Scan Comments for URLs
Include Supporting Files
Check this box to have SiteSucker include supporting files in the download. When this option is on, SiteSucker will download non-HTML files (such as style sheets, images, etc.) even if they are not allowed by the current Path settings or the Maximum Number of Levels under the Limit settings is exceeded.
Download Using Web Views
data- and do not contain uppercase characters.
To add a custom data attribute, click the button, enter the name of the attribute, and press return.
To remove custom data attributes, select them in the table and click the button.
To modify a custom data attribute, double-click on its name in the table, enter a new name, and press return. All names in the table must be unique.
Settings under this tab allow you to use regular expressions to replace text in HTML files and/or extract URLs from HTML text.
To replace text in HTML files, enter a search pattern and a substitution template for the text you would like to replace. If text is found that matches the search pattern, it will be altered in accordance with the substitution template. The template specifies what is to be used to replace each match, with the back-reference $0 representing the matched text, $1 representing the contents of the first capture group, and so on. To delete the matched text, enter a back-reference that doesn't have a corresponding capture group.
To extract URLs from HTML text, enter a search pattern and a URL template that specifies a URL that SiteSucker should download. The URL template is ignored if it doesn't include a back-reference or it produces a blank URL.
These search patterns are applied after any pre-analysis script is run, but before SiteSucker scans HTML files for URLs. Search patterns are applied in the order in which they appear in the list, and the order of search patterns can be rearranged by dragging them in the list. Furthermore, if you include both a substitution template and a URL template, text is replaced before URLs are extracted. The pattern syntax currently supported is that specified by ICU. The ICU regular expressions are described at Regular Expressions - ICU User Guide.
As an example, in the image shown above, SiteSucker is instructed to do the following:
extract a URL from the first argument of the
- replace URLs that have a certain query string with the same URL without the query string.
To add a row to the table, click the button, enter the search pattern, substitution template, and/or URL template, and press return.
To remove rows from the table, select them in the table and click the button.
To modify a row, double-click on a string in the table, enter a new string, and press return.