Webpage

Webpage

The Webpage screen provides the following settings:

Check All Links

Switch this on to have SiteSucker check all links in all downloaded HTML files — including links to files that you are not downloading — and log any errors that occur. With this option turned on, SiteSucker may report many errors that you normally wouldn't see. This setting is intended as a debugging tool for web designers who want to see if their own sites have any bad links.

Scan Comments for URLs

Switch this on to have SiteSucker scan comments for URLs. Normally, SiteSucker ignores comments. This option is useful when tags are included in comments so that they can be used by Internet Explorer or JavaScript.

Include Supporting Files

Switch this on to have SiteSucker include supporting files in the download. When this option is on, SiteSucker will download non-HTML files (such as style sheets, images, etc.) even if they are not allowed by the current Path settings or if the Maximum Levels limit is exceeded in the Limit settings.

Download Using Web Views

Switch this on to have SiteSucker download HTML using hidden web views. When this option is on, SiteSucker will load each HTML file into a hidden web view and then extract the HTML from the web view after the page is loaded. This can be useful if the file uses JavaScript to modify the HTML as the page is loading or if the file is an XML file that can be converted into HTML by the web view.

Save Delay (seconds)

Use this control to specify how long to delay saving a webpage after it has supposedly finished loading in a web view. Some webpages may take longer to load because their content is generated using JavaScript. This setting provides additional time for webpages to finish loading before they are saved. This control is only enabled if the Download Using Web Views setting is on.

Text Encoding

Use this control to specify the text encoding for webpages. SiteSucker will read and save all webpages using the specified text encoding. If it is set to Default, SiteSucker will try to detect the webpage's text encoding. This setting is ignored when reading or saving webpages that were previously downloaded, when the Download Using Web Views setting is on, or when viewing webpages using the browser.


Custom Data Attributes

Use the Custom Data Attributes screen to specify custom data attributes that SiteSucker should scan for URLs. Introduced in HTML5, custom data attributes store extra information, usually for the page's JavaScript, in standard HTML tags. Data attribute names begin with data- and do not contain uppercase characters.

CustomDataAttributes

If you tap the Edit button in the Custom Data Attributes screen, SiteSucker displays a toolbar with the following buttons:

Delete Delete

Deletes the selected custom data attributes.

Add Add

Allows you to add a new custom data attribute.

Edit Edit

Allows you to edit the selected custom data attribute.


Patterns

Use the Patterns screen to specify regular expressions to replace text in HTML files and/or extract URLs from HTML text.

Patterns

To replace text in HTML files, enter a search pattern and a substitution template for the text you would like to replace. If text is found that matches the search pattern, it will be altered in accordance with the substitution template. The template specifies what is to be used to replace each match, with the back-reference $0 representing the matched text, $1 representing the contents of the first capture group, and so on. To delete the matched text, enter a back-reference that doesn't have a corresponding capture group.

To extract URLs from HTML text, enter a search pattern and a URL template that specifies a URL that SiteSucker should download. The URL template is ignored if it doesn't include a back-reference or it produces a blank URL.

These search patterns are applied before SiteSucker scans HTML files for URLs. Search patterns are applied in the order in which they appear in the list, and the order of search patterns can be rearranged by dragging them in the list. Furthermore, if you include both a substitution template and a URL template, text is replaced before URLs are extracted. The pattern syntax currently supported is that specified by ICU. The ICU regular expressions are described at Regular Expressions - ICU User Guide.

As an example, in the image shown above, SiteSucker is instructed to do the following:

  1. extract a URL from the first argument of the javascript:openWin() function and then
  2. replace URLs that have a certain query string with the same URL without the query string.

If you tap the Edit button in the Patterns screen, SiteSucker displays a toolbar with the following buttons:

Delete Delete

Deletes the selected patterns.

Add Add

Allows you to add a new pattern.

Edit Edit

Allows you to edit the selected pattern.