Webpage
The Webpage settings let you customize how webpages are downloaded, analyzed, and modified.
The Webpage screen provides the following settings:
Download Using Web Views
Switch this on to have SiteSucker download HTML using hidden web views. When this option is on, SiteSucker will load each webpage into a hidden web view and then extract the HTML from the web view after the page is loaded. This can be useful when webpages are built using JavaScript or when the webpage is an XML file that can be converted into HTML by the web view.
Save Delay
Use this control to specify how long to delay saving a webpage after it has supposedly finished loading in a web view. Some webpages may take longer to load because their content is generated using JavaScript. This setting provides additional time for webpages to finish loading before they are saved. This control is only enabled if the Download Using Web Views setting is on.
Text Encoding
Use this control to specify the text encoding for webpages. SiteSucker will read and save all webpages using the specified text encoding. If it is set to Default, SiteSucker will try to detect the webpage’s text encoding. This setting is ignored when reading or saving webpages that were previously downloaded.
Custom Data Attributes
Use the Custom Data Attributes screen to specify custom data attributes that SiteSucker should scan for URLs. Introduced in HTML5, custom data attributes store extra information, usually for the page’s JavaScript, in standard HTML tags. Data attribute names begin with data-
and do not contain uppercase characters.
If you tap the Edit button in the Custom Data Attributes screen, SiteSucker displays a toolbar with the following buttons:
Delete |
Deletes the selected custom data attributes.
Edit |
Allows you to edit the selected custom data attribute.
Add |
Allows you to add a new custom data attribute.
Patterns
You can use the Patterns screen to do the following:
- Specify regular expressions that modify HTML files so that downloaded webpages display and behave correctly.
- Extract and download URLs embedded in JavaScript or nonstandard HTML that SiteSucker would otherwise overlook.
- Include or exclude URLs by extracting them from HTML text.
- Detect problems with downloaded HTML files that require the file to be downloaded again.
To replace text in HTML files, set the action to Substitute Early or Substitute Late and enter a search pattern and a substitution template for the text you would like to replace. If text is found that matches the search pattern, it is altered in accordance with the substitution template. The template specifies what should be used to replace each match, with the back-reference $0
representing the matched text, $1
representing the contents of the first capture group, and so on. To delete the matched text, enter a blank template.
To extract a URL from HTML text, set the action to Extract URL and enter a search pattern and a URL template for the URL you would like to extract. If text is found that matches the search pattern, SiteSucker extracts the URL specified by the URL template, converts it to an absolute URL if necessary, and then tries to download the URL. The extracted URL is ignored if the action produces a blank URL or a URL that is identical to the template. SiteSucker also ignores the extracted URL if it is not allowed by the other settings.
To include a URL extracted from HTML text, set the action to Include URL and enter a search pattern and a URL template for the URL you would like to include. If text is found that matches the search pattern, SiteSucker extracts the URL specified by the URL template, converts it to an absolute URL if necessary, and then adds it to any other URLs from the Include list in the URL settings. The extracted URL is ignored if the action produces a blank URL or a URL that is identical to the template. This action does not alter the URL settings displayed in the app.
To exclude a URL extracted from HTML text, set the action to Exclude URL and enter a search pattern and a URL template for the URL you would like to exclude. If text is found that matches the search pattern, SiteSucker extracts the URL specified by the URL template, converts it to an absolute URL if necessary, and then adds it to any other URLs from the Exclude list in the URL settings. The extracted URL is ignored if the action produces a blank URL or a URL that is identical to the template. This action does not alter the URL settings displayed in the app.
To retry downloading an HTML file if a pattern is found in the HTML text, set the action to Retry If Found and enter a search pattern. If text is found that matches the search pattern, SiteSucker will try to download the HTML file again. To retry downloading an HTML file if a pattern is not found in the HTML text, set the action to Retry If Missing and enter a search pattern. If text matching the search pattern is not found in the file, SiteSucker will try to download the HTML file again. If SiteSucker has exhausted the number of attempts specified in the Request settings, the HTML file is deleted and a 259 (corrupt file) error is logged. The template value is ignored for these actions.
The Retry If Found and Retry If Missing actions are applied right after SiteSucker reads the downloaded HTML file. The Extract URL, Include URL, Exclude URL, and Substitute Early actions are applied after the Retry If Found and Retry If Missing actions are applied, but before SiteSucker scans the HTML file for URLs. The Substitute Late action is applied after the HTML file is localized. Otherwise, actions are applied in the order in which they appear in the list, and the order of actions can be rearranged by dragging them in the list when editing. The pattern syntax currently supported is that specified by ICU, which is described at Regular Expressions | ICU Documentation.
As an example, in the image shown above (where the first line is the search pattern, the second line is the template, and the third line is the action), SiteSucker is instructed to do the following:
-
extract a URL from the first argument of the
javascript:openWin()
function and then - replace URLs that have a certain query string with the same URL without the query string.
If you tap the Edit button in the Patterns screen, SiteSucker displays a toolbar with the following buttons:
Delete |
Deletes the selected patterns.
Edit |
Allows you to edit the selected pattern.
Add |
Allows you to add a new pattern.