The Webpage settings let you customize how webpages are downloaded, analyzed, and modified.
The Webpage screen provides the following settings:
Download Using Web Views
Use this control to specify the text encoding for webpages. SiteSucker will read and save all webpages using the specified text encoding. If it is set to Default, SiteSucker will try to detect the webpage’s text encoding. This setting is ignored when reading or saving webpages that were previously downloaded.
Custom Data Attributes
data- and do not contain uppercase characters.
If you tap the Edit button in the Custom Data Attributes screen, SiteSucker displays a toolbar with the following buttons:
Deletes the selected custom data attributes.
Allows you to edit the selected custom data attribute.
Allows you to add a new custom data attribute.
Use the Patterns screen to specify regular expressions to replace text in HTML files or extract URLs from HTML text.
To replace text in HTML files, set the Template Type to Early Substitution or Late Substitution and enter a search pattern and a substitution template for the text you would like to replace. If text is found that matches the search pattern, it is altered in accordance with the substitution template. The template specifies what should be used to replace each match, with the back-reference
$0 representing the matched text,
$1 representing the contents of the first capture group, and so on. To delete the matched text, enter a blank template.
To extract URLs from HTML text, set the Template Type to URL Extraction and enter a search pattern and a URL template that specifies a URL that SiteSucker should download. The URL template is ignored if it produces a blank URL or a URL that is identical to the template.
The URL Extraction and Early Substitution search patterns are applied before SiteSucker scans HTML files for URLs. The Late Substitution search pattern is applied after HTML files are localized. Search patterns are applied in the order in which they appear in the list, and the order of search patterns can be rearranged by dragging them in the list when editing. The pattern syntax currently supported is that specified by ICU, which is described at Regular Expressions - ICU Documentation.
As an example, in the image shown above, SiteSucker is instructed to do the following:
extract a URL from the first argument of the
- replace URLs that have a certain query string with the same URL without the query string.
If you tap the Edit button in the Patterns screen, SiteSucker displays a toolbar with the following buttons:
Deletes the selected patterns.
Allows you to edit the selected pattern.
Allows you to add a new pattern.