Webpage
The Webpage settings let you customize how webpages are downloaded, analyzed, and modified.
The Webpage pane of the Settings dialog provides the following controls:
Text Encoding
Use this control to specify the text encoding for webpages. SiteSucker will read and save all webpages using the specified text encoding. If it is set to Default, SiteSucker will try to detect the webpage’s text encoding. This setting is ignored when reading or saving webpages that were previously downloaded.
Download Using Web Views
Check this box to have SiteSucker download HTML using hidden web views. When this option is on, SiteSucker will load each webpage into a hidden web view and then extract the HTML from the web view after the page is loaded. This can be useful when webpages are built using JavaScript or when the webpage is an XML file that can be converted into HTML by the web view.
Create PDF
Check this box to have SiteSucker download a site as a PDF document. Each webpage is rendered as a single page in the document, and the width and minimum height of each PDF page is determined by the Web View Size setting. The order of pages in the document will be somewhat random due to the nonlinear nature of websites. Internet links are preserved in the document, and links to downloaded pages are modified so that the user is sent to the appropriate location within the document. Only webpages and other files that can be displayed in a web browser are included in the document. This control is only enabled if the Download Using Web Views setting is on. When this setting is on, other settings that do not apply are ignored.
Save Delay
Use this control to specify how long to delay saving a webpage after it has supposedly finished loading in a web view. Some webpages may take longer to load because their content is generated using JavaScript. This setting provides additional time for webpages to finish loading before they are saved. This control is only enabled if the Download Using Web Views setting is on.
Web View Size
Use this control to specify the size of hidden web views. The web view width can effect the appearance of downloaded webpages and PDF documents created by SiteSucker. The web view height determines the minimum page height when creating PDF documents. This control is only enabled if the Download Using Web Views setting is on.
Attributes
Settings under this tab allow you to specify custom data attributes that SiteSucker should scan for URLs. Introduced in HTML5, custom data attributes store extra information, usually for the page’s JavaScript, in standard HTML tags. Data attribute names begin with data-
and do not contain uppercase characters.
To add a custom data attribute, click the button, enter the name of the attribute, and press ↩.
To remove custom data attributes, select them in the table and click the button.
To modify a custom data attribute, double-click on its name in the table, enter a new name, and press ↩. All names in the table must be unique.
Patterns
Settings under this tab allow you to do the following:
- Specify regular expressions that modify HTML files so that downloaded webpages display and behave correctly.
- Extract and download URLs embedded in JavaScript or nonstandard HTML that SiteSucker would otherwise overlook.
- Detect problems with downloaded HTML files that require the files to be downloaded again.
To replace text in HTML files, set the action to Substitute Early or Substitute Late and enter a search pattern and a substitution template for the text you would like to replace. If text is found that matches the search pattern, it is altered in accordance with the substitution template. The template specifies what should be used to replace each match, with the back-reference $0
representing the matched text, $1
representing the contents of the first capture group, and so on. To delete the matched text, leave the template blank.
To extract URLs from HTML text, set the action to Extract URL and enter a search pattern and a URL template for the URL you would like to extract. If text is found that matches the search pattern, SiteSucker will try to download the URL specified by the URL template. The URL template is ignored if it produces a blank URL or a URL that is identical to the template.
To retry downloading an HTML file if a pattern is found in the HTML text, set the action to Retry If Found and enter a search pattern. If text matching the search pattern is found in the file, SiteSucker will try to download the HTML file again. To retry downloading an HTML file if a pattern is not found in the HTML text, set the action to Retry If Missing and enter a search pattern. If text matching the search pattern is not found in the file, SiteSucker will try to download the HTML file again. If SiteSucker has exhausted the number of attempts specified in the Request settings, the HTML file is deleted and a 259 (corrupt file) error is logged. The template value is ignored for these actions.
The Extract URL and Substitute Early actions are applied after any pre-analysis script is run, but before SiteSucker scans the HTML file for URLs. The Retry If Found and Retry If Missing actions occur just before the Extract URL and Substitute Early actions are applied. The Substitute Late action occurs after the HTML file is localized. Actions are applied in the order in which they appear in the table, and the order of actions can be rearranged by dragging them in the table. The pattern syntax currently supported is that specified by ICU, which is described at Regular Expressions - ICU Documentation.
As an example, in the image shown above, SiteSucker is instructed to do the following:
-
extract a URL from the first argument of the
javascript:openWin()
function and then - replace URLs that have a certain query string with the same URL without the query string.
To add a row to the table, click the button, set the Action, enter the Search Pattern and Template, and press ↩.
To remove rows from the table, select them in the table and click the button.
To modify a row, double-click on a string in the table, enter a new string, and press ↩.
JavaScript
Settings under this tab allow you to inject JavaScript into hidden web views after the page finishes loading but before other sub-resources finish loading. This text box is only enabled if the Download Using Web Views setting is on.
This feature can be used to perform any number of tasks before SiteSucker saves a webpage. For example, it can be used to click on buttons that modify a webpage before SiteSucker saves it; click on links that download attachments; extract obscure URLs from a webpage; or rename files.
In the image shown above, the script calls the moreImagesFunction()
after the window has loaded and calls it again whenever webpage changes occur. When the moreImagesFunction()
is called, the element with the “trending” ID is clicked and a message is sent back to SiteSucker requesting a five second delay before saving the webpage. This JavaScript makes it possible to load all the images on a particular webpage before it is saved.
SiteSucker provides a number of message handlers that can be used to pass information from JavaScript back to the application. To call a message handler from JavaScript, include the function window.webkit.messageHandlers.
<message name>.postMessage
(<argument>) in your code. The message names recognized by SiteSucker, their use, and their arguments are discussed in the table below.
Message Name | Description |
---|---|
delay | Allows you to add a delay before saving a webpage. The function argument should be an integer or a floating-point number that specifies the delay in seconds. Essentially, this function resets the Save Delay setting to the value passed into the function without modifying the setting in the SiteSucker document. |
log | Allows you to write a message to SiteSucker’s log file. The function argument should be a string. This function can be useful when debugging your JavaScript. |
rename | Allows you to rename files. The function argument should be an array containing two strings: a search pattern followed by a substitution template. These strings are added temporarily to the Replace table under the Path settings and are removed after the document stops downloading. You could use this feature to rename files with the text content of a button or anchor. |
save
saveLast
|
Allows you to save the current contents of the web view. Some webpages, such as those containing tables or image galleries, update using JavaScript or other means when an element is clicked but don’t change the page URL. You can use the save message handler to save the contents of the web view, update the page using JavaScript, save the contents again, and so on until you call the saveLast message handler. The function argument should be a string containing a unique name for the webpage. These files are saved regardless of the Maximum Number of Levels setting. A procedure for using this feature is outlined below this table.
|
url | Allows you to pass a URL back to SiteSucker. The function argument should be a string that specifies an absolute or relative URL. SiteSucker will then try to download the URL without localizing the URL on the webpage. If you want to localize the URL, you will have to do it yourself using JavaScript or using the Patterns setting. |
The following is an example of the JavaScript code, with comments, used to save a series of pages from a single web view:
After the last page is saved, SiteSucker will analyze all the saved pages.