ucoreScraper

HTML parsing, CSS selectors, and web scraping toolkit.

Strict Path Resolution: Unnarize enforces strict sandboxing. All file paths used in this library are resolved relative to the directory of the executing script, regardless of where the unnarize command is run. Absolute paths are typically rebased or rejected to prevent sandbox escape.

API Reference

FunctionReturnsDescription
fetch(url)stringDownload HTML content from URL
download(url, filepath)boolSave HTML to local file
select(html, selector)ArraySelect elements from HTML string
parseFile(path, selector)ArrayRead file and select elements
parse(html, [debug])nilParse HTML (debug prints DOM tree)

CSS Selector Reference

SelectorExampleDescription
Elementdiv, p, aMatch by tag name
Class.classnameMatch by class attribute
ID#myidMatch by id attribute
Descendantdiv pMatch p inside div (any level)
Combineddiv.containerMatch div with class "container"
// Examples of CSS selectors
var html = ucoreScraper.fetch("https://example.com");

// Select all paragraphs
var paragraphs = ucoreScraper.select(html, "p");

// Select by class
var articles = ucoreScraper.select(html, ".article");

// Select by ID
var header = ucoreScraper.select(html, "#main-header");

// Descendant selector: links inside nav
var navLinks = ucoreScraper.select(html, "nav a");

Element Structure

Selected elements are returned as an Array of Maps. Each element Map contains:

KeyTypeDescription
tagstringTag name (div, p, a, span, etc.)
textstringAll inner text content (recursive)
attributesMapElement attributes (href, class, id, src, etc.)
var links = ucoreScraper.select(html, "a");

// Access first link
var link = links[0];
print(link["tag"]);                      // "a"
print(link["text"]);                     // "Click here"
print(link["attributes"]["href"]);       // "https://example.com"
print(link["attributes"]["class"]);      // "btn primary"

Detailed Function Reference

fetch(url)

Downloads HTML content from a URL and returns it as a string. Uses curl internally with redirect following (-L).

var html = ucoreScraper.fetch("https://en.wikipedia.org/wiki/Main_Page");
print("Downloaded " + length(html) + " bytes");

// Check for success (empty string = failure)
if (length(html) == 0) {
    print("Download failed!");
}

download(url, filepath)

Saves HTML content directly to a local file. Returns true on success. Automatically creates directories if needed (--create-dirs).

// Download to current directory
ucoreScraper.download("https://example.com", "page.html");

// Download to nested directory (auto-created)
ucoreScraper.download("https://example.com", "data/pages/example.html");

// Check success
if (ucoreScraper.download(url, "output.html")) {
    print("Saved successfully");
} else {
    print("Download failed");
}

select(html, selector)

Parses an HTML string and returns all matching elements. Best for in-memory processing.

var html = ucoreScraper.fetch("https://news.ycombinator.com");

// Get all story titles
var titles = ucoreScraper.select(html, ".titleline");

var i = 0;
while (i < length(titles)) {
    print((i + 1) + ". " + titles[i]["text"]);
    i = i + 1;
}

parseFile(path, selector)

Reads an HTML file from disk, parses it, and returns matching elements. Efficient for large files or repeated processing.

// Download once, process multiple times
ucoreScraper.download("https://en.wikipedia.org/wiki/Countries", "countries.html");

// Extract different elements
var tables = ucoreScraper.parseFile("countries.html", "table");
var links = ucoreScraper.parseFile("countries.html", "a");
var headlines = ucoreScraper.parseFile("countries.html", ".mw-headline");

print("Tables: " + length(tables));
print("Links: " + length(links));
print("Headlines: " + length(headlines));

parse(html, [debug])

Parses HTML and optionally prints the DOM tree structure. Useful for debugging selector issues.

var html = "<div><p>Hello</p></div>";

// Normal parse (returns nil)
ucoreScraper.parse(html);

// Debug mode: prints DOM tree
ucoreScraper.parse(html, true);
// Output:
// DOCUMENT
//   ELEMENT: div
//     ELEMENT: p
//       TEXT: Hello

Complete Web Scraping Example

print("=== Wikipedia Scraper ===");

// 1. Download the page
var url = "https://en.wikipedia.org/wiki/List_of_programming_languages";
var file = "programming_langs.html";

print("Downloading...");
ucoreScraper.download(url, file);

// 2. Extract all language links
var links = ucoreScraper.parseFile(file, "#mw-content-text a");
print("Found " + length(links) + " links");

// 3. Print first 10 language names
var i = 0;
var max = 10;
if (length(links) < max) { max = length(links); }

while (i < max) {
    var link = links[i];
    var href = link["attributes"]["href"];
    var text = link["text"];
    
    print((i+1) + ". " + text + " -> " + href);
    i = i + 1;
}

print("=== Done ===");

Performance

Benchmarked on 370KB Wikipedia HTML (Intel i5-1135G7):

OperationSpeedNote
parseFile (class selector)339 ops/sec~3ms per parse
parseFile (table rows)212 ops/sec~5ms per parse
select (in-memory)128 ops/sec~8ms per parse
parseFile (1000+ links)83 ops/sec~12ms per parse

Run Examples

# Run from examples/corelib/scraper/
cd examples/corelib/scraper
../../../bin/unnarize stress_test.unna
../../../bin/unnarize benchmark.unna