ucoreScraper

HTML parsing, CSS selectors, and web scraping toolkit.

Strict Path Resolution: Unnarize enforces strict sandboxing. All file paths used in this library are resolved relative to the directory of the executing script, regardless of where the unnarize command is run. Absolute paths are typically rebased or rejected to prevent sandbox escape.

API Reference

Function	Returns	Description
`fetch(url)`	string	Download HTML content from URL
`download(url, filepath)`	bool	Save HTML to local file
`select(html, selector)`	Array	Select elements from HTML string
`parseFile(path, selector)`	Array	Read file and select elements
`parse(html, [debug])`	nil	Parse HTML (debug prints DOM tree)

CSS Selector Reference

Selector	Example	Description
Element	`div`, `p`, `a`	Match by tag name
Class	`.classname`	Match by class attribute
ID	`#myid`	Match by id attribute
Descendant	`div p`	Match p inside div (any level)
Combined	`div.container`	Match div with class "container"

// Examples of CSS selectors
var html = ucoreScraper.fetch("https://example.com");

// Select all paragraphs
var paragraphs = ucoreScraper.select(html, "p");

// Select by class
var articles = ucoreScraper.select(html, ".article");

// Select by ID
var header = ucoreScraper.select(html, "#main-header");

// Descendant selector: links inside nav
var navLinks = ucoreScraper.select(html, "nav a");

Element Structure

Selected elements are returned as an Array of Maps. Each element Map contains:

Key	Type	Description
`tag`	string	Tag name (div, p, a, span, etc.)
`text`	string	All inner text content (recursive)
`attributes`	Map	Element attributes (href, class, id, src, etc.)

var links = ucoreScraper.select(html, "a");

// Access first link
var link = links[0];
print(link["tag"]);                      // "a"
print(link["text"]);                     // "Click here"
print(link["attributes"]["href"]);       // "https://example.com"
print(link["attributes"]["class"]);      // "btn primary"

Detailed Function Reference

fetch(url)

Downloads HTML content from a URL and returns it as a string. Uses curl internally with redirect following (-L).

var html = ucoreScraper.fetch("https://en.wikipedia.org/wiki/Main_Page");
print("Downloaded " + length(html) + " bytes");

// Check for success (empty string = failure)
if (length(html) == 0) {
    print("Download failed!");
}

download(url, filepath)

Saves HTML content directly to a local file. Returns true on success. Automatically creates directories if needed (--create-dirs).

// Download to current directory
ucoreScraper.download("https://example.com", "page.html");

// Download to nested directory (auto-created)
ucoreScraper.download("https://example.com", "data/pages/example.html");

// Check success
if (ucoreScraper.download(url, "output.html")) {
    print("Saved successfully");
} else {
    print("Download failed");
}

select(html, selector)

Parses an HTML string and returns all matching elements. Best for in-memory processing.

var html = ucoreScraper.fetch("https://news.ycombinator.com");

// Get all story titles
var titles = ucoreScraper.select(html, ".titleline");

var i = 0;
while (i < length(titles)) {
    print((i + 1) + ". " + titles[i]["text"]);
    i = i + 1;
}

parseFile(path, selector)

Reads an HTML file from disk, parses it, and returns matching elements. Efficient for large files or repeated processing.

// Download once, process multiple times
ucoreScraper.download("https://en.wikipedia.org/wiki/Countries", "countries.html");

// Extract different elements
var tables = ucoreScraper.parseFile("countries.html", "table");
var links = ucoreScraper.parseFile("countries.html", "a");
var headlines = ucoreScraper.parseFile("countries.html", ".mw-headline");

print("Tables: " + length(tables));
print("Links: " + length(links));
print("Headlines: " + length(headlines));

parse(html, [debug])

Parses HTML and optionally prints the DOM tree structure. Useful for debugging selector issues.

var html = "<div><p>Hello</p></div>";

// Normal parse (returns nil)
ucoreScraper.parse(html);

// Debug mode: prints DOM tree
ucoreScraper.parse(html, true);
// Output:
// DOCUMENT
//   ELEMENT: div
//     ELEMENT: p
//       TEXT: Hello

Complete Web Scraping Example

print("=== Wikipedia Scraper ===");

// 1. Download the page
var url = "https://en.wikipedia.org/wiki/List_of_programming_languages";
var file = "programming_langs.html";

print("Downloading...");
ucoreScraper.download(url, file);

// 2. Extract all language links
var links = ucoreScraper.parseFile(file, "#mw-content-text a");
print("Found " + length(links) + " links");

// 3. Print first 10 language names
var i = 0;
var max = 10;
if (length(links) < max) { max = length(links); }

while (i < max) {
    var link = links[i];
    var href = link["attributes"]["href"];
    var text = link["text"];
    
    print((i+1) + ". " + text + " -> " + href);
    i = i + 1;
}

print("=== Done ===");

Performance

Benchmarked on 370KB Wikipedia HTML (Intel i5-1135G7):

Operation	Speed	Note
parseFile (class selector)	339 ops/sec	~3ms per parse
parseFile (table rows)	212 ops/sec	~5ms per parse
select (in-memory)	128 ops/sec	~8ms per parse
parseFile (1000+ links)	83 ops/sec	~12ms per parse

Run Examples

# Run from examples/corelib/scraper/
cd examples/corelib/scraper
../../../bin/unnarize stress_test.unna
../../../bin/unnarize benchmark.unna