ucoreScraper
HTML parsing, CSS selectors, and web scraping toolkit.
Strict Path Resolution: Unnarize enforces strict sandboxing. All file paths used in this library are resolved relative to the directory of the executing script, regardless of where the
unnarize command is run. Absolute paths are typically rebased or rejected to prevent sandbox escape.
API Reference
| Function | Returns | Description |
|---|---|---|
fetch(url) | string | Download HTML content from URL |
download(url, filepath) | bool | Save HTML to local file |
select(html, selector) | Array | Select elements from HTML string |
parseFile(path, selector) | Array | Read file and select elements |
parse(html, [debug]) | nil | Parse HTML (debug prints DOM tree) |
CSS Selector Reference
| Selector | Example | Description |
|---|---|---|
| Element | div, p, a | Match by tag name |
| Class | .classname | Match by class attribute |
| ID | #myid | Match by id attribute |
| Descendant | div p | Match p inside div (any level) |
| Combined | div.container | Match div with class "container" |
// Examples of CSS selectors
var html = ucoreScraper.fetch("https://example.com");
// Select all paragraphs
var paragraphs = ucoreScraper.select(html, "p");
// Select by class
var articles = ucoreScraper.select(html, ".article");
// Select by ID
var header = ucoreScraper.select(html, "#main-header");
// Descendant selector: links inside nav
var navLinks = ucoreScraper.select(html, "nav a");
Element Structure
Selected elements are returned as an Array of Maps. Each element Map contains:
| Key | Type | Description |
|---|---|---|
tag | string | Tag name (div, p, a, span, etc.) |
text | string | All inner text content (recursive) |
attributes | Map | Element attributes (href, class, id, src, etc.) |
var links = ucoreScraper.select(html, "a");
// Access first link
var link = links[0];
print(link["tag"]); // "a"
print(link["text"]); // "Click here"
print(link["attributes"]["href"]); // "https://example.com"
print(link["attributes"]["class"]); // "btn primary"
Detailed Function Reference
fetch(url)
Downloads HTML content from a URL and returns it as a string. Uses curl internally with redirect following (-L).
var html = ucoreScraper.fetch("https://en.wikipedia.org/wiki/Main_Page");
print("Downloaded " + length(html) + " bytes");
// Check for success (empty string = failure)
if (length(html) == 0) {
print("Download failed!");
}
download(url, filepath)
Saves HTML content directly to a local file. Returns true on success. Automatically creates directories if needed (--create-dirs).
// Download to current directory
ucoreScraper.download("https://example.com", "page.html");
// Download to nested directory (auto-created)
ucoreScraper.download("https://example.com", "data/pages/example.html");
// Check success
if (ucoreScraper.download(url, "output.html")) {
print("Saved successfully");
} else {
print("Download failed");
}
select(html, selector)
Parses an HTML string and returns all matching elements. Best for in-memory processing.
var html = ucoreScraper.fetch("https://news.ycombinator.com");
// Get all story titles
var titles = ucoreScraper.select(html, ".titleline");
var i = 0;
while (i < length(titles)) {
print((i + 1) + ". " + titles[i]["text"]);
i = i + 1;
}
parseFile(path, selector)
Reads an HTML file from disk, parses it, and returns matching elements. Efficient for large files or repeated processing.
// Download once, process multiple times
ucoreScraper.download("https://en.wikipedia.org/wiki/Countries", "countries.html");
// Extract different elements
var tables = ucoreScraper.parseFile("countries.html", "table");
var links = ucoreScraper.parseFile("countries.html", "a");
var headlines = ucoreScraper.parseFile("countries.html", ".mw-headline");
print("Tables: " + length(tables));
print("Links: " + length(links));
print("Headlines: " + length(headlines));
parse(html, [debug])
Parses HTML and optionally prints the DOM tree structure. Useful for debugging selector issues.
var html = "<div><p>Hello</p></div>";
// Normal parse (returns nil)
ucoreScraper.parse(html);
// Debug mode: prints DOM tree
ucoreScraper.parse(html, true);
// Output:
// DOCUMENT
// ELEMENT: div
// ELEMENT: p
// TEXT: Hello
Complete Web Scraping Example
print("=== Wikipedia Scraper ===");
// 1. Download the page
var url = "https://en.wikipedia.org/wiki/List_of_programming_languages";
var file = "programming_langs.html";
print("Downloading...");
ucoreScraper.download(url, file);
// 2. Extract all language links
var links = ucoreScraper.parseFile(file, "#mw-content-text a");
print("Found " + length(links) + " links");
// 3. Print first 10 language names
var i = 0;
var max = 10;
if (length(links) < max) { max = length(links); }
while (i < max) {
var link = links[i];
var href = link["attributes"]["href"];
var text = link["text"];
print((i+1) + ". " + text + " -> " + href);
i = i + 1;
}
print("=== Done ===");
Performance
Benchmarked on 370KB Wikipedia HTML (Intel i5-1135G7):
| Operation | Speed | Note |
|---|---|---|
| parseFile (class selector) | 339 ops/sec | ~3ms per parse |
| parseFile (table rows) | 212 ops/sec | ~5ms per parse |
| select (in-memory) | 128 ops/sec | ~8ms per parse |
| parseFile (1000+ links) | 83 ops/sec | ~12ms per parse |
Run Examples
# Run from examples/corelib/scraper/
cd examples/corelib/scraper
../../../bin/unnarize stress_test.unna
../../../bin/unnarize benchmark.unna