This dataset is a real-world web page collection used for research on the automatic extraction of structured data (e.g., attribute-value pairs of entities) from the Web. We hope it could serve as a useful benchmark for evaluating and comparing different
methods for structured web data extraction.
Contents of the Dataset
Currently the dataset involves:
- 8 verticals with diverse semantics;
- 80 web sites (10 per vertical);
- 124,291 web pages (200~2,000 per web site), each containing a single data record with detailed information of an entity;
- 32 attributes (3~5 per vertical) associated with
carefully labeled ground-truth of corresponding values in each web page. The goal of structured data extraction is to automatically identify the values of these attributes from web pages.
The involved verticals are summarized as follows:
||model, price, engine, fuel_economy
||title, author, isbn_13, publisher, publication_date
||model, price, manufacturer
||title, company, location, date_posted
||title, director, genre, mpaa_rating
||name, team, height, weight
||name, address, phone, cuisine
||name, phone, website, type
Format of Web Pages
Each web page in the dataset is stored as one .htm file (in UTF-8 encoding) where the first tag encodes the source URL of the page.
Format of Ground-truth Files
For each web site, the page-level ground-truth of attribute values has been labeled using handcrafted regular expressions and stored in
.txt files (in UTF-8 encoding) named as "<vertical>-<site>-<attribute".txt".
In each such file:
- The first line stores the names of vertical, site, and attribute, separated by TAB characters ('\t').
- The second line stores some statistics (separated by TABs) w.r.t. the corresponding site and attribute, including:
- the total number of pages,
- the number of pages containing attribute values,
- the total number of attribute values contained in the pages,
- the number of unique attribute values.
- Each remaining line stores the ground-truth information (separated by TABs) of one page, in sequence of:
- page ID,
- the number of attribute values in the page,
- attribute values ("<NULL>" in case of non-existence).
Notes on Ground-truth Labeling
- The ground-truth labeling was conducted in the DOM-node level. More specifically, the candidate attribute values in a web page are the non-empty strings contained in text nodes in the corresponding DOM tree.
- One page (although containing a single data record) may contain multiple distinct values that correspond to an attribute (e.g., multiple
authors of a book, multiple granularity levels of addresses).
- Currently, when a text node presents a mixture of multiple attributes, its string value is labeled with each of these attributes, if no substitute is available.
- Before being stored in .txt files, the raw attribute values were refined by removing redundant separators (e.g., ' ', '\t', '\n').
How to Download
You can access the latest version of this dataset (associated with ground-truth, sample code, website list, and readme file) from
We would appreciate it if you cite the following paper when using the dataset:
Qiang Hao, Rui Cai, Yanwei Pang, and Lei Zhang. "From One Tree to a Forest: a Uniﬁed Solution for Structured Web Data Extraction". in Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 2011), pp.775-784, Beijing, China. July 24-28, 2011.
If you have questions about this dataset, please contact Qiang Hao (firstname.lastname@example.org,