Entity Resolution for Hidden Web Data

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Entity Resolution for Hidden Web Data"

By

Miss Xiaoheng Xie


Abstract

Entity resolution (ER) identifies and merges records judged to represent 
the same real-world entity. With the development of the Internet, ER for 
hidden Web data has become increasingly important in many real-world 
applications such as online search engines, web data integration and so 
on. Hidden Web data often originates from different data sources that 
usually have different schemas. As a consequence, there is no one most 
efficient way to compare and merge records from different schemas. 
Moreover, the existing proposed techniques that put all records together 
under a unified schema are often not suitable.

In this thesis, we investigate ER methods for hidden Web data using a 
multi-schema approach. That is, we keep the data under the original 
schemas instead of placing them under a unified schema. Based on the 
multi-schema structure, a pair-wise ER method validity-ensured and 
order-sensitive (VEOS) is proposed. For the rest parts of the thesis, we 
first propose two techniques for improving the performance of the VEOS 
method. Since duplicates that exist in the same data source may adversely 
affect recall performance, the first technique applies an expanding window 
to VEOS to enhance the recall performance. To reduce the number of record 
pair comparisons, our second technique separates the records in large data 
sources into several blocks, so that only records in the blocks with the 
same key values need to be compared. Then, we propose an efficient ER 
method for on-line query data integration, which self-trains the schema 
fields (attributes) so as to set appropriate weights, such that more 
representative attributes will be used for the ER process.

We demonstrate through extensive experiments using real online data sets 
from different domains and some reasonable synthetic data sets, the 
scalability of the ER algorithms, the efficiency of the advanced VEOS 
approaches and the effectiveness of our proposed ER method for online 
querying.


Date:			Thursday, 6 September 2012

Time:			2:00pm – 4:00pm

Venue:			Room 3501
 			Lifts 25/26

Chairman:		Prof. Weichuan Yu (ECE)

Committee Members:	Prof. Frederick Lochovsky (Supervisor)
 			Prof. Dik-Lun Lee
 			Prof. Qiong Luo
 			Prof. Rong Zheng (ISOM)
                        Prof. Felix Naumann (Univ. of Potsdam)


**** ALL are Welcome ****