Spring 2006 HKUST Database Seminar





March 24,  2006


Wei Feng

Holistic Schema Matching for Web Query Interfaces  

One significant part of today's Web is Web databases, which can dynamically provide information in
response to user queries. To help users submit queries to different Web databases, the query interface
matching problem needs to be addressed. To solve this problem, we propose a new complex schema matching 
approach, Holistic Schema Matching (HSM). By examining the query interfaces of real Web databases, we 
observe that attribute matchings can be discovered from attribute-occurrence patterns. For example, 
First Name often appears together with Last Name while it is rarely co-present with Author in the Books 
domain. Thus, we design a count-based greedy algorithm to identify which attributes are more likely to 
be matched in the query interfaces. In particular, HSM can identify both simple matching, i.e., 1:1 
matching, and complex matching, i.e., 1:n or m:n matching, between attributes. Our experiments show 
that HSM can discover both simple and complex matchings accurately and efficiently on real data sets.


March 31st,  2006


Dr. Yan Liu

Automatic Feature Selection in Video Understanding 


The rapid growth and wide applications of digital video data have led to a significant need for video
understanding. Unfortunately, the gap between high-level concepts and low-level features and the high
time cost of video analysis are two important obstacles of efficient video data management. Feature
selection, which can improve the prediction performance of the predictors, provide faster and more
cost-effective predictors, and provide better understanding of the data, is introduced to address these
two problems. But applying existing automatic feature selection algorithms to video data is impractical
because of the unrealistic amount of computer time. So far, most feature selection technologies in
video applications are based on researchers' intuition although human interaction can't satisfy the
dramatic increase of video data and the multiple requirements of different users. 

 In this talk, I propose and demonstrate four low time cost feature selection algorithms to address the 
problem of high dimensionality, complex hypotheses, and sparse training data in classification, with 
special applications in video segmentation, categorization and retrieval. Compared to existing methods,
 in both artificial datasets and real problems in several different domains, these methods meet or 
exceed existing methods' accuracy while running at several orders of magnitude faster.
Dr. Liu, Yan is an assistant professor in
the Department of Computing at The Hong
 Kong Polytechnic University. Her research
 interests include machine learning,
multimedia understanding and E-learning.
 She received her PhD in Computer Science

 Department from Columbia University and

 her M.S. in Business School from Nanjing

April 21st, 2006


Dr. Yufei Tao


  Privacy Preserving Publication with Generalization 


  Privacy preservation is a serious concern in publication of personal data, and has attracted
considerable attention from the database community.  In the first part of the talk, we will review "k-
anonymous generalization",  an important methodology underlying the existing solutions in the 
literature. We also explain the drawbacks of this methodology. In particular, k-anonymous 
generalization exerts the same amount of privacy preservation for all persons, without catering for
 their concrete needs. The consequence is that we may be offering insufficient protection to a subset
 of people, while applying excessive privacy control to another subset. 

The rest of the talk will be devoted to introducing a novel generalization ramework based on the 
concept of "personalized anonymity". This technique performs the minimum generalization for satisfying
 everybody's requirements, and thus, retains the largest amount of information from the 
microdata. We also present the results of our theoretical study that provide valuable insight into the
 behavior of alternative solutions. These results mathematically reveal the circumstances where the 
previous work fails to protect privacy, and establish the superiority of the proposed technique. 

The content of the talk will appear in SIGMOD 2006.
Dr. Yufei Tao is an Assistant Professor in
 the Department of Computer Science, 

the City University of Hong Kong. He holds
 a PhD from the Hong Kong University of
Science and Technology, and did his post
doc at the Carnegie Mellon University,
 USA. Yufei received the Hong Kong Young
 Scientist Award in 2002. He enjoys
 solving database problems, particularly
 those related to privacy preservation,
 and spatial/temporal databases.

April 28th, 2006


David Yang

An Efficient Approach to Support Querying Secure Outsourced XML Information

Data security is well-recognized a vital issue in an information system that is supported in an
 outsource environment. However, most of conventional XML encryption proposals treat confidential parts 
of an XML document as whole blocks of text and apply encryption algorithms directly on them. As a 
result, queries involving the encrypted part cannot be efficiently processed. 
This talk focuses on XQEnc, a novel approach to support querying encrypted XML. XQEnc is based on two 
important techniques of vectorization and skeleton compression. Essentially, vectorization, which is a 
generalization ofcolumns of a relational table, makes use the basic path of an XML tree to label the 
data values. Skeleton compression collapses the redundant paths into a multiplicity attribute. Our 
analysis and experimental study shows that XQEnc achieves both better query efficiency and more robust 
security compared with conventional methods. As an application, we show how XQEnc can be realized with 
relational techniques to enable secure XML data outsourcing.

July 28th, 2006


Prof. Ben Choi

Building a Better Search Engine


This presentation will outline several new techniques for building a better search engine. New algorithms have been developed for

automatically classifying and clustering Web pages to allow users to quickly find information on the Web. A new text indexing scheme has

 been developed for speeding up keyword search. And, a new parallel and distributed architecture has been developed to handle massive

 search requests. Several techniques under development will also be outlined, including Web crawling methods for finding and updating Web

 pages and page ranking methods for ordering search results. These techniques have been implemented and tested on an object-oriented

 database, the experience of which will also be shared. In addition, two patent applications have been filed based on some the new  techniques.

Prof. Ben Choi, Ph.D. & Pilot: He is an
 Associate Professor in Computer Science
 at Louisiana Tech University and is also
a pilot of airplanes and helicopters. He
 worked in the computer industry as a
System Performance Engineer at Lucent
 Technologies and as a Computer
 Consultant. He got his Bachelor, Master,
 and Ph.D. degrees all from The Ohio State
 University. His areas were Electrical
 Engineering, Computer Engineering, and
 Computer Science. His works included
 associative memory, parallel computing,
 and machine learning. His research
 interests are developing software and
hardware methods for building intelligent
machines and abstracting the Universe as
a Computer.

August 4th, 2006


Yiping Ke


Mining Quantitative Correlated Patterns Using an Information-Theoretic Approach


Existing research on mining quantitative databases mainly focuses on mining associations. However,
 mining associations is too expensive to be practical in many cases. In this paper, we study mining 
correlations from quantitative databases and show that it is a more effective approach than mining 
associations. We propose a new notion of Quantitative Correlated Patterns (QCPs), which is founded on 
two formal concepts, mutual information and all-confidence. We first devise a normalization on mutual 
information and apply it to QCP mining to capture the dependency between the attributes. We further 
adopt all-confidence as a quality measure to control, at a finer granularity, the dependency between
 the attributes with specific quantitative intervals. We also propose a supervised method to combine 
the consecutive intervals of the quantitative attributes based on mutual information, such that the 
interval combining is guided by the dependency between the attributes. We develop an algorithm, 
QCoMine, to efficiently mine QCPs by utilizing normalized mutual information and all-confidence to
perform a two-level pruning. Our experiments verify the efficiency of QCoMine and the quality of the