RESEARCH INTERESTS OF MY GROUP

"We are like dwarfs on the shoulders of giants, so that we can see more than they, and things at a greater distance." (John of Salisbury)
 

Disclaimer: To facilitate better understanding of our work, we make some of our papers embedded in hyperlinks. However, it is no guarantee that they are the final version or the published version. Copyright and all rights are retained by authors or by other copyright holders.

 

Briefly, we have been investigating the following four specific areas. The following are some highlights of our past work.

 

Database Modelling:

 

This is related to establish a database model and applications that manage tree-structured data, graph-structured data, sequence data, temporal data, uncertain data, etc. The advancement of classical data dependencies such as functional dependencies (FDs) in different application domains are also related.

 

We employed the Ordered Relational Model (ORM) as a unifying framework for capturing the semantics of the information in a large class of database applications, in addition to studying ordered data dependencies (see journal papers \ref{is99}, \ref{tods01}).

An important reason for why the ORM is useful is due to the fact that, in general, the intuition of incomparability can be explicitly captured by means of partial ordering, which has been demonstrated to be useful in those applications involving incomplete, fuzzy, temporal and tree-structured information (see journal paper \ref{jdm01}).

On other data modelling, we continue to work on modelling vague data and to address the role of user preferences in database models. The results have been regularly published in the ER conference, which is a major conference in data modeling (see conference papers \ref{er04b},\ref{er05},\ref{er06},\ref{er07a},\ref{er07b},\ref{er07c}). We further develop methods in mining preference in terms of the notion of bucket order \ref{sigmod08}.

 

Query Processing of Structured Data:

 

This is related to managing XML or other graph data such as modeling, querying, and indexing. Some interesting issues such as XML compression, XML watermarking, and Web searching are relevant.

 

XML is a de-facto standard for exchanging and presenting information on the Web. However, XML data are also recognized as verbose, since they heavily inflate the data size due to repeated tags and structures. We mainly tackled the so-called size problem of XML documents, which hinders their practical usage and substantially increases the costs of storing, processing and exchanging the data (see journal paper \ref{wwwj05}).

A novel storage technique called DTD Tree and SAX Event Stream Parsing (DSP) was developed to support compressed XML data. The DSP technique was designed for efficient compression of the XML documents that conform to a given DTD without involving user expertise.  The main feature of the DSP technique is that it supports querying without the need for full decompression (see journal papers \ref{kais05}, \ref{wwwj06}).

We also developed a novel SIT-indexing technique on compressed XML data, which is able to achieve a satisfactory compression quality. The querying performance of using SIT-indexing in processing XML queries over compressed documents is also found to be competitive compared to the state-of-the-art techniques (see conference papers \ref{edbt04}, \ref{caise06}).

To further demonstrate the benefits of the compression technique, we worked on the XML dissemination problem over the Internet, where the speed of information delivery can be rather slow in a client-server architecture that consists of a large number of geographically spanned users who access a large amount of correlated XML information (see journal paper \ref{jdm07} and conference papers \ref{sigir07}, \ref{dasfaa07c}, \ref{dasfaa06}).

In processing queries on graph databases, we developed a novel indexing technique that constructs a nested inverted-index, called the FG-index, based on the set of Frequent subGraphs (FGs for short). Given a graph query that is an FG in the database, FG-index returns the exact set of query answers without performing candidate verification. On the other hand, FG-index produces a candidate answer set that is close to the exact answer set when the query is an infrequent graph
. Since an infrequent graph means the graph occurs in only a small number of graphs in the database, the number of subgraph isomorphism tests is small (see conference paper \ref{sigmod07} and journal paper \ref{tods08b}).

 

Web Searching:

 

Searching information via a search engine is crucial for both casual users and web applications (see journal paper \ref{cnj05}). It is also important to consider user preference in searching, since the semantics of a search query may vary for different users. We employed some machine learning techniques such as SVM, co-training and naive Bayes to develop an effective framework of adaptive searching for HTML and XML data.

 

I   Adaptive Web Searching

We explored ranking techniques that are able to incorporate users' implicit feedback in order to achieve personalization in web searching. The outcome framework is able to cater to diversified searching needs. We addressed search engine personalization and presented a new approach to mining a user's preferences on the search results from clickthrough data and using the discovered preferences to adapt the search engine's ranking function for improving search quality. A new preference mining technique called SpyNB was developed to discover more accurate preferences than existing algorithms do (see journal papers \ref{toit07}, \ref{tkde08b}and conference paper \ref{webkdd04}, \ref{dasfaa04}). We maintain a publicly accessible clickthrough analyzing engine (CTA) which captures the clickthroughs that a user performs on the search results and identifies and group together related queries that are relevant to the user's interest.

 

II  Adaptive XML Searching

We employed tagged keywords as an XML search query for searching XML data. The problem of effective merging of user-preferred fragments in order to present quality search results was also studied. The outcome framework is able to cater to diversified searching needs when given mixed XML data sources. This research is directly related to the development of effective XML search engines. In order to deal with the diversified nature of XML documents as well as user preferences, we proposed a novel Multi-Ranker Model (MRM), which is able to abstract a spectrum of important XML properties and adapt the features to different XML search needs. The MRM is composed of three ranking levels. The lowest level consists of two categories of similarity and granularity features. At the intermediate level, we define four tailored XML Rankers (XRs), which consist of different lower level features and have different strengths in searching XML fragments. We demonstrated that the trained MRM is able to bring out the strengths of the XRs in order to adapt to different preferences and queries (see journal papers \ref{vldb07}, \ref{is07}).

 

Data Mining:

 

Data Mining can be viewed as querying on a database in order to identify relationship of data. Advanced query languages that support effective mining and searching are also our interests.

 

I   Mining Frequent Itemsets and Association Rules

It is well-recognized that the main factor that hinders the applications of Frequent Itemsets (FIs for short) and Association Rules (ARs for short) is the huge number of FIs and ARs returned by the mining process (see journal paper \ref{kais06}).

We proposed effective solutions that present concise mining results by eliminating the redundancy in both the set of FIs and the set of ARs (see journal papers \ref{dami07}, \ref{jiis07} and conference papers \ref{icdm06}, \ref{pakdd06}).

First, we studied how to eliminate the redundancy in the set of FIs. Two major works in this direction are Closed FIs (CFIs for short) and Maximal FIs (MFIs for short). The set of CFIs is a lossless representation of FIs but the number of CFIs is still too large, while the number of MFIs is small but MFIs lose the frequency information of FIs. We proposed
δ-Tolerance CFIs (δ-TCFIs) to bridge the gap between the lossless CFIs and the lossy MFIs. The notion of δ-tolerance relaxes the restrictive definition of CFIs to remove a significantly greater amount of redundancy, while it still retains the frequency information of FIs.

Second, we developed an effective solution that eliminates the redundancy in the set of ARs based on the set of δ-TCFIs, proposed
δ-Tolerance ARs (δ-TARs for short), and devised a set of inference rules. We established the interesting results that the set of δ-TARs is a non-redundant representation of ARs, while the set of ARs that is derived from the δ-TARs by the inference rules is sound and complete.

II  Mining in Quantitative Databases and Graph Databases

We are interested in adopting information-theoretic and statistical approaches to tackling mining problems in quantitative and graph databases (see journal papers \ref{tods08}, \ref{kais07b}, conference papers \ref{kdd07}, \ref{kdd06} and the synthetic graph data generator GrapGen).

Mining associations in quantitative databases is known to be too expensive to be practical. We tackled this long-standing problem in the field by mining correlations from quantitative databases and showed that it is a more effective approach than mining associations. We proposed a new notion of Quantitative Correlated Patterns (QCPs), which is founded on two formal concepts, mutual information and all-confidence (see journal paper \ref{tods08a}).

First, we devised a normalized form of mutual information and applied it to QCP mining to capture the dependency between the attributes. The notion of all-confidence is employed as a quality measure to control, at a finer granularity, the dependency between the attributes with specific quantitative intervals. Second, we adopted a supervised method to combine the consecutive intervals of the quantitative attributes based on mutual information, such that the interval combining is guided by the dependency between the attributes.

In addition to tackling the mining problem in quantitative databases, we proposed a new problem of correlation mining in the context of graph databases, called Correlated Graph Search (CGS) (see journal paper \ref{tkde08a}).

CGS adopts Pearson's correlation coefficient as a correlation measure to take into consideration the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derived necessary conditions that set bounds on the occurrence probability of a candidate in the database. With this result, an efficient algorithm that operates on a much smaller projected database was devised and thus a significantly smaller set of candidates was obtained. To further improve the efficiency, we developed heuristic rules and applied them to the candidate set to further reduce the search space.

 

Software Development of the above mentioned work

 

Please go the following web page: Software Downloads

 

 

Some interesting research problems are starting from asking simple questions. Just share our experience: 

 

Eg 1.     XML Data Compression

 

How to make XML data "smaller" but still queriable?

 

Eg 2.     XML Merging with Quality

 

How to ensure XML + XML = a better XML?

 

Eg 3.     Mining Languages for Graphs

 

How to know some part of a large graph is "very related" to another part?

 

Eg 4.     Adaptive Web Searching

 

How to get what I really want from the Web?