"We are like dwarfs on the shoulders of giants, so that we can see more than
they, and things at a greater distance." (John of Salisbury)
Disclaimer: To facilitate better understanding of our work, we make some of our papers embedded in hyperlinks. However, it is no guarantee that they are the final version or the published version. Copyright and all rights are retained by authors or by other copyright holders.
Briefly, we have been investigating the following four specific areas. The following are some highlights of our past work.
Database Modelling:
This is related to establish a database model and applications that manage tree-structured data, graph-structured data, sequence data, temporal data, uncertain data, etc. The advancement of classical data dependencies such as functional dependencies (FDs) in different application domains are also related.
We employed the Ordered Relational Model (ORM)
as a unifying framework for capturing the semantics of the information in a
large class of database applications, in addition to studying ordered data
dependencies (see journal papers \ref{is99},
\ref{tods01}).
An important reason for why the ORM is useful is due to the fact that, in
general, the intuition of incomparability can be explicitly captured by means of
partial ordering, which has been demonstrated to be useful in those applications
involving incomplete, fuzzy, temporal and tree-structured information (see
journal paper \ref{jdm01}).
On other data modelling, we continue to work on modelling vague data and
to address the role of user preferences in database models. The results have been
regularly published in the ER conference, which is a major conference in data
modeling (see conference papers \ref{er04b},\ref{er05},\ref{er06},\ref{er07a},\ref{er07b},\ref{er07c}).
We further develop methods in mining preference in terms of the notion of bucket
order \ref{sigmod08}.
Query Processing of Structured Data:
This is related to managing XML or other graph data such as modeling, querying, and indexing. Some interesting issues such as XML compression, XML watermarking, and Web searching are relevant.
XML is a de-facto standard
for exchanging and presenting information on the Web. However, XML data are also recognized as verbose, since
they heavily inflate the data size due to repeated tags and structures. We
mainly tackled the so-called size problem of XML documents, which hinders
their practical usage and substantially increases the costs of storing,
processing and exchanging the data (see journal paper
\ref{wwwj05}).
A novel storage technique called DTD Tree and SAX Event Stream Parsing (DSP) was
developed to support compressed XML data.
The DSP technique was designed for efficient compression of the XML documents that
conform to a given DTD without involving user expertise. The main
feature of the DSP technique is that it supports querying
without the need for full decompression (see journal papers
\ref{kais05},
\ref{wwwj06}).
We also developed a novel SIT-indexing technique on compressed XML data, which
is able to achieve a satisfactory compression quality. The querying performance
of using SIT-indexing in processing XML queries over compressed documents is
also found to be competitive compared to the state-of-the-art techniques (see
conference papers \ref{edbt04},
\ref{caise06}).
To further demonstrate the benefits of the compression technique, we worked on
the XML dissemination problem over the Internet, where the speed of information
delivery can be rather slow in a client-server architecture that consists of a large
number of geographically
spanned users who access a large amount of correlated XML information (see
journal paper \ref{jdm07} and conference papers
\ref{sigir07},
\ref{dasfaa07c},
\ref{dasfaa06}).
In processing queries on graph databases, we developed a novel indexing technique
that constructs a nested inverted-index, called the FG-index, based on the
set of Frequent subGraphs (FGs for short). Given a graph query that is an FG in
the database, FG-index returns the exact set of query answers without performing
candidate verification. On the other hand, FG-index produces
a candidate answer set that is close to the exact answer set when the query is
an infrequent graph. Since an
infrequent graph means the graph occurs in only a small number of graphs in the
database, the number of subgraph isomorphism tests is small (see conference
paper \ref{sigmod07} and journal paper
\ref{tods08b}).
Web Searching:
Searching information via a search engine is crucial for both casual users and web applications (see journal paper \ref{cnj05}). It is also important to consider user preference in searching, since the semantics of a search query may vary for different users. We employed some machine learning techniques such as SVM, co-training and naive Bayes to develop an effective framework of adaptive searching for HTML and XML data.
I Adaptive Web Searching
We explored ranking techniques that are able to incorporate users' implicit
feedback in order to achieve personalization in web searching. The outcome
framework is able to cater to diversified searching needs. We addressed search
engine personalization
and presented a new approach to mining a user's preferences on the search
results from clickthrough data and using the discovered
preferences to adapt the search engine's ranking function for improving search
quality. A new preference mining technique called SpyNB was developed to
discover more accurate preferences than existing algorithms do (see journal
papers \ref{toit07},
\ref{tkde08b}and conference paper
\ref{webkdd04},
\ref{dasfaa04}). We maintain a publicly
accessible clickthrough analyzing engine (CTA)
which captures the clickthroughs that a user performs on the search results and
identifies and group together related queries that are relevant to the user's
interest.
II Adaptive XML Searching
We employed tagged keywords as an XML search query for searching XML data.
The problem of effective merging of user-preferred fragments in order to present
quality search results was also studied. The outcome framework is able to cater
to diversified searching needs when given mixed XML data sources. This research
is directly related to the development of effective XML search engines. In order to deal with the diversified nature of
XML documents as well as user preferences, we proposed a novel Multi-Ranker
Model (MRM), which is able to abstract a spectrum of important XML properties
and adapt the features to different XML search needs. The MRM is composed of
three ranking levels. The lowest level consists of two categories of similarity
and granularity features. At the intermediate level, we define four tailored XML
Rankers (XRs), which consist of different lower level features and have
different strengths in searching XML fragments. We demonstrated that the trained
MRM is able to bring out the strengths of the XRs in order to adapt to different
preferences and queries (see journal papers
\ref{vldb07},
\ref{is07}).
Data Mining:
Data Mining can be viewed as querying on a database in order to identify relationship of data. Advanced query languages that support effective mining and searching are also our interests.
I Mining Frequent Itemsets and
Association Rules
It is well-recognized that the main factor that hinders the applications of
Frequent Itemsets (FIs for short) and Association Rules (ARs for short) is the
huge number of FIs and ARs returned by the mining process (see journal paper
\ref{kais06}).
We proposed effective solutions that present concise mining results by
eliminating the redundancy in both the set of FIs and the set of ARs (see
journal papers \ref{dami07},
\ref{jiis07} and conference papers
\ref{icdm06},
\ref{pakdd06}).
First, we studied how to eliminate the redundancy in the set of FIs. Two major
works in this direction are Closed FIs (CFIs for short) and Maximal FIs (MFIs
for short). The set of CFIs is a lossless representation of FIs but the number of
CFIs is still too large, while the number of MFIs is small but MFIs lose the
frequency information of FIs. We proposed
δ-Tolerance CFIs (δ-TCFIs)
to bridge the gap between the lossless CFIs and the lossy MFIs. The notion of δ-tolerance
relaxes the restrictive definition of CFIs to remove a significantly greater
amount of redundancy, while it still retains the frequency information of FIs.
Second, we developed an effective solution that eliminates the redundancy in the
set of ARs based on the set of δ-TCFIs,
proposed
δ-Tolerance
ARs (δ-TARs for short), and devised a set of inference rules. We
established the interesting results that the set of δ-TARs
is a non-redundant representation of ARs, while the set of ARs that is derived
from the δ-TARs
by the inference rules is sound and complete.
II Mining in Quantitative Databases and Graph Databases
We are interested in adopting information-theoretic and statistical approaches to
tackling mining problems in quantitative and graph databases (see journal papers
\ref{tods08}, \ref{kais07b}, conference
papers \ref{kdd07},
\ref{kdd06} and the synthetic graph data
generator GrapGen).
Mining associations in quantitative databases is known to be too expensive to be
practical. We tackled this long-standing problem in the field by mining
correlations from quantitative databases and showed that it is a more effective
approach than mining associations. We proposed a new notion of Quantitative
Correlated Patterns (QCPs), which is founded on two formal concepts, mutual
information and all-confidence (see journal paper
\ref{tods08a}).
First, we devised a normalized form of mutual information and applied it to QCP mining to capture the dependency between the attributes. The notion of
all-confidence is employed as a quality measure to control, at a finer
granularity, the dependency between the attributes with specific quantitative
intervals. Second, we adopted a supervised method to combine the consecutive
intervals of the quantitative attributes based on mutual information, such that
the interval combining is guided by the dependency between the
attributes.
In addition to tackling the mining problem in quantitative databases, we proposed
a new problem of correlation mining in the context of graph databases, called
Correlated Graph Search (CGS) (see journal paper
\ref{tkde08a}).
CGS adopts Pearson's correlation coefficient as a correlation measure to take
into consideration the occurrence distributions of graphs. However, the problem
poses significant challenges, since every subgraph of a graph in the database is
a candidate but the number of subgraphs is exponential. We derived necessary
conditions that set bounds on the occurrence probability of a candidate in the
database. With this result, an efficient algorithm that operates on a much
smaller projected database was devised and thus a significantly smaller set of
candidates was obtained. To further improve the efficiency, we developed
heuristic rules and applied them to the candidate set to further reduce the
search space.
Software Development of the above mentioned work
Please go the following web page: Software Downloads
Some interesting research problems are starting from asking simple questions. Just share our experience:
Eg 1. XML Data Compression
How to make XML data "smaller" but still queriable?
Eg 2. XML Merging with Quality
How to ensure XML + XML = a better XML?
Eg 3. Mining Languages for Graphs
How to know some part of a large graph is "very related" to another part?
Eg 4. Adaptive Web Searching
How to get what I really want from the Web?