NEW !!!

 

 

 

 

Tutorials - Monday April 16, 2001

Tutorial

Content

Morning Session : 09:00 - 12:30

I

An Introduction to MARS

Dr Dan Steinberg, CEO of Salford Systems, USA

II

Static and Dynamic Data Mining Using Advanced Machine Learning Methods

Professor Ryszard S. Michalski, George Mason University, USA

III

Sequential Pattern mining: From Shopping History Analysis to Weblog Mining and DNA Mining

Professor Jiawei Han and Jian Pei, Simon Fraser University, Canada

Afternoon Session : 14:00 -17:30
IV

Recent Advances in Data Mining Algorithms for Large Databases

Dr Rajeev Rastogi and Dr Kyuseok Shim, USA and Korea

V

Web Mining for E-Commerce

Professor Jaideep Srivastava, University of Minnesota, USA

 

Description

Tutorial I - An Introduction to MARS

Presenter : Dr Dan Steinberg, CEO of Salford Systems, USA

MARS is one of several modern regression tools that can help analysts quickly develop superior predictive models. Suited for linear and logistic regression, MARS automates the model specification search, including variable selection, variable transformation, interaction detection, missing value handling, and model validation.

Created by Stanford's Jerome H. Friedman, one of the developers of CART(r), MARS is a non-parametric modelling tool that is equally adept at developing simple or highly non-linear models. MARS rapidly separates effects that are applicable to an entire data set from those that apply only to specific subsets, automatically tracking non-linear effects with spline basis functions. Models enhanced with MARS-created variables are typically far more accurate than hand crafted models.

This tutorial will cover the key MARS concepts and illustrate how to get the most out of MARS.

Details include :

  • Parametric vs. Non-parametric Regression
  • Splines as Method for Non-parametric Regression
  • Splines as resolution of functional-form specification problem
  • MARS Modelling Overview
  • MARS Approach to Spline Creation-Hockey-Stick Basis Functions
  • MARS Forward Model Building Stage-Discovering the Best Basis Functions
  • Tensor products: how MARS finds interactions
  • Model Pruning Stage-Determination of Optimal Model
  • Overriding the MARS selection: choosing alternative models
  • Overview of results from experimental evaluation studies
  • Multinomial Logit (Discrete Choice Models)
  • Survival Analysis

 

Biography of the speaker

Dan Steinberg received his Ph.D. in Econometrics from Harvard University. Dr Steinberg is the founder of Salford Systems, a data mining software development and consulting company. Starting in 1982 he led Salford in producing add-on procedures for the SAS software system and subsequently developed a series of advanced statistical procedures for PCs, including modules for survival analysis, discrete choice modelling, and multinomial logistic regression. In 1991 he began research on data mining technologies, focusing on CART(r) decision trees. Working in close collaboration with the CART originators at Stanford University and UC Berkeley, he developed a series of enhancements and extensions to CART, including a new graphical interface, additional splitting rules, improved accuracy for regression trees, and new diagnostic reports. More recently he has been working on several next generation data mining tools including MARS (Multivariate Adaptive Regression Splines), tree-based cluster analysis, database hot spot detection, and hybrid methods combining multiple data mining methods.

In the consulting arena, via Salford Systems, Dr. Steinberg has conducted high profile data mining and market research projects for AT&T, Chase Bank, General Motors, American Express, several European telecommunications firms, and other Fortune 100 firms. Specific topics have included bankruptcy scoring, predicting home mortgage refinance, response models for direct mail promotions, and new tree-based customer segmentation schemes. In addition to data mining, in the early 1990s he pioneered a series of proprietary tools and methods for sophisticated modelling of consumer choice behaviour. Using these tools and working closely with leading management consulting firms, he directed key strategy studies in the telecommunications, banking, and retail industries.

Dr. Steinberg is the author of CART software documentation and the Japanese language book "Applied Tree Based Methods Using CART" with co-authors at Meiji University. This book was awarded the 1999 Nikkei QC Literature Prize, awarded by the Deming Committee for excellence in statistical literature contributing to QC practice and management. His papers have appeared in scholarly journals including The American Economic Review, Journal of Econometrics, The American Statistician, Communications in Statistics, Marketing Letters and the Social Science Computer Review. He has given 2-day CART tutorials across the US, Australia, and Japan, and has been a featured speaker for the Direct Marketing Association, the American Statistical Association, and DCI's Data Warehouse Conferences.

Salford was a two time winner in the KDDCup 2000 data mining/web mining competition. Using ART and MARS, Salford won first place for most accurate predictive model (predicting which brands a web visitor would view). Salford also won the first place in the category of Best CRM analysis. The competition drew registrants from around the world including most of the major data mining software and consulting companies.

 

Tutorial II - Static and Dynamic Data Mining Using Advanced Machine Learning Methods

Presenter: Professor Ryszard S. Michalski, George Mason University, USA

This tutorial will present basic concepts and recent methods for mining static and dynamic (temporal) databases that are based on advanced machine learning approaches. The tutorial will start with defining fundamental concepts underlying data mining and knowledge discovery, and then review recent advances in machine learning that are directly relevant to static and dynamic data mining. We will discuss problems of competence and efficiency of the methods, and their applicability to temporal databases in which patterns are not stable but evolve in time. We will review, in particular, recent progress on the application of natural induction to data mining.

In contrast to conventional data mining systems that are concerned primarily with high predictive accuracy of discovered patterns, natural induction systems also stress the ease of the patterns' interpretation and understandability. Presented ideas will be illustrated by examples of their application to pattern discovery in a large medical database and in a temporal database characterizing behaviour of computer users.

 

Biography of the speaker

Ryszard S. Michalski is Planning Research Corporation Chaired Professor of Information Technology, Computer Science and Systems Engineering, Director of GMU Machine Learning and Inference Laboratory, and Affiliate Scientist at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. He is a cofounder of the field of machine learning, and the originator of several ideas/research subareas, such as conceptual clustering, constructive induction, variable-valued logic, variable-precision logic (with Patrick Winston, MIT), two-tiered concept representation, multistrategy task-adaptive learning, inferential theory of learning, and learnable evolution model.

Dr. Michalski's educational background includes studies at the Cracow and Warsaw Universities of Technology, an M.S. degree from St. Petersburg Polytechnical University, and a Ph.D. degree from the University of Silesia in Poland (1969). Before emigrating to the United States in 1970, he was a research scientist at the Polish Academy of Sciences. From 1970 to 1987, Dr. Michalski was on the faculty of the University of Illinois at Urbana-Champaign, initially as a Research Professor, and then became Full Professor of Computer Science and Medical Information Science, and Director of Artificial Intelligence Laboratory. In 1988, he moved with his research group to George Mason University in Fairfax, VA (Washington, D.C. metropolitan area).

Dr. Michalski has been working on topics of machine learning since 1966, when he developed (in collaboration with J. Karpinski) an early learning system for the recognition of handwritten alpha-numeric characters. He invented algorithm AQ for solving the general covering problem, which became the basis of many machine learning programs. He originated research on constructive induction and conceptual clustering; developed a computational theory of inductive learning; invented variable-valued logic; and co-developed a computational theory of human plausible reasoning (with Alan Collins from BBN, Cambridge, MA). Collaborating with James Sinclair, a plant pathologist at the University of Illinois, he developed the first agricultural expert system, and the first practical expert system that learned its decision rules from examples. Dr. Michalski is a cofounder of the Journal of Machine Learning, and a co-organizer of the first several international machine learning conferences. He has lectured extensively worldwide, and held visiting professorships at major universities in the U.S., including MIT, CMU and the University of Wisconsin, as well as abroad, specifically, in Belgium, Great Britain, Italy and France.

Dr. Michalski's research interests include machine learning and inference, inductive databases, data mining and knowledge discovery, applications of machine learning to computer vision, and intelligent agents. He authored/co-authored/co-edited 14 books, some of which received a world-wide acclaim, and more than 350 publications in the areas of his interest.

 

Tutorial III - Sequential Pattern Mining: From Shopping History Analysis to Weblog Mining and DNA Mining

Presenters: Professor Jiawei Han and Jian Pei, Simon Fraser University, Canada

Sequential pattern mining, i.e., discovering frequent sub-sequences in sequence databases, is an important data mining task. Although many efficient sequential pattern mining techniques have been developed in the last 7-8 years, may have been published in scattered places. The tutorial will present a comprehensive overview of this theme and discuss its applications.

The survey will cover a wide spectrum of techniques and applications. We will first give an introduction to what sequential pattern mining is and show a road map for sequential pattern mining. Then, we will illustrate an essential technique, Apriori, and its extensions for mining sequential patterns. The Apriori principle as well as several previous methods, including AprioriAll, GSP and SPADE will be covered. After that, we will discuss mining sequential patterns in recently proposed pattern-growth methods, including FreeSpan and PrefixSpan. We will also revisit some researches going beyond sequential pattern mining, such as mining episodes, global partial orders, cyclic association rules and partial periodicity.

As an important component of the survey, we will address applications of sequential pattern mining techniques. We will discuss extending sequential pattern mining to Weblog mining, e.g., mining Weblog access patterns and making prediction based on Weblog access patterns. We will also touch a hot topic, from sequential pattern mining to DNA mining. Some examples, such as finding nucleotide repeats and modelling DNA sequences by sequential classification will be analyzed and some currently influential methods on the analysis of DNA sequences will be reviewed. The survey will be concluded by a summary on achievements, promises, and research problems for sequential pattern mining.

Both database and data mining researchers and practitioners may find the survey interesting. While the section of the audience with such a background would benefit most from this tutorial, the material would give newcomers an overall picture of the important landmarks in this field, and inspire them to learn more. The tutorial will be informative and educational to database and data mining researchers who work on the theme of sequential pattern mining and its applications.

 

Biography of the speakers

Jiawei Han, (Ph.D., Univ. of Wisconsin at Madison, 1985), Director of Intelligent Database Systems Research Laboratory, and Professor of School of Computing Science, Simon Fraser University, Canada. He has conducted research in the areas of data mining, data warehousing, spatial data mining, Web mining, multimedia data mining, deductive and object-oriented databases, and logic programming, with over 150 journal and conference publications. He is a project leader of the Canada NCE/IRIS-3 project ``Building, Querying, Analyzing, and Mining Data Warehouses on the Internet'' (1998-2002). He has served or is currently serving in the program committees of over 50 international conferences and workshops, including SIGMOD'99, SIGKDD'99 (tutorial chairman), SIGMOD'2000 (demo chairman), EDBT'2000, VLDB'2000, SIAMDM'2001 (PC co-chairman), SIGKDD'2001 (Best paper award chairman), and PAKDD'2001 (conference co-chairman). He has also been serving as an editor for Data Mining and Knowledge Discovery, and Journal of Intelligent Information Systems. He is a co-author of the book "Data Mining: Concepts and Techniques'' by Morgan Kaufmann (2000).

 

Jian Pei, Ph.D. Candidate, Simon Fraser University. He received B.Eng. and M.Eng. in Shanghai Jiaotong University, China, and was a Ph.D. candidate of Peking University, China, before joining SFU. He has published many research papers in China and received the best student paper award in China Database Conference in 1998. He delivered many tutorials on different themes when he was in China. After joining SFU, he has published research papers at PAKDD'2000, SIGMOD'2000, DMKD'2000, KDD'2000 and ICDE'2001. He has served in the program committees of DOLAP'00, workshops in PAKDD'00, PAKDD'01 and WISE'00. He has also served as reviews, referees or external referees for international journals and conferences, including IEEE Transactions on Knowledge and Data Engineering, Data Mining and Knowledge Discovery, Knowledge and Information Systems, SIGMOD'00, EDBT'00, SSDBM'00, COMAD'00, etc.

 

Tutorial IV - Recent Advances in Data Mining Algorithms for Large Databases

Presenters: Dr Rajeev Rastogi and Dr Kyuseok Shim, USA and Korea

A large number of corporations have invested heavily in information technology to manage their businesses more effectively, and vast amounts of critical business data have been stored in database systems. The volume of this data is expected to grow considerably in the near future. Yet many organizations have been unable to collect valuable insights from the data to guide their marketing strategy, investment and management policies. One of the reasons for this is that most information is stored implicitly in the large amounts of data. Fortunately, new and sophisticated techniques being developed in the area of data mining can help companies leverage their data more effectively and extract insightful information from their data.

This tutorial describes the fundamental algorithms for data mining, many of which have been proposed in recent years. These techniques include association rules, correlation, causal relationship, clustering, outlier detection, similar time sequences, similar images, sequential patterns and classification. In addition, since we will cover technical material in some degree of depth, the audience will get a good exposure to the results in the area, and also future research directions.

Details include :

  • Overview and discussion on data mining techniques developed for large databases
  • Association Rules and Sequential Patterns
  • Bayesian Network
  • Classification, covering PUBLIC, BOAT, Rain-Forest, SLIQ and SPRINT algorithms as well as nearest neighbour and Bayesian classifiers
  • Clustering, covering CURE, ROCK, CLARANS, DBSCAN, BIRCH and CLIQUE algorithms
  • Similar Time Sequences and Similar Images, covering QBIC, WBIIS and WALRUS systems that are developed for similar image retrieval algorithms.
  • Outlier Detection algorithms
  • Other Applications and Future Research

 

Biography of Speakers

Dr. Rajeev Rastogi is the Director of the Internet Management Research Department at Bell Laboratories, Lucent Technologies. He received the B. Tech degree in Computer Science from the Indian Institute of Technology, Bombay in 1988, and the masters and Ph.D. degrees in Computer Science from the University of Texas, Austin, in 1990 and 1993, respectively. He joined Bell Laboratories in Murray Hill, New Jersey, in 1993 and became a Distinguished Member of Technical Staff (DMTS) in 1998.

Rajeev Rastogi is active in the field of databases and has served as a program committee member for several conferences in the area. His writings have appeared in a number of ACM and IEEE publications and other professional conferences and journals. His research interests include database systems, storage systems, knowledge discovery and network management. His most recent research has focused on the areas of network management, data mining, high-performance transaction systems, continuous-media storage servers, tertiary storage systems, and multi-database transaction management.

 

Dr. Kyuseok Shim is an Assistant Professor at Korea Advanced Institute of Science and Technology (KAIST) in Korea. He is also currently a Technical Advisory Board Member of WISEngine Incorporated. Before joining KAIST, he was a member of technical staff (MTS) in the Database Systems Research Department of Bell Laboratories and was one of the key contributors to the Serendip data mining project in Bell Laboratories. Before that, he worked for Quest Data Mining project at IBM Almaden Research Centre and contributed to IBM Intelligent Miner for Data. He also worked as a summer intern for two summers at Hewlett Packard Laboratories. He received B.S. degree in Electrical Engineering from Seoul National University in 1986, and the MS and Ph.D. degrees in Computer Science from University of Maryland, College Park in 1988 and 1993, respectively.

Kyuseok Shim has been working in the area of databases focusing on data mining, data warehousing, query processing and query optimisation, XML and semi-structured data. He has been an Advisory Committee Member for ACM SIGKDD. He has published several research papers in prestigious database conferences and journals. He has also served as a program committee member on ICDE'97, KDD'98, SIGMOD'99, SIGKDD'99, ICDE'00, VLDB'00 and SIGKDD'01 conferences. He did a data mining tutorial with Rajeev Rastogi at CIKM'99, ICDE'99 and SIGKDD'99 conferences. He was co-chair of the 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

 

Tutorial V - Web Mining for E-Commerce

Presenter: Professor Jaideep Srivastava, University of Minnesota, USA

The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. The Web is revolutionizing the way businesses interact with each other (business to business, i.e. B2B) and with each customer (business to customer, i.e. B2C); and the way people are interacting with each other, either through an intermediary (customer to customer, i.e. C2C) or without (person to person, i.e. P2P). It has introduced entirely new ways of doing commerce, including e.g. auctions and reverse auctions. It also made it imperative for organizations and companies to optimize their electronic business. Knowledge about the Web user is fundamental for the establishment of viable e-business solutions. Web mining is the application of data mining techniques to acquire this knowledge for e-business. Typical concerns in e-business include improved cross-sells, up-sells, personalized ads, targeted assortments, improved conversion rates, and measurements of the effectiveness of actions.

This tutorial will (a) introduce the concept of Web Mining and discuss some of the key challenges, (b) present various solutions to overcome these challenges, (c) discuss the business impact of using Web Mining, and (d) present some case studies from well known E-commerce organizations.

 

Biography of Speaker

Jaideep Srivastava received his Bachelors in Computer Science from IIT Kanpur, India, in 1983. He obtained his Masters and Doctorate, both in Computer Science, from the University of California - Berkeley, in 1985 and 1988, respectively. For over 15 years he has been active as an accomplished, experienced, energetic technology leader, with a strong mix of innovation, management and communication abilities. He has a proven track record of rapidly building and managing fast-paced cross-functional teams in a wide variety of environments, including internet startups (Yodlee.com and Lancet Software), the leading e-tailer (Amazon.com), and in the research and development environment (University of Minnesota). He is experienced in delivering large-scale software systems in resource constrained environments with severe deadline requirements. He has very strong oral and written communication skills, and experience in software engineering processes and methodologies.

Since May 2000, Dr. Srivastava has been the Senior Director of Engineering at Yodlee.com, where he is in-charge of all database activities. He has built a database and applications department from ground up, which now has 7 engineers and analysts. He is responsible for technical and business vision for database and applications; working with marketing, business development and customers, to define and sell product offerings; and deciding the strategic positioning of database applications offerings that appropriately addresses consumer security and privacy concerns.

Prior to this, Dr. Srivastava was the Chief Database Architect for Amazon.com, where he headed the efforts in applying database, data warehousing and data mining technologies for customer relationship management (CRM), which is one of the key elements of Amazon.com's vision of being the most customer focused company in the world. In addition to this, he applied various database technologies to a wide range of e-business functions, including supply chain management, fraud detection, finance, etc.

From 1997 through 1999, Dr. Srivastava was the Chief Technology Officer for Lancet Software, a data warehousing/data-mining company of 25 people. His responsibilities included market analysis and strategy definition, input to product development, and employee training on emerging technologies. He initiated and led the development of a suite of data cleaning products based on his original research. He established and managed an offshore software development partnership for Lancet, with Netcom Systems & Software Ltd., India.

Upon getting his Ph.D. from Berkeley, Dr. Srivastava spent 10 years at the University of Minnesota, going from tenure track Assistant Professor to tenured Full Professor during this period. He established and led a database and multimedia research laboratory, which graduated 17 Ph.D. and 35 M.S. students. In this period, he authored or co-authored over 125 papers in refereed journals and conferences, and as invited book chapters. A series of software systems were built and deployed as part of various military systems. In support of this research and development activity, Dr. Srivastava secured research grants totalling over $2 million from federal agencies and the industry. In addition he participated in a number of successful collaborative research/infrastructure grant efforts. Dr. Srivastava has been elected as a senior member of the IEEE for his fundamental research contributions to the fields of databases and multimedia systems. He serves on the editorial boards of several IEEE and ACM journals. Dr. Srivastava has served as the federal government's expert witness in a nationally significant tax case, which involved expressing his opinion on what constitutes computer science research and development. This case is being published in the law journal as a reference case.

During his academic life, Dr. Srivastava founded Data Engineering Technologies, a company that provides technology consulting and software services to industry and the government. Its clients include the US Army, Cargill Inc., the government of Chile, and Tata Infotech. He served as its President from 1995 through 1999.

Information about Dr. Srivastava's published research is available at: http://www.cs.umn.edu/Research/mmdbms/, http://www.cs.umn.edu/Research/websift/ and http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/s/Srivastava:Jaideep.html

 

Tutorial VI - From Evolving Single Neural Networks to Evolving Ensembles (Cancelled due to non-availability)