Exceptions in the cube (11/121)

Sunita Sarawagi: Explaining Differences in Multidimensional Aggregates. VLDB 1999: 42-53

The Data Cube (11/14)

Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, Hamid PiraheshData Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery 1 (1):29-53, 1997.

Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi: On the Computation of Multidimensional Aggregates. VLDB 1996: 506-521

S. Sarawagi, R. Agrawal, N. Megiddo "Discovery-driven exploration of OLAP data cubes", Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. PDF format. Abstract.
Expanded version available as IBM Research Report RJ 10102 (91918) , January 1998. PDF format.

Class was cancelled 11/7.

Outlier Detection (10/24 and 10/31)

Efficient algorithms for mining outliers from large data sets.   Published in the Proceedings of the ACM SIGMOD Conference, 2000.

Breunig S., Kriegel H.-P., Ng R., Sander J.: LOF: Identifying Density-Based Local Outliers, Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 2000), Dallas, TX, 2000.
Paper (pdf 312K)

Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem, M. Hernandez and S. Stolfo,
Journal of Data Mining and Knowledge Discovery, 1997.

An Application of the EM Algorithm: Clustering (10/17)

Jeff A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models (1998).

EM Algorithm (10/3, Guest lecture by Lillian Lee)

Classic reference: The paper that started it all. Cornell people can get this paper online, although printing it is a pain. (scanned in, and some restrictions apply). Maximum Likelihood from Incomplete Data via the EM Algorithm , pp. 1-38 A. P. Dempster, N. M. Laird, D. B. Rubin [ Citation / Abstract ] [ View Article ] [ Print ] [ Download ]

A useful reference is Michael Collins' (UPenn) exam paper The EM Algorithm. http://www.cis.upenn.edu/~mcollins/papers/wpeII.4.ps 

 

Outlier Detection (9/26/2000)

Edwin M. Knorr and Raymond T. Ng. "Finding Intensional Knowledge of Distance-Based Outliers", Proc. VLDB, Edinburgh, Scotland, September 7-10, 1999, pp. 211-222. Postscript

Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. "Distance-Based Outliers: Algorithms and Applications", The VLDB Journal, 8(3), February, 2000, pp. 237-253. Postscript or Compressed Postscript. This is the conference version of the paper: Edwin M. Knorr and Raymond T. Ng. "Algorithms for Mining Distance-Based Outliers in Large Datasets", Proceedings of the 24th VLDB Conference, New York, August 24-27, 1998, pp. 392-403. Postscript

Edwin M. Knorr and Raymond T. Ng. "A Unified Notion of Outliers: Properties and Computation", Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 14-17, 1997, AAAI Press, pp. 219-222. Postscript

 

Decision Tree Construction (9/12/2000)

J. E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RAINFOREST - A Framework for Fast Decision Tree Construction of Large Datasets. In Proceedings of the Twenty-fourth International Conference on Very Large Data Bases, New York, New York, 1998. PDF format.

J. E. Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. BOAT -- Optimistic Decision Tree Construction. In Proceedings of the 1999 SIGMOD Conference, Philadelphia, Pennsylvania, 1999.

Johannes Gehrke, Wei-Yin Loh, and Raghu Ramakrishnan. Data Mining with Decision Trees. (Slides and References.) Tutorial at the 1999 SIGKDD Conference, San Diego, California, 1999.

 

Sequence Analysis (9/5/2000)

R. Srikant, R. Agrawal: ``Mining Sequential Patterns: Generalizations and Performance Improvements'', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. PDF format. Abstract.
Expanded version available as IBM Research Report RJ 9994, December 1995.

R. Agrawal, R. Srikant: ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. PDF format. Abstract.
Expanded version available as IBM Research Report RJ9910, October 1994. PDF format.

Nevill-Manning, C.G. and Witten, I.H. (1997) " Identifying Hierarchical Structure in Sequences: A linear-time algorithm ," Journal of Artificial Intelligence Research, 7, 67-82.

 

Market Basket Analysis (8/30/2000)

R. Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. PDF format. Abstract.
Expanded version available as IBM Research Report RJ9839, June 1994. PDF format.

R. J. Bayardo Jr., "Efficiently Mining Long Patterns from Databases", Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, 85-93, June 1998. PDF format. Abstract.

R. J. Bayardo Jr., R. Agrawal, and D. Gunopulos. "Constraint-Based Rule Mining in Large, Dense Databases". Proc. of the 15th Int'l Conf. on Data Engineering, 188-197, Sydney, Australia, March 1999. PDF format. Abstract.
Expanded version available as IBM Research Report RJ 10146, July 1999. PDF Format.