CS 478 Machine Learning
Proposal Format and Project Suggestions


1 Page Project Proposal Due: Tuesday, March 28
Final Project Report Due: Friday, May 5

General Information and Resources

The items below are just suggestions. Please feel free to propose your own idea for a final project. I encourage you to submit the proposal well before the March 28 deadline if you already know what you want to work on. I'll give you feedback within a day or so of submitting the proposal. If you need help in choosing a project, consider coming to office hours to go over some options. The only real restriction is that the project involve doing some research in machine learning rather than a writing a survey or discussion paper.

For the projects, you can use any machine learning systems available - you don't have to write all/any of the code yourself. Your paper should make clear what code you wrote yourself, borrowed, extended, etc.

Check the course web site for pointers to existing ML code and data sets.

Proposal Format

The project proposal should be at most one page in length. It should describe:

  • the problem that you'll investigate
  • the techniques that you'll explore to solve the problem
  • the method for evaluating the work (e.g. what data sets will you use? what baselines will you use for comparison?)

    Project Ideas

    1. Construct a data set for a novel problem of interest to you and experiment with it using several of the available learning systems. Consider improvements to some system motivated by results on one of these datasets.
    2. Implement and compare various methods for dealing with noisy data in ID3 such as the chi-square statistic, reduced-error pruning, minimum description length.
    3. Implement one or more of the algorithms discussed in class (e.g. standard Bayesian probability approach to classification, instance-based nearest-neighbor learning, some type of neural network). Compare to ID3 or each other on multiple data sets.
    4. Implement and extend a learning algorithm discussed in class. Evaluate on one or more data sets.
    5. Experiment with various multiple model (voting) methods such as bagging or boosting applied to different learning methods.
    6. Implement and test recent "cotraining" methods. Cotraining uses unsupervised data with supervised data in order to perform supervised classification with minumum amount of supervision. It does this by exploiting information in unclassified data.
    7. Implement or build ID5, an incremental version of ID3. Experiment on different data sets or make (and evaluate) some extensions.
    8. Develop or experiment with methods for learning Bayesian networks.
    9. Experiment with or enhance any of the available propositional rule-learning systems (e.g. extend the language, deal with noise, missing data etc.).
    10. Experiment with an efficient rule learning system called RIPPER, especially suited for problems with large number of features like text categorization.
    11. Implement and experiment with a version of the Winnow system, a perceptron-like system with provably faster convergence for domains with lots of irrelevant features.
    12. Develop and experiment with various active-learning methods which actively select examples to use for training in order to learn more from fewer examples.
    13. Implement and test a version of the CN2 system for learning decision lists (a variant of DNF).
    14. Code for the inductive logic programming systems GOLEM and FOIL should be available on the web. Find the papers that describe them, download the code, and compare the system to each other.
    15. Experiment with partial matching or probabilistic interpretations of inductively learned rules and compare to a purely logical interpretation.
    16. Add the ability to dynamically shift the bias of an existing system and experiment. Such a system would start with a strong bias (e.g. conjunctive concepts only) and then if this doesn't work shift to a more weakly biased language.
    17. Read about PAC analysis and apply it to an interesting concept language.
    18. Implement, enhance, and/or experiment with some neural-network learning algorithms.
    19. Data is available here at Cornell for a number of natural language learning problems: part of speech tagging, parsing, information extraction, noun phrase coreference, named entity identification.
    20. Implement COBWEB, a conceptual clustering system.
    21. Enhance COBWEB to handle numerical data and do some interesting experimentation.
    22. Change COBWEB to deal with noise as described by Fisher in IJCAI-89.
    23. Implement, experiment with, and/or extend the AUTOCLASS Bayesian system for clustering.
    24. Implement a small version of a scientific discovery system such as BACON, GLAUBER, STAHL, or DALTON.
    25. Get the distributed versions of the PRODIGY or SOAR systems (available by FTP from CMU) and experiment with them.