CS 478 Machine Learning
Proposal Format and Project Suggestions
1 Page Project Proposal Due: Tuesday, March 28
Final Project Report Due: Friday, May 5
General Information and Resources
The items below are
just suggestions. Please feel free to propose your own idea for a
final project. I encourage you to submit the proposal well before the
March 28 deadline if you already know what you want to work on. I'll
give you feedback within a day or so of submitting the proposal. If
you need help in choosing a project, consider coming to office hours
to go over some options. The only real restriction is that the project
involve doing some research in machine learning rather than a
writing a survey or discussion paper.
For the projects, you can
use any machine learning systems available - you don't have to write
all/any of the code yourself. Your paper should make clear what code
you wrote yourself, borrowed, extended, etc.
Check the course web site for pointers to existing ML code and data
sets.
Proposal Format
The project proposal should be at most
one page in length. It should describe:
the problem that you'll investigate
the techniques that you'll explore to solve the problem
the method for evaluating the work (e.g. what data sets will you
use? what baselines will you use for comparison?)
Project Ideas
- Construct a data set for a novel problem of interest to you and experiment
with it using several of the available learning systems. Consider improvements
to some system motivated by results on one of these datasets.
- Implement and compare various methods for dealing with noisy data in ID3
such as the chi-square statistic, reduced-error pruning, minimum description length.
- Implement one or more of the algorithms discussed in class
(e.g. standard Bayesian probability approach to classification,
instance-based nearest-neighbor learning, some type of neural
network). Compare to ID3 or each other on multiple data sets.
- Implement and extend a learning algorithm discussed in class.
Evaluate on one or more data sets.
- Experiment with various multiple model (voting) methods such as bagging or
boosting applied to different learning methods.
- Implement and test recent "cotraining" methods. Cotraining
uses unsupervised data with supervised data in order to perform
supervised classification with minumum amount of supervision. It
does this by exploiting information in unclassified data.
- Implement or build ID5, an incremental version of
ID3. Experiment on different data sets or make (and evaluate) some
extensions.
- Develop or experiment with methods for learning Bayesian networks.
- Experiment with or enhance any of the available propositional
rule-learning systems (e.g. extend the language, deal with noise,
missing data etc.).
- Experiment with an efficient rule learning system called RIPPER,
especially suited for problems with large number of features like text
categorization.
- Implement and experiment with a version of the Winnow system, a
perceptron-like system with provably faster convergence for domains
with lots of irrelevant features.
- Develop and experiment with various active-learning methods which actively
select examples to use for training in order to learn more from fewer
examples.
- Implement and test a version of the CN2 system for learning
decision lists (a variant of DNF).
- Code for the inductive logic programming systems GOLEM and FOIL
should be available on the web. Find the papers that describe them,
download the code, and compare the system to each other.
- Experiment with partial matching or probabilistic interpretations of
inductively learned rules and compare to a purely logical interpretation.
- Add the ability to dynamically shift the bias of an existing
system and experiment. Such a system would start with a strong bias
(e.g. conjunctive concepts only) and then if this doesn't work shift
to a more weakly biased language.
- Read about PAC analysis and apply it to an interesting concept language.
- Implement, enhance, and/or experiment with some neural-network learning
algorithms.
- Data is available here at Cornell for a number of natural
language learning problems: part of speech tagging, parsing,
information extraction, noun phrase coreference, named entity
identification.
- Implement COBWEB, a conceptual clustering system.
- Enhance COBWEB to handle numerical data and do some interesting
experimentation.
- Change COBWEB to deal with noise as described by Fisher in IJCAI-89.
- Implement, experiment with, and/or extend the AUTOCLASS Bayesian system for
clustering.
- Implement a small version of a scientific discovery system such as BACON,
GLAUBER, STAHL, or DALTON.
- Get the distributed versions of the PRODIGY or SOAR systems (available by
FTP from CMU) and experiment with them.