Beruflich Dokumente
Kultur Dokumente
1 Page Project Proposal Due: Nov. 2, 2006 Final Project Report Due: 5PM December 15, 2006
Bioinformatics
There is an active local group of researchers working on data mining for bioinformatics. In particular there is a group competing in a scientific competition, called MouseFunc I, for predicting the function of mouse genes from various sources of information about the gene that is interested in help. Local contacts are Prof. Edward Marcotte in biochemistry
(marcotte@icmb.utexas.edu) and ECE Ph.D. student Chase Krumpleman (krump@lans.ece.utexas.edu). Below is another project idea from Prof. Edward Marcotte, and biology post-doc Christine Vogel (cvogel@mail.utexas.edu), contact them for details: We are interested in a particular version of these called 'synthetic lethal interactions', in which either gene can be removed individually without affecting the cell, but the removal of both kills the cell. In particular, the hubs in a synthetic lethal interaction network are important: they compensate for the loss of many other genes. I suspect these may often be genes critical for suppressing tumor formation, although we haven't formally tested this idea. A good project would be to try to predict which genes are hubs in these networks--that is, learn what properties separate hubs from non-hubs (with features based on expression data, interaction data, functions, etc). This would require playing with various classifiers, etc. on functional genomics data. If anyone got nice results on this, we have a collaborator in England that would be willing to experimentally assay the genes for their participation in synthetic lethal interactions, so the project would continue on beyond the class with real results.
Computer Graphics
Below is an idea from CS Prof. Okan Arikan (okan@cs.utexas.edu) for applying machine learning to computer graphics. Please contact him for more details. Graphics applications (films, games, etc.) require very detailed digital content such as geometric models, textures, animations. Creating such content often involves repetitive operations. For example, if we're creating an old decrepit house, the artist would have to create the same dusty, eroded appearance (with spider webs distributed accordingly) everywhere in the house. The idea of this project is to learn a model of the user's edits and generalizing these edits to cut down the creation time for digital environments. In our old decrepit house example, if we let the user create a single room of this house, can we learn from this what it means to be old and decrepit and apply this to the rest of the house automatically. Can we learn and help the user as the user is performing these edits? Can we ever reach a state where after sufficient use of this system, we can develop a model of user's desired appearance and recreate it from very simple inputs on novel environments?
Computer Architecture
There is a large architecture project in the department called TRIPS. CS Prof. Kathryn McKinley (mckinley@cs.utexas.edu) has proposed two potential topics for machine learning research in this area. Please contact her for details. 1. We have a bunch of data from simulated annealing of different schedules on TRIPS. We examined this data by hand to extract scheduling heuristics. It would be interested to see if learning could do better. 2. I would also would like to use the same framework to generate register bank allocation and scheduling data from simulated annealing, and see if we can derive some register bank assignment heuristics either independently or together with schedules.
Business Applications
Prof. Maytal Saar-Tsechansky (Maytal.Saar-Tsechansky@mccombs.utexas.edu) in the School of Business works in machine learning and data mining and has proposed the following topics. Please see her for details. Active Information Acquisition: Predictive models play a dominant role in numerous business intelligence tasks. A critical factor affecting the knowledge captured by such a model is the quality of the information, i.e., the training data from which the model is induced. For many tasks potentially pertinent information is not immediately available, but can be acquired at a cost. Traditionally, information acquisition and modeling are addressed independently. Data are collected irrespective of the modeling objectives. However, information acquisition and predictive modeling in fact are mutually dependent: newly acquired information affects the model induced from the data, and the knowledge captured by the model can help determine what new information would be most useful to acquire. Information acquisition policies take advantage of this relationship to producing acquisition schedules. An acquisition schedule is a ranking of potential information acquisitions, in this case currently unknown feature values or missing labels (target variables). An ideal acquisition schedule would rank most highly those acquisitions that would yield the largest improvement in model quality per unit cost. Prior work proposed algorithms for acquisition of either class labels (i.e., dependent/target variables) of training examples, or of feature values. In this project we propose and evaluate new comprehensive approaches for information acquisition to rank the acquisition of both labels and feature values that may be missing. We will evaluate new approaches on several business data sets where feature values and class labels are costly to acquire. Information acquisition for compliance management domains: Compliance management pertains to the selection of tax reports to be audited, or health care claims to be scrutinized for fraud. Non-compliance is a substantial source of revenue loss that afflicts a variety of industries. The U.S. Department of Health and Human Services reported in 2001 that Medicare alone lost $11.9 billion to fraud or other improper payments to providers in 2000. The Internal Revenue Service reported recently that the gap between taxes paid and taxes owed was between $312 billion and $353 billion in 2001 (the latest year for which figures are available), with about one sixth of this amount eventually collected. Substantial losses have also been reported by the auto insurance industry, adding billions of dollars to auto premiums each year. In all these scenarios, at each point in time, an analyst must decide whether to audit a case in order to recover or prevent a revenue loss. Such decisions are increasingly being guided by predictive models built based on historical audit outcomes. Due to the vast transaction volume, review of all cases that might be related to fraud is prohibitively expensive. Thus effective predictive models that identify strong leads for further investigation are essential. However, there exists a hurdle shared by all compliance management sectors that renders model induction in this domain particularly challenging. As in all supervised learning settings, in order to induce a model it is necessary for training examples to be labeled, meaning that the value of the target variable (e.g., whether or not a particular claim is fraudulent) must be acquired. However, audits are carefully chosen to target only cases which are predicted to have a high probability of being fraudulent (and thus leading to revenue recovery). This leads to a severely biased training sample which cripples model induction and revenue collection. Such models do not detect new pockets of non- compliance effectively. Furthermore, rather than helping improve the model, new audit outcomes merely reinforce existing perceptions and do not provide useful information.
An alternative approach is to complement the biased sample with additional audits carefully selected to significantly improve the model itself (rather than to avoid imminent losses). Because audits are costly, it is essential to devise selective information acquisition mechanism that would identify informative audits that will particularly improve model performance for a given acquisition cost. Existing active learning policies either assume that no data is available ex-ante and that the sample is constructed exclusively by the active learner or that a representative sample of the underlying population is available ex-ante. These assumptions are fundamental to the acquisition policies' subsequent actions and are severally violated in the compliance management domain due to the biased audit data. Thus the active learner ought to leverage this knowledge of biasness to help identify new, informative acquisitions that can improve the model's performance. We have obtained a real data set on companies' sales tax audits which we will use in this study to evaluate new policies. The data include information about firms in a given state, sales tax audit results and the amounts of money paid by companies following the audits.
Data Mining
The following suggestions come from Prof. Inderjit Dhillon's data mining research group.
Non-negative matrix factorization: Non-negative matrix factorization attempts to find a factorization of a matrix A into the product of two matrices B and C, where the entries of B and C are non-negative. This has been used recently in unsupervised learning for various applications. One project possibility is to code up and use non-negative matrix factorization for some application; possibilities include text analysis (modelling topics in text), image/face processing, or for problems in bioinformatics. A second idea would be explore sparse nonnegative matrix factorization, which has been recently proposed. The contact for these projects is Suvrit Sra (suvrit@cs.utexas.edu). Power-Law Graphs: Much large-scale graph data has node degrees that are power-law distributed (a standard example is a web graph). Clustering of such graphs is somewhat difficult due to these distributions---one project idea would be to explore clustering such graphs. Some work has been done recently using min balanced cuts to cluster these graphs, and there has been promise in using normalized cuts in this domain. An option would be to compare Graclus, software for computing the normalized cut in a graph, to other graph clustering methods to determine the most effective ways to compute clusters in power-law graphs. The contact for this project is Brian Kulis (kulis@cs.utexas.edu). Experiments with SVMs: Several projects are possible for support vector machines. One possibility is to find a good application, create a support vector machine implementation, and perform experiments. Another possibility is to compare different existing implementations of SVMs (including SVM code developed at UT) in terms of speed and accuracy. A third idea is to explore different methods of regularization for SVMs (this would require some knowledge of optimization), and compare performance. Contacts for this project are Dongmin Kim (dmkim@cs.utexas.edu), Suvrit Sra (suvrit@cs.utexas.edu), and Brian Kulis(kulis@cs.utexas.edu).