Beruflich Dokumente
Kultur Dokumente
Lesley Farmer
California State University Long Beach
lfarmer@csulb.edu
503
The list of statistics programs obtained was on the website of the 3.1 Courses by Discipline
American Statistical Association at http://www.sci.csueastbay. Once the data mining courses were identified by discipline, the
edu/~mwatnik/statlist. The keywords from statistics programs number of departments offering them was determined. That 2009
offering data mining courses were: neural network, decision tree, data are reported in Table 1. The courses are listed by
and data mining. The keyword count for statistics was obtained departments because of the marked variation in courses by
from schools with graduate programs in statistics, but math department. Note that the computer science/engineering
departments were excluded because of the extremely small departments offer the most graduate data mining courses followed
number of data mining courses in mathematics outside of schools by business school offerings.
offering a graduate degree in statistics.
The list of library and information science programs was from the Table 1. Number of U.S. University Departments Offering
American Library Association, http://www.ala.org/ala/education Data Mining Courses by Discipline
careers/education/accreditedprograms/index.cfm. The keywords
associated with data mining courses were: informatics, Discipline # of depts. % of total # of depts.
information retrieval, information management, knowledge with data in discipline offering
management, knowledge discovery in databases, competitive mining data mining courses
intelligence, bibliometrics, biometrics, bibliomining and data courses
mining. Business 83 17.6%
2.2 Top Resources and Publications By Computer 187 48.8%
Discipline Science/Engineering
Information about the most commonly used books and software Statistics 46 28.0%
by discipline was collected from course syllabi and instructors
replies to email in 2009. A reply specifying book(s), software, or Library/Information 15 30.0%
both was counted as a response. There is no further breakdown of Science
the percent who responded with each piece of information
because the counts from the two sources, syllabi and emails, were
not kept separately.
3.2 Keywords by Discipline
Keywords obtained from university catalog titles, course listings
The average number of articles per year from 1990 to September and descriptive words relating to data mining courses from each
2009 was calculated for the disciplines: business, computer of the four major disciplines are shown in Table 2. Keyword
science/engineering, statistics, and library/information science. overlap between disciplines is surprisingly infrequent.
The phrase data mining in the abstracts of journals, books, and
conference proceedings was used to search the business database Table 2. Keywords by Discipline
ABI Inform Complete, the computer science/engineering database
Compendex, the statistics database Current Index to Statistics, and Business Computer Statistics Library/Info
the library/information science databases Library Literature and Sci./ Science
Information Science and Library, Information Science & Engineering
Technology Abstracts. It should be noted that in the Current
Index to Statistics, which is the main database for statistics, there business adaptive association/ automatic
was no specific identifier for abstracts (as in the other two intelligence computation link analysis extracting
databases), so title/keywords was the closest option. This is very
likely the reason for a lower number of results found in the competitive artificial clustering (K bibliometrics
statistics search. If one looks at the number of articles divided by advantage intelligence means, near-
the number of departments in a particular discipline having data est neighbors)
mining courses, one can compare articles/department across
CRM database/ data decision trees bibliomining
disciplines to compare publication productivity. The list of
warehouse
departments was obtained from the same list as keywords to
identify data mining courses in the different disciplines. database intelligent genetic biometrics
mgmt. agents algorithms
3. RESULTS
There were 75 business faculty surveyed by email and 48 systems
responded (64%) providing information on data mining books or database knowledge machine business
related software. Of the 235 computer science/engineering decision discovery in learning intelligence
faculty surveyed who were teaching data mining courses, 127 making databases
responded (54%). For inquiries from statistics departments, 31 of
data machine model competitive
the 44 surveyed responded (a 70% email response rate of either
warehouse learning validation intelligence
book or software or both). All library/information science
programs had online information. Although the degree of decision multidimensio neural content mining
response combined both texts and software, the text titles and the support nality(data networks/
type of software were recorded separately. systems cubes) fuzzy logic
intelligent neural nonparametric database
enterprise networks/ learning management
504
neurocomputi
ng processing
knowledge text mining pattern database
mgmt./ recognition decision
discovery making
mgmt.
information support vector data warehouse
systems machines
market- training/testin decision
basket g dataset support
analysis
OLAP unsupervised fuzzy logic
learning
quantitative health
methods informatics
informatics
information
mgmt.
information
retrieval
Figure 3. Computer Science/Engineering Data Mining
knowledge Software by Brand Name (n=118)
mgmt.
S+
knowledge 10%
disc/database
Ggobi
quantitative 10% SAS
methods 35%
text mining
other
3.3 Software by Discipline 15%
In the email responses from academicians in each discipline,
numerous types of data mining software were reported. These are
presented as proportions in Figures 2, 3, 4, and 5.
Oracle SPSS
5% R
5%
30%
(none used)
7% SAS
34% Figure 4. Statistics Data Mining Software by Brand Name
SQL (n=18)
7%
S+
Access 10%
11%
Ggobi
10% SAS
35%
other Excel
12% 19%
other
Figure 2. Business Data Mining Software by Brand Name 15%
(n=57)
R
30%
505
3.4 Books by Discipline 3.5 Data Mining Articles by Discipline
Data mining books vary by discipline largely because their focus The average annual number of published data mining articles by
and applications differ. Some of the leading books as identified discipline from 1990 through mid-2009 is listed in Figure 6. Note
in 2009 are listed below by discipline. Only the leading books are that the average per year increase over the decade in data mining
listed. The Russell and Norvig title was the most popular, used articles in business journals was nearly two-fold, fifteen-fold in
more than twice as often as the next most cited textbook, by Duda, computer science/engineering journals, seven-fold in
et al. library/information science articles, and there was little change in
Business the number of statistics journals. As mentioned previously, there
Witten, I., and Frank, E. 2005. Data Mining: Practical Machine was no specific identifier for abstracts in the main database for
Learning Tools and Techniques. Morgan Kaufmann, Burlington, statistics (as in the other databases), so title/keywords was used as
MA. the closest option. Again, this is the likely the reason for the
Berry, M., and Linoff, G. 2004. Data Mining Techniques: For lower number of results found within statistics.
Marketing, Sales, and Customer Relationship Management.
Wiley, New York. Data Mining Articles by Discipline
Olson, D., and Shi, Y. 2005. Introduction to Business Data
(refereed journals, also proceedings and books)
Mining. McGraw-Hill, Columbus, OH.
Marakas, G. 2002. Modern Data Warehousing, Mining, and
Visualization. Prentice Hall, Upper Saddle River, NJ. 2000 1754
506
Library/information science reflect a broad range of perspectives: to choose. Instead, it appears that textbook choice depends on the
from logical architecture of data for mining to field-specific specific course objectives and content focus, the academic lens
applications of data mining (especially health and business). determining the title to be used. It would be useful to survey
Generally, courses blend theory and practice. Data mining is also faculty as to the basis for their textbook choice.
considered a viable research methodology in library/information The number of articles over time varies by discipline. Business
science, in which case it is were more likely to be offered at the published the greatest number before the year 2000, but the rate
doctorate level than at the masters. In no case is data mining a leveled in the 21st century. By contrast, the library/information
required course in library/information science, although Syracuse science article publication rate has shown a continuing rise,
University and Wayne State offered specializations in data increasing a little over threefold from the late 1990s to the early
management, which included data mining as an elective 2000s and then a bit over twice as many in the past five years.
Beyond the term data mining, each discipline generated unique Computer science articles rose dramatically (over tenfold) from
associated terms. Business terms focused on decision-making, the late 1990s to the early 2000s, and continued to rise by nearly
management, and competition. Computer science/engineering 50% in the past five years.
used more technology-related and intelligence-related terms. A potential limitation in organizing data mining articles by
Statistics used more methodological terms. Library/information discipline is that database aggregators may not have captured all
science terms had the greatest variation, from fuzzy logic to text relevant publications. It should be noted that another
mining, but most terms were associated with applications (e.g., interpretation involving data mining articles by discipline is that
bibiometrics, health informatics, and information management). the database aggregators may vary. In addition, deeper
The greatest overlap existed between business and investigation into the quality of the articles would also shed light
library/information science due to decision-making methodology on the extent of scholarly contributions.
and management issues.
Data mining software varied by discipline. SAS was the dominant 5. CONCLUSION
software used in the business and statistics departments. Statistics Data mining courses in the U.S. are available in various academic
had the most stable set of software brands. Matlab and C++ were disciplines, and the overall field is rapidly expanding. Evidence
the most frequently cited software in computer presented in the figures and tables makes this abundantly clear.
science/engineering courses for data mining. Computer Detailed information concerning overlapping emphases in data
programming languages, in general, were used by a majority of mining disciplines has not been reported heretofore and deserves
those courses. SPSS, SQL, and Excel were the dominant software attention. Certain other academic areas include data mining
used in library/information science courses. It appears that the courses and have associated texts and software. Nonetheless, the
choice of tools depended on the status of the databases to be four disciplinary fields described in this review cover the major
utilized. One might assume that courses where computer academic areas at this time. The emerging picture reveals a blend
programming software was used would address both database of theory and practice that reflects each academic discipline rather
creation as well as data mining. Software also reflected the type of than a unified system. Hopefully, a productive merging of data
data needed, such as SPSS vs. RefEVAL or TextQuest. In mining approaches through increased cross-disciplinary research
addition, the choice of software might also reflect the technical can develop and advance all these related fields. The rate of
sophistication within the academic community, with business change in the data mining field is so rapid that the information is
using the least complicated software and computer likely to be measurably different in the next ten to twenty years.
science/engineering and statistics using the most complex
products. 6. REFERENCES
A good deal of overlap exists in textbook choices across [1] Berry, M., and Linoff, G. 2004. Data Mining Techniques for
disciplines--and in some cases within disciplines, especially for Marking, Sales and Customer Support (2nd ed.). Wiley, New
library/information science. Tan, et al.s Introduction to Data York.
Mining was used in computer science/engineering and in [2] Duda, R., Hart, P., and Stork, D. 2000. Pattern Classification
statistics, Han and Kambers Data Mining was used in computer (2nd ed.). Wiley-Interscience, New York.
science/engineering and library/information science, and Witten [3] Hastie, T., Tibshirani, R., and Friedman, J. 2009. The
and Franks Data Mining was used in business and Elements of Statistical Learning (2nd ed.). Springer. New
library/information science. Russell and Norvigs Artificial York.
Intelligence was by far the most popular computer [4] Larose, D. 2005. Discovering Knowledge in Data. Wiley-
science/engineering textbook. Han and Kamber was the favorite Interscience, Hoboken, NJ.
title in library/information science, although Shortliffe and [5] Olson, D. and Shi, Y. 2006. Introduction to Business Data
Ciminos Biomedical Informatics was the standard textbook for Mining. McGraw-Hill, Columbus, OH.
health informatics within library/information science. The picture [6] Roiger, R., and Geatz, M. 2003. Data Mining. Addison-
that emerges shows little agreement on standard textbooks except Wesley, Boston.
in computer science/engineering. In specialized subsets of the
field, such as biometrics, few titles may be available from which
507