Scientific Data Analysis

Applications of Expert Systems to Scientific Data Analysis
Kevin G. Goebel
Michael Mager
Principles of Knowledge Acquisition & Retrieval
CSSIE 482, Winter 2000
Revision 2.0
March 12, 2000
1 Abstract
Scientific data analysis carries a high data maintenance overhead. Further, successful research and analysis is per-
formed only by painstaking, laborious effort of an expert researcher. Technology exists that can facilitate research and
analysis on scientific data. Unfortunately, this technology exists as discrete elements, not a complete system.
Our goal is to prove the concept of, and develop the framework for, a knowledge-based, neuroinformatic data man-
agement system; an expert system. In this report, we describe, and present the context of, a prototype expert system.
This expert system will transcend traditional database management systems to function as a “research assistant” by
applying rules about the data, operations, research goals and hardware constraints.
2 Domain Description
One of the major challenges in neuroscience research is managing the issues associated with the complexity of the data
generated during experiments. In response to this challenge, a cross-disciplinary field (“Neuroinformatics”) has arisen
within the biomedical information processing community to apply computer technology to research and collaboration
in neuroscience.
A variety of research has been performed in the area of scientific database management systems. These systems
are focused on helping the researcher classify and manipulate their data. However, such systems generally require the
researcher to understand the types of analysis required for the data, understand the unique constraints of specific data
formats and to develop plans to perform the analysis.
These traditional systems fall short of researchers’ needs and of what currently available technology can provide.
While the currently implemented applications that store, classify, organize, maintain, process and present data are
adequate to their respective tasks, there is no mechanism to “intelligently” combine these aspects of scientific data
analysis. This forces the researcher to perform countless hours of trial-and-error analysis, pursuing inappropriate or
illogical analysis avenues.
To illustrate the complexity of these analysis issues, and to construct our proposed solution, we use, as our ex-
ample case, research performed and presented by Dr. B. Stiber in her paper, “Categorization of Gerbil auditory fiber
responses” (Stiber, et al.).
2.1 Research Tasks

Simply put, in general, researchers must follow some form of the following steps:
establish a context and a goal

collect or reuse experiment data
identify appropriate operations based on data type and the analysis goal
perform appropriate operations on the data
document or model the results of the operations
make the results and the data available for independent verification
Examining one aspect of our example case, we found a graphical representation of processed data; specifically,
a scatter plot of two characteristics of an experiment, represented as a time domain envelope. That plot appears as
figure 1 below (Stiber 279); the time domain envelope appears as figure 2 below (Stiber 278).
In our example case, the raw data is time series data of an auditory fiber’s response to stimulus. The time domain
envelope characterizes a reverse correlation of the raw, time series data. The scatter plot illustrates the relationship
between the standard deviation and the skewness of all of the time domain envelopes whose peak frequency is 1034Hz
from 10 subjects. We describe the data in more detail in section 2.2.
For our purposes it was important to note that the researchers performed specific, classifiable operations on raw
data, in an identifiable order, to generate intermediate results, which in turn generated the plot. Further, this scatter
plot is only one component of the complete analysis process.
1
2
Skewness
1.5 ▲
▼▼
1 ◆ ■
●
0.5
1 1.5
2
Skewness
1.5
0.5
1 1.5
Coefficient of Variation
Figure 1:
1
Filter Amplitude
−1
0 1 2 3 4 5 6 7 8 9
Time (msec)
Figure 2:
2
The following list illustrates the “chain” of operations the researchers performed on the raw data to generate this
graph (Note: each experiment contains time series data that represents the stimulus response of one filter).
1. Perform a reverse correlation procedure on raw, time series data.

2. Locate the peaks of the absolute value of the filter.
3. Characterize the time domain envelope using two methods. These two methods yield four characteristics: order
and alpha, and standard deviation and skewness.
4. Set operation-dependent and statistical value constraints on the data set. This determines whether or not data
will be included for consideration.
5. Graph the relationship of the skewness of a time domain envelope against its standard deviation for all filters at
a statistically significant peak frequency.
All of this in an effort to generate and execute a plan that answers the researchers’ analysis goals: Can significant,
systematic variation in auditory fiber response, sufficient to categorize filters according to the type of information to
which they are selective, be found, given that there is variation from animal to animal?
2.2 Data
Neuroinformatic research generates vast quantities of information that is extraordinarily diverse in scale, base and
algorithmic processing. In his paper, “Logos: A Computational Framework for Neuroinformatics Research”, Dr.
Michael Stiber describes the characteristics of neuroinformatic data this way: Quantities of data currently range up to
multi-petabyte levels. The data itself are diverse, including scalar, vector (from 1 to 4 dimensions), volumetric (up to 4
dimensional spatiotemporal), topological, and symbolic, structured knowledge. Spatial scales range from Angstroms
to meters, while temporal scales go from microseconds to decades. Base data vary greatly from individual to indi-
vidual, and results computed can change with improvements in algorithms, data collection techniques, or underlying
methods.
That description emphasizes the necessity for a researcher to be well informed about their domain. A researcher
must select and manipulate data in an informed manner to efficiently analyze data. A researcher must possess knowl-
edge about the data constraints to successfully apply appropriate operations. This issue begs the support of knowledge
about the data, or, metadata.
This information is frequently stored haphazardly in an almost folkloric manner. Information is embedded in the
file names of computerized data stores. Information is kept in margin notes in research notebooks. Information is kept
on Post-It notes clinging to the edges of CRTs in basement laboratories. Recent developments in database management
systems address this problem, but the resulting information store is too frequently not transmittable and not extensible.
The structure of our example case data, which is the description of the response of an auditory cell to various
stimuli, takes the form of a list of data points. Without the associated metadata, this data is useless. Table 1 describes
our example data.
2.3 Operations
As mentioned, each data set possesses elements that constrain the type and degree of operations that can be applied
against it. And, too, some operations are constrained to accepting only specific types of data. It is necessary to maintain
information about operations on data in the same way that we maintain information about data.
We classified and categorized operations as function types. The results follow:
Transform converts data to an equivalent representation. No data loss occurs from operation. The operation is
reversible.
Input:
Output:
Characteristic Generator derivation of more succinct representation from single data source.
3
Table 1: Auditory fiber response experiment metadata
Element Description
Filename An eight character alphanumeric string that refers to
the date of the experiment and the experiment number.
Format: DDMMYY##
Unit number An ordinal identifying the individual cell fiber.
Stimulus (category) Stimulus possesses sub-elements, which follow.
type For our example case: Band limited Gaussian white
noise with two different amplitude ranges.
amplitude Either high or low. High: -70dBv. Low: -9OdBv.
annotation A field for notes about the stimulus.
Sampling rate The rate at which samples were taken.
Annotation A field for notes about the experiment.
Accuracy A field for comments about the accuracy (optional).
Input: single data source.

Output: scalar, vector.
Summary Generator derivation of more succinct representation from multiple data source.
Input: multiple data source.
Output: scalar, vector.
Database Operation any “traditional” database function, such as a query.
Input: characteristic constraint.
Output: data sources.
Data Screening derivation of a subset.
Input: constraint on individual points, data set.

Output: derived data set, where output input.
3 Problem Definition
An expert system can extend the scope of bioinformatic data collection, manipulation, presentation and sharing by
introducing, directly into the system, the researcher’s analysis expertise. The capabilities of an expert system would
allow researchers to pose system-level queries, which the expert system would then map to queries of individual
experiment data, combined with operation functions to produce a chain of functions that respond to those queries.
This enables mapping cellular-level data through operation constructs that model how individual cells contribute to
system-level function.
Our goal is to design and implement a prototype expert system that demonstrates the feasibility of applying
knowledge-based techniques to the data-handling requirements of scientific data analysis using neuroscientific data
and analysis as the model for complex scientific data analysis.
4 Functional Specification
Following are statements describing the functionality of our prototype expert system.
The system shall query the user to determine the analysis goal. The analysis goal is the system-level goal of the
researcher; one that represents generalized questions about the effect of, and relationship to, cellularlevel data.
4
The system shall identify candidate operations. Candidate operations are those functions that conform to con-
straints of the goal, conform to constraints of the data and that are members of the process that will achieve the
analysis goal.
The system shall evaluate candidate operations and, if necessary, query the user for additional information.
The system shall generate a plan of “chained” operations. The system shall link operations in appropriate order,
“chaining” one to the other to process cellular-level data to respond to the system-level query.
At the primary level of completion, the system shall present the generated plan of chained operations to the user.
At the secondary level of completion, the system shall execute the plan. The system shall make function calls to
external applications, such as MATLAB, to execute the “chained” operations. The system shall store the analytical
results returned by any external applications.
At the tertiary level of completion, the system shall display the results of the executed plan. The system shall
incorporate the results returned by external applications in a graphical model, plot or other transform that will allow
the user an alternative perspective of the results.
5 Design
The project was primarily focused on “pre-design” activities. As a result, most of the project effort was devoted
to increasing our understanding of the problem domain. As we explored the problem domain in greater depth, we
began to achieve a greater understanding of the complexity of the problem domain. Our efforts shifted to analysis and
prototype development activities focused on identifying the underlying design issues, which include the following:
Definition of requirements for system-level query input

Definition of requirements for converting system-level query to goal-rule
Definition of function knowledge
Understanding specification of function chaining rules
Definition of system-level result outputs
This approach led to an iterative prototype process oriented toward increased understanding of design issues. This
approach is discussed in more detail in the next section.
The CLIPS constructs from our most recent prototype appear as Appendix A.
6 Mitigation
Because of the complexity of this domain, our development quickly became an iterative process. We did not, initially,
understand the subtleties of function interaction. It was essential to proceed with an overly simplified construct as a
trial of our design logic. Our initial constructs were inadequate to full analysisplan construction, but allowed us to
examine two critical issues: function interaction and constraint propagation.
Our design goal of chaining functions from input to output required that we understand not only the input or
output type, but also criteria constraining instances of each function. For example, users may constrain the system to
only operations on standard deviation and skewness, but the system can employ a single function to process standard
deviation, skewness, mean, median and mode. The user’s constraint must be identified and maintained by the system
and applied at the appropriate time during the plan generation process.
The complexity of this project, and our limited development time frame, forced us to narrow our scope. We limited
analysis to a single goal - that of displaying the scatter plot of skewness against standard deviation. Further, we chose
to implement only the operations necessary to reach that goal. No other functions are defined in our construct and our
functions are goal-specific, not general.
Again, because of our tight development time frame, we selected CLIPS over JESS as our expert system shell.
We had experience with CLIPS that we didn’t have with JESS. As a result, we limited our system to a textual user
interface.
5
7 Future Work
Even with this successful demonstration of concept, there is much more to be done. Functions remain to be generalized
within the gerbil auditory fiber research. Our base construct can be extended to accommodate additional research
analysis goals and methods.
True modularity and analytical functionality can be more easily implemented with JESS. We see a need to expand
the user interface to incorporate a graphical form that will allow users to more easily convey analysis goals and
identify user constraints. JESS allows greater connectivity and platform accessibility than CLIPS. Additionally, the
system requires further extension to interface with MATLAB and other external analysis applications.
References
B.Z. Stiber, E.R. Lewis, M.D. Stiber and K.R. Henry, Categorization of Gerbil auditory fiber responses, Neurocom-
puting 26-27 (1999) 277-283.
M.D. Stiber, G.A. Jacobs, D. Swanberg, Logos: A Computational Framework for Neuroinformatics Research,
Proceedings of the International Conference on Scientific and Statistical Database Management (1997)
A Appendix
;================================================================================
; gerbil 1-5 00308 1445.CLP
; last revision: 3-Mar-2000
;================================================================================
;================================================================================
; deftemplates
;================================================================================
(deftemplate response (slot question) (slot answer))
(deftemplate goal (slot name) (slot num-variables))
(deftemplate goal-constraint (slot goal) (slot criteria) (slot value))
; concept not currently used

; (deftemplate plan-op (slot category))
(deftemplate graph-2D-func
(slot name)
(slot output-semantic)
(slot output-type)
(slot num-variables)
(slot var-x-type)
(slot var-y-type)
(slot var-criteria))
(deftemplate plan-op-graph-2D-func
(slot name)
(slot plan-level)
(slot output-user)
(slot var-x-required-type)
(slot var-x-semantic)
(slot var-y-required-type)
(slot var-y-semantic)
(slot match-key))
6
(deftemplate char-gen
(slot name)
(slot output-type)
(slot input-type))
(deftemplate plan-op-char-gen
(slot name)
(slot plan-level)
(slot output-user)
(slot input-criteria))
(deftemplate func-gen
(slot name)
(slot output-type)
(slot input-type))
(deftemplate plan-op-func-gen
(slot name)
(slot plan-level)
(slot output-user)
(slot input-type)
(slot input-criteria))
(deftemplate data-screen
(slot name)
(slot output-type)
(slot input-type)
(slot screen-type))
(deftemplate plan-op-data-screen
(slot name)
(slot plan-level)
(slot output-user)
(slot input-type))
;================================================================================
; deffacts
;================================================================================
(deffacts func-list
(graph-2D-func
(name graph-2D-scatterplot)
(output-semantic scatterplot)
(output-type relationship)
(num-variables 2)
(var-x-type scalar)
(var-y-type scalar)
(var-criteria match-key))
(char-gen
(name sEd-dev-of-prob-density-fn)
7
(output-semantic sUd-dev)
(output-type scalar)
(input-type prob-density-fn))
(char-gen
(name skewness-of-prob-density-fn)
(output-semantic skewness)
(output-type scalar)
(input-type ’ prob-density-fn))
(func-gen
(name fit-prob-density-fn)
(output-semantic prob-density-fn)
(output-type prob-density-fn)
(input-type set-scalar))
(data-screen
(name extract-time-domain-envelope-data-set)
(output-semantic time-domain-envelope)
(output-type set-scalar)
(input-type set-scalar)
(screen-type extreme-values))
)
;(deffacts goal-info
; (goal-constraint
; (goal relationship)
; (criteria var-x)
; (value std-dev))
; (goal-constraint
; (criteria var-y)
; (value skewness))
; (goal-constraint
; (criteria variable-relationship)
; (value match-key))
; (goal-constraint
; (criteria match-key)
; (value time-domain-envelope))
; (goal-constraint
; (criteria source-data-filter)
; (value peak-frequency))
(deffacts init
(goal-plan-level O)
(yes-answer yes YES)
(subcategory time-domain-envelope prob-density-fn)
(criteria relationship source-datafilter)
(criteria relationship match-key)
(criteria relationship variable-relationship)
(criteria relationship var-y)
(criteria relationship var-x)
(max-level O))
;================================================================================
8
; defrules
;================================================================================
;--------------------------------------------------------------------------------
; user interface question rules
;________________________________________________________________________________
(defrule ask-analysis-goal
(declare (salience +10))
(initial-fact)
=>
(printout t crlf crlf crlf crlf crlf crlf crlf crlf crlf crlf)
(printout t crlf "Scientific Data Anaylsis Expert System’’ crlf)
(printout t "Enter your analysis goal." crlf)
(printout t crlf "The following choices are allowed:" crlf)
(printout t " relationship" crlf crlf)
(bind ?reply (read))
(assert (response (question goal) (answer ?reply))))
(defrule ask-goal-relationship-num-variables
(response (question goal) (answer relationship))
=>
(printout t crlf "Scientific Data Anaylsis Expert System" crlf)
(printout t "Enter the number of variables you want to relate." crlf)
(printout t crlf "The following choices are allowed:" crlf)
(printout t " 1" crlf)
(printout t " 2" crlf crlf)
(assert (response (question num-variables) (answer ?reply))))
(defrule create-goal
(response (question goal) (answer ?curGoal))
(response (question num-variables) (answer ?curNumVar))
=>
(assert (goal (name ?curGoal) (num-variables ?curNumVar))))
(defrule goal-relationship-constraints
(goal (name relationship))
(criteria relationship ?x)
(yes-answer $?curAnswers)
=>
(printout t crlf "Do you want to enter a " ?x " criteria? (yes/no)" crlf)
(if (member (lowcase ?reply) ?curAnswers) then
(assert (response (question ?x) (answer yes)))
else
(assert (response (question ?x) (answer no))))
9
(defrule goal-relationship-constraint-value
(goal (name relationship))
(criteria relationship ?x)
(response (question ?x) (answer yes))
=>
(printout t crlf "What is the value for the ’’ ?x ’’ criteria?" crlf)
(assert (goal-constraint (goal relationship) (criteria ?x) (value ?reply))))
;________________________________________________________________________________
; function chaining rules
;________________________________________________________________________________
(defrule select-graph-2D-func-from-goal
(goal (name ?curGoal) (numvariables ?curGoalNumVars))
(goal-constraint (goal ?curGoal) (criteria variable-relationship) (value ?curVarRel))
(goal-constraint (goal ?curGoal) (criteria ?curVarRel) (value ?curVarRelVal))
(goal-constraint (goal ?curGoal) (criteria var-x) (value ?curVar-x))
(goal-constraint (goal ?curGoal) (criteria var-y) (value ?curVar-y))
(graph-2D-func
(name ?nextFnName)
(output-type ?curGoal)
(var-x-type ?xType)
(var-y-type ?yType)
(var-criteria ?curVarRel))
(goal-plan-level ?curPlanLevel)
=>
(assert (plan-op-graph-2D-func
(name ?nextFnName)
(plan-level (+ ?curPlanLevel 1))
(output-user ?curGoal)
(var-x-requiredtype ?xType)
(var-x-semantic ?curVar-x)
(var-y-required-type ?yType)
(var-y-semantic ?curVar-y)
(match-key ?curVarRelVal)))
(defrule select-char-gen-from-graph-2D-func-for-var-x
(plan-op-graph-2Dfunc
(name ?curFnName)
(plan-level ?curPlanLevel)
(var-x-required-type ?curXType)
(var-x-semantic ?curVarX)
(match-key ?curMK))
(char-gen
(name ?nextFnName)
(output-semantic ?curVarX)
(output-type ?curXType)
(input-type ?nextInType))
=>
(assert (plan-op-char-gen
(name ?nextFnName)
(output-user ?curFnName)
10
(output-semantic ?curVarX)
(input-criteria ?curMK)))
(defrule selectchar-gen-from-graph-2D-func-for-var-y
(plan-op-graph-2D-func
(name ?curFnName)
(var-y-required-type ?curYType)
(var-y-semantic ?curVarY)
(match-key ?curMK))
(char-gen
(name ?nextFnName)
(output-semantic ?curVarY)
(output-type ?curYType)
=>
(assert (plan-op-char-gen
(name ?nextFnName)
(outputsemantic ?curVarY)
(input-criteria ?curMK)))
(defrule select-func-genfrom-char-gen
(plan-op-char-gen
(name ?curFnName)
(input-criteria ?curInCriteria))
(subcategory ?curInCriteria ?nextOutType)
(func-gen
(name ?nextFnName)
(outputsemantic ?nextOutSemantic)
(output-type ?nextOutType)
=>
(assert (plan-op-func-gen
(name ?nextFnName)
(planlevel (+ ?curPlanLevel 1))
(output-user char-gen)
(output-semantic ?nextOutSemantic)
(input-type ?nextInType)
(input-criteria ?curInCriteria)))
(defrule select-data-screen-from-func-gen
(plan-op-func-gen
(name ?curFnName)
(input-criteria ?curInCriteria)
(input-type ?curInType))
(data-screen
(name ?nextFnName)
(output-semantic ?curInCriteria)
(output-type ?curInType)
=>
(assert (plan-op-data-screen
(name ?nextFnName)
11
(outputsemantic ?curInCriteria)
(input-type ?nextInType)))
;________________________________________________________________________________
; control & administration rules
;________________________________________________________________________________
(defrule check-max-level
(declare (salience -1))
(or (plan-op-graph-2Dfunc (plan-level ?curOpLevel))
(plan-op-char-gen (plan-level ?curOpLevel))
(plan-op-func-gen (plan-level ?curOpLevel))
(plan-op-datascreen (plan-level ?curOpLevel)))
?x <- (max-level ?curMaxLevel)
=>
(if (> ?curOpLevel ?curMaxLevel) then
(retract ?x)
(assert (max-level ?curOpLevel)))
;--------------------------------------------------------------------------------
; printing rules
;--------------------------------------------------------------------------------
(defrule print-plan-header
(goal (name ?curGoal))
(max-level ?curMaxLevel &: (> ?curMaxLevel 0)) =>
(assert (print-level ?curMaxLevel))
(printout t crlf "The following plan will satisfy your analysis goal of: ")
(printout t ?curGoal crlf) )
(defrule print-planlevel
(print-level ?curMaxLevel)
(or
(plan-opgraph-2D-func
(name ?curFnName)
(plan-level ?curMaxLevel)
(output-user ?curOutUser))
(plan-op-char-gen
(name ?curFnName)
(plan-op-func-gen
(name ?curFnName)
(plan-op-data-screen
(name ?curFnName)
(output-user ?curOutUser)))
=>
(printout t crlf "Execute the " ?curFnName " function;" crlf)
(printout t " use the results in the " ?curOutUser " function." crlf))
12
(defrule change-print-level
?x <- (print-level ?curLevel &: (> ?curLevel 0))
=>
(retract ?x)
(assert (print-level (- ?curLevel 1))))
13

Scientific Data Analysis

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Scientific Data Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Applications of Expert Systems to Scientific Data Analysis

2.1 Research Tasks

establish a context and a goal

1. Perform a reverse correlation procedure on raw, time series data.

Input: single data source.

Definition of requirements for system-level query input

(deftemplate response (slot question) (slot answer))

(deftemplate goal (slot name) (slot num-variables))

(deftemplate goal-constraint (slot goal) (slot criteria) (slot value))

; concept not currently used

Das könnte Ihnen auch gefallen