Sie sind auf Seite 1von 24

Access and Organize Data with MATLAB

Graham Dudgeon, MEng, PhD Industry Manager MathWorks

graham.dudgeon@mathworks.com

2013 The MathWorks, Inc. 1

Overview

Challenges and Solutions


Access data from multiple sources including SQL databases, data historians, instrumentation, and files

Examples: SQL database connection, URL file reading and reading multiple files

Work with data too large to fit into system memory

Binary and text files


Examples: Log ASCII files (.LAS) and Power system formats (IEEE CDF and others)

Write and test functions that read in industry specific text file formats

Organize multiple data sets into single data containers using Data Tables Visualize and interact with data Automate the detection and classification of events

Threshold detection and stability classification

Summary
2

From Data to Decisions & Design


Action

Decisions & Design


Reporting & Apps Scalable Deployment Design Optimization

Knowledge

Understanding
Analytics Frequency & Time-domain Predictive Analytics Extrapolation

Information

Organization
Filtering Signal Analysis Data Reduction Plotting

Data

Observation and Access


Sensing Collecting Health Status Data Acquisition 3

Physical Sensors

From Data to Decisions & Design


Action

Decisions & Design


Reporting & Apps Scalable Deployment Design Optimization

Database Access
Financial Data ODBC JDBC HDFS (Hadoop)

Knowledge

File I/O
Understanding
Analytics Frequency & Time-domain Predictive Analytics Extrapolation Text Spreadsheet XML CDF/HDF Image Audio Video Geospatial Web content

Information

Organization
Filtering Signal Analysis Data Reduction Plotting

Hardware Access
Data acquisition Image capture GPU Lab instruments

Data

Observation and Access


Sensing Collecting Health Status Data Acquisition

Communication Protocols
CAN (Controller Area Network) DDS (Data Distribution Service) OPC (OLE for Process Control) XCP (eXplicit Control Protocol) 4

Physical Sensors

From Data to Decisions & Design


Action

Decisions & Design


Reporting & Apps Scalable Deployment Design Optimization

Data Processing
Convert, Sync, Clean, Reduce

Knowledge

Understanding
Analytics Frequency & Time-domain Predictive Analytics Extrapolation

Information

Organization
Filtering Signal Analysis Data Reduction Plotting

Data

Observation and Access


Sensing Collecting Health Status Data Acquisition 5

Physical Sensors

From Data to Decisions & Design


Action

Decisions & Design


Reporting & Apps Scalable Deployment Design Optimization

Visualization

Knowledge

Understanding
Analytics Frequency & Time-domain Predictive Analytics Extrapolation

Information

Organization
Filtering Signal Analysis Data Reduction Plotting

Data

Observation and Access


Sensing Collecting Health Status Data Acquisition 6

Physical Sensors

From Data to Decisions & Design


Action

Decisions & Design


Reporting & Apps Scalable Deployment Design Optimization

Exploratory Analysis
Derived metrics, events, conditions

Knowledge

Understanding
Displacement Acceleration

Analytics Frequency & Time-domain Predictive Analytics Extrapolation

40

MPG

20 20 10 400 200

Information

Organization
Filtering Signal Analysis Data Reduction Plotting

Weight Horsepower

4000 2000 200 150 100 50 20


MPG

Data

Observation and Access


Sensing Collecting Health Status Data Acquisition

40

10

20

200

400 2000

4000

50 100150200
Horsepow er

Acceleration

Displacement

Weight

Physical Sensors

Reading in Multiple Files


Files with equivalent formats and well ordered file names can be read in using a for-loop. Speed advantages can be gained by using parfor. parfor l = 1:no_files

fid = fopen([data',num2str(l),'.txt']); ww = textscan(fid,'%f %f'); fclose(fid);


time(:,l) = ww{:,1}; data(:,l) = ww{:,2}; end

16

Classification
Taking an example of dynamic responses, classify the responses automatically into an appropriate category Create a categorical array to allow logical indexing on a categorical basis
unstable stable

neutral

17

Working With Data Too Large To Fit Into System Memory


Use memory mapping to point to the data, and format the data such that sections of interest are easily indexed.
section size in bytes no. of sections

Text File

h1 = memmapfile('large_file.txt','Format', {'uint8',[14015 1000 10e3],'x'}); qq = textscan(char(h1.Data.x(:,:,1)),'%f'); ww = reshape(qq{:},1001,1000);


section number

Binary File h2 = memmapfile('travelTime.dat','Format',{'double',[1911 1201 1000],x'}); ww = h2.Data.x(:,:,1);


18

High Level Format of a Text File


Header Section 1 Header Section 2

Header Section n

Data Section 1 Data Section 2

Separated by row delimiters which may change for each section

Data Section n

19

Working with Variable Column Lengths (1)


Where a file format supports variable column lengths for a data section, an approach to read the data in is as follows, Read in the column headers using fgetl and textscan >> line = fgetl(fid)

line =
~A DEPTH ILM ILD

20

Working with Variable Column Lengths (2)


>> col_heads = textscan(line,'%s'); >> col_heads{:}

ans =
'~A' 'DEPTH' 'ILM' 'ILD' >> col_heads = col_heads{:}(2:end); % strip off the '~A';
21

Working with Variable Column Lengths (3)


Use repmat to create a format string to read in the data using textscan >> num_cols = numel(col_heads); % number of columns

>> format = repmat('%f',1,num_cols)


format = %f%f%f

>> data = textscan(fid,format); % retrieve all the measured data


22

Working with Variable Column Lengths (4)


Use the column headers to create data structure field names using dynamic expressions and use a loop to place the data columns under the correct field for l = 1:num_cols data1.(col_heads{l}) = data{:,l};

end
data1.(col_heads{1}) data1.(col_heads{2}) data1.(col_heads{3}) data1.DEPTH data1.ILM data1.ILD
23

Condition a Line of Text that Contains Different Delimiters and Different Substring Identifiers (1)
Files may contain combinations of delimiters that serve the same purpose, such as whitespace, tab or comma to separate column entries. There may also be substrings that are enclosed by unique substring identifiers

9 1, 10.000 10 , 1 80.000 ,

NAME9" 0.000, 0.000 1 ' BUS09 ' NAME10' 0.000,, 1 ' BUS10

9 1 10.000 10 1 80.000

NAME9 0.000 NAME10 0.000

0.000 1 BUS09 0.000 1 BUS10


24

Condition a Line of Text that Contains Different Delimiters and Different Substring Identifiers (2)
Use regular expression replacement to identify and replace delimiters and add characters as appropriate. >> str1 = regexprep(str1,',\s*,',', 0.000 ,'); 9 1, 10.000 10 , 1 80.000 , NAME9" 0.000, 0.000 1 ' BUS09 ' NAME10' 0.000,, 1 ' BUS10

9 1, 10.000 10 , 1 80.000 ,

NAME9" 0.000, 0.000 1 ' BUS09 ' NAME10' 0.000, 0.000 , 1 ' BUS10
25

Condition a Line of Text that Contains Different Delimiters and Different Substring Identifiers (3)
Use regular expressions to identify substrings and sprintf to replace the substring with a conditioned version. >> [start_idx,end_idx] = regexp(str2,'"\s*\w*\s*"'); 9 1, 10.000 10 , 1 80.000 , NAME9" 0.000, 0.000 1 ' BUS09 ' NAME10' 0.000, 0.000 , 1 ' BUS10

9 1, 10.000 10 , 1 80.000 ,

NAME9 0.000, 0.000 1 ' BUS09 ' NAME10' 0.000, 0.000 , 1 ' BUS10
26

Synchronize Data to a Common Axis

Merge tables together Popular Joins:


Inner Full Outer Left Outer Right Outer
Inner Join

Full Outer Join

Left Outer Join


27

Full Outer Join

Key A 1 4 7 9

B
1.1 1.4 1.7

Key 1

B
1.1 NaN 1.4 NaN 1.7 1.9

Y
0.1 0.3 NaN 0.5 0.7 NaN

Z
0.2 0.4 NaN 0.6 0.8 NaN

1.9 First Data Set

3
4 Z
0.2 0.4 0.6

0.7 0.8 Second Data Set

Key X 1 3 5 7

Y
0.1 0.3 0.5

5 7 9

Joined Data Set

28

Techniques to Handle Missing Data

List-wise deletion
Unbiased estimates Reduces sample size

Implementation options
Built in to many MATLAB functions Manual filtering

29

Techniques to Handle Missing Data

Substitution replace missing data points with a reasonable approximation

Easy to model

Too important to exclude


30

Summary

Challenges and Solutions


Access data from multiple sources including SQL databases, data historians, instrumentation, and files

Examples: SQL database connection, URL file reading and reading multiple files

Work with data too large to fit into system memory

Binary and text files


Examples: Log ASCII files (.LAS) and Power system formats (IEEE CDF and others)

Write and test functions that read in industry specific text file formats

Organize multiple data sets into single data containers using Data Tables Visualize and interact with data Automate the detection and classification of events

Threshold detection and stability classification

31

Find Out More


Get answers to your questions E-mail the presenters at webinars@mathworks.com Include the webinar title and date in your e-mail View recorded webinars www.mathworks.com/recordedwebinars Visit MATLAB Central www.mathworks.com/matlabcentral Contact a MathWorks sales representative In North America, call 508-647-7000 In other locations, visit www.mathworks.com/webcontact for contact information

32