Take Home Exam

Past Experience:
I worked on the project called TRACDS (Temporal Relationships among

clusters for massive data streams). In general the data scientists
develop data stream clustering algorithms where they dont consider
the temporal order of events which results loss of temporal information
after clustering. To counter the problem mentioned, [1] introduced a
data stream clustering process using the evolving markov chain. Here
each state is represented as cluster.
For temporal data the team
choose hurricane data, here many accurate models for predicting track
movement has been developed but coming to intensity prediction are
still inaccurate. So we proposed the hurricane intensity prediction
based on weighted feature learning and extensible markov model.
Here we used Java to build the algorithms, R for testing and
visualization. I am the first person in team to analyze and understand
the data. Its in-unstructured format, it took like couple of days to
understand the data and its dimensionality. Here the interesting point
is not all the parameters are affecting in predicting the intensity
prediction. So applied the weighted feature learning method where we
initially used the traditional genetic algorithm to give the necessary
weights to the features and later on extended to different weight
feature learning methods. I even implemented the existing algorithm
called SHIPS that was developed in python. Our method was
outperformed with the existing method, to evaluate we applied
absolute error and k-fold cross validation. NOAA along with other
intensity prediction models has used the developed model. This is the
major achievement by our team to forecast the intensity prediction.
1.
Yu Su, Sudheer Chelluboina, Michael Hahsler, and Margaret H

Dunham, A New Data Mining Model for Hurricane Intensity
Prediction, 2nd IEEE ICDM Workshop on Knowledge Discovery
from Climate Data, Proceedings of the 2010 IEEE
My research was on trajectory analysis where segmenting trajectories,

pattern mining and anomaly detection are involved, I will discuss about
some of my work in broader way, but I cant get into in depth details as
I need to publish this work to get my PhD. Analysis of trajectory data
creates opportunities for finding behavioral patterns that can be used
in various applications. For any moving object analysis or time series
analysis segmentation is the preliminary step to understand the
needed information from data. My approach towards the segmenting
trajectory is by using the formation of sequential oblique envelopes[2].
Date involved in this will be from the global positioning system (GPS)
where large amount of spatial-temporal data can be easily obtained.
Here I used three different datasets to test my model, Hurricane data,
Taxi data and simulated data. Reason to choose those datasets,
hurricane dataset has very less number of data points as each data
point is recorded for every 6 hours, so each track has very minimum
points, after segmenting the tracks and grouping using distance
clustering method not many group patterns are identified for analysis.
Taxi data, this requires preprocessing because of damaged GPS or
improper GPS signal where the location is not clearly captured. It has
millions of data points, where the parallelly moving objects can be
classified using the neighborhood distance measure. As it has many
data points it can also be used to evaluate the performance of the
proposed algorithm. Last the simulated dataset the reason to consider
this is to check the segment quality by using the mean squatted error.
Moreover this algorithm was compared with the existing MDL
approach, where the proposed algorithm outperformed. Applying some
geometrical approaches can refine this methodology, that is the overall
performance can be made to O(n) instead of O(nlogn). This can impact
the moving object management systems, that is once the segments
are formed and the coming trajectory is overlaid on the segments we
can analyze the moving object behavior which is nothing but drivers
behavior.
2.
Sudheer
Chelluboina
and
Michael
Hahsler,
Trajectory
Segmentation using Oblique Envelopes, Submitted at 4nd IEEE

Workshop IRI-DIM, Proceedings of the 2015 IEEE at San
Francisco, CA, USA.
Methodology:
What
are
methods
by
which
you
could
reduce
the
computational cost of fitting a classification model? Feel free

to share real world experience.
Dimensionality reduction:
In general more features involves more information and possibly higher
accuracy, but for coming to the classification models more features
means it will harder for the classifier to train, this process is termed as
the curse of dimensionality. Practically, the size of samples needed per
feature increases exponentially with the number of features.
In my TRACDS project we are not build the model for classifier but the
dimensionality reduction shown better results while comparing to
considering all the features. Most popular dimensionality reduction
methods are LCA and PCA.
Feature Selection:
This is used to find the subsets to the original features using the filter
and wrapper. For filtering we can go with the information gain and for
wrapper genetic algorithm. We can even use the weighted features as I
mentioned previously in one of my published papers. This will reduce

the computational cost too.
Feature Extraction:
This the reduced set of features, created by applying a mapping of the
multidimensional space in to the space of less features. This can be
formally defined as the original feature space that is transformed to
the reduced feature space.
In my Verizon project where I had access to the health care data, here
we applied filters for the feature extraction. If I remember mostly we
used the demographic information, medical records to predict the
possibilities of diseases. That is this feature extraction is used for risk
prediction for the insurance companies.
Suppose that you have access to the the following data for all
trips on Uber (date and time, trip origin (lat, long), destination
(lat, long), travel distance, Uber car type, driver ID). Describe a
sampling methodology that would produce independent and
identically distributed (IID) samples for the travel distance.
How would you demonstrate that the data is IID?
From the definition it is evident that the variable travel distance is
independent and identically distributed if they have the same
probability distribution and also they are independent to each reading.
To satisfy the sample data is independent and identically distributed:
Center of the data will not show increasing or decreasing or
systematic trend
Variance over the data will not show increasing or decreasing or
systematic trend
Data should have independent observations.
Now if we look in the data observation, the features are independent to

each other. That is data and time, trip origin, destination, uber cartype,
driver ID will be different and the travel distance is not depending on
any features. But we know that the travel distance can be calculated
from origin and destination parameters.
From my prior knowledge central limit theorem roughly tells the result
is Gaussian. As we know that the Gaussian variable is specified by the
mean and variance of that variable. By knowing these two things we
can say that the distribution is probability distribution. So to produce
the IID sample I compute its mean and variance.
Now these analysis methodologies require proving that the sample is
IID. In general IID dont have the common testing procedure to test
that the data meets IID, so it is mainly based on the assumptions. This
is extremely important to verify that the data satisfy IID before we
analyze before trying other methods. To check that the data sample
meets IID we can make use of plots. Initially construct the plot on data
and observe for the trends for the variance and median, if the data are
not nonstationary than go for some of the random order plots, if these
are rougher than the previous one than we say that the data is nonstationary or dependent data.
In one of my project I designed the confidence interval for the
hurricane intensity prediction model. Before calculating the confidence
interval it is necessary to test that the data meets IID or not, because if
the data is not identically distributed than the samples are not have
the same probability distribution function. That is when the confidence
intervals are computed from the average number of the data
observations, if the data is not IID it leads to inaccurate outcomes.
Suppose your team will design a display to aid drivers in

finding optimal locations when waiting for passengers. How
would you approach this? What analytic techniques would you
use and how would this impact the visual interface?
The main reason to choose this question is that my next paper to
publish is trying to answer that question.
Here it is necessary to integrate many methods to find the optimal
location when waiting for the passengers. Here I use trajectory
partitioning, clustering and sequential analysis and visualization.
First and foremost the given data should be preprocessed to
avoid fitfalls remove the unknown data.

From the data each trip should be partitioned based on the
existing methodologies or the method I proposed for segmenting
the trajectories. This is required to know the road network. This
step is not required actually but this gives the tracks where the
trips are made, two or more cars are moving parallel or more
frequent trips.
Now overlaid the data points with orgin location of the trip. Here
the data to be considered is the time and origin. For time
attribute I create the labels as hourly, for example time between
11:00AM and 12:00PM will be labeled as 11, so this attribute will
have 24 labels for 24 hours. These labels can be increased or
decreased based on the requirement. We know how the
probability works based on the more and less labels for time.
Now clustering process, by considering the observation as
<time_label, lat, long>. We apply the distance based hierarchal
clustering, that this algorithms should group the origins by some
distance and on top of if by time_label. By this algorithm we can
find the probability of passengers at that particular location is
some value at that particular time. We can set the probabilities
based on our assumptions.

Once we have these clusters we can overlay these probability
values on the driver screen based on his current location. He will
be given information about the possibility of passengers getting
the trip.
As having the mobile app building experience, the abovecalculated values at the system can be represented in an
understanding way. Like, In a app a options are given to the
driver to check the probability of passengers taking the trip at
particular location or current location. This is for the users who
are not used to visual representation.
These can be represented visually as heat map on the selected

location. The higher the probabilities darker the grid color or
lower the probability lighter the grid color. Once you tap on the
location it is further zoomed to show the exact variations of
probabilities. The implementation is also simple as we have map
tools to do the job.

Take Home Exam

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Take Home Exam

Hochgeladen von

Copyright:

Verfügbare Formate

Past Experience:

I worked on the project called TRACDS (Temporal Relationships among

For temporal data the team

Yu Su, Sudheer Chelluboina, Michael Hahsler, and Margaret H

My research was on trajectory analysis where segmenting trajectories,

Segmentation using Oblique Envelopes, Submitted at 4nd IEEE

computational cost of fitting a classification model? Feel free

mentioned previously in one of my published papers. This will reduce

Center of the data will not show increasing or decreasing or

Data should have independent observations.

Now if we look in the data observation, the features are independent to

Suppose your team will design a display to aid drivers in

First and foremost the given data should be preprocessed to

avoid fitfalls remove the unknown data.

based on our assumptions.

These can be represented visually as heat map on the selected

Das könnte Ihnen auch gefallen