Sie sind auf Seite 1von 10

Statistical and Analytical Software Development

Sachu Thomas Isaac

PROBLEM STATEMENT
With the enhancement of IT infrastructure and better information collection over the past several decades, organizations have at their disposal a very deep information store. Most of the current organizations are looking for ways to explore this opportunity and mine this information which is available at their disposal to Improve their products and services Enhance Customer Relations Enhance their Top and Bottom lines Currently there are some established software which cater to various requirements of statistical applications in the industry. Some of these software are IBM SPSS, SAS, R, S-Plus etc. However with increasing challenges such as Volume of Data Speed of Analysis required and Complexity of Analysis there is a requirement for finding alternatives to these software / solutions which will be an improvement, advanced solution.

PURPOSE
The goal is to build a fully functioning end user application which would be competitively more efficient and faster when compared to the available softwares especially when it comes to BigData analysis. This application will also include the various data mining tools which would provide the business analyst a whole new interface.

SOLUTION FRAMEWORK

Raw Data

Transformed Data

Analysis

Output

THE UNDERLYING CONCEPT


We are using the concept of Distributed computing and the MapReduce programming paradigm by incorporating Apaches Hadoop framework into our application . Well also be incorporating HBase which provides BigTable-like capabilities on top of Hadoop core. This software targets for simplicity in terms of usage by a business entrepreneur and sophisticated in terms of computation and capabilities.

THE 4 VS OF BIGDATA

HOW IS IT DIFFERENT FROM THE AVAILABLE


SOFTWARES This application is based on open-source softwares such as Hadoop, HBase and is platform independent as its based on Java. Less programming constructs by introducing more user friendly functions. Distributed computing is lot more quicker as we use the computing power of each machine in a cluster. It will be able to handle petabytes of data.

GENERAL CONCEPTS TACKLED IN THIS


APPLICATION

Data Manipulation
Distributed File System)

Loading of data onto HDFS (Hadoop

Extracting the meta data information. Merging the various datasets. Sorting of various datasets.

Data Processing
functions

Implementation of common statistical

Frequency, mean, median, mode Univariate/bivariate analysis Correlations Graph generation

MAPREDUCE PARADIGM

CONCLUSION
The main aim of this particular endeavor is to develop this application and do a comparative study with the existing softwares in terms of speed of computation, efficiency and reliability.

This application would be a more generic model that is it should be easily extended to the various business domains to meet their requirements.

Das könnte Ihnen auch gefallen