Sie sind auf Seite 1von 16

SEMINAR REPORT

ON
PYTHON LIBRARIES FOR DATA SCIENCE

Submitted To:- Submitted By:-


Computer Science Deptt. Drishti Gupta
8716113
CSE (6TH Sem.)
PYTHON

1.Introduction

Python is an interpreted high-level programming language for general-purpose


programming. Created by Guido van Rossum and first released in 1991, Python has a design
philosophy that emphasizes code readability, notably using significant whitespace. It
provides constructs that enable clear programming on both small and large scales. In July
2018, Van Rossum stepped down as the leader in the language community after 30 years.

Python features a dynamic type system and automatic memory management. It supports
multiple programmingparadigms,including object-
oriented, imperative, functional and procedural, and has a large and
comprehensive standard library.

Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based
development model, as do nearly all of Python's other implementations. Python and
CPython are managed by the non-profit Python Software Foundation.

Python has a simple, easy to learn syntax emphasizes readability hence, it reduces the cost
of program maintenance.

Also, Python supports modules and packages, which encourages program modularity and
code reuse.
1.1Advantages of using PYTHON

The diverse application of the Python language is a result of the combination of features
which give this language an edge over others. Some of the benefits of programming in
Python include:

1. Presence of Third Party Modules:

The Python Package Index (PPI) contains numerous third-party modules that make
Python capable of interacting with most of the other languages and platforms.

2. Extensive Support Libraries:

Python provides a large standard library which includes areas like internet protocols,
string operations, web services tools and operating system interfaces. Many high use
programming tasks have already been scripted into the standard library which reduces
length of code to be written significantly.

3. Open Source and Community Development:

Python language is developed under an OSI-approved open source license, which makes
it free to use and distribute, including for commercial purposes.

Further, its development is driven by the community which collaborates for its code
through hosting conferences and mailing lists, and provides for its numerous modules.

4. Learning Ease and Support Available:

Python offers excellent readability and uncluttered simple-to-learn syntax which helps
beginners to utilize this programming language. The code style guidelines, PEP 8,
provide a set of rules to facilitate the formatting of code. Additionally, the wide base of
users and active developers has resulted in a rich internet resource bank to encourage
development and the continued adoption of the language.
5. User-friendly Data Structures:

Python has built-in list and dictionary data structures which can be used to construct fast
runtime data structures. Further, Python also provides the option of dynamic high-level
data typing which reduces the length of support code that is needed.

6. Productivity and Speed:

Python has clean object-oriented design, provides enhanced process control capabilities,
and possesses strong integration and text processing capabilities and its own unit testing
framework, all of which contribute to the increase in its speed and productivity. Python
is considered a viable option for building complex multi-protocol network applications.
2. DATA SCIENCE:-

“Data science” is just about as broad of a term as they come. It may be easiest to describe
what it is by listing its more concrete components:

1) Data exploration & analysis:-.

 Included here: Pandas; NumPy; SciPy; a helping hand from Python’s Standard
Library.

2) Data visualization:- A pretty self-explanatory name. Taking data and turning it into
something colorful.

 Included here: Matplotlib; Seaborn; Datashader; others.

3) Classical machine learning:- Conceptually, we could define this as any supervised


or unsupervised learning task that is not deep learning (see below). Scikit-learn is far-
and-away the go-to tool for implementing classification, regression, clustering, and
dimensionality reduction, while StatsModels is less actively developed but still has a
number of useful features.

 Included here: Scikit-Learn, StatsModels.

4) Deep learning:- This is a subset of machine learning that is seeing a renaissance, and
is commonly implemented with Keras, among other libraries. It has seen monumental
improvements over the last ~5 years, such as AlexNet in 2012, which was the first
design to incorporate consecutive convolutional layers.

 Included here: Keras, TensorFlow, and a whole host of others.

5) Data storage and big data frameworks:-Big data is best defined as data that is
either literally too large to reside on a single machine, or can’t be processed in the
absence of a distributed environment. The Python bindings to Apache technologies
play heavily here.

 Apache Spark; Apache Hadoop; HDFS; Dask; h5py/pytables.


3.Most common libraries used in the Data Science
3.1 Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among
other things:

 a powerful N-dimensional array object


 sophisticated (broadcasting) functions
 tools for integrating C/C++ and Fortran code
 useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to
seamlessly and speedily integrate with a wide variety of databases.

NumPy is licensed under the BSD license, enabling reuse with few restrictions. The core
functionality of NumPy is its "ND array", for n-dimensional array, data structure. These
arrays are stride views on memory. In contrast to Python's built-in list data structure (which,
despite the name, is a dynamic array), these arrays are homogeneously typed: all elements of
a single array must be of the same type. NumPy has built-in support for memory-mapped
arrays.

3-dimensional numpy array\


Here is some function that are defined in this NumPy Library.

1. zeros (shape [, dtype, order]) - Return a new array of given shape and type, filled with
zeros.
2. array (object [, dtype, copy, order, lubok, ndim]) - Create an array
3. as array (a [, dtype, order]) - Convert the input to an array.
4. As an array (a [, dtype, order]) - Convert the input to an ND array, but pass ND array
subclasses through.

Numpy will help you to manage multi-dimensional arrays very efficiently. Maybe you
won’t do that directly, but since the concept is a crucial part of data science, many other
libraries (well, almost all of them) are built on Numpy. Simply put: without Numpy you
won’t be able to use Pandas, Matplotlib, Scipy or Scikit-Learn. That’s why you need it on
the first hand.
3.2 Pandas

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-


use data structures and data analysis tools for the Python programming language. Python with
Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the
various features of Python Pandas and how to use them in practice.

The name Pandas is derived from the word Panel Data – an Econometrics from
Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin
of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.

3.2.1 Key Features of Pandas

 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data
3.3 Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application
servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible. You can generate plots,
histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of
code. For examples, see the sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object-oriented interface or via a set of functions
familiar to MATLAB users.

The best and most well-known Python data visualization library is Matplotlib. I wouldn’t say
it’s easy to use… But usually if you save for yourself the 4 or 5 most commonly used code
blocks for basic line charts and scatter plots, you can create your charts pretty fast.
3.4 SciPy

SciPy is a machine learning library for application developers and engineers. However, you
still need to know the difference between SciPy library and SciPy stack. SciPy library
contains modules for optimization, linear algebra, integration, and statistics.

3.4.1Features Of SciPy:-

The main feature of SciPy library is that it is developed using NumPy, and its array makes the
most use of NumPy.

In addition, SciPy provides all the efficient numerical routines like optimization, numerical
integration, and many others using its specific submodules.

All the functions in all submodules of SciPy are well documented.

3.4.2Where Is SciPy Used?

SciPy is a library that uses NumPy for the purpose of solving mathematical functions. SciPy
uses NumPy arrays as the basic data structure, and comes with modules for various
commonly used tasks in scientific programming.

Tasks including linear algebra, integration (calculus), ordinary differential equation solving
and signal processing are handled easily by SciPy.
3.5Scikit-Learn

Scikit-learn (formerly scikits. learn) is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries
NumPy and SciPy.
The scikit-learn project started as scikits.learn, a Google Summer of Code project by David
Cournapeau. Its name stems from the notion that it is a "SciKit" (SciPy Toolkit), a separately-
developed and distributed third-party extension to SciPy. The original codebase was later
rewritten by other developers. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre
Gramfort and Vincent Michel, all from INRIA took leadership of the project and made the
first public release on February the 1st 2010. Of the various scikits, scikit-learn as well as
scikit-image were described as "well-maintained and popular" in November 2012.
As of 2018, scikit-learn is under active development.
Scikit-learn is largely written in Python, with some core algorithms written in Cython to
achieve performance. Support vector machines are implemented by a Cython wrapper around
LIBSVM; logistic regression and linear support vector machines by a similar wrapper around
LIBLINEAR. [10]

3.5.1 Advantages of using Scikit-Learn:


 Scikit-learn provides a clean and consistent interface to tons of different models.
 It provides you with many options for each model, but also chooses sensible defaults.
 Its documentation is exceptional, and it helps you to understand the models as well as
how to use them properly.
 It is also actively being developed.
3.6 Keras

Primary Intent: Developing and training deep learning models, deep learning research

Secondary Intent(s): Working with image and text data

Considered to be one of the coolest machine learning Python libraries, Keras offers an easier
mechanism for expressing neural networks. It also features great utilities for compiling
models, processing datasets, visualizing graphs, and much more.

Written in Python, Keras has the ability to run on top of CNTK, TensorFlow, and Theano.
The Python machine learning library is developed with a primary focus on allowing fast
experimentation. All Keras models are portable.

Compared to other Python machine learning libraries, Keras is slow. This is due to the fact
that it creates a computational graph using the backend infrastructure first and then uses the
same to perform operations. Keras is very expressive and flexible for doing innovative
research.

3.6.1 Highlights:

 Being completely Python-based makes it easier to debug and explore


 Modular in nature
 Neural network models can be combined for developing more complex models
 Runs smoothly on both CPU and GPU
 Supports almost all models of a neural network, including convolutional, embedding,
fully connected, pooling, and recurrent
3.7 Seaborn

Primary Intent: Data visualization, making statistical graphics in Python

Secondary Intent(s): None

Basically a data visualization library for Python, Seaborn is built on top of the Matplotlib
library. Also, it is closely integrated with Pandas data structures. The Python data
visualization library offers a high-level interface for drawing attractive as well as informative
statistical graphs.

The main aim of Seaborn is to make visualization a vital part of exploring and understanding
data. Its dataset-oriented plotting functions operate on arrays and data-frames containing
whole datasets. The library is ideal for examining relationships among multiple variables.

Seaborn internally performs all the important semantic mapping and statistical aggregation
for producing informative plots. The Python data visualization library also has tools for
choosing among color palettes that aid in revealing patterns in a dataset.

3.7.1 Highlights:

 Automatic estimation as well as the plotting of linear regression models


 Comfortable views of the overall structure of complex datasets
 Eases building complex visualizations using high-level abstractions for structuring
multi-plot grids
 Options for visualizing bivariate or univariate distributions
 Specialized support for using categorical variables
3.8 TensorFlow

Primary Intent: Developing, training, and designing deep learning models

Secondary Intent(s): Performing numerical computation

Anybody involved in machine learning projects using Python must have, at least, heard of
TensorFlow. Developed by Google, it is an open source symbolic math library for numerical
computation using data flow graphs.

The mathematical operations in a typical TensorFlow data flow graph are represented by the
graph nodes. The graph edges, on the other hand, represent the multidimensional data arrays,
a.k.a. tensors, that flow between the graph nodes.

TensorFlow flaunts a flexible architecture. It allows Python developers to deploy


computation to one or many CPUs or GPUs in a desktop, mobile device, or server without the
need of rewriting code. All libraries created in TensorFlow are written in C and C++.

Widely used Google products like Google Photos and Google Voice Search are built using
TensorFlow. The library has a complicated front-end for Python. The Python code will get
compiled and then executed on TensorFlow distributed execution engine.

3.8.1 Highlights:

 Allows training multiple neural networks and multiple GPUs, making models very
efficient for large-scale systems
 Easily trainable on CPU and GPU for distributed computing
 Flexibility in its operability, meaning TensorFlow offers the option of taking out the
parts that you want and leaving that you don’t
 Great level of community and developer support
 Unlike other data science Python libraries, TensorFlow simplifies the process of
visualizing each and every part of the graph

Das könnte Ihnen auch gefallen