Competitive Analysis API Report

B.K.
BIRLA INSTITUTE OF ENGINEERING &
TECHNOLOGY,PILANI (RAJ.)
SESSION 2015
CERTIFICATE
This is to certified that the project entitled API to perform competitive analysis of Ecommerce sites has been submitted to the Rajasthan Technical University, Kota fulfillment
of the requirement for the award of the degree of Bachelor of Technology in Information
Technology by following students of final year B.Tech. (Information Technology).
Gaurav Kumawat (11EBKIT012)
Mr. Shridhar Dandin
Vishesh Mishra(11EBKIT063)
(HOD, IT Deptt.)
Guide
Mrs. Sonam Mittal Sahu
2| P a g e
CONTENTS
1
Abstract
1.1
Workflow
1.2
Base Language
1.3
Input
1.4
Output
Introduction
2.1
Background
2.2
Objectives
2.3
Purpose, Scope & Applicability
Tools & Environments used
Python
4.1
Python Features
4.2
Statements & Control Flow
10
4.3
Expressions
11
4.4
Methods
14
4.5
Mathematics
14
4.6
Libraries
15
4.7
Development Environments
16
Numpy
16
3| P a g e
Python Data Analysis Library
17
JSON
20
Difflib
21
8.1
9
Sequence Matcher
Comma Separated Values (CSV)
10 Design Document
10.1
Modularization Details
21
22
23
23
11 Source Code
23
12 Testing
27
12.1
Unit Testing
27
12.2
Integration Testing
28
12.3
System Testing
28
13 Reports
28
13.1
Unit Testing
28
13.2
Integration Testing
29
13.3
System Testing
30
14 Input & Output Screens
31
15 Conclusion
34
16 Limitations of the Project
35
17 Future Applications of the Project
35
4| P a g e
18 References
36
5| P a g e
TEAM INFORMATION
MEMBERS :
Gaurav Kumawat (11EBKIT012)

Branch : Information Technology
Vishesh Mishra (11EBKIT063)
Branch : Information Technology
6| P a g e
1. ABSTRACT
This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-comm platforms and gives
an analytical insight about the market capture of a product or a company against its
competitors. For instance a well known company 'flipkart' wants to compare selling
price of its product such as 'IPhone 6' to its competitors. Now at different e-comm
platforms the name of the same product may be different. for instance on Amazon it
would be i-phone6 black and on snapdeal it would be like white I-phone/6 32gb. It
becomes a mess and eats lot of time to search and sort. Our API is designed to address
such problems and reduces the overall effort by many folds.
1.1.
WORKFLOW
Development Phase: To Analyze the problem statement, Research and a theoritical

view-point synthesis. Algorithms Development.
Implementation Phase: To mold the theoritical concepts in core software codes.
Evaluation Phase: To test the algorithms and verifying the results. Recall-Precision
for the output.
1.2.
BASE LANGUAGE : Python.
1.3.
INPUT: Web-Crawled Data, Dataset(Json Format).
1.4.
OUTPUT: Business Intelligence Report, CSV Report.
7| P a g e
2. INTRODUCTION
2.1.
BACKGROUND
As there are numerous e-commerce sites present in the web and also there are
products which are common among these sites but having different prices so it may be
quiet time consuming for an e-commerce company to compare its products
information with that of other companies and perform competitive analysis. It would
be quiet easy for the company if it could compare the prices and other information of
products present in different sites at a single location.
2.2.
OBJECTIVES
To develop, implement and evaluate an API to perform competitive analysis on

various E-commerce business platforms.
2.3.
PURPOSE, SCOPE AND APPLICABILITY
The purpose of this project is to reduce users workload of comparing price of a

product in different e-commerce sites. There is a lot of scope in this area since the
future will see a sudden surge in the number of e-commerce sites many containing
common products. Our API will address this problem by giving a platform which can
help e-commerce companies compare its product data with various other e-commerce
sites and to stay in competition by adjusting prices accordingly. Currently the API
8| P a g e
could be a bit complex since its still in development stages and the algorithm used is
still not 100% efficient but it could be improved further in future. Its applicability
resides in scenarios where there are large number of e-commerce sites and it would be
difficult to analyze the product information of common products manually.
3. TOOLS AND ENVIRONMENT USED

Enviroment :
Operating System used-Windows 7/8
Development Environment- Python 2.7.9 IDLE
Tools : Python Modules which include :
Numpy
Pandas
Csv
Json
Difflib
Operator
Sequencematcher
9| P a g e
Other tools used : Visual C++ for Python 2.7 (VCForPython27), PythonGUI.
Dependencies
Python 2.7
Csv
operator
Numpy
Pandas
Json
difflib
4. PYTHON (PROGRAMMING LANGUAGE)

Python is a high-level, interpreted, interactive and object-oriented scripting language.
Python is designed to be highly readable. It uses English keywords frequently where
as other languages use punctuation, and it has fewer syntactical constructions than
other languages.
Python is Interpreted: Python is processed at runtime by the interpreter. You

do not need to compile your program before executing it. This is similar to
PERL and PHP.
10| P a g e
Python is Interactive: You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
Python is Object-Oriented: Python supports Object-Oriented style or

technique of programming that encapsulates code within objects.
Python is a Beginner's Language: Python is a great language for the

beginner-level programmers and supports the development of a wide range of
applications from simple text processing to WWW browsers to games.
4.1.
PYTHON FEATURES
Python's features include:
Easy-to-learn: Python has few keywords, simple structure, and a clearly

defined syntax. This allows the student to pick up the language quickly.
Easy-to-read: Python code is more clearly defined and visible to the eyes.
Easy-to-maintain: Python's source code is fairly easy-to-maintain.
A broad standard library: Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode:Python has support for an interactive mode which allows

interactive testing and debugging of snippets of code.
Portable: Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.
Databases: Python provides interfaces to all major commercial databases.
11| P a g e
GUI Programming: Python supports GUI applications that can be created

and ported to many system calls, libraries and windows systems, such as
Windows MFC, Macintosh, and the X Window system of Unix.
Scalable: Python provides a better structure and support for large programs
than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few
are listed below:
IT supports functional and structured programming methods as well as OOP.
It can be used as a scripting language or can be compiled to byte-code for

building large applications.
It provides very high-level dynamic data types and supports dynamic type
checking.
IT supports automatic garbage collection.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
4.2.
STATEMENTS AND CONTROL FLOW
Python's statements include (among others):
The if statement, which conditionally executes a block of code, along

with else and elif (a contraction of else-if).
The for statement, which iterates over an iterable object, capturing each
element to a local variable for use by the attached block.
The while statement, which executes a block of code as long as its condition
is true.
12| P a g e
The try statement, which allows exceptions raised in its attached code block
to be caught and handled by except clauses; it also ensures that clean-up code in
a finally block will always be run regardless of how the block exits.
The class statement, which executes a block of code and attaches its local
namespace to a class, for use in object-oriented programming.
The def statement, which defines a function or method.
The with statement (from Python 2.5), which encloses a code block within a
context manager (for example, acquiring a lock before the block of code is run
and releasing the lock afterwards, or opening a file and then closing it),
allowing RAII-like behavior.
The pass statement, which serves as a NOP. It is syntactically needed to

create an empty code block.
The assert statement, used during debugging to check for conditions that
ought to apply.
The yield statement, which returns a value from a generator function. From
Python 2.5, yield is also an operator. This form is used to implement coroutines.
The import statement, which is used to import modules whose functions or

variables can be used in the current program.
print() was changed to a function in Python 3.
4.3.
EXPRESSIONS
Python expressions are similar to languages such as C and Java:
13| P a g e
Addition, subtraction, and multiplication are the same, but the behavior of
division differs (see Mathematics for details). Python also added the ** operator
for exponentiation.
In Python, == compares by value, in contrast to Java, where it compares by

reference.
(Value
comparisons
in
Java
use
the equals() method.)
Python's is operator may be used to compare object identities (comparison by

reference). Comparisons may be chained, for example a <= b <= c .
Python uses the words and , or , not for its boolean operators rather than the
symbolic && , || , ! used in Java and C.
Python has a type of expression termed a list comprehension. Python 2.4

extended list comprehensions into a more general expression termed
a generator expression.[40]
Anonymous functions are implemented using lambda expressions; however,

these are limited in that the body can only be a single expression.
Conditional expressions in Python are written as x if c else y [57] (different in

order of operands from the ?: operator common to many other languages).
Python makes a distinction between lists and tuples. Lists are written as [1, 2,
3] , are mutable, and cannot be used as the keys of dictionaries (dictionary keys
must beimmutable in Python). Tuples are written as (1, 2, 3) , are immutable and
thus can be used as the keys of dictionaries, provided all elements of the tuple are
immutable. The parentheses around the tuple are optional in some contexts.
Tuples can appear on the left side of an equal sign; hence a statement like x, y =
y, x can be used to swap two variables.
Python has a "string format" operator % . This functions analogous

to printf format strings in C, e.g. "foo=%s bar=%d" % ("blah", 2) evaluates
14| P a g e
to "foo=blah bar=2" . In Python 3 and 2.6+, this was supplemented by

the format() method of the str class, e.g. "foo={0} bar={1}".format("blah", 2) .
Python has various kinds of string literals:
Strings delimited by single or double quotation marks. Unlike in Unix

shells, Perl and Perl-influenced languages, single quotation marks and double
quotation marks function identically. Both kinds of string use the backslash
( \ ) as an escape character and there is no implicit string interpolation such
as "$foo" .
Triple-quoted strings, which begin and end with a series of three single
or double quotation marks. They may span multiple lines and function
like here documents in shells, Perl and Ruby.
Raw string varieties, denoted by prefixing the string literal with an r .

No escape sequences are interpreted; hence raw strings are useful where
literal backslashes are common, such as regular expressions and Windowsstyle paths. Compare " @ -quoting" in C#.
Python
has index and slice expressions
as a[key] , a[start:stop] or a[start:stop:step] .
on
Indexes
lists,
denoted
are zero-based,
and
negative indexes are relative to the end. Slices take elements from the start index
up
to,
but
not
including,
the stop index.
The
third
slice
parameter,
called step or stride, allows elements to be skipped and reversed. Slice indexes
may be omitted, for example a[:] returns a copy of the entire list. Each element of
a slice is a shallow copy.
In Python, a distinction between expressions and statements is rigidly enforced, in
contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to some
duplication of functionality. For example:
List comprehensions vs. for -loops

15| P a g e
Conditional expressions vs. if blocks
The eval() vs. exec() built-in functions (in Python 2, exec is a statement);
the former is for expressions, the latter is for statements.
4.4.
METHODS
Methods on
objects
are functions attached
to
the
object's
class;
the
syntax instance.method(argument) is, for normal methods and functions, syntactic

sugar for Class.method(instance,
explicit self parameter
to
argument) .
access instance
Python
data,
methods
in
contrast
have
an
to
the
implicit self (or this ) in some other object-oriented programming languages (e.g. C+
+, Java, Objective-C, or Ruby).
4.5.
MATHEMATICS
Python has the usual C arithmetic operators (+, -, *, /, %). It also has ** for
exponentiation, e.g. 5**3 == 125 and 9**.5 == 3.0 and a new matrix multiply
operator @ coming in 3.5.[61]
The behavior of division has changed significantly over time.[62]
Python 2.1 and earlier use the C division behavior. The / operator is integer
division if both operands are integers, and floating point division otherwise.
Integer division rounds towards 0, e.g. 7 / 3 == 2 and -7 / 3 == -2 .
Python 2.2 changes integer division to round towards negative infinity,

e.g. 7 / 3 == 2 and -7 / 3 == -3 . The floor division // operator is introduced. So 7
// 3 == 2 , -7 // 3 == -3 , 7.5 // 3 == 2.0 and -7.5 // 3 == -3.0 . Adding from
__future__ import division causes a module to use Python 3.0 rules for division
(see next).
16| P a g e
Python 3.0 changes / to always be floating point division. In Python terms, the
pre-3.0 / is "classic division", the 3.0 / is "real division", and // is "floor division".
Rounding towards negative infinity, though different from most languages, adds
consistency. For instance, it means that the equation (a+b) // b == a // b + 1 is always
true. It also means that the equation b * (a // b) + a % b == a is valid for both positive
and negative values of a . However, maintaining the validity of this equation means
that while the result of a % b is, as expected, in the half-open interval [0,b),
where b is a positive integer, it has to lie in the interval (b,0] when b is negative.[63]
Python provides a round function for rounding floats to integers. Versions before 3
use
round-away-from-zero: round(0.5) is
1.0, round(-0.5) is
1.0.[64] Python
usesround-to-even: round(1.5) is 2, round(2.5) is 2.[65] The Decimal type/class in

module decimal (since version 2.4) provides exact numerical representation and
several rounding modes.
Python allows boolean expressions with multiple equality relations in a manner that is
consistent with general usage in mathematics. For example, the expression a < b <
c tests whether a is less than b and b is less than c . C-derived languages interpret
this expression differently: in C, the expression would first evaluate a < b , resulting
in 0 or 1, and that result would then be compared with c .[66][page needed]
Due to Python's extensive mathematics library, it is frequently used as a scientific
scripting language to aid in problems such as data processing and manipulation.
4.6. LIBRARIES
Python has a large standard library, commonly cited as one of Python's greatest
strengths,[67] providing tools suited to many tasks. This is deliberate and has been
described as a "batteries included"[26] Python philosophy. For Internet-facing
17| P a g e
applications,
large
number
of
standard
formats
and
protocols
(such
as MIME and HTTP) are supported. Modules for creating graphical user interfaces,
connecting to relational databases, pseudorandom number generators, arithmetic with
arbitrary precision decimals,[68]manipulating regular expressions, and doing unit
testing are also included.
Some parts of the standard library are covered by specifications (for example,
the WSGI implementation wsgiref follows PEP 333[69]), but the majority of the
modules are not. They are specified by their code, internal documentation, and test
suite (if supplied). However, because most of the standard library is cross-platform
Python code, there are only a few modules that must be altered or completely
rewritten by alternative implementations.
The standard library is not essential to run Python or embed Python within an
application. Blender 2.49, for instance, omits most of the standard library.
As of January 2015, the Python Package Index, the official repository of third-party
software for Python, contains more than 54,000 packages offering a wide range of
functionality, including:
graphical user interfaces, web frameworks, multimedia, databases, networking

and communications
test frameworks, automation and web scraping, documentation tools, system

administration
scientific computing, text processing, image processing
18| P a g e
4.7.
DEVELOPMENT ENVIRONMENTS
Most Python implementations (including CPython) can function as a command line

interpreter, for which the user enters statements sequentially and receives the results
immediately (REPL). In short, Python acts as a shell.
Other
shells
add
capabilities
beyond
those
in
the
basic
interpreter,
including IDLE and IPython. While generally following the visual style of the Python
shell, they implement features like auto-completion, retention of session state, and
syntax highlighting.In addition to standard desktop Python IDEs (integrated
development environments), there are also browser-based IDEs, Sage (intended for
developing science and math-related Python programs), and a browser-based IDE and
hosting environment, PythonAnywhere.
5. NUMPY
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional container of generic data. Arbitrary data-types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy is an extension to the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large library of highlevel mathematical functions to operate on these arrays. The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from several
19| P a g e
other developers. In 2005, Travis Oliphant created NumPy by incorporating features

of the competing Numarray into Numeric, with extensive modifications. NumPy
is open source and has many contributors
6. PYTHON DATA ANALYSIS LIBRARY

pandas is an open source, BSD-licensed library providing high-performance, easy-touse data structures and data analysis tools for the Python programming language.
Library Highlights
A fast and efficient DataFrame object for data manipulation with

integrated indexing;
Tools for reading and writing data between in-memory data structures
and different formats: CSV and text files, Microsoft Excel, SQL databases,
and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain

automatic label-based alignment in computations and easily manipulate messy
data into an orderly form;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing, and subsetting of large

data sets;
Columns can be inserted and deleted from data structures for size
mutability;
Aggregating or transforming data with a powerful group by engine

allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets;
Hierarchical axis indexing provides an intuitive way of working with

high-dimensional data in a lower-dimensional data structure;
Time
series-functionality:
date
range
generation
and
frequency
conversion, moving window statistics, moving window linear regressions, date

shifting and lagging. Even create domain-specific time offsets and join time
series without losing data;
20| P a g e
Highly optimized for performance, with critical code paths written

in Cythonor C.
Python with pandas is in use in a wide variety of academic and

commercialdomains, including Finance, Neuroscience, Economics, Statistics,
Advertising, Web Analytics, and more.
pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with relational or labeled data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real
world data analysis in Python. Additionally, it has the broader goal of becoming the
most powerful and flexible open source data analysis / manipulation tool
available in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
Tabular data with heterogeneously-typed columns, as in an SQL table

or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series

data.
Arbitrary matrix data (homogeneously typed or heterogeneous) with

row and column labels
Any other form of observational / statistical data sets. The data actually
need not be labeled at all to be placed into a pandas data structure
Here are just a few of the things that pandas does well:
Easy handling of missing data (represented as NaN) in floating point

as well as non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame

and higher dimensional objects
21| P a g e
Automatic and explicit data alignment: objects can be explicitly

aligned to a set of labels, or the user can simply ignore the labels and
let Series, DataFrame, etc. automatically align the data for you in
computations
Powerful, flexible group by functionality to perform split-applycombine operations on data sets, for both aggregating and transforming
data
Make it easy to convert ragged, differently-indexed data in other

Python and NumPy data structures into DataFrame objects
Intelligent label-based slicing, fancy indexing, and subsetting of large

data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes (possible to have multiple labels per

tick)
Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving / loading data from the
ultrafast HDF5 format
Time series-specific functionality: date range generation and

frequency conversion, moving window statistics, moving window
linear regressions, date shifting and lagging, etc.
Many of these principles are here to address the shortcomings frequently experienced
using other languages / scientific research environments. For data scientists, working
with data is typically divided into multiple stages: munging and cleaning data,
analyzing / modeling it, then organizing the results of the analysis into a form suitable
for plotting or tabular display. pandas is the ideal tool for all of these tasks.
22| P a g e
Some other notes
pandas is fast. Many of the low-level algorithmic bits have been

extensively tweaked in Cython code. However, as with anything else
generalization usually sacrifices performance. So if you focus on one
feature for your application you may be able to create a faster
specialized tool.
pandas is a dependency of statsmodels, making it an important part of

the statistical computing ecosystem in Python.
pandas has been used extensively in production in financial

applications.
7. JSON (JAVASCRIPT OBJECT NOTATION)

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
for humans to read and write. It is easy for machines to parse and generate. It is based
on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd
Edition - December 1999. JSON is a text format that is completely language
independent but uses conventions that are familiar to programmers of the C-family of
languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others.
These properties make JSON an ideal data-interchange language.
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is realized as

an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array,

vector, list, or sequence.
23| P a g e
These are universal data structures. Virtually all modern programming languages
support them in one form or another. It makes sense that a data format that is
interchangeable with programming languages also be based on these structures.
JSON IN PYTHON 2.X

Provided by json module, here is a description of default behavior.
Transformation from JSON to Python and back is symmetrical (only formatting is
lost).
Transformation from Python to JSON is not:
strings are converted to unicode objects
operator Functional interface to built-in operators

Purpose:
Functional interface to built-in operators.
Functional programming using iterators occasionally requires creating small functions

for simple expressions. Sometimes these can be expressed as lambda functions, but
some operations do not need to be implemented with custom functions at all.
The operator module defines functions that correspond to built-in operations for
arithmetic and comparison, as well as sequence and dictionary operations.
8. DIFFLIB
This module provides classes and functions for comparing sequences. It can be used
for example, for comparing files, and can produce difference information in various
formats, including HTML and context and unified diffs. For comparing directories
and files,
class difflib.
24| P a g e
8.1.
SEQUENCE MATCHER
This is a flexible class for comparing pairs of sequences of any type, so long as the
sequence elements are hashable. The basic algorithm predates, and is a little fancier
than, an algorithm published in the late 1980s by Ratcliff and Obershelp under the
hyperbolic name gestalt pattern matching. The idea is to find the longest contiguous
matching subsequence that contains no junk elements (the Ratcliff and Obershelp
algorithm doesnt address junk). The same idea is then applied recursively to the
pieces of the sequences to the left and to the right of the matching subsequence. This
does not yield minimal edit sequences, but does tend to yield matches that look
right to people
9. COMMA-SEPARATED VALUES (CSV)

A comma-separated values (CSV) (also sometimes called character-separated
values) file stores tabular data (numbers and text) in plain-text form. Plain text means
that the file is a sequence of characters, with no data that has to be interpreted as
binary numbers. A CSV file consists of any number of records, separated by line
breaks of some kind; each record consists of fields, separated by some other character
or string, most commonly a literal comma or tab. Usually, all records have an
identical sequence of fields.
"CSV" refers to any file that:[2][4]
1. is plain text using a character set such as ASCII, various Unicode character
sets (e.g. UTF-8), EBCDIC, or Shift JIS,
2. consists of records (typically one record per line),
3. with the records divided into fields separated by delimiters (typically a single
reserved character such as comma, semicolon, or tab; sometimes the delimiter
may include optional spaces),
4. where every record has the same sequence of fields.
25| P a g e
Within these general constraints, many variations are in use. Therefore "CSV" files
are not entirely portable. Nevertheless, the variations are fairly small, and many
implementations allow users to preview the first few lines of the file (which is feasible
because it is plain text), and then specify the delimiter character(s), quoting rules, etc.
If a particular CSV file's variations fall outside what a particular receiving program
supports, it is often feasible to examine and edit the file by hand or write a script or
program to fix the problem.
10.
DESIGN DOCUMENT
10.1. MODULARIZATION DETAILS

In the source code we have used different libraries which are also called modules like
numpy, json, csv, difflib, pandas. In the project the data in JSON format is input and
the output is shown on the output screen i.e. Python shell. The output is further stored
in a csv file.
INPUT DATA FORMAT : JSON
OUTPUT DATA FORMAT : CSV/Python shell
11.
SOURCE CODE
'''
Created on 7-May-2015
@author: gaurav & vishesh

'''
import csv
import operator
import numpy as np
from collections import defaultdict
from operator import itemgetter
import json
import pandas as pd
from difflib import SequenceMatcher as SM
26| P a g e
class CodingTest:
'''
'''
def __init__(self):
'''
initialization of the variables.
'''
self.dataFile = "C:\\Python27\\project\\data.txt"
self.csvToList = list()
self.dataList = list()
self.frame
= pd.DataFrame()
self.SnapDeal = list()
self.Flipkart = list()
self.Amazon
= list()
self.gg = {}
self.values = dict()
self.competitorList = ['SnapDeal','Flipkart','Amazon']
self.mylist = list()
def createDataDict(self,data,a,mylist):
title = data['title']
mrp = data['mrp']
source = data['source']
url = data['url']
stock = data['stock']
selling_price = data['available_price']
mylist.append({'Title':title,'MRP':mrp,'Source':source,
'URL':url,'Stock_Status':stock,'Selling_Price': selling_price})
return mylist
def filterData(self):
dataList = list()
self.loadFile = open(self.datafile, "r")
for line in self.loadFile:
try:
data = json.loads(line)
#print data
27| P a g e
a = dict(map(str.strip,x) for x in data.items())

print a
dataList = dataList.append(data)
except:
pass
#print dataList[0:4]
def loadDataFile(self):
'''
function to load data file.
'''
SnapDeal = list()
Flipkart = list()
Amazon = list()
self.loadFile = open("C:\\Python27\\project\\data.txt", "r")
print 'file is ok'
for line in self.loadFile:

try:
data = json.loads(line)
#data = dict(map(str.strip,x) for x in data.items())
#datalist = datalist.append(data)
if data['source'] == 'SnapDeal':
snapdeal = self.createDataDict(data,'SnapDeal',SnapDeal)
if data['source'] == 'Flipkart':
flipkart = self.createDataDict(data,'Flipkart',Flipkart)
if data['source'] == 'Amazon':
amazon = self.createDataDict(data,'Amazon',Amazon)
else:
pass
except:
continue
return(snapdeal,flipkart,amazon)
28| P a g e
def productMatching(self):
a = list()
b = list()
d = list()
e = list()
f = list()
r = list()
j = 0.0
t = 60.00
mydict = {}
Snapdeal,Flipkart,Amazon = self.loadDataFile()
matches = []
biglist1_indexed = {}
for item in Snapdeal:

a.append(item["Title"])
#biglist1_indexed[(item["Title"])] = item
for item in Flipkart:

d.append(item['Title'])
biglist1_indexed[(item["Title"])] = item
#for item in Amazon:

#
f.append(item['Title'])
for i in a:
c = list()
for k in d:
s = SM(None,i,k)
j = s.quick_ratio()
if(j >= .70):#you can adjust the threshold as per your requirement.
#print "{0:20} {1:40} {2:40}".format(j,i,k)
c.append((j,i,k))
#print 'appending',len(c)
else:
pass
29| P a g e
c.sort(reverse = True)
if not c:
pass
else:
#c.sort(reverse = True)
e.append(c[0])
print 'List containing the top matching results with repective matching quotient'
for s in e:
print s
#for i biglist1_indexed.
#biglist1_indexed[(item["Title"].lower().replace(" ",''))] = item
#for t in e:
#
print t[2]
for i in biglist1_indexed:
for t in e:
if (t[2] == i):
r.append(biglist1_indexed[i])
keys = r[0].keys()
with open('C:\\Python27\\project\\aa.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(r)
print 'done'
'''
make a dictionary which contains the attributes of snapdeal also
'''
#print biglist1_indexed
#for t in e:
#
if t[1] ==
#for h in e:
#
print h
#print len(e)
#r = open('C:\\Python27\\project\\aa.csv', 'w')
30| P a g e
#l = csv.writer(r)
#l.writerows(e)
#r.close()
def createCSVReport(self):
columns = ['Flipkart Product Title','Flipkart Pro. Stock Status','Flipkart Pro
MRP','Flipkart Selling Price' ]
with open('C:\\Python27\\project\\aa.csv', 'w') as f:
[f.write('{0},{1}\n'.format(key, value)) for key, value in mydict.items()]
def comupteRecallAndPrecision(self):
return(recall,precision)
12.
TESTING
12.1. UNIT TESTING

In computer programming, unit testing is a software testing method by which
individual units of source code, sets of one or more computer program modules
together with associated control data, usage procedures, and operating procedures, are
tested to determine whether they are fit for use.
Here we have treated the functions as individual modules or units. We have executed
them individually in order to test them
12.2. INTEGRATION TESTING

Integration testing (sometimes called integration and testing, abbreviated I&T) is
the phase in software testing in which individual software modules are combined
and tested as a group. It occurs after unit testing and before validation testing.
Here we have tested the functions working together i.e. in integrated form.
12.3. SYSTEM TESTING
31| P a g e
System testing of software or hardware is testing conducted on a complete, integrated

system to evaluate the system's compliance with its specified requirements. System
testing falls within the scope of black box testing, and as such, should require no
knowledge of the inner design of the code or logic.
Here we have tested the system as a whole and the results are as expected and quiet
satisfactory.
13.
REPORTS
13.1. UNIT TESTING
Tested units : competitorList, createCSVReport,values, myList, productmatching.

The unit testing report is shown above.
13.2. INTEGRATION TESTING
32| P a g e
Tested the system integration by running the most important function which is
productmatching() which is linked and integrated with other functions.
productmatching() contains the product matching algorithm which is the most
important aspect of this program.
13.3. SYSTEM TESTING
33| P a g e
The whole integrated system is tested as a whole and the system is run in its
production environment. The system testing report is shown above in the screenshot.
14.
INPUT AND OUTPUT SCREENS
14.1. INPUT
34| P a g e
This is the input data in JSON format which was crawled from the internet. The data
is basically information of all the products of various e-commerce sites (Flipkart,
snapdeal, amazon).
35| P a g e
14.2. OUTPUT
Output at threshold = 0.70

Using quick_ratio()
36| P a g e
Output at threshold = 0.80

Using quick_ratio()
37| P a g e
15.
CONCLUSION
This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-commerce platforms and
gives an analytical insight about the market capture of a product or a company against
its competitors. Our API is designed to address such problems and reduces the overall
effort by many folds. This API can be used by various e-commerce companies to do
competitive analysis. The companies can use the results generated by this project to
make future market strategies which could help them in capturing more market and
thereby increase in profits. The use of this software by the e-commerce companies
will result in better strategies and planning and will force the software developers to
pay more attention in this area which will help the business organizations to perform
data analysis more efficient and at a place.
38| P a g e
16.
LIMITATIONS OF THE PROJECT
The software has its own set of limitations. The one which is at the forefront is the
efficiency of the algorithm. Currently, the algorithm is not 100% efficient. As the
product names have some variations for e.g. a product sony xperia z may be present
as sony xperia z in flipkart but in snapdeal it may be present as xperia z ,in this case
there algorithm wont be able to function efficiently and may give unexpected results.
To solve this problem high level mathematical concepts can be included, applied and
implemented in the algorithm such as probalibilty models etc.
The GUI of this project is still not developed and the software is still in its infancy
which is the reason of its less user friendliness. The software, currently is not much
user friendly and can be operated only by the persons having some knowledge
regarding the internal structures and working of the API.
17.
FUTURE APPLICATIONS OF THE PROJECT
The application currently matches the products along with their data and outputs the
data in the desired format at one place. But the efficiency of the algorithm used in the
application is not 100%. In future the algorithm can be improved and made more
efficient so that it can produce results with 100% accuracy.
The application can be made more user friendly by making a proper GUI for it which
will increase its user friendliness and make it more intuitive, interactive and user
friendly.
39| P a g e
18.
REFERENCES
https://www.python.org/
en.wikipedia.org/wiki/Python_(programming_language)
stackoverflow.com/
www.lfd.uci.edu/~gohlke/pythonlibs/
www.codeproject.com/
40| P a g e

Competitive Analysis API Report

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Competitive Analysis API Report

Hochgeladen von

Copyright:

Verfügbare Formate

B.K.

BIRLA INSTITUTE OF ENGINEERING &

Gaurav Kumawat (11EBKIT012)

Mr. Shridhar Dandin

Purpose, Scope & Applicability

Tools & Environments used

Statements & Control Flow

Python Data Analysis Library

Comma Separated Values (CSV)

14 Input & Output Screens

16 Limitations of the Project

17 Future Applications of the Project

Gaurav Kumawat (11EBKIT012)

Development Phase: To Analyze the problem statement, Research and a theoritical

BASE LANGUAGE : Python.

INPUT: Web-Crawled Data, Dataset(Json Format).

OUTPUT: Business Intelligence Report, CSV Report.

To develop, implement and evaluate an API to perform competitive analysis on

PURPOSE, SCOPE AND APPLICABILITY

The purpose of this project is to reduce users workload of comparing price of a

3. TOOLS AND ENVIRONMENT USED

4. PYTHON (PROGRAMMING LANGUAGE)

Python is Interpreted: Python is processed at runtime by the interpreter. You

Python is Object-Oriented: Python supports Object-Oriented style or

Python is a Beginner's Language: Python is a great language for the

Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

Interactive Mode:Python has support for an interactive mode which allows

Databases: Python provides interfaces to all major commercial databases.

GUI Programming: Python supports GUI applications that can be created

IT supports functional and structured programming methods as well as OOP.

It can be used as a scripting language or can be compiled to byte-code for

IT supports automatic garbage collection.

STATEMENTS AND CONTROL FLOW

Python's statements include (among others):

The if statement, which conditionally executes a block of code, along

The def statement, which defines a function or method.

The pass statement, which serves as a NOP. It is syntactically needed to

The import statement, which is used to import modules whose functions or

print() was changed to a function in Python 3.

Python expressions are similar to languages such as C and Java:

In Python, == compares by value, in contrast to Java, where it compares by

the equals() method.)

Python's is operator may be used to compare object identities (comparison by

Python has a type of expression termed a list comprehension. Python 2.4

Anonymous functions are implemented using lambda expressions; however,

Conditional expressions in Python are written as x if c else y [57] (different in

Python has a "string format" operator % . This functions analogous

to "foo=blah bar=2" . In Python 3 and 2.6+, this was supplemented by

Python has various kinds of string literals:

Strings delimited by single or double quotation marks. Unlike in Unix

Raw string varieties, denoted by prefixing the string literal with an r .

has index and slice expressions

as a[key] , a[start:stop] or a[start:stop:step] .

the stop index.

List comprehensions vs. for -loops

Conditional expressions vs. if blocks

are functions attached

syntax instance.method(argument) is, for normal methods and functions, syntactic

Python 2.2 changes integer division to round towards negative infinity,

usesround-to-even: round(1.5) is 2, round(2.5) is 2.[65] The Decimal type/class in

graphical user interfaces, web frameworks, multimedia, databases, networking