Beruflich Dokumente
Kultur Dokumente
TECHNOLOGY,PILANI (RAJ.)
SESSION 2015
CERTIFICATE
This is to certified that the project entitled API to perform competitive analysis of Ecommerce sites has been submitted to the Rajasthan Technical University, Kota fulfillment
of the requirement for the award of the degree of Bachelor of Technology in Information
Technology by following students of final year B.Tech. (Information Technology).
Vishesh Mishra(11EBKIT063)
(HOD, IT Deptt.)
Guide
Mrs. Sonam Mittal Sahu
2| P a g e
CONTENTS
1
Abstract
1.1
Workflow
1.2
Base Language
1.3
Input
1.4
Output
Introduction
2.1
Background
2.2
Objectives
2.3
Python
4.1
Python Features
4.2
10
4.3
Expressions
11
4.4
Methods
14
4.5
Mathematics
14
4.6
Libraries
15
4.7
Development Environments
16
Numpy
16
3| P a g e
17
JSON
20
Difflib
21
8.1
9
Sequence Matcher
10 Design Document
10.1
Modularization Details
21
22
23
23
11 Source Code
23
12 Testing
27
12.1
Unit Testing
27
12.2
Integration Testing
28
12.3
System Testing
28
13 Reports
28
13.1
Unit Testing
28
13.2
Integration Testing
29
13.3
System Testing
30
31
15 Conclusion
34
35
35
4| P a g e
18 References
36
5| P a g e
TEAM INFORMATION
MEMBERS :
6| P a g e
1. ABSTRACT
This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-comm platforms and gives
an analytical insight about the market capture of a product or a company against its
competitors. For instance a well known company 'flipkart' wants to compare selling
price of its product such as 'IPhone 6' to its competitors. Now at different e-comm
platforms the name of the same product may be different. for instance on Amazon it
would be i-phone6 black and on snapdeal it would be like white I-phone/6 32gb. It
becomes a mess and eats lot of time to search and sort. Our API is designed to address
such problems and reduces the overall effort by many folds.
1.1.
WORKFLOW
1.2.
1.3.
1.4.
7| P a g e
2. INTRODUCTION
2.1.
BACKGROUND
As there are numerous e-commerce sites present in the web and also there are
products which are common among these sites but having different prices so it may be
quiet time consuming for an e-commerce company to compare its products
information with that of other companies and perform competitive analysis. It would
be quiet easy for the company if it could compare the prices and other information of
products present in different sites at a single location.
2.2.
OBJECTIVES
2.3.
could be a bit complex since its still in development stages and the algorithm used is
still not 100% efficient but it could be improved further in future. Its applicability
resides in scenarios where there are large number of e-commerce sites and it would be
difficult to analyze the product information of common products manually.
9| P a g e
Other tools used : Visual C++ for Python 2.7 (VCForPython27), PythonGUI.
Dependencies
Python 2.7
Csv
operator
Numpy
Pandas
Json
difflib
Python is Interactive: You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
4.1.
PYTHON FEATURES
Easy-to-read: Python code is more clearly defined and visible to the eyes.
A broad standard library: Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.
Portable: Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.
Extendable: You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.
11| P a g e
Scalable: Python provides a better structure and support for large programs
than shell scripting.
Apart from the above-mentioned features, Python has a big list of good features, few
are listed below:
It provides very high-level dynamic data types and supports dynamic type
checking.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
4.2.
The for statement, which iterates over an iterable object, capturing each
element to a local variable for use by the attached block.
The while statement, which executes a block of code as long as its condition
is true.
12| P a g e
The try statement, which allows exceptions raised in its attached code block
to be caught and handled by except clauses; it also ensures that clean-up code in
a finally block will always be run regardless of how the block exits.
The class statement, which executes a block of code and attaches its local
namespace to a class, for use in object-oriented programming.
The with statement (from Python 2.5), which encloses a code block within a
context manager (for example, acquiring a lock before the block of code is run
and releasing the lock afterwards, or opening a file and then closing it),
allowing RAII-like behavior.
The assert statement, used during debugging to check for conditions that
ought to apply.
The yield statement, which returns a value from a generator function. From
Python 2.5, yield is also an operator. This form is used to implement coroutines.
4.3.
EXPRESSIONS
13| P a g e
Addition, subtraction, and multiplication are the same, but the behavior of
division differs (see Mathematics for details). Python also added the ** operator
for exponentiation.
(Value
comparisons
in
Java
use
Python uses the words and , or , not for its boolean operators rather than the
symbolic && , || , ! used in Java and C.
Python makes a distinction between lists and tuples. Lists are written as [1, 2,
3] , are mutable, and cannot be used as the keys of dictionaries (dictionary keys
must beimmutable in Python). Tuples are written as (1, 2, 3) , are immutable and
thus can be used as the keys of dictionaries, provided all elements of the tuple are
immutable. The parentheses around the tuple are optional in some contexts.
Tuples can appear on the left side of an equal sign; hence a statement like x, y =
y, x can be used to swap two variables.
Triple-quoted strings, which begin and end with a series of three single
or double quotation marks. They may span multiple lines and function
like here documents in shells, Perl and Ruby.
Python
on
Indexes
lists,
denoted
are zero-based,
and
negative indexes are relative to the end. Slices take elements from the start index
up
to,
but
not
including,
The
third
slice
parameter,
called step or stride, allows elements to be skipped and reversed. Slice indexes
may be omitted, for example a[:] returns a copy of the entire list. Each element of
a slice is a shallow copy.
In Python, a distinction between expressions and statements is rigidly enforced, in
contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to some
duplication of functionality. For example:
The eval() vs. exec() built-in functions (in Python 2, exec is a statement);
the former is for expressions, the latter is for statements.
4.4.
METHODS
Methods on
objects
to
the
object's
class;
the
to
argument) .
access instance
Python
data,
methods
in
contrast
have
an
to
the
implicit self (or this ) in some other object-oriented programming languages (e.g. C+
+, Java, Objective-C, or Ruby).
4.5.
MATHEMATICS
Python has the usual C arithmetic operators (+, -, *, /, %). It also has ** for
exponentiation, e.g. 5**3 == 125 and 9**.5 == 3.0 and a new matrix multiply
operator @ coming in 3.5.[61]
The behavior of division has changed significantly over time.[62]
Python 2.1 and earlier use the C division behavior. The / operator is integer
division if both operands are integers, and floating point division otherwise.
Integer division rounds towards 0, e.g. 7 / 3 == 2 and -7 / 3 == -2 .
Python 3.0 changes / to always be floating point division. In Python terms, the
pre-3.0 / is "classic division", the 3.0 / is "real division", and // is "floor division".
Rounding towards negative infinity, though different from most languages, adds
consistency. For instance, it means that the equation (a+b) // b == a // b + 1 is always
true. It also means that the equation b * (a // b) + a % b == a is valid for both positive
and negative values of a . However, maintaining the validity of this equation means
that while the result of a % b is, as expected, in the half-open interval [0,b),
where b is a positive integer, it has to lie in the interval (b,0] when b is negative.[63]
Python provides a round function for rounding floats to integers. Versions before 3
use
round-away-from-zero: round(0.5) is
1.0, round(-0.5) is
1.0.[64] Python
4.6. LIBRARIES
Python has a large standard library, commonly cited as one of Python's greatest
strengths,[67] providing tools suited to many tasks. This is deliberate and has been
described as a "batteries included"[26] Python philosophy. For Internet-facing
17| P a g e
applications,
large
number
of
standard
formats
and
protocols
(such
as MIME and HTTP) are supported. Modules for creating graphical user interfaces,
connecting to relational databases, pseudorandom number generators, arithmetic with
arbitrary precision decimals,[68]manipulating regular expressions, and doing unit
testing are also included.
Some parts of the standard library are covered by specifications (for example,
the WSGI implementation wsgiref follows PEP 333[69]), but the majority of the
modules are not. They are specified by their code, internal documentation, and test
suite (if supplied). However, because most of the standard library is cross-platform
Python code, there are only a few modules that must be altered or completely
rewritten by alternative implementations.
The standard library is not essential to run Python or embed Python within an
application. Blender 2.49, for instance, omits most of the standard library.
As of January 2015, the Python Package Index, the official repository of third-party
software for Python, contains more than 54,000 packages offering a wide range of
functionality, including:
18| P a g e
4.7.
DEVELOPMENT ENVIRONMENTS
shells
add
capabilities
beyond
those
in
the
basic
interpreter,
including IDLE and IPython. While generally following the visual style of the Python
shell, they implement features like auto-completion, retention of session state, and
syntax highlighting.In addition to standard desktop Python IDEs (integrated
development environments), there are also browser-based IDEs, Sage (intended for
developing science and math-related Python programs), and a browser-based IDE and
hosting environment, PythonAnywhere.
5. NUMPY
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional container of generic data. Arbitrary data-types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy is an extension to the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large library of highlevel mathematical functions to operate on these arrays. The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from several
19| P a g e
Tools for reading and writing data between in-memory data structures
and different formats: CSV and text files, Microsoft Excel, SQL databases,
and the fast HDF5 format;
Columns can be inserted and deleted from data structures for size
mutability;
Time
series-functionality:
date
range
generation
and
frequency
20| P a g e
pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with relational or labeled data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real
world data analysis in Python. Additionally, it has the broader goal of becoming the
most powerful and flexible open source data analysis / manipulation tool
available in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:
Any other form of observational / statistical data sets. The data actually
need not be labeled at all to be placed into a pandas data structure
Here are just a few of the things that pandas does well:
21| P a g e
Powerful, flexible group by functionality to perform split-applycombine operations on data sets, for both aggregating and transforming
data
Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving / loading data from the
ultrafast HDF5 format
Many of these principles are here to address the shortcomings frequently experienced
using other languages / scientific research environments. For data scientists, working
with data is typically divided into multiple stages: munging and cleaning data,
analyzing / modeling it, then organizing the results of the analysis into a form suitable
for plotting or tabular display. pandas is the ideal tool for all of these tasks.
22| P a g e
23| P a g e
These are universal data structures. Virtually all modern programming languages
support them in one form or another. It makes sense that a data format that is
interchangeable with programming languages also be based on these structures.
8. DIFFLIB
This module provides classes and functions for comparing sequences. It can be used
for example, for comparing files, and can produce difference information in various
formats, including HTML and context and unified diffs. For comparing directories
and files,
class difflib.
24| P a g e
8.1.
SEQUENCE MATCHER
This is a flexible class for comparing pairs of sequences of any type, so long as the
sequence elements are hashable. The basic algorithm predates, and is a little fancier
than, an algorithm published in the late 1980s by Ratcliff and Obershelp under the
hyperbolic name gestalt pattern matching. The idea is to find the longest contiguous
matching subsequence that contains no junk elements (the Ratcliff and Obershelp
algorithm doesnt address junk). The same idea is then applied recursively to the
pieces of the sequences to the left and to the right of the matching subsequence. This
does not yield minimal edit sequences, but does tend to yield matches that look
right to people
Within these general constraints, many variations are in use. Therefore "CSV" files
are not entirely portable. Nevertheless, the variations are fairly small, and many
implementations allow users to preview the first few lines of the file (which is feasible
because it is plain text), and then specify the delimiter character(s), quoting rules, etc.
If a particular CSV file's variations fall outside what a particular receiving program
supports, it is often feasible to examine and edit the file by hand or write a script or
program to fix the problem.
10.
DESIGN DOCUMENT
11.
SOURCE CODE
'''
Created on 7-May-2015
import csv
import operator
import numpy as np
from collections import defaultdict
from operator import itemgetter
import json
import pandas as pd
from difflib import SequenceMatcher as SM
26| P a g e
class CodingTest:
'''
'''
def __init__(self):
'''
initialization of the variables.
'''
self.dataFile = "C:\\Python27\\project\\data.txt"
self.csvToList = list()
self.dataList = list()
self.frame
= pd.DataFrame()
self.SnapDeal = list()
self.Flipkart = list()
self.Amazon
= list()
self.gg = {}
self.values = dict()
self.competitorList = ['SnapDeal','Flipkart','Amazon']
self.mylist = list()
def createDataDict(self,data,a,mylist):
title = data['title']
mrp = data['mrp']
source = data['source']
url = data['url']
stock = data['stock']
selling_price = data['available_price']
mylist.append({'Title':title,'MRP':mrp,'Source':source,
'URL':url,'Stock_Status':stock,'Selling_Price': selling_price})
return mylist
def filterData(self):
dataList = list()
self.loadFile = open(self.datafile, "r")
for line in self.loadFile:
try:
data = json.loads(line)
#print data
27| P a g e
#print dataList[0:4]
def loadDataFile(self):
'''
function to load data file.
'''
SnapDeal = list()
Flipkart = list()
Amazon = list()
self.loadFile = open("C:\\Python27\\project\\data.txt", "r")
print 'file is ok'
if data['source'] == 'Flipkart':
flipkart = self.createDataDict(data,'Flipkart',Flipkart)
if data['source'] == 'Amazon':
amazon = self.createDataDict(data,'Amazon',Amazon)
else:
pass
except:
continue
return(snapdeal,flipkart,amazon)
28| P a g e
def productMatching(self):
a = list()
b = list()
d = list()
e = list()
f = list()
r = list()
j = 0.0
t = 60.00
mydict = {}
Snapdeal,Flipkart,Amazon = self.loadDataFile()
matches = []
biglist1_indexed = {}
f.append(item['Title'])
for i in a:
c = list()
for k in d:
s = SM(None,i,k)
j = s.quick_ratio()
if(j >= .70):#you can adjust the threshold as per your requirement.
#print "{0:20} {1:40} {2:40}".format(j,i,k)
c.append((j,i,k))
#print 'appending',len(c)
else:
pass
29| P a g e
c.sort(reverse = True)
if not c:
pass
else:
#c.sort(reverse = True)
e.append(c[0])
print 'List containing the top matching results with repective matching quotient'
for s in e:
print s
#for i biglist1_indexed.
#biglist1_indexed[(item["Title"].lower().replace(" ",''))] = item
#for t in e:
#
print t[2]
for i in biglist1_indexed:
for t in e:
if (t[2] == i):
r.append(biglist1_indexed[i])
keys = r[0].keys()
with open('C:\\Python27\\project\\aa.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(r)
print 'done'
'''
make a dictionary which contains the attributes of snapdeal also
'''
#print biglist1_indexed
#for t in e:
#
if t[1] ==
#for h in e:
#
print h
#print len(e)
#r = open('C:\\Python27\\project\\aa.csv', 'w')
30| P a g e
#l = csv.writer(r)
#l.writerows(e)
#r.close()
def createCSVReport(self):
columns = ['Flipkart Product Title','Flipkart Pro. Stock Status','Flipkart Pro
MRP','Flipkart Selling Price' ]
with open('C:\\Python27\\project\\aa.csv', 'w') as f:
[f.write('{0},{1}\n'.format(key, value)) for key, value in mydict.items()]
def comupteRecallAndPrecision(self):
return(recall,precision)
12.
TESTING
31| P a g e
13.
REPORTS
32| P a g e
Tested the system integration by running the most important function which is
productmatching() which is linked and integrated with other functions.
productmatching() contains the product matching algorithm which is the most
important aspect of this program.
33| P a g e
The whole integrated system is tested as a whole and the system is run in its
production environment. The system testing report is shown above in the screenshot.
14.
14.1. INPUT
34| P a g e
This is the input data in JSON format which was crawled from the internet. The data
is basically information of all the products of various e-commerce sites (Flipkart,
snapdeal, amazon).
35| P a g e
14.2. OUTPUT
36| P a g e
37| P a g e
15.
CONCLUSION
This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-commerce platforms and
gives an analytical insight about the market capture of a product or a company against
its competitors. Our API is designed to address such problems and reduces the overall
effort by many folds. This API can be used by various e-commerce companies to do
competitive analysis. The companies can use the results generated by this project to
make future market strategies which could help them in capturing more market and
thereby increase in profits. The use of this software by the e-commerce companies
will result in better strategies and planning and will force the software developers to
pay more attention in this area which will help the business organizations to perform
data analysis more efficient and at a place.
38| P a g e
16.
The software has its own set of limitations. The one which is at the forefront is the
efficiency of the algorithm. Currently, the algorithm is not 100% efficient. As the
product names have some variations for e.g. a product sony xperia z may be present
as sony xperia z in flipkart but in snapdeal it may be present as xperia z ,in this case
there algorithm wont be able to function efficiently and may give unexpected results.
To solve this problem high level mathematical concepts can be included, applied and
implemented in the algorithm such as probalibilty models etc.
The GUI of this project is still not developed and the software is still in its infancy
which is the reason of its less user friendliness. The software, currently is not much
user friendly and can be operated only by the persons having some knowledge
regarding the internal structures and working of the API.
17.
The application currently matches the products along with their data and outputs the
data in the desired format at one place. But the efficiency of the algorithm used in the
application is not 100%. In future the algorithm can be improved and made more
efficient so that it can produce results with 100% accuracy.
The application can be made more user friendly by making a proper GUI for it which
will increase its user friendliness and make it more intuitive, interactive and user
friendly.
39| P a g e
18.
REFERENCES
https://www.python.org/
en.wikipedia.org/wiki/Python_(programming_language)
stackoverflow.com/
www.lfd.uci.edu/~gohlke/pythonlibs/
www.codeproject.com/
40| P a g e