Sie sind auf Seite 1von 40

B.K.

BIRLA INSTITUTE OF ENGINEERING &

TECHNOLOGY,PILANI (RAJ.)
SESSION 2015

CERTIFICATE

This is to certified that the project entitled API to perform competitive analysis of Ecommerce sites has been submitted to the Rajasthan Technical University, Kota fulfillment
of the requirement for the award of the degree of Bachelor of Technology in Information
Technology by following students of final year B.Tech. (Information Technology).

Gaurav Kumawat (11EBKIT012)

Mr. Shridhar Dandin

Vishesh Mishra(11EBKIT063)

(HOD, IT Deptt.)

Guide
Mrs. Sonam Mittal Sahu

2| P a g e

CONTENTS
1

Abstract

1.1

Workflow

1.2

Base Language

1.3

Input

1.4

Output

Introduction

2.1

Background

2.2

Objectives

2.3

Purpose, Scope & Applicability

Tools & Environments used

Python

4.1

Python Features

4.2

Statements & Control Flow

10

4.3

Expressions

11

4.4

Methods

14

4.5

Mathematics

14

4.6

Libraries

15

4.7

Development Environments

16

Numpy

16
3| P a g e

Python Data Analysis Library

17

JSON

20

Difflib

21

8.1
9

Sequence Matcher

Comma Separated Values (CSV)

10 Design Document
10.1

Modularization Details

21
22
23
23

11 Source Code

23

12 Testing

27

12.1

Unit Testing

27

12.2

Integration Testing

28

12.3

System Testing

28

13 Reports

28

13.1

Unit Testing

28

13.2

Integration Testing

29

13.3

System Testing

30

14 Input & Output Screens

31

15 Conclusion

34

16 Limitations of the Project

35

17 Future Applications of the Project

35

4| P a g e

18 References

36

5| P a g e

TEAM INFORMATION

MEMBERS :

Gaurav Kumawat (11EBKIT012)


Branch : Information Technology
Vishesh Mishra (11EBKIT063)
Branch : Information Technology

6| P a g e

1. ABSTRACT
This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-comm platforms and gives
an analytical insight about the market capture of a product or a company against its
competitors. For instance a well known company 'flipkart' wants to compare selling
price of its product such as 'IPhone 6' to its competitors. Now at different e-comm
platforms the name of the same product may be different. for instance on Amazon it
would be i-phone6 black and on snapdeal it would be like white I-phone/6 32gb. It
becomes a mess and eats lot of time to search and sort. Our API is designed to address
such problems and reduces the overall effort by many folds.

1.1.

WORKFLOW

Development Phase: To Analyze the problem statement, Research and a theoritical


view-point synthesis. Algorithms Development.
Implementation Phase: To mold the theoritical concepts in core software codes.
Evaluation Phase: To test the algorithms and verifying the results. Recall-Precision
for the output.

1.2.

BASE LANGUAGE : Python.

1.3.

INPUT: Web-Crawled Data, Dataset(Json Format).

1.4.

OUTPUT: Business Intelligence Report, CSV Report.

7| P a g e

2. INTRODUCTION
2.1.

BACKGROUND

As there are numerous e-commerce sites present in the web and also there are
products which are common among these sites but having different prices so it may be
quiet time consuming for an e-commerce company to compare its products
information with that of other companies and perform competitive analysis. It would
be quiet easy for the company if it could compare the prices and other information of
products present in different sites at a single location.

2.2.

OBJECTIVES

To develop, implement and evaluate an API to perform competitive analysis on


various E-commerce business platforms.

2.3.

PURPOSE, SCOPE AND APPLICABILITY

The purpose of this project is to reduce users workload of comparing price of a


product in different e-commerce sites. There is a lot of scope in this area since the
future will see a sudden surge in the number of e-commerce sites many containing
common products. Our API will address this problem by giving a platform which can
help e-commerce companies compare its product data with various other e-commerce
sites and to stay in competition by adjusting prices accordingly. Currently the API
8| P a g e

could be a bit complex since its still in development stages and the algorithm used is
still not 100% efficient but it could be improved further in future. Its applicability
resides in scenarios where there are large number of e-commerce sites and it would be
difficult to analyze the product information of common products manually.

3. TOOLS AND ENVIRONMENT USED


Enviroment :
Operating System used-Windows 7/8
Development Environment- Python 2.7.9 IDLE
Tools : Python Modules which include :
Numpy
Pandas
Csv
Json
Difflib
Operator
Sequencematcher

9| P a g e

Other tools used : Visual C++ for Python 2.7 (VCForPython27), PythonGUI.

Dependencies

Python 2.7

Csv

operator

Numpy

Pandas

Json

difflib

4. PYTHON (PROGRAMMING LANGUAGE)


Python is a high-level, interpreted, interactive and object-oriented scripting language.
Python is designed to be highly readable. It uses English keywords frequently where
as other languages use punctuation, and it has fewer syntactical constructions than
other languages.

Python is Interpreted: Python is processed at runtime by the interpreter. You


do not need to compile your program before executing it. This is similar to
PERL and PHP.
10| P a g e

Python is Interactive: You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.

Python is Object-Oriented: Python supports Object-Oriented style or


technique of programming that encapsulates code within objects.

Python is a Beginner's Language: Python is a great language for the


beginner-level programmers and supports the development of a wide range of
applications from simple text processing to WWW browsers to games.

4.1.

PYTHON FEATURES

Python's features include:

Easy-to-learn: Python has few keywords, simple structure, and a clearly


defined syntax. This allows the student to pick up the language quickly.

Easy-to-read: Python code is more clearly defined and visible to the eyes.

Easy-to-maintain: Python's source code is fairly easy-to-maintain.

A broad standard library: Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.

Interactive Mode:Python has support for an interactive mode which allows


interactive testing and debugging of snippets of code.

Portable: Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.

Extendable: You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.

Databases: Python provides interfaces to all major commercial databases.

11| P a g e

GUI Programming: Python supports GUI applications that can be created


and ported to many system calls, libraries and windows systems, such as
Windows MFC, Macintosh, and the X Window system of Unix.

Scalable: Python provides a better structure and support for large programs
than shell scripting.

Apart from the above-mentioned features, Python has a big list of good features, few
are listed below:

IT supports functional and structured programming methods as well as OOP.

It can be used as a scripting language or can be compiled to byte-code for


building large applications.

It provides very high-level dynamic data types and supports dynamic type
checking.

IT supports automatic garbage collection.

It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

4.2.

STATEMENTS AND CONTROL FLOW

Python's statements include (among others):

The if statement, which conditionally executes a block of code, along


with else and elif (a contraction of else-if).

The for statement, which iterates over an iterable object, capturing each
element to a local variable for use by the attached block.

The while statement, which executes a block of code as long as its condition
is true.

12| P a g e

The try statement, which allows exceptions raised in its attached code block
to be caught and handled by except clauses; it also ensures that clean-up code in
a finally block will always be run regardless of how the block exits.

The class statement, which executes a block of code and attaches its local
namespace to a class, for use in object-oriented programming.

The def statement, which defines a function or method.

The with statement (from Python 2.5), which encloses a code block within a
context manager (for example, acquiring a lock before the block of code is run
and releasing the lock afterwards, or opening a file and then closing it),
allowing RAII-like behavior.

The pass statement, which serves as a NOP. It is syntactically needed to


create an empty code block.

The assert statement, used during debugging to check for conditions that
ought to apply.

The yield statement, which returns a value from a generator function. From
Python 2.5, yield is also an operator. This form is used to implement coroutines.

The import statement, which is used to import modules whose functions or


variables can be used in the current program.

print() was changed to a function in Python 3.

4.3.

EXPRESSIONS

Python expressions are similar to languages such as C and Java:

13| P a g e

Addition, subtraction, and multiplication are the same, but the behavior of
division differs (see Mathematics for details). Python also added the ** operator
for exponentiation.

In Python, == compares by value, in contrast to Java, where it compares by


reference.

(Value

comparisons

in

Java

use

the equals() method.)

Python's is operator may be used to compare object identities (comparison by


reference). Comparisons may be chained, for example a <= b <= c .

Python uses the words and , or , not for its boolean operators rather than the
symbolic && , || , ! used in Java and C.

Python has a type of expression termed a list comprehension. Python 2.4


extended list comprehensions into a more general expression termed
a generator expression.[40]

Anonymous functions are implemented using lambda expressions; however,


these are limited in that the body can only be a single expression.

Conditional expressions in Python are written as x if c else y [57] (different in


order of operands from the ?: operator common to many other languages).

Python makes a distinction between lists and tuples. Lists are written as [1, 2,
3] , are mutable, and cannot be used as the keys of dictionaries (dictionary keys
must beimmutable in Python). Tuples are written as (1, 2, 3) , are immutable and
thus can be used as the keys of dictionaries, provided all elements of the tuple are
immutable. The parentheses around the tuple are optional in some contexts.
Tuples can appear on the left side of an equal sign; hence a statement like x, y =
y, x can be used to swap two variables.

Python has a "string format" operator % . This functions analogous


to printf format strings in C, e.g. "foo=%s bar=%d" % ("blah", 2) evaluates
14| P a g e

to "foo=blah bar=2" . In Python 3 and 2.6+, this was supplemented by


the format() method of the str class, e.g. "foo={0} bar={1}".format("blah", 2) .

Python has various kinds of string literals:

Strings delimited by single or double quotation marks. Unlike in Unix


shells, Perl and Perl-influenced languages, single quotation marks and double
quotation marks function identically. Both kinds of string use the backslash
( \ ) as an escape character and there is no implicit string interpolation such
as "$foo" .

Triple-quoted strings, which begin and end with a series of three single
or double quotation marks. They may span multiple lines and function
like here documents in shells, Perl and Ruby.

Raw string varieties, denoted by prefixing the string literal with an r .


No escape sequences are interpreted; hence raw strings are useful where
literal backslashes are common, such as regular expressions and Windowsstyle paths. Compare " @ -quoting" in C#.

Python

has index and slice expressions

as a[key] , a[start:stop] or a[start:stop:step] .

on
Indexes

lists,

denoted

are zero-based,

and

negative indexes are relative to the end. Slices take elements from the start index
up

to,

but

not

including,

the stop index.

The

third

slice

parameter,

called step or stride, allows elements to be skipped and reversed. Slice indexes
may be omitted, for example a[:] returns a copy of the entire list. Each element of
a slice is a shallow copy.
In Python, a distinction between expressions and statements is rigidly enforced, in
contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to some
duplication of functionality. For example:

List comprehensions vs. for -loops


15| P a g e

Conditional expressions vs. if blocks

The eval() vs. exec() built-in functions (in Python 2, exec is a statement);
the former is for expressions, the latter is for statements.

4.4.

METHODS

Methods on

objects

are functions attached

to

the

object's

class;

the

syntax instance.method(argument) is, for normal methods and functions, syntactic


sugar for Class.method(instance,
explicit self parameter

to

argument) .
access instance

Python
data,

methods
in

contrast

have

an

to

the

implicit self (or this ) in some other object-oriented programming languages (e.g. C+
+, Java, Objective-C, or Ruby).

4.5.

MATHEMATICS

Python has the usual C arithmetic operators (+, -, *, /, %). It also has ** for
exponentiation, e.g. 5**3 == 125 and 9**.5 == 3.0 and a new matrix multiply
operator @ coming in 3.5.[61]
The behavior of division has changed significantly over time.[62]

Python 2.1 and earlier use the C division behavior. The / operator is integer
division if both operands are integers, and floating point division otherwise.
Integer division rounds towards 0, e.g. 7 / 3 == 2 and -7 / 3 == -2 .

Python 2.2 changes integer division to round towards negative infinity,


e.g. 7 / 3 == 2 and -7 / 3 == -3 . The floor division // operator is introduced. So 7
// 3 == 2 , -7 // 3 == -3 , 7.5 // 3 == 2.0 and -7.5 // 3 == -3.0 . Adding from
__future__ import division causes a module to use Python 3.0 rules for division
(see next).
16| P a g e

Python 3.0 changes / to always be floating point division. In Python terms, the
pre-3.0 / is "classic division", the 3.0 / is "real division", and // is "floor division".

Rounding towards negative infinity, though different from most languages, adds
consistency. For instance, it means that the equation (a+b) // b == a // b + 1 is always
true. It also means that the equation b * (a // b) + a % b == a is valid for both positive
and negative values of a . However, maintaining the validity of this equation means
that while the result of a % b is, as expected, in the half-open interval [0,b),
where b is a positive integer, it has to lie in the interval (b,0] when b is negative.[63]
Python provides a round function for rounding floats to integers. Versions before 3
use

round-away-from-zero: round(0.5) is

1.0, round(-0.5) is

1.0.[64] Python

usesround-to-even: round(1.5) is 2, round(2.5) is 2.[65] The Decimal type/class in


module decimal (since version 2.4) provides exact numerical representation and
several rounding modes.
Python allows boolean expressions with multiple equality relations in a manner that is
consistent with general usage in mathematics. For example, the expression a < b <
c tests whether a is less than b and b is less than c . C-derived languages interpret
this expression differently: in C, the expression would first evaluate a < b , resulting
in 0 or 1, and that result would then be compared with c .[66][page needed]
Due to Python's extensive mathematics library, it is frequently used as a scientific
scripting language to aid in problems such as data processing and manipulation.

4.6. LIBRARIES
Python has a large standard library, commonly cited as one of Python's greatest
strengths,[67] providing tools suited to many tasks. This is deliberate and has been
described as a "batteries included"[26] Python philosophy. For Internet-facing
17| P a g e

applications,

large

number

of

standard

formats

and

protocols

(such

as MIME and HTTP) are supported. Modules for creating graphical user interfaces,
connecting to relational databases, pseudorandom number generators, arithmetic with
arbitrary precision decimals,[68]manipulating regular expressions, and doing unit
testing are also included.
Some parts of the standard library are covered by specifications (for example,
the WSGI implementation wsgiref follows PEP 333[69]), but the majority of the
modules are not. They are specified by their code, internal documentation, and test
suite (if supplied). However, because most of the standard library is cross-platform
Python code, there are only a few modules that must be altered or completely
rewritten by alternative implementations.
The standard library is not essential to run Python or embed Python within an
application. Blender 2.49, for instance, omits most of the standard library.
As of January 2015, the Python Package Index, the official repository of third-party
software for Python, contains more than 54,000 packages offering a wide range of
functionality, including:

graphical user interfaces, web frameworks, multimedia, databases, networking


and communications

test frameworks, automation and web scraping, documentation tools, system


administration

scientific computing, text processing, image processing

18| P a g e

4.7.

DEVELOPMENT ENVIRONMENTS

Most Python implementations (including CPython) can function as a command line


interpreter, for which the user enters statements sequentially and receives the results
immediately (REPL). In short, Python acts as a shell.
Other

shells

add

capabilities

beyond

those

in

the

basic

interpreter,

including IDLE and IPython. While generally following the visual style of the Python
shell, they implement features like auto-completion, retention of session state, and
syntax highlighting.In addition to standard desktop Python IDEs (integrated
development environments), there are also browser-based IDEs, Sage (intended for
developing science and math-related Python programs), and a browser-based IDE and
hosting environment, PythonAnywhere.

5. NUMPY
NumPy is the fundamental package for scientific computing with Python. It contains
among other things:

a powerful N-dimensional array object

sophisticated (broadcasting) functions

tools for integrating C/C++ and Fortran code

useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multidimensional container of generic data. Arbitrary data-types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy is an extension to the Python programming language, adding support for
large, multi-dimensional arrays and matrices, along with a large library of highlevel mathematical functions to operate on these arrays. The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from several
19| P a g e

other developers. In 2005, Travis Oliphant created NumPy by incorporating features


of the competing Numarray into Numeric, with extensive modifications. NumPy
is open source and has many contributors

6. PYTHON DATA ANALYSIS LIBRARY


pandas is an open source, BSD-licensed library providing high-performance, easy-touse data structures and data analysis tools for the Python programming language.
Library Highlights

A fast and efficient DataFrame object for data manipulation with


integrated indexing;

Tools for reading and writing data between in-memory data structures
and different formats: CSV and text files, Microsoft Excel, SQL databases,
and the fast HDF5 format;

Intelligent data alignment and integrated handling of missing data: gain


automatic label-based alignment in computations and easily manipulate messy
data into an orderly form;

Flexible reshaping and pivoting of data sets;

Intelligent label-based slicing, fancy indexing, and subsetting of large


data sets;

Columns can be inserted and deleted from data structures for size
mutability;

Aggregating or transforming data with a powerful group by engine


allowing split-apply-combine operations on data sets;

High performance merging and joining of data sets;

Hierarchical axis indexing provides an intuitive way of working with


high-dimensional data in a lower-dimensional data structure;

Time

series-functionality:

date

range

generation

and

frequency

conversion, moving window statistics, moving window linear regressions, date


shifting and lagging. Even create domain-specific time offsets and join time
series without losing data;

20| P a g e

Highly optimized for performance, with critical code paths written


in Cythonor C.

Python with pandas is in use in a wide variety of academic and


commercialdomains, including Finance, Neuroscience, Economics, Statistics,
Advertising, Web Analytics, and more.

pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with relational or labeled data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real
world data analysis in Python. Additionally, it has the broader goal of becoming the
most powerful and flexible open source data analysis / manipulation tool
available in any language. It is already well on its way toward this goal.
pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table


or Excel spreadsheet

Ordered and unordered (not necessarily fixed-frequency) time series


data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with


row and column labels

Any other form of observational / statistical data sets. The data actually
need not be labeled at all to be placed into a pandas data structure

Here are just a few of the things that pandas does well:

Easy handling of missing data (represented as NaN) in floating point


as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame


and higher dimensional objects

21| P a g e

Automatic and explicit data alignment: objects can be explicitly


aligned to a set of labels, or the user can simply ignore the labels and
let Series, DataFrame, etc. automatically align the data for you in
computations

Powerful, flexible group by functionality to perform split-applycombine operations on data sets, for both aggregating and transforming
data

Make it easy to convert ragged, differently-indexed data in other


Python and NumPy data structures into DataFrame objects

Intelligent label-based slicing, fancy indexing, and subsetting of large


data sets

Intuitive merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labeling of axes (possible to have multiple labels per


tick)

Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving / loading data from the
ultrafast HDF5 format

Time series-specific functionality: date range generation and


frequency conversion, moving window statistics, moving window
linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced
using other languages / scientific research environments. For data scientists, working
with data is typically divided into multiple stages: munging and cleaning data,
analyzing / modeling it, then organizing the results of the analysis into a form suitable
for plotting or tabular display. pandas is the ideal tool for all of these tasks.

22| P a g e

Some other notes

pandas is fast. Many of the low-level algorithmic bits have been


extensively tweaked in Cython code. However, as with anything else
generalization usually sacrifices performance. So if you focus on one
feature for your application you may be able to create a faster
specialized tool.

pandas is a dependency of statsmodels, making it an important part of


the statistical computing ecosystem in Python.

pandas has been used extensively in production in financial


applications.

7. JSON (JAVASCRIPT OBJECT NOTATION)


JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
for humans to read and write. It is easy for machines to parse and generate. It is based
on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd
Edition - December 1999. JSON is a text format that is completely language
independent but uses conventions that are familiar to programmers of the C-family of
languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others.
These properties make JSON an ideal data-interchange language.
JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as


an object, record, struct, dictionary, hash table, keyed list, or associative array.

An ordered list of values. In most languages, this is realized as an array,


vector, list, or sequence.

23| P a g e

These are universal data structures. Virtually all modern programming languages
support them in one form or another. It makes sense that a data format that is
interchangeable with programming languages also be based on these structures.

JSON IN PYTHON 2.X


Provided by json module, here is a description of default behavior.
Transformation from JSON to Python and back is symmetrical (only formatting is
lost).
Transformation from Python to JSON is not:

strings are converted to unicode objects

operator Functional interface to built-in operators


Purpose:

Functional interface to built-in operators.

Functional programming using iterators occasionally requires creating small functions


for simple expressions. Sometimes these can be expressed as lambda functions, but
some operations do not need to be implemented with custom functions at all.
The operator module defines functions that correspond to built-in operations for
arithmetic and comparison, as well as sequence and dictionary operations.

8. DIFFLIB
This module provides classes and functions for comparing sequences. It can be used
for example, for comparing files, and can produce difference information in various
formats, including HTML and context and unified diffs. For comparing directories
and files,

class difflib.

24| P a g e

8.1.

SEQUENCE MATCHER

This is a flexible class for comparing pairs of sequences of any type, so long as the
sequence elements are hashable. The basic algorithm predates, and is a little fancier
than, an algorithm published in the late 1980s by Ratcliff and Obershelp under the
hyperbolic name gestalt pattern matching. The idea is to find the longest contiguous
matching subsequence that contains no junk elements (the Ratcliff and Obershelp
algorithm doesnt address junk). The same idea is then applied recursively to the
pieces of the sequences to the left and to the right of the matching subsequence. This
does not yield minimal edit sequences, but does tend to yield matches that look
right to people

9. COMMA-SEPARATED VALUES (CSV)


A comma-separated values (CSV) (also sometimes called character-separated
values) file stores tabular data (numbers and text) in plain-text form. Plain text means
that the file is a sequence of characters, with no data that has to be interpreted as
binary numbers. A CSV file consists of any number of records, separated by line
breaks of some kind; each record consists of fields, separated by some other character
or string, most commonly a literal comma or tab. Usually, all records have an
identical sequence of fields.
"CSV" refers to any file that:[2][4]
1. is plain text using a character set such as ASCII, various Unicode character
sets (e.g. UTF-8), EBCDIC, or Shift JIS,
2. consists of records (typically one record per line),
3. with the records divided into fields separated by delimiters (typically a single
reserved character such as comma, semicolon, or tab; sometimes the delimiter
may include optional spaces),
4. where every record has the same sequence of fields.
25| P a g e

Within these general constraints, many variations are in use. Therefore "CSV" files
are not entirely portable. Nevertheless, the variations are fairly small, and many
implementations allow users to preview the first few lines of the file (which is feasible
because it is plain text), and then specify the delimiter character(s), quoting rules, etc.
If a particular CSV file's variations fall outside what a particular receiving program
supports, it is often feasible to examine and edit the file by hand or write a script or
program to fix the problem.

10.

DESIGN DOCUMENT

10.1. MODULARIZATION DETAILS


In the source code we have used different libraries which are also called modules like
numpy, json, csv, difflib, pandas. In the project the data in JSON format is input and
the output is shown on the output screen i.e. Python shell. The output is further stored
in a csv file.
INPUT DATA FORMAT : JSON
OUTPUT DATA FORMAT : CSV/Python shell

11.

SOURCE CODE

'''
Created on 7-May-2015

@author: gaurav & vishesh


'''

import csv
import operator
import numpy as np
from collections import defaultdict
from operator import itemgetter
import json
import pandas as pd
from difflib import SequenceMatcher as SM

26| P a g e

class CodingTest:
'''
'''
def __init__(self):

'''
initialization of the variables.
'''

self.dataFile = "C:\\Python27\\project\\data.txt"
self.csvToList = list()
self.dataList = list()
self.frame

= pd.DataFrame()

self.SnapDeal = list()
self.Flipkart = list()
self.Amazon

= list()

self.gg = {}
self.values = dict()
self.competitorList = ['SnapDeal','Flipkart','Amazon']

self.mylist = list()

def createDataDict(self,data,a,mylist):
title = data['title']
mrp = data['mrp']
source = data['source']
url = data['url']
stock = data['stock']
selling_price = data['available_price']
mylist.append({'Title':title,'MRP':mrp,'Source':source,
'URL':url,'Stock_Status':stock,'Selling_Price': selling_price})
return mylist

def filterData(self):
dataList = list()
self.loadFile = open(self.datafile, "r")
for line in self.loadFile:
try:
data = json.loads(line)
#print data

27| P a g e

a = dict(map(str.strip,x) for x in data.items())


print a
dataList = dataList.append(data)
except:
pass

#print dataList[0:4]

def loadDataFile(self):
'''
function to load data file.
'''
SnapDeal = list()
Flipkart = list()
Amazon = list()
self.loadFile = open("C:\\Python27\\project\\data.txt", "r")
print 'file is ok'

for line in self.loadFile:


try:
data = json.loads(line)
#data = dict(map(str.strip,x) for x in data.items())
#datalist = datalist.append(data)
if data['source'] == 'SnapDeal':
snapdeal = self.createDataDict(data,'SnapDeal',SnapDeal)

if data['source'] == 'Flipkart':
flipkart = self.createDataDict(data,'Flipkart',Flipkart)

if data['source'] == 'Amazon':
amazon = self.createDataDict(data,'Amazon',Amazon)

else:
pass
except:
continue
return(snapdeal,flipkart,amazon)

28| P a g e

def productMatching(self):
a = list()
b = list()

d = list()
e = list()
f = list()
r = list()
j = 0.0
t = 60.00
mydict = {}
Snapdeal,Flipkart,Amazon = self.loadDataFile()

matches = []
biglist1_indexed = {}

for item in Snapdeal:


a.append(item["Title"])
#biglist1_indexed[(item["Title"])] = item

for item in Flipkart:


d.append(item['Title'])
biglist1_indexed[(item["Title"])] = item

#for item in Amazon:


#

f.append(item['Title'])

for i in a:
c = list()
for k in d:
s = SM(None,i,k)
j = s.quick_ratio()
if(j >= .70):#you can adjust the threshold as per your requirement.
#print "{0:20} {1:40} {2:40}".format(j,i,k)
c.append((j,i,k))
#print 'appending',len(c)

else:
pass

29| P a g e

c.sort(reverse = True)

if not c:
pass
else:
#c.sort(reverse = True)
e.append(c[0])

print 'List containing the top matching results with repective matching quotient'
for s in e:
print s

#for i biglist1_indexed.
#biglist1_indexed[(item["Title"].lower().replace(" ",''))] = item

#for t in e:
#

print t[2]

for i in biglist1_indexed:
for t in e:
if (t[2] == i):
r.append(biglist1_indexed[i])

keys = r[0].keys()
with open('C:\\Python27\\project\\aa.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(r)
print 'done'
'''
make a dictionary which contains the attributes of snapdeal also
'''
#print biglist1_indexed
#for t in e:
#

if t[1] ==

#for h in e:
#

print h

#print len(e)

#r = open('C:\\Python27\\project\\aa.csv', 'w')

30| P a g e

#l = csv.writer(r)
#l.writerows(e)

#r.close()

def createCSVReport(self):
columns = ['Flipkart Product Title','Flipkart Pro. Stock Status','Flipkart Pro
MRP','Flipkart Selling Price' ]
with open('C:\\Python27\\project\\aa.csv', 'w') as f:
[f.write('{0},{1}\n'.format(key, value)) for key, value in mydict.items()]

def comupteRecallAndPrecision(self):
return(recall,precision)

12.

TESTING

12.1. UNIT TESTING


In computer programming, unit testing is a software testing method by which
individual units of source code, sets of one or more computer program modules
together with associated control data, usage procedures, and operating procedures, are
tested to determine whether they are fit for use.
Here we have treated the functions as individual modules or units. We have executed
them individually in order to test them

12.2. INTEGRATION TESTING


Integration testing (sometimes called integration and testing, abbreviated I&T) is
the phase in software testing in which individual software modules are combined
and tested as a group. It occurs after unit testing and before validation testing.
Here we have tested the functions working together i.e. in integrated form.

12.3. SYSTEM TESTING

31| P a g e

System testing of software or hardware is testing conducted on a complete, integrated


system to evaluate the system's compliance with its specified requirements. System
testing falls within the scope of black box testing, and as such, should require no
knowledge of the inner design of the code or logic.
Here we have tested the system as a whole and the results are as expected and quiet
satisfactory.

13.

REPORTS

13.1. UNIT TESTING

Tested units : competitorList, createCSVReport,values, myList, productmatching.


The unit testing report is shown above.

13.2. INTEGRATION TESTING

32| P a g e

Tested the system integration by running the most important function which is
productmatching() which is linked and integrated with other functions.
productmatching() contains the product matching algorithm which is the most
important aspect of this program.

13.3. SYSTEM TESTING

33| P a g e

The whole integrated system is tested as a whole and the system is run in its
production environment. The system testing report is shown above in the screenshot.

14.

INPUT AND OUTPUT SCREENS

14.1. INPUT
34| P a g e

This is the input data in JSON format which was crawled from the internet. The data
is basically information of all the products of various e-commerce sites (Flipkart,
snapdeal, amazon).

35| P a g e

14.2. OUTPUT

Output at threshold = 0.70


Using quick_ratio()

36| P a g e

Output at threshold = 0.80


Using quick_ratio()

37| P a g e

15.

CONCLUSION

This API is actually a system software to help Business Intelligence team to yield
market strategies. The main task of this API is the Product - Matching Algorithm
which intelligently matches the products across various e-commerce platforms and
gives an analytical insight about the market capture of a product or a company against
its competitors. Our API is designed to address such problems and reduces the overall
effort by many folds. This API can be used by various e-commerce companies to do
competitive analysis. The companies can use the results generated by this project to
make future market strategies which could help them in capturing more market and
thereby increase in profits. The use of this software by the e-commerce companies
will result in better strategies and planning and will force the software developers to
pay more attention in this area which will help the business organizations to perform
data analysis more efficient and at a place.

38| P a g e

16.

LIMITATIONS OF THE PROJECT

The software has its own set of limitations. The one which is at the forefront is the
efficiency of the algorithm. Currently, the algorithm is not 100% efficient. As the
product names have some variations for e.g. a product sony xperia z may be present
as sony xperia z in flipkart but in snapdeal it may be present as xperia z ,in this case
there algorithm wont be able to function efficiently and may give unexpected results.
To solve this problem high level mathematical concepts can be included, applied and
implemented in the algorithm such as probalibilty models etc.
The GUI of this project is still not developed and the software is still in its infancy
which is the reason of its less user friendliness. The software, currently is not much
user friendly and can be operated only by the persons having some knowledge
regarding the internal structures and working of the API.

17.

FUTURE APPLICATIONS OF THE PROJECT

The application currently matches the products along with their data and outputs the
data in the desired format at one place. But the efficiency of the algorithm used in the
application is not 100%. In future the algorithm can be improved and made more
efficient so that it can produce results with 100% accuracy.
The application can be made more user friendly by making a proper GUI for it which
will increase its user friendliness and make it more intuitive, interactive and user
friendly.

39| P a g e

18.

REFERENCES

https://www.python.org/
en.wikipedia.org/wiki/Python_(programming_language)
stackoverflow.com/
www.lfd.uci.edu/~gohlke/pythonlibs/

www.codeproject.com/

40| P a g e

Das könnte Ihnen auch gefallen