You are on page 1of 119

CHAPTER 1

INTRODUCTION

1.1 Necessity of Caching Technology in World Wide Web


In rural areas, the speed of internet is very low, largely disconnecting, of
the range of 30- 50 kbps. Rural people are less privileged and can not afford high
speed broadband. Also the broadband companies do not get enough customers so
that they can provide connectivity in a profitable way. If the above situation
persists, then village people can never be brought to the main stream of
development that their urban counterparts enjoy. For ex., a village youth would
never consider his village a comfortable place where he can apply through various
job sites. We are going to develop a facility that would enhance the speed of
internet in villages while still using conventional dial up lines which usually
provide between 30 % to 50% of their promised 128 kbps.
We have planned to do it in two ways.
(i)

Implement a cache in a way that gives minimum collisions, i.e. an


universal hash function and

(ii)

Keeping in cache only those websites that give a better hit ratio than
the existing ones.

All these concepts are based on an assumption that the requirements of

rural

people are less, so they do not access internet arbitrarily, rather they stick to their
necessities. For ex, the requirements of the villagers vary between health, agriculture,
job, education and governmental functionaries. They rarely go for using internet
beyond that.
Isaacman and Martonosi (2008) have discussed extensively on this issue.
According to them, as the internet becomes more pervasive in everyday life in the
developed world, those who wish to be competitive in modern markets must have
access to its information. Those unable to harness the internets vast resources will be
disadvantaged through a lack of powerful tools for communication, healthcare, and

education. The United Nations has established bringing technology to developing


regions as one of its Millennium Development Goals to be achieved by2015 for
precisely this reason. Though lack of connectivity is rarely an issue in North America,
worldwide internet penetration rates (defined here as the percentage of the population
with access to an internet connection point and the knowledge to use it) are just over
20% and in some regions are below 5%. Clearly, the disparity in internet connectivity
needs to be addressed, but major stumbling blocks exist to internet penetration in the
developing world. The infrastructure to support full connectivity is often non-existent.
The cost and logistics involved with laying wires into remote villages in developing
regions turns the last mile problem into something far greater .Often, a remote
village can be a bus ride of many hours away from the nearest feasible internet access
point. However, the advent of wireless technology has allowed us to move beyond
wired solutions and directly to wireless solutions. Though even wireless solutions
suffer from distance and infrastructure limitations (e.g., maximum ranges, tower
height restrictions, etc.), they offer hope of bringing the internet to remote villages.
This requirement leads us to the concept of web caching. Web caching is a widely
used term in the present world. Although the urban world enjoys broadband and the
broadband provider company promise to deliver bandwidth of the ranges of some
Mbps, still it requires web caching. Web sites of search engines such as Google
implement the web caching in some form or the other. Web caching is used so that the
internet provider can promise us bigger bandwidth. Be it disconnecting natured
internet in the villages or the high speed broadband of the urban areas, web caching is
a necessity.
There are many forms of web caching prevalent in the market. Still research is going
on to make it better. There are many standard techniques to implement web caching.
Some are application specific and some are generalized. This emphasizes the
importance of the concept. So we started to take interest in the topic. The application
specific web caching mechanisms are designed to cater to a specific need, ex. a
specific environment or a specific configuration based on a specific hardware
configuration. These are not suitable for a different scenario. Sometimes these type of
web caching systems deter the efficiency of some other environment. So we thought
to make it more dynamic, suitable for different environment. This can happen only if

we make the web caching system a bit intelligent, i.e. the system would in one way,
read the users mind and act accordingly. To make a system intelligent, the only way
is to apply Genetic Algorithms or similar technology. This helps the system to be
simpler in design, yet increases the efficiency of the system. It eliminates the need for
any architectural reconfiguration. In other words, it enhances the internal behavior of
the system, in our case, it is the web cache.

1.2 Proposed Work


We are going to develop an intelligent web caching method. A facility that can be
useful in villages as well as broadband. It would enhance the speed of internet in
villages while still using conventional dial up lines which usually provide between 30
% to 50% of their promised 128 kbps. Also it can cater to the needs of broadband that
promise to deliver speeds of some Mbps. In this way it is different from its
contemporary applications which are only application specific.
We have planned to do it in two ways.
(i)Implement a cache in a way that gives minimum collisions, i.e. an universal hash
function and
(ii)Keeping in cache only those websites that give a better hit ratio than the existing
ones.
All these concepts are based on an assumption that the requirements of

rural

people are less, so they do not access internet arbitrarily, rather they stick to their
necessities. For ex, the requirements of the villagers vary between health, agriculture,
job, education and governmental functionaries. They rarely go for using internet
beyond that.
We have observed that cache is the key of all these. So improving the performance of
cache improves the efficiency of all these techniques. But cache is inherently a hash
table in functionality and configuration. So we emphasize on improving the
performance of a hash table so that the same principles can be applicable to a cache
also. Our work is based on the following system model.

w1

Fig. 1.1 Web pages being hashed.


For improving the hash performance, we have used Gene Expression Programming
(GEP). It is a concept in line with Genetic Algorithm and Genetic Programming that
works on both Genotype and Phenotype. While its Genotype is based on individual
chromosomes, the phenotypes are Expression Trees derived out of the chromosomes.

1.3 Methodology
We have assumed that the websites accessed are named w1, w2, w3, w4 .wn.
Individual web pages be named w11, w12, w13 w1m and similarly w21, w22, w23.
etc. Each incoming web page is indexed according to its ip address. Each ip address is
a number which can be expressed in binary form. Here we assume that the index
numbers are consecutive so that each new website would be given the next binary
number in succession. For example, Suppose if a website W xy is represented by a
binary number m, then the next website Wxy+1 be represented by the next
consecutive number m+1 so that substitution operations in the chromosomes would be
traceable and would result in a valid website address. Suppose P be the set of websites
or the population of our study. Each time a new website comes, let the population be
changed to P1. Corresponding hit ratio of the current population be studied each time.
The best possible population is chosen at the end.

1.3.1 Improving collision

The indexes of the pages are used for hashing.


The index of the page is converted into a range of integers called keys [0-M-1], say,
which has to be hashed into another range of numbers [0-N-1]. For a hash function h
chosen at random from H,
Pr( h(m)= h (n)) 1 /N

----- (F1)

where Pr(E) is the probability of event E.


In [9], it has been discussed that a universal hash function has got the minimum
collision. So we use a universal hash function here which has the following form of
equation
h p, q (r) = ((p*r+q) mod x) mod N ----

(F2)

x is a prime number, M x< 2M, p, q are any two random integers, 0<p<x, 0<q<x.
Ultimately, the GEP has been chosen to be the basis of this project. We have
developed the following algorithm to carry out the task. Chromosome chosen is the
above hash function F (2).

1.3.2 Steps of the algorithm:

1. The ip address of a valid website is converted into an index which is represented


by symbol r.
2. N represents the size of the hash table.
3. h ( r ) is the value or position into which the website has to be placed.
4. p,q and x represent intermediate values.

5. r is chosen from among a population P and is subjected to regression.


6. Hit ratio of the current population is calculated.
7. Cross over and mutation are performed by changing the values of p, q, x .
8. Steps 5-7 are repeated till a certain no of generations after which a best set of
individuals is determined.
The above algorithm is developed by us to implement the project. I incorporates GEP,
in a new manner, which is in line with GA and GP, but different from the both. Our
aim in this thesis is to explore the potential of GEP in terms of its efficiency and
simplicity.

1.4 Literature Survey

Collaborative caching and Pre-fetching in the internet is the basis of this research
paper. Potential for these techniques have been elaborately discussed in Isaacman and
Martonosi [1]. They have coined the term in the pretext of largely disconnecting
internet in remote villages. Distributed Collaborative Caching for Proxy Servers
Discussed in Kasbekar and Desai [2] proposes distributed proxy server and WWW
clients, being implemented on a network of Sun workstations running Solaris2.5.1
connected. This system writes a wrapper for the client. If the reply from the remote
server implies no change in the document since the last update date supplied, the
document received from the user stub is sent to the requesting client. On the other
hand, if the remote server sends the document, it is sent to the requesting client, and
the proxys index of the locally cached documents is updated.
The success and popularity of social network systems, such as del.icio.us, Facebook,
MySpace, and YouTube, have generated many interesting and challenging problems
to the research community. So Li, Guo and Yihong Zhao [3] discuss among others,

discovering social interests shared by groups of users is very important. The main
challenge to solving this problem comes from the difficulty of detecting and
representing the interest of the users. The existing approaches are all based on the
online connections of users and so unable to identify the common interest of users
who have no online connections. In this paper, a social interest discovery approach is
suggested based on user-generated tags. The authors have developed an Internet
Social Interest Discovery system, ISID, which can effectively cluster similar
documents by interest topics and discover user communities with common interests
no matter if they have any online connections. CoCache, as discussed by Qian, Xu,
Zhou, and Zhou [4], is a Query processing system based on collaborative caching
technology in a peer-to-peer environment. But it is different from existing P2P
systems in the sense that both the caching process and query processing are fully
decentralized based on a distributed hash table (DHT) scheme, called CON
(Coordinator Overlay Network). Query answering performance is improved greatly
with low overhead for maintaining CON. Systematic P2P Aided Cache Enhancement
or SPACE, a new collaboration scheme among clients in a computer cluster of a high
performance computing facility to share their caches with each other has been
discussed by the authors in [5]. Here the clients create an environment which gives
the perception of a large pseudo global cache by exchanging information through
gossip messages. If a request can not be served in the local cache, it is looked upon in
the pseudo global cache without involving any file manager or central server. The
collaboration is achieved in a distributed manner, and is designed based on peer-topeer computing model. Dominguez-Sal, Larriba-Pey and Surdeanu [6] have designed
multilayer collaborative cache for question answering. In this work a multi-class
maximum Entropy classifier is used to map each question into a known answer type
taxonomy. Another algorithm is used to retrieve the queries. The question set contains
questions which are randomly selected using Zipf distribution. Two different
protocols have been proposed here for the management of the multilayer and
distributed cache. Xu, Liu , Li and Jia [7] have discussed about caching and prefetching for web content distribution.

Zhang, Lee, Ma and Zhou [13] have discussed about pre-fetching co-coordinator
(PFC), which is a multi level independent pre-fetching. An immediate layer of
intelligence has been placed between upper and lower level strategies for prefetching
and cache replacement. They implemented four well known prefetching algorithms
used in real systems: P- Block Read Ahead (RA), Linux kernel prefetching, SARC
and AMP. They have shown that when the four algorithms are applied to a two level
storage system, the addition of PFC can considerably improve the system
performance by upto 35%. Tang, Zhang, and Chanson [14] compare the various
caching algorithms designed for transcoding proxies. They propose an adaptive
algorithm that dynamically selects an appropriate for adjusting the management
policy. Their experimental results show that the algorithm significantly outperforms
those that cache only transcoded or only untranscoded objects. Kipruto, Tan, Musau
and Mushi [15] implemented our web caching within the proxy web caching
architecture using Squid running on windows operating system. To preserve a
constituent cache population, optimize cacheable objects and results of the e-learning
content, their implementation adapts a Genetic Algorithm(GA) approach. With
adoption of GA or GEP, one can avoid the necessity of using multiple or multi-level
caches. Davison [16] elaborately explains about various web caching methods and
resources. Rabinovich and Spatscheck[17] have highlighted the term called
replication. According to them, replication is the method that enhance scalability from
the server side. Replication creats and maintains distributed copies of content under
the control of the content provider. Client requests are sent to the nearest and least
busy server. A decentralized peer-to-peer web cache system, called squirrel has been
explained extensively by Iyer, Rowstron and Druschel[19]. There are many works
relating to cooperative web caching and workload characterization. But this paper
demonstrates that it is efficient to adopt a peer-to-peer web caching system in a
corporate LAN situated in a single geographical location, as it uses no extra hardware
and administration, simultaneously being fault-tolerant. Differential service
architecture has been designed and implemented by Venketesh, Sivanandamand
Manigandan[22]. This model achieves service differentiation with improved hit rate
with separate replacement algorithm for each class based on its requirement
specifications. Du and Subhlok [23] have evaluated the performance of cooperative

web caching with polygraph. Web polygraph is employed to provide an environment


for simulation of desired workloads and performance statistics. Web polygraph
employs single caching system. The performance of cache is evaluated from the
polygraph processes. Polygraph generates multiple request streams within same
global URL space. Beaumont[24] has discussed the expiration times of web sites to
calculate the hit ratios of web caches. He has considered a Zipf distribution of
probability of a web site to be requested and shown that size of an website has least
significance on hit ratio. Summary Cache is a protocol designed by Fan, Cao, Almeida
and Broder [25]. In this protocol, each proxy keeps a summary of the cache directory
of each participating proxy and checks the summaries for potential hits before sending
any queries. In[26], Suresh Babu and Srivastava have designed a system where the
browser maintains a database in which all the site names are stored and whenever a
user visits a site, the browser automatically updates the user profile. User profile
includes the interest of the user, time and date of visiting the site etc. Patel, Patel and
Parikh[27] have developed an algorithm which uses server log files in W3C file
format. Each log file is processed by parsing the Raw log file. It also calculates the
access count of each web page and removes the pages that lie below the threshold
value. The remaining web pages would be the candidates that are stored in the cache.
The term cache pollution has been used by Ali and Shamsuddin in [28]. It means a
situation where the cache contains objects which would not be used in near future.
The authors have proposed an intelligent approach that uses neuro-fuzzy system. This
system is used to predict web pages that can be re-accessed later. Hence unwanted
objects are removed efficiently and to make space for new objects. Patil and
Pawar[29] have discussed about intelligent predictive caching and static prefetching
in web proxy servers. Ali, Shamsuddin and Ismail have surveyed on various web
caching and prefetching techniques[30]. In Umapathi, Aramutham and Raja[31] Prefetch Enhanced Caching Algorithm(PEC) has been discussed In this work, cache is
divided into two parts. The first part serves as the cache while the other part is a prefetch buffer. If the requested document is in the pre-fetch buffer, then it will move to
the caching partition, using the LRU policy. In the meanwhile, future documents are
predicted and are stored in the pre-fetch partition. Balasundaram and Akilan [32] have
discussed about two LRU techniques namely the Class_based LRU (C-LRU) and

Pass Down C-LRU ( PDC-LRU).Each web page is first placed in a fit region but
when the cache becomes full, instead of being removed based on the LRU scheme,
get passed down into the unfit region. Che, Tung and Wang[33] have viewed a cache
as a low-pass filter and have designed a hierarchical web caching model. The
hierarchy is a k-level tree with single cache C0 at the root level and k no. of leaf
nodes of size Ck at the leaf level. Available traces at any level of the tree are the
filtered traces of the lower levels. Each cache is identified by a unique request arrival
time. This approach is compared with the traditional uncooperative hierarchical
models.
After discussing all the above traditional techniques, we have observed that cache is
the key of all these. So improving the performance of cache improves the efficiency
of all these techniques. But cache is inherently a hash table in functionality and
configuration [41], the wikipaedia. So we emphasize on improving the performance
of a hash table so that the same principles can be applicable to a cache also. We made
the following observations regarding the functions of a hash table. It is done in two
ways. (1) By reducing the collisions to a minimum and (2) By increasing the hit ratio
considerably. Estebanez, Cesar and Ribagorda [9] have implemented hash functions
using Genetic Programming(GP). Here the fitness function is the hash function with a
single bit flipped. After hashing both the inputs, the hamming distance between the
two is calculated. Safdari and Joshi[10] have implemented the

universal hash

functions using Genetic Algorithm(GA). They have proved that a Universal Hash
function gives the minimum collision. Going ahead in Genetic Algorithm way, we
came across Gene Expression Programming, first discovered by Ferreira [11].
According to her, GEP is similar to GA and GP where individuals are selected from
populations, selects them according to fitness and uses genetic variations using
genetic operators. But the difference lies in the nature of individuals. Skaruz and
Seredynski [35] have discussed about anomaly detection methods in Web applications
using Gene Expression Programming. Security of an Organization depends on an
effective intrusion detection system. GEP allows to make the system an intelligent one
so that currently known attacks as well as those that can occur in future can be
detected. Liu, English and Pohl [38] propose a GEP-based data mining method to

10

obtain formulas for the reliability of the C (k, n: F) system. Wang and Lu have proved
that GATree algorithms have improved performance over the traditional LRU based
algorithms.

1.5 Contribution of this Thesis


We have discussed in the above paragraphs how different methods have been
implemented by different authors to improve the caching technology in the field of
web mining. Our observation leads us to reveal that the performance of cache can be
improved further by keeping in store only those websites that give the best hit ratio,
thus reducing the miss rate to a minimum.
For this, we have considered the fact that a cache is similar in structure to a hash
function. So improving a cache means to improve the performance of a hash. Safdari
and Joshi[10] have proved that a hash function can give the best performance in terms
of the minimum collision when it is implemented in the form of a Universal hash
function. We argue that since the hash function we propose to implement is not of
variable size, an increased collision means loss of data. So an universal Hash function
would provide minimum collision, hence better hit ratio, ensuring that maximum
number of websites have been stored in the cache.
This thesis implements a universal hash function and selects as training data those set
of keys that give minimum collision. Then using it as our initial population, we apply
gene expression programming to find out the fittest set of individual that should
remain in the cache, so as to provide a greater hit ratio in future.

11

CHAPTER 2
OVERVIEW OF CACHING TECHNOLOGY

2.1Cache: The Definition


A cache is a part of the memory system of the digital computer. Excluding
registers, it has got second position in the memory hierarchy in terms of speed of
access and bandwidth. There are three principles involved with the cache technology.
(i)

Spatial Locality: Given an access to a particular location in memory, there


is a high probability that other accesses will be made to either that or
neighboring locations within the lifetime of the program.

(ii)

Sequentiality:

Given that a reference has been made to particular

locations, it is likely that within the next several references a reference to


the location of s+1 will be made.
Level 0

Level 1

Level 2

Cache

12

Main memory
map control

Back up
Fig. 2.1 The memory hierarchy

A cache is a component that transparently stores data so that future requests for that
data can be served faster. The data that is stored within a cache might be values that
have been computed earlier or duplicates of original values that are stored elsewhere.
If requested data is contained in the cache (cache hit), this request can be served by
simply reading the cache, which is comparatively faster. Otherwise (cache misses),
the data has to be recomputed or fetched from its original storage location, which is
comparatively slower. Hence, the more requests can be served from the cache the
faster the overall system performance is.
To be cost efficient and to enable an efficient use of data, caches are relatively small.
Nevertheless, caches have proven themselves in many areas of computing because
access patterns in typical computer applications have locality of reference. References
exhibit temporal locality if data is requested again that has been recently requested
already. References exhibit spatial locality if data is requested that is physically stored
close to data that has been requested already.

Fig2.2. Block Diagram of a CPU cache


Sometimes, the cache is implemented inside the CPU. It is called the Level 1 cache.

13

2.2 Example of Collaborative caches


CoCache
A peer may take different roles in CoCache network: a requester is a peer who issues
one or more queries. A source peer is a peer whose database is accessible by other
ones. A caching peer is a requester who caches result(s) of its queries or subqueries.
Both source peers and caching peers are called providers. A coordinator is a peer in
charge of maintaining information about a specific query expression. The information
includes the providers that can provide data to answer the query, the locality
information of the providers, and the coordinators corresponding to the sub- and
super-expressions. The coordinators are also responsible to coordinate the requesters
to determine which part of data to be cached by which peer.

SPACE: A lightweight collaborative caching for clusters


Systematic P2P Aided Cache Enhancement or SPACE, a new collaboration scheme
among clients in a computer cluster of a high performance computing facility to share
their caches with each other has been discussed by the authors in Akon and Islam
[5]. The collaboration is achieved in a distributed manner, and is designed based on
peer-to- peer computing model. The objective is to provide (1) a decentralized
solution, and (2) a near optimal performance with reasonably low overhead.
Simulation results are given to demonstrate the performance of the proposed scheme.
In addition, the results show that SPACE evenly distributes work loads among
participators, and entirely eliminates any requirement of a central cache manager.
Here, the clients create an environment which gives the perception of a large pseudo
global cache. To accomplish this goal, the clients exchange information about their
local caches through gossip messages. The disseminated information is used to unify
the smaller client caches into the pseudo global cache. If a request can not be served
from the local cache, it is looked up in the rest of the pseudo global cache without
involving the file server or any central manager. When a locally unavailable block is

14

found in the pseudo global cache, the clients cooperate in fetching the block.
Moreover, through the coordination, the performance of a busy cache can be
improved by properly utilizing a nearby idle cache. Here, an idle cache not only helps
the busy cache in fetching blocks, but also preserves critical blocks to reduce the cost
of retrieval from the mass storage. In addition, we propose that the clients introduce
replicas of frequently accessed blocks. Replication of such blocks often reduces the
bottleneck of the central server, distributes the service load among clients, and
increases the chance of hits in the local as well as in the pseudo global cache.
However, when the system gets busy, the clients coordinate in an elimination process
to remove one or more replicas, and make space for newly introduced blocks. Due to
the collaboration, a client acts as a requester for services from other clients and at the
same time, acts like a service point for other clients. As a result, the load of the system
is distributed among the participators. In this scheme, the data server gets a service
request, only if the request can not be served by the pseudo global cache, i.e., a miss
happens in both the local and the global cache. To achieve this coordination, a peer-topeer (P2P) client partnership has been proposed to model the collaborative cache. In
this partnership, the client-server relation becomes dubious, and cooperation among
peers (i.e., clients of the file server) emerges to provide higher number of hits in the
pseudo global cache. With this approach, we obtain three additional fundamental
benefits: (1) low maintenance cost, (2) easy integration with existing software
platforms, and (3) easy development platform. The authors have successfully shown
that the proposed scheme reasonably approximates the ideal Global LRU caching
policy which has the instantaneous view of all the caches in the system. The results
also show that the scheme performs better than existing centralized solutions.
Additionally, the results demonstrate that the message communication and memory
overhead for the maintenance operations are fairly low.

2.3 Cache Organization & Prefetching


There are two types of cache organization. Demand fetch and prefetch
organizations. A demand fetch cache brings a new memory locality into the cache
only when a processor reference is not found in the current cache contents (a miss

15

occurs). The prefetch cache attempts to anticipate the locality about to be requested by
the processor and thus prefetches it into the cache.

2.3.1 Database caching:


It usually means that we want to save frequently-used information into an
easy to get to area. Usually, this means storing it into memory. Caching has three main
goals: reducing disk access, reducing computation (i.e. CPU utilization), and speeding
up the time as measured by how long it takes a user to see a result. It does all this at
the expense of RAM, and the tradeoff is almost always worth it.
In multi-tier architectures, application tier and data tier are stored in different hosts.
As commercial databases are heavy weight, it is not practically feasible to have
application and database at the same host. There are lots of light-weight databases
available in the market, which shall be used to cache the data from the commercial
databases. Caching helps the database application in the following ways:
(1)Scalability: Distributes query workload from backend to multiple cheap front-end
systems.
(2)Flexibility: Each cache hosts different parts of the backend data, e.g., the data of
Platinum customers are cached while that of ordinary customers are not.
(3)Availability: Provides continued service for applications that depend only on
cached tables even if the backend server is unavailable.
(4)Performance: Responds fast because of locality of data and smoothes out load
peaks by avoiding round-trips between middle-tier and data-tier.
While database caches ease the way data are stored and fetched, they need the system
have the following features.
(5)Updateable Cache Tables: The cache solutions should not be read-only which
limits their usage to small segment of the applications, non-real time applications.
(6)Bi-Directional updates: For updateable caches, updates, which happen in cache,
should propagate to the database and any updates that happen directly on the target
database should come to cache automatically.
Synchronous and asynchronous update propagation

16

The updates on cache table are propagated to target database in two modes.
Synchronous mode makes sure that after the database operation is complete, the
updates are applied at the target database as well. In case of Asynchronous mode the
updates are delayed to the target database till all database operations are complete.
Synchronous mode gives high cache consistency and is suited for real time
applications. Asynchronous mode gives high throughput and is suited for non- real
time applications.
Transparent Fail over: There should not be any service outages in case of caching
platform failure. Client connections should be routed to the target database.
No or very few changes to application for the caching solution: Support for
standard interfaces JDBC, ODBC etc that will make the application should work
seamlessly without any application code changes. It should route all stored procedure
calls to target database so that they dont need to be migrated.

2.3.2 Types of Database Caching:


In a database, there are three basic types of caching: query results, query plans, and
relations. The first, query result caching, simply means that we store into memory the
exact output of a SELECT query for the next time that somebody performs that exact
same SELECT query. Thus, if 800 people do a "SELECT * FROM foo", the database
runs it for the first person, saves the results, and simply reads the cache for the next
799 requests. This saves the database from doing any disk access, practically removes
CPU usage, and speeds up the query.
The second, query plan caching, involves saving the results of the optimizer, which is
responsible for figuring out exactly "how" the database is going to fetch the requested
data. This type of caching usually involves a "prepared" query, which has almost all
of the information needed to run the query with the exception of one or more
"placeholders" (spots that are populated with variables at a later time). The query
could also involve non-prepared statements as well. Thus, if someone prepares the
query "SELECT flavor FROM foo WHERE size=?", and then executes it by sending
in 300 different values for "size", the prepared statement is run through the optimizer,

17

the resulting path is stored into the query plan cache, and the stored path is used for
the 300 execute requests. Because the path is already known, the optimizer does not
need to be called, which saves the database CPU and time.
The third, relation caching, simply involves putting the entire relation (usually a table
or index) into memory so that it can be read quickly. This saves disk access, which
basically means that it saves time. (This type of caching also can occur at the OS
level, which caches files).Those are the three basic types of caching, ways of
implementing each are discussed below. Each one should complement the other, and a
query may be able to use one, two, or all three of the caches.

I. Query result caching:


A query result cache is only used for SELECT queries that involve a relation (i.e. not
for "SELECT version"). Each cache entry has the following fields: the query itself,
the actual results, a status, an access time, an access number, and a list of all included
columns. (The column list actually tells as much information as needed to uniquely
identify it, i.e. schema, database, table, and column). The status is merely an indicator
of whether or not this cached query is valid. It may not be, because it may be
invalidated for a user within a transaction but still be of use to others. When a select
query is processed, it is first parsed apart into a basic common form, stripping
whitespace, standardizing case, etc., in order to facilitate an accurate match. Note that
no other pre-processing is really required, since we are only interested in exact
matches that produce the exact same output. An advanced version of this would
ideally be able to use the cached output of "SELECT bar, baz FROM foo" when it
receives the query "SELECT bar, baz FROM foo", but that will require some
advanced parsing. Possible, but probably not something to attempt in the first iteration
of a query caching function. :) If there is a match (via a simple strcmp at first)and the
status is marked as "valid", then the database simply uses the stored output, updates
the access time and count, and exits. This should be extremely fast, as no disk access
is needed, and almost no CPU. The complexity of the query will not matter either: a
simple query will run just as fast as something with 12 sorts and 28 joins. If a query is
*not* already in the cache, then after the results are found and delivered to the user,
the database will try and store them for the next appearance of that query. First, the

18

size of the cache will be compared to the size of the query+output, to see if there is
room for it. If there is, the query will be saved, with a status of valid, a time of 'now', a
count of 1, a list of all affected columns found by parsing the query, and the total size
of the query+output. If there is no room, then it will try to delete one or more to make
room. Deleting can be done based on the oldest access time, smallest access count, or
size of the query output. Some balance of the first two would probably work best,
with the access time being the most important. Everything will be configurable, of
course. Whenever a table is changed, the cache must be checked as well. A list of all
columns that were actually changed is computed and compared against the list of
columns for each query. At the first sign of a match, the query is marked as "invalid."
This should happen before the changes are made to the table itself. We do not delete
the query immediately since this may be inside of a transaction, and subject to
rollback. However, we do need to mark it as invalid for the current user inside the
current transaction: thus, the status flag. When the transaction is committed, all
queries that have an "invalid" flag are deleted, and then the tables are changed. Since
the only time a query can be flagged as "invalid" is inside your own transaction, the
deletion can be done very quickly.

II. Query plan caching: If a query is not cached, then it "falls through" to the
next level of caching, the query plans. This can either be automatic or strictly on a
user-requested format (i.e. through the prepare-execute paradigm). The latter is
probably better, but it also would not hurt much to store non-explicitly prepared
queries in this cache as long as there is room. This cache has a field for the query
itself, the plan to be followed (i.e. scan this table, that index, sort the results, then
group them), the columns used, the access time, the access count, and the total size. It
may also want a simple flag of "prepared or non-prepared", where prepared indicates
an explicitly prepared statement that has placeholders for future values. A good
optimizer will actually change the plan based on the values plugged in to the prepared
queries, so that information should become a part of the query itself as needed, and
multiple queries may exist to handle different inputs. In general, most of the inputs
will be similar enough to use the same path (e.g. SELECT flavor FROM foo
WHERE size=?" will most usually result in a simple numeric value for the

19

executables). If a match *is* found, then the database can use the stored path, and not
have to bother calling up the optimizer to figure it out. It then updates the access time,
the access count, and continues as normal. If a match was *not* found, then it might
possibly want to be cached. Certainly, explicit prepares should always be cached.
Non-explicitly prepared queries (those without placeholders) can also be cached. In
theory, some of this will also be in the result cache, so that should be checked as well:
it is there, no reason to put it here. Prepared queries should always have priority over
non-prepared and the rest of the rules above for the result query should also apply,
with a caveat that things that would affect the output of the optimizer (e.g.
vacuuming) should also be taken into account when deleting entries.

III. Relation caching


The final cache is the relation itself, and simply involves putting the entire relation
into memory. This cache has a field for the name of the relation, the table info itself,
the type (indexes should ideally be cached more than tables, for example), the access
time, and the access number. Loading could be done automatically, but most likely
should be done according to a flag on the table itself or as an explicit command by the
user.

2.4 Web caching:


Web browsers and web proxy servers employ web caches to store previous
responses from web servers, such as web pages. Web caches reduce the amount of
information that needs to be transmitted across the network, as information previously
stored in the cache can often be re-used. This reduces bandwidth and processing
requirements of the web server, and helps to improve responsiveness for users of the
web. Web browsers employ a built-in web cache, but some internet service providers
or organizations also use a caching proxy server, which is a web cache that is shared
among all users of that network. Another form of cache is P2P caching, where the
files most sought for by peer-to-peer applications are stored in an ISP cache to
accelerate P2P transfers. Similarly, decentralized equivalents exist, which allow
communities to perform the same task for P2P traffic, e.g. Corelli. There are three
kinds of web caches: 1) Client Caching, 2) Proxy Caches, 3) Gateway caches.

20

Client Caching: It is installed in the web browser and addresses to store the contents
of the web sites near the client side.
Proxy Caches: A proxy cache is installed near the Web users, say within an
enterprise. Users in the enterprise are told to configure their browsers to use the proxy.
Requests for objects from a website are intercepted and handled by the proxy cache. If
they are not in the cache, the proxy gets them from another cache or from the website
itself.
Gateway caches: Are installed in the gateways connecting different networks.

2.4.1 Differentiated Service Architecture of Web caching:


Requests for a web object are classified into different classes with each class holding
its own queue [22]. Here the term object lifetime refers to the amount of time the
object resides in cache. Object lifetime is a metric of the hit ratio of the object and
provides a measure to the amount of service received by the object at the cache.
Moreover, the classifications of objects depend upon the mechanisms deployed by the
web server QoS schemes. The objectives of resource allocation and service
differentiation can be achieved by employing different classification policies.
Classifications for the proposed service differentiation can be categorized into
1. Source - based: Source hostname / IP Address
2. Content - based: File type (HTML, Images, Video etc)
3. Economics based: Payment by object owner
4. Popularity based: Object Popularity
5. Size- based: Object size

Desired relative hit rate for each class is assigned based on service differentiation
policy. This architecture assigns value based on previous measurements taken without
differentiation criteria. The desired relative hit rate of all classes is normalized such
that it equals 1.

21

Relative hit rate is measured as:

Hi
Ri =
H1 +H2 + H3+Hn

Difference between measured relative hit rate and desired relative hit rate is used
to adjust the space allocated to particular class. To evaluate this differentiated model
that provides QoS, the architecture has been implemented in widely used Squid Proxy
Server. Squid is an open-source, high performance, Internet Proxy Cache that services
HTTP requests on behalf of clients. Squid maintains cache of documents that are
requested to avoid refetching from the web server if another client makes the same
request. Hit rate (H) is considered an important parameter in measuring the Cache
efficiency.

2.4.2 Transcoding at proxy caches


Proxy caching is a popular approach to enhance the performance of web content
delivery. The proxies are usually deployed much closer to the clients than the content
servers. Frequently accessed objects are stored at the proxies so that client requests
can be satisfied without having to contact the distant content servers. Multimedia
contents may need to be converted to a form suitable for display on the target client
depending on the availability of resources. This conversion process is called
transcoding. Possible forms of transcoding include lowering the bit rate of a media
stream by reducing the image resolution, size and/or frame rate, converting a media
stream from one encoding format to another and a combination of these. Transcoding
can be performed at either the content servers or proxies.
Providing transcoding service at the proxies has introduced new challenges on proxy
caching. The transcoding process is typically computation intensive. With

22

transcoding, considerable amount of CPU resource at the proxy is required to


complete a request. Thus, in addition to the objective of reducing network traffic
addressed by existing caching schemes the potential bottleneck of proxy CPU needs
to be considered as well. An effective caching scheme for transcoding proxies should
deal with network and CPU demands in an integrated fashion.

2.5 Algorithms for transcoding proxies:


Several caching algorithms for transcoding proxies are discussed. These are of two
types: Static and Dynamic. Since web workload changes with time, we should discuss
only the dynamic algorithms.

2.5.1 Full Version Only (FVO) Algorithm


A common goal of existing caching algorithms for traditional web pages is to reduce
the network traffic between the content servers and the proxy [11]. The Full Version
Only (FVO) algorithm is a simple extension of this class of algorithms. FVO ignores
the existence of transcoded object versions and only caches full object versions. To
reduce network demand, the caching gain of a full object version is determined by its
bandwidth consumption. Consider a full object version v. Let f(v) be the access
frequency2 of v. Let b(v) and l(v) be the bandwidth requirement to fetch v from the
content server and the session duration of v respectively. The caching gain of v is
given by f(v) _ b(v) _ l(v). To maximize the total caching gain, FVO tries to cache full
object versions with the highest gain density (i.e., the caching gain per unit cache
space occupied)
f(v) . b (v). l (v)
Dnet(v) =
s(v)

23

where s(v) is the size of v.

2.5.2Transcoded Version Only (TVO) Algorithm


The Transcoded Version Only (TVO) algorithm aims at reducing CPU demand by
only caching the transcoded object versions. The caching gain of a transcoded object
version is defined as its CPU consumption to transcode. Suppose v is a transcoded
object version and vF is the corresponding full version. Let c(v) be the required
proportion of CPU power to transcode vF to v. Then the caching gain of v is given by
f(v)_c(v)_ l(v), where l(v) = l(vF ) is the session duration of v. Similar to FVO, to
maximize the total caching gain, TVO tries to cache transcoded object versions with
the highest gain density
f(v).c(v) .l(v)
Dcpu(v) =
s(v)
where s(v) is the size of v.

2.5.3 Adaptive Caching Algorithm


Both the FVO and TVO algorithms have advantages and drawbacks. FVO tries to
reduce network demand by caching full object versions. However, it puts much more
CPU demand on the transcoding proxy as a transcoded object version needs to be
produced on-the-fly every time it is requested. On the other hand, TVO reduces
computation demand by caching the transcoded object versions. The drawback is that
multiple versions of the same objects may have to be cached, resulting in possibly
higher cache miss ratio.

24

CHAPTER 3
GENE EXPRESSION PROGRAMMING CONCEPT

3.1 The Terminologies


Gene expression programming (GEP) is[11], like genetic algorithms (GAs)
and genetic programming (GP), a genetic algorithm as it uses populations of
individuals, selects them according to fitness, and introduces genetic variation using
one or more genetic operators. The fundamental difference between the three
algorithms resides in the nature of the individuals: in GAs the individuals are linear
strings of fixed length (chromosomes); in GP the individuals are nonlinear entities of
different sizes and shapes (parse trees); and in GEP the individuals are encoded as
linear strings of fixed length (the genome or chromosomes) which are afterwards
expressed as nonlinear entities of different sizes and shapes (i.e., simple diagram
representations or expression trees).
The structural organization of GEP genes is better understood in terms of open
reading frames (ORFs). In biology, an ORF, or coding sequence of a gene, begins

25

with the start codon, continues with the amino acid codons, and ends at a
termination codon. However, a gene is more than the respective ORF, with sequences
upstream from the start codon and sequences downstream from the stop codon.
Although in GEP the start site is always the first position of a gene, the termination
point does not always coincide with the last position of a gene. It is common for GEP
genes to have non coding regions downstream from the termination point.
Consider, for example, the algebraic expression:
(5.1)
This can also be represented as a diagram or ET:

Fig. 3.1 Expression Tree


where Q represents the square root function. This kind of diagram representation is
in fact the phenotype of GEP individuals, being the genotype easily inferred from the
phenotype as follows:
01234567
Q*+-abcd

(5.2)

which is the straightforward reading of the ET from left to right and from top to
bottom. Expression (3.2) is an ORF, starting at Q (position 0) and terminating at d
(position 7). These ORFs were named K-expressions (from the Karva language, the
name chosen for the language of GEP). Note that this ordering differs from both the
postfix and prefix expressions used in different GP implementations with arrays or
stacks. The inverse process, that is, the translation of a K-expression into an ET, is
also very simple. Consider the following K-expression:

26

01234567890
Q*+*a*Qaaba

(5.3)

Looking only at the structure of GEP ORFs, it is difficult or even impossible to see
the advantages of such a representation, except perhaps for its simplicity and
elegance. However, when ORFs are analyzed in the context of a gene, the advantages
of such representation become obvious. As stated previously, GEP chromosomes have
fixed length and are composed of one or more genes of equal length. Therefore the
length of a gene is also fixed. Thus, in GEP, what varies is not the length of genes
(which is constant), but the length of the ORFs. Indeed, the length of an ORF may be
equal to or less than the length of the gene. In the first case, the termination point
coincides with the end of the gene, and in the second case, the termination point is
somewhere upstream from the end of the gene .So, what is the function of these non
coding regions in GEP genes? They are, in fact, the essence of GEP and evolvability,
for they allow modification of the genome using any genetic operator without
restrictions, always producing syntactically correct programs without the need for a
complicated editing process or highly constrained ways of implementing genetic
operators. Indeed, this is the paramount difference between GEP and previous GP
implementations, with or without linear genomes.

3.2 GEP Implementation


We implement GEP to our project. In this Thesis we have chosen GEP as our
platform. It has the following features over the conventional techniques such as GA
and GP.
Important facts about GEP:
1. GEP encodes an expression that represents a genome, where the older ones
represent a single entity, which is a bit string, as its chromosome.
2. GEP produces valid ETs always, where as GA or GP may reject some entities.
3. GEP can solve relatively complex problems with relatively small population.
We chose the following terminologies.

3.2.1 ET of the Fitness Function: It is the above hash function (F2).The


resulting ET would look like the following:

27

mod
mod

*
p

q
r

Fig. 3.2 The Expression Tree for the Fitness Function


The equivalent K-expression from the above ET would result as follows.
mod mod N + X *q p r .F(3)
If the ip address is subjected to GEP operations, the cross over and mutations
performed on it would result in many false entities, thereby increasing the no. of
iterations required to reach the optimal value. Moreover, we require that our ip
address should remain unchanged throughout the process. So we have replaced each
ip address by an integer value, i.e. r.

3.3 Creating population & the Chromosome:


In [9], it has been explained how a Universal hash function has been implemented
using Genetic Algorithm. They have proved that a Universal hash function gives

28

minimum collision. They have constructed the hash function using Genetic
Algorithm. We have designed our algorithm by using GEP.
The combination of p, q and x would be treated as the chromosome for our
experiment. Let the initial population start with the following.

Population
p

00000000000000000000000000000001

00000000000000000000000000000010

00000000000000000000000000000011 00000000000000000000000000000100
00000000000000000000000000000101

00000000000000000000000000000110

00000000000000000000000000000110

00000000000000000000000000000111

00000000000000000000000000000101

00000000000000000000000000000100

The chromosome = Each combination of p and q.


Hash code (= r) = 0-24
M =25
X_array (= x), M< x<= 2M. i.e. x values are
29, 31, 37, 41, 43, 47
Choose N = 20
H(r) = (((p * r + q) mod x) mod N)
Putting the first values of p, r, q, x and N, we get H(r) = 2. Now keeping p and q
same, apply the formula to with different values of x. The details are tabulated below.
Table 3.1 (The construction of Universal Hash table)
p
1
1
1
1
1
1
1
1

Q
2
2
2
2
2
2
2
2

r
0
0
0
0
0
0
1
1

x
29
31
37
41
43
47
29
31

29

P*r +q (A)
02
02
02
02
02
02
03
03

A mod x(B)
2
2
2
2
2
2
3
3

B mod N(=20)
2
2
2
2
2
2
3
3

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8

37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47

30

03
03
03
03
04
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
9
9
9
9
9
9
10
10
10
10
10
10

3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
9
9
9
9
9
9
10
10
10
10
10
10

3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6
7
7
7
7
7
7
8
8
8
8
8
8
9
9
9
9
9
9
10
10
10
10
10
10

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

9
9
9
9
9
9
10
10
10
10
10
10
11
11
11
11
11
11
12
12
12
12
12
12
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
15
16
16
16
16

29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41

31

11
11
11
11
11
11
12
12
12
12
12
12
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
17
17
17
17
17
17
18
18
18
18

11
11
11
11
11
11
12
12
12
12
12
12
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
17
17
17
17
17
17
18
18
18
18

11
11
11
11
11
11
12
12
12
12
12
12
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
17
17
17
17
17
17
18
18
18
18

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

16
16
17
17
17
17
17
17
18
18
18
18
18
18
19
19
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
22
22
22
22
22
22
23
23
23
23
23
23
24
24

43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31
37
41
43
47
29
31

32

18
18
19
19
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
22
22
22
22
22
22
23
23
23
23
23
23
24
24
24
24
24
24
25
25
25
25
25
25
26
26

18
18
19
19
19
19
19
19
20
20
20
20
20
20
21
21
21
21
21
21
22
22
22
22
22
22
23
23
23
23
23
23
24
24
24
24
24
24
25
25
25
25
25
25
26
26

18
18
19
19
19
19
19
19
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6

1
1
1
1

2
2
2
2

24
24
24
24

37
41
43
47

26
26
26
26

26
26
26
26

6
6
6
6

Chromosome - combination of 1 and 2.


Since the cache was already full, the collisions are not the real collisions.
Fitness = No. of filled buckets / (No. of collisions + 1)
No. of collisions for this chromosome = 05. No. of filled buckets = 20. Fitness = 20/1
= 20
We performed the above analysis manually to observe that a Universal hash function
can give the minimum no. of collisions (almost zero) with maximum no. of buckets
filled (full). But we can not do it manually always.
So we apply GEP to get the best value of the function in F2 by performing GEP
operations.

3.4 The Tools & Sample Data: We used two tools for our experiment.
1. CodeBlock 10.05
2. DTREG and
3. GeneXproTools 4.0
Sample data File:
A sample of history files was obtained which is as follows.
Each of the web sites above has a particular ip address. For our project point of
view, we would convert each of them into an index. The index is any arbitrary

33

number as we like, because it is only a mere representation of the actual site and
hence, we have taken numbers from 1001 to 1015 for our input file.

Fig. 3.3 The collected website history over the month of August 2011
Table- 3.2 (A sample of 15 collected Web site History pages)
Web address
1. https://login.tikona.in/userportal
2.http://www.google.com/firefox?client=firefox3. http://www.orkut.com/Logout?msg=0&hl=en-US
4. http://www.google.co.in/accounts/
5. http://www.orkut.co.in/Main#Home
6. http://uk.yahoo.com/
7. http://1.254.254.254/?N=1314542648179
8.https://www.irctc.co.in/cgi-bin/bv60.dll/irctc/booking/planner.do?

Index
1001
1002
1003
1004
1005
1006
1007
1008

ReturnBhttps://www.ir
9. ankResponse=true&ErrorTemplate
10 http://n.admagnet.net/d/pc/?AwMNCgAUDgxUXltfWl9JCh
11. http://www.indianrail.gov.in/dont_Know_Station_Code.html

1009
1010
1011

34

12.http://uk.mc290.mail.yahoo.com/mc/welcome?.gx=1&.tm=1314454

1012

235&.rand=1l5oigv20vkup#_pg=showFolder;_ylc
13.http://www.gamesfreak.net/games/Grand-Prix-Go_4208.html
14. http://news.indiaagainstcorruption.org/?p=3520
15. https://www.irctc.co.in/
This table is obtained from the above history page.

1013
1014
1015

3.4.1 Genetic parameters


Initial population size = 15
Length of head h=6
Length of tail t = 6 * (2-1) + 1 = 7
Length of Gene = h + t = 6 + 7 =13
No. of genes per chromosome =2
Length of Chromosome = 2 * 13 = 26

3.5 About DTREG:


It is predictive modeling software. DTREG (pronounced D-T-Reg) builds
classification and regression decision trees, neural networks, support vector machine
(SVM), GMDH polynomial networks, gene expression programs, K-Means
clustering, discriminant analysis and logistic regression models that describe data
relationships and can be used to predict values for future observations. DTREG also
has full support for time series analysis.
DTREG accepts a dataset containing of number of rows with a column for each
variable. One of the variables is the target variable whose value is to be modeled
and predicted as a function of the predictor variables. DTREG analyzes the data and
generates a model showing how best to predict the values of the target variable based
on values of the predictor variables.

Classes of variables used in DTREG

35

Target variable: The target variable is the variable whose values are to be modeled
and predicted by other variables. It is analogous to the dependent variable (i.e., the
variable on the left of the equal sign) in linear regression. There must be one and only
one target variable.
Predictor variable: A predictor variable is a variable whose values will be used to
predict the value of the target variable. It is analogous to the independent variables
(i.e., the variables on the right side of the equal sign) in linear regression. There must
be at least one predictor variable specified, and there may be many predictor
variables. If more than one predictor variable is specified, DTREG will determine
how the predictor variables can be combined to best predict the values of the target
variable. For time series analysis, DTREG can automatically generate lag variables.
Weight variable -- Optionally, you can specify a weight variable. If a weight variable
is specified, it must a numeric (continuous) variable whose values are greater than or
equal to 0 (zero). The value of the weight variable specifies the weight given to a row
in the dataset. For example, a weight value of 2 would cause DTREG to give twice as
much weight to a row as it would to rows with a weight of 1; the effect on model
training is the same as two occurrences of the row in the dataset. Weight values may
be real (non-integer) values such as 2.5. A weight value of 0 (zero) causes the row to
be ignored. If you do not specify a weight variable, all rows are given equal weight.
An integer weight value has the same effect on model training as duplicating rows the
equivalent number of times in the training data. Since the goal of model training is to
tune parameters to minimize the overall error (or variance) of the training data,
weighted (or duplicated) rows that are misclassified add a greater amount to the total
error than un-weighted rows, so they have an increased influence on the model.
Types of Variables
Variables may be of two types: continuous and categorical.
Continuous variables with ordered values -- A continuous variable has numeric
values such as 1, 2, 3.14, -5, etc. The relative magnitude of the values is significant

36

(e.g., a value of 2 indicates twice the magnitude of 1). Examples of continuous


variables are blood pressure, height, weight, income, age, and probability of illness.
Some programs call continuous variables ordered , ordinal , interval or
monotonic variables. If a variable is numeric and the values indicate relative
magnitude or order, then the variable should be declared as continuous even if the
numbers are discrete and do not form a continuous scale.
Categorical variables with unordered values -- A categorical variable has values
that function as labels rather than as numbers. Some programs call categorical
variables nominal variables. For example, a categorical variable for gender might use
the value 1 for male and 2 for female. The actual magnitude of the value is not
significant; coding male as 7 and female as 3 would work just as well. As another
example, marital status might be coded as 1 for single, 2 for married, 3 for divorced
and 4 for widowed. DTREG allows you to use non-numeric (character string) values
for categorical variables. So your dataset could have the strings Male and Female
or M and F for a categorical gender variable. Because categorical values are
stored and compared as string values, a categorical value of 001 is different than a
value of 1. In contrast, values of 001 and 1 would be equal for continuous variables.

37

3.5.1 Creating a New Project

38

To create a new project, the leftmost icon on the toolbar is clicked: Project wizard
screens

guide

through

setting

39

up

the

project.

The Evolution property page contains parameters that control evolution operations
such as mutation and recombination.
Mutation and inversion rates
Mutation rate This is the probability that a symbol (variable, function or constant)
in a gene will be mutated during each generation. Symbols in the head of a gene can
be replaced by variables, functions and constants (if constants are used); symbols in
the tail of the gene can be replaced only by variables and constants.
Inversion rate: This is the probability that the inversion operation will be performed
on a chromosome. Inversion selects a random starting symbol in a gene and a random
ending symbol. All of the symbols between the starting and ending points are then
reversed in order.
Transposition rates
Transposition is the process of moving a sequence of symbols in a gene from one
location to another. Some types of transposition allow sequences of symbols to be
moved from one gene to another gene in the same chromosome.
IS transposition rate: This is probability that Insertion Sequence Transposition will
be applied to a chromosome. Source and destination genes are selected in the
chromosome; the source gene may be the same as the destination. Starting and ending
symbol positions are selected in the source gene. The starting point may be in the
head or tail section of the gene, and the selected section may span the head and tail.
The destination, insertion point is selected in the head of the destination gene, but it is
not allowed to be the first (root) symbol of the gene, and the selection length is
restricted so that it will remain entirely in the head of the destination gene. The
selected sequence of symbols is then inserted into the destination gene, and any
symbols following the insertion point that are in the head of the destination gene are
moved right to make room of the insertion. Symbols shifted out of the head by the
insertion are discarded.
RIS transposition rate This is probability that Root Insertion Sequence
Transposition will be applied to a chromosome. A random scan point is selected in the
head of a gene beyond the first (root) symbol of the gene. The process then scans

40

forward looking for a function symbol. If no function is found, RIS transposition does
nothing. If a function is found, a random ending point is selected beyond the starting
point but in the head of the gene. The symbols in the selected range are then inserted
at the beginning (root) of the gene. Symbols pushed out of the head by the insertion
are discarded.
Gene transposition rate This is probability that Gene Transposition will be applied
to a chromosome. A random gene that is not the first gene of a chromosome is
selected. This gene is then inserted as the first gene of the chromosome. The gene
being inserted is removed from its original location, and the genes preceding it are
moved over to make room for the insertion at the head of the chromosome. So the
length of the chromosome is not changed.
Recombination rates
During Recombination, two chromosomes are randomly selected, and genetic material
is exchanged between them to produce two new chromosomes. It is analogous to the
process that occurs when two individuals are bred, and the offspring share genetic
material from both parents.
One-point rate This is probability that one-point recombination will be applied to a
chromosome. Two parent chromosomes are randomly selected and paired together. A
split point is selected anywhere in the chromosomes (any gene and any position in a
gene head or tail). The symbols in the parents from the split point to the ends of the
chromosomes are then exchanged between the parents. Note that all chromosomes
have the same number of symbols, so no symbols are lost during the exchange.
Two-point rate This is probability that two-point recombination will be applied to a
chromosome. Two parent chromosomes are randomly selected and paired together.
Two recombination points are selected in the chromosomes. The symbols between the
starting and ending recombination points are then exchanged between the parent
genes.
Gene recombination rate This is probability that gene recombination will be
applied to a chromosome. Two parent chromosomes are randomly selected and paired
together. A random gene is selected and exchanged between the parent chromosomes.

41

DTREG provides a full implementation of the Gene Expression Programming


algorithm developed by Candida Ferreira (Ferreira 2006). Here are some of the
features of DTREGs implementation:
* Continuous and categorical target variables
* Automatic handling of categorical predictor variables
* A large library of functions that you can select for inclusion in the model
* Mathematical and logical (AND, OR, NOT, etc.) function generation
* Choice of many fitness functions
* Both static linking functions and evolving homeotic genes
* Fixed and random constants
* Nonlinear regression to optimize constants
* Parsimony pressure to optimize the size of functions
* Automatic algebraic simplification of the combined function
* Several forms of validation including cross-validation and hold-out
* Computation of the relative importance of predictor variables
* Automatic generation of C or C++ source code for the functions
* Multi-CPU execution for multiple target categories and cross-validation
In ordinary mathematical regression, the procedure is given the form of the function
to be fitted to the data. This could be a linear function for linear regression or a
general mathematical function for nonlinear regression. The regression procedure
computes the optimal values of parameters for the function to make the function fit a
data set as well as possible, but the regression procedure does not alter the form of the
function. For example, a linear regression problem with two variables has the form:
y = a + bx
Where x is the independent variable, y is the dependent variable, and a and b are
parameters whose values are to be computed by the regression algorithm. This type of
procedure is classified as parametric regression because the goal is to estimate
parameters for a function whose form is known (or assumed).

42

Gene Expression Programming is an elegant solution to the expression-mutation


problem was discovered in 1999 by Candida Ferreira (Ferreira 1996). Ferreira devised
a system for encoding expressions that allows fast application of a wide variety of
mutation and cross-breeding techniques while guaranteeing that the resulting
expression will always be syntactically valid. This approach is called Gene
Expression Programming (GEP). Experiments have shown that GEP is 100 to 60,000
times faster than older genetic algorithms.

3.5.3 Genes
A gene consists of a fixed number of symbols encoded in the Karva language. A gene
has two sections, the head and the tail. The head is used to encode functions for the
expression. The tail is a reservoir of extra terminal symbols that can be used if there
arent enough terminals in the head to provide arguments for the functions. Thus, the
head can contain functions, variables and constants, but the tail can contain only
variables and constants (i.e. terminals). The number of symbols in the head of a gene
is specified as a parameter for the analysis. The number of symbols in the tail is
determined by the equation
t = h *(MaxArg -1) +1
Where t is the number of symbols in the tail, h is the number of symbols in the head,
and MaxArg is the maximum number of arguments required by any function that is
allowed to be used in the expression. For example, if the head length is 6 and the
allowable set of functions consists of binary operators (+, -, *, /), then the tail length
is:
t = 6*(2-1) +1 = 7
The purpose of the tail is to provide a reservoir of terminal symbols (variables and
constants) that can be used as arguments for functions in the head if there arent
enough terminals in the head.
Chromosomes and Linking Functions

43

A chromosome consists of one or more genes. The number of genes in a chromosome


is a parameter for the analysis. If there is more than one gene in a chromosome, then a
linking function is used to join the genes in the final function. The linking function
can be static or evolving.
For Example, consider a chromosome with two genes having the K-expressions:
Gene 1: *ab
Gene2: /cd
If +is used as the static linking function, then the combined expression is:

Which is equivalent to (a * b + c /d).

3.6 Parsimony Pressure and Expression Simplification


If two expressions do an equally good job of fitting a data set, the simpler expression
is usually preferred. For symbolic regression, complexity is measured by the number
of symbols and functions in the expression. Gene expression programming has two
techniques for selecting simpler expressions over more complex ones.
The first approach is to adjust the fitness scores of individuals so that fitness is
reduced by an amount proportional to the complexity of the expression. This penalty
for complexity is called parsimony pressure. See page 95 for information about how
to adjust how much parsimony pressure is applied. While parsimony pressure is
effective at guiding evolution toward simpler expressions; experiments have shown
that parsimony pressure may hinder the process of evolving toward greater fitness. It
is not uncommon for more complex expressions to do a better job of fitting than less
complex ones, so pushing evolution to favor simpler expressions may increase the

44

number of generations required to find a solution, or it may make it impossible to find


a good solution. If parsimony pressure is used, you also should build a model with it
turned off, and verify that the simpler solution does not lose significant accuracy.
The second approach to finding parsimonious solutions is to divide the task into two
phases: (1) primary training without parsimony pressure, and (2) secondary training
which uses parsimony pressure. Since the primary training is done without parsimony
pressure, evolution can focus on finding the most accurate model as quickly as
possible. Once primary training is finished, a second round of training begins using
the final population from primary training as the starting population for the secondary
training.
During secondary training, parsimony pressure is used to try to find a simpler
expression that is at least as good as the best one found during primary training. While
secondary training is being performed, the primary goal is still to improve accuracy,
and the secondary goal is to find simpler expressions. So a simpler expression will be
selected only if its accuracy meets or exceeds the best accuracy previously found. If a
more accurate expression is found, it is used even if the result is an increase in
complexity. So it is possible that during the secondary training complexity could
actually increase in order to improve accuracy. But experiments have shown that this
rarely happens, and secondary training usually results in simpler expressions. Since
there is never any risk of losing accuracy with this approach, and it may result in a
simpler and possibly more accurate expression, it is recommended.
Algebraic Simplification
DTREG includes a sophisticated procedure for performing algebraic simplification on
expressions after gene expression programming has evolved the best expressions.
This simplification does not alter the mathematical meaning of expressions; it just
does simplifications such as grouping common terms and simplifying identities. Here
are some examples of simplifications that it can perform:
a+ b + a + a +b + a + a => 5a + 2b

45

(a + b) / (b + a) => 1
a AND NOT a => 0
Optimization of Random Constants
In addition to functions and variables, expressions can contain constants. You can
specify a set of explicit constants, and you can allow DTREG to generate and evolve
random constants. While evolution can do a good job of finding an expression that fits
data well, it is difficult for evolution to come up with exact values for real constants.
DTREG provides an optional final step to the GEP process to refine the values of
random constants. If this option is enabled, DTREG uses a sophisticated nonlinear
regression algorithm to refine the values of the random constants. This optimization is
performed after evolution has developed the functional form and linking and
simplification have been performed. DTREG uses a model/trust-region technique
along with an adaptive choice of the model Hessian. The algorithm is essentially a
combination of Gauss-Newton and Levenberg-Marquardt methods; however, the
adaptive algorithm often works much better than either of these methods alone.
If nonlinear regression does not improve the accuracy of the model, the original
model is used. So there is no risk of losing accuracy by using this option.

Chapter 4
EXPERIMENT AND RESULT ANALYSIS

46

4.1 Genetic Operations in C++:


Code Blocks 10.05 was used to run program for GEP operations such as cross over
and mutation in C++.
The program is
#include <iostream>
#include<conio.h>
#include<stdlib.h>
#include<cstring>
#include<time.h>
using namespace std;
int NOT(int a)
{
if(a==1)
return 0;else
if (a==0)
return 1; else
cout<<"improper binary value... put only 1 or 0";
return 0;
}
int power(int n)
{
int prod=1, i;
i= n;
for(int r=0;r<i;r++)
{
prod = prod*2;
}

47

return prod;
}
void mutation(int a)
{
int i = a,j=0;
int arr1[12],arr2[12];
do{
arr1[j]= i%2;
i= (i/2);
++j;
} while(i!=1);
arr1[j] = 1;
for(int n=j+1;n<12;n++)
{
arr1[n]=0;
}
cout<<"\nBefore Mutation.........\n";
for(int k=11,l=0;k>=0;k--,l++)
{
arr2[l]=arr1[k];
cout<<arr2[l];
}
cout<<"\n";
int pos;
pos=6;
cout<< "\nAfter mutation at position "<<pos<<endl;
int l;
l= NOT(arr2[pos]);
for(int x=0;x<pos;x++)
cout<<arr2[x];
cout<<l;
for(int x=pos;x<11;x++)

48

cout<<arr2[x];
}
void delay(int n)
{
for(int i=0;i<n;i++)
{
}
}
void CrossOver(int a, int b)
{
int i = a,j=0;
int arr1[12],arr2[12], arr3[12], arr4[12],arr5[12],arr6[12];
char ch;
ch = getch();
if(ch=='y'){
cout<<"\n\n\nGoing to exit from console.......\n\n";
exit(0);
}
else if(ch=='c')
{
do{
arr1[j]= i%2;
i= (i/2);
++j;
} while(i!=1);
arr1[j] = 1;
for(int n=j+1;n<12;n++)
{
arr1[n]=0;
}

49

cout<<"\t\nBefore cross over.........\n";


for(int k=11,l=0;k>=0;k--,l++)
{
arr2[l]=arr1[k];
cout<<arr2[l];
}
cout<<"\n";
int x =b, y=0,w,z;
do
{
arr3[y] = x%2;
x=x/2;
++y;
}while(x!=1);
arr3[y]=1;
for(int m=y+1;m<12;m++)
{
arr3[m]=0;
}
for( w=11,z=0;w>=0;w--,z++)
{
arr4[z]=arr3[w];
cout<<arr4[z];
}
static int pos=1, temp,q;
cout<<"\nAfter cross over At Position "<<pos<<"\n";
if(pos==12)
pos=0;
pos++;

50

for(q=pos;q<12;q++)
{
temp=arr2[q];
arr2[q] = arr4[q];
arr4[q]=temp;
}
int sum_K=0,sum_R=0;
for(int p=0,m=11;p<=11;p++,m--)
{
cout<<arr2[p];
arr5[m]=arr2[p];
}
//cout<<"\nContents of array5 = ";
for(int m=0;m<=11;m++)
{
int y= power(m);
arr5[m] = arr5[m]*y;
sum_K=sum_K+arr5[m];
}
cout<<"\n";
cout<<sum_K;
cout<<"\n";
for(int r=0, n=11;r<=11;r++,n--)
{
cout<<arr4[r];
arr6[n]=arr4[r];
}
for(int n=0;n<=11;n++)
{
int y= power(n);
arr6[n]=arr6[n]*y;;
//cout<<arr5[p];

51

sum_R=sum_R+arr6[n];
}
cout<<"\n";
cout<<sum_R;
static int val=0;
char ch1=getch();
if (ch1=='t')
{
cout<<"\nPerforming mutation on "<<val<<" th No. after Cross Over....";
if(val==0)
mutation(sum_K);
else
if(val==1)
{
mutation(sum_R);
val++;
if(val>=2)
val=0;
}
}
else
CrossOver(sum_K, sum_R);
delay(500000000);
}
else exit(0);
}
int main()
{
CrossOver(135,245);
//mutation(56);
return 0;
}
Output :

52

Fig. 4.1.1Output of the basic steps of crossover

Fig. 4.1.2. Output iterations of GA

53

4.2 Program Showing GEP is faster than GA: This program demonstrates the
steps of each evolution. In our project, according to the principles of GEP, the fitness function
is the chromosome itself. So each evolution would depict each possible combination of the
following chromosomes.
Evolution 1: + * % p q r x n p q r x n
Evolution 2: * % p q r x n p q r x n +
Evolution 3: % p q r x n p q r x n + * ..and so on. In simplifying the structure and to
evaluate the computations, we have taken the following expressions in the program. The
expressions are not visible to the outside, but predict as if each chromosome is either selected
or discarded. The basic principle in this GEP program is that discard those chromosome that
can not produce a valid Expression Tree, in other words, each chromosome that does not
start with a character such as +, * or % .
#include <iostream>
#include<cstdlib>
#include<conio.h>
#include<math.h>
#include<cstdio>
using namespace std;
int p, q, r, x, n;
int funct1(){

return (p+q+r+x+n); }

int funct2(){

return ((p+r)*(p+r)); }

int funct3(){

return (p+r+p+q); }

int funct4(){

return (p+r+p+x); }

54

int funct5(){

return (p+r+q+x); }

int funct6(){

return (p+r+x+n); }

int funct7(){

return (p+r+p+n); }

int funct8(){

return (p+r+r+x); }

int funct9(){

return (p+r+q+n); }

int funct10(){

return (p*r+p*r); }

int funct11(){

return (p*r+p+r); }

int funct12(){

return (p*r+p+q); }

int funct13(){

return (p*r+p+x); }

int funct14(){

return (p*r+p+n); }

int funct15(){

return (p*r+r+q); }

int funct16(){

return (p*r+r+x); }

int funct17(){

return (p*r+r+n); }

int funct18(){

return (p*r+q+x); }

int funct19(){

return (p*r+q+n); }

int funct20(){

return (p*r+x+n); }

int funct21(){

return (p*r+p*r); }

int funct22(){

return (p*r+p*q); }

int funct23(){

return (p*r+p*x); }

55

int funct24(){

return (p*r+p*n); }

int funct25(){

return (p*r+r*q); }

int funct26() { return (p*r+q*x); }


int funct27() { return (p*r+r*x); }
int funct28() { return (p*r+r*n); }
int funct29() { return (p*r+p%n); }
int funct30(){

return (p*r+p%r); }

int funct31(){

return (p*r+p%q); }

int funct32(){

return (p*r+r%q); }

int funct33(){

return (p*r+r%x); }

int funct34(){

return (p*r+r%n); }

int funct35(){

return (p*r+r%p); }

int funct36(){

return (p*r+q%p); }

int funct37(){

return (p*r+q%r); }

int funct38(){

return (p*r+q%x); }

int funct39(){

return (p*r+q%n); }

int funct40(){

return (p*r+x%n); }

int funct41(){

return (p*r+x%p); }

int funct42(){

return (p*r+x%q); }

56

int funct43(){

return (p*r+x%r); }

int funct44(){

return (p*r+n%p); }

int funct45(){

return (p*r+n%q); }

int funct46(){

return (p*r+n%x); }

int funct47(){

return (p*r+n%r); }

int funct48(){

return (p*r+q*x); }

int funct49(){

return (p*r+q*n); }

int funct50(){

return (p*r+q*r); }

int funct51(){

return (p*r+x*n); }

int funct52(){

return (p*r+q%n); }

int funct53(){

return (p*r*(p+q)); }

int funct54(){

return (p*r*(x+n)); }

int funct55(){

return (p*r*(p+n)); }

int funct56(){

return (p*r*(r+x)); }

int funct57(){

return (p*r*(q+n)); }

int funct58(){

return (p*r*(p*r)); }

int funct59(){

return (p*r+q%n); }

int funct60(){

return (p*r*(p*x)); }

int funct61(){

return (p*r*(p*n)); }

57

int funct62(){

return (p*r*(r*q)); }

int funct63(){

return (p*r*(q*x)); }

int funct64(){

return (p*r*r*x); }

int funct65(){

return (p*r*r*n); }

int funct66(){

return (p*r*r%n); }

int funct67(){

return (p*r*p%n); }

int funct68(){

return (p*r*p%q); }

int funct69(){

return (p*r*r%q); }

int funct70(){

return (p*r*r%x); }

int funct71(){

return (p*r*r%p); }

int funct72(){

return (p*r*q%p); }

int funct73(){

return (p*r*q%n); }

int funct74(){

return (p*r*q%x); }

int funct75(){

return (p*r*q%r); }

int funct76(){

return (p*r*x%q); }

int funct77(){

return (p*r*x%r); }

int funct78(){

return (p*r*x%n); }

int funct79(){

return (p*r*x%p); }

58

int funct80(){

return (p*r*(n%p)); }

int funct81(){

return (p*r*(n%q)); }

int funct82(){

return (p*r*(n%x)); }

int funct83(){

return (p*r*(n%r)); }

int funct84(){

return (p*r*(q*n)); }

int funct85(){

return (p*r*x*n); }

int funct86(){

return (p*r*p*q); }

int funct87(){

return (p*r+q%n); }

int funct88(){

return (p+q+p+q);}

int funct89(){

return (p+q+x+n); }

int funct90(){

return (p+q+x+r); }

int funct91(){

return (p+q+r+n); }

int funct92(){

return (p+q+q+x); }

int funct93(){

return (p+q+q+n);

int funct94(){

return (p+q+p*q); }

int funct95(){

return (p+q+p*x); }

int funct96(){

return (p+q+p*n); }

int funct97(){

return (p+q+r*q); }

int funct98(){

return (p+q+r*x); }

59

int funct99(){

return (p+q+q*x); }

int funct100(){

return (p+q+q*n); }

int funct101(){

return (p+q+r*n); }

int funct102(){

return (p+q+x*n); }

int funct103(){

return (p+q+p%q); }

int funct104(){

return (p+q+q%r); }

int funct105(){

return (p+q+q%x); }

int funct106(){

return (p+q+q%n); }

int funct107(){

return (p+q+p%r); }

int funct108(){

return (p+q+p%x); }

int funct109(){

return (p+q+p%n); }

int funct110(){

return (p+q+r%p); }

int funct111(){

return (p+q+r%q); }

int funct112(){

return (p+q+r%x); }

int funct113(){

return (p+q+r%n); }

int funct114(){

return (p+q+x%p); }

int funct115(){

return (p+q+x%q); }

int funct116(){

return (p+q+x%n); }

int funct117(){

return (p+q+x%r); }

60

int funct118(){

return (p+q+n%p); }

int funct119(){

return (p+q+n%q); }

int funct120(){

return (p+q+n%r); }

int funct121(){

return (p+q+n%x); }

int funct122(){

return (p+x+p+r); }

int funct123(){

return (p+x+p+n); }

int funct124(){

return (p+x++q); }

int funct125(){ return (p+x+q+r); }


int funct126(){

return (p+x+r+x); }

int funct127(){

return (p+x+r+n); }

int funct128(){

return (p+x+q+x); }

int funct129(){

return (p+x+q+n); }

int funct130(){

return (p+x+x+n); }

int funct131(){

return (p+x+p*r); }

int funct132(){

return (p+x+p*q); }

int funct133(){

return (p+x+p*n); }

int funct134(){

return (p+x+r*q); }

int funct135(){

return (p+x+r*x); }

int funct136(){

return (p+x+r*n); }

61

int funct137(){

return (p+x+p%q); }

int funct138(){

return (p+x+p%r); }

int funct139(){

return (p+x+p%n); }

int funct140(){

return (p+x+p%x); }

int funct141(){

return (p+x+r%q); }

int funct142(){

return (p+x+r%n); }

int funct143(){

return (p+x+q*x); }

int funct144(){

return (p+x+q*n); }

int funct145(){

return (p+x+x*n); }

int funct146(){

return (p+x+q%x); }

int funct147(){

return (p+x+x%n); }

int funct148(){

return (p+x+q%n); }

int funct149(){

return (p+x+n%p); }

int funct150(){

return (p+x+n%q); }

int funct151(){

return (p+x+n%r); }

int funct152(){

return (p+x+n%x); }

int funct153(){

return (p+x+q%r); }

int funct154(){

return (p+x+r%p); }

int funct155(){

return (p*r+q%p); }

62

int funct156(){

return (p+n+p+r); }

int funct157(){

return (p+n+p+q); }

int funct158(){

return (p+n+p+x); }

int funct159(){

return (p+n+p+n); }

int funct160(){

return (p+n+r+q); }

int funct161(){

return (p+n+r+x); }

int funct162(){

return (p+n+r+n); }

int funct163(){

return (p+n+q+x); }

int funct164(){

return (p+n+q+n); }

int funct165(){

return (p+n+x+n); }

int funct166(){

return (p+n+p*r); }

int funct167(){

return (p+n+p*q); }

int funct168(){

return (p+n+p*n); }

int funct169(){

return (p+n+r*q); }

int funct170(){

return (p+n+r*x); }

int funct171(){

return (p+n+r*n); }

int funct172(){

return (p+n+p%r); }

int funct173(){

return (p+n+p%q); }

int funct174(){

return (p+n+p%x); }

63

int funct175(){

return (p+n+p%n); }

int funct176(){

return (p+n+r%q); }

int funct177(){

return (p+n+r%x); }

int funct178(){

return (p+n+r%n); }

int funct179(){

return (p+n+q*x); }

int funct180(){

return (p+n+q*n); }

int funct181(){

return (p+n+q*r); }

int funct182(){

return (p+n+x*n); }

int funct183(){

return (p+n+q%x);}

int funct184(){

return (p+n+q%r);}

int funct185(){

return (p+n+q%p);}

int funct186(){

return (p+n+q%n); }

int funct187(){

return (p+n+x%p);}

int funct188(){

return (p+n+x%q);}

int funct189(){

return (p+n+x%r);}

int funct190(){

return (p+n+x%n);}

int funct191(){

return (p+n+r%p);}

int funct192(){

return (r+q+p+r);}

int funct193(){

return (r+q+p+q);}

64

int funct194(){

return (r+q+p+x);}

int funct195(){

return (r+q+p+n);}

int funct196(){

return (r+q+r+q);}

int funct197(){

return (r+q+r+x);}

int funct198(){

return (r+q+r+n);}

int funct199(){

return (r+q+q+x);}

int funct200(){

return (r+q+q+n);}

int funct201(){

return (r+q+x+n);}

int funct202(){

return (r+q+p*r);}

int funct203(){

return (r+q+p*q);}

int funct204(){

return (r+q+p*x);}

int funct205(){

return (r+q+p*n);}

int funct206(){

return (r+q+q*r);}

int funct207(){

return (r+q+q*x);}

int funct208(){

return (r+q+q*n);}

int funct209(){

return (r+q+r*x);}

int funct210(){

return (r+q+r*n);}

int funct211(){

return (r+q+x*n);}

int funct212(){

return (r+q+p%r);}

65

int funct213(){

return (r+q+p%q);}

int funct214(){

return (r+q+p%x);}

int funct215(){

return (r+q+p%n);}

int funct216(){

return (r+q+q%p);}

int funct217(){

return (r+q+q%r);}

int funct218(){

return (r+q+q%x);}

int funct219(){

return (r+q+q%n);}

int funct220(){

return (r+q+r%p);}

int funct221(){

return (r+q+r%q);}

int funct222(){

return (r+q+r%x);}

int funct223(){

return (r+q+r%n);}

int funct224(){

return (r+q+x%p);}

int funct225(){

return (r+q+x%q);}

int funct226(){

return (r+q+x%r);}

int funct227(){

return (r+q+x%n);

int funct228(){

return (r+q+n%p);}

int funct229(){

return (r+q+n%q);}

int funct230(){

return (r+q+n%r);}

int funct231(){

return (r+q+n%x);}

66

int funct232(){

return (r+x+p+r);}

int funct233(){

return (r+x+p+x);}

int funct234(){

return (r+x+p+n);}

int funct235(){

return (r+x+p+q);}

int funct236(){

return (r+x+q+r);}

int funct237(){

return (r+x+q+x);}

int funct238(){

return (r+x+q+n);}

int funct239(){

return (r+x+r+x);}

int funct240(){

return (r+x+r+n);}

int funct241(){

return (r+x+x+n);}

int funct242(){

return (r+x+p*q);}

int funct243(){

return (r+x+p*r);}

int funct244(){

return (r+x+p*x);}

int funct245(){

return (r+x+p*n);}

int funct246(){

return (r+x+q*r);}

int funct247(){

return (r+x+q*x);}

int funct248(){

return (r+x+q*n);}

int funct249(){

return (r+x+r*x);}

int funct250(){

return (r+x+r*n);}

67

int funct251(){

return (r+x+x*n);}

int funct252(){

return (r+x+p%q);}

int funct253(){

return (r+x+p%r);}

int funct254(){

return (r+x+p%x); }

int funct255(){

return (r+x+p%n);}

int funct256(){

return (r+x+q%p);}

int funct257(){

return (r+x+q%r);}

int funct258(){

return (r+x+q%x);}

int funct259(){

return (r+x+q%n); }

int funct260(){

return (r+x+r%p);}

int funct261(){

return (r+x+r%q);}

int funct262(){

return (r+x+r%x);}

int funct263(){

return (r+x+r%n);}

int funct264(){

return (r+x+x%p);}

int funct265(){

return (r+x+x%q);}

int funct266(){

return (r+x+x%r);}

int funct267(){

return (r+x+x%n);}

int funct268(){

return (r+x+n%p);}

68

int funct269(){

return (r+x+n%q);}

int funct270(){

return (r+x+n%r);}

int funct271(){

return (r+x+n%x);}

int funct272(){

return (r+n+p+q);}

int funct273(){

return (r+n+p+r);}

int funct274(){

return (r+n+p+x);}

int funct275(){

return (r+n+p+n);}

int funct276(){

return (r+n+q+r);}

int funct277(){

return (r+n+q+x);}

int funct278(){

return (r+n+q+n);}

int funct279(){

return (r+n+r+x);}

int funct280(){

return (r+n+r+n);}

int funct281(){

return (r+n+x+n);}

int funct282(){

return (r+n+p*q);}

int funct283(){

return (r+n+p*r);}

int funct284(){

return (r+n+p*x);}

int funct285(){

return (r+n+p*n);}

int funct286(){

return (r+n+q*r);}

int funct287(){

return (r+n+q*x);}

69

int funct288(){

return (r+n+q*n);}

int funct289(){

return (r+n+r*x);}

int funct290(){

return (r+n+r*n);}

int funct291(){

return (r+n+x*n);}

int funct292(){

return (r+n+p%q);}

int funct293(){

return (r+n+p%r);}

int funct294(){

return (r+n+p%x);}

int funct295(){

return (r+n+p%n);}

int funct296(){

return (r+n+q%p);}

int funct297(){

return (r+n+q%r);}

int funct298(){

return (r+n+q%x);}

int funct299(){

return (r+n+q%n);}

int funct300(){

return (r+n+r%p);}

int funct301(){

return (r+n+r%q);}

int funct302(){

return (r+n+r%x);}

int funct303(){

return (r+n+r%n);}

int funct304(){

return (r+n+q%p);}

int funct305(){

return (r+n+q%r);}

int funct306(){

return (r+n+q%x);}

70

int funct307(){

return (r+n+q%n);}

int funct308(){

return (r+n+x%p);}

int funct309(){

return (r+n+x%q);}

int funct310(){

return (r+n+x%r);}

int funct311(){

return (r+n+x%n);}

int funct312(){

return (r+n+n%r);}

int funct313(){

return (r+n+n%p);}

int funct314(){

return (r+n+n%q);}

int funct315(){

return (r+n+n%x);}

int funct316(){

return (r+n+n%x);}

int funct317(){

return (r+n+n%x);}

int funct318(){

return (r+n+n%x);}

int funct319(){

return (r+n+n%x);}

int funct320(){

return (r+n+n%x);}

int funct321(){

return (r+n+n%x);}

int funct322(){

return (r+n+n%x);}

int funct323(){

return (r+n+n%x);}

int funct324(){

return (r+n+n%x);}

int funct325(){

return (r+n+n%x);}

71

int funct326(){

return (r+n+n%x);}

int funct327(){

return (r+n+n%x);}

int funct328(){

return (r+n+n%x);}

int funct329(){

return (r+n+n%x);}

int funct330(){

return (r+n+n%x);}

int funct795(){

return (p%x+x%n);}

int funct1015(){

return(q*n+x%n);}

int funct1115(){

return (q%n+x%n);}

int funct2015(){

return (((p*r+q)%x)%n);}

int main()
{
int i , a[320];
int f;
p=1, q=2, r=1001, x=19, n=20;
f= (((p * r+ q) %x) %n);
a[0]=funct1();
a[1]=funct2();

72

a[2]=funct3();
a[3]=funct4();
a[4]=funct5();
a[5]=funct6();
a[6]=funct7();
a[7]=funct8();
a[8]=funct9();
a[9]=funct10();
a[10]=funct11();
a[11]=funct12();
a[12]=funct13();
a[13]=funct14();
a[14]=funct15();
a[15]=funct16();
a[16]=funct17();
a[17]=funct18();
a[18]=funct19();
a[19]=funct20();
a[20]=funct21();

73

a[21]=funct22();
a[22]=funct23();
a[23]=funct24();
a[24]=funct25();
a[25]=funct26();
a[26]=funct27();
a[27]=funct28();
a[28]=funct29();
a[29]=funct30();
a[30]=funct31();
a[31]=funct32();
a[32]=funct33();
a[33]=funct34();
a[34]=funct35();
a[35]=funct36();
a[36]=funct37();
a[37]=funct38();
a[38]=funct39();
a[39]=funct40();

74

a[40]=funct41();
a[41]=funct42();
a[42]=funct43();
a[43]=funct44();
a[44]=funct45();
a[45]=funct46();
a[46]=funct47();
a[47]=funct48();
a[48]=funct49();
a[49]=funct50();
a[50]=funct51();
a[51]=funct52();
a[52]=funct53();
a[53]=funct54();
a[54]=funct55();
a[55]=funct56();
a[56]=funct57();
a[57]=funct58();
a[58]=funct59();

75

a[59]=funct60();
a[60]=funct61();
a[61]=funct62();
a[62]=funct63();
a[63]=funct64();
a[64]=funct65();
a[65]=funct66();
a[66]=funct67();
a[67]=funct68();
a[68]=funct69();
a[69]=funct70();
a[70]=funct71();
a[71]=funct72();
a[72]=funct73();
a[73]=funct74();
a[74]=funct75();
a[75]=funct76();
a[76]=funct77();
a[77]=funct78();

76

a[78]=funct79();
a[79]=funct80();
a[80]=funct81();
a[81]=funct82();
a[82]=funct83();
a[83]=funct84();
a[84]=funct85();
a[85]=funct86();
a[86]=funct87();
a[87]=funct88();
a[88]=funct89();
a[89]=funct90();
a[90]=funct91();
a[91]=funct92();
a[92]=funct93();
a[93]=funct94();
a[94]=funct95();
a[95]=funct96();
a[96]=funct97();

77

a[97]=funct98();
a[98]=funct99();
a[99]=funct100();
a[100]=funct101();
a[101]=funct102();
a[102]=funct103();
a[103]=funct104();
a[104]=funct105();
a[105]=funct106();
a[106]=funct107();
a[107]=funct108();
a[108]=funct109();
a[109]=funct110();
a[110]=funct111();
a[111]=funct112();
a[112]=funct113();
a[113]=funct114();
a[114]=funct115();
a[115]=funct116();

78

a[116]=funct117();
a[117]=funct118();
a[118]=funct119();
a[119]=funct120();
a[120]=funct121();
a[121]=funct122();
a[122]=funct123();
a[123]=funct124();
a[124]=funct125();
a[125]=funct126();
a[126]=funct127();
a[127]=funct128();
a[128]=funct129();
a[129]=funct130();
a[130]=funct131();
a[131]=funct132();
a[132]=funct133();
a[133]=funct134();
a[134]=funct135();

79

a[135]=funct136();
a[136]=funct137();
a[137]=funct138();
a[138]=funct139();
a[139]=funct140();
a[140]=funct141();
a[141]=funct142();
a[142]=funct143();
a[143]=funct144();
a[144]=funct145();
a[145]=funct146();
a[146]=funct147();
a[147]=funct148();
a[148]=funct149();
a[149]=funct150();
a[150]=funct151();
a[151]=funct152();
a[152]=funct153();
a[153]=funct154();

80

a[154]=funct155();
a[155]=funct156();
a[156]=funct157();
a[157]=funct158();
a[158]=funct159();
a[159]=funct160();
a[160]=funct161();
a[161]=funct162();
a[162]=funct163();
a[163]=funct164();
a[164]=funct165();
a[165]=funct166();
a[165]=funct167();
a[167]=funct168();
a[168]=funct169();
a[169]=funct170();
a[170]=funct171();
a[171]=funct172();
a[172]=funct173();

81

a[173]=funct174();
a[174]=funct175();
a[175]=funct176();
a[176]=funct177();
a[177]=funct178();
a[178]=funct179();
a[179]=funct180();
a[180]=funct181();
a[181]=funct182();
a[182]=funct183();
a[183]=funct184();
a[184]=funct185();
a[185]=funct186();
a[186]=funct187();
a[187]=funct188();
a[188]=funct189();
a[189]=funct190();
a[190]=funct191();
a[191]=funct192();

82

a[192]=funct193();
a[193]=funct194();
a[194]=funct195();
a[195]=funct196();
a[196]=funct197();
a[197]=funct198();
a[198]=funct199();
a[199]=funct200();
a[200]=funct201();
a[201]=funct202();
a[202]=funct203();
a[203]=funct204();
a[204]=funct205();
a[205]=funct206();
a[206]=funct207();
a[207]=funct208();
a[208]=funct209();
a[209]=funct210();
a[210]=funct211();

83

a[211]=funct212();
a[212]=funct213();
a[213]=funct214();
a[214]=funct215();
a[215]=funct216();
a[216]=funct217();
a[217]=funct218();
a[218]=funct219();
a[219]=funct220();
a[220]=funct221();
a[221]=funct222();
a[222]=funct223();
a[223]=funct224();
a[224]=funct225();
a[225]=funct226();
a[226]=funct227();
a[227]=funct228();
a[229]=funct229();
a[229]=funct230();

84

a[230]=funct231();
a[231]=funct232();
a[232]=funct233();
a[233]=funct234();
a[234]=funct235();
a[235]=funct236();
a[236]=funct237();
a[237]=funct238();
a[238]=funct239();
a[239]=funct240();
a[240]=funct241();
a[241]=funct242();
a[242]=funct243();
a[243]=funct244();
a[244]=funct245();
a[245]=funct246();
a[246]=funct247();
a[247]=funct248();
a[248]=funct249();

85

a[249]=funct250();
a[250]=funct251();
a[251]=funct252();
a[252]=funct253();
a[253]=funct254();
a[254]=funct255();
a[255]=funct256();
a[256]=funct257();
a[257]=funct258();
a[258]=funct259();
a[259]=funct260();
a[260]=funct261();
a[261]=funct262();
a[262]=funct263();
a[263]=funct264();
a[264]=funct265();
a[265]=funct266();
a[266]=funct267();
a[267]=funct268();

86

a[268]=funct269();
a[269]=funct270();
a[270]=funct271();
a[271]=funct272();
a[272]=funct273();
a[273]=funct274();
a[274]=funct275();
a[275]=funct276();
a[276]=funct277();
a[277]=funct278();
a[278]=funct279();
a[279]=funct280();
a[280]=funct281();
a[281]=funct282();
a[282]=funct283();
a[283]=funct284();
a[284]=funct285();
a[285]=funct286();
a[286]=funct287();

87

a[287]=funct288();
a[288]=funct289();
a[289]=funct290();
a[290]=funct291();
a[291]=funct292();
a[292]=funct293();
a[293]=funct294();
a[294]=funct295();
a[295]=funct296();
a[296]=funct297();
a[297]=funct298();
a[298]=funct299();
a[299]=funct300();
a[300]=funct301();
a[301]=funct302();
a[302]=funct303();
a[303]=funct304();
a[304]=funct305();
a[305]=funct306();

88

a[306]=funct307();
a[307]=funct308();
a[308]=funct309();
a[309]=funct310();
a[310]=funct311();
a[311]=funct312();
a[312]=funct313();
a[313]=funct314();
a[314]=funct315();
a[315]=funct316();
a[316]=funct317();
a[317]=funct318();
a[318]=funct319();
a[319]=funct320();
a[320]=funct795();
a[321]=funct1015();
a[322]=funct1115();
a[323]=funct2015();
for(i=0;i<324;i++)

89

{
if (a[i]==f)
cout<<"Target is found after "<<i<<" th iteration \n";
}

cout<<"exit from console.......\n";

return 0;
}
Output:

Fig. 4.2.1. Output of GEP iterations.


The program is designed in a very simple way. Each individual function has been written
separately as no loop can read a chromosome string (which is essentially a character string)
and do a mathematical computation (retaining integer output) over it.
This output shows that we get a desired result after much less no. of iterations. Here it is
around 129th iteration. It can go upto maximum 2015 no. of iterations. But in the case of the
Genetic Algorithm, as the chromosome is the combination of arbitrary bits, the iteration can

90

go upto a number much more than these, without guarantying a valid result. On the other
hand, a GEP always ensures a valid result.

4.3 Simulating GEP Using The DTREG Tool: This is the first step towards
initiating the project. It is done by single

clicking the DTREG icon, then New Project. The

following screen will come up. It asks for a name of the project following the location of the
input file. The input file is a csv ( comma separated value) file. A csv file is an ancient form
of an excel file.

91

Fig.4.3Creating Project in DTREG

4.3.1 Choosing the model

92

Fig4.3.1 Choosing the model


The model chosen is a Gene Expression Programming model. There are other models to
choose. These are normal predictive model and a time series model. Then the variables
are initialized.

4.3.2 Setting Variables

93

Fig.4.3.2 Setting variables.


After setting variable types, the project is run and analysis report generated as
follows. The target is the output value that we suppose to evaluate. A predictor
variable is the independent variable (ex. p, q etc.) used in the function. The other types
are categorical and continuous variables. A categorical one has intervals between two
values, whereas a continuous one has no intervals.

4.3.3 Analysis Report

94

Fig.4.3.3 Analysis Report


It analyzes the input parameters with respect to the expected value, in this case the
value H(r).

95

Fig. 4.3.4. Analysis report continued..

96

4.4 Data Modeling with GeneXproTools


It performs data loading to model creation and analysis with a minimum number of
steps. It can be used in many different fields such as risk management, financial
services, marketing, and in most scientific research areas to extract knowledge from
multivariate data, giving you insights into hidden patterns and relationships.
GeneXproTools creates models for Function Finding, Logistic Regression,
Classification, Time Series Prediction and Logic Synthesis problems. You can export
the models to 16 different programming languages ready to be integrated with your
backend systems or you can use the included scoring system to process new data.
GeneXproTools has a modern and clean interface that lets to choose between a handsoff minimal fuss approach and an advanced down to the smallest detail operation. It
also brings the power of modern learning algorithms to the desktop integrating the
latest advances in evolutionary computation with standard statistical measures and
techniques producing best of breed models and giving you an edge over using both
approaches separately. GeneXproTools is also an extensible platform that lets you
create your own fitness functions and two types of custom functions (UDFs and
DDFs) using JavaScript. It is also possible to add fitness functions written in different
languages when wrapped in a COM component. Finally the code generation
infrastructure can easily be extended with the addition of new grammars which are
simple XML files.
With the above experimental data set, with Gene Xpro Tools, we get the following
results.

97

4.4.1Data:

Fig.4.4.1 The Data


It is the file containing the input values i.e. the values of p, q, r, x etc. which looks
similar to an excel file. Then distribution of p is plotted and also p vs. F(r).

98

4.4.2General settings:

Fig.4.4.2 The General Settings


Some more settings are required before proceeding towards final analysis. These
settings include parameters such as no. of chromosomes, head size of the
chromosome, no. of genes and the linking function etc.

99

4.4.3 Chosen Functions

Fig. 4.4.3 chosen functions


Addition, multiplication and floating point reminder (mod) are the chosen functions.
The evolution would be based on these functions.

100

4.4.4Run

Fig. 4.4.4 The evolution of the function


This slide shows how the evolution takes place. It evolves till Max. fitness is
achieved. Here the no. of generations is 1802 and Best fitness value is 954.6923. The
graph of the target value in yellow is close to the graph of the model in green with an
error of only 1 out of 15. Since the target function is closely similar to the model, we
confirm that the fitness function we have developed gives the required hash function
that would implement the desired hash table with minimum collision.

4.4.5History

101

Fig. 4.4.5 History of valid models.


This slide shows the models which are selected for evolution and their corresponding
fitness. Cross over, mutation and the selection procedures that may have been
undertaken during the evolution process sometimes give rise to undesirable outputs
that are rejected. Only those models are selected that give rise to valid ETs. All these
models listed above have given rise to valid models so these are selected and
subjected to regression in the next generation.

4.4.6Results

102

Fig. 4.4.6 Result of evolution


The result slide shows locations where the hash keys would be put. Locations
according to the target function and the model are compared. It shows that both of
them match in 14 locations out of the 15, giving rise to an error (1/15) =0.0667

103

CHAPTER 5

CONCLUSION AND FUTURE WORK

5.1 Conclusion:
A key performance measure for the World Wide Web is the speed with which
content is served to users. As traffic on the Web increases, users are faced with
increasing delays and failures in data delivery. Web caching is one of the key
strategies that has been explored to improve performance. Caching has been
employed to improve the efficiency and reliability of data delivery over the Internet. A
nearby cache can serve a (cached) page quickly even if the originating server is
swamped or the network path to it is congested. While this argument provides the
self-interested user with the motivation to exploit caches, it is worth noting that using
widespread use of caches also engenders a general good: if requests are intercepted by
nearby caches, then fewer go to the source server, reducing load on the server and
network traffic to the benefit of all users.
Applying hashing to integers is an established technique. But applying hashing to
enhance the functionality of a cache that stores web pages is a new approach which
needs lots of experiment and careful observation. Therefore, future scope of this
project lies in more and more no of experiments by using various kinds of websites
such as jobsites, educational and e-governance sites.

5.2 Scope for future work


Gene Expression Programming is a sub category of and a mixed representation of
Genetic Algorithm and Genetic Programming. But it is much faster and accurate than
both the above. We have, in this project, implemented Gene Expression Programming

104

into enhancing a hash function which actually represents a web cache. In our
experiment, we have taken a cluster of computer systems in rural areas called the
nodes, which are connected to a slow internet connection and at the same time
connected to each other through high speed network. Each of them has a local cache.
They can access each others cache when a miss occurs in their own cache. In this
scenario, the village community which uses the system has very limited requirements.
They mostly need internet for healthcare, education, jobs and land usages. So our
improved cache is full of related websites which are meant to fulfill the needs of the
villagers. It is not optimized to incorporate the needs of people of other communities
with varied interests such as people living in metros. Therefore, the future scope of
our project lies in developing improved cache functionality for different usage
patterns for different groups of people. For ex. Some people are interested in research,
some are into politics, some are into govt. functionaries and people are interested in
social networking sites. Our project would throw light into improving the web cache
functionality for varied set of web sites used by different groups of users with
different ranges of interests.

105

REFERENCES

1. Sibren Isaacman and Margaret Martonosi, Potential for Collaborative caching and
Prefetching in largely-DisconnectedVillages, WiNS- DR08, September 19, 2008,
San Francisco, California, USA. Copyright 2008 ACM 978-1-

60558-190-3/08/09.

2. Mangesh Kasbekar and Vikram Desai, Distributed Collaborative Caching for


Proxy Servers, www.ra.ethz.ch/cdstore/www6/Posters/761/761_DCC.HTM
3. Xin Li, Lei Guo and Yihong (Eric) Zhao, Tag-based Social Interest Discovery,
WWW 2008 April 2125, 2008, Beijing, China. ACM 978-1- 60558- 085-2/08/04
4. Weining Qian, Linhao Xu, Shuigeng Zhou and Aoying Zhou,CoCache:Query
ProcessingBased

on

Collaborative

Caching

in

P2P

Systems,

homepage.fudan.edu.cn/~wnqian/publications/DASFAA05P2P.pdf
5. Mursalin Akon,Towhidul Islam, Xuemin Shen and Ajit Singh,SPACE: A
lightweight Collaborative caching for clusters.,Peer-to-Peer NetwAppl(2010)3:83
99,DOI 10.1007/s12083-009-0047-5
6. Dominguez-Sal, D., Larriba-Pey,J.,and Surdeanu, M., A Multi-layer Collaborative
Cache for Question Answering. In Proceedings of Euro-Par. 2007, 295-306.

106

7. Jianliang Xu; Jiangchuan Liu; Bo Li; Xiaohua Jia; Caching and prefetching for
WebContent distribution , July-Aug. 2004, IEEE Xplore, Volume: 06 Issue: 4, 54
59, 1521- 9615, 8034831, 10.1109/MCSE.2004.5

8. Sarina Sulaiman, Siti Mariyam Shamsuddin, Ajith Abraham, "Data Warehousing


for Rough Web Caching and Pre-fetching:GrC 2010, 443-448. 2009,Granular
Computing, IEEE International.
9.Safdari,M. Joshi, R.,Evolving universal hash functions using genetic algorithms,
IEEE Xplore, FCC- 3-5 April 2009,84 87, 978 0-7695-3591-3, 10804486.
10. Jaroslaw Skaruz and Franciszek Seredynski,Anomaly Detection In Web
ApplicationsUsing. Gene Expression Programming., Recent Advances in Intelligent
Information

Systems,ISBN978-83-60434-59-8,pages89

398,iis.ipipan.waw.pl/2009/proceedings/iis09-38. pdf
11. GEP: Mathematical modeling by an Artificial Intelligence, C.Fereira, Online
version.
12. Michael J. Flynn ,Computer architecture, Pipelined and Parallel Processor
Design, ISBN- 978-81-7319-100-8.
13. PFC: Transparent Optimization of Existing Prefetching Strategies for Multi-level
Storage Systems, Zhe Zhang; Kyuhyung Lee; Xiaosong Ma;Yuanyuan Zhou; ISBN:
978-0-7695- 3172-4, IEEE, ICDCS08
14. Tang; Zhang; Chanson; Streaming Media Caching Algorithms for Transcoding
Proxies, http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.6017.

107

15. Kipruto; Tan; Musau; Mushi Using Genetic Algorithms to Optimize Web
Caching in Multimedia-Integrated e-Learning Content, International Journal of
Digital Content Technology and its Applications. Volume 5, Number 8, August 2011.
16. Brian D. Davison, A Web Caching Primer , IEEE. Reprinted from IEEE
Internet Computing,Volume 5,Number 4, July/August 2001, pages 38-45.
17. Hai liu; Maobian Chen Evaluation of Web Caching Consistency, 20IO 3rd
International Conference on Advanced Computer Theory and Engineering(ICACTE),
IEEE ISBN: 978-1-4244-6539-2
18. Michael Rabinovich and Oliver Spatscheck; Web Caching and Replication;
Addison Wesley; 1st edition (2002), ISBN 0-201-61570-3.
19. Sitaram Iyer, Antony Rowstron, Peter Druschel, Squirrel: A decentralized Peer
to peer web cache, 21th ACM Symposium on Principles of Distributed Computing
(PODC 2002).
20. Pawan Kumar Choudhary and Kishor S. Trivedi; Performance evaluation of
Web Cache, www.ee.duke.edu/~pkc4/webcache.pdf
21. Sarina Sulaiman,1 Siti Mariyam Shamsuddin,1 and Ajith Abraham, Rough Web
Caching, www.softcomputing.net/sarina-rs.pdf
22. P.Venketesh, S.N. Sivanandam, S.Manigandan , Enhancing qos in web caching
using

Differentiated services, International Journal of Computer Science and

Applications,Vol-III,No.I,pp-78-91 Technomathematics Research Foundation.


23. Ping Du, Jaspal Subhlok Evaluation of Performance of Cooperative Web
Caching with Web Polygraph, 2002.iwcw.org/papers/18500200.pdf

108

24. Leland R. Beaumont; Calculating Web Cache Hit Ratios, www.contentnetworking.com/papers/web-caching-zipf.pdf


25. Li Fan, Member, IEEE, Pei Cao, Jussara Almeida, and Andrei Z. Broder,
Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol ,IEEE/ACM
TRANSACTIONS ON NETWORKING, VOL. 8, NO. 3, JUNE 2000.
26. G.N.K.Suresh Babu and

Dr.S.K.Srivatsa, Web Caching Using Randomized

Algorithm In Web Personalization, Journal of Theoretical and Applied Information


Technology 2005 - 2009 JATIT.
27. Dharmendra Patel, Atul Patel & Kalpesh Parikh Preprocessing Algorithm Of
Prediction Model For Web Caching And Perfecting ,International Journal of
Information Technology and Knowledge Management July-December 2011, Volume
4, No. 2, pp. 343-345
28. Waleed Ali, Siti Mariyam Shamsuddin Integration of Least Recently Used
Algorithm and Neuro-Fuzzy System into Client-side Web Caching ,International
Journal of Computer Science and Security, 3 (1). pp. 1-15. ISSN 1985-1553
29. J. B. Patil, B. V. Pawar, Integrating Intelligent Predictive Caching And Static
Prefetching In Web Proxy Servers ,International Journal on Computer Science and
Engineering (IJCSE). ISSN : 0975-3397 Vol. 3 No. 2 Feb 2011
30. Waleed Ali , Siti Mariyam Shamsuddin, and Abdul Samad Ismail, A Survey of
Web Caching and Prefetching , Int. J. Advance. Soft Comput. Appl., Vol. 3, No. 1,
March 2011 ISSN 2074-8523; Copyright ICSRS Publication, 2011 www.i-csrs.org
31. C. Umapathi, M. Aramuthan, and K. Raja , Enhancing Web Services Using
Predictive Caching , International Journal of Research and Reviews in Information
Sciences (IJRRIS) Vol. 1, No. 3, September 2011, ISSN: 2046-6439

109

32. S. R. Balasundaram, S. Akhilan , Significance of Object Size Based Caching


Algorithms for Personalized Web Applications , International Journal of the
Computer, the Internet and Management, Vol. 18 No.1 (January-April, 2010), pp 9-14.
33. Hao Che, Ye Tung, Member, IEEE, and Zhijun Wang, Hierarchical Web Caching
Systems: Modeling, Design and Experimental Results, Ieee Journal On Selected
Areas In Communications, Vol. 20, No. 7, September 2002.
34. Kaikuo Xu, Changjie Tang, Rong Tang, Yintian Liu, Jie Zuo, Jun Zhu,
Application of Gene Expression Programming to Real Parameter Optimization,
978-0-7695-3304-9/08 $25.00 2008 IEEE DOI 10.1109/ICNC.2008.511
35. Jaroslaw Skaruz and Franciszek Seredynski, Anomaly Detection In Web
Applications Using Gene Expression Programming ,Recent Advances in Intelligent
Information Systems ISBN 978-83-60434-59-8, pages 389398
36Jaroslaw Skaruz, Franciszek Seredynski , Detecting Web Application Attacks With
Use of Gene Expression Programming, 978-1-4244-2959-2/09/ 2009 IEEE
37. Chi Zhou, Weimin Xiao, Thomas M. Tirpak, Member, IEEE, and Peter C.
Nelson, Evolving Accurate and Compact Classification Rules With Gene Expression
Programming Ieee Transactions On Evolutionary Computation, Vol. 7, No. 6,
December 2003
38. Yanchao Liu, John English, and Edward Pohl, Application of Gene Expression
Programming in the Reliability of Consecutive-k-out-of-n: F Systems with Identical
Component Reliabilities D.S. Huang, L. Heutte, and M. Loog (Eds.): ICIC 2007,
CCIS 2, pp. 217224, 2007. Springer-Verlag Berlin Heidelberg 2007.
39. Rui Wang, Jing Lu, The Improvement of Replacement Method for Web
Caching, ISBN 978-952-5726-10-7. Proceedings of the Third International
Symposium on Computer Science and Computational Technology (ISCSCT 10)
Jiaozuo, P. R. China, 14-15,August 2010, pp. 304-307.

110

40 .Athena Vakali, A Genetic Algorithm scheme for Web Replication and Caching,
http://www.csd.auth.gr/teachers/vakali.html
41. en.wikipedia.org/wiki/Hash_table
Appendix I
List of Publications:

1. National Conference on "Information & Communication Technology:


Opportunities & Challenges in 21st Century, NCICT, Birla Institute of
Technology, Noida, Feb, 2011.

2.

International

conference

on

Advances

in

Computing

CommunicationICACC-2011, NIT Hamirpur, April 2011.

111

and

National
Conference
on
"Information &
Communication
Technology

Opportunities
& Challenges in
21st Century"
Web-Mining
Enabled Services in Web Mining using Advanced
1.

DSA Spatial LBS Case Study


Improving web caching mechanisms using Gene

2.

3.

Expression Programming
Web Mining: - Ranking Metrics Method

Organized By:
BIRLA

INSTITUTE

OF

A-7, Sector-1, NOIDA - 201301 (U.P.)

112

TECHNOLOGY

113

114

115

116

Appendix II
CURRICULUM VITAE
Profile Summary
10 years of job experience comprising both academic and industry.

1 year of software industry experience in Application, Systems and Client/Server


environment using C, C++ and Linux/AIX UNIX operating system.

9 years of experience as Faculty in the department of Computer science and


Engg.

Started as lecturer in University College of Engineering, Burla, Odisha in June


2000.

Subsequently joined various Institutions such as G.H. Institute of Technology &


Management, Puri, BBDIT, Ghaziabad and SRM University, Modinagar.

Worked as Software Engineer in Accenture Services Pvt Ltd from August


2007 to August 2008.

Education:
B.E.

in Computer Science & Engineering

from University College of

Engineering, Sambalpur University with 69.7% in April 2000.

Qualified GATE 2000 with 76.92 percentile.

Publications:
1. National Conference on "Information & Communication Technology:
Opportunities & Challenges in 21st Century, NCICT, Birla Institute of Technology,
Noida, Feb, 2011.
2. International conference on Advances in Computing and
Communication ICACC-2011, NIT Hamirpur, April 2011, on the topic
Improving web caching mechanisms using Gene Expression Programming.

117

118