Sie sind auf Seite 1von 16

C0SC1300

Web Server Performance

Where to Cache

Tired of having to make coffee while you wait for a


home page to download?
.... how the proper integration of several new
technologies can make page downloads 20%-400%
faster and reduce Web-generated Internet traffic by as
much as 50%.
Soon, you may be down to one cup a day!
W3C Recommendations Reduce World Wide Wait

Table Of Contents

n this chapter, we cover web caching.

Proxy Caching
Introduction
Where to Cache
Controlling the Cache
Cache Replacement Algorithms
Cache Hierarchies

Proxy Caching
Battling the "World Wide Wait"

Introduction
Definition of Proxy Caching
"Internet Object Caching is a way to store requested Internet Objects (i.e., data available
through http, ftp and gopher protocols) on a system closer to the requesting site than the
source. Web browsers can then use the local cache as a proxy HTTP server, reducing
access time as well as reducing bandwidth consumption."
http://squid-cache.org/Doc/FAQ/FAQ-1.html

Proxy servers work by intercepting requests for documents or files and then seeing if
they have a local copy of that particular object. If a current one exists, then that
document is returned to the client. If that document does not exist in the cache, or it is
deemed that the document held in the cache is no longer current, then a new copy of the
file is obtained via the web. This object is then forwarded onto the client, and a copy is
kept locally so that the next computer to request the same object can obtain the
document more swiftly. Caching can take place at the browser, or at the server.

The diagram above shows the relationship between a client (ie a web browser), and a cache. The cache
intercepts the web requests and either returns something it already has stored, or passes the request onto
the Location in the original request.

The technology used within a cache is very similar to that used in any web server,
however there are some subtle differences. Proxy Caches do not work with the same
efficiency as a Web Server, and all requests to a cache must be made with a complete
URL (rather than a relative path, which is fine on a web browser).

Why?
Security
Beyond the obvious savings in download time and bandwidth costs, Proxy Servers can
also provide a valuable service as part of the security policy of a company. For instance,
networks can be configured so that the only device that can make HTTP requests is the
proxy server. All computers must make their http requests through the proxy server.
This would serve to
1. Reduce the risk of attack to individual PCs and the network as a whole because
only one single machine is making external requests, and this is far simpler and
more robust to administer
2. Allow filtering of sites that users of the proxy can access - for example requests
for documents that are not cached might be rejected.

The clients (browsers) in the above diagram are behind a firewall - only the cache can make requests
through the firewall, the individual clients do not have this privilege.

Speed
There are a number of factors at play in reducing the efficiency at which your browser is
able to retrieve a web page. Factors such as DNS lookups for URLs, slow response
times of Web servers, the size of the object desired for retrieval, and general network
congestion all lead to the network leg of any data transfer being the slowest of all. Also,
Web Servers that can only run on HTTP/1.0 will always behave more slowly than those
that run on HTTP/1.1. The notion of sidestepping all of these issues should then have
immediate appeal. Keeping a copy of the objects that you want closer to the client will
avoid many of these pitfalls.
Bandwidth
As well as decreasing response times, many organizations pay for their Internet
connection based on data volume rather than length of connection. Objects retrieved
from within the organization, rather than the wild, will ultimately save the organization
money, and increase the efficiency with which the organizations conduct their online
affairs.

Checkpoint Questions
1. What are the three main benefits of using a cache?

Web Server Performance


C0SC1300 - Lecture Notes
Web Servers and Web Technology

Where to Cache
Copyright 2000 RMIT Computer Science
All Rights Reserved

C0SC1300
Introduction

Controlling How a web page is cached

Table Of Contents

n this section, we cover a complete study of


caching and related issues.

Proxy Caching
Introduction
Where to Cache
Controlling the Cache
Cache Replacement Algorithms
Cache Hierarchies

Where to cache?
There are two approaches to caching - browser caching and server caching (or proxy
caching). Both use the same approach of intercepting requests for Internet objects, and
checking to see if they have a valid local copy already stored.
Browser Caching
Browsers can be configured to keep local copies of the files you browse on your own
hard disk. They use simple algorithms and allow minor configuration options.
Internet Explorer allows you to specify when the caching is done, but provides no
control over whether caching is done using Disk Cache or Memory Cache.
IE Settings

IE Settings

Cache configuration panel from Internet Explorer 5

Internet Explorer allows the users to choose the cache level, among the following
available options.
Every Time You Start IE - If you log on to the Internet and access a page youve
previously visited, your browser will check only once during that session to see if that
page has been updated. At all other times, it will take the page from cache. For most
types of Web activity, this setting should be sufficient, and this setting is recommended.
Every Visit to the Page - Every time you access a web page, the browser will check to
see if it has been changed. If the last modified date of the page is older than the date of
the cached copy, it retrieves the page from the cache, and if it has been modified
recently, it retrieves the page directly from its source. This setting is harmless but
unnecessary.
Never - Your browser will never check if a page has been updated and will always use a
cached page. This setting is not recommended.
Netscape Settings

Netscape Settings

The Cache configuration panel from Netscape 4.08

Netscape provides a further element in that it allows you to determine how much RAM
and disk space are available for use as a cache. Storing objects in memory for retrieval
will enable better performance than storing them on disk. As always with these things,
there is a limit to what can be achieved due to system constraints. This is as valid on a
server as it is on browser.
The main benefit of caching locally over caching on a server is that local caching will
eliminate any network hops.

Server Caching / Proxy Server


Server caching follows the same scheme - saving files that have been requested and then
intercepting future web requests to see if that particular request has been stored locally.
However Server Caches can provide a much higher level of service for a number of
reasons. The most relevant reason is that they are dealing with requests from many more
clients and therefore will have a much larger and greater variety of web objects to
supply to their clients. In addition, server caches have more room with which objects
can be stored, so, the chances of your request for a web page already being in existence
in the cache is improved markedly. Also, server caches are likely to be installed on
computers that are tuned for the best performance in running a cache. Hard disk space
and the amount of information held in memory at any one time are configured to
optimise the perfomance of the cache.

A further benefit is that server caches can be part of what are known as Cache Farms which essentially is a network of server caches working together in order to capture the
greatest variety and quantity of objects possible. Companies like Bigpond offer the
services of their Web farms at a price.
Cache farms are discussed in more detail later.

Checkpoint Questions
1. What benefits does browser caching provide?
2. What benefits does networked caching provide?

Introduction
C0SC1300 - Lecture Notes
Web Servers and Web Technology

Controlling How a web page is cached


Copyright 2000 RMIT Computer Science
All Rights Reserved

C0SC1300
Where to Cache

Cache Replacement Algoritms

Table Of Contents

n this section, we cover a complete study of


caching and related issues.

Proxy Caching
Introduction
Where to Cache
Controlling the Cache
Cache Replacement Algorithms
Cache Hierarchies

Controlling how a Web Page is cached


As authors of web pages, we sometimes want to control how long our pages are current
for, and sometimes whether they are cached at all.
In some cases, the HTTP header information on the web page would indicate whether it
is valid or not. For instance,HTTP/1.0 and HTTP/1.1 headers contain EXPIRES
information that caching systems will use to determine the freshness of a document
(ref Web Servers Section Hypertext Transfer Protocol). The administrator of the cache
will have little control over these type of rules. By default, some pages are not cached at
all. Authenticated pages are typically not cached, nor are objects requested via the
Secure Sockets Layer (SSL).
Meta tags have also been used by page designers in order to control how their pages are
cached, however this approach is somewhat flawed as proxy caches rarely parse the
HTML contents of the document, and only look at the header information, so a page
designer who wants to manage when and how a page is cached should use HTTP
Headers rather than HTML meta tags.
HTTP 1.0 was somewhat limited in the amount of control it gave to control caching,
however HTTP/1.1 introduced a new set of Cache Control Response Headers which
allow the page designer much greater control. Some elements of control are listed
below.
max-age=[seconds] The amount of time an object will be considered fresh
Public
Forces a page to be cachable
no-cache
The item will not be cached
must-revalidate
Must obey any freshness directives

Your browser settings will also impact on whether the document is refreshed or not, and
quite commonly you may find yourself specifically telling the browser to get a fresh
copy of the document from the server.
HTTP/1.1 200 OK
Date: Fri, 30 Oct 1998 13:19:41 GMT
Server: Apache/1.3.3 (Unix)
Cache-Control: max-age=3600, must-revalidate
Expires: Fri, 30 Oct 1998 14:19:41 GMT
Last-Modified: Mon, 29 Jun 1998 02:28:12 GMT
ETag: "3e86-410-3596fbbc"
Content-Length: 1040
Content-Type: text/html

Example of HTML Header information for a web page that contains caching instructions.

When is a page no longer valid?


In an ideal world, a cache could keep copies of every document that is likely to be
requested more than once. However this is an unlikely scenario, as hard disk space
constraints will eventually come into play. A cache is also limited in the quantity of
information that it can keep; it cant just keep on collecting local copies of pages
forever. How do we determine what to keep and how long to keep it for?
The caching system enforce one of the following policies to determine the pages to be
removed from the cache, in the event the cache is full.

Checkpoint Questions
1. Why do we need to control how our web pages are cached, and how do we as web
developers manage it?

Where to Cache
C0SC1300 - Lecture Notes
Web Servers and Web Technology

Cache Replacement Algorithms


Copyright 2000 RMIT Computer Science
All Rights Reserved

C0SC1300
Controlling the Cache

Cache Farms and Cache Hierarchies

Table Of Contents

n this section, we cover caching replacement


algorithms.

Proxy Caching
Introduction
Where to Cache
Controlling the Cache
Cache Replacement Algorithms
Cache Hierarchies

Not Recently Used Caching


A very simple caching algorithm is to mark data in four classes as:
(1) Not read, not written
(2) Not read, written
(3) Read, not written
(4) Read and written
Periodically, the read mark is reset (say, every few clock cycles).
When the cache is full, data in the lowest-numbered, non-empty class is evicted.
The premise is that modified, unread data should be removed in preference to read,
unmodified data. This works on the assumption that data is read more often than it is
written, which is certainly true in web caching.
The attraction is simplicity, but the disadvantage is that it is certainly not optimal.
NRU caching is infrequently used in practice.

First-In First-Out (FIFO)


The idea is to remove the oldest data, that is, the first item added to the cache is the first
removed.
Does not typically work well, since age has little to do with frequency of use; the first
item added to the cache is frequently the most popular. The is often true with web pages

and web traffic.


FIFO is not used in an unmodified form in practical caching.

Second-Chance
Second-chance is just like FIFO, but data is marked or flagged each time it is used,
and the data is managed by a queue.
When evicting, we inspect the oldest data that was added to the cache:
if the data is unmarked, we evict it
if it marked, we make it newest data in and unmark it, and move on to the
next-oldest item
The basic principle is to look for old data that has not been referenced for a period of
time.
Second-chance is inefficient for the reason that we need to maintain queues of data by
age, and reorganise this queue frequently. Another approach is to arrange data in a
circular queue, and to move a hand around the queue. The hand points to the oldest
page.
If the hand points to an unmarked item, it is evicted. If it points to a marked item, it is
unmarked and the hand moves on.
The only difference to second-chance is the implementation.

Least-Recently Used (LRU)


A good observation about web data is that data that is accessed frequently is likely to be
accessed again in the near future. It is also true that data that has not been used for a
long period is unlikely to be used in the near future.
The idea of LRU is to evict the data that has been unused for the longest time. This has a
cost: we must maintain information about all data stored in the cache.
One approach is to maintain a central counter that is incremented each time data is
accessed. As the data is accessed, the current value is stored with the data. The data with
the lowest counter is the oldest and will evicted.

Checkpoint Questions
1. Use with the FIFO page, and determine the state of the cache after the following
values are accessed 0 1 2 3 4 5 6 3 4 5 6 2 3 4.
2. Use with the LRU page, and determine the state of the cache after the following
values are accessed 0 1 2 3 4 5 6 3 4 5 6 2 3 4.
3. Use with the 2nd page, and determine the state of the cache after the following
values are accessed 0 1 2 3 4 5 6 3 4 5 6 2 3 4.

Links
1. FIFO page
2. LRU page
3. Second Chance page

Controlling the Cache


C0SC1300 - Lecture Notes
Web Servers and Web Technology

Cache Farms and Cache Hierarchies


Copyright 2000 RMIT Computer Science
All Rights Reserved

C0SC1300
Cache Replacement Algorithms

Table Of Contents

n this section, we cover a complete study of


caching and related issues.

Proxy Caching
Introduction
Where to Cache
Controlling the Cache
Cache Replacement Algorithms
Cache Hierarchies

Cache Farms and Cache Hierarchies


As mentioned briefly earlier, one of the great benefits of server caching is that it can be
combined with other caches to provide a much wider variety and much larger quantity
of cached objects available for faster retrieval through increasing the user base
supplying the requests to the proxy. Another benefit of this approach is that if the
proxies are intelligent and capable of communicating with other proxies, then it is
possible to manage the duplication of data, or even to balance the load placed on the
proxy by sharing resources between different proxies.
A cache hierarchy describes a kind of pyramid in which there may be many caches
lower down the pyramid and fewer caches near the top. The benefits of this approach
mean that network traffic diminishes as you head down the hierarchy, but there are costs
involved in terms of replication of disk storage, as each successive pass through each
level of cache will be cached there as well.

This situation can be improved if we increase the intelligence and cooperation of our
network of caches - in this case, the paradigm is moves away from a pyramid where the
information flows one way, to a more parallel scheme, with many caches on similar
levels and with a greater intelligence available in the extent of communication between
them. Harvard University developed a communications protocol called ICP (Internet
Cache Protocol) which uses UDP to communicate between caches. Requests for
information are passed between caches using UDP, and when a positive result is
received a complete request is made using HTTP. The net result of this more informed
discussion is that redundancy of stored objects is decreased, and load sharing between
different components of the cache farm is much more viable. When the requesting cache
receives a positive response via ICP, it makes a formal request for the document from
the cache that made the positive reponse, otherwise, it goes out to the Internet to request
the document from the original location.

Checkpoint Questions
1. What benefits does a cache farm have over a cache hierarchy?

Cache Replacement Algorithms


C0SC1300 - Lecture Notes
Web Servers and Web Technology

Copyright 2000 RMIT Computer Science


All Rights Reserved

Das könnte Ihnen auch gefallen