Sie sind auf Seite 1von 11

Running a Reverse Proxy with Apache: http://www.apachetutor.

org/admin/reverseproxies

Privileges for anonymous


admin | reverseproxies
Read Yes
Annotate No
Edit No
Manage No

In 2003, Nick Kew released a new module that complements Apache's mod_proxy and is essential
for reverse-proxying. Since then he gets regular questions and requests for help on proxying with
Apache. In this article he attempts to give a comprehensive overview of the proxying and
mod_proxy_html

This article was originally published at ApacheWeek in January 2004, and moved to ApacheTutor
with minor updates in October 2006. The current revision was made in October 2009 and
incorporates updates in Apache 2.2 and mod_proxy_html 3.1.

Notes

Sorry, I've turned off anonymous annotations, due to an unusual level of abuse. Existing non-spam
annotations are preserved.

A proxy server is a gateway for users to the Web at large. Users configure
the proxy in their browser settings, and all HTTP requests are routed via
the proxy. Proxies are typically operated by ISPs and network
administrators, and serve several purposes: for example,

to speed access to the Web by caching pages fetched, so that


popular pages don't have to be re-fetched for every user who views
them.
to enable controlled access to the web for users behind a firewall.
to filter or transform web content.

A reverse proxy is a gateway for servers, and enables one web server to
provide content from another transparently. As with a standard proxy, a
reverse proxy may serve to improve performance of the web by caching;
this is a simple way to mirror a website. Loadbalancing a heavy-duty
application, or protecting a vulnerable one, are other common usages. But
the most common reason to run a reverse proxy is to enable controlled
access from the Web at large to servers behind a firewall.

The proxied server may be a webserver itself, or it may be an application


server using a different protocol, or an application server with just
rudimentary HTTP that needs to be shielded from the web at large. Since
2004, reverse proxying has been the preferred method of deploying
JAVA/Tomcat applications on the Web, replacing the old mod_jk (itself a
special-purpose reverse proxy module).

1 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

The standard Apache module mod_proxy supports both types of proxy


operation. Under Apache 1.x, mod_proxy only supported HTTP/1.0, but
from Apache 2.0, it supports HTTP/1.1. This distinction is particularly
important in a proxy, because one of the most significant changes between
the two protocol versions is that HTTP/1.1 introduces rich new cache
control mechanisms.

Apache 2.2 brings major improvements over Apache 2.0 in both proxying
and cacheing, and is also the first version to support load-balancing as
standard. If you are using an older Apache version, it is strongly
recommended you upgrade.

For users of Unix-family platforms, you have a choice of MPM. A


threaded MPM (Worker or Event) is likely to perform best in a proxy,
especially if you need to support large numbers of clients. If you have an
application that is not compatible with a threaded MPM, you may want to
consider putting that on a different server (which could be another Apache
instance on the same hardware), unless your load is too low to matter.

The Apache Proxy Modules

So far, we have spoken loosely of mod_proxy. However, it's a little more


complicated than that. In keeping with Apache's modular architecture,
mod_proxy is itself modular, and a typical proxy server will need to
enable several modules. Those relevant to proxying and this article
include:

mod_proxy: The core module deals with proxy infrastructure and


configuration and managing a proxy request.
mod_proxy_http: This handles fetching documents with HTTP and
HTTPS.
mod_proxy_ftp: This handles fetching documents with FTP.
mod_proxy_connect: This handles the CONNECT method for
secure (SSL) tunneling.
mod_proxy_ajp: This handles the AJP protocol for Tomcat and
similar backend servers.
mod_proxy_balancer implements clustering and load-balancing
over multiple backends.
mod_cache, mod_disk_cache, mod_mem_cache: These deal with
managing a document cache. To enable caching requires mod_cache
and one or both of disk_cache and mem_cache.
mod_proxy_html: This rewrites HTML links into a proxy's address
space.
mod_xml2enc: This supports internationalisation (i18n) on behalf of
mod_proxy_html and other markup-filtering modules. space.
mod_headers: This modifies HTTP request and response headers.
mod_deflate: Negotiates compression with clients and backends.

Having mentioned the modules, I'm going to ignore caching for the
remainder of this article. You may want to add it if you are concerned
about the load on your network or origin servers, but the details are
outside the scope of this article. I'm also going to ignore all non-HTTP
protocols, and load balancing.

Building Apache for Proxying

2 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

Note: if you are installing Apache from a package, you will just need to
install packages for Apache, libxml2 and third-party modules according to
your distributor's conventions, which may differ from what is described
here.

Most of the above modules are included in the core Apache distribution.
They can easily be enabled in the Apache build process. For example:

$ ./configure --enable-so --enable-mods-shared="proxy cache ssl


$ make
# make install

Of course, you may want other build options too, and you could just as
well build the modules as static.

If you are adding proxying to an existing installation, you should use apxs
instead:

# apxs -c -i [module-name].c
noting that mod_proxy itself is in two source files
(mod_proxy.c and proxy_util.c).

This leaves mod_proxy_html and mod_xml2enc, which are third-party


modules, and require a third-party library libxml2. At the time of writing,
libxml2 is installed as standard or packaged for most operating systems
(except Windows - see below). If you don't have it, you can download it
from xmlsoft.org and install it yourself. For the purposes of this article,
we'll assume libxml2 is installed as /usr/lib/libxml2.so, with headers in
/usr/include/libxml2/libxml/.

1. Check libxml2 is installed. The version should no longer be an


issue, but note that versions before 2.5.10 had a bug that could
cause mod_proxy_html to DoS, and version 2.6 is required for some
error recovery when parsing data containing invalid byte sequences.
2. Download mod_proxy_html and mod_xml2enc from
http://apache.webthing.com/
3. Build mod_proxy_html and mod_xml2enc with apxs:

# apxs -c -I/usr/include/libxml2 -I. -i mod_proxy_html.c


# apxs -c -I/usr/include/libxml2 -I. -i mod_xml2enc.c

Company example.com has a website at www.example.com, which has a


public IP address and DNS entry, and can be accessed from anywhere on
the Internet.

The company also has a couple of application servers which have private
IP addresses and unregistered DNS entries, and are inside the firewall.
The application servers are visible within the network - including the
webserver, as "internal1.example.com" and "internal2.example.com", But
because they have no public DNS entries, anyone looking at
internal1.example.com from outside the company network will get a "no
such host" error.

A decision is taken to enable Web access to the application servers. But


they should not be exposed to the Internet directly, instead they should be
integrated with the webserver, so that http://www.example.com/app1/any-

3 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

path-here is mapped internally to http://internal1.example.com


/any-path-here and http://www.example.com/app2/other-path-here is
mapped internally to http://internal2.example.com/other-path-here. This is
a typical reverse-proxy situation.

Configuring the Proxy

As with any modules, the first thing to do is to load them in httpd.conf


(this is not necessary if we build them statically into Apache).

LoadModule proxy_module modules/mod_proxy.so


LoadModule proxy_http_module modules/mod_proxy_http.so
#LoadModule proxy_ftp_module modules/mod_proxy_ftp.so
#LoadModule proxy_connect_module modules/mod_proxy_connect.so
LoadModule headers_module modules/mod_headers.so
LoadModule deflate_module modules/mod_deflate.so
LoadFile /usr/lib/libxml2.so
LoadModule xml2enc_module modules/mod_xml2enc.so
LoadModule proxy_html_module modules/mod_proxy_html.so

For windows users this is slightly different: you'll need to load libxml2.dll
rather than libxml2.so, and you'll probably need to load iconv.dll and
xlib.dll as prerequisites to libxml2 (you can download them from
zlatkovic.com, the same site that maintains windows binaries of libxml2).
The LoadFile directive is the same.

Of course, you may not need all the modules. Two that are not required in
our typical scenario are shown commented out above.

Having loaded the modules, we can now configure the Proxy. But before
doing so, we have an important security warning:

Do Not set "ProxyRequests On". Setting ProxyRequests On turns your


server into an Open Proxy. There are 'bots scanning the Web for open
proxies. When they find you, they'll start using you to route around blocks
and filters to access questionable or illegal material. At worst, they might
be able to route email spam through your proxy. Your legitimate traffic
will be swamped, and you'll find your server getting blocked by things
like family filters.

Of course, you may also want to run a forward proxy with appropriate
security measures, but that lies outside the scope of this article. The author
runs both forward and reverse proxies on the same server (but under
different Virtual Hosts).

The fundamental configuration directive to set up a reverse proxy is


ProxyPass. We use it to set up proxy rules for each of the application
servers:

ProxyPass /app1/ http://internal1.example.com/


ProxyPass /app2/ http://internal2.example.com/

The [P] flag to mod_rewrite offers an alternative to Proxypass, but this is


more complex, and may in some instances degrade performance by
making it impossible for Apache to use persistent proxy connections.

Now as soon as Apache re-reads the configuration (the recommended way


to do this is with "apachectl graceful"), proxy requests will work, so
http://www.example.com/app1/some-path maps to
http://internal1.example.com/some-path as required.

4 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

However, this is not the whole story. ProxyPass just sends traffic straight
through. So when the application servers generate references to
themselves (or to other internal addresses), they will be passed straight
through to the outside world, where they won't work.

For example, an HTTP redirection often takes place when a user (or
author) forgets a trailing slash in a URL. So the response to a request for
http://www.example.com/app1/foo proxies to
http://internal.example.com/foo which generates a response:

HTTP/1.1 302 Found


Location: http://internal.example.com/foo/
(etc)

But from the outside world, the net effect of this is a "No such host" error.
The proxy needs to re-map the Location header to its own address space
and return a valid URL

HTTP/1.1 302 Found


Location: http://www.example.com/app1/foo/

The command to enable such rewrites in the HTTP Headers is


ProxyPassReverse. The Apache documentation suggests the form:

ProxyPassReverse /app1/ http://internal1.example.com/


ProxyPassReverse /app2/ http://internal2.example.com/

However, there is a slightly more complex alternative form that I


recommend as more robust:

<Location /app1/>
ProxyPassReverse /
</Location>
<Location /app2/>
ProxyPassReverse /
</Location>

Note: this currently fails due to a regression in mod_proxy. It does the


right thing with the ProxyPassReverse balancer:/// form if you
have a balancer: this is a workaround. Note too that the three slashes
are not a typo! Without a balancer, please apply the patch from the
bug report or use the other form.

The reason for recommending this is that a problem arises with some
application servers. Suppose for example we have a redirect:

HTTP/1.1 302 Found


Location: /some/path/to/file.html

This is a violation of the HTTP protocol and so should never happen:


HTTP only permits full URLs in Location headers. However, it is also a
source of much confusion, not least because the CGI spec has a similar
Location header with different semantics where relative paths are allowed.
There are a lot of broken servers out there! In this instance, the first form
of ProxyPassReverse will return the incorrect response

HTTP/1.1 302 Found


Location: /some/path/to/file.html

which, even allowing for error-correcting browsers, is outside the Proxy's

5 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

address space and won't work. The second form fixes this to

HTTP/1.1 302 Found


Location: /app2/some/path/to/file.html

which is still broken, but will at least work in error-correcting browsers.


Most browsers will deal with this.

If your backend server uses cookies, you may also need the
ProxyPassReverseCookiePath and ProxyPassReverseCookieDomain
directives. These are similar to ProxyPassReverse, but deal with the
different form of cookie headers. These require mod_proxy from Apache
2.2 (recommended), or a patched version of 2.0.

As we have seen, ProxyPassReverse remaps URLs in the HTTP headers


to ensure they work from outside the company network. There is,
however, a separate problem when links appear in HTML pages served.
Consider the following cases:

1. <a href="somefile.html">This link will be resolved by the browser


and will work correctly.</a>
2. <a href="/otherfile.html">This link will be resolved by the browser
to http://www.example.com/otherfile.html, which is incorrect.</a>
3. <a href="http://internal1.example.com/">This link will resolve to
"no such host" for the browser.</a>

The same problem of course applies to included content such as images,


stylesheets, scripts or applets, and other contexts where URLs occur in
HTML.

To fix this requires us to parse the HTML and rewrite the links. This is the
purpose of mod_proxy_html. It works as an output filter, parsing the
HTML and rewriting links as it is served. Two basic configuration
directives are required to set it up:

ProxyHTMLEnable On
This activates mod_proxy_html (and mod_xml2enc if available) for
the request, and enables ProxyHTMLURLMap and other directives.
ProxyHTMLURLMap from-pattern to-pattern [flags] [cond]
In its basic form, this has a similar purpose and semantics to
ProxyPassReverse. Additionally, an extended form is available to
enable search-and-replace rewriting of URLs within Scripts and
Stylesheets.

Note that this is a change from earlier versions of mod_proxy_html


and of this tutorial. The old method is deprecated. The reason for the
change is that ProxyHTMLEnable configures both mod_proxy_html and
mod_xml2enc and ensures they interact correctly: a task that would
otherwise be far more complex.

How it works

mod_proxy_html is based on a SAX parser: specifically the HTMLparser


module from libxml2 running in SAX mode (any other parse mode would
of course be very much slower, especially for larger documents). It has

6 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

full knowledge of all URI attributes that can occur in HTML 4 and
XHTML 1. Whenever a URL is encountered, it is matched against
applicable ProxyHTMLURLMap directives. If it starts with any
from-pattern, that will be rewritten to the to-pattern. Rules are applied in
the reverse order to their appearance in httpd.conf, and matching stops as
soon as a match is found.

Here's how we set up a reverse proxy for HTML. Firstly, full links to the
internal servers should be rewritten regardless of where they arise, so we
have:

ProxyHTMLURLMap http://internal1.example.com /app1


ProxyHTMLURLMap http://internal2.example.com /app2

Note that in this instance we omitted the "trailing" slash. Since the
matching logic is starts-with, we use the minimal matching pattern. We
have now globally fixed case 3 above.

Case 2 above requires a little more care. Because the link doesn't include
the hostname, the rewrite rule must be context-sensitive. As with
ProxyPassReverse above, we deal with that using <Location>

<Location /app1/>
ProxyHTMLURLMap / /app1/
</Location>
<Location /app2/>
ProxyHTMLURLMap / /app2/
</Location>

Debugging your Proxy Configuration

The above is a simple case taken from mod_proxy_html version 1. With


the more complex URLmapping and rewriting enabled by Version 2, you
may need a bit of help setting up a complex ruleset, perhaps involving a
series of complex regexps, chained anc blocking rules, etc. To help with
setting up and troubleshooting your rulesets, mod_proxy_html 2 provides
a "debug" mode, in which all the 'interesting' things it does are written to
the Apache error log. To analyse and fix your rulesets, set

ProxyHTMLLogVerbose On
LogLevel Info (or LogLevel Debug)

Now run your testcases through your rulesets, and examine the apache
error log for details of exactly how it was processed.

Do not leave ProxyHTMLLogVerbose On for normal use. Although the


effect is marginal, it is an overhead.

Extended URL Mapping

The previous section sets up remapping of HTML URLs, but leaves any
URL encountered in a Stylesheet or Script untouched. mod_proxy_html
doesn't parse Javascript or CSS, so dealing with URLs in them requires
text-based search-and-replace. This is enabled by the directive
ProxyHTMLExtended On.

Because the extended mode is text-based, it can no longer guarantee to


match exact URLs. It's up to you to devise matching rules that can pick

7 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

out URLs, just as if you were writing an old-fashioned Perl or PHP


regexp-based filter (though of course it's still massively more efficient
than performing search-and-replace on an entire document in-memory).
To help with this, ProxyHTMLExtended supports both simple text-based
and regular expression search-and-replace, according to the flags. You can
also use the flags to specify rules separately for HTML links, scripting
events, and embedded scripts and stylesheets.

A second key consideration with extended URL mapping is that whereas


an HTML link contains exactly one URL, a script or stylesheet may
contain many. So instead of stopping after a successful match, the
processor will apply all applicable mapping rules. This can be stopped
with the L (last) flag.

Dealing with multimedia content

We just set up a proxy to parse and where necessary correct HTML. But
of course, the web isn't just HTML. Surely feeding non-HTML content
through an HTML parser is at best inefficient, if not totally broken?

Yes indeed. mod_proxy_html deals with that by checking the


Content-Type header, and removing itself from the processing chain when
a document is not HTML (text/html) or XHTML
(application/xhtml+xml). This happens in the filter initialisation phase,
before any data are processed by the filter.

But that still leaves a problem. Consider compressed HTML:

Content-Type: text/html
Content-Encoding: gzip

Feeding that into an HTML parser is clearly broken!

There are two solutions to this. One is to uncompress the incoming data
with mod_deflate. Uncompressing and compressing content radically
reduces network traffic, but increases the processor load on the proxy. It is
worthwhile if and only if bandwidth between the proxy and the backend is
at a premium: this is common on the 'net at large, but unlikely to be the
case on a company internal network.

SetOutputFilter INFLATE;DEFLATE

(note that ProxyHTMLEnable correctly inserts the proxy-html filter


between INFLATE and DEFLATE).

The alternative solution is to refuse to support compression. Stripping any


Accept-Encoding request header does the job. So invoking mod_headers,
we add a directive

RequestHeader unset Accept-Encoding

This should only apply to the Proxy, so we put it inside our <Location>
containers.

A similar situation arises in the case of encrypted (https) content. But in


this case, there is no such workaround: if we could decrypt the data to
process it then so could any other man-in-the-middle, and the security
would be worthless. This can only be circumvented by installing mod_ssl

8 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

and a certificate on the proxy, so that the actual secure session is between
the browser and the proxy, not the origin server.

We are now in a position to write a complete configuration for our reverse


proxy. Here is a bare minimum, that ignores extended urlmapping:

LoadModule proxy_module modules/mod_proxy.so


LoadModule proxy_http_module modules/mod_proxy_http.so
LoadModule headers_module modules/mod_headers.so
LoadFile /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so
LoadModule xml2enc_module modules/mod_xml2enc.so

ProxyRequests off
ProxyPass /app1/ http://internal1.example.com/
ProxyPass /app2/ http://internal2.example.com/
ProxyHTMLURLMap http://internal1.example.com /app1
ProxyHTMLURLMap http://internal2.example.com /app2

<Location /app1/>
ProxyPassReverse /
ProxyHTMLEnable On
ProxyHTMLURLMap / /app1/
RequestHeader unset Accept-Encoding
</Location>

<Location /app2/>
ProxyPassReverse /
ProxyHTMLEnable On
ProxyHTMLURLMap / /app2/
RequestHeader unset Accept-Encoding
</Location>

Of course, there's more than one way to do it. Our configuration would
actually have been simpler if we'd used Virtual Hosts for each application
server. But that takes you beyond the realm of Apache configuration and
into DNS. If you don't fully understand that (or if you think "why can't I
see my domain" is a webserver question), then please don't try using
virtual hosts for this.

NOTE

If you are using a mod_proxy_html version older than 3.1, there is no


ProxyHTMLEnable directive, and you'll have to insert the filter with
Apache's standard directives: for example
SetOutputFilter proxy-html

Cacheing

We haven't dealt with cacheing in this article. In a company-intranet


situation, the connection from the proxy to the application servers is the
local LAN, which is probably fast and has ample capacity. In such cases,
caching at the proxy will have little effect, and can probably be omitted.

If we want to cache pages, we can of course do so with mod_cache But


that is beyond the scope of this article.

9 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

Load Balancing

If the backend is an application that's heavy on the computer, we may


wish to spread the load across multiple machines. Apache enables this
with mod_proxy_loadbalancer. Current development versions (and in
future stable versions 2.4.x) have additional clustering and monitoring
support.

Content Transformation and Aggregation

mod_proxy_html is one of many modules that can be deployed on a proxy


to rewrite contents on the fly. Other examples include more general-
purpose content transformation, aggregation (e.g with server-side includes
or edge-side includes), XML processing such as XInclude and XSLT, and
even embedded scripting and database queries.

Filtering and Security

A reverse proxy is not the natural place for a "family filter", but is ideal
for defining access controls and imposing security restrictions. We could,
for example, configure the proxy to recognise a custom header from an
origin server and block content based on it. This delegates control to the
application servers.

If you are interested in this subject, another third-party module


mod_security offers powerful and sophisicated protection.

(Q) Where can I get the software?


(A) Most of it from the obvious place, http://httpd.apache.org/
mod_proxy_html and mod_xml2enc are available from
http://apache.webthing.com/
libxml2 is available from http://xmlsoft.org/. Windows users should read
libxml2.dll for libxml2.so, and can obtain it together with the prerequisites
iconv.dll and zlib.dll from Igor Zlatkovic's site, http://www.zlatkovich.com/.
Finally, mod_security is available from http://www.modsecurity.org/
(Q) Can I get a binaries of software ?
(A) If there's no link at the websites above, ask the provider of your
operating system or distribution. The author can compile it on different
platforms but does not provide a free compilation service.
(Q) What is httpd.conf? My apache has different configuration files.
(A) Some distribution packagers mess about with the Apache configuration.
If this applies to you, the details should be documented by your distributor,
and have nothing to do with Apache itself! Substitute your distributions
choice of configuration file for httpd.conf in the above discussion, or create
your own proxy.conf file and Include it.
(Q) You mentioned apxs and apachectl. Where do I find them?
(A) They're part of a standard Apache installation (except on Windows). If
you don't have them or can't find them, that's a problem with your
installation. The easiest solution is probably to download a complete
Apache from httpd.apache.org.
(Q) Does mod_proxy_html deal with Javascript links?
(A) From mod_proxy_html 2.0, yes!
(Q) The proxy appears to change my HTML?
(A) It doesn't really, but it may appear to. Here are the possible causes:

1. Changing the FPI (the <!DOCTYPE ...> line) may affect some
browsers. FIX: set the doctype explicitly if this bothers you.

10 de 11 11/5/17 11:24
Running a Reverse Proxy with Apache: http://www.apachetutor.org/admin/reverseproxies

2. mod_proxy_html has the side-effect of transforming content to utf-8


(Unicode) encoding. This should not be a problem: utf-8 is
well-supported by browsers, and offers comprehensive support for
internationalisation. If it appears to cause a problem, that's almost
certainly a bug in the application server, or possibly a misconfigured
browser. FIX: filter through mod_charset_lite to your chosen
charset.

3. mod_proxy_html will perform some minor normalisations. If your


HTML includes elements that are closed implicitly, it will explicitly
close them. In other words:

<body>
<p>Hello, World!
</body>

will be transformed to

<body>
<p>Hello, World!</p>
<body>

If this affects the rendition in your browser, it almost certainly means


you are using malformed HTML and relying on error-correction in a
browser. FIX: validate your HTML! The online Page Valet service
will both validate and show your markup normalised by the DTD,
while a companion tool AccessValet will show markup normalised by
the same parser used in the proxy, and highlight other problems.
Both are available at http://valet.webthing.com/

Owner niq, Last Updated: Fri Nov 29 08:18:25 2013.

11 de 11 11/5/17 11:24

Das könnte Ihnen auch gefallen