Sie sind auf Seite 1von 20

One of the more complex and perhaps least commonly understood issues weve faced here on

the Wolfgang Digital SEO team over the past 12 months has been the rise in prominence,
frequency and general havoc incurred at the hands of the dreaded spider trap, or crawler
trap as theyre sometimes referred to.
A spider trap is something webmasters should strive to avoid at all costs, as its essentially a
death knoll to your sites ability to be crawled and indexed, which in turn impacts negatively on
overall organic visibility, rankings and ultimately, your sites ability to generate revenue; its
kind of a big deal when contextualised as such!

So, in order to ensure you have a solid understanding of the potential impact of a spider trap,
its important that we give an overview of what a spider trap is, how to identify one and how to
diagnose one, but before we weave that particular web (pardon the spidery pun!), lets take a
step back and understand the fundamental reason why any of this is important to us as website
optimisers, business owners and marketers; its all down to the concept of crawl budget and
how you can influence website performance in the context of maintaining crawl efficiency
through effective URL management.

What is crawl budget and why is it important?


Google and the other search engines have invested significant capital in creating these
wonderful search engines that have become ingrained in our day-to-day lives; gone are the
days when we dusted off an old Encyclopaedia Britannica for authoritative answers to our daily
queries or the innocent early-internet days of simply Asking Jeeves and hoping he served us up
a treat! Search engines are big business, and with SEO spend set to skyrocket to $80bn in the
US alone by 2020, chances are, theyre here to stay, in one form or another.
With the ever-evolving Google algorithm at the forefront of evrything we in the SEO industry
try to build our best practices around, its fair to assume that theres a significant cost
associated with Big G maintaining this vast, Rankbrain-driven beast. Needless to say, the
bots/crawlers/spiders employed by Lord Googlebot to crawl our sites, index our content and
ultimately, display them to our target audiences, cost money to run. Hosted within vast server
networks dotted throughout the globe, there is a financial cost associated with physically
providing the bandwidth expended by these spiders as they crawl through the web on a
continual basis.
Fundamentally, if a website isnt optimised to allow spider access to its infrastructure in an
efficient manner, a spider will inevitably reach a point in time whereby it reaches its allocated
bandwidth allowance for a given site and moves on the next website on its crawl schedule,
rather than hanging around indefinitely on the same site, hope that it finds a way through a
myriad of issues before coming out the other side with a coherent understanding of what its
just scanned.

Naturally, if theres been an issue whereby a spider couldnt complete the task at hand and
ceases crawling a site as crawl budget has been reached, then the affected site could in a sense
be deemed less than optimal in the eyes of the crawler and could in essence be downgraded
(when it comes to ranking a site versus, say, a competitor site on which the crawler has no
issues accessing) as a result.

Furthermore, if a website has serious crawl issues, then some very important product, category
or informational pages may never see the light of day in the SERPs if a spider was unable to
reach them to begin with.

To put a slightly weird personification around this concept, consider the following scenario:

Youre walking around a brand new multi-storey shopping centre, browsing through the first
couple of shops, deciding which shops you like and might tell your friends about and also
which ones you wont want to visit again.
Then suddenly, just as youre getting a feel for the place, the shutters come crashing down on
the stores, you cant go any further into this wonderful new shopping complex, the security
guards block your path to entry and you have no idea whats in the remaining shops or how
you might spend your budget for the shopping trip youve planned.
Naturally, you make a beeline for the nearest exit, perhaps with a sour taste in your mouth, a
little less likely to return, and still having no clue whats beyond the shutters you reached so
unexpectedly. You take your hard-earned cash to the neighbouring store, which welcomes you
with open arms, meets all your shopping needs and you tell your friends all about how great
this place is!
In this scenario (in case I've lost you already!), you are the crawler, the shops are the pages of a
website and your friends are the SERPs; can you guess who played the role of the spider trap
yet?! Those pesky shutters/security guards of course!

Crawl budget, in its purest definition, can be defined as the number of times a search engine
crawler visits your website during a given timeframe, which is heavily influenced by its ease of
navigation on a website. For example, if Googlebot typically visits your website approximately X
times per month, we can assume with a good degree of confidence that that figure is your
monthly crawl budget for Googlebot, although this is by no means set in stone.

Its important to recognise that estimated crawl budget can evolve over time. Many other
factors such as PageRank (the kind that Google still most definitely uses internally despite it
being discontinued as a publicly-available toolbar metric) and server host load play a role in
crawl budget too, as stated by former friend-to-SEOs, Matt Cutts in a revealing interview on
the topic with Eric Enge a couple of years back.
A quick scan of your web logs or indeed via Search Console under Crawl Stats will help you
gain insight into what your average crawl budget for Google may be at a given point in time.

Heres an example of what calculating estimated crawl budget looks like for our very own new
spanky Wolfgang Digital website:

Looking at the above stats, if we take the daily average of 275*30 then we can deduce the
average monthly crawl budget for this website from Google is 8250, meaning we now have an
estimate of the number of pages we can expect Google to crawl, index and rank within a given
timeframe.

If the number of crawlable assets (URLs) we want to rank for is significantly higher (or lower)
than this amount, it typically means that we need to look further at crawl budget optimisation
as a priority issues from a technical SEO perspective. Thankfully, were alright for now, but we
regularly encounter scenarios in which websites are struggling with their web IA and oftentimes
the root cause can be determined with a simple Screaming Frog crawl, configured to emulate
Googlebot.

Bottom line is, if your site can't get indexed, it can't get ranked. Dont let crawl budget play a
part in how a spider treats your site, ensure youre optimising to allow spiders enter your site,
visits whats important and exit the site left in no uncertain terms that youve pulled out all the
stops to make it a smooth, logical experience for them.

What is a spider trap and how do I fix it?


A spider trap, as bluntly portrayed as an unreasonable security guard or inanimate shutter
above, is basically what stands in the way of a spider accessing your website, having a lovely
time in there and coming away with nothing but happy memories and some nice indexable
pages to rank on SERPs versus the harsh reality of being trapped in a grim, never-ending loop of
a section of your site which causes all sorts of trouble and eventually forces them to give up and
move on due to pre-programmed bandwidth allowance limits.
Once you have access to the basic tools required in order to perfroma a site crawl, you're ready
to commence your crawl analysis to determine whther a spider trap exists on your site.

Theyre particularly prevalent on ecommerce sites due to the nature of large inventory
management issues and fancy UX-led filtering configurations, but they can be found on pretty
much any site with dynamically-served content. Big sites will naturally have more bandwidth
allocated to it by search engine spiders, but that doesnt mean they are less susceptible to
being caught in a trap, on the contrary, if fact. Similarly, smaller sites can be equally effected by
these issues if theyre not nipped in the bud at an early stage.

Sometimes, a trap is evident at a very early stage, with a loop kicking in with less than half the
site crawled.

Heres an example of a very early stage spider trap on a very small site, where the crawl hits the
8th asset and then loops back through URLs 1-7 before stumbling back over 8 at just 6.4% of
crawl completion; this means a staggering 93.6% of this small website is not being crawled,
indexed or ranked:

The following crawl completion index illustrates what a mid-stage spider trap looks like, the
spiders have crawled a significant chunk of this large site bet cant get any further, effectively
leaving 2/5ths of this large ecommerce site on the shelf in terms of potential web visibility:

Other times, it can kick in really late in the day, with almost the entire site crawled, yet the
crawl will simply never end due to the crawler continually falling through the infinite loop
inflicted by an intricate faceted navigation configuration or a simple coding error.

This can be particularly disheartening as just when you think the crawl is almost complete with
only 161 URLs remaining from over 40k, it starts skipping back up around the 200 mark and
never falls quite below the 100 mark towards completion.

This kind of trap will likely have less of an impact in the sense that the majority of the site has
been crawled, but itll undoubtedly cause crawlers to look unfavourably on the site in question
and perhaps reduce crawl budget to avoid running through similar issues in future crawls. Best
not leave such things to chance in the hope that crawlers deem things OK and will index most
of what its found, we say lets nip these traps in the bud and make crawl efficiency a non-issue
once and for all!

There are four main causes of spider traps which weve encountered in recent times, each with
varying degrees of complexity to both identify and diagnose; lets begin with the simpler ones,
moving through to the proper headache scenarios for us SEOs (not to mention the poor spiders,
who Im sure by now many of you are visualising as a hairy little Google-critter funnelling its
way down the interweb towards your beloved domain!

If youre more versed in using crawler tools then you may still have some burning questions
such as why will this crawl never end? or why is smoke emanating from my brand new
Lenovo?; this guide should hopefully put your mind to ease and put you on the path towards
an optimal crawler experience for all.

1. The Infinite Calendar Trap


A calendar trap is perhaps the only trap that isnt the product of a fault or major technical
oversight on any dev or webmasters part. The pages served are, in theory, legitimate URLs that
do serve an ultimate purpose or function; the only issue is that as time, by very definition, is
infinite, then so too are any URLs that relate to time!

How to Spot an Infinite Calendar Trap


This is probably the easiest kind of spider trap to understand, spot and address, if you have a
calendar on your site that enables a young parent to navigate to and potentially book an event
for their newborn babys graduation ball, or worse still, a trip to Disneyland in 3016, then its
likely your site has a calendar trap!

A crawler will never reach the end of these calendar pages unless a system is in place to
manage a reasonable cut-off date, which of course will need to be revisited once that date
becomes reasonable in future!

How to Fix an Infinite Calendar Trap


There are a number of relatively straight-forward fixes to ensure this is never an issue for your
site. Using a noindex, nofollow meta tag on beyond reasonable date years is one option,
whilst employing the robots.txt file to disallow any date-specific URLs beyond a certain
timeframe is another option, although this route is uncommon.

Most out-of-the-box Web 2.0 calendar plugins and self-build functionality guidelines come pre-
built with these crawling considerations in mind, but it does crop up now and again for some
older sites with tonnes of other legacy issues, theres a handy host constraint solution for
calendar traps outlined in the JIRA archives should the need arise.

2. The Never-Ending URL Trap


A never-ending URL trap can be found on pretty much any website and does not share common
traits across industry sectors nor transactional-versus-non-transactional domains; it is usually
just the product of a malformed relative URL or poorly-implemented server-side URL rewrite
rules.

How to Spot a Never-Ending URL Trap


Its very uncommon to see the results of this particular trap within a web browser or indeed
within the SERPs as theyre often buried deep within a sites IA and can often be the very
reason why some content beyond the point of a never-ending URL is not actually indexed for a
user to find. These generally only become evident with the use of a crawl tool such as the
excellent Screaming Frog or Xenu Link Sleuth.

You can tell somethings up when you start to notice that a) the crawl is hitting a stumbling
block and looping back on itself, as per above and b) some really funky-looking URLs start
cropping up in the crawl dashboard with tens, hundreds, even thousands of crazy-looking
directories appended to it. Heres an example of a recent case of this exact trap encountered
on an ecommerce clients site:
Note the lack of a proceeding slash on the relative link, enough to fool the spider into thinking
there are an infinite number of directories proceeding this directory, thus sending the crawl
into overdrive on an infinite URL loop!

This is just one of the many reasons were not fans of relative links here at Wolfgang, absolute
URLs are the order of the day, but thats a whole other topic for a different day! Yoasts Joost
de Valk has a detailed blogpost expressing the rather extreme view that relative URLs should
be forbidden for web developers, citing spider traps as one of the key risks associated with
their use, while Ruth Burr-Ready has an excellent Whiteboard Friday session covering why
relative URLs can be an SEOs worst nightmare; well worth a watch to help develop a further
understanding of this topic as a whole.
Relative linking issues aside, some other reasons you might find an infinite URL string cropping
up in a crawl environment include poorly configured URL rewrite rules from a previous website
migration project or malformed query parameters which ignore large sections of a URL string
due to of server-side dynamic URL serving, e.g. if someone types www.yoursite.com/this-is-a-
completely-made-up-url but server still returns a 200 response code within a crawl instead of
a 404. The solution for this remains pretty much the same across the board; the need for a
properly maintained URL management infrastructure in conjunction with correct server
response code handling.
How to Fix a Never-Ending URL Trap
Once youve noticed this kind of trap occurring, you can use the sort functionality within the
crawler tool to sort by URL length; find the longest URL and youll then find the root source of
the issue, in the above case we were able to isolate the culprit as residing somewhere within
the source code of the lkbennett directory.

Its then a matter of sifting through the source code of the page in question and looking for
anomalies. It turns out that at the very root of this spider issue, lay a very simple mistake; a tiny
typo on line 2354 of the code within a relative URL link configuration:

Considering there are over 1300 links on the page in question and any one of these could have
potentially been the cause, it was a proverbial needle in a stack of needles, but an experienced
eye can spot these issues fairly quickly. Big relief that it wasn't a much, much worse issue; once
flagged, this was put to bed in a matter of minutes and the site was back firing on all cylinders
in no time!

Failing a manual weeding process, there are some more technical ways of addressing the
situation, either by disallowing the offending parameters with the robots.txt file or by adding
server-side rules to ensure that URL strings with a certain maximum limitation of URL strings on
them. Both of these approaches require some savvy programming skills but the net aim is for
non-existent URLs to correctly serve 404 response codes rather than allowing them to be
pawned off as 200 (OK) pages for infinite bot consumption.

If youre fortunate enough to have access to a dev team with the skills to implement these
workarounds, they should also be well-equipped to build a more permanent resolution to the
issue which led us here in the first place; namely in the form of a rebuild or fully-functional URL
rewrite exercise.
3. The Session ID Trap
A session ID spider trap is generally found on larger ecommerce websites where the need has
been established for more granular user session tracking, channel attribution and cross-sell
between established SBUs without the desire for over-reliance on cookies as the primary data
collection instrument.

How to Spot a Session ID Trap


This kind of trap can usually be picked up pretty quickly by crawling your website and looking at
the list of crawled URLs for something like this:

Commonly-identified symptoms of the Session ID trap are the appearance of tags like
jsessionid, sid, 'affid' or similar within the URLs strings as a crawl unfolds, with the same IDs
re-occuring beyond a point where the spider can successfully move on the the next ID-laden
URL string.

Other times, as established sites test new ways of user and cookie tracking methods (without
prior knowledge of how this might impact crawl efficiency), the potential for a likely spider trap
issue can be identified almost instantaneously within the browser if a constant parameter string
is appended to the previously-normal-looking TLD:

In each of these very different cases, the use of jsessionid and opt= are attempts to track
user sessions without the use of a traditional, cookie-based attribution model, not something
weve ever overly-keen to recommend as the propensity for error is so high; were yet to see it
implemented without some major downsides.
The logic behind the implementation of session ID tracking is that if a particular URL visited by a
user doesnt a session ID associated with it, then a user is redirected to the same URL version
with the session ID appended to the end of it, as per the above opt= example. If the requested
URL does have the requisite session ID attributed to it, then the server will load the relevant
webpage but it also appends the session ID to each and every internal link on the same page;
this is where it all starts to unravel from a spiders perspective as theres so much that can and
often does go wrong at from here onwards that is enough for us to warrant a call for this
practice to avoided in its entirety where possible!

The fact of the matter is, theres almost always a couple of links that slip through the net here,
that arent correctly attributed the specific session ID in order for this tracking project to
function smoothly. If the structure contains just a single internal link without the relevant
session ID due to incorrect implementation, then the link will effectively generate a brand new
session ID each time its followed. This means that from a spiders PoV, it arrives at a new
version of the entire website being tacked under a brand new session ID each time this rogue
link is accessed. Confused yet? Imagine what the poor spider makes of it all!

How to Fix a Session ID Trap


To diagnose this kind of issue during an initial site crawl, it may be necessary to first exclude the
offending parameters within the crawl tool itself so that the crawl can successfully be
completed. More and more problematic parameters can be picked up as exclusions are added,
until crawl completion is finally possible. This can look something like this for the above
example to exclude the problematic opt=, so= and returnurl rule parameters from the
Screaming Frog crawl:
The ideal solution here is that this logic be removed from the site entirely, but if a band-aid
solution is required, then replicating these exclusions within the robots.txt file should ensure
that Googlebot and other search engine spiders can now get beyond the trap and start indexing
some of the more important URLs we want the site to have organic visibility for.

Managing URL parameters within Google Search Console can also help with the task of
instructing Googlebot on specific ways to deal with specific parameters, with the options to
crawl, dont crawl, or let Googlebot decide using active/passive rules within the dropdown
boxes available beside each parameter instance.

However, as youll notice when you log in to this area of GSC, a prominent warning of Use this
feature only if you're sure how parameters work. Incorrectly excluding URLs could result in
many pages disappearing from search welcomes Webmasters upon arrival:
Handle with care on this one, as its probably one of the most advanced sections within the
entire Google webmaster suite. Be sure to read up on the official guidelines on the topic if this
is a rabbit-hole youre keen to explore the depths of.
If your site is falling victim to this issue, the only sure-fire way to fully address this kind of spider
trap is to completely remove every single instance of session IDs from all hyperlinks and all
internal redirect rules across the entire site. Care must be taken to ensure that every instance is
removed as if even one instance remains, the crawl will still throw up a potentially unlimited
number URLs as it reaches the point of error within the IA.

4. The Faceted Navigation Trap


This particular spider trap is the one weve found the most challenging to address here at
Wolfgang HQ, not only as the solution is often hard to implement but because, in theory,
faceted navigation is one of the greatest UX features on any modern ecommerce website; it
allows users to filter deep within a sites menu structure to find what they really want, fast.

As highlighted by Clickz in this great take on how to make robots cry with faceted navigation,
its presence on a site is an almost-universally positive experience for humans.
Two of the most important benefits of faceted navigation are outlined as:

Facets permit users to combine selections to zero in on results.


Facets permit users to make those selections in any order.
The problem is, when theres a massive list of items on a site, and they can all be sorted by a
large number of filtering options, then the potential for all sorts of bizarre URL permutations is
quite staggering if not managed correctly from the outset.

How to Spot a Faceted Navigation Trap


If your website offers users a range of different products with many different ways to navigate
towards finding these items, it may well be susceptible to this kind of spider trap.

Looking for elongated URL strings, various reoccurring filtering tags and a never-ending loop
within a crawler tool is again the tell-tale indicator of whether your site is configured to handle
faceted navigation in an SEO-friendly manner or not.

Common sorting labels such as colour, size, price, or number of products per page are just
some of the many filter tags that can create issues for a crawler upon visiting your site. Crawl
issues around faceted navigation generally start to arise when it becomes evident to a spider
that it is possible to mix, match and/or combine various filter types.

For example, using the case of a hugely popular Irish DIY store below, if a user is looking to
purchase some paint for an upcoming home decoration project, then its useful that he or she
can navigate through to desired options using filter items such as colour, brand and price.

The issue arises when the same user can then also sort this filtered result by number of
products per page, the category of paint that it is, a maximum price rand, a minimum price
range and so on and so forth. It should be able to filter by a couple of different, important
filters, but not by all, simultaneously.

During the initial phases of our technical auditing process on this website, our crawler was set
to emulate Googlebot using the excellent user-agent settings in Screaming Frog.

Iit quickly became apparent that a spider trap was being experienced meaning that the spider
was essentially being sent on an infinite, never-ending loop through a series of filters as a result
of the faceted navigation structure of the site, resulting in long URL strings such as
www.website.ie/kitchen-and-
bathroom/kitchen?FilterCategoryID=8399&BrandID=3455&Page=1&PriceHigh=21474836
47&PriceLow=0&SortBy=1&q=&ProductsPerPage=9 being churned out, only to be sent
through yet another iteration of the filtering process, leaving our crawl running for days on end
without ever reaching completion.
We knew it was a big site, but eyebrows began to raise with over 20,000,000 URLs crawled!

This scenario throws up a potentially endless, (well in the multiple billions, anyway!) number of
permutations that a user can tweak to get their ultra-tailored result, which may ultimately just
be a single, very long URL with a single product result displayed on its corresponding web page.

With the greatest UX intentions at heart, this faceted nav structure has failed to take the very
fundamental SEO consideration into account; how on earth does a search engine fetch, render,
parse, index and rank these multiple billions of URLs?! (the answer is, it doesnt, it gives up and
moves on at some point, as covered in our crawl budget analogies, above!).

How to Fix a Faceted Navigation Trap


Again, applying exclusions to the crawl tool will be necessary just to complete the crawl in
question.
In order for us to ensure we could exclude insignificant filters, without negatively impacting the
ability of important categories or brand listings, we first needed to understand which filters
could be dropped from the crawl process, and which could remain, all the while testing to
ensure that major landing pages, categories and product listings were not affected.

After a period of consultation with the client, we were unable to determine the filters
pertaining to content that is important to rank for (brands and categories) and which ones are
superfluous in the context of SEO (size, no. of products per page, maximum price, minimum
price, etc.)

An exclusion of that sort looks like this within a sites robots.txt file:

Disallow: /*PriceHigh.*

Disallow: /*PriceLow.*

Disallow: /*SortBy.*

Disallow: /*ProductsPerPage.*

Disallow: /*searchresults.*

Robust testing is required with each exclusion inserted; exclude too much and some important
listings could be blitzed from the SERPs, exclude too little and the problem may only be partially
addressed.

Its important to be aware that the net result of this exclusion process will ultimately mean that
there will be less pages indexed in Google and other search engines, but that the remaining
listings will relate to the more important pages on the site, that duplication will be reduced and,
most importantly, that the spiders will be better positioned to have the opportunity to crawl
the entire site and re-prioritise its understanding of the site, free from the risk of a spider trap.
It should result in a win-win situation for both spiders and webmasters (and subsequently,
business owners).

The best form of defence against this kind of issue borne of faceted nav is to avoid it to begin
with! In the above case, applying the recommended exclusions and rebuilding the sites mega
menu in HTML5 in place of a JavaScript menu helped tame this particular spider trap and the
site is reaping the rewards from an organic traffic perspective.

Thats not to say that JS menus cant prove effective, but they need to be set in a way that
doesnt allow for multi-faceted layers to populate within dynamic URL strings. More is not
always best, giving users choice is always a good thing, but there comes a point where logic
needs to prevail and for a user case to be built around the pros and cons of having another four
or five filter objects that wont really influence purchase at the end of the day.

When using JavaScript to serve faceted navigation, it's important to decide on how and what to
serve the spiders and in which format. Rather than just serving unreadable JS tags, we
recommend using a pre-rendering mechanism such as Prerender.io in order to make the
transitions through to JS as smooth as possible for the predonimently HTML-savvy spiders.
If exclusions and/or rebuilds arent feasible options, then adding canonical tags to offending
URL strings can certainly help with the indexation side of things and will help avoid thin content
and potential site penalisation at the hands of Google Panda, but it does not address the fact
that these awful, elongated URLs first need to be crawled in order for the canonical directive to
be picked up and respected by the spiders.

OK, so the spiders traps have been removed; now


what?!
Now that youve gained a deeper understanding of the potential cause of and solution to these
four main common spider traps, we hope you can begin to appreciate the importance of
embracing the spiders for enhanced overall performance of your website from an SEO
perspective.

Theres also another pretty interesting instance of spider traps arising from keyword search
indexing thats worth investigating if none of the above spider trap issues are the root of your
crawling woes. Weve yet to encounter this issue on any of our clients sites, hence the lack of
detailed coverage; fingers crossed that remains the case for some time to come!
If youve conducted any of the changes suggested above to remedy your spidering woes, wed
highly recommend you use both Google Search Console and Bing Webmaster Tools to resubmit
your XML sitemaps and allow them to render and index with the new changes in place, youll be
amazed at how your estimated crawl budget tends to rise within a couple of months of the
spiders have a better grasp of a website in a post-trap environment! Using Googles Fetch and
Render option on the fly as site changes are made is a sure-fire way to test success rates and
ensure all is well from a crawl perspective.

Ensuring that internal linking is well utilised throughout your site and that youre creating fresh
content are two sure-fire ways to ensure that Googlebot understand which areas of your site
are the most important to you in this fresh new, spider-friendly IA youve enabled. Continual
reminders through smart use of content, links and sitemaps will help ensure topic priority is
sculpted accordingly.

Finally, wed recommend embedding a process whereby youve got a clearly-defined schedule
for periodic SEO crawling mapped out for your site. As illustrated in detail above, theres a lot of
moving parts when it comes to URL management and indexing, so it makes sense to run a crawl
every couple of days, weeks or months, depending on the size of the site and the frequency of
change it undergoes. If you're an existing Wolfgang client, you can rest safe in the knowledge
that we're on the case if your site does ever experience the wrath of a spider trap, but if any of
our readers are keen on learning more about our SEO methodologies, you can always contact
us and we'll gladly crawl your site for you!
Do you have any crazy crawl issues to report or have you seen an instance of a spider trap on
your own site or a site you manage? If so, please dont be a stranger in the comments box
below or indeed on social; wed love to hear your feedback or answer any queries you may
have in relation to taming the spiders on your domain!

Das könnte Ihnen auch gefallen