Fighting spam: Spam-resistant web statistics
A few words in advance
These instructions assume that you already know how to record the data that is
generated when accessing a web server and subsequently process it for other
users so that usable statistics are generated. This requires that you are able
to configure Apache in a way that it passes its log data to a script which
splits the data into distinct fields that are in turn recorded in a MySQL
database. It is furthermore necessary to read said database at another spot and
process the data contained therein in such a fashion that they can subsequently
be displayed by a web browser.
These instructions describe how to modify a script that transmits the
statistics to a web browser that search engines cannot process the displayed
links any more and so don't record the target pages in their indexes any more.
Referrer stats and spam
Everybody most likely has had his fair share of this phenomenon: As soon as an
option is offered to users to post a message somewhere (in the broader sense
web stats are messages as well, because they provide information on whence a
visitor came, from which IP address he is visiting the site, which browser he
is using, etc.), spammers inevitably show up and contaminate the affected
platform with unwanted entries. This works with web statistics as well, in
which especially the so-called
referrer
spam presents a problem.
Here the
Referer header used by HTTP is manipulaed by a
visitor in such a fashion that the URL of a site to be spamvertised is
transmitted instead of the one of the site he originally came from, thereby
having it show up in the statistics. The spammer's goal of doing so is twofold:
At first he wants to provide an opportunity that visitors may reach a page so
spamvertised and so make it known in the first place, and secondly this is
coupled with the promise of having the spamvertised site included into the
index of search engines, which in turn is supposed to increase the degree of
familiarity of said site once again. To some extent another goal is that the
PageRank of an affected site is propagated to the spamvertised site, thereby
being considered as relevant by any search engines and so are moved to a better
position within the search results – and so possibly supplant other pages that
are considerably more relevant. The only thing of interest for a spammer is how
he can make the site that he is spamvertising to as broad an audience as
possible as quickly as possible. He is generally taking interest in other sites
only to such an extent as he may abuse it to his own ends.
Unfortunately the spammers' work is rendered easier by the fact that the
scripts used for creating the stats are of a quite straightforward design and
therefore don't provide any defensive measures at all that could prevent
including pages that are referred to in the statistics into the index of any
search engines.
The reference problem
The fulcrum of this problem is the fact that a link is by definition leading
somewhere. This is normally done by means of the attribute
href that specifies the destination of a link.
It is exactly that which aids the spammers: Any address set by this standard
attribute can be analyzed effortlessly, plus the destination of the link
normally isn't checked, either. This way a search engine can find any linked
sites and include them in its index, which in turn makes so spamvertised known
much quicker. This lets the spammer reach his goal in virtually no time.
The search engine problem
If one lets search engines analyze referrer stats, they are confronted with a problem as well: There is, among various definitely useful sites, very often a lot of junk to be found, which at first glance cannot be told apart by a search engine. This requires implementing sufficiently sophisticated algorithms and manual intervention for recognizing so spamvertised sites and subsequently removing them. These in turn are unnecessary processes that could easily be avoided, plus the required work and computing time could be used much more sensibly for other tasks like regularly pruning the index of the search engine or processing really meaningful pages. In the end spammers therefore also harm the operators of search engines.
to the topConsequences for the referring site
But this is just the beginning: Things become particularly bad when a search
engine suddenly classifies a site that by itself is benign as a spam belcher,
because its referrer stats are swamped for example with links to spamvertised
sites. This inevitably has consequences for the referring domain that is
suddenly downgraded or in the extreme case altogether removed from the search
results. This way spammers inevitably harm the operators of these web sites
that are cut off from traffic that would otherwise have been directed to these
sites by search engines. If it's companies that are adversely affected by such
machinations, there is always the danger that this could become noticeable as
loss of revenue.
On top of that things could become particularly problematic for the operators
of affected sites: Since spam is generally frowned upon it may happen in next
to no time that the visitors of a site that they perceive to be a spam belcher
are leaving and instead start looking somewhere else.
Ineffective nofollow
In order to make spamming moot for spammers, the rel
attribute had been modified so that it may be specified for links whether or
not the PageRank of the referring site may be propagated to the page being
referred to, but this method has proven to be ineffective. The page being
referred to won't be rated as being of a higher rank, but the primary goal of a
spammer is nonetheless reached: Guarantee the reachability of a page and
ascertain that it is included into the indexes of any search engines. A
decrease in the appearance of spam as a result couldn't be noticed.
The only positive effect of the value nofollow is
that the referring site won't be penalized, just because it is referring to
spamvertised sites and so retains its original PageRank – but that still
doesn't prevent search engines from classifying a site as a spam belcher when
too many links to spamvertised sites show up.
How JavaScript comes into play
In order to thoroughly thwart at least one goal of referrer spam an entirely
different approach is necessary than setting the link to the destination page
by means of the href attribute. To begin with, this
attribute is set to a value of javascript:void(0);,
which effectively prevents search engines from analyzing the link in question:
From the search engine's point of view it doesn't even contain a valid link
target!
One could argue, though, that one could just omit the
href attribute and so places an empty link, but then
the problem arises that such a dead link appears as normal text instead of as a
link. One may click on it and doesn't trigger any action, just as planned, but
any visitors cannot recognize it as such any more. So this is not a viable
method.
One could of course set the content of this attribute to
#, but this empty anchor causes the page to be
recorded under two distinct addresses, something that should be avoided as far
as possible! In contrast to that javascript:void(0);
makes the search engine recognize the link target as invalid and doesn't
prepare to look for potential targets, plus that the problem caused by the
empty anchor is avoided as well.
One would nonetheless want to allow his visitors to access the destination page if so desired when they click on the respective links. This, however, requires that the expected behavior of these links be emulated with the aid of JavaScript so that the browser directs a visitor to the destination page like normal if he clicks on it. This can be achieved by means of the onclick attribute that is inserted into the link by the script that creates the statistics. A link so modified appears in the HTML code as follows:
This event handler so inserted tells the browser to execute the JavaScript included therein – whose sole task is loading the document located at the specified address as if it was an absolutely normal link. However, from a search engine's point of view this link is a real ripsnorter: Since the given address isn't a valid target, there isn't anything of interest for the search engine and so doesn't see the page being referenced.
If so desired, the script may be extended in a way that a link may be created
as usual, that is, the destination address is set normally within the
href attribute when the link in question is marked
accordingly in the database. However, you should do this only after sufficient
verification, because you permit a search engine to reach the destination page
again, but that's something that we don't want in the case of spamvertised
sites. However, when the destination page turns out to be clean, you may
arrange for search engines finding them with this extension, plus your PageRank
can also be propagated to the page being referenced – which in turn benefits
the operator of the site being referenced.
However, should unwanted links nonetheless show up in the statistics, you may
easily remove them from the database without having caused any harm beforehand
by their presence.
When search engines don't see spamvertised sites any more...
You have won in this case and thoroughly thwarted one primary goal of any spammers, that is, including their spamvertised pages into the index of search engines. A visitor may still reach the pages being referenced, but since the search engines don't see the sites, this finally makes such pages harder and harder to find. This reduces the number of visitors on spamvertised sites and denies the spammers their feeling of success because of the absence of a larger number of visitors on so spamvertised sites. On top of that it also spares the search engines any pages that nobody needs, thereby avoiding unnecessary worksteps and any space therefore not blocked by nonsensical pages is otherwise available.
to the topImplementing a search-engine-proof reporting function
To make the task of finding of undesired referrers easier for you you also have
the option of including a reporting function that permits your visitors to
report any suspicious or unwanted links. These should be marked appropriately
in your database, plus preferably you are notified by e-mail in the case that
any links have been reported. However, you have to take care to prevent search
engines from following these reporting links and so spuriously report any
referrers.
In order to achieve this you may use the same method that prevents search
engines from easily following any unchecked links. It is also helpful to
implement an option for disabling the reporting function for individual links
should you discover that a link is reported for no reason, for example some
joker or people who, for whatever reason, want to see to a particular page not
being referenced any more.