Fighting spam: Spam-free guest book
Conspicuous patterns
Spam usually stands out because of some very characteristic traits. Here its
structure is particularly interesting: Very often the message isn't formatted
at all, but instead it turns out to be a run-on text written in one line. The
distribution of links within the text is also of interest, and in one of two
cases they are arbitrarily dispersed throughout the text. In about 20% of all
cases they appear as a compact list and so catch one's eye.
Whereas a semantic analysis of the text is very difficult, if at all, to
implement to expose nonsensical texts containing similarly nonsensical links,
link lists can be detected much more easily, because the links therein stand
out because of their proximity to each other. Spam is also conspicuous, because
the links are present in multiple variants to accommodate for different means
of transmission. That looks akin to something like this (here shown as
cleartext address, as BBcode, and as HTML code):
http://www.example.com/viagra.html [url]http://www.example.com/viagra.html[/url] <a href="http://www.example.com/viagra.html">Viagra</a>
Such a construct is generally suspicious, because there is normally no good
reason to set links in this manner. The format is normally adapted to the media
used to publish the message: On a web page (X)HTML is naturally used, and in a
forum or guest book that offers BBcode, such a construct would be used. If that
is presented in normal, unformatted text, this would happen in plaintext.
When you encounter this, you can be pretty sure that you are dealing with some
sort of nonsense.
An overly long list of links is also very suspicious. Before and/or behind this
there is very often some accompanying text that doesn't match that which
spammers are up to and only serves as a deception so that the operator of the
guest book doesn't realize what is really going on. So if you encounter an
overly long list of links, you can be all but sure that you are dealing with
spam.
But even when the links aren't close to each other a conspicuous pattern can
nonetheless be discerned. All you have to do is count the links and correlate
that with the length of the text. If you find too many links, that also
indicates spam.
You can easily deal with such strange links by checking them for any
peculiarities, that is, whether or not particular terms are showing up.
You can easily get this done with the aid of
regular expressions.
Nonsensical data
Other conspicuous indicators are nonsensical strings of characters at the
beginning or the end of the title or the message. Here regular expressions that
look for such peculiar strings are going to help you once again. Once you find
something, you can reject any messages in question, because they are most
likely spam. There actually is no reason for using nonsensical character
sequences in a regular message.
They are usually used to throww filter software used for filtering out spam off
the track so that they let the spammy message pass, but that can easily be
turned against the spammers with a regular expression.
What you should definitely do is analyze how the text is formatted. Some spam
bots enter multi-line data into fields that by definition only take single-line
data. Since there is no reason to do so, this is a strong indication for
rejecting the message in question. It may also happen that lines in a
multi-line text are excessively long, which may on the one hand severely blow
the layout of a page, and secondly this again makes no sense.
Line breaks can be easily detected with this regular expression:
Here $oneline_data denotes a string taken from a single-line input field, and the regular expression [\n\r] looks for a line break (both CR and LF is searched for – each of these two characters scores a hit). If a hit is scored, you can reject the message in question.
Another red flag are texts that make no sense. Indeed valid words are strung together and interlarded with links, but a costly semantic analysis would be required to detect this type of spam. This firstly requires considerable knowledge of the language being used and secondly a sophisticated algorithm to analyze the text. This is a computationally intensive method which, compared to its usefulness, is designed extremely inefficiently. Here it would be better to forego such an algorithm and instead review any possible messages and discard them manually if necessary.
to the topStrange links
Do invisible links show up all the sudden in your guest book? Or does a link
appear to be completely out of place or does it blow the layout of the page? In
this case something most likely is amiss, because what would be the sense in
putting such peculiar links in a guest book?
Invisible links are usually not meant for a user to follow them, but instead
should be a means for search engines to reach the targeted page. This can be
achieved in various ways, for example by placing a link on a very narrow
character like a space, with the underline typical of a link being suppressed.
You could alternatively assign a style definition or a class that causes
displaying of the link to be suppressed – possibly a combination of both (a
link that is placed on a space could e. g. be assigned a style definition of
visibility: hidden; so that the link appears to be
a normal space and not a link: Elements with the attribute of
visibility: hidden; aren't actually shown, the space
for them is nonetheless reserved so that a gap more or less large appears.
However, search engines nowadays don't like this type of cloaking and are going
to penalize sites on which these links show up or even entirely exclude them
from their index. This limits the effect of this type of spam to a relatively
short time frame, but the site abused for spam is the one that takes the
damage. This alone is reason enough to counter this.
You can detect this type of link this way:
This construct may appear to be awkward upon fisrt glance, and one could also argue that only blanks should be looked for in the link, but this could be bypassed by adding other elements within a link that, as a logical consequence, don't match the condition that only blanks are present in the link. This construct prevents exactly that.
Another nuisance are links that contain a run-on string of characters therein.
This is annoying in more than one respect, because for one the text flow is
severely disrupted e. g. by the link running out of the window – and even if
that is not the case such a link could severely scramble and so disrupt reading
fluency.
Here it helps to count the characters between the opening and the closing tag
and insert a line break if necessary, alternatively reject the text to be
publicized altogether. After all, there is no reason for creating such
unshapely links.
One could carry this to extremes by tagging large passages or even the entire
text as a link. This outright screams at anyone that the author wants to make
one click on the link at all costs. Since this is not the purpose of it, it is
advisable to reject the text altogether.
In contrast the method of hiding a link within another link is much more
subtle. This is the most difficult to discover, because there are no telltale
signs, that is, style definitions that hides the link or links containing text
that consists entirely of blanks that could be detected if someone has a closer
look. In this case a link whose text consists entirely of blanks is being used
again, but since the surrounding text is a link as well, that is unlikely to be
noticed – except someone runs the page through a
validator which is
immediately going to complain.
It is in conjunction with HTML that browsers attempt to display even those
pages whose HTML code is, according to their DTD, invalid. It's this laxness
that is happily being exploited by spammers.
XHTML in contrast makes this intention more difficult, but it cannot entirely
inhibit it. This problem, however, is based less on the definition of XHTML,
but is instead the consequence of some limitations of XML which it is based on:
Direct nesting of links may be forbidden, but indirect nesting by putting an
element that may contain a link and that in turn may be placed inside of a link
defeats this protection.
If you need to check for a link within a link, this construct is helpful:
Having been tied up this way, links cannot be hidden any more in other links that surround them – and even if a spammer came up with the idea to set a link, then regular terminate it before setting the empty link 2 and then continuing the original link behind that, you defeat him with the aforementioned method for recognizing links whose text consists of blanks.
to the topSuspicious words
But even though a document turns out to be syntactically and logically
unsuspicious, the devil can still be in the details. What about a post in the
guest book that merely contains a single link, but which could pack a wallop?
You now have the option to scan both the link target and the text describing it
for conspicuous words that frequently appear in spam after the other tests have
been done. You just have to disassemble the link for this:
This splits a link into its target and the link text to begin with. Afterwards
both can be scanned for suspicious words. However, this may require to
translate any potentially encoded characters back into cleartext (this is done
with a regular expression that looks for encoded characters and translates them
back into cleartext).
Since spammers occasionally encode suspicious words to hide them from content
analysis, this technique can be defeated by this method by turning it back into
cleartext that in turn can be scanned normally for suspicious words.
The other starting point is the link text that you also pull out of the link with this technique. Even when the link target doesn't give any leads, the link text can nevertheless indicate that something is amiss. However, some spammers attempt to make things difficult once again and insert HTML elements, e. g. <span> … </span>, into the link text to throw off a lexical analysis once more. You can counter that by simply removing everything from the link text that appears to be an HTML tag so that it is only the plain text that remains that can be subjected to a textual analysis.
However, you must pay attention on how you proceed with the textual analysis. On the one hand a single occurrenece of a suspicious word doesn't necessarily mean that we are dealing with spam, but if they heap up, that is a pretty sure indication of spam. But at the same time you need to make certain that you don't spuriously detect a spammy word although there's nothing to be found. Similar to Mr Turski's example given for German (paragraph Inhaltsfilter zur Erkennung von Spam), you could, for example, take the word pill: If you constructed a regular expression based on this word, the comparison would be spuriously triggered by the word overspill even though that word is absolutely benign. You can avoid this problem by extending the regular expression a little bit:
The \b in front of the regular expression make it score a hit if the term has no prefix. The i behind the expression furthermore makes it case insensitive so that how the word is written absolutely doesn't matter and so prevents the check from being defeated by a jumble of capital and lower case letters.
The only problem that could come up is the fact that words containing the
search term aren't recognized this way and would have to be explicitly checked
for. This nonetheless avoids false positives.
You can alternatively very easily define exceptions by resorting to an
assertion and suitably rewrite the
regular expression:
This variant looks for everything containing pill except when it starts with overs, which in turn leads to the aforementioned word overspill – and that's what you want to exclude from your search. You can of course extend the assertion so that you exempt additional benign words from the search so that they don't give you false positives, either, e. g. spill. You just have to add these exceptions as a selection within the assertion so that every alternative is checked to find out whether it is a false positive. If not, you are able to deal with the spammer in the usual way.
to the top