Fighting spam: Spam-free guest book
Mode of operation of spam bots
A spam bot usually operates according to a certain pattern. It follows links
from one page to another and looks for possible forms. if it finds one, it
reads it in, analyzes it and saves both the address of the page in question and
the form for future use.
When spam is to be dispersed, the previously saved forms are used to propagate
this digital junk. A validly formatted mail address (albeit it is invalid or
belongs to someone who is not the spammer), possibly a home page, a subject and
the message itself are filled in for this. Since there's not a human at work
here but a program, this happens in next to no time, with the latency time of
the 'Net being the only bottleneck.
Operation honeypot
Exactly this behavior of spam bots can excellently be turned against them. All you have to do is add one or two additional fields to the form that don't accept any data but are instead meant to mislead the bot. However, that requires that they be named in a way that the bot reacts on them.
If your trap is supposed to be coined on mail addresses you need to name it
accordingly (e. g. mail, email
etc.) and label it with a matching text to raise the bot's interest in said
field.
The form field would look like this in this case:
Then add the following to your CSS:
It is of utmost importance to hide this field via CSS instead of declaring it
with type="hidden", because fields hidden
in this fashion are copied 1:1, and your trap would lose its effect in this
case. By hiding it with the help of CSS the input field appears to be a normal
field and is taken into account by the bot. But beware!
Do not define this style information directly in the input field, but instead
generate a class definition an external CSS file that you include into your
document, and afterwards all you have to do is set this class in a container
surrounding the input field to be suppressed in order to make it disappear.
This prevents a bot from potentially associating the suppression and the fake
input field.
The effect of this is twofold: A normal CSS-capable browser both this input
field and the associated label are hidden from the user so that it is invisible
for him. At the same time the field is marked in such a fashion in browsers not
capable of interpreting CSS (e. g. the text-mode browser Lynx) that the user
immediately knows that he must not enter anything here. Since the description
is directly linked to the input field as a label, any
auxiliary devices for amblyopics can
forward this information to the user so that he knows what's going on. This in
turn fulfills the criterion of accessibility.
In addition to that you have to add a field that actually takes the e-mail
address. You may name it as you please, with the name not necessarily having to
give an indication of its use – which becomes particularly interesting with
required fields (which should be the case for an e-mail address). This turns
this field into yet another tripwire for spam bots since they usually leave it
empty if they cannot determine what they are dealing with.
It is furthermore of utmost importance to mark required fields accordingly,
both in the field's label and the input field itself. The former can always be
done, the latter requires a document type that supports
ARIA.
However, you shouldn't make the mistake and mark these fields directly in the
XHTML source code, but this is best done with JavaScript. Auxiliary devices for
amblyopics nowadays can handle JavaScript so that a modification of the label
and the ARIA attribute set in the
input field are recognized by them. Spam bots won't see this modification,
because they aren't equipped with a JavaScript engine: The bot would be way too
clumsy.
It is furtermore important that you modify the script so that it checks your
traps. It is entirely sufficient to see whether something has been entered into
your traps. Once you find something in here you immediately know that something
is fishy, and you can reject the attempt. If this check is passed, additional
tests are required, because these traps cannot screen out the rarely-occurring
manual spam. Other methods are required here.
A simple check could be implemented like this:
You can put traps like these anywhere within your form, however, they should
match the context in which they appear. If you set up a trap via the mail
address, the field should be next to the real input field, because fields that
are too far detached could raise suspicion and risk being ignored by spam bots
in the future, causing this trap to lose its effect.
However, the mail address doesn't have to be the only spot that can be
tripwired. Depending on the use of the form other opportunities for tripwires
could present themselves, which are all constructed the same way as for the
mail address.
These simple measures can screen out the majority of spam without having to apply other checks. That which remains can be further scrutinized without overly straining the processor.
to the topSanity checks
Once you have the input, you should by all means apply some sanity checks. On
some occasions single-line fields are filled in with multi-line data – however,
this is not permitted, and you should reject the messages in question.
You should furthermore perform additional sanity checks, e. g. whether the
format of both the mail address and the URL of the home page is valid and the
named servers exist. You also have the opportunity to check the server names
against a DNSBL. Some providers of blacklists offer ones that contain domain
names instead of IP addresses that are linked to spam. Messages containing
dubious domains can thus be rejected as well.
Imposing time limits
You can also have time work for you. A normal human cannot post a message
within a split-second. However, since spammers are impatient folk and intent on
posting as many messages in virtually no time, many of them are going to read
the form and attempt to post a message right afterwards. Here a time limit that
blocks the attempt right away takes effect in case the entry was posted too
soon. Normally a threshaold of ten seconds is feasible.
But the opposite can be seen as well, that is, a form is read exactly once and
used repeatedly to post spam messages. This, however, is rarely used – to be
honest, so far no attempt like this has been seen here so far – but should they
happen to appear, another time limit can thwart them as well: No user is going
to take more than half an hour to publish a message, but if yes, it is most
probably an old form that is now used repeatedly to post spam.
Oftentimes sites set a session ID to implement this timer, together with all
the management overhead for adding these session IDs to a database, pruning
expired entries, etc. However, this can be implemented a lot easier: All you
have to do is add a data field to your form that is defined with
type="hidden" and so not showing it to a
user although it is present. This field takes the time of creation of the form.
Since this field is explicitly marked as hidden, a spam bot will leave this
field untouched. It absolutely doesn't matter how it is named. You can even
confound the bot by adding an offset to the time stamp that is only known to
you and which enables you to detect these manipulations much more easily. It is
furthermore not necessary to encrypt this value. Firstly this saves computing
time, and secondly it would prove counterproductive when you intend to use this
time stamp to lead the spam bots down the garden path.
When you are imposing a time limit, by all means, provide a notification that automatically filling out the form can cause problems. You may additionally give an indication whether the form is ready to be used or not, best implemented in JavaScript.
to the topCheckbox acting as a doorknob
Of course XHTML has additional elements at its disposal that enable you to
integrate additional nasty tricks. You can, for example, add a checkbox (e. g.
at the lower end of the form). Since bots tend to activate such checkboxes, you
may turn this behavior against them as well. Just proceed in the same fashion
as with the fake input field, that is, create the checkbox, label it in a way
that someone seeing it doesn't accidentally click on it, and hide the entire
construct via CSS, as you did with the fake input field.
All that remains is incorporating a check for the checkbox into your script so
that any attempt to post something with the checkbox checked can be
instantaneously rejected.
If you want to be particularly nasty you can even turn this definition upside
down and declare the checkbox as a required field which you don't add to the
XHTML source but insert it by means of JavaScript and also use JavaScript to
set a label explaining what is expected of the user. This way spam bots don't
see the field and promptly gets caught in this trap: If this field isn't
passed, the checkbox hadn't been checked, either because one has simply
forgotten about it or because it couldn't be seen by a spam bot.
However, it is imperative in this case that the user
be able to see the checkbox. The only flaw in this: Browsers that don't support
any JavaScript find themselves locked out.
Run into the tar pit!
However, you aren't done yet with your spam traps. In case you really want to
get on a spammer's nerves, you can achieve this in case of illegal input by
delaying the output of the XHTML document. Once you discover the problem, you
can impose a delay before sending back any data to the caller, and even if you
start to send the data you have the option to output it in dribs and drabs in
order to waste as much time as possible. Spammers absolutely hate any loss of
time, and even though you have already been molested, you can stop a bot from
accessing other servers this way as long as it is besetting you. However, you
have to pull something from your bag of tricks for this.
The following code takes care of outputting in dribs and drabs:
It is important that you are sending the header data immediately to the browser
to prevent it from signaling a timeout! Transmission of the document proper can
then be arbitrarily slowed down.
If you are interested, please feel free and
try such a tar pit out!
But beware! The loading time tends to be very long!
There are in principle two variants: You can either assemble the XHTML code in
a variable and then output these data with a delay, or you simply initiate a
secondary process within your script that serves as intermediary between the
XHTML output and your web server. Here the second variant is described since it
requires minimal effort if you want to adapt an already existing script. All
you need to do is insert a file handle into the print
statements instead of rewriting them altogether. You additionally insert a
fork statement and check for the result since both
processes take on different roles.
When your spam traps have been triggered you open a pipe at first for
communications between both processes that are responsible for the output. Then
you branch off another process within your script by means of
fork.
Now you have duplicated the script so to speak, and depending on which side you
are returning from the call to fork, different code
sequences must be executed if you don't want to do the same work twice. The
indication on this is the return value of the function: Zero means that you are
in the child process, and a nonzero value denotes the parent process. This can
be checked with an if statement.
In the parent process the first thing to do is closing the handle for reading
since you only want to output data here, and once that is done, you can write
your XHTML to the pipe. Once you are done with that, simply close the handle
you have written to and instruct the parent process to wait for its child
process.
In the meantime you close the write handle and send the child to sleep for half
a minute, thereby defining the initial delay for the caller. You can safely
assume that you are dealing with a spam bot when your traps have been triggered
so you can very well waste its time.
Once this initial delay has passed, you start to output the XHTML code
character by character with a delay of one second between individual
characters. This makes the entire transmission last for one second per
character, and if the document is sufficiently long, it takes at least several
minutes up to a few hours – the longer the document, the more time you are
stealing from the spammers.
At the end of the child process close the read handle and definitely terminate
it with exit(0) since you
would execute the same code as the parent process outside of the block of the
child process.
Conclusion
Sometimes the simple methods prove to be extremely effective, especially when
done right. In contrast to a captcha they can
easily be added to your XHTML code and seamlessly integrate into it. Any
evaluation happens by means of simple checks within your script that evaluates
your form, anyway, and entirely without user or session IDs, plus you can on
top of that handle everything on your server without having to relay to third
parties. And since everything is available in plaintext, it can be displayed
without problems even if someone suffers from amblyopia, because all reading
aids can analyze the document and inform their users on what they are dealing
with.
In contrast to that a captcha firstly cannot be easily integrated, because
something has to be downloaded from another server, and if you don't want to
implement this yourself, you have to
resort to the services of third parties. However, there are multiple
communication paths that have to be taken into account: You would have to
download the captcha once from the server of the provider to display it.
However, since an image is included, amblyopics are faced with a problem: How
the heck recognize what is being displayed? Even folks with good sight have
severe problems recognizing with these images, so how are folks with bad
eyesight supposed to deal with this?
One also needs a communication path from the server providing the form to the
server from which the captcha originated to determine whether the input has
been correct in the fist place. This, however, requires a sophisticated data
exchange, because the captcha has to transmit that which has been entered in
its input field to the server from which it originated. The script evaluating
the form then has to contact the server providing the captcha to find out
whether the input was good. That means that several data transmissions are
going back and forth here.
These spam traps prove that more complex doesn't automatically mean safer. Right the opposite: The more complex a mechanism, the more easily it can usually be manipulated. In contrast to that you can implement these methods rather easily, and the entire evaluation takes place on your server. You don't have to send any information to third parties this way, retaining any information acquired on your server. These security measures therefore cannot be tampered with from the outside.
to the topBonus: Captcha with inverse logic
You have read that right: When done right you can very well use captchas to
keep bots outwithout getting on your users's nerves.
If you are now wondering: “He's got to be completely off
his rocker! First he's telling us that captchas should be avoided, but now he
wants to make us believe that it still works?”, you can rest assured:
Captchas are still not necessary for setting up
effective spam traps, but with a little ploy you can fool a spam bot so
thoroughly that its programmer absolutely doesn't know any more what's going
on.
All you need to do is include a captcha (self-made or from third parties, it
doesn't matter) that can be cracked by OCR software with a little effort,
insert the necessary input fields (it's supposed to appear like a real captcha
after all), label it appropriately so that nobody triggers the trap if he gets
to see it and hide everything with a CSS definition, just like the fake input
field.
If it is empty, you either aren't dealing with a bot, or it has left it empty
for whatever reason. However, if you find something here, all you need to do is
treat it like the fake input field mentioned above and reject the transmitted
message. And if you are absolutely feeling like it, you can further penalize
the bot upon solving the captcha correctly by enforcing an extended retention
period. You then can also resort to some
gaslighting
and claim that the captcha has been solved incorrectly although the solution
has been correct to cause additonal confusion.