SPAMOUFLAGE AND SPAMJECTS
so, let’s get the easy question out of the way first – why do so many spam messages have weird-yet-compelling subject lines?
well, i can’t shed any light on why they’re compelling, but i can say what they’re trying to do…they’re trying to defeat best-of-breed, automated, anti-spam filters by disguising themselves as potentially interesting messages. those subject lines are "spamouflage." (i wish i could lay claim to this term – Wired beat me to it.)
this probably comes as no surprise. some of you might even have said, ‘duh’ when you read the explanation above.
the better question is, why do these subject lines confuse spam filters?
it turns out that a lot of the war on spam is fought in statistical trenches, as it were. specifically, many best-of-breed anti-spam filters today use what are called Bayesian statistical filters in order to spot spam and weed it out. i’ll skip the statistics and more in-depth mathematics – you can read Paul Graham’s excellent anti-spam treatise to get more info.
the idea is simple – use the language of spam to identify it. after all, spam is always selling something, and the list of things being sold is pretty short (viagra, porn, low-interest mortgage loans, Nigerian letter scams, etc.). as a result, the language used in spam should be fundamentally different from the language used in everyday mail you receive from friends or co-workers (unless your friends try to sell you things on a regular basis). with this idea in mind, email software that uses Bayesian filtering does something like this:
- Build up a dictionary of words contained in your email (spam and non-spam); next to each word, indicate how likely it is that the associated word is found in a message that you have called spam. (this is done during the ‘training’ period in which your email software is figuring out what your typical messages look like).
Once the dictionary has been built, try to identify spam automatically as follows:
- When a new message arrives, break it up into "words" (tokens, in computer geek lingo).
- Look up each token in your spam-dictionary, and find the likelihood that this word belongs to a spam message. if the word isn’t in the dictionary, assign it an arbitrary, fixed spam probability (usually something around 40%, i.e., 4 times out of 10 this word would appear in spam).
- Use Bayes’ theorem to compute the probability that the entire message is spam, based on an analysis of the most interesting words (i.e., those that are most likely to be in spam, and those that are not).
- For messages that cross some threshold or spam probability (e.g., 90% probability the given message is spam), mark them as spam; let everything else pass through.
ok, maybe not. but it does explain why you get all of these weird words in spam…actually, they’re not weird per se – they’re just uncommon. as a result, they are often not in the dictionary your software uses to compute the probability that a message is spam. this also goes for words that are hacks (like vi^gr^), which will often be missing from your spam dictionary. the software doesn’t know what to do with these words and assigns them a middle-of-the-road spam probability. this increases the likelihood that your software will think the message is valid.
so, the mystery of the odd subject lines has been explained. the same answer goes for those large paragraphs of text you sometimes see at the bottom of spam emails – they try to confuse the filter as well (since filters use the subject, headers, and message text to determine if something is spam). spammers actually use a whole host of other tricks (see the Fieldguide to Spam for an interesting list.
NOTE 1: false positives and false negatives
one thing to be aware of as far as Bayesian filters go…they were designed to give NO false positives (i.e., marking legitimate mail as spam). this is essentially a requirement, since no one ever really bothers to look through that spambox containing the 437 spams they got today…they just delete them and hope for the best. false negatives, on the other hand, are tolerable (but kept to a minimum). after all, if five spams get through out of 1000 messages, that’s manageable.
NOTE 2: Bayesian filters evolve and are tailored to you as an individual user.
these two facts are important, even critical, since spam changes over time, and since the average language content of every person’s email is different. for example, i might receive email from friends containing words like PHP and blog, whereas others might not. as the spammers try to adapt their messages to spoof the filters, the filters are constantly evolving themselves to meet the new onslaught. in order for spammers to get their messages through, they have to insert content in their message that is basically indistinguishable from your personal email.
WHY SPAMMERS BOTHER
economics, pure and simple. it costs about $250 to send 1 million spams ($0.0002 per message), and response rates are about 1 in 1000. paper bulk mail costs about $0.25 per message with a 1 in 20 success rate. do the math – spam is 200 times cheaper than paper bulk mail; it’s all about volume.
HOW DEEP THE RABBIT HOLE GOES
once i read about Bayesian filters, my curiosity was piqued. i decided to learn more. spam spam spam and more spam! what i discovered is that the anti-spammers are working at least as hard as the spammers, and that the lines between good and evil are not as clearly drawn as one might think. there’s definitely a war going on, but there are more than two sides, and there’s lots of collateral damage (e.g., people unfairly nailed by so-called blacklists that can block an entire range of IP addresses from getting though corporate mail filters).
i was going to write about all of the different stuff going on, but you’d click away well before i got through the introduction…there’s WAY too much info. hopefully, the links below will lead to some interesting information, should you wish to go further down the rabbit hole.
Wired articles (just a few…there are many more)
- Random Acts of Spamness
- Spam Wars: Filters Strike Back
- When the Spam Hits the Blogs
- Can Spam or New Can of Worms?
Paul Graham’s Writings
- A Paln for Spam (required reading)
- Better Bayesian Filtering
- Filters that Fight Back
- Filters v. Blacklists
CAN SPAM Act
- Markov Chains and Text Generation
- Bayes’ Theorem
- Controllable Regex Mutilator (CRM)
- Pew Charitable Trust: Spam Survey Results
- MAPS and Real-time Black Hole Lists
The post above is not meant as an endorsement of any of the legislation, authors, or organizations listed. i leave it to you, the reader, to make up your own mind regarding the participants and casulaties in the spamwars.