How Publishers Can Protect Themselves from Bots That Steal Content

September 22, 2015

Photo by Arthur Caranta and used here with Creative Commons license.

The following opinion piece is a guest post and does not necessarily reflect the opinions of this publication. Read more about MediaShift guest posts here.

Before the digital age, content theft was a labor-intensive business. Then computers enabled theft by cut-and-paste. Today, the content theft business is almost entirely automated – and probably happening on your website right now.

"Content that’s stolen and repackaged for sale by a third party hurts branding, SEO results (your content is no longer seen as unique by search engines) and customer trust."

Welcome to the age of theft by bot.

What is a bot?

Photo by Peyri Herrera on Flickr and used here with Creative Commons license.

Bots are small, automated programs that come in “good” and “bad” forms. Good bots are deployed by web indexers like Google and Bing, and are an essential component of online business. Bad bots, on the other hand, are designed to seek out and steal specific content by “scraping” it from sites and transmitting it to a third party. Why spend time and money creating your own content when you can steal someone else’s for free?

Content theft is just the beginning

Content that’s stolen and repackaged for sale by a third party hurts branding, SEO results (your content is no longer seen as unique by search engines) and customer trust. An example of a company that can’t have its branding hurt by bad bots is Whitepages.com, where Andy Hibbler, Vice President of Brand Advertising and Partnerships promises his clients the company will “serve honest impressions.”

But the problems don’t stop at theft.

While they’re scraping content, bots overload servers, slow or crash sites, and generally impede business – which is how publishers discover they’re under attack.

“We would get 200 simultaneous requests for long-tail pages that rarely got much traffic,” said Kurt Freytag, head of product at CrunchBase, the leading platform that brings together data on companies and people behind them. “It was totally unlike user traffic.”

Similar impacts were being felt at GuideStar, the world’s largest source of information on nonprofits, where data architect Shane Ward was “getting lots of alarms and alerts, and the site was going down.”

Publishers relying on advertising for revenue must deliver accurate traffic reports to their customers.

“One of the key ways we show value to our customers is by demonstrating how much activity we’re driving to their open jobs and profiles,” said Ryan Aylward, CTO at Glassdoor, a web platform that brings together employers and potential recruits. Redirecting ad responses, artificial click inflation, and other digital ad fraud is a multi-billion dollar business for bots.

Digital publishers present a big target

In digital publishing, the site content is the business, and it has to be monetized. Our 2015 Bad Bot Landscape Report found that bad bots represent more than 32 percent of digital publishers’ site traffic, compared with 18 percent for good bots. The other 50 percent of traffic remains gratifyingly human, but that’s an important factor when considering how to block bad bot activity.

Blocking bots without impacting users – or ad deliveries

Photo by Michael Himbeault and used here with Creative Commons license.

IP address blocking is usually the first reaction to uncovering a bot problem. But, as Freytag at Crunchbase quickly discovered, that “takes a great deal of effort and playing whack-a-mole with IP addresses.” Bot deployers cycle through blocks of IP addresses, which may well include legitimate users, so blocking those would only exacerbate the problem.

Glassdoor’s Aylward also tried the home-grown approach, but “we were only stopping the bad bot activity after it happened.” Over at GuideStar, Ward had a security appliance in place, but that was only inspecting packets, so he still found himself “responding to alerts in the middle of the night” — not to mention the 20 hours a month his team spent “chasing ghosts.”

More than 40 percent of today’s bots closely mimic human behavior, so any kind of manual review of web traffic or firewall-based approach is a non-starter. That’s why companies must focus their resources and attention toward separating good bots from bad bots, and bad bots from humans.

Each of the publishers mentioned above knew what their traffic ought to look like, and that’s the first step in creating a solution that filters traffic to meet individual business needs. Once the non-human traffic has been isolated, each business can determine which validation tools are appropriate.

To resolve the bad bot problem, some companies have developed their own in-house solutions to identify and block, but they soon realize the advancement and evolving nature of malicious bots and that it requires dedicated resources and a full-time detection effort. Fortunately, companies and IT professionals can now deploy very specific and proactive bot defenses, enabling them to continuously fend off bot attacks , allowing for professionals to dedicate their time and resources to other business necessities.

Rami Essaid is CEO and cofounder of Distil Networks, a bot detection and mitigation company.

Comments are closed.

Who We Are

MediaShift is the premier destination for insight and analysis at the intersection of media and technology. The MediaShift network includes MediaShift, EducationShift, MetricShift and Idea Lab, as well as workshops and weekend hackathons, email newsletters, a weekly podcast and a series of DigitalEd online trainings.

About MediaShift »
Contact us »
Sponsor MediaShift »
MediaShift Newsletters »

Follow us on Social Media

@MediaShiftorg
@Mediatwit
@MediaShiftPod
Facebook.com/MediaShift