Excluding Robots, Bots, Spiders & Crawlers from Logfile data with Urchin 6
“60,000 out of 73,000 file fownloads were from bots”
Web analytics based on web server log file data using Urchin 6 web analytics software provides some very useful points of information that Google Analytics and any other tag-based web analytics solution simply can’t deliver. However, one of the biggest challenges with logfile based analytics is pollution of the data by “bots” – search engines robots, crawlers, spiders, scapers, etc… I’ve found that the data generated by non-human activity can easily account for 60% of the hits in your logfiles, and if you don’t exclude it, your resulting reports built on your server logs can be off by that margin. Typical installations of popular logfile analysis tools like Webtrends, AW Stats, Webalizer, and even most Urchin installations won’t exclude robotically-generated data by default.
Old vs. New Filtering Options
I was working on a project for a client recently using Urchin 6 in our cloud-based hosting environment and needed to process some old logfile data to cross-analyze and validate Google Analytics data. Out of 73,000 hits for download files in the log data, 60,000 were from bots. That’s a problem! So, I thought “I have to exclude all those bots”. In previous versions of Urchin (prior to 6.6) there has always been a “robots report”, but no easy way to exclude robots. Well, I took a look at filtering options in our hosted version of Urchin 6.6.02 and found a convenient filtering field for “robot_agent”. This field contains the user-agent for hits that were generated by a bot. Nice!
Creating the anti-bot filter
So, I created a simple filter: exclude all hits where “robot_agent” equals “.*” (i.e. any value). After applying the filter and re-processing the data (yeap – re-processing, you can’t do that with Google Analytics! That’s one reason I love backing up my Google Analytics data to our analytics data warehouse) the reports were not completely void of any bot-generated data.
Nice, clean, pristine logfile data without bot pollution!
Next… analyzing and making sense of all the data.