Web Analytics.
As legal marketing agency in Calgary, AB - we often get asked by our clients about the accuracy of traffic analytics. While this article was written back in 2008 - it is still valid today.
As legal marketing agency in Calgary, AB - we often get asked by our clients about the accuracy of traffic analytics. While this article was written back in 2008 - it is still valid today.
It's one thing to collect website metrics, but making heads or tails of it it is another matter all together. There are a few web stat programs out there to decipher server logfiles in a somewhat coherent fashion, but they all output metrics in a different way. Understanding the differences can be very useful - notably when your boss is asking you for specifics on the company traffic report. How do you interpret the information and what do all the discrepancies mean? How do these programs read logfiles? There are some great 'free' open source server tools out there to help us wade through data to extract measurable results. I'll go over what I know about how logfiles are interpreted and i'll briefly cover 3 popular logfile statistic programs - Webalizer, AWStats and Analog.
How Server Logs are Collected - Common Log Format
Web servers keep a log of what they are doing, and they usually do it by logging events in plain text files. Each time someone or something asks for a web page, and any component on that page - like a graphic, the server writes another line in the logfile to represent that request. Errors and unsuccessful requests are also logged, so details pile up. Raw logs are ugly and a little tricky for humans to read. Thankfully logfile analysis programs such as Webalizer, AWStats, and Analog (many more) are around to help us interpret the results.
Limitations
Servers are limited by the fact that they can not distinguish between a human being and a robot (search spider, email harvester, or other). The server only sees remote IPs making individual requests - a remote IP address connects, sends a request, receives a response and then it disconnects. The server records this instance in the logfile. The thing to understand here is that the source of each individual request looks the same to the server. It won't decipher a human from a robot. The statistical results outputted are therefore up to interpretation.
Another limitation is that unique IPs don't always represent a unique individual (or entity) looking at your website. An IP address can represent an array of possibilities; a robot scanning your site, a human, or an entire group people behind a single IP. The opposite could also be true, where an individual might revisit your website later in the day, but the Internet Service Provider assigned this person a 'new' IP... now the server logs 2 IPs - so the logfile analyzer registered 2 unique visitors, when there was only one. Ultimately, 'unique IP' metrics are really only 'best guesses' where a 'visit' is based on the assumption that a single IP address represents a single user.
Lastly, web browser caching also presents a problem for logfile analysis. If a person revisits a page, the second request will often be retrieved from the browser's cache, and so no request will be received by the web server. Therefore, that visitor's path through the site is lost. Web servers can be configured to stop this but many are not.
Alternative formats used to improve assumptions
There are non-standard log formats that can be applied to logfile analyzers. One popular method is the 'combined' log format, where the basics of the CLF are used in conjunction with 'User-Agent' and 'Referrer' logs. A 'User-Agent' could be a browser like Firefox, Internet Explorer, Safari, etc... The 'Referrer' represents the web page that directed the user to your website - it could be a Google search results page, a blog or a website. Unfortunately, these alternatives can also be misleading since both User-Agents and Referrers requests can be modified and/or spoofed resulting in erroneous logfile stat reports. But in general, it cleans up a little bit of the mess.
For more info on how Apache logs in either the Common Log Format or the Combined Log Format - check out their site here.
Popular open source logfile analyzers
Webalizer is a free web server logfile analysis program, distributed under the GNU General Public License. It comes standard with most Linux (or Apache) based hosting accounts and basically provides a detailed report in HTML format. It was designed to be run from command line prompt or as a cron job. Most web hosting companies assume you'll just be looking at the HTML report via provided URL - so they often default to running daily cron jobs. Webalizer supports CLF and Combined log formats. Incidentally, the last Webalizer release seems to have been back in 2013, so it appears that not much has been done on Webalizer since then.AWStats is also a free web server logfile analysis program, distributed under the GNU General Public License. AWstats is actually a Perl script (awstats.pl), which parses your server’s logfiles and generates reports. AWStats also supports CLF and Combined log formats (and more). The development of this program seems to be alive and kicking, with the latest AWStats 6.8 release in November 2007.
AWStats compiles statistics for unique visitors by looking at 'pages' (not IPs). This is significant in that many visitors are behind a proxy server when they surf (ie: AOL users). When an AOL user hits your site - it's possible that several hosts (several IP addresses) are used to reach your web site (ie: one proxy server to download the web'page' and 2 other servers to download all the images). Therefore logging 3 unique visitors (IP based) when really only the 1 visitor explored your site. So AWStats, considers only downloaded webpages to count unique visitors. This decrease the margin of error... but a level of error still exists, since some websites use 'frames' which are in effect a combination of pages.
AWStats also uses a .txt list of known search engine spiders to separate search engine activity by referencing a robots.txt file (if you have one in place). Note however that many crawlers ignore the robot.txt file and go unnoticed - they are then counted as human visitors... more error.
Analog is considered by some as the best web server logfile analysis tool around, although it has not undergone further development since 2004. So, this however may be more relevant to die-hards more inclined to configure it to their liking. Analog tries to output almost every metric possible, but it can be a bit overwhelming as it is all (by default) on a single page.
Conclusion
Logfile analysis is open to interpretation - it can be used to look for trends, and useful to extract which pages are being viewed - filtered logfiles are a powerful metric but must be interpreted with care. This introduces a few questions - how then can we better interpret our findings? How can we minimize the margin for error? What are the alternatives to logfile analysis?One solution using 'cookies' and 'JavaScript' - Google Analytics
Written in 2008 by Dave Taillefer of ICONA Calgary web design and development

Comments