WebDevelopersJournal.comTips on Web Page Design, HTML and Graphics
SITE SEARCH
Newsletters
HTML (M-F) Text (M,TH)



Jobs at webdeveloper.com

Resources By Subject
Technical
Graphical
Authoring
Business
WDJ resources
Archive

internet.com

internet.commerce
  • Partner With Us
















Developer Channel


Find a web host with:
CGI Access DB Support Telnet Access
NT Servers UNIX Servers



Semi-automatic?

JavaScript
JavaScript Helper:
Meet Paige Turner, the least geeky geek we've ever come across.

Variables and Operators Explained:
First of a three part guide to JavaScript basics.

Controlling Forms:
Enhance your HTML forms with a touch of JS.

DHTML:
Forget how it works, let's see some in action!


Analyzing Log Files

How to set up a comprehensive statistics monitoring service.

by Cliff Wootton

Who's coming to your site? What are they doing there? Where are they coming from? What are the main things log file analysis can tell you. What are the best programs? what features do you need to look for? What things do I need to consider - platforms, etc.?
November 8, 1998

Cliff Who's coming to your site? What are they doing there? Where are they coming from? What are the main things log file analysis can tell you. What are the best programs? what features do you need to look for? What things do I need to consider - platforms, etc.?

Bruce asked me to put down some ideas about log analysis. Apart from this being a book length subject, it's an area that is fraught with many gotchas and traps for the unwary. Believe me, I know. I've been plenty unwary in the past and gotten bitten in the soft fleshy places by the log analysis dragon a couple of times before now. You just need to be aware of some of the pitfalls is all.

Here I'll tell you what sort of analysis you can do on the basic logging streams that you get out of a Web server and how to clean the log files up so that when you process them with an analyser like WebTrends or Analog you get more meaningful results.

In this article, I'm going to try and gloss over some of the specifics of setting this up. If necessary, we'll do another article later on and revisit this at a deeper level and work through some practical examples then. It's a lengthy subject and this piece is going to turn out long enough as it is.

OK, so you just got the job as webmaster for your sales department's Web site and they want you to put up a whole load of material and on top of that you are taking on an existing raft of software including an ad server. The sales manager stops by and says that they are going to need to gather some stats and present them at the monthly board meeting and you have to prepare them in time on a regular basis.

Creating the content is a breeze because you've done a lot of that before. You've found the configuration file for the Apache web server but it looks a bit geeky to you. There's all sorts of helpful comments but where are you going to get the statistical information from? Is there somewhere your Apache server is saving it so all you gotta do is mail it to the boss?

Well sorry, but it's probably not that easy. It ain't really that hard either, so long as you have a little time and patience to set it up right.

Right now I'm going to take it for granted that you already know some stuff about the Web and maybe just a little more than being able to knit a piece of HTML. Like for instance you know what an IP address is and what a port and socket are. You don't? Well maybe we need to have another article about that sometime. For now, assume an IP address is like a postal address for your office block and a port is like an in-tray on your desk as opposed to the in-tray on the desk next to yours. You expect your messages to arrive in your in-tray. If they go in someone else's, you won't see them even though they arrived at the right postal address. Your web server is going to listen on port 80 at one or several IP addresses for requests arriving from browsers. It's going to process those requests and send out responses. This is what we call the request-response loop and there are opportunities to inject content into various places on that loop. Its good to find out as much as you can about what goes on during that loop and to record it so you can pick it apart later on. Trying to reconstruct what really happened is like archeology. The more evidence there is, the more likely you are going to get to the truth. This is especially important with dynamically generated Web sites.

I'm also going to make some assumptions about the kind of system that you are using. All right I'm biased. I do like the Apache web server on a UNIX platform. Pretty much everything I'm suggesting here applies to any webserver platform but you may need to read some manuals to figure out where you gotta look for the conf file or its equivalent.

If you look inside that configuration file that your Apache server is using, you should see some lines that describe the three main kinds of log files. One is where Apache writes any error messages that get triggered by a request for a non-existent URL. Another (if you have it turned on) lists the page people were looking at just before they requested one of yours. This is called a referrer log. The main one is the access log. This records the requests that arrive at your web server and contains some information about what happened when that request was served.

There is a lot of valuable information recorded here, even when you use the default set up. You can add more to it to make it even more useful.

Every time someone connects to your web server, it figures out what their IP address is. There are some problems with the way it does this, mostly because of the way people get connected to the Internet in the first place. There are two main problems, the first is when a dial up account is granted a floating IP address. That means each time you dial in, your PC is given a different IP value. This happens when IP addresses are set up with something called DHCP. If you see this in your control panel settings then it's likely you won't always have the same IP address. The second problem is when people come through firewalls that surround their systems. This normally happens when you use an office machine. The only IP address that your web server can work out is the IP address of the firewall system. This means that everyone coming to your site from behind that firewall looks like they are living at the same address.

The consequence of those problems is that when someone visits your site two days running they might have a different IP address each time. Or two people visiting at the same time might share the same IP address. Because of this, you cannot rely on the IP address on its own to tell you how many visitors are coming to your site.

To work round this, you could configure your Apache server to record the user agent value for every request. That is an additional piece of information that your web browser sends so that in theory you can send out customized HTML for each type of browser.

If you expect to be audited by the ABC or BPA for example, you will need to record that information. It doesn't wholly solve the shared fire-wall IP address but it does help some, because its likely that not everyone behind afire-wall will have exactly the same version of web browser. It's not ideal but it helps a little. It does help you eliminate non-human driven traffic (robots and spiders) which you must do to satisfy the auditors.

The only sure-fire way to identify users uniquely and to tell if they are coming back is to drop a permanent and unique cookie on their box. This requires some active CGI type software to ensure that you detect users who don't yet have a cookie and gives you a way to generate a unique one-time value.

Some folks will block cookies because they think they will catch a virus or think that the cookie is some kind of security breach. This is a being a bit unfair to cookies who actually never did anyone any harm but became the patsy for a couple of hacks that duped cookies into carrying information back to a web server for them. Cookies themselves are quite harmless.

Well, now at least we can detect whether our user has been here before. So long as we enhance the Access log format to include the cookie contents as well. So that's two additions we need to make to the default log content, cookies and user agents.

There's another addition to the log format too that's worth considering. If you have a dynamic web site served with some middleware such as WebObjects or Cold Fusion, your server log will record the names of the templates that were requested. It won't necessarily record anything that you can detect as the specific content that got painted into that template. There may be some vestigial traces in the URL parameters depending on how your middleware works. You can get round this so long as your middleware is smart enough to be able to write an additional header into the outgoing response. If you can, add a header called 'Logging_data:' and associate some values with that which describe the dynamic page content. Then, back in your Apache webserver's configuration file (httpd.conf), add an item to the log format to include that header in the logging stream as well.

Now, with the other information that was already being recorded, you have almost everything you need.

Remember that I said there were three types of logs. You can reduce this to two if you merge in the referrer data with the access log. That lets you turn off the referrer log. Its a good idea to do this because its very difficult otherwise to correlate the entries in the referrer log with those in the access log. By adding the referrer data to the custom log format, you eliminate this problem too.

Here is an example log format that you can use:

LogFormat "%h %l %u %t \"%r\" %>s %b \| \"%{Referer}i\" \"%{User-Agent}i\"\"%{Cookie}i\"" combined

Here is a list of all the items we are now recording in the log:

IP address
hostname (not usually activated for performance reasons)
identd (not usually activated as the other end usually doesn't support it)
Date and time
Request method
Request path
Request protocol
Response status
Response content size
Referrer path
User agent
Cookie values
Logging header

And here is an example log entry:

195.238.161.136 - - [06/Nov/1998:14:54:33 +0000] "GET/img/navigation/top_nav/jamba_dips_stat.gif HTTP/1.0" 200 743 |"/navigation/top_nav/jamba_dips_stat.html HTTP/1.0" "Mozilla/4.05 [en] (Win95; I)" "Cookie data here"

From this, you can work out a great deal from your log content if you apply some filtering and processing tools.

For example, you can determine where people are located by using nslookup on the IP address and resolving back to their full whois specification. You would want to do this after you had de-duplicated all the hits and checked them against previous days log data.

Taking the user agent value, you can tell whether robots and search engines are visiting your site. From the referrer data, you can count the number of visits to the site and determine how people linked to it. This is good for testing whether ad banners you have placed in other sites are working or whether some reciprocal link is active. You can tell if your site has just won an award for example. The other interesting data in the referrer information is the contents of the search string that someone might have used to find your site in Yahoo or Alta Vista for example. This is helpful in working out whether any seeding of search engines is working.

The user agent value is also worth looking at carefully because you can see whether its is worth customising your site to work efficiently with a particular browser or platform. It's also a helpful additional piece of data to give your sales team if they are selling ads on your site.

By following the referrer data and chaining that to the request data, you can, within all the hits from a single IP address during a session, track the user's journey through your site. This means you can determine where they entered your site and also figure out where they got so bored with it that they left.

Beware, because this is a point at which I am going to introduce you to the difference between hits and page impressions. Many people quote their hit numbers as if they were page requests. Counting hits is a meaningless currency. A highly framed site with many graphic components will generate vastly more hits than a flat structured site with very few graphics. Quoting hits may suggest that one gets more traffic than the other does when in fact the more popular site may be the one with fewer hits.

You need to filter out all the hits that cannot be counted as page impressions. Its best to identify that single particular thing about a page that is unique and identifiable. A good technique is to locate that one content item and include the word 'Page' in its name. You need to make sure you don't include the word 'Page' in any other URLs that you want to discard.

Throw away all your hits on graphics, framesets, navbars etc so that all you are left with is genuine pages. The ABC and BPA define a valid page impression as that one thing that was served as a result of the user making an active request. You cannot count pushed content or client pulled content unless it was human triggered. Anything that is automatically refreshed must be counted exactly once although you can count non displayed items such as sounds since they were requested by a user interaction.

Make sure that all you have is pages and then throw that through WebTrends or Analog or whatever other log analyser you want to use. Unless you do this thinning operation first, you are going to see misleading results plotted in your web report. WebTrends will consistently tell you that the most popular exit item was an ad banner for example. That is because it will be requested after the request that serves the last genuine page. Since that page probably contains the ad, it follows that the ad must be served after the page itself. Once you have a properly filtered page impression log, you can do as much analysis on it as you like.

If you are really serious about this process, you need to carefully eliminate any pages served to robots. The auditing company will give you a list against which you should filter. You should also remove any redirects or failed requests. Basically you can only count requests whose status value is in the 100 or 200 range. The 300 range is for redirects or relocations and the 400 range indicates a serving error. Errors in the 500range indicate that your Apache web server is having some severe problems. Thankfully these are pretty rare.

You should be able to take that thinned page impression log now and extract the dynamic page data and any static page names to generate an editorial report. Your content creation staff will want to know when a feature starts to become less popular so they can replace it with another. Keeping track of this can keep your users interested. New feature items will follow a similar interest curve. You should be able to predict the exact right time when it's good to push a feature deeper down in your site and replace it with a new one. Getting this into a good rhythm can gradually ramp up your page impressions. Reliable log analysis is key to making a success of this.

Your sales people (if you are selling ads) will want to know page impression counts on a daily basis. This suggests that you will want to rotate your logs once a day. Midnight is a good time to do this. You can automate this by writing a script that can regenerate your Apache server config file every night so that the log file names get a date extension to their name. Then, having rewritten the conf, you can send a signal to the Apache main parent server process to re-read its config file.

Once you have grabbed and thinned the daily log file, you can count some metrics from it. There's probably about 40 different values that are useful and which you can distill from a daily log. Store these away safely, you'll need them later. You ought to archive away the log files for safe keeping.

Bear in mind that you will want to do weekly reporting and may want to do monthly reporting. Some values such as page impressions can be added together. Users per week cannot. They need to be aggregated. That means you merge the logs for a whole week and run the same deduplication process on the user identification values (cookies or IP and user agent). Likewise some monthly statistics may need to be aggregated.

Record the daily weekly and monthly values in a database or flat file structure.

Then, you can extract and format these values to generate the necessary reports. Its possible to generate formatted SYLK files for loading straight into a spreadsheet. You could also generate RTF files to load into Word. PDF files are also a possibility although they are a bit more difficult to create.

The daily reports can be generated every morning before people get to work, the weekly ones every Monday and the monthly ones right before the Board meeting your Sales manager originally asked you about.

To recap, the processing on the logs goes like this:

Re-write conf file with date stamps.
Kick over the web server at midnight.
Compress and archive yesterday's logs at about 10 past midnight.
Generate thinned page impression log at about 1:00 am.
Distill the daily information and store it in the database at about 4:00am.
On Mondays, aggregate and distill the weekly data.
On the first of each month, aggregate and distill the monthly data.
At about 7:00 am every day, generate the daily report and mail it out.
On Mondays at 7:00 am generate and mail out the weekly report.
On the first of each month, at 7:00 am, generate and mail out the monthly report.
Optionally, on a quarterly basis or as often as needed, generate your audit certificate.

You may need to upload a copy of the raw log file every day to the auditor's server so that they can check your claims. You should also download a list of user agents and feed that back into your log automation. They will very likely seed your log with hits that only they can identify. This is so that they can ensure you have sent a genuine log file. You wouldn't want to try and synthesize one anyway would you?

Oh yeah, another point in closing. Is it possible to do all this and still have a life outside of running a web site? Not all folks doing log analysis are train spotters in their spare time but probably helps to be kinda picky about the small details and finer points.

To put it another way, can this be completely automated? I sure hope so. In fact I know it can. I've developed three generations of these logging systems, on some very high traffic sites (more than a million hits a day). Almost all of that was automated, including some heuristic routines to learn about new spiders. I'm about to implement the fourth and most comprehensive one to date. It's going to be a real doozy. Completely automated and hands off throughout as far as its possible to be. Zero maintenance. I LIKE that!!!
Suits PonytailsPropheadsContact WDJDiscussWeb AudioSearch

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info

Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers