parselog

Copyright 2000 - 2005, Stuart Udall

overview
important bits
installation
configuration and startup
controls and methods
issues and limitations
planned improvements
revision history
latest version

version 1.57: May 10, 2005


 
  overview next section top of page

PARSELOG converts a webserver logfile into Comma Separated Value (CSV) format. The contents of the logfile can then be loaded into a number of database and spreadsheet applications for additional analysis. The information produced by PARSELOG is best described as a basis for further research; the actual utility of the numbers depends on who is looking for what. While processing, PARSELOG also calculates several statistics.

PARSELOG supports the common logfile format (CLF), however it prefers Extended NCSA/combined format (which includes referrer and useragent data). If user agent and referral information is not supplied, PARSELOG treats all traffic as human. However, the produced information is of limited value, as these two fields are central to further analysis.

PARSELOG skips email headers; it can handle a whole bunch of logfiles exported from an email reader to a single file, To:, From: lines and all.

PARSELOG filters referrals based on either search engines, webmailer, or URL. That is, if a user arrived on the site from a search engine, their visit will be logged to the search engine logfile; if they clicked on a link from inside their web-based inbox, their visit will be logged to the webmail logfile; while if they followed a link from a static page, their visit will be logged to the referring URL logfile. PARSELOG can also filter referrals from selected "local" referring URLs to a separate file.

PARSELOG filters spiders, monitors, harvesters and unrecognised User Agents to separate logfiles. All other useragents are treated as human.

PARSELOG currently supports over 200 distinct useragents, search engines and webmailers. The lists are completely user-definable (via text-based INI file), to a maximum of 200 each.

PARSELOG filters HTTP POST operations to a separate logfile. In order to make reports and statistics based on POST meaningful, PARSELOG first filters calls to FrontPage Server Extensions from the logfile (which utilise this POST function).

PARSELOG filters 403, 404 and 405 errors to separate logfiles, while filtering all other errors to the errors logfile.

PARSELOG extrapolates bookmarks by using a predefined statistic, the marketshare for Internet Explorer. This is a fuzzy number at the best of times, however it is certain that other browsers account for some other traffic. Therefore, PARSELOG allows a configurable extrapolation factor to multiply measured Internet Explorer 4 and 5 bookmarking activity to predict actual user bookmarking activity. The default marketshare is 70%; therefore 100 Internet Explorer favorites probably means that 130 users bookmarked the site, in total.

PARSELOG uses a similar approach to calculate the number of times the website was accessed from an inbox. It does this by counting the number of referrals from web-based inboxes, and then multiplying this by a configurable extrapolation factor, in order to predict actual email referral activity. The default factor is 50%; therefore 100 webmail referrals probably means that 150 users visited the site by following a link in an email in their inbox, in total.

PARSELOG calculates several other statistics (see below); these are be streamed every n hits to the stats logfile.

PARSELOG generates a text-based report.

PARSELOG runs in "automatic" mode only; runtime options are configured using the INI file.


 
  important bits next section top of page

  • requires Windows 9x/ME/2K/XP
  • requires logfile to be in CLF or NCSA Extended/combined format
  • requires logfile to be CR/LF terminated
  • This program is LICENSED SOFTWARE and may not be copied or distributed without prior written permission of the author.
  • Please see the license agreement included with the software for the complete terms and conditions of use of the software.


 
  installation next section top of page

  1. run the self-extracting distribution archive

 
  configuration and startup next section top of page
  1. edit PARSELOG.INI and change the settings, as appropriate. Mostly, the defaults will be fine, however ensure you change the file to reflect your domainname and logfile location. Below is a summary of general INI settings. The useragent and referrer signatures are defined in their respective sections.

    Note: as of version 1.50, PARSELOG expects the logfile to be called history.log, and expects to find it in the target directory (defined below). As of 1.51, PARSELOG will process parsenow.log in preference to history.log, should it find it in the target directory. PARSELOG will no longer work with logfiles of any other name, or in any other location.

    domainnamethe name of your domain. Don't include http:// or www, simply the domainname such as cyberdelix.net is fine.
    targetdirthe name of the directory in which to find the logfile, and where to create reports
    logrulesthe name of the file in which to find the strings used to analyse the raw logfile
    IEmsharethe marketshare of Internet Explorer
    webmailsharethe marketshare of webmailers
    makesearch(yes or no) - create a log of referrals from search engines?
    makespiders(yes or no) - create a log of visits from spiders?
    sampleratethe rate at which statistics are sampled to the stats logfile

  2. You may also wish to edit the [localreferrers] section, which allows you to filter referring URLs you control to a separate file. This feature is handy if a remote URL sends visitors to your site often, but you don't want those stats counted with other referrals, perhaps because the remote site is another of your sites, or you placed or paid for the link.

  3. You may also wish to edit the [excludedreferrers] section, which allows you to filter referring URLs you do not wish analysed to the bitbucket. The hit is still logged to the visitor logfile, but it's not logged in either the referring URL logfile, or the local referrers logfile.

  4. You may also wish to edit the logrules INI file. This is a separate file, by default called webrules.ini, which contains the strings parselog uses to analyse the raw logfile. The name and location of this file is specified by PARSELOG.INI (above).

Note: PARSELOG will check for REPORT.TXT in the target directory, and if it exists, attempt to read previous statistics from it. PARSELOG will then add to these the results of the current analysis. In this way, PARSELOG only ever analyses the most recent data. To force PARSELOG to start from zero, erase REPORT.TXT before starting PARSELOG.

Note: If your local referrer logfile grows large, you should adjust the [localreferrers] and/or [excludedreferrers] sections of the INI file. A large quantity of entries in this logfile indicates 1. a misconfiguration; and 2. slow operation of PARSELOG.

Note: PARSELOG runs faster if search and spider filtering is turned off, and if the samplerate is increased. However, increasing the samplerate reduces the accuracy (increases the graininess) of graphs produced using the streaming stats logfile.

Note: the streaming stats file contains a snapshot of the various calculated ratios. PARSELOG adds to this file every n hits, with n being a number equal to the value of the samplerate INI setting. This file can then be opened in Excel, or similar, and a graph generated. However, Excel supports a maximum of 32767 unique dataitems. Increase the samplerate setting if you find you have too much data. For example, doubling the samplerate from 10 to 20 would halve the size of the streaming stats file.

Note: to disable streaming stats altogether, set the samplerate to 0. No other measurements are affected - they are simply not recalculated and streamed to file every n hits.


 
  controls and methods next section top of page

  • type PARSELOG from the command line

    This will cause PARSELOG to start. There are no commandline parameters.

Note: PARSELOG will not start without PARSELOG.INI in the current directory.

The parser will discard anything it doesn't recognise as a logfile entry, logging them to discard.log. Everything else is parsed.

If any of the CSV files already exist, they will be appended to. If they do not exist, they will be created. PARSELOG does not make backups.

PARSELOG will immediately exit at end, without displaying a message - check the report or processed logfiles to see what it did.

If PARSELOG was aborted by user mid-analysis, this fact will be noted in the resultant report. If it aborts mid-analysis for some other reason, the report will not be created at all.

about the hits by humans/total hits (%) statistic

This is total hits minus hits by spiders, monitors and harvesters, expressed as a percentage of total hits. Spiders are indexing utilities used by search engines to maintain their databases. Monitors are programs that periodically reload a given page. They can be run both by companies providing a service to others, or by users directly. Harvesters are programs which sniff email addresses from webpages, for later use in spam.

about the mystery agents/total hits (%) statistic

This is total human hits minus hits from unrecognised user agents, expressed as a percentage to total hits. This is essentially a measure of error in the total hits by humans statistic. A rising trend indicates a growing need to reconfigure the useragent section of the INI file.

about the extrapolated bookmarks/human hit (%) statistic

This is the ratio of bookmarks by humans to total hits by humans, expressed as a percentage (bookmarking activity by robots, harvesters and monitors is discarded). A falling trend indicates users are arriving on the site, but are not bookmarking it.

about the referrals by remote page/referrals by search (%) statistic

This is the ratio of referrals by search to referrals by URL, expressed as a percentage. There are four ways a user arrives at your site: directly (eg. via a bookmark, by typing it in, etc), via their web-based inbox, via a search engine, or by following a link from another page.

Referring URLs containing strings listed in the local referrers section of the INI file are logged to the local referrers logfile, and are excluded from this ratio. A local referrer is usually a page you control which acts as a gateway to the website you are analysing. Local referrer filtering allows you to separate external inbound links you are responsible for from those maintained by third parties.

about the extrapolated referrals by email/referrals by page (%) statistic

This is the ratio of referrals from webmailers (eg. Hotmail) to referrals from other sites (excluding search engines), expressed as a percentage.

about the legal posts/human hit (%) statistic

This is the total number of times a human visitor successfully POSTed a form, expressed as a percentage to total hits. A rising trend indicates increasing levels of user interaction.

about the 404 errors/total hits (%) statistic

This is total hits minus 404 errors, expressed as a percentage of total hits. A value of 94% means that for every 100 page requests, 6 pages were missing.


 
  issues and limitations next section top of page

PARSELOG was mentioned on a German-language website; apparently a process called PARSELOG.EXE was open up to 4 times on one machine, and was crashing on another. Please be assured I have no idea what PARSELOG.EXE this is. As my PARSELOG.EXE is a 16-bit DOS-mode compiled BASIC program, it knows nothing about multiple threads, etc. OTOH, as it is a 16-bit DOS-mode compiled BASIC program, that might indeed explain why it's having some problems on (probably) newer systems. Although it's crunched over a gigabyte of raw CLF on my local Windows 98 box without a problem. more at reger24.de

All statistics have inherent levels of error, and the statistics generated by PARSELOG are no exception. When working with statistics, keep your grains of salt handy at all times.

The accuracy of the extrapolated bookmarks statistic is dependent upon a fuzzy input, and is thus of limited value. Avoid setting the IE marketshare to a ridiculous setting (eg., negative ;-) if you value the accuracy of this statistic.

Because PARSELOG uses string-matching to determine what sort of traffic is being analysed, it's possible to include your own site as a remote referrer, or even a search engine, or class all Mozillas as DIIbots, or something equally inane. This could be useful, but more likely, it will be confusing. Take care when reconfiguring the strings in the INI file - small changes can make a big difference.

PARSELOG requires logfiles to be formatted with CR/LF. If your logfile is not CR/LF terminated, preparse it to CR/LF using a third-party utility before calling PARSELOG.

PARSELOG converts commas found in multiple IP addresses to underscores.

PARSELOG generates CSV files. These are easier to work with than CLF, but they are still noisy. A separate program, CSV2HTML, can convert these CSV-formatted files to HTML documents, encapsulating the CSV data in a table.

PARSELOG supports up to 200 entries in any [section] of the INI file. Any additional entries are ignored.

PARSELOG can analyse between 2.1 and 2.2 billion logfile entries, at maximum.


 
  planned improvements next section top of page

  • add start-analysis-at-datetime, stop-analysis-at-datetime
  • filter news to newslog?
  • support unix-style textfiles (LF only)
  • parse-from-archive
  • log rotation
  • duplicate detection
  • add anonymisers and proxies?
  • re-order preparsed logfile chronologically
  • average number of hits/day, referrals/day, searches/day etc

 
  revision history top of page

March 11, 20000.01initial development; basic logfile decoding/CSV encoding routines
March 14, 20000.02bugfixes
March 15, 20000.03documentation
November 3, 20000.04added preparsing, visitor/spider/personal/other agent filtering, search/referrer filtering, 304/404/other error filtering
November 12, 20001.0added bookmark extrapolation, and human/robot, search/referrer, and hits/error ratios
November 21, 20001.01added bookmarks/hit and unknown agent ratios; added webmail filtering
January 10, 20011.02added stats sampling; database update; bugfixes; SFX distribution
February 25, 20011.03database update
August 12, 20011.47added makesearch and makespider toggles; added extrapolated webmail referrals and webmail/referrers ratio; added POST counter and legal posts/human hit ratio; bugfixes; database update
November 26, 20011.48bugfixes; database update
January 31, 20021.49added domainname setting; bugfixes; database update
March 4, 20021.50added ability to disable stats sampling; performance tweaks; database update
September 8, 20021.51added caching of old results; separated logrules and parselog.ini
January 15, 20041.52bugfixes; database update; recompiled both EXE's with a slightly faster compiler
February 21, 20041.53bugfixes
May 4, 20041.54added support for parking services and advertisers; added logging
June 9, 20041.55added ability to disable local referrer logfile
?1.56bugfix
May 10, 20051.57bugfix, database update, commercial release