Webalizer Documentation

=========================================================================
The Webalizer - A web server log file analysis tool
Copyright 1997-2008 by Bradford L. Barrett (brad@mrunix.net)

CONTENTS:
  1. What is The Webalizer
  2. Incremental Processing
  3. Definitions
  4. General Configuration
  5. Top Table Options
  6. Show All
  7. Hide Options
  8. Group Options
  9. Referrers
  10. Notes on Character Escaping
  11. Hide Options
  12. Notes on Visits/Entry/Exit Figures
  13. Known Issues
  14. Final Notes
  15. Adhost Notes

What is The Webalizer?
----------------------
The Webalizer is a web server log file analysis program which produces
usage statistics in HTML format for viewing with a browser.  The results
are presented in both columnar and graphical format, which facilitates
interpretation.  Yearly, monthly, daily and hourly usage statistics are
presented, along with the ability to display usage by site, URL, referrer,
user agent (browser) and country (user agent and referrer are only
available if your web server procduces Combined log format files).


Incremental Processing
----------------------
Some special precautions need to be taken when using the incremental
run capability of The Webalizer.  Configuration options should not be
changed between runs, as that could cause corruption of the internal
stored data.  For example, changing the MangleAgents level will cause
different representations of user agents to be stored, producing invalid
results in the user agents section of the report.  If you need to change
configuration options, do it on the 1st day of the month, before processing
for the that month begins at about 3:00am on the 2nd.  You may set your
configuration at any time if you choose the "Reconfigure For 1st Day Of
Next Month" option.  Your new configuration will be stored until the 1st
day of the month, and then automatically installed.


Definitions
----------------
Hits

  Any request made to the server which is logged, is considered a 'hit'.
The requests can be for anything... html pages, graphic images, audio
files, cgi scripts, etc...  Each valid line in the server log is
counted as a hit.  This number represents the total number of requests
that were made to the server during the specified report period.

Files

  Some requests made to the server, require that the server then send
something back to the requesting client, such as a html page or graphic
image.  When this happens, it is considered a 'file' and the files
total is incremented.  The relationship between 'hits' and 'files' can
be thought of as 'incoming requests' and 'outgoing responses'.

Pages

  Pages are, well, pages!  Generally, any HTML document, or anything
that generates an HTML document, would be considered a page.  This
does not include the other stuff that goes into a document, such as
graphic images, audio clips, etc...  This number represents the number
of 'pages' requested only, and does not include the other 'stuff' that
is in the page.  What actually constitutes a 'page' can vary from
server to server.  The default action is to treat anything with the
extension '.htm', '.html', '.txt' or '.asp' as a page.  A lot of sites will
probably define other extensions, such as '.phtml', '.php3' and '.pl'
as pages as well.  Some people consider this number as the number of
'pure' hits... I'm not sure if I totaly agree with that viewpoint.
Some other programs (and people :) refer to this as 'Pageviews'.

Sites

  Each request made to the server comes from a unique 'site', which can
be referenced by a name or ultimately, an IP address.  The 'sites'
number shows how many unique IP addresses made requests to the server
during the reporting time period.  This DOES NOT mean the number of
unique individual users (real people) that visited, which is impossible
to determine using just logs and the HTTP protocol (however, this
number might be about as close as you will get).

Visits

  Whenever a request is made to the server from a given IP address (site),
the amount of time since a previous request by the address is calculated
(if any).  If the time difference is greater than a preconfigured 'visit
timeout' value (or has never made a request before), it is considered a
'new visit', and this total is incremented (both for the site, and the IP
address).  The default timeout value is 30 minutes (entered as 'mmss' ie,
3000 = 30 mins and 00 seconds), so if a user visits your site at
1:00 in the afternoon, and then returns at 3:00, two visits would be
registered.  Note: in the 'Top Sites' table, the visits total should be
discounted on 'Grouped' records, and thought of as the "Minimium number of
visits" that came from that grouping instead.  Note: Visits only occur on
PageType requests, that is, for any request whose URL is one of the 'page'
types defined with the PageType option.  Due to the limitation of the HTTP
protocol, log rotations and other factors, this number should not be taken
as absolutely accurate, rather, it should be considered a pretty close
"guess". 

KBytes

  The KBytes (kilobytes) value shows the amount of data, in KB, that was
sent out by the server during the specified reporting period.  This value
is generated directly from the log file, so it is up to the webserver to
produce accurate numbers in the logs (some web servers do stupid things
when it comes to reporting the number of bytes).  In general, this should
be a fairly accurate representation of the amount of outgoing traffic the
server had, regardless of the web servers reporting quirks. 

Note: A kilobyte is 1024 bytes, not 1000 :) 
      A megabyte is 1024 kilobytes, or     1,048,576 bytes.
      A gigabyte is 1024 megabytes, or 1,073,741,824 bytes.

Top Entry and Exit Pages

  The Top Entry and Exit Pages give a rough estimate of what URL's
are used to enter your site, and what the last pages viewed are.
Because of limitations in the HTTP protocol, log rotations, etc...
this number should be considered a good "rough guess" of the actual
numbers, however will give a good indication of the overall trend in
where users come into, and exit, your site.


General Configuration
---------------------
ReportTitle   This specifies the title to use for the generated reports.
              It is used in conjunction with the hostname (unless blank)
              to produce the final report titles.  If not defined, the
              default of ": " is used.

VisitTimeout  Set the 'visit timeout' value.  Visits are determined by
              looking at the time difference between the current and last
              request made by a specific site.  If the difference in time
              is greater than the visit timeout value, the request is
              considered a new visit.  The value must be in the form of
              HHMMSS, leading zeros suppressed.  The default value of
              30 minutes (3000) should be fine for most.

PageType      Allows you to define the 'page' type extension.  Normally,
              people consider HTML and cgi scripts as 'pages'.  This
              option allows you to specify what extensions you consider
              a page.  Default is 'htm*' and 'cgi' for web logs, and
              'txt' for ftp logs.

GraphLegend   Enable/disable the display of color coded legends on the
              produced graphs.  Default is 'yes', to display them.

CountryGraph  This keyword is used to either enable or disable the creation
              and display of the Country Usage graph.  Values may be either
              'yes' or 'no', with the default being 'yes'.

HourlyGraph   This keyword is used to either enable or disable the creation
              and display of the Hourly Usage graph.  Values may be either
              'yes' or 'no', with the default being 'yes'.

HourlyStats   This keyword is used to either enable or disable the creation
              and display of the Hourly Usage statistics table.  Values may
              be either 'yes' or 'no', with the default being 'yes'.

MangleAgents  The MangleAgents keyword specifies the level of user agent
              name mangling, if any.  There are 6 levels that may be specified,
              each producing a different level of detail displayed.  Level 5
              displays only the browser name (MSIE or Mozilla) and the major
              version number.  Level 4 adds the minor version (single
              decimal place).  Level 3 adds the minor version to two decimal
              places.  Level 2 will also add any sub-level designation
              (such as Mozilla/3.01Gold or MSIE 3.0b).  Level 1 will also
              attempt to add the system type.  The default level 0 will
              leave the user agent field unmodified and produces the
              greatest amount of detail.


Top Table Options
------------------
TopAgents     This allows you to specify how many "Top" user agents are
              displayed in the "Top User Agents" table.  The default
              is 15.  If you do not want to display user agent statistics,
              specify a value of zero (0).  The display of user agents
              will only work if your web server includes this information
              in its log file (ie: a combined log format file). Note:
              Agent is another name for Web Browser.

TopUsers      These are users who had to enter a username and password
              to view a specific page or directory.

TopCountries  This allows you to specify how many "Top" countries are
              displayed in the "Top Countries" table.  The default is
              50.  If you want to disable the countries table, specify
              a value of zero (0).

TopReferrers  This allows you to specify how many "Top" referrers are
              displayed in the "Top Referrers" table.  The default is
              30.  If you want to disable the referrers table, specify
              a value of zero (0).  The display of referrer information
              will only work if your web server includes this information
              in its log file (ie: a combined log format file).

Show All      The All* keywords allow the display of all URL's, Sites, Referrers   
              User Agents, Search Strings and Usernames.  If enabled, a seperate
              HTML page will be created, and a link will be added to the bottom 
              of the appropriate "Top" table.  There are a couple of conditions
              for this to occur.  First, there must be more items than will fit
              in the "Top" table (otherwise it would just be duplicating what is
              already displayed).  Second, the listing will only show those items   
              that are normally visable, which means it will not show any hidden
              items.  Grouped entries will be listed first, followed by individual
              items.  The value for these keywords can be either 'yes' or 'no',
              with the default being 'no'.  Please be aware that these pages can
              be quite large in size, particularly the sites page, and seperate
              pages are generated for each month, which can consume quite a lot   
              of disk space depending on the traffic to your site.

TopSites      This allows you to specify how many "Top" sites are
              displayed in the "Top Sites" table.  The default is 30.
              If you want to disable the sites table, specify a value
              of zero (0).

TopKSites     Identical to TopSites, except for the 'by KByte' table.
              Default is 10.

TopURLs       This allows you to specify how many "Top" URL's are
              displayed in the "Top URL's" table.  The default is 30.
              If you want to disable the URL's table, specify a value
              of zero (0).
              Normally, The Webalizer scans for and strips the string
              "index." from URL's before processing them.  This turns a
              URL such as /somedir/index.html into just /somedir/ which
              is really the same URL.


TopKURLs      Identical to TopURLs, except for the 'by KByte' table.
              Default is 10.

TopEntry      Allows you to specify how many "Top Entry Pages" are
              displayed in the table.  The default is 10.  If you
              want to disable the table, specify a value of zero (0).

TopExit       Allows you to specify how many "Top Exit Pages" are
              displayed in the table.  The default is 10.  If you
              want to disable the table, specify a value of zero (0).

TopSearch     Allows you to specify how many "Top Search Strings" are
              displayed in the table.  The default is 20.  If you
              want to disable the table, specify a value of zero (0).
              Only works if using combined log format (ie: contains
              referrer information).


Hide Options
------------
The following options take a string argument to use as a comparison
for matching.  The string argument can be plain text, or plain text
that either starts or ends with the wildcard character '*'.

For Example:

Given the string "yourmama/was/here", the arguments "was", "*here" and
"your*" will all produce a match.


Hide Object Keywords
--------------------
These keywords allow you to hide user agents, referrers, sites and
URL's from the various "Top" tables.  The value for these keywords
are the same as those used in their command line counterparts.  You
can specify as many of these as you want without limit.  Refer to the
section above on "Command Line Options" for a description of the string
formatting used as the value.  Values cannot exceed 80 characters in
length.  Hide* keywords can have a leading or trailing wildcard '*'.

HideAgent     This allows specified user agents to be hidden from the
              "Top User Agents" table.  Not very useful, since there
              a zillion different names by which browsers go by today,
              but could be useful if there is a particular user agent
              (ie: robots, spiders, realaudio, etc..) that hits your
              site frequently enough to make it into the top user agent
              listing.  This keyword is useless if 1) your log file does
              not provide user agent information or 2) you disable the
              user agent table.

HideReferrer  This allows you to hide specfied referrers from the
              "Top Referrers" table.  Normally, you would only specify
              your own web server to be hidden, as it is usually the
              top generator of references to your own pages.  Of course,
              this keyword is useless if 1) your log file does not include
              referrer information or 2) you disable the top referrers
              table.

HideSite      This allows you to hide specified sites from the "Top
              Sites" table.  Normally, you would only specify your own
              web server or other local machines to be hidden, as they
              are usually the highest hitters of your web site, especially
              if you have their browsers home page pointing to it.

HideURL       This allows you to hide URL's from the "Top URL's" table.
              Normally, this is used to hide items such as graphic files,
              audio files or other 'non-html' files that are transferred
              to the visiting user.


Group Object Keywords
---------------------
The Group* keywords allow object grouping based on Site, URL, Referrer
and User Agent.  Combined with the Hide* keywords, you can customize
exactly what will be displayed in the 'Top' tables.  For example, to
only display totals for a particular directory, use a GroupURL and HideURL
with the same value (ie: '/help/*').  Group processing is only done after
the individual record has been fully processed, so name mangling and
site total updates have already been peformed.  Because of this, groups
are not counted in the main site total (as that would cause duplication).
Groups can be displayed in bold and shaded as well.  Grouped records are
not, by default, hidden from the report.  This allows you to display a
grouped total, while still being able to see the individual records, even
if they are part of the group.  If you want to hide the detail records,
follow the Group* directive with a Hide* one using the same value.  There
are no command line switches for these keywords.  The Group* keywords also
accept an optional label to be displayed instead of the actual value used.

GroupReferrer Allows grouping Referrers.  Can be handy for some of the
              major search engines that have multiple host names a
              referral could come from.

GroupURL      This keyword allows grouping URL's. Useful for grouping
              complete directory trees.

GroupSite     This keywords allows grouping Sites.  Most used for
              grouping top level domains and unresolved IP address
              for local dial-ups, etc...

GroupAgent    Groups User Agents.  A handy example of how you could use
              this one is to use "Mozilla" and "MSIE" as the values for
              GroupAgent and HideAgent keywords.  Make sure you put the
              "MSIE" one first.


GroupShading  Allows shading of table rows for groups.  Value can be
              'yes' or 'no', with the default being 'yes'.

GroupHighlight Allows bolding of table rows for groups.  Value can be
               'yes' or 'no', with the default being 'yes'.

--------------------------------------------------------------------------


Notes on Referrers
------------------
Referrers are weird critters... They take many shapes and forms, which makes
it much harder to analyze than a typical URL, which at least has some
standardization.  What is contained in the referrer field of your log
files varies depending on many factors, such as what site did the referral,
what type of system it comes from and how the actual referal was generated.
Why is this?  Well, because a user can get to your site in many ways... They
may have your site bookmarked in their browser, they may simply type your
sites URL field in their browser, they could have clicked on a link on some
remote web page or they may have found your site from one of the many search
engines and site indexes found on the web.  The Webalizer attempts to deal
with all this variation in an intelligent way by doing certain things to
the referrer string which makes it easier to analyze.  Of course, if your
web server doesn't provide referrer information, you probably don't really
care and are asking yourself why you are reading this section...

Most referrer's will take the form of "http://somesite.com/somepage.html",
which is what you will get if the user clicks on a link somewhere on the
web in order to get to your site.  Some will be a variation of this, and
look something like "file:/some/such/sillyname", which is a reference from
a HTML document on the users local machine.  Several variations of this can
be used, depending on what type of system the user has, if he/she is on
a local network, the type of network, etc...  To complicate things even
more, dynamic HTML documents and HTML documents that are generated by
cgi scripts or external programs produce lots of extra information which
is tacked on to the end of the referrer string in an almost infinate number
of ways.  If the user just typed your URL into their browser or clicked on
a bookmark, there won't be any information in the referrer field and will
take the form "-".

In order to handle all these variations, The Webalizer parses the referrer
field in a certain way.  First, if the referrer string begins with "http",
it assumes it is a normal referral and converts the "http://" and following
hostname to lowercase in order to simplify hiding if desired.  For example,
the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become
"http://www.myhost.com/This/Is/A/HTML/Document.html".  Notice that only the
"http://" and hostname are converted to lower case... The rest of the
referrer field is left alone.  This follows standard convention, as the
actuall method (HTTP) and hostname are always case insensitive, while the
document name portion is case sensitive.

Referrers that came from search engines, dynamic HTML documents, cgi
scripts and other external programs usually tack on additional information
that it used to create the page.  A common example of this can be found
in referrals that come from search engines and site indexes common on the
web.  Sometimes, these referrers URL's can be several hundred characters
long and include all the information that the user typed in to search for
your site.  The Webalizer deals with this type of referrer by stripping
off all the query information, which starts with a question mark '?'.
The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will
be converted to just "http://search.yahoo.com/search".

When a user comes to your site by using one of their bookmarks or by
typing in your URL directly into their browser, the referrer field is
blank, and looks like "-".  Most sites will get more of these referrals
than any other type.  The Webalizer converts this type of referral into
the string "- (Direct Request)".  This is done in order to make it easier
to hide via a command line option or configuration file option.  This is
because the character "-" is a valid character elsewhere in a referrer
field, and if not turned into something unique, could not be hidden without
possibly hiding other referrers that shouldn't be.


Notes on Character Escaping
---------------------------
The HTTP protocol defines certain ways that URL's can look and behave.  To
some extent, referrer fields follow most of the same conventions.  Character
escaping is a technique by which non-printable or other non-ASCII (and even
some ASCII) characters can be used in a URL.  This is done by placing the
Hexdecimal value of the character in the URL, preceeded by a percent sign '%'.
Since Hex values are made up of ASCII characters, any character can be
escaped to ensure only printable ASCII characters are present in the URL.
Some systems take this concept to the extreme and escape all sorts of stuff,
even characters that don't need to be escaped.  To deal with this, The
Webalizer will un-escape URL's and referrers before being processed. For
Example, the URL "/www.mrunix.net/%7Ebrad/resume.html" is the same URL as
"/www.mrunix.net/~brad/resume.html", a very common form of a URL to access
users web pages.  If the URL's were not un-escaped, they would be treated as
two seperate documents, even though they are really one and the same.


Search String Analysis
----------------------
The Webalizer will do a minimal analysis on referrer strings that
it finds, looking for well known search string patterns.  Most of
the major search engines are supported, such as yahoo, altavista,
lycos, etc...  Unfortunately, search engines are always changing
their internal/CGI query formats, new search engines are coming on
line every day, and the ability to detect _all_ search strings is
nearly impossible.  However, it should be accurate enough to give
a good indication of what users were searching for when they stumbled
across your site.


Notes on Visits/Entry/Exit Figures
----------------------------------
The majority of data analyzed and reported on by The Webalizer is
as accurate and correct as possible based on the input log file.
However, due to the limitation of the HTTP protocol, the use of
firewalls, proxy servers, multi-user systems, the rotation of your
log files, and a myriad of other conditions, some of these numbers
cannot, without absolute accuracy, be calculated.  In particular,
Visits, Entry Pages and Exit Pages are suspect to random errors
due to the above and other conditions.  The reason for this is
twofold, 1) Log files are finite in size and time interval, and
2) There is no way to distinguish multiple individual users apart
given only an IP address.  Because log files are finite, they have
a begining and ending, which can be represented as a fixed time
period.  There is no way of knowing what happened previous to this
time period, nor is it possible to predict future events based on
it.  Also, because it is impossible to distinguish individual users
apart, multiple users that have the same IP address all appear to
be a single user, and are treated as such.  This is most common where
corporate users sit behind a proxy/firewall to the outside world,
and all requests appear to come from the same location (the address
of the proxy/firewall itself).  Dynamic IP assignment (used with
dial-up internet accounts) also present a problem, since the same
user will appear as to come from multiple places.

For example, suppose two users visit your server from XYZ company,
which has their network connected to the internet by a proxy server
'fw.xyz.com'.  All requests from the network look as though they
originated from 'fw.xyz.com', even though they were really initiated
from two seperate users on different PC's.  The Webalizer would
see these requests as from the same location, and would record only
1 visit, when in reality, there were two.  Because entry and exit
pages are calculated in conjunction with visits, this situation
would also only record 1 entry and 1 exit page, when in reality,
there should be 2.

As another example, say a single user at XYZ company is surfing
around your website..  They arrive at 11:52pm the last day of
the month, and continue surfing until 12:30am, which is now a
new day (in a new month).  Since a common practice is to rotate
(save then clear) the server logs at the end of the month, you
now have the users visit logged in two different files (current
and previous months).  Because of this (and the fact that the
Webalizer clears history between months), the first page the
user requests after midnight will be counted as an entry page.
This is unavoidable, since it is the first request seen by that
particular IP address in the new month.

For the most part, the numbers shown for visits, entry and exit
pages are pretty good 'guesses', even though they may not be 100%
accurate.  They do provide a good indication of overall trends,
and shouldn't be that far off from the real numbers to count much.
You should probably consider them as the 'minimum' amount possible,
since the actual (real) values should always be equal or greater
in all cases.


Known Issues
------------
 o Performance.  The Hide* and Group* configuration options can cause
    a performance decrease if lots of them are used.  The reason for
    this is that every log record must be scanned for each item in
    each list.  For example, if you are Hiding 20 objects and Grouping
    20 more, each record is scanned, at most, 40 times (20+20).
    On really large log files, this can have a profound impact.  It
    is recommended that you use the least amount of these configuration
    options that you can, as it will greatly improve performance.


Final Notes
-----------
A lot of time and effort went into making The Webalizer, and to ensure that
the results are as accurate as possible.  If you find any abnormalities or
inconsistant results, bugs, errors, ommisions or anything else that doesn't
look right, please let me know so I can investigate the problem or correct
the error.  This goes for the minimal documentation as well.
Suggestions for future versions are also welcome and appreciated.


A word about the "Cumulative" report:  (Webalizer)
==================================================
Adhost has taken great pains to gather as much statistical information
as possible for all of each year, and to try to compile that into a useful
Webalizer report of your year-to-date activity.  The reports are quite
accurate, reporting every file access to your web site, and will be
updated daily.

For a few clients, usually sites hosted on non-unix servers, there may
be a few "holes" in your stats for the first months of each year.  Logs
either were corrupted, missing, or just not available for all days.
This might also have occurred if your site changed servers or was
temporarily disabled for any reason.  This will be obvious in the
Cumulative reports where you can see a graph of the individual day's
accesses.

If you see such a hole, DON'T PANIC!  It doesn't mean that your site
was down for that day, it just means we were unable to recover the
logfiles for that day.  Our servers are checked continously, 24/7,
and we are paged if there is a problem.  We pride ourselves on our
"up time" and respond quickly to problems to maintain that reputation.

If you have questions about a particular day, please email your
question to stats@adhost.com.

Richard Stockton
Senior Webmaster
Adhost Internet Advertising
4-Oct-1999
Updated 3-Sep-2004