webalizer
NAME
|
webalizer - A web server log file analysis tool. |
SYNOPSIS
|
webalizer [ option ... ] [ log-file
] |
|
webazolver [ option ... ] [ log-file
] |
DESCRIPTION
|
The Webalizer is a web server log file analysis
program which produces usage statistics in HTML format for
viewing with a browser. The results are presented in both
columnar and graphical format, which facilitates
interpretation. Yearly, monthly, daily and hourly usage
statistics are presented, along with the ability to display
usage by site, URL, referrer, user agent (browser),
username, search strings, entry/exit pages, and country
(some information may not be available if not present in the
log file being processed). |
|
The Webalizer supports CLF (common log format)
log files, as well as Combined log formats as defined
by NCSA and others, and variations of these which it
attempts to handle intelligently. In addition, the
Webalizer also supports wu-ftpd xferlog
formatted log files, allowing analysis of ftp servers, and
squid proxy logs. Logs may also be compressed, via
gzip. If a compressed log file is detected, it will
be automatically uncompressed while it is read. Compressed
logs must have the standard gzip extension of
.gz. |
|
webazolver is normally just a symbolic link to the
webalizer. When run as webazolver, only DNS
file creation/updates are performed, and the program will
exit once complete. All normal options and configuration
directives are available, however many will not be used. In
addition, a DNS cache file must be specified. If the number
of DNS children processes to use are not specified, the
webazolver will default to 5. |
|
This documentation applies to The Webalizer Version
2.01 |
RUNNING THE WEBALIZER
|
The Webalizer was designed to be run from a Unix
command line prompt or as a crond(8) job. Once
executed, the general flow of the program is: |
|
A default configuration file is scanned for. A file named
webalizer.conf is searched for in the current
directory, and if found, it's configuration data is parsed.
If the file is not present in the current directory, the
file /etc/webalizer.conf is searched for and, if
found, is used instead. |
|
Any command line arguments given to the program are parsed.
This may include the specification of a configuration file,
which is processed at the time it is
encountered. |
|
If a log file was specified, it is opened and made ready for
processing. If no log file was given, STDIN is used
for input. If the log filename '-' is specified,
STDIN will be forced. |
|
If an output directory was specified, the program does a
chdir(2) to that directory in prepration for
generating output. If no output directory was given, the
current directory is used. |
|
If a non-zero number of DNS Children processes were
specified, they will be started, and the specified log file
will be processed, creating or updating the specified DNS
cache file. |
|
If no hostname was given, the program attempts to get the
hostname using a uname(2) system call. If that fails,
localhost is used. |
|
A history file is searched for in the current directory
(output directory) and read if found. This file keeps totals
for previous months, which is used in the main
index.html HTML document. Note: The file
location can now be specified with the HistoryName
configuration option. |
|
If incremental processing was specified, a data file is
searched for and loaded if found, containing the 'internal
state' data of the program at the end of a previous run.
Note: The file location can now be specified with the
IncrementalName configuration option. |
|
Main processing begins on the log file. If the log spans
multiple months, a seperate HTML document is created for
each month. |
|
After main processing, the main index.html page is
created, which has totals by month and links to each months
HTML document. |
|
A new history file is saved to disk, which includes totals
generated by The Webalizer during the current
run. |
|
If incremental processing was specified, a data file is
written that contains the 'internal state' data at the end
of this run. |
INCREMENTAL PROCESSING
|
Version 1.2x of The Webalizer adds incremental run
capability. Simply put, this allows processing large log
files by breaking them up into smaller pieces, and
processing these pieces instead. What this means in real
terms is that you can now rotate your log files as often as
you want, and still be able to produce monthly usage
statistics without the loss of any detail. Basically, The
Webalizer saves and restores all internal data in a
file named webalizer.current. This allows the program
to 'start where it left off' so to speak, and allows the
preservation of detail from one run to the next. The data
file is placed in the current output directory, and is a
plain ascii text file that can be viewed with any standard
text editor. It's location and name may be changed using the
IncrementalName configuration keyword. |
|
Some special precautions need to be taken when using the
incremental run capability of The Webalizer.
Configuration options should not be changed between runs, as
that could cause corruption of the internal data stored. For
example, changing the MangleAgents level will cause
different representations of user agents to be stored,
producing invalid results in the user agents section of the
report. If you need to change configuration options, do it
at the end of the month after normal processing of the
previous month and before processing the current month. You
may also want to delete the webalizer.current file as
well. |
|
The Webalizer also attempts to prevent data
duplication by keeping track of the timestamp of the last
record processed. This timestamp is then compared to current
records being processed, and any records that were logged
previous to that timestamp are ignored. This, in theory,
should allow you to re-process logs that have already been
processed, or process logs that contain a mix of
processed/not yet processed records, and not produce
duplication of statistics. The only time this may break is
if you have duplicate timestamps in two seperate log
files... any records in the second log file that do have the
same timestamp as the last record in the previous log file
processed, will be discarded as if they had already been
processed. There are lots of ways to prevent this however,
for example, stopping the web server before rotating logs
will prevent this situation. This setup also necessitates
that you always process logs in chronological order,
otherwise data loss will occur as a result of the timestamp
compare. |
REVERSE DNS LOOKUPS
|
The Webalizer supports reverse DNS lookups through a DNS
cache file that is either created/updated at run-time,
or has been previously created, either by a previous run of
the webalizer, or by running the stand-alone version,
webazolver. In order to perform reverse DNS lookups,
a DNSCache filename must be specified. In order to
create/update the cache file at run-time, the
DNSChildren number must be non-zero. The
DNSChildren value specifies the number of children
processes to fork, each of which will perform reverse DNS
lookups in order to create/update the DNS cache file. See
the file DNS.README for additional
information. |
COMMAND LINE OPTIONS
|
The Webalizer supports many different configuration options
that will alter the way the program behaves and generates
output. Most of these can be specified on the command line,
while some can only be specified in a configuration file.
The command line options are listed below, with references
to the corresponding configuration file
keywords. |
|
Display all available command line options and exit
program. |
|
Display program version and exit program. |
|
Debug. Display debugging information for errors and
warnings. |
|
IgnoreHist. Ignore history. USE WITH CAUTION.
This will cause The Webalizer to ignore any previous
monthly history file only. Incremental data (if present) is
still processed. |
|
Incremental. Preserve internal data between
runs. |
|
Quiet. Supress informational messages. Does not
supress warnings or errors. |
|
ReallyQuiet. Supress all messages including warnings
and errors. |
|
TimeMe. Force display of timing information at end of
processing. |
|
Use configuration file file. |
|
HostName. Use the hostname name. |
|
OutputDir. Use output directory
dir. |
|
ReportTitle. Use name for report
title. |
|
LogType. Specify log type to be processed. Value can
be either clf, ftp or squid format. If
not specified, will default to CLF format. FTP
logs must be in standard wu-ftpd xferlog
format. |
|
FoldSeqErr. Fold out of sequence log records back
into analysis, by treating as if they were the same
date/time as the last good record. Normally, out of sequence
log records are simply ignored. |
|
CountryGraph. Supress country graph. |
|
HourlyGraph. Supress hourly graph. |
|
HTMLExtension. Defines HTML file extension to use. If
not specified, defaults to html. Do not include the
leading period. |
|
HourlyStats. Supress hourly statistics. |
|
GraphLegend. Supress color coded graph
legends. |
|
GraphLines. Specify number of background lines.
Default is 2. Use zero ('0') to disable the
lines. |
|
PageType. Specify file extensions that are considered
pages. Sometimes referred to as
pageviews. |
|
VisitTimeout. Specify the Visit timeout period.
Specified in number of seconds. Default is 1800 seconds (30
minutes). |
|
IndexAlias. Use the filename name as an
additional alias for index.. |
|
MangleAgents. Mangle user agent names according to
the mangle level specified by num. Mangle levels
are: |
|
5 Browser name and major version. |
|
4 Browser name, major and minor version. |
|
3 Browser name, major version, minor version to two
decimal places. |
|
2 Browser name, major and minor versions and
sub-version. |
|
1 Browser name, version and machine type if
possible. |
|
0 All informaiton (left unchanged). |
|
GroupDomains. Automatically group sites by domain.
The grouping level specified by num can be thought of
as 'the number of dots' to display in the grouping. The
default value of 0 disables any domain
grouping. |
|
DNSCache. Use the DNS cache file
name. |
|
DNSChildren. Use num DNS children processes to
perform DNS lookups, either creating or updateing the DNS
cache file. Specify zero (0) to disable cache file
creation/updates. If given, a DNS cache filename must be
specified. |
|
HideAgent. Hide user agents matching
name. |
|
HideReferrer. Hide referrer matching
name. |
|
HideSite. Hide site matching
name. |
|
HideAllSites. Hide all individual sites (only display
groups). |
|
HideURL. Hide URL matching name. |
|
TopAgents. Display the top num user agents
table. |
|
TopReferrers. Display the top num referrers
table. |
|
TopSites. Display the top num sites
table. |
|
TopURLs. Display the top num URL's
table. |
|
TopCountries. Display the top num countries
table. |
|
TopEntry. Display the top num entry pages
table. |
|
TopExit. Display the top num exit pages
table. |
CONFIGURATION FILES
|
Configuration files are standard ascii(7) text files
that may be created or edited using any standard editor.
Blank lines and lines that begin with a pound sign ('#') are
ignored. Any other lines are considered to be configurgation
lines, and have the form "Keyword Value", where
the Keyword is one of the currently available configuration
keywords defined below, and 'Value' is the value to assign
to that particular option. Any text found after the keyword
up to the end of the line is considered the keyword's value,
so you should not include anything after the actual value on
the line that is not actually part of the value being
assigned. The file sample.conf provided with the
distribution contains lots of useful documentation and
examples as well. General Configuration
Keywords |
|
Use log file named name. If none specified,
STDIN will be used. |
|
Specify log file type as name. Values can be either
web, squid or ftp, with the default
being web. |
|
Create output in the directory dir. If none
specified, the current directory will be used. |
|
Filename to use for history file. Relative to output
directory unless absolute name is given (ie: starts with
'/'). Defaults to webalizer.hist' in the standard
output directory. |
|
Use the title string name for the report title. If
none specified, use the default of (in english)
"Usage Statistics for ". |
|
Set the hostname for the report as name. If none
specified, an attempt will be made to gather the hostname
via a uname(2) system call. If that fails,
localhost will be used. |
|
Use https:// on links to URLS, instead of the default
http://, in the 'Top URL's'
table. |
|
Supress informational messages. Warning and Error messages
will not be supressed. |
|
Supress all messages, including Warning and Error
messages. |
|
Print extra debugging information on Warnings and
Errors. |
|
Force timing information at end of processing. |
|
Use GMT (UTC) time instead of local timezone
for reports. |
|
Ignore previous monthly history file. USE WITH
CAUTION. Does not prevent Incremental file
processing. |
|
Fold out of sequence log records back into analysis by
treating them as if they had the same date/time as the last
good record. Normally, out of sequence log records are
ignored. |
|
CountryGraph ( yes | no ) |
|
Display Country Usage Graph in output report. |
|
Display Daily Graph in output report. |
|
Display Daily Statistics in output report. |
|
Display Hourly Graph in output report. |
|
Display Hourly Statistics in output report. |
|
Define the file extensions to consider as a page. If
a file is found to have the same extension as name,
it will be counted as a page (sometimes called a
pageview). |
|
Allows the color coded graph legends to be
enabled/disabled. |
|
Specify the number of background reference lines displayed
on the graphs produced. Disable by using zero ('0'),
default is 2. |
|
Specifies the visit timeout value. Default is 1800
seconds (30 minutes). A visit is determined by looking
at the difference in time between the current and last
request from a specific site. If the difference is greater
or equal to the timeout value, the request is counted as a
new visit. Specified in seconds. |
|
Use name as an additional alias for
index.*. |
|
Mangle user agent names based on mangle level num.
See the -M command line switch for mangle levels and
their meaning. The default is 0, which doesn't mangle
user agents at all. |
|
SearchEngine name variable |
|
Allows the specification of search engines and their query
strings. The name is the name to match against the
referrer string for a given search engine. The
variable is the cgi variable that the search engine
uses for queries. See the sample.conf file for
example usage with common search engines. |
|
Enable Incremental mode processing. |
|
Filename to use for incremental data. Relative to output
directory unless an absolute name is given (ie: starts with
'/'). Defaults to webalizer.current' in the standard
output directory. |
|
Filename to use for the DNS cache. Relative to output
directory unless an absolute name is given (ie: starts with
'/'). |
|
Number of children DNS processes to run in order to
create/update the DNS cache file. Specify zero (0) to
disable. |
|
Display the top num User Agents table. Use zero to
disable. |
|
Create seperate HTML page with All User
Agents. |
|
Display the top num Referrers table. Use zero to
disable. |
|
AllReferrers ( yes | no ) |
|
Create seperate HTML page with All
Referrers. |
|
Display the top num Sites table. Use zero to
disable. |
|
Display the top num Sites (by KByte) table. Use zero
to disable. |
|
Create seperate HTML page with All
Sites. |
|
Display the top num URLs table. Use zero to
disable. |
|
Display the top num URLs (by KByte) table. Use zero
to disable. |
|
Create seperate HTML page with All URLs. |
|
Display the top num Countries in the table. Use zero
to disable. |
|
Display the top num Entry Pages in the table. Use
zero to disable. |
|
Display the top num Exit Pages in the table. Use zero
to disable. |
|
Display the top num Search Strings in the table. Use
zero to disable. |
|
AllSearchStr ( yes | no ) |
|
Create seperate HTML page with All Search
Strings. |
|
Display the top num Usernames in the table. Use zero
to disable. Usernames are only available if using http based
authentication. |
|
Create seperate HTML page with All
Usernames. |
|
Hide/Ignore/Group/Include Keywords |
|
Hide User Agents that match name. |
|
Hide Referrers that match name. |
|
Hide Sites that match name. |
|
HideAllSites ( yes | no ) |
|
Hide all individual sites. This causes only grouped sites to
be displayed. |
|
Hide URL's that match name. |
|
Hide Usernames that match name. |
|
Ignore User Agents that match name. |
|
Ignore Referrers that match name. |
|
Ignore Sites that match name. |
|
Ignore URL's that match name. |
|
Ignore Usernames that match name. |
|
Group User Agents that match name. Display
Label in 'Top Agent' table if given (instead of
name). |
|
GroupReferrer name [Label] |
|
Group Referrers that match name. Display Label
in 'Top Referrer' table if given (instead of
name). |
|
Group Sites that match name. Display Label in
'Top Site' table if given (instead of
name). |
|
Automatically group sites by domain. The value num
specifies the level of grouping, and can be thought of as
the 'number of dots' to be displayed. The default value of
0 disables domain grouping. |
|
Group URL's that match name. Display Label in
'Top URL' table if given (instead of
name). |
|
Group Usernames that match name. Display Label
in 'Top Usernames' table if given (instead of
name). |
|
Force inclusion of sites that match name. Takes
precedence over Ignore# keywords. |
|
Force inclusion of URL's that match name. Takes
precedence over Ignore# keywords. |
|
Force inclusion of Referrers that match name. Takes
precedence over Ignore# keywords. |
|
Force inclusion of User Agents that match name. Takes
precedence over Ignore* keywords. |
|
Force inclusion of Usernames that match name. Takes
precedence over Ignore* keywords. |
|
Defines the HTML file extension to use. Default is
html. Do not include the leading period! |
|
Insert text at the very beginning of the generated
HTML file. Defaults to a standard html 3.2 DOCTYPE
record. |
|
Insert text within the <HEAD></HEAD>
block of the HTML file. |
|
Insert text in HTML page, starting with the
<BODY> tag. If used, the first line must be a
<BODY ...> tag. Multiple lines may be
specified. |
|
Insert text at top (before horiz. rule) of HTML
pages. Multiple lines may be specified. |
|
Insert text at bottom of the HTML page. The
text is top and right aligned within a table column
at the end of the report. |
|
Insert text at the very end of the HTML page. If not
specified, the default is to insert the ending </BODY>
and </HTML> tags. If used, you must supply
these tags yourself. |
|
The Webalizer allows you to export processed data to other
programs by using tab delimited text files. The
Dump* commands specify which files are to be written,
and where. |
|
Save dump files in directory name. If not specified,
the default output directory will be used. Do not specify a
trailing slash (/fP). |
|
Use name as the filename extension for dump files. If
not given, the default of tab will be
used. |
|
Print a column header as the first record of the
file. |
|
Dump the sites data to a tab delimited file. |
|
Dump the url data to a tab delimited file. |
|
DumpReferrers ( yes | no ) |
|
Dump the referrer data to a tab delimitd file. This data is
only available if using a log that contains referrer
information (ie: a combined format web log). |
|
Dump the user agent data to a tab delimited file. This data
is only available if using a log that contains user agent
information (ie: a combined format web log). |
|
Dump the username data to a tab delimited file. This data is
only available if processing a wu-ftpd xferlog or a web log
that contains http authentication information. |
|
DumpSearchStr ( yes | no ) |
|
Dump the search string data to a tab delimited file. This
data is only available if processing a web log that contains
referrer information and had search string information
present. |
FILES
|
Default configuration file. Is searched for in the current
directory and if not found, in the /etc/
directory. |
|
Monthly history file for previous 12 months. (can be
changed) |
|
Current state data file (Incremental processing). (can be
changed) |
|
Various monthly HTML output files produced.
(extension can be changed) |
|
Various monthly image files used in the
reports. |
|
Monthly tab delimited text files. (extension can be
changed) |
BUGS
|
Report bugs to brad@mrunix.net. |
COPYRIGHT
|
Copyright (C) 1997-2000 by Bradford L. Barrett. Distributed
under the GNU GPL. See the files "COPYING"
and "Copyright", supplied with all
distributions for additional information. |
AUTHOR
|
Bradford L. Barrett
<brad@mrunix.net> |
The Webalizer - A web server log file analysis tool
-- README
Copyright 1997-2000 by Bradford L. Barrett (brad@mrunix.net)
Distributed under the GNU GPL. See the files "COPYING" and
"Copyright" supplied with the distribution for additional info.
What is The Webalizer?
----------------------
The Webalizer is a web server log file analysis program which produces
usage statistics in HTML format for viewing with a browser. The results
are presented in both columnar and graphical format, which facilitates
interpretation. Yearly, monthly, daily and hourly usage statistics are
presented, along with the ability to display usage by site, URL, referrer,
user agent (browser), search string, entry/exit page, username and country
(some information is only available if supported and present in the log
files being processed). Processed data may also be exported into most
database and spreadsheet programs that support tab delimited data formats.
The Webalizer supports CLF (common log format) log files, as well as
Combined log formats as defined by NCSA and others, and variations
of these which it attempts to handle intelligently. In addition, wu-ftpd
xferlog formatted logs and squid proxy logs are supported.
Gzip compressed logs may now be used as input directly. Any log filename
that ends with a '.gz' extension will be assumed to be in gzip format and
uncompressed on the fly as it is being read. In addition, the Webalizer
also supports DNS lookup capabilities if enabled at compile time. See
the file DNS.README for additional information.
This documentation applies to The Webalizer Version 2.01
Running the Webalizer
---------------------
The Webalizer was designed to be run from a Unix command line prompt or
as a cron job. There are several command line options which will modify
the results it produces, and configuration files can be used as well.
The format of the command line is:
webalizer [options ...] [log-file]
Where 'options' can be one or more of the supported command line
switches described below. 'log-file' is the name of the log file
to process (see below for more detailed information). If a dash
("-") is specified for the log-file name, STDIN will be used.
Once executed, the general flow of the program follows:
o A default configuration file is scanned for. A file named
'webalizer.conf' is searched for in the current directory, and if
found, it's configuration data is parsed. If the file is not
present in the current directory, the file '/etc/webalizer.conf'
is searched for and, if found, is used instead.
o Any command line arguments given to the program are parsed. This
may include the specification of a configuration file, which is
processed at the time it is encountered.
o If a log file was specified, it is opened and made ready for
processing. If no log file was given, or the filename '-' is
specified on the command line, STDIN is used for input.
o If an output directory was specified, the program does a 'chdir' to
that directory in preparation for generating output. If no output
directory was given, the current directory is used.
o If a non-zero number of DNS Children processes were specified, they
will be started, and the specified log file will be processed,
either creating or updateing the specified DNS cache file.
o If no hostname was given, the program attempts to get the hostname
using a uname system call. If that fails, 'localhost' is used.
o A history file is searched for. This file keeps previous month
totals used on the main index.html page. The default file is
named 'webalizer.hist', kept in the specified output directory,
however may be changed using the "HistoryName" configuration file
keyword.
o If incremental processing was specified, a data file is searched for
and loaded if found, containing the 'internal state' data of the
program at the end of a previous run. The default file is named
'webalizer.current', kept in the specified output directory, however
may be changed using the "IncrementalName" configuration file keyword.
o Main processing begins on the log file. If the log spans multiple
months, a separate HTML document is created for each month.
o After main processing, the main 'index.html' page is created, which
has totals by month and links to each months HTML document.
o A new history file is saved to disk, which includes totals generated
by The Webalizer during the current run.
o If incremental processing was specified, a data file is written that
contains the 'internal state' data at the end of this run.
Incremental Processing
----------------------
Version 1.2x of The Webalizer adds incremental run capability. Simply
put, this allows processing large log files by breaking them up into
smaller pieces, and processing these pieces instead. What this means
in real terms is that you can now rotate your log files as often as you
want, and still be able to produce monthly usage statistics without the
loss of any detail. This is accomplished by saving and restoring all
relevant internal data to a disk file between runs. Doing so allows the
program to 'start where it left off' so to speak, and allows the
preservation of detail from one run to the next.
Some special precautions need to be taken when using the incremental
run capability of The Webalizer. Configuration options should not be
changed between runs, as that could cause corruption of the internal
stored data. For example, changing the MangleAgents level will cause
different representations of user agents to be stored, producing invalid
results in the user agents section of the report. If you need to change
configuration options, do it at the end of the month after normal
processing of the previous month and before processing the current month.
You may also want to delete the 'webalizer.current' file as well (or
whatever name was specified using the "IncrementalName" configuration
option).
The Webalizer also attempts to prevent data duplication by keeping
track of the timestamp of the last record processed. This timestamp
is then compared to current records being processed, and any records
that were logged previous to that timestamp are ignored. This, in
theory, should allow you to re-process logs that have already been
processed, or process logs that contain a mix of processed/not yet
processed records, and not produce duplication of statistics. The
only time this may break is if you have duplicate timestamps in two
separate log files... any records in the second log file that do have
the same timestamp as the last record in the previous log file processed,
will be discarded as if they had already been processed. There are
lots of ways to prevent this however, for example, stopping the web
server before rotating logs will prevent this situation. This setup
also necessitates that you always process logs in chronological order,
otherwise data loss will occur as a result of the timestamp compare.
Output Produced
---------------
The Webalizer produces several reports (html) and graphics for each
month processed. In addition, a summary page is generated for the
current and previous months (up to 12), a history file is created
and if incremental mode is used, the current month's processed data.
The exact location and names of these files can be changed using
configuration files and command line options. The files produced,
(default names) are:
index.html - Main summary page (extension may be changed)
usage.png - Yearly graph displayed on the main index page
usage_YYYYMM.html - Monthly summary page (extension may be changed)
usage_YYYYMM.png - Monthly usage graph for specified month/year
daily_usage_YYYYMM.png - Daily usage graph for specified month/year
hourly_usage_YYYYMM.png - Hourly usage graph for specified month/year
site_YYYYMM.html - All sites listing (if enabled)
url_YYYYMM.html - All urls listing (if enabled)
ref_YYYYMM.html - All referrers listing (if enabled)
agent_YYYYMM.html - All user agents listing (if enabled)
search_YYYYMM.html - All search strings listing (if enabled)
webalizer.hist - Previous month history (may be changed)
webalizer.current - Incremental Data (may be changed)
site_YYYYMM.tab - tab delimited sites file
url_YYYYMM.tab - tab delimited urls file
ref_YYYYMM.tab - tab delimited referrers file
agent_YYYYMM.tab - tab delimited user agents file
user_YYYYMM.tab - tab delimited usernames file
search_YYYYMM.tab - tab delimited search string file
The yearly (index) report shows statistics for a 12 month period, and
links to each month. The monthly report has detailed statistics for
that month with additional links to any URL's and referrers found.
The various totals shown are explained below.
Hits
Any request made to the server which is logged, is considered a 'hit'.
The requests can be for anything... html pages, graphic images, audio
files, CGI scripts, etc... Each valid line in the server log is
counted as a hit. This number represents the total number of requests
that were made to the server during the specified report period.
Files
Some requests made to the server, require that the server then send
something back to the requesting client, such as a html page or graphic
image. When this happens, it is considered a 'file' and the files
total is incremented. The relationship between 'hits' and 'files' can
be thought of as 'incoming requests' and 'outgoing responses'.
Pages
Pages are, well, pages! Generally, any HTML document, or anything
that generates an HTML document, would be considered a page. This
does not include the other stuff that goes into a document, such as
graphic images, audio clips, etc... This number represents the number
of 'pages' requested only, and does not include the other 'stuff' that
is in the page. What actually constitutes a 'page' can vary from
server to server. The default action is to treat anything with the
extension '.htm', '.html' or '.cgi' as a page. A lot of sites will
probably define other extensions, such as '.phtml', '.php3' and '.pl'
as pages as well. Some people consider this number as the number of
'pure' hits... I'm not sure if I totally agree with that viewpoint.
Some other programs (and people :) refer to this as 'Pageviews'.
Sites
Each request made to the server comes from a unique 'site', which can
be referenced by a name or ultimately, an IP address. The 'sites'
number shows how many unique IP addresses made requests to the server
during the reporting time period. This DOES NOT mean the number of
unique individual users (real people) that visited, which is impossible
to determine using just logs and the HTTP protocol (however, this
number might be about as close as you will get).
Visits
Whenever a request is made to the server from a given IP address
(site), the amount of time since a previous request by the address
is calculated (if any). If the time difference is greater than a
pre-configured 'visit timeout' value (or has never made a request before),
it is considered a 'new visit', and this total is incremented (both
for the site, and the IP address). The default timeout value is 30
minutes (can be changed), so if a user visits your site at 1:00 in
the afternoon, and then returns at 3:00, two visits would be registered.
Note: in the 'Top Sites' table, the visits total should be discounted
on 'Grouped' records, and thought of as the "Minimum number of visits"
that came from that grouping instead. Note: Visits only occur on
PageType requests, that is, for any request whose URL is one of the
'page' types defined with the PageType option. Due to the limitation
of the HTTP protocol, log rotations and other factors, this number
should not be taken as absolutely accurate, rather, it should be
considered a pretty close "guess".
KBytes
The KBytes (kilobytes) value shows the amount of data, in KB, that
was sent out by the server during the specified reporting period. This
value is generated directly from the log file, so it is up to the
web server to produce accurate numbers in the logs (some web servers
do stupid things when it comes to reporting the number of bytes). In
general, this should be a fairly accurate representation of the amount
of outgoing traffic the server had, regardless of the web servers
reporting quirks.
Note: A kilobyte is 1024 bytes, not 1000 :)
Top Entry and Exit Pages
The Top Entry and Exit tables give a rough estimate of what URL's
are used to enter your site, and what the last pages viewed are.
Because of limitations in the HTTP protocol, log rotations, etc...
this number should be considered a good "rough guess" of the actual
numbers, however will give a good indication of the overall trend in
where users come into, and exit, your site.
Command Line Options
--------------------
The Webalizer supports many different configuration options that will
alter the way the program behaves and generates output. Most of these
can be specified on the command line, while some can only be specified
in a configuration file. The command line options are listed below,
with references to the corresponding configuration file keywords.
--------------------------------------------------------------------------
General Options
---------------
-h Display all available command line options and exit program.
-v Display program version and exit program.
-d Display additional 'debugging' information for errors and
warnings produced during processing. This normally would
not be used except to determine why you are getting all those
errors and wanted to see the actual data. Normally The
Webalizer will just tell you it found an error, not the
actual data. This option will display the data as well.
Config file keyword: Debug
-F Specify that the log being used is a ftp log. Normally, the
Webalizer expects to find a valid CLF or Combined format
we server log file. This option allows you to process wu-ftpd
xferlogs as well.
Config file keyword: LogType
-f Fold out of sequence log records back into analysis, by
treating them as if they were the same date/time as the
last good record. Normally, out of sequence log records
are ignored. If you run apache, don't worry about this.
Config file keyword: FoldSeqErr
-i Ignore history file. USE WITH CAUTION. This causes The
Webalizer to ignore any existing history file produced from
previous runs and generate it's output from scratch. The
effect will be as if The Webalizer is being run for the
first time and any previous statistics will be lost (although
the HTML documents, if any, will not be deleted) on the main
index.html (yearly) web page.
Config file keyword: IgnoreHist
-p Preserve state (incremental processing). This allows the
processing of partial logs in increments. At the end of
the program, all relevant internal data is saved, so that
it may be restored the next time the program is run. This
allows sites that must rotate their logs more than once a
month to still be able to use The Webalizer, and not worry
about having to gather and feed an entire months logs to
the program at the end of the month. See the section on
"Incremental Processing" below for additional information.
The default is to not perform incremental processing. Use
this command line option to enable the feature.
Config file keyword: Incremental
-q Quiet mode. Normally, The Webalizer will produce various
messages while it runs letting you know what it's doing.
This option will suppress those messages. It should be
noted that this WILL NOT suppress errors and warnings, which
are output to STDERR.
Config file keyword: Quiet
-Q ReallyQuiet mode. This allows suppression of _all_ messages
generated by The Webalizer, including warnings and errors.
Useful when The Webalizer is run as a cron job.
Config file keyword: ReallyQuiet
-T Display timing information. The Webalizer keeps track of the
time it begins and ends processing, and normally displays the
total processing time at the end of each run. If quiet mode
(-q or 'Quiet yes' in configuration file) is specified, this
information is not displayed. This option forces the display
of timing totals if quiet mode has been specified, otherwise
it is redundant and will have no effect.
Config file keyword: TimeMe
-c file This option specifies a configuration file to use. Configuration
files allow greater control over how The Webalizer behaves, and
there are several ways to use them. As of version 0.98, The
Webalizer searches for a default configuration file in the
current directory named "webalizer.conf", and if not found,
will search in the /etc/ directory for a file of the same name.
In addition, you may specify a configuration file to use with
this command line option.
-n name This option specifies the hostname for the reports generated.
The hostname is used in the title of all reports, and is also
prepended to URL's in the reports. This allows The Webalizer
to be run on log files for 'virtual' web servers or web servers
that are different than the machine the reports are located on,
and still allows clicking on the URL's to go to the proper
location. If a hostname is not specified, either on the
command line or in a configuration file, The Webalizer attempts
to determine the hostname using a 'uname' system call. If this
fails, "localhost" will be used as the hostname.
Config file keyword: HostName
-o dir This options specifies the output directory for the reports.
If not specified here or in a configuration file, the current
default directory will be used for output.
Config file keyword: OutputDir
-x name This option allows the generated pages to have an extension
other than '.html', which is the default. Do not include the
leading period ('.') when you specify the extension.
Config file keyword: HTMLExtension
-P name Specify the file extensions for 'pages'. Pages (sometimes
called 'PageViews') are normally html documents and CGI
scripts that display the whole page, not just parts of it.
Some system will need to define a few more, such as 'phtml',
'php3' or 'pl' in order to have them counted as well. The
default is 'htm*' and 'cgi' for web logs and 'txt' for ftp.
Config file keyword: PageType
-t name This option specifies the title string for all reports. This
string is used, in conjunction with the hostname (if not blank)
to produce the actual title. If not specified, the default of
"Usage Statistics for" will be used.
Config file keyword: ReportTitle
-Y Supress Country graph. Normally, The Webalizer produces
country statistics in both Graph and Columnar forms. This
option will suppress the Country Graph from being generated.
Config file keyword: CountryGraph
-G Supress hourly graph. Normally, The Webalizer produces
hourly statistics in both Graph and Columnar forms. This
option will suppress the Hourly Graph only from being generated.
Config file keyword: HourlyGraph
-H Supress Hourly statistics. Normally, The Webalizer produces
hourly statistics in both Graph and Columnar forms. This
option will suppress the Hourly Statistics table only from
being generated.
Config file keyword: HourlyStats
-L Disable Graph Legends. The color coded legends displayed on
the in-line graphs can be disabled with this option. The
default is to display the legends.
Config file keyword: GraphLegend
-l num Graph Lines. Specify the number of background reference
lines displayed on the in-line graphics produced. The default
is 2 lines, however can range anywhere from zero ('0') for
no lines, up to 20 lines (looks funny!).
Config file keyword: GraphLines
-P name Page type. This is the extension of files you consider to
be pages for Pages calculations (sometimes called 'pageviews').
The default is 'htm*' and 'cgi' (plus whatever HTMLExtension
you specified if it is different). Don't use a period!
-m num Specify a 'visit timeout'. Visits are calculated by looking at
the time difference between the current and last request made
by a specific host. If the difference is greater that the
visit timeout value, the request is considered a new visit.
This value is specified in number of seconds. The default
is 30 minutes (1800).
Config file keyword: VisitTimeout
-M num Mangle user agent names. Normally, The Webalizer will keep
track of the user agent field verbatim. Unfortunately, there are
a ton of different names that user agents go by, and the field
also reports other items such as machine type and OS used. For
Example, Netscape 4.03 running on Windows 95 will report a
different string than Netscape 4.03 running on Windows NT, so even
though they are the same browser type, they will be considered
as two totally different browsers by The Webalizer. For that
matter, Netscape 4.0 running on Windows NT will report different
names if one is run on an Alpha and the other on an Intel
processor! Internet Exploder is even worse, as it reports itself
as if it were Netscape and you have to search the given string a
little deeper to discover that it is really MSIE! In order to
consolidate generic browser types, this option will cause The
Webalizer to 'mangle' the user agent field, attempting to
consolidate generic browser types. There are 6 levels that can be
specified, each producing different levels of detail. Level 5
displays only the browser name (MSIE or Mozilla) and the major
version number. Level 4 will also display the minor version
number (single decimal place). Level 3 will display the minor
version number to two decimal places. Level 2 will add any
sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b).
Level 1 will also attempt to add the system type. The default
Level 0 will disable name mangling and leave the user agent
field unmodified, producing the greatest amount of detail.
Configuration file keyword: MangleAgents
-g num This option allows you to specify the level of domains name
grouping to be performed. The numeric value represents the
level of grouping, and can be thought of as the 'number of
dots' to be displayed. The default value of 0 disables any
domain name grouping.
Configuration file keyword: GroupDomains
-D name This allows the specification of a DNS Cache file name. This
filename MUST be specified if you have dns lookups enabled
(using the -N command line switch or DNSChildren configuration
keyword). The filename is relative to the default output
directory if an absolute path is not specified (ie: starts
with a leading '/'). This option is only available if DNS
support was enabled at compile time, otherwise an 'Invalid
Keyword' error will be generated. See the DNS.README file
for additional information regarding DNS lookups.
-N num Number of DNS child processes to use for reverse DNS lookups.
If specified, a DNSCache name MUST be specified also. If you
do not wish a DNS cache file to be generated, specify a value
of zero ('0') to disable it. This does not prevent using an
existing cache file, only the generation of one at run time.
See the DNS.README file for additional information regarding
DNS lookups.
Hide Options
------------
The following options take a string argument to use as a comparison
for matching. Except for the IndexAlias option, the string argument
can be plain text, or plain text that either starts or ends with the
wildcard character '*'.
For Example:
Given the string "yourmama/was/here", the arguments "was", "*here" and
"your*" will all produce a match.
-a name This option allows hiding of user agents (browsers) from the
"Top User Agents" table in the report. This option really
isn't too useful as there are a zillion different names that
current browsers go by, depending where they were obtained,
however you might have some particular user agents that hit
your site a lot that you would like to exclude from the list.
You must have a web server that includes user agents in it's
log files for this option to be of any use. In addition, it
is also useless if you disable the user agent table in the
report (see the -A command line option or "TopAgents"
configuration file keyword). You can specify as many of these
as you want on the command line. The wildcard character '*'
can be used either in front of or at the end of the string.
(ie: Mozilla/4.0* would match anything that starts with the
string "Mozilla/4.0").
Config file keyword: HideAgent
-r name This option allows hiding of referrers from the "Top Referrer"
table in the report. Referrers are URL's, either on your own
local site or a remote site, that referred the user to a URL
on your web server. This option is normally used to hide
your own server from the table, as your own pages are usually
the top referrers to your own pages (well, you get the idea).
You must have a web server that includes referrer information
in the log files for this option to be of any use. In addition,
it is also useless if you disable the referrers table in the
report (see the -R command line option or "TopReferrers"
configuration file keyword). You can specify as many of these
as you like on the command line.
Config file keyword: HideReferrer
-s name This option allows hiding of sites from the "Top Sites" table
in the report. Normally, you will only want to hide your own
domain name from the report, as it usually is one of the top
sites to visit your web server. This option is of no use if
you disable the top sites table in the report (see the -S
command line option or "TopSites" configuration file option).
Config file keyword: HideSite
-X This causes all individual sites to be hidden, which results
in only grouped sites to be displayed on the report.
Config file keyword: HideAllSites
-u name This option allows hiding of URL's from the "Top URL's" table
in the report. Normally, this option is used to hide images,
audio files and other objects your web server dishes out that
would otherwise clutter up the table. This option is of no
use if you disable the top URL's table in the report (see the
-U command line option or "TopURLs" configuration file keyword).
Config file keyword: HideURL
-I name This option allows you to specify additional index.html aliases.
The Webalizer usually strips the string 'index.' from URL's
before processing, which has the effect of turning a URL such
as /somedir/index.html into just /somedir/ which is really the
same URL and should be treated as such. This option allows you
to specify _additional_ strings that are to be treated the same
way. Use with care, improper use could cause unexpected results.
For example, if you specify the alias string of 'home', a URL
such as /somedir/homepages/brad/home.html would be converted
into just /somedir/ which probably isn't what was intended.
This option is useful if your web server uses a different default
index page other than the standard 'index.html' or 'index.htm',
such as 'home.html' or 'homepage.html'. The string specified
is searched for _anywhere_ in the URL, so "home.htm" would
turn both "/somedir/home.htm" and "/somedir/home.html" into
just "/somedir/". Go easy on this one, each string specified
will be scanned for in EVERY log record, so if you specify a
bunch of these, you will notice degraded performance. Wildcards
are not allowed on this one.
Config file keyword: IndexAlias
Table Size Options
------------------
-e num This option specifies the number of entries to display in the
"Top Entry Pages" table. To disable the table, use a value of
zero (0).
Config file keyword: TopEntry
-E num This option specifies the number of entries to display in the
"Top Exit Pages" table. To disable the table, use a value of
zero (0).
Config file keyword: TopExit
-A num This option specifies the number of entries to display in the
"Top User Agents" table. To disable the table, use a value of
zero (0).
Config file keyword: TopAgents
-C num This option specifies the number of entries to display in the
"Top Countries" table. To disable the table, use a value of
zero (0).
Config file keyword: TopCountries
-R num This option specifies the number of entries to display in the
"Top Referrers" table. To disable the table, use a value of
zero (0).
Config file keyword: TopReferrers
-S num This option specifies the number of entries to display in the
"Top Sites" table. To disable the table, use a value of
zero (0).
Config file keyword: TopSites
-U num This option specifies the number of entries to display in the
"Top URL's" table. To disable the table, use a value of
zero (0).
Config file keyword: TopURLs
--------------------------------------------------------------------------
CONFIGURATION FILES
-------------------
The Webalizer allows configuration files to be used in order to simplify
life for all. There are several ways that configuration files are accessed
by the Webalizer. When The Webalizer first executes, it looks for a
default configuration file named "webalizer.conf" in the current directory,
and if not found there, will look for "/etc/webalizer.conf". In addition,
configuration files may be specified on the command line with the '-c'
option. There are lots of different ways you can combine the use of
configuration files and command line options to produce various results.
The Webalizer always looks for and reads configuration options from a
default configuration file before doing anything else. Because of this,
you can override options found in the default file by use of additional
configuration files specified on the command line or command line options
themselves. If you specify a configuration file on the command line, you
can override options in it by additional command line options which follow.
For example, most users will most likely want to create the default file
/etc/webalizer.conf and place options in it to specify the hostname, log
file, table options, etc... At the end of the month when a different log
file is to be used (the end of month log), you can run The Webalizer as
usual, but put the different filename on the end of the command line, which
will override the log file specified in the configuration file. It should
be noted that you cannot override some configuration file options by the
use of command line arguments. For example, if you specify "Quiet yes" in
a configuration file, you cannot override this with a command line argument,
as the command line option only _enables_ the feature (-q option).
The configuration files are standard ASCII text files that may be created
or edited using any standard editor. Blank lines and lines that begin
with a pound sign ('#') are ignored. Any other lines are considered to
be configuration lines, and have the form "Keyword Value", where the
'Keyword' is one of the currently available configuration keywords defined
below, and 'Value' is the value to assign to that particular option. Any
text found after the keyword up to the end of the line is considered the
keyword's value, so you should not include anything after the actual value
on the line that is not actually part of the value being assigned. The
file "sample.conf" provided with the distribution contains lots of useful
documentation and examples as well. It should be noted that you do not
have to use any configuration files at all, in which case, default values
will be used (which should be sufficient for most sites).
--------------------------------------------------------------------------
General Configuration Keywords
------------------------------
LogFile This defines the log file to use. It should be a fully qualified
name (ie: contain the path), but relative names will work as
well. If not specified, the logfile defaults to STDIN.
LogType This specified the log file type being used. Normally, The
Webalizer processes web logs in either CLF or Combined format.
You may also process wu-ftpd xferlog formatted logs, or squid
proxy logs by setting the appropriate type using this keyword.
Values may be either 'clf', 'ftp' or 'squid'. Ensure that you
specify the proper file type, otherwise you will be presented
with a long stream of 'invalid record' messages ;)
Command line argument: -F
OutputDir This defines the output directory to use for the reports. If
it is not specified, the current directory is used.
Command line argument: -o
HistoryName Allows specification of a history path/filename if desired.
The default is to use the file named 'webalizer.hist', kept
in the normal output directory (OutputDir above). Any name
specified is relative to the normal output directory unless
an absolute path name is given (ie: starts with a '/').
ReportTitle This specifies the title to use for the generated reports.
It is used in conjunction with the hostname (unless blank)
to produce the final report titles. If not defined, the
default of "Usage Statistics for" is used.
Command line argument: -t
HostName This defines the hostname. The hostname is used in the
report title as well as being prepended to URL's in the
"Top URL's" table. This allows The Webalizer to be run
on "virtual" web servers, or servers that do not reside
on the local machine, and allows clicking on the URL to
go to the right place. If not specified, The Webalizer
attempts to get the hostname via a 'uname' system call,
and if that fails, will default to "localhost".
Command line argument: -n
UseHTTPS Causes the links in the 'Top URL's' table to use 'https://'
instead of the default 'http://' prefix. Not much use if
you run a mix of secure/insecure servers on your machine.
Only useful if you run the analysis on a secure servers
logs, and want the links in the table to work properly.
Quiet This allows you to enable or disable informational messages
while it is running. The values for this keyword can be
either 'yes' or 'no'. Using "Quiet yes" will suppress these
messages, while "Quiet no" will enable them. The default
is 'no' if not specified, which will allow The Webalizer
to display informational messages. It should be noted that
this option has no effect on Warning or Error messages that
may be generated, as they go to STDERR.
Command line argument: -q
TimeMe This allows you to display timing information regardless of
any "quiet mode" specified. Useful only if you did in fact
tell the webalizer to be quiet either by using the -q command
line option or the "Quiet" keyword, otherwise timing stats
are normally displayed anyway. Values may be either 'yes'
or 'no', with the default being 'no'.
Command line argument: -T
GMTTime This keyword allows timestamps to be displayed in GMT (UTC)
time instead of local time. Normally The Webalizer will
display timestamps in the time-zone of the local machine
(ie: PST or EDT). This keyword allows you to specify the
display of timestamps in GMT (UTC) time instead. Values
may be either 'yes' or 'no'. Default is 'no'.
Debug This tells The Webalizer to display additional information
when it encounters Warnings or Errors. Normally, The
Webalizer will just tell you it found a bad record or
field. This option will enable the display of the actual
data that produced the Warning or Error as well. Useful
only if you start getting lots of Warnings or Errors and
want to determine the cause. Values may be either 'yes'
or 'no', with the default being 'no'.
Command line argument: -d
IgnoreHist This suppresses the reading of a history file. USE WITH
EXTREME CAUTION as the history file is how The Webalizer
keeps track of previous months. The effect of this option
is as if The Webalizer was being run for the very first
time, and any previous data is discarded. Values may be
either 'yes' or 'no', with the default being 'no'.
Command line argument: -i
FoldSeqErr Allows log records that are out of sequence to be folded
back into the analysis, by treating them as if they had
the same date/time as the last good record. Normally,
out of sequence log records are simply ignored. If you
run apache, don't worry about this.
VisitTimeout Set the 'visit timeout' value. Visits are determined by
looking at the time difference between the current and last
request made by a specific site. If the difference in time
is greater than the visit timeout value, the request is
considered a new visit. The value is in number of seconds,
and defaults to 30 minutes (1800).
Command line argument: -m
PageType Allows you to define the 'page' type extension. Normally,
people consider HTML and CGI scripts as 'pages'. This
option allows you to specify what extensions you consider
a page. Default is 'htm*' and 'cgi' for web logs, and
'txt' for ftp logs.
Command line argument: -P
GraphLegend Enable/disable the display of color coded legends on the
produced graphs. Default is 'yes', to display them.
Command line argument: -L
GraphLines Specify the number of background reference lines to display
on produced graphs. The default is 2. To disable the use
of background lines, use zero ('0').
Command line argument: -l
CountryGraph This keyword is used to either enable or disable the creation
and display of the Country Usage graph. Values may be either
'yes' or 'no', with the default being 'yes'.
Command line argument: -Y
DailyGraph This keyword is used to either enable or disable the creation
and display of the Daily Usage graph. Values may be either
'yes' or 'no', with the default being 'yes'.
DailyStats This keyword is used to either enable or disable the creation
and display of the Daily Usage statistics table. Values may
be either 'yes' or 'no', with the default being 'yes'.
HourlyGraph This keyword is used to either enable or disable the creation
and display of the Hourly Usage graph. Values may be either
'yes' or 'no', with the default being 'yes'.
Command line argument: -G
HourlyStats This keyword is used to either enable or disable the creation
and display of the Hourly Usage statistics table. Values may
be either 'yes' or 'no', with the default being 'yes'.
Command line argument: -H
IndexAlias This allows additional 'index.html' aliases to be defined.
Normally, The Webalizer scans for and strips the string
"index." from URL's before processing them. This turns a
URL such as /somedir/index.html into just /somedir/ which
is really the same URL. This keyword allows _additional_
names to be treated in the same fashion for sites that use
different default names, such as "home.html". The string
is scanned for anywhere in the URL, so care should be used
if and when you define additional aliases. For example,
if you were to use an alias such as 'home', the URL
/somedir/homepages/brad/home.html would be turned into just
/somedir/ which probably isn't the intended result. Instead,
you should have specified 'home.htm' which would correctly
turn the URL into /somedir/homepages/brad/ like intended.
It should also be noted that specified aliases are scanned
for in EVERY log record... A bunch of aliases will noticeably
degrade performance as each record has to be scanned for
every alias defined. You don't have to specify 'index.' as
it is always the default.
Command line argument: -I
MangleAgents The MangleAgents keyword specifies the level of user agent
name mangling, if any. There are 6 levels that may be specified,
each producing a different level of detail displayed. Level 5
displays only the browser name (MSIE or Mozilla) and the major
version number. Level 4 adds the minor version (single
decimal place). Level 3 adds the minor version to two decimal
places. Level 2 will also add any sub-level designation
(such as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will also
attempt to add the system type. The default level 0 will
leave the user agent field unmodified and produces the
greatest amount of detail.
Command line argument: -M
SearchEngine This keyword allows specification of search engines and
their query strings. Search strings are obtained from
the referrer field in the record, and in order to work
properly, the Webalizer needs to know what query strings
different search engines use. The SearchEngine allows
you to specify the search engine and it's query string
to parse the search string from. The line is formatted
as: "SearchEngine engine-string query-string" where
'engine-string' is a substring for matching the search
engine with, such as "yahoo.com" or "altavista". The
'query-string' is the unique query string that is added
to the URL for the search engine, such as "search=" or
"MT=" with the actual search strings appended to the
end. There is no command line option for this keyword.
Incremental This allows incremental processing to be enabled or disabled.
Incremental processing allows processing partial logs without
the loss of detail data from previous runs in the same month.
This feature saves the 'internal state' of the program so that
it may be restored in following runs. See the section above
titled "Incremental Processing" for additional information.
The value may be 'yes' or 'no', with the default being 'no'.
Command line argument: -p
IncrementalName
Allows specification of the incremental data filename if
desired. Normally, the file named "webalizer.current' is
used, kept in the standard output directory. If specified,
filenames are relative to the standard output directory,
unless an absolute name is given (ie: starts with '/').
DNSCache Specifies the DNS cache filename. This name is relative
to the default output directory unless an absolute name
is given (ie: starts with '/'). See the DNS.README file
for additional information.
DNSChildren The number of DNS children processes to run in order to
create/update the DNS cache file. If specified, the DNS
cache filename must also be specified (see above). Use
a value of zero ('0') to disable. See the DNS.README
file for additional information.
Top Table Keywords
------------------
TopAgents This allows you to specify how many "Top" user agents are
displayed in the "Top User Agents" table. The default
is 15. If you do not want to display user agent statistics,
specify a value of zero (0). The display of user agents
will only work if your web server includes this information
in its log file (ie: a combined log format file).
Command line argument: -A
AllAgents Will cause a separate HTML page to be generated for all
normally visable User Agents. A link will be added to
the bottom of the "Top User Agents" table if enabled.
Value can be either 'yes' or 'no', with 'no' being the
default.
TopCountries This allows you to specify how many "Top" countries are
displayed in the "Top Countries" table. The default is
30. If you want to disable the countries table, specify
a value of zero (0).
Command line argument: -C
TopReferrers This allows you to specify how many "Top" referrers are
displayed in the "Top Referrers" table. The default is
30. If you want to disable the referrers table, specify
a value of zero (0). The display of referrer information
will only work if your web server includes this information
in its log file (ie: a combined log format file).
Command line argument: -R
AllReferrers Will cause a separate HTML page to be generated for all
normally visable Referrers. A link will be added to the
"Top Referrers" table if enabled. Value can be either
'yes' or 'no', with 'no' being the default.
TopSites This allows you to specify how many "Top" sites are
displayed in the "Top Sites" table. The default is 30.
If you want to disable the sites table, specify a value
of zero (0).
Command line argument: -S
TopKSites Identical to TopSites, except for the 'by KByte' table.
Default is 10. No command line switch for this one.
AllSites Will cause a separate HTML page to be generated for all
normally visable Sites. A link will be added to the
bottom of the "Top Sites" table if enabled. Value can
be either 'yes' or 'no', with 'no' being the default.
TopURLs This allows you to specify how many "Top" URL's are
displayed in the "Top URL's" table. The default is 30.
If you want to disable the URL's table, specify a value
of zero (0).
Command line argument: -U
TopKURLs Identical to TopURLs, except for the 'by KByte' table.
Default is 10. No command line switch for this one.
AllURLs Will cause a separate HTML page to be generated for all
normally visable URLs. A link will be added to the
bottom of the "Top URLs" table if enabled. Value can
be either 'yes' or 'no', with 'no' being the default.
TopEntry Allows you to specify how many "Top Entry Pages" are
displayed in the table. The default is 10. If you
want to disable the table, specify a value of zero (0).
Command line argument: -e
TopExit Allows you to specify how many "Top Exit Pages" are
displayed in the table. The default is 10. If you
want to disable the table, specify a value of zero (0).
Command line argument: -E
TopSearch Allows you to specify how many "Top Search Strings" are
displayed in the table. The default is 20. If you
want to disable the table, specify a value of zero (0).
Only works if using combined log format (ie: contains
referrer information).
TopUsers This allows you to specify how many "Top" usernames are
displayed in the "Top Usernames" table. Usernames are
only available if you use http authentication on your
web server, or when processing wu-ftpd xferlogs. The
default value is 20. If you want to disable the Username
table, specify a value of zero (0).
AllUsers Will cause a separate HTML page to be generated for all
normally visable usernames. A link will be added to the
bottom of the "Top Usernames" table if enabled. Value
can be either 'yes' or 'no', with 'no' being the default.
AllSearchStr Will create a separate HTML page to be generated for all
normally visable Search Strings. A link will be added
to the bottom of the "Top Search Strings" table if
enabled. Value can be either 'yes' or 'no', with 'no'
being the default.
Hide Object Keywords
--------------------
These keywords allow you to hide user agents, referrers, sites, URL's
and usernames from the various "Top" tables. The value for these keywords
are the same as those used in their command line counterparts. You
can specify as many of these as you want without limit. Refer to the
section above on "Command Line Options" for a description of the string
formatting used as the value. Values cannot exceed 80 characters in
length.
HideAgent This allows specified user agents to be hidden from the
"Top User Agents" table. Not very useful, since there
a zillion different names by which browsers go by today,
but could be useful if there is a particular user agent
(ie: robots, spiders, real-audio, etc..) that hits your
site frequently enough to make it into the top user agent
listing. This keyword is useless if 1) your log file does
not provide user agent information or 2) you disable the
user agent table.
Command line argument: -a
HideReferrer This allows you to hide specified referrers from the
"Top Referrers" table. Normally, you would only specify
your own web server to be hidden, as it is usually the
top generator of references to your own pages. Of course,
this keyword is useless if 1) your log file does not include
referrer information or 2) you disable the top referrers
table.
Command line argument: -r
HideSite This allows you to hide specified sites from the "Top
Sites" table. Normally, you would only specify your own
web server or other local machines to be hidden, as they
are usually the highest hitters of your web site, especially
if you have their browsers home page pointing to it.
Command line argument: -s
HideAllSites This allows hiding all indvidual sites from the display,
which can be useful when a lot of groupings are being
used (since grouped records cannot be hidden). It is
particularly useful in conjunction with the GroupDomain
feature, however can be useful in other situations as well.
Value can be either 'yes' or 'no', with 'no' the default.
Command line argument: -X
HideURL This allows you to hide URL's from the "Top URL's" table.
Normally, this is used to hide items such as graphic files,
audio files or other 'non-html' files that are transferred
to the visiting user.
Command line argument: -u
HideUser This allows you to hide Usernames from the "Top Usernames"
table. Usernames are only available if you use http based
authentication on your web server.
Group Object Keywords
---------------------
The Group* keywords allow object grouping based on Site, URL, Referrer,
User Agent and Usernames. Combined with the Hide* keywords, you can
customize exactly what will be displayed in the 'Top' tables. For example,
to only display totals for a particular directory, use a GroupURL and
HideURL with the same value (ie: '/help/*'). Group processing is only
done after the individual record has been fully processed, so name mangling
and site total updates have already been performed. Because of this, groups
are not counted in the main site total (as that would cause duplication).
Groups can be displayed in bold and shaded as well. Grouped records are
not, by default, hidden from the report. This allows you to display a
grouped total, while still being able to see the individual records, even
if they are part of the group. If you want to hide the detail records,
follow the Group* directive with a Hide* one using the same value. There
are no command line switches for these keywords. The Group* keywords also
accept an optional label to be displayed instead of the actual value used.
This label should be separated from the value by at least one whitespace
character, such as a space or tab character. See the sample.conf file
for examples.
GroupReferrer Allows grouping Referrers. Can be handy for some of the
major search engines that have multiple host names a
referral could come from.
GroupURL This keyword allows grouping URL's. Useful for grouping
complete directory trees.
GroupSite This keywords allows grouping Sites. Most used for
grouping top level domains and unresolved IP address
for local dial-ups, etc...
GroupAgent Groups User Agents. A handy example of how you could use
this one is to use "Mozilla" and "MSIE" as the values for
GroupAgent and HideAgent keywords. Make sure you put the
"MSIE" one first.
GroupDomains Allows automatic grouping of domains. The numeric value
represents the level of grouping, and can be thought of
as 'the number of dots' to display. A 1 will display
second level domains only (xxx.xxx), a 2 will display
third level domains (xxx.xxx.xxx) etc... The default
value of 0 disables any domain grouping.
Command line argument: -g
GroupUser Allows grouping of usernames. Combined with a group
name, this can be handy for displaying statistics on
a particular group of users without displaying their
real usernames.
GroupShading Allows shading of table rows for groups. Value can be
'yes' or 'no', with the default being 'yes'.
GroupHighlight Allows bolding of table rows for groups. Value can be
'yes' or 'no', with the default being 'yes'.
Ignore/Include Object Keywords
----------------------
These keywords allow you to completely ignore log records when generating
statistics, or to force their inclusion regardless of ignore criteria.
Records can be ignored or included based on site, URL, user agent, referrer
and username. Be aware that by choosing to ignore records, the accuracy of
the generated statistics become skewed, making it impossible to produce
an accurate representation of load on the web server. These keywords
behave identical to the Hide* keywords above, where the value can have
a leading or trailing wildcard '*'. These keywords, like the Hide* ones,
have an absolute limit of 80 characters for their values. These keywords
do not have any command line switch counterparts, so they may only be
specified in a configuration file. It should also be pointed out that
using the Ignore/Include combination to selectively exclude an entire
site while including a particular 'chunk' is _extremely_ inefficient,
and should be avoided. Try grep'ing the records into a separate file
and process it instead.
IgnoreSite This allows specified sites to be completely ignored from
the generated statistics.
IgnoreURL This allows specified URL's to be completely ignored from
the generated statistics. One use for this keyword would
be to ignore all hits to a 'temporary' directory where
development work is being done, but is not accessible to
the outside world.
IgnoreReferrer This allows records to be ignored based on the referrer
field.
IgnoreAgent This allows specified User Agent records to be completely
ignored from the statistics. Maybe useful if you really
don't want to see all those hits from MSIE :)
IgnoreUser This allows specified username records to be completely
ignored from the statistics. Usernames can only be used
if you use http authentication on your server.
IncludeSite Force the record to be processed based on hostname. This
takes precedence over the Ignore* keywords.
IncludeURL Force the record to be processed based on URL. This takes
precedence over the Ignore* keywords.
IncludeReferrer Force the record to be processed based on referrer.
This takes precedence over the Ignore* keywords.
IncludeAgent Force the record to be processed based on user agent.
This takes precedence over the Ignore* keywords.
IncludeUser Force the record to be processed based on username.
Usernames are only available if you use http based
authentication on your server. This takes precedence over
the Ignore* keywords.
Dump Object Keywords
--------------------
The Dump* Keywords allow text files to be generated that can then be used
for import into most database, spreadsheet and other external programs.
The file is a standard tab delimited text file, meaning that each column
is separated by a tab (0x09) character. A header record may be included
if required, using the 'DumpHeader' keyword. Since these files contain
all records that have been processed, including normally hidden records,
an alternate location for the files can be specified using the 'DumpPath'
keyword, otherwise they will be located in the default output directory.
DumpPath Specifies an alternate location for the dump files. The
default output location will be used otherwise. The value
is the path portion to use, and normally should be an
absolute path (ie: has a leading '/' character), however
relative path names can be used as well, and will be
relative to the output directory location.
DumpExtension Allows the dump filename extensions to be specified. The
default extension is "tab", however may be changed with
this option.
DumpHeader Allows a header record to be written as the first record
of the file. Value can be either 'yes' or 'no', with
the default being 'no'.
DumpSites Dump tab delimited sites file. Value can be either 'yes'
or 'no', with the default being 'no'. The filename used
is site_YYYYMM.tab (YYYY=year, MM=month).
DumpURLs Dump tab delimited url file. Value can be either 'yes' or
'no', with the default being 'no'. The filename used is
url_YYYYMM.tab (YYYY=year, MM=month).
DumpReferrers Dump tab delimited referrer file. Value can be either
'yes' or 'no', with the default being 'no'. Filename
used is ref_YYYYMM.tab (YYYY=year, MM=month). Referrer
information is only available if present in the log
file (ie: combined web server log).
DumpAgents Dump tab delmited user agent file. Value can be either
'yes' or 'no', with the default being 'no'. Filename
used is agent_YYYYMM.tab (YYYY=year, MM=month). User
agent information is only available if present in the
log file (ie: combined web server log).
DumpUsers Dump tab delimited username file. Value can be either
'yes' or 'no', with the default being 'no'. FIlename
used is user_YYYYMM.tab (YYYY=year, MM=month). The
username data is only avilable if processing a wu-ftpd
xferlog or http authentication is used on the web server
and that information is present in the log.
DumpSearchStr Dump tab delimited search string file. Value can be
either 'yes' or 'no', with the default being 'no'.
Filename used is search_YYYYMM.tab (YYYY=year, MM=month).
the search string data is only available if referrer
information is present in the log being processed and
recognized search engines were found and processed.
HTML Generation Keywords
------------------------
These keywords allow you to customize the HTML code that The Webalizer
produces, such as adding a corporate logo or links to other web pages.
You can specify as many of these keywords as you like, and they will be
used in the order that they are found in the file. Values cannot exceed
80 characters in length, so you may have to break long lines up into two
or more lines. There are no command line counterparts to these keywords.
HTMLExtension Allows generated pages to use something other than the
default 'html' extension for the filenames. Do not
include the leading period ('.') when you specify the
extension.
Command line argument: -x
HTMLPre Allows code to be inserted at the very beginning of the
HTML files. Defaults to the standard HTML 3.2 DOCTYPE
record. Be careful not to include any HTML here, as it
is inserted _before_ the <HTML> tag in the file. Use it
for server-side scripting capabilities, such as php3, to
insert scripting files and other directives.
HTMLHead Allows you to insert HTML code between the <HEAD></HEAD>
block. There is no default. Useful for adding scripts
to the HTML page, such as Javascript or php3, or even
just for adding a few META tags to the document.
HTMLBody This keyword defines HTML code to be placed immediately
after the <HEAD> section of the report, just before the
title and "summary period/generated on" lines. If used,
the first HTMLHead line MUST include a <BODY> tag. Put
whatever else you want in subsequent lines, but keep in
mind the placement of this code in relation to the title
and other aspects of the web page. Some typical uses
are to change the page colors and possibly add a corporate
logo (graphic) in the top right. If not specified, a
default <BODY> tag is used that defines page color, text
color and link colors (see "sample.conf" file for example).
HTMLPost This keyword defines HTML code that is placed after the
title and "summary period/generated on" lines, just before
the initial horizontal rule <HR> tag. Normally this keyword
isn't needed, but is provided in case you included a large
graphic or some other weird formatting tag in the HTMLHead
section that needs to be cleaned up or terminated before the
main report section.
HTMLTail This keyword defines HTML code that is placed at the bottom
right side of the report. It is inserted in a <TABLE> section
between table data <TD>..</TD> tags, and is top and right
aligned within the table. Normally this keyword is used to
provide a link back to your home page or insert a small
graphic at the bottom right of the page.
HTMLEnd This allows insertion of closing code, at the very end of
the page. The default is to put the closing </BODY> and
</HTML> tags. If specified, you _must_ specify these tags
yourself.
--------------------------------------------------------------------------
Notes on Web Log Files
----------------------
The Webalizer supports CLF log formats, which should work for just
about everyone. If you want User Agent or Referrer information, you
need to make sure your web server supplies this information in it's
log file, and in a format that the Webalizer can understand. While
The Webalizer will try to handle many of the subtle variations in
log formats, some will not work at all. Most web servers output
CLF format logs by default. For Apache, in order to produce the
proper log format, add the following to the httpd.conf file:
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\""
This instructs the Apache web server to produce a 'combined' log
that includes the referrer and user agent information on the end of
each record, enclosed in quotes (This is the standard recommended
by both Apache and NCSA). Netscape and other web servers have
similar capabilities to alter their log formats. (note: the above
works for apache servers up to V1.2. V1.3 and higher now have additional
ways to specify log formats... refer to included documentation).
Notes on FTP Log Files
----------------------
The Webalizer now supports ftp logs produced by wu-ftpd and others, as
a standard 'xferlog'. To process an ftp log, you must either use the
-Ff command line option or have "LogType ftp" in your configuration file.
Support for additional formats may be forthcoming, however a future
version of the Webalizer is in the works that will allow user defined
log formats, so this will become a non-issue. It is recommended that
you create a separate configuration file for ftp analysis, since the
values used for your web server will most likely not be suited for ftp
log analysis (ie: page types, hostname, etc.. should be different).
Because of the difference in web and ftp logs, there are a few limitations:
o Because there is no concept of a 'response code' in ftp world, response
codes are restricted to either 200 (OK) or 206 (Partial Content), based
on the completion status found in xferlog (for wu-ftpd, 'i'=incomplete
and will generate a 206, 'c'=complete and will generate a 200). If your
ftp server doesn't supply the completion status, all requests will be
assigned a response code of 200. This allows the usage graph to display
all transfer requests (hits), and how many of those completed in success
(files - ie: 200 response codes).
o Page totals won't accurately reflect reality, since there isn't really
the concept of a 'page' in regards to ftp services. I have found that
setting the PageType value to "README", "FIRST", etc... seems to work
fairly well however, and will give a pretty good indication of how
many 'non-binary' files were requested. Of course, the content of your
ftp site will be different, so your results may vary.
o Visit totals also won't accurately reflect reality, since visits are
triggered on PageType requests (see above). What you usually wind up
with is visits=sites in most cases.
o Entry/Exit pages will not be calculated for ftp logs.
o For obvious reasons, referrers and user agents are not supported.
o You _cannot_ analyze both web and ftp logs at the same time.. they must
be done separately in different runs.
Notes on Referrers
------------------
Referrers are weird critters... They take many shapes and forms, which makes
it much harder to analyze than a typical URL, which at least has some
standardization. What is contained in the referrer field of your log
files varies depending on many factors, such as what site did the referral,
what type of system it comes from and how the actual referral was generated.
Why is this? Well, because a user can get to your site in many ways... They
may have your site bookmarked in their browser, they may simply type your
sites URL field in their browser, they could have clicked on a link on some
remote web page or they may have found your site from one of the many search
engines and site indexes found on the web. The Webalizer attempts to deal
with all this variation in an intelligent way by doing certain things to
the referrer string which makes it easier to analyze. Of course, if your
web server doesn't provide referrer information, you probably don't really
care and are asking yourself why you are reading this section...
Most referrer's will take the form of "http://somesite.com/somepage.html",
which is what you will get if the user clicks on a link somewhere on the
web in order to get to your site. Some will be a variation of this, and
look something like "file:/some/such/sillyname", which is a reference from
a HTML document on the users local machine. Several variations of this can
be used, depending on what type of system the user has, if he/she is on
a local network, the type of network, etc... To complicate things even
more, dynamic HTML documents and HTML documents that are generated by
CGI scripts or external programs produce lots of extra information which
is tacked on to the end of the referrer string in an almost infinite number
of ways. If the user just typed your URL into their browser or clicked on
a bookmark, there won't be any information in the referrer field and will
take the form "-".
In order to handle all these variations, The Webalizer parses the referrer
field in a certain way. First, if the referrer string begins with "http",
it assumes it is a normal referral and converts the "http://" and following
hostname to lowercase in order to simplify hiding if desired. For example,
the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become
"http://www.myhost.com/This/Is/A/HTML/Document.html". Notice that only the
"http://" and hostname are converted to lower case... The rest of the
referrer field is left alone. This follows standard convention, as the
actual method (HTTP) and hostname are always case insensitive, while the
document name portion is case sensitive.
Referrers that came from search engines, dynamic HTML documents, CGI
scripts and other external programs usually tack on additional information
that it used to create the page. A common example of this can be found
in referrals that come from search engines and site indexes common on the
web. Sometimes, these referrers URL's can be several hundred characters
long and include all the information that the user typed in to search for
your site. The Webalizer deals with this type of referrer by stripping
off all the query information, which starts with a question mark '?'.
The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will
be converted to just "http://search.yahoo.com/search".
When a user comes to your site by using one of their bookmarks or by
typing in your URL directly into their browser, the referrer field is
blank, and looks like "-". Most sites will get more of these referrals
than any other type. The Webalizer converts this type of referral into
the string "- (Direct Request)". This is done in order to make it easier
to hide via a command line option or configuration file option. This is
because the character "-" is a valid character elsewhere in a referrer
field, and if not turned into something unique, could not be hidden without
possibly hiding other referrers that shouldn't be.
Notes on Character Escaping
---------------------------
The HTTP protocol defines certain ways that URL's can look and behave. To
some extent, referrer fields follow most of the same conventions. Character
escaping is a technique by which non-printable or other non-ASCII (and even
some ASCII) characters can be used in a URL. This is done by placing the
Hexadecimal value of the character in the URL, preceeded by a percent sign '%'.
Since Hex values are made up of ASCII characters, any character can be
escaped to ensure only printable ASCII characters are present in the URL.
Some systems take this concept to the extreme and escape all sorts of stuff,
even characters that don't need to be escaped. To deal with this, The
Webalizer will un-escape URL's and referrers before being processed. For
Example, the URL "/www.mrunix.net/%7Ebrad/resume.html" is the same URL as
"/www.mrunix.net/~brad/resume.html", a very common form of a URL to access
users web pages. If the URL's were not un-escaped, they would be treated as
two separate documents, even though they are really one and the same.
Search String Analysis
----------------------
The Webalizer will do a minimal analysis on referrer strings that
it finds, looking for well known search string patterns. Most of
the major search engines are supported, such as Yahoo!, Altavista,
Lycos, etc... Unfortunately, search engines are always changing
their internal/CGI query formats, new search engines are coming on
line every day, and the ability to detect _all_ search strings is
nearly impossible. However, it should be accurate enough to give
a good indication of what users were searching for when they stumbled
across your site. Note: as of version 1.31, search engines can now
be specified within a configuration file. See the sample.conf file
for examples of how to specify additional search engines.
Notes on Visits/Entry/Exit Figures
----------------------------------
The majority of data analyzed and reported on by The Webalizer is
as accurate and correct as possible based on the input log file.
However, due to the limitation of the HTTP protocol, the use of
firewalls, proxy servers, multi-user systems, the rotation of your
log files, and a myriad of other conditions, some of these numbers
cannot, without absolute accuracy, be calculated. In particular,
Visits, Entry Pages and Exit Pages are suspect to random errors
due to the above and other conditions. The reason for this is
twofold, 1) Log files are finite in size and time interval, and
2) There is no way to distinguish multiple individual users apart
given only an IP address. Because log files are finite, they have
a beginning and ending, which can be represented as a fixed time
period. There is no way of knowing what happened previous to this
time period, nor is it possible to predict future events based on
it. Also, because it is impossible to distinguish individual users
apart, multiple users that have the same IP address all appear to
be a single user, and are treated as such. This is most common where
corporate users sit behind a proxy/firewall to the outside world,
and all requests appear to come from the same location (the address
of the proxy/firewall itself). Dynamic IP assignment (used with
dial-up internet accounts) also present a problem, since the same
user will appear as to come from multiple places.
For example, suppose two users visit your server from XYZ company,
which has their network connected to the Internet by a proxy server
'fw.xyz.com'. All requests from the network look as though they
originated from 'fw.xyz.com', even though they were really initiated
from two separate users on different PC's. The Webalizer would
see these requests as from the same location, and would record only
1 visit, when in reality, there were two. Because entry and exit
pages are calculated in conjunction with visits, this situation
would also only record 1 entry and 1 exit page, when in reality,
there should be 2.
As another example, say a single user at XYZ company is surfing
around your website.. They arrive at 11:52pm the last day of
the month, and continue surfing until 12:30am, which is now a
new day (in a new month). Since a common practice is to rotate
(save then clear) the server logs at the end of the month, you
now have the users visit logged in two different files (current
and previous months). Because of this (and the fact that the
Webalizer clears history between months), the first page the
user requests after midnight will be counted as an entry page.
This is unavoidable, since it is the first request seen by that
particular IP address in the new month.
For the most part, the numbers shown for visits, entry and exit
pages are pretty good 'guesses', even though they may not be 100%
accurate. They do provide a good indication of overall trends,
and shouldn't be that far off from the real numbers to count much.
You should probably consider them as the 'minimum' amount possible,
since the actual (real) values should always be equal or greater
in all cases.
Exporting Webalizer Data
------------------------
The Webalizer now has the ability to dump all object tables to tab
delimited ascii text files, which can then be imported into most
popular database and spreadsheet programs. The files are not normally
produced, as on some sites they could become quite large, and are only
enabled by the use of the Dump* configuration keywords. The filename
extensions default to '.tab' however may be changed using the
'DumpExtension' keyword. Since this data contains all items, even
those normally hidden, it may not be desirable to have them located
in the output directory where they may be visable to normal web users..
For this reason, the 'DumpPath' configuration keyword is available,
and allows the placement of these files somewhere outside the normal
web server document tree. An optional 'header' record may be written
to these files as well, and is useful when the data is to be imported
into a spreadsheet.. databases will not normally need the header. If
enabled, the header is simply the column names as the first record of
the file, tab separated.
Log files and The Webalizer
---------------------------
Most sites will choose to have The Webalizer run from cron at specified
intervals. Care should be taken to ensure that data is not lost as a
result of log file rotations. A suggested practice is to rotate your
web server logs at the end of each month as close to midnight as possible,
then have The Webalizer process the 'end of month' log file before running
statistics on the new, current log. On our systems, a shell script called
'rotate_logs' is run at midnight, the end of each month. This script file
looks like:
------------------------- file: rotate_logs ------------------------------
#!/bin/sh
# halt the server
kill `cat /var/lib/httpd/logs/httpd.pid`
# define backup names
OLD_ACCESS_LOG=/var/lib/httpd/logs/old/access_log.`date +%y%m%d-%H%M%S`
OLD_ERROR_LOG=/var/lib/httpd/logs/old/error_log.`date +%y%m%d-%H%M%S`
# make end of month copy for analyzer
cp /var/lib/httpd/logs/access_log /var/lib/httpd/logs/access_log.backup
# move files to archive directory
mv /var/lib/httpd/logs/access_log `echo $OLD_ACCESS_LOG`
mv /var/lib/httpd/logs/error_log `echo $OLD_ERROR_LOG`
# restart web server
/usr/sbin/httpd
# compress the archived files
/bin/gzip $OLD_ACCESS_LOG
/bin/gzip $OLD_ERROR_LOG
------------------------- end of file ------------------------------------
This script first stops the web server using a 'kill' command. Apache
keeps the PID of the server in the file httpd.pid, so we use it as the
argument for the kill. Next, it defines some names for the backup files,
which are basically the name of the files with the date and time appended
to the end of them. It then makes a copy of the log file, appended with
'.backup' in the log directory, moves the current log files to an archive
directory (/var/lib/httpd/logs/old) and restarts the server. This setup
allows the web server to be down for the minimum amount of time needed,
which is important for busy sites. If you don't want to stop the server,
you can remove the initial 'kill' command, and replace the '/usr/sbin/httpd'
line with "kill -1 `cat /var/lib/httpd/logs/httpd.pid`" command instead,
On most web servers, this will cause a restart of the server and create
the new log files in the process...
At this point, we have made copies of the previous months logs, the web
server is going about it's business as usual, and we have all the time in
the world to do any other additional processing we want. The last two
lines of the script compress the archived logs using the GNU zip program
(gzip). Remember, we still have a copy of the log which we can now run
The Webalizer on without having to do any further processing.
Next, we define two crontab entries. The first runs the above 'rotate_logs'
script at midnight at the end of the month. The second runs The Webalizer
on the '.backup' log file created above at 5 minutes after midnight. This
gives other end of month processing jobs a chance to run so we don't bog
the system down too much. If you have lots of end of month stuff going on,
you can change the timing to suit your needs. The crontab entries look
something like:
------------------------- crontab entries --------------------------------
# Rotate web server logs and run monthly analysis
0 0 1 * * /usr/local/adm/rotate_logs
5 0 1 * * /usr/bin/webalizer -Q /var/lib/httpd/logs/access_log.backup
------------------------- end of crontab ---------------------------------
As you can see, the log rotations occur at midnight, and the analysis
is done at 5 minutes after. Once you verify that The Webalizer ran
successfully, the access_log.backup file can be deleted as it isn't
needed any more. If you need to re-run the analysis, you still have
the compressed archive copy that the shell script created. In order
for the above analysis to work properly, you should have already
created an /etc/webalizer.conf configuration file suitable for your
site, or otherwise specify configuration options or a configuration
file on the crontab command line above.
If you want The Webalizer to be run more often than once a month, you
can specify additional crontab entries to do this as well. Care should
be taken however to ensure that The Webalizer is not running when the
end of month processing above occurs, or unpredictable results may
happen (such as an inability to rotate the logs due to a file lock).
The easiest way is to run it on the half hour with a crontab entry like:
30 * * * * /usr/bin/webalizer
Language Support
----------------
Version 1.0x of The Webalizer added language support. This
support is only provided at compile time in the form of an
include file containing all the strings used by The Webalizer.
The source distribution contains all language files that were
available at the time, with English being the default as
that is the only human language I speak fluently, and me
Espanol es muy malo. Several people have already indicated
the desire to do translations into various languages, and as
I receive the language files, will make them available via
ftp at ftp://ftp.mrunix.net/pub/webalizer/lang. Unless there
happens to be a binary distribution in the language you need,
you will need to grab the source distribution and compile the
program yourself. See the file INSTALL that comes in the source
distribution for information on how to use a language other than
English.
It should also be noted that the GD graphics library, used to
produce the in-line graphics in the output HTML, doesn't
support extended character sets, so if you are translating
the language file, you will no doubt encounter this problem.
New: You can now specify the language to use when you are building
program from source, using the configure script. Just add
--with-language=language_name , where 'language_name' is the
name of a valid language file in the /lang/ directory. For
example, --with-language=french will build using French as
the default language. You should consult the INSTALL file
for additional information on building the program from source.
Known Issues
------------
o Memory Usage. The Webalizer makes liberal use of memory for internal
data structures during analysis. Lack of real physical memory will
noticeably degrade performance by doing lots of swapping between memory
and disk. One user who had a rather large log file noticed that The
Webalizer took over 7 hours to run with only 16 Meg of memory. Once
memory was increased, the time was reduced to a few minutes.
o Performance. The Hide*, Group*, Ignore*, Include* and IndexAlias
configuration options can cause a performance decrease if lots of
them are used. The reason for this is that every log record must
be scanned for each item in each list. For example, if you are
Hiding 20 objects, Grouping 20 more, and Ignoring 5, each record
is scanned, at most, 46 times (20+20+5 + an IndexAlias scan).
On really large log files, this can have a profound impact. It
is recommended that you use the least amount of these configuration
options that you can, as it will greatly improve performance.
Final Notes
-----------
A lot of time and effort went into making The Webalizer, and to ensure that
the results are as accurate as possible. If you find any abnormalities or
inconsistent results, bugs, errors, ommisions or anything else that doesn't
look right, please let me know so I can investigate the problem or correct
the error. This goes for the minimal documentation as well. Suggestions
for future versions are also welcome and appreciated.
The Webalizer - A log file analysis program
-- DNS information
The webalizer now has the ability to perform reverse DNS lookups. This
document attempts to explain how it works and some things that you should
be aware of when using the DNS lookup features.
Note: The Reverse DNS feature may be enabled or disabled at compile
time. It is enabled by using the -DUSE_DNS compiler switch, or
by specifing '--enable-dns' when "configure' is run. DNS lookups
are disabled by default.
Another Note: DNS lookups will not work under Windows yet, see the
README.WIN file for more information.
How it works
------------
DNS lookups are made against a DNS cache file containing IP addresses
and resolved names. If the IP address is not found in the cache file,
it will be left as an IP address. In order for this to happen, a
cache file MUST be specified when the Webalizer is run, either using
the '-D' command line switch, or a "DNSCache" configuration file
keyword. If no cache file is specified, no attempts to perform DNS
lookups will be done. The cache file can be made in two different ways.
1) You can have the Webalizer pre-process the specified log file at
run-time, creating the cache file before processing the log file
normally. This is done by setting the number of DNS Children
processes to run, either by using the '-N' command line switch or
the "DNSChildren" configuration keyword. This will cause the
Webalizer to spawn the specified number of processes which will
be used to do reverse DNS lookups.. generally, a larger number
of processes will result in faster resolution of the log, however
if set too high may cause overall system degredation. A setting
of between 5 and 20 should be acceptable, and there is a maximum
limit of 100. If used, a cache filename MUST be specified also,
using either the '-D' command line switch, or the "DNSCache"
configuration keyword. Using this method, normal processing will
continue only after all IP addresses have been processed, and the
cache file is created/updated.
2) You can pre-process the log file as a standalone process, creating
the cache file that will be used later by the Webalizer. This is
done by running the Webalizer with a name of 'webazolver' (ie: the
name 'webazolver' is a symbolic link to 'webalizer') and specifing
the cache filename (either with '-D' or DNSCache). If the number
of child processes is not given, the default of 5 will be used. In
this mode, the log will be read and processed, creating a DNS cache
file or updating an existing one, and the program will then exit
without any further processing.
Run-time DNS cache file creation/update
---------------------------------------
The creation/update of a DNS cache file at run-time occurs as follows:
1) The log file is read, creating a list of all IP addresses that are
not already cached and need to be resolved.
2) The specified number of children processes are forked, and are used
to perform DNS lookups.
3) Each IP address is given, one at a time, to the next available child
process until all IP addresses have been processed. Each child will
update the cache file when a name is found.
4) Once all IP addresses have been processed and the cache file updated,
the Webalizer will process the log normally. Each record it finds
that has an unresolved IP address will be looked up in the cache file
to see if a hostname is available (ie: was previously found).
Because there may be a significant amount of time between the inital
unresolved IP list and normal processing, the Webalizer should not be
run against live log files (ie: a log file that is activly being written
to by a server), otherwise there may be additional records present that
were not resolved.
Stand-Alone DNS cache file creation/update
------------------------------------------
The creation/update of the DNS cache file, when run in stand-alone mode,
occurs as follows:
1) The log file is read, creating a list of all IP addresses that are
not already cached and need to be resolved.
2) The specified number of children processes are forked, and are used
to perform DNS lookups. If the number of processes was not specified,
the default of 5 will be used.
3) Each IP address is given, one at a time, to the next available child
process until all IP addresses have been processed. Each child will
update the cache file when a name is found.
4) Once all IP addresses have been processed and the cache file updated,
the program will terminate without any further processing.
Larger sites may prefer to use a stand-alone process to create the DNS
cache file, and then run the Webalizer against the cache file. This
allows a single cache file to be used for many virtual hosts, and reduces
the processing needed if many sites are being processed. The Webalizer
can be used in stand alone mode by running it as 'webazolver'. When
run in this fashion, it will only create the cache file and then exit
without any further processing. A cache filename MUST be specified,
however unlike when running the Webalizer normally, the number of child
processes does not have to be given (will default to 5). All normal
configuration and command line options are recognized, however, many
of them will simply be ignored.. this allows the use of a standard
configuration file for both normal use and stand alone use.
Examples:
---------
webalizer -c test.conf -N 10 -D dns_cache.db /var/log/my_www_log
This will use the configuration file 'test.conf' to obtain normal
configuration options such as hostname and output directory.. it
will then either create or update the file 'dns_cache.db' in the
default output directory (using 10 child processes) based on the
IP addresses it finds in the log /var/lib/my_www_log, and then
process that log file normally.
webalizer -o out -D dns_cache.db /var/log/my_www_log
This will process the log file /var/log/my_www_log, resolving IP
addresses from the cache file 'dns_cache.db' found in the default
output directory "out". The cache file must be present as it will
not be created with this command.
for i in /var/log/*/access_log; do
webazolver -N 20 -D /var/lib/dns_cache.db $i
done
The above is an example of how to run through multiple log files
creating a single DNS cache file.. this might be typically used on
a larger site that has many virtual hosts, all keeping their log
files in a seperate directory. It will process each access_log it
finds in /var/log/* and create a cache file (var/lib/dns_cache.db).
This cache file can then be used to process the logs normally with
with the Webalizer.
for i in /etc/webalizer/*.conf; do webalizer -c $i -D /etc/cache.db; done
This will process each configuration file found in /etc/webalizer,
using the DNS cache file /etc/cache.db. This will also typically be
used on a larger site with multiple hosts.. Each configration file
will specify a site specific log file, hostname, output directory, etc.
The cache file used will typically be created using a command similar
to the one previous to this example.
Considerations
--------------
Processing of live log files is discouraged, as the chances of log records
being written between the time of DNS resolution and normal processing will
cause problems.
Cached DNS addresses have a TTL (time to live) of 3 days. This may be
changed at compile time by editing the dns_resolv.h header file and
changing the value for DNS_CACHE_TTL.
There is an absolute maximum of 100 child processes that may be created,
however the actual number of children should be significantly less than
the maximum.. typical usage should be between 5 and 20.
If you are using STDIN for the input stream (log file) and have run-time
DNS cache file creation/update enabled.. the program will exit after the
cache file has been created/updated and no output will be produced. If
you must use STDIN for the input log, you will need to process the stream
twice, once to create/update the cache file, and again to produce the
reports.
Special thanks to Henning P. Schmiedehausen <hps@tanstaafl.de> for the
original dns-resolver code he submitted, which was the basis for this
implementation.