$Id: README 76 2007-11-09 18:14:43Z decibel $

Here's some real quick-and-dirty explanation of how RRS works and what
you need to do to set it up.

There are two tables that define how RRS will operate. The rrs table defines
characteristics of each RRS. In the default configuration, these
RRSs show data from the past hour, 4 hours, 12 hours, day, week, month, and
year.

The other configuration table is source. This table describes the different
data sources RRS will aggregate. There is an example definition that is
commented out in rrs.sql.

When rrs.update() is run, it first gets a list of all the sources. For each
source it then gets a list of rrs RRSs, in rrs_id order. This is the order it
will process the RRSs in, and it's important that a parent is processed before
it's children. An RRS with a NULL parent will pull data from base data table
that you're aggregating out of. An RRS with a parent specified will pull data
from that RRS.

Here's details on what each field in the source table means:

source_id is an integer used to reference each source. It's a serial, and
shouldn't be touched.

source_name is a unique name for each source. It is how you should reference
the source table should you need to.

insert_table is the table that aggregated data will be inserted into. Note that
you should define a foreign key constraint on this table referencing
rrs.bucket(bucket_id) ON DELETE CASCADE. If you don't do this, results will
pile up in insert_table.

source_table is the table that raw data will be pulled from.

source_timestamptz_field is the field in the source table that will be used for
the time aggregation. It must be of type timestamptz.

group_clause is the clause that will be used to group data when aggregating.

insert_aggregate_fields is the list of fields in *insert_table* that will be
inserted into

primary_aggregate is the SELECT clause that will be used when inserting data
from *source_table*.

rrs_aggregate is the SELECT clause that will be used when inserting data from
*insert_table*.

Finally, here's a detailed description of how update() works:

First, update rrs.bucket. Delete any old buckets, and add new buckets as required.

For each source...

For each RRS (ORDER BY rrs_id)...

If source.parent is NULL, then
INSERT INTO *insert_table* (bucket_id, *group_clause*, *insert_aggregate_fields*)
    SELECT bucket_id, *group_clause*, *primary_aggregate*
        FROM *source_table*, rrs.bucket
        GROUP BY rrs.bucket_id, *group_clause*

Otherwise,
INSERT INTO *insert_table* (bucket_id, *group_clause*, *insert_aggregate_fields*)
    SELECT bucket_id, *group_clause*, *rrs_aggregate*
        FROM *insert_table*, rrs.bucket
        GROUP BY rrs.bucket_id, *group_clause*

I've ommitted the details of how data is grouped into time buckets from the
SELECT statements, but that is part of the code that isn't configurable. 

PRESENTATION
------------
As you can see, the table we're inserting summarized data into won't actually
contain timestamps. Instead, it contains references to the buckets table, so
you'll want to join to that when presenting your data. But joining directly to
the bucket table will sometimes present extra data points, compared to what
rrs.keep_buckets is set to. Instead, you should join to the bucket_trimmed
view:

CREATE OR REPLACE VIEW public.database AS
	SELECT b.rrs_id, b.end_time, b.previous_end_time, d.*
		FROM rrs.bucket_trimmed b LEFT JOIN rrs_data.database d USING( bucket_id )
;

Note also that the view uses a LEFT JOIN. This would be the normal use case for
when you are graphing data; it ensures that you'll get at least one data point
for each bucket, which can help with graphing when some actual data points are
missing. To ensure 100% full coverage, you would also want to left-join to
tables containing all values for all your GROUP BY fields. In this example,
database_id is the grouping, so our view would become:

CREATE OR REPLACE VIEW public.database AS
	SELECT b.rrs_id, b.end_time, b.previous_end_time, d.*
		FROM rrs.bucket_trimmed b
            CROSS JOIN raw.databases
            LEFT JOIN rrs_data.database d USING( database_id, bucket_id )
;

THROTTLING
----------
The throttling system works by processing limited time periods of raw input
data. Previously, all data that was available would be processed at once.

Any time RRS detects that it is 'behind', it will engage the throttling code.
There is a setting ('desired run time', see below) that controls the maximum
amount of time that RRS should run when throttling. Note that this isn't a hard
limit, and that the commit time at the end of the run can not be accounted for.
This commit time can be quite substantial, and 'desired run time' must be
adjusted to account for it.

The throttling code will take a look at the history of run time for runs when
the throttling code was active. This is stored in the history_data_interval
table. In addition to logging how long each run took, this table also logs what
data interval was used. Using these two pieces of information, RRS calculates
the ratio of data interval to run time, and multiplies that by the desired run
time. This provides a new target data interval. This is done for each row of
history_data_interval (it's actually done in the
history_data_interval_run_time_v view), and the average of the target data
interval (next_data_interval in the view) is used as the amount of data to
process in this run.

It's not quite this simple though. Because this is a very crude method of
estimation, it's easy for it to produce numbers that are very out of wack.
Because of this, limits are imposed. There are hard limits for data interval,
both minimum and maximum. This ensures that RRS always process at least some
data, and that RRS doesn't process too much data. There is also a slew-rate
limit, that prevents the data interval from being increased by more than 3x
what it was for the last run.

SETTINGS
--------
There are some settings that control the operation of RRS:

decibel=# select * from setting;
         setting_name         |  setting   
------------------------------+------------
 desired run time             | 30 seconds
 minimum data interval        | 00:08:20
 maximum data interval        | 13:53:20
 history length               | 100
 history_data_interval length | 20


'Data interval' refers to the interval of data that will be processed from an
input table. RRS updates in chunks by limiting the amount of data that will be
processed from the source table.

desired run time                This is the maximum amount of time you would like a run to take.
minimum data interval           The minimum amount of source data to process at one time.
maximum data interval           The maximum amount of source data to process at one time.
history length                  How many runs of RRS to keep history for.
history_data_interval length    How many throttled runs of RRS to keep history for.

Initially, these aren't set at all. The first time update is run, it (well,
actually calculate_run_time(..) and log_time(..)) will set them to the
following default values:

desired run time                30 seconds
minimum data interval           desired rut time * 10
maximum data interval           minimum data interval * 100
history length                  100
history_data_interval length    20

These defaults are reasonable for most hardware if RRS is being run once a
minute.

The functions setting_get(setting name (text)) and setting_set(setting name
(text), setting value (text) are provided for getting and changing settings.
You may also operate on the table directly. For example, the following might be
good if you onlyrun RRS every 10 minutes:

DELETE FROM rrs.setting;
SELECT rrs.setting_set('desired run time', '6 minutes'); -- Remember to allow for commit time
SELECT rrs.setting_set('minimum data interval', '10 minutes');

Note that RRS will always process at least one bucket interval on a run, no
matter what the throttling minimum is. This ensures that at least some work is
done.

TROUBLESHOOTING
---------------

If it's taking a long time to run update(), you've probably run into a
configuration problem of some sort. These steps will help identify the issue:

SET client_min_messages = debug;
SELECT update_buckets(); -- This adds all the buckets. It typically runs OK
SELECT update();

By changing client_min_messages, you'll now have the actual queries that RRS is
running against the system. You should take the last one displayed (the one
that's taking a long time), and do an EXPLAIN on it. This should point you to
the problem. Note that it's critical that your raw data table has an index on
the timestamp column that you're joining on.
