
pg_snapclone_tools
------------------

This module is a collection of functions designed to allow manipulation
of MVCC snapshots for read only transactions. The purpose is to allow
some read only operations to operate in parallel, specifically...

 * psql running in parallel
 * pg_dump running in parallel
 * a consistent pg_dump on all databases in a cluster

This module is designed to work with PostgreSQL 8.1, 8.2 and 8.3
We expect this mechanism to exist internally within PostgreSQL 8.4,
so these functions will *never* be available in a main Postgres
distro. Use them in your programs with this in mind.

*** WARNING:  Abuse of the code contained here can lead to SQL returning
wrong answers, provide security holes and worse. Treat these
functions with the caution and respect due to malignant biohazards.
Properly handled, this can be medicine. Poorly handled, plague.

These functions are provided to assist you with the upgrade 
process to bigger and better releases of Postgres. Happy Upgrading!

Simon Riggs	simon@2ndquadrant.com


Snapshot Cloning
----------------

Usage procedure looks like this:

1. Connect 1 or more sessions that will be cloned sessions

2. Connect the master session

3. Start a Read Only, Serializable transaction on clone sessions
	BEGIN ISOLATION LEVEL SERIALIZABLE;

   Run a single, simple command to set the snapshot for the transaction.
	select true;		# or similar

   The output from this last command is not important, as long
   as it does not produce an ERROR. If you skip this step then the
   transaction snapshot is not taken until the point we run
   the command to clone the master session's snapshot, which is
   then too late.

   Don't issue any other commands on these sessions (yet)

4. Start a Serializable transaction on master session
	BEGIN ISOLATION LEVEL SERIALIZABLE;

   Don't issue any other commands on this session (yet)

5. Publish the master session's snapshot, do not disconnect
	select pg_snapclone_publish();

   which returns the MasterPid for others to access. This value
   must then be passed using some form of IPC to other sessions.
   In threaded programs this will be straightforward.

   In step (3) we needed to issue an SQL command to fix the 
   snapshot. We don't need to do this here because the command
   to publish the snapshot does it automatically for us, so 
   no need to add additional steps.

6. Clone the snapshot by running this in *each* cloned session
	select pg_snapclone_clone(MasterPid);

   The cloned sessions now have identical MVCC snapshots to the
   master session.

7. **Now** perform read-only operations on the cloned sessions. 

   There is not any easy way to judge what is truly a read-only 
   operation, since functions may have side-effects, sequences may 
   be incremented etc..

   Issuing COPY tablename TO commands is mostly safe, as long
   as we access the table directly and do not use a SELECT statement.

	COPY tablename TO ...

   Even COPY TO commands execute the output functions of their
   datatypes, so custom datatypes may still cause this to fail
   if the output functions do more than just format the data. 
   There are no known publicly available datatypes that do this.
   If you know of one, please report it immediately.

   You can issue more than one command within this transaction,
   just as you would normally.

   When you have finished issuing read-only commands you are still
   not able to issue writes, so you cannot write a row to a table
   to say the work is done. All communication between sessions
   must be performed client-side.

8. ROLLBACK the cloned sessions, just in case writes did occur

9. ROLLBACK the master session, just in case writes did occur

10. Clean up the saved snapshot from the master session
	select pg_snapclone_unpublish();
	This can be performed in a different transaction, if desired.

Simplified Process
------------------

A utility program called pg_snapclone exists which will execute steps 
2, 3, 4, 5, 9, 10

The simplified process is then

1. Connect 1 or more sessions that will be cloned sessions

2. Execute pg_snapclone (by default writes a file called pg_snapclone.pid)

3. Clone the snapshot by running this in *each* cloned session

	i) read pg_snapclone.pid to find out the MasterPid

	ii) Execute select pg_snapclone_clone(MasterPid);

   The cloned sessions now have identical MVCC snapshots to the
   master session.

4. **Now** perform read-only operations on the cloned sessions. 

   There is not any easy way to judge what is truly a read-only 
   operation, since functions may have side-effects, sequences may 
   be incremented etc..

   Issuing COPY tablename TO commands is mostly safe, as long
   as we access the table directly and do not use a SELECT statement.

	COPY tablename TO ...

   Even COPY TO commands execute the output functions of their
   datatypes, so custom datatypes may still cause this to fail
   if the output functions do more than just format the data. 
   There are no known publicly available datatypes that do this.
   If you know of one, please report it immediately.

   You can issue more than one command within this transaction,
   just as you would normally.

   When you have finished issuing read-only commands you are still
   not able to issue writes, so you cannot write a row to a table
   to say the work is done. All communication between sessions
   must be performed client-side.

5. ROLLBACK the cloned sessions, just in case writes did occur

6. Now delete the snapshot file (pg_snapclone.pid). pg_snapclone
   will automatically finish as soon as the file is gone.



What it does
------------

pg_snapclone_publish() will publish the data from the MVCC snapshot on
the master into a small cookie file accessible by the server. The
snapshot file will be created in a new sub-directory of the server
called "pg_snapclone". Files within this directory will be called
"snapshot.XXXX" where XXXX is the process id (pid) of the master session.
Only one snapshot can be published per session; older snapshots are
useless and will be overwritten.

pg_snapclone_clone() will read the snapshot file and overwrite the cloned
session's SerializableSnapshot, an important internal data structure
that is the heart of the MVCC technique within PostgreSQL.

There are a number of checks in place to prevent abuse of this
technique. They are not the only ones needed, but if you carefully
examine the SQL statements to be executed then it may be possible to use 
this successfully. Testing and strong validation is highly recommended.

Restrictions
------------

The master and cloned sessions can be connected to different databases.

The master and cloned sessions must be the same effective userid. The
master session should stay connected until the cloned sessions complete
as a precaution against other complex effects.

The size of the snapshot file in bytes is roughly the same size as the
snapshot takes in memory within each backend. pg_snapclone could
potentially grow in size over time if pg_snapclone_unpublish() is
not executed, or sessions are shutdown or crash during use. In that case
snapshots can be removed by pg_snapclone_unpublish(integer). 

On a typical system, snapshot files will be about 1kB each, possibly
as few as 50 bytes. On a very busy server with large configuration that 
might increase, but more than 10kB is unlikely.

Design
------

Access to shared memory areas and LWLocks from extensions was only
introduced in Postgres 8.2 onwards, so that was not an option that
could be used for a mechanism that would work with 8.1. 

The data is held internally to avoid security issues, malicious or 
careless use of these techniques.

Note that when running this on Postgres 8.3, the TransactionIds are
not set until a write occurs. So both the master and the cloned
sessions will not appear in the snapshots.

Utility Functions
-----------------

Two other utility functions exist

pg_snapclone_unpublish(integer)

Allows removal of specific snapshots, identified by pid number.
You can remove your own published snapshots this way, if you choose.
Anybody can run this command to remove an old stale snapshot, as long
as a backend with that pid is not currently running.

pg_snapclone_clean_directory(void)

An expensive function which removes all stale snapshot files
from the pg_snapclone directory. Don't run this too frequently.

It is also possible to simply remove files from pg_snapclone by manually
issued OS commands, but that isn't the sort of thing we encourage with 
data directories, so it is best avoided. Superuser only.
