  
-------- -------- -------- -------- -------- -------- -------- -------- --------

      BLASTgres -- Biosequence data management in PostgreSQL
      http://www.blastgres.org/
  
      Copyright (C) 2005, 2006  Ruey-Lung Hsiao,  D.S. Parker
      (rlhsiao@cs.ucla.edu, stott@cs.ucla.edu)
  
      UCLA Center for Computational Biology
      635 Charles Young Drive South, Suite 225
      Los Angeles, CA 90095-7332  USA
  
      UCLA Computer Science Dept.
      3532 Boelter Hall
      Los Angeles, CA 90095-1596  USA
  
      This is free software; you can redistribute it and/or
      modify it under the terms of the GNU General Public License
      as published by the Free Software Foundation; either version 2 of
      the License, or (at your option) any later version.
  
      This is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
      See the GNU General Public License for more details.
  
      You should have received a copy of the GNU General Public
      License along with this; if not, write to the
      Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
      Boston, MA  02110-1301  USA

-------- -------- -------- -------- -------- -------- -------- -------- --------


This overview gives a quick summary of BLASTgres' extensions for PostgreSQL.
It is not a full user manual, but just a sketch of the functions provided.
For further information, see the BLASTgres documentation.


There are three new datatypes provided by BLASTgres:

   -- "range"
             A "range" is an object consisting of two integer points
                        (lower bound, upper bound)
             representing the interval between these two specified points.

   -- "loc"
             A "loc" consists of an identifer and a range:
                        (identifier, range)
             Here an identifier is typically a sequence identifier.

   -- "hit"
             A "hit" is a HSP result from the BLAST program --
             a high-scoring sequence pair (high-score alignment).

The most important is the "range" type, on which everything else in
this extension depends.

The "loc" type is similar in many ways to the "range" type, so we
overview only the "range" type here.

The "hit" type is the result of using the BLAST() function described
at the end of this overview.


============================
FUNCTIONS DEFINED FOR RANGES
============================

A number of functions/operations can be applied to pairs of ranges.
Assume that $1 and $2 are ranges:


Boolean-valued functions:

    range_over_left( $1, $2 )
          true if $1 and $2 overlap on the left hand side of $2; otherwise false.

    range_over_right( $1, $2 )
          true if $1 and $2 overlap on the right hand side of $2; otherwise false.
        
    range_right( $1, $2 )
          true if the lower bound of $1 is greater than the upper bound of $2;
          otherwise false.

    range_left( $1, $2 )
          true if the upper bound of $1 is less than the lower bound of $2;
          otherwise false.

    range_lt( $1, $2 )
          true if the lower bound of $1 is less than the lower bound of $2.
          In case of a tie, it true if the upper bound of $1 is less than the
          upper bound of $2; otherwise false.

    range_le( $1, $2 )
          true if the lower bound of $1 is less than or equal to the lower
          bound of $2.  In case of a tie, it true if the upper bound of $1
          is less than or equal to the upper bound of $2; otherwise false.

    range_gt( $1, $2 )
          true if the lower bound of $1 is greater than the lower bound
          of $2. in case of a tie returns true if the upper bound of $1
          is greater than the upper bound of $2; otherwise false.

    range_ge( $1, $2 )
          true if the lower bound of $1 is greater than or equal to
          the lower bound of $2. in case of a tie, it true if the upper
          bound of $1 is greater than or equal to the upper bound of $2;
          otherwise false.

    range_contains( $1, $2 )
          true if the lower bound of $1 is less than or equal to the
          lower bound of $2 and the upper bound of $1 is less than or
          equal to the upper bound of $2; otherwise false.

    range_overlap( $1, $2 )
          true if $1 overlaps $2; otherwise false.

    range_same( $1, $2 )
          true if $1 and $2 represent the same interval range.

    range_different( $1, $2 )
          true if $1 and $2 represent different interval ranges.

    range_meet( $1, $2 )
          true if the upper bound of $1 is equal to the lower bound of $2

    range_met_by( $1, $2 )
          this is equivalent to range_meet($2,$1)

    range_start( $1, $2 )
          true if the lower bound of $1 is equal to that of $2

    range_started_by( $1, $2 )
          this is equivalent to range_start($2,$1)

    range_finish( $1, $2 )
          true if the upper bound of $1 is equal to that of $2, and
          the lower bound of $1 is less than or equal to that of $2.

    range_finished_by( $1, $2 )
          equivalent to range_finish($2,$1)



        Binary operator     Equivalent query
     ------------------------------------------------

         $1   <   $2        range_lt($1,$2)

         $1   <=  $2        range_le($1,$2)

         $1   >   $2        range_gt($1,$2)

         $1   >=  $2        range_ge($1,$2)

         $1   <<  $2        range_left($1,$2)

         $1   >>  $2        range_right($1,$2)

         $1   &<  $2        range_over_left($1,$2)

         $1   &>  $2        range_over_right($1,$2)

         $1   &&  $2        range_overlap($1,$2)

         $1   =   $2        range_over_right($1,$2)

         $1   <>  $2        range_different($1,$2)

         $1   @   $2        range_contains($1,$2)

         $1   ~   $2        range_contained($1,$2)

     ------------------------------------------------


Range-valued functions:

    range_minus( $1, $2 )
          yields the range with lower bound ($1.lower - $1.upper)
                            and upper bound ($1.upper - $2.upper)

    range_plus( $1, $2 )
          yields the range with lower bound  ($1.lower + $1.upper)
                            and upper bound  ($1.upper + $2.upper)

    range_toseg( $1, $2 )
          yields the range with lower bound  $1
                            and upper bound  $2

    range_maxmin( $1, $2 )
          yields the range with lower bound  max($1.lower,$2.lower)
                            and upper bound  min($1.upper,$2.upper)

    range_minmax( $1, $2 )
          yields the range with lower bound  min($1.lower,$2.lower)
                            and upper bound  max($1.upper,$2.upper)

    range_union( $1, $2 )
          yields the range with lower bound  min($1.lower,$2.lower)
                            and upper bound  max($1.upper,$2.upper)

    range_inter( $1, $2 )
          yields the range with lower bound  max($1.lower,$2.lower)
                            and upper bound  min($1.upper,$2.upper)


Integer-valued functions:

    range_size( $1 )
          returns the length of the range object.

    range_upper( $1 )
          returns the upper bound of the range object.

    range_lower( $1 )
          returns the lower bound of the range object.



Aggregate functions of ranges:

   coalescing( $1, $2, $3 )     returns a set of range objects
   coalescing( $1, $2, $3, $4 ) returns a set of range objects

        - $1, $2, $3 are of type text.
        - $4 is of type int4.

        - $1 is the name of a table.
        - $2 is the name of a range field/attribute in that table.
        - $3 is the predicate used to SELECT (i.e., filter)
                    the field values to be processed.
        - $4 is the allowed gap size (zero if omitted).

        The result is a 1-column table of "coalesced" range values
        (ranges that are obtained by combining ranges that overlap --
         except that gaps are tolerated up to the specified gap size).

        - Example:  Suppose we have the table  DEMO1 declared by
         
          CREATE TABLE DEMO1 ( subjectid INTEGER, segment range );

          that contains the ranges below:

              SELECT * FROM DEMO1;
    
               subjectid |  segment
              -----------+--------------
               5027      |  '23 .. 45'
               5027      |  '46 .. 100'
               5232      |  '1 .. 24'
               5232      |  '25 .. 60'
               5232      |  '63 .. 100'  
         

          Then the result of the query

       SELECT * FROM coalescing( 'DEMO1', 'segment', 'subjectid=5232', 1 );

          will be the following table:

               interval     
               --------
               '1 .. 60'
               '63 .. 100'    

       '1 .. 60'   is the result of coalescing  '1 .. 24', '25 .. 60'
       '63 .. 100' is the result of coalescing  '63 .. 100'


   partition( $1, $2, $3, $4, $5 ) returns a set of record

        - $1, $2, $3, $4, $5 are of type text.

        - $1 is the name of a table.
        - $2, $3, $4, $5 each give the name of a field/attribute in that table.

        - $2 is a field giving a query id.
        - $3 is a field giving a range for that query id.
        - $4 is a field giving a subject id.
        - $5 is a field giving a range for that subject id.

        - this function partitions a set of intervals into disjoint chunks.
          In other words, if two intervals overlap, their intersection becomes
          one chunk, and their remainders (if any) also become (disjoint) new chunks.


=============
BlastHitTable
=============

BlastHitTable is a datatype provided solely for holding Blast results.
That is, Blast-related functions return objects of this type.
The datatype is defined as follows:

CREATE TYPE BlastHitTable AS
(
   queryid TEXT,
   subjectid TEXT,
   identity FLOAT,
   length INTEGER,
   mismatches INTEGER,
   gap INTEGER,
   qinterval range,
   sinterval range,
   evalue FLOAT,
   bitscore FLOAT
);

Make sure you have the range type defined in your database.
See the BLASTgres installation instructions.


=============================
BLAST-RELATED TABLE FUNCTIONS
=============================

Each of the following functions returns a set of hits (a BlastHitTable):

blast_sequence( $1 )
blast_sequence( $1, $2 )

        - $1, $2 are of type text

        - $1 is an actual sequence string, or other input permitted
            by the Blast server.

        - $2 is the Blast database name ('nr' if omitted).

        - example:

          SELECT * FROM blast_sequence(
            'ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCC',
            'nr');


blast_sequence_p( $1, $2 )

        - $1, $2 are of type text

        - $1 represents the actual sequence data or any input
             that is allowed by the Blast server.

        - $2 is a table name in which it contains Blast parameters
             and their corresponding values

        - this function expects the table specified by $2
          to have two fields only. The first field specifies Blast
          parameters, and the second field is for the values.

          For all possible parameters, please refer to:
          http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html


blast( $1, $2 )
blast( $1, $2, $3 )

        - $1, $2, $3 are of type text

        - $1 is the table name 
        - $2 is the sequence field name
        - $3 is the name of the Blast database ('nr' if omitted)

        - this function Blasts all sequences in table $1
          against the Blast database.


blast_p( $1, $2, $3 )

        - $1, $2, $3 are of type text

        - $1 is the table name
        - $2 is the sequence field name
        - $3 is table name for the Blast parameters

        - this function Blasts all the sequence data in the specified
          table against the Blast database with the Blast parameters
          specified by table $3.

