Simple Database Algorithms

 www.partow.net  .: Home :.   .: Links :.   .: Search :.   .: Contact :. 


Description

In this section, I attempt to explain the problems of complexity that occur during joins within queries from relational databases, and in a final solutions introduce a selection of join algorithms that can be used in various situations to reduce the complexity of query oriented joins.

The Problem

The term join means to perform an "algebraic" cross product on 2 sets of possibly differing tuples via common primary keys. In simple terms it means that the join operation will maintain the intersection of the two sets and filter out everything that is unique. This kind of operation is a very valuable operation in database queries. It is one of the ultimate forms of relational quantification, the problem with joining is that intuitive algorithms for joining have a very high algorithmic complexity which can lead to an explosion in time requirements for simple queries on large databases.

An Example

To demonstrate where a join can be used, lets assume there is a university that has both male and female students, the university also provides a large variety of subjects, such as art, history, programming, mathematics etc...

We would like to execute a query on the university's database. The query will be how many female students take both history and mathematics. The solution to this problem is to use a divide and conquer technique and to hence break the query down into smaller simpler sub-queries. In this case one query for the history course database and one query for the mathematics course database (these queries can be executed in parallel). Obviously each query will be in the form of: "extract all females from the course"

The next step would be to use a common primary key which will be available in both the result from the history course query and mathematics course query which in most cases will be the student identification number. The join will literally be done using the student ID and traversing through the result of one query (ie: history course query) taking each result and comparing it to each element in the other query (ie: mathematics course query), if a common primary key is found (meaning there is a common student ID in both query results) it means the student represented by that primary key is doing both history and mathematics, and because the initial queries where based on gender automatically filtering out males it can further be inferred that the student is female.

Quary Join Diagram - Copyright Arash Partow


The problem here is that the join algorithm mentioned above is naive and has a quadratic complexity (O(n²)). This means that the algorithm will increase its running time by the square of the increase of the first query multiplied by the increase of the second query. This level of processing requirement is not acceptable for database management systems which are required to service multiple queries with results instantly.

On The Road To A Solution

Analysis of the join process and its relevant principles from an algorithmic point of view leads one to conclude the following:

  • Over Processing

    In the naive algorithm for joins as described above, the tuples in each set (query result) are processed more than is required by the minimum number of necessary scans for a join to take place. An optimal solution would eradicate the over processing from the naive solution.

  • No Order In Result Of Joins

    The naive algorithm when completed returns a final column of tuples. However the column has no ordering as far as the primary key is concerned, in theory you could do the join and then sort the resulting tuple, with a nice sorting algorithm such as quick-sort, however further analysis proves that sorting of the final result can be accomplished during the join itself, hence not requiring a final sort of the resulting column. An optimal solution would attempt to consolidate series of processing such as sorting and grouping into the one algorithm hence reducing overall processing complexity.

A series of pre-processing steps can be taken in order to ensure the efficiency of the join

  • Sorting Of The Initial Query Columns

    Basically this means that the resulting columns from the queries which later on will be joined should be sorted according to the common primary key. This gives the naive algorithm the ability to only search a small section of the target column rather than having to search the entire target column for joining a particular tuple in the source column.

    There are a few issues to contend with in this optimization, mostly to do with the style of sorting, different sorting algorithms have different complexities, they also have different memory requirements, some much more than others for example quick sort. Also one may want the elements in the columns to be sorted in a relative fashion, this comes under the umbrella of stable and volatile sorting. Usually stable sorting algorithms have high algorithmic complexities where as volatile sorting algorithms such as quick sort and merge sort tend to have low algorithmic complexities. Ultimately the requirements decide what kind of sorting methodology will be adequate.

  • Usage Of Hash Tables For Quick Look-ups

    In this optimization, one of the columns are placed into a hash-table and the other column is queried against the hash-table, this results in look-ups of complexity O(1) with a maximum number of look-ups being the size of the queried column. In this implementation it would be better to hash the larger column and query with the shorter one, because the aim is to reduce the over-all number of hash-table queries.

    The main disadvantage with this solution is that it is optimized for speed, and in doing so is very memory hungry. 20 years ago this may have been an issue, however with the continual decrease in the cost of memory units, such an issue can be resolved simply by adding more memory to the DBMS.

In the database world, there have been a number of algorithms developed for joining queries in an efficient manner, many of these algorithms can be noted as being derived from one of the following base algorithms:

  • Simple Nested Loop Join

    For every element in the first column one complete traversal in the second column occurs. This leads to a quadratic algorithmic complexity (O(n^2)), this method is highly inefficient.

  • Sort Merge Join

    In this algorithm both query columns are both sorted. Traversal of the first column occurs against the second column. Because of the sorted nature of both columns repetitive looping of the target column is not required. instead a progressive traversal through both columns occurs with no more than ONE scan per element in each column. The best solution using this algorithm has a complexity of along the lines of O(nlogn)

  • Block Nested Loop Joins

    To be completed...

  • Hash Join

    To be completed...

General Join Conditions

A join can occur over several attributes within a tuple. This can lead to instantenous joining on serveral properties requiring less overall joins to be completed. Hash-join and sort-merge joins would not be the best choice for applications that require simultaneous joins on more than one attribute, a better suited algorithm would the best the Nested Loop Join.


Simple Database Algorithms License

Free use of the Simple Database Algorithms library is permitted under the guidelines and in accordance with the most current version of the "Common Public License."


Compatability

The Simple Database Algorithms C implementation is compatible with the following C compiler:

  • GNU Compiler Collection (3.3.1-x+)
  • Intel® C Compiler (8.x+)

Download




Copyright Arash Partow