In this section, I attempt to explain the problems of complexity that occur during
joins within queries from relational databases, and in a final solutions introduce
a selection of join algorithms that can be used in various situations to reduce the
complexity of query oriented joins.
The Problem
The term join means to perform an "algebraic" cross product on 2 sets of
possibly differing tuples via common primary keys. In simple terms it means that the
join operation will maintain the intersection of the two sets and filter out everything
that is unique. This kind of operation is a very valuable operation in database queries.
It is one of the ultimate forms of relational quantification, the problem with joining
is that intuitive algorithms for joining have a very high algorithmic complexity which
can lead to an explosion in time requirements for simple queries on large databases.
An Example
To demonstrate where a join can be used, lets assume there is a university that has both
male and female students, the university also provides a large variety of subjects, such
as art, history, programming, mathematics etc...
We would like to execute a query on the university's database. The query will be how many
female students take both history and mathematics. The solution to this problem is to use
a divide and conquer technique and to hence break the query down into smaller simpler
sub-queries. In this case one query for the history course database and one query for the
mathematics course database (these queries can be executed in parallel). Obviously each
query will be in the form of: "extract all females from the course"
The next step would be to use a common primary key which will be available in both the
result from the history course query and mathematics course query which in most cases
will be the student identification number. The join will literally be done using the
student ID and traversing through the result of one query (ie: history course query)
taking each result and comparing it to each element in the other query (ie: mathematics
course query), if a common primary key is found (meaning there is a common student ID
in both query results) it means the student represented by that primary key is doing both
history and mathematics, and because the initial queries where based on gender automatically
filtering out males it can further be inferred that the student is female.
The problem here is that the join algorithm mentioned above is naive and has a quadratic
complexity (O(n²)). This means that the algorithm will increase its running time
by the square of the increase of the first query multiplied by the increase of the second
query. This level of processing requirement is not acceptable for database management systems
which are required to service multiple queries with results instantly.
On The Road To A Solution
Analysis of the join process and its relevant principles from an algorithmic
point of view leads one to conclude the following:
Over Processing
In the naive algorithm for joins as described above, the tuples in each set
(query result) are processed more than is required by the minimum number of
necessary scans for a join to take place. An optimal solution would eradicate
the over processing from the naive solution.
No Order In Result Of Joins
The naive algorithm when completed returns a final column of tuples. However
the column has no ordering as far as the primary key is concerned, in theory
you could do the join and then sort the resulting tuple, with a nice sorting
algorithm such as quick-sort, however further analysis proves that sorting of
the final result can be accomplished during the join itself, hence not requiring
a final sort of the resulting column. An optimal solution would attempt to
consolidate series of processing such as sorting and grouping into the one
algorithm hence reducing overall processing complexity.
A series of pre-processing steps can be taken in order to ensure the efficiency of
the join
Sorting Of The Initial Query Columns
Basically this means that the resulting columns from the queries which later on
will be joined should be sorted according to the common primary key. This gives
the naive algorithm the ability to only search a small section of the target
column rather than having to search the entire target column for joining a
particular tuple in the source column.
There are a few issues to contend with in this optimization, mostly to do with
the style of sorting, different sorting algorithms have different complexities,
they also have different memory requirements, some much more than others for
example quick sort. Also one may want the elements in the columns to be sorted
in a relative fashion, this comes under the umbrella of stable and volatile
sorting. Usually stable sorting algorithms have high algorithmic complexities
where as volatile sorting algorithms such as quick sort and merge sort tend to
have low algorithmic complexities. Ultimately the requirements decide what kind
of sorting methodology will be adequate.
Usage Of Hash Tables For Quick Look-ups
In this optimization, one of the columns are placed into a hash-table and the
other column is queried against the hash-table, this results in look-ups of
complexity O(1) with a maximum number of look-ups being the size of the queried
column. In this implementation it would be better to hash the larger column
and query with the shorter one, because the aim is to reduce the over-all number
of hash-table queries.
The main disadvantage with this solution is that it is optimized for speed, and
in doing so is very memory hungry. 20 years ago this may have been an issue,
however with the continual decrease in the cost of memory units, such an issue
can be resolved simply by adding more memory to the DBMS.
In the database world, there have been a number of algorithms developed for joining
queries in an efficient manner, many of these algorithms can be noted as being
derived from one of the following base algorithms:
Simple Nested Loop Join
For every element in the first column one complete traversal in the second column
occurs. This leads to a quadratic algorithmic complexity (O(n^2)), this method
is highly inefficient.
Sort Merge Join
In this algorithm both query columns are both sorted. Traversal of the first column
occurs against the second column. Because of the sorted nature of both columns
repetitive looping of the target column is not required. instead a progressive
traversal through both columns occurs with no more than ONE scan per element in
each column. The best solution using this algorithm has a complexity of along the
lines of O(nlogn)
Block Nested Loop Joins
To be completed...
Hash Join
To be completed...
General Join Conditions
A join can occur over several attributes within a tuple. This can lead to instantenous
joining on serveral properties requiring less overall joins to be completed. Hash-join and
sort-merge joins would not be the best choice for applications that require simultaneous joins
on more than one attribute, a better suited algorithm would the best the Nested Loop Join.
Simple Database Algorithms License
Free use of the Simple Database Algorithms library is permitted under the guidelines and in accordance with the most current
version of the "Common Public License."
Compatability
The Simple Database Algorithms C implementation is compatible with the following C compiler: