Swimming around in Relation Databases

Relational Databases

Let’s focus on the way a database handles a SQL query:

First of all, there’s an important concept called Time Complexity. Time complexity is used to see how long an algorithm will take for a given amount of data. What’s important is not the amount of data, but the way number operations increase when the amount of data increase.

Let’s look at a few different types of algorithms optimizing the number of operations it takes to process a specific amount of data:

To process about 2000 elements:

O(1)  = 1 operation

O(log(n)) = 7 operations

O(n) = 2 000 operations

O(n*log(n)) = 14 000 operations

O(n^2) = 4 000 000 operations

Even though the difference between O(1) and O(n^2) seems like a lot, you will lose at most around 2ms – just the time for you to blink your eyes. Current processors can handle hundreds of millions of processes per second, which is why performance and optimization are not a priority for many IT projects.

It’s important to understand this concept when dealing with a huge number of elements. To process about 1 000 000 elements:

Now the difference between O(1) and O(n^2) is 1 to 1 000 000 000 000. You’d probably have time to get a coffee at this rate. This is where performance optimizations matter.

Let’s break down what these different time complexities mean:

O(1) – A search in a good hash table  

O(log(n)) – A search in a well balanced tree

O(n) – A search in an array

O(n*log(n)) = The very best sorting algorithms

O(n^2) = A bad sorting algorithm

Now, on to the different types of time complexities:

-Average case scenario

-Best case scenario

-Worst case scenario <– usually the case

There are also other types of complexities such as memory consumption of an algorithm and disk I/O consumption of an algorithm.

There are, of course, worse complexities than n^2 such as:

n^4 = That sucks

3^n = Even worse

factorial n = You’ll never get the results, even with a low number of elements

n^n = Consider another field of profession

When you need to sort through a collection, you call the sort( ) function. There are many good sorting algorithms, one is called the merge sort.

Let’s say we need to merge two arrays of size N/2 into a N-element sorted array. This operation is called a Merge.

array mergeSort (array a)

if ( length ( a ) == 1)

return a [ 0 ] ;

end if

//recursive calls

[ left_array right_array ] : = split_into_2_equally_sized_arrays ( a );

//division phase

array new_left_array : = mergeSort( left_array);

array new_right_array : = mergeSort( right_array);

//merging the 2 small ordered arrays into a big one

array result := merge (new_left_array, new_right_array);

return result;

This kind of algorithm is called divide and conquer. The division phase is where the array is divided into smaller arrays. The sorting phase is where the small arrays are put together using the merge to form a bigger array. Let’s go more in depth into this merge sort:

In the division phase, the array is divided into unitary arrays using three steps: The former number of steps is log(N) (since N=8, log(N) = 3). Each step divides the initial array by two. The number of steps therefore, is equal to the number of times you can divide the array by two. This is the exact definition of logarithm. Notice how everything was broken down into smaller pieces.

Now, with the sorting phase, you start out with the unitary arrays. During each step, you apply multiple merges and the overall cost is N= 8 operations. Step one has 4 merges costing 2 operations each. In the second step, you have two merges costing 4 operations each. In the third and final step, you have one merge costing 8 operations.

Since there are log(N) steps, the overall cost N * log(N) operations.

So why is this algorithm so powerful?

  • You can modify it to reduce memory footprint, in a way that you don’t create new arrays but directly modify the input array.
  • You can also modify it to use disk space and a small amount of memory at the same time without a huge disk I/O penalty. The idea is to load memory only parts that are currently processed. This is important when you need to sort a multi-gigabyte table with only a memory buffer of 100 megabytes. This type of algo is called external sorting
  • You can modify it to run on multiple process and servers
  • This sorting algorithm is used in virtually all databases

More database related topics coming soon!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s