Home >>Distributed DBMS Tutorial >DDBMS - Relational Algebra for Query

# DDBMS - Relational Algebra for Query

### Relational Algebra for Query Optimization

It is initially scanned, parsed and validated when a query is placed. An internal representation of the query, such as a query tree or a query graph, is then generated. To extract results from the database tables, alternative execution methods are then devised. The method of selecting the most suitable query processing execution strategy is called query optimization.

### Query Optimization Issues in DDBMS

Query optimization in DDBMS is a crucial task. As the number of alternative strategies will increase exponentially due to the following factors, the complexity is high.

• The presence of a certain number of fragments.
• Distribution through different sites of the fragments or tables.
• The speed of the links for communication.
• Disparity in capabilities in local processing.

Hence, the goal in a distributed system is sometimes to find a successful query processing execution strategy rather than the right one. The time for a query to operate is the sum of the following −

• Time for databases to communicate queries.
• Time for local query fragments to be executed.
• Time for data to be compiled from different sites.
• Time to show the application 's performance.

### Query Processing

Query processing is a collection of all activities starting from the placement of the query to the display of the query results. As seen in the following diagram, the steps are-

### Relational Algebra

Relational algebra defines the core set of relational database model operations. A relational algebra expression forms a sequence of relational algebra operations. The result of this expression reflects the result of a query from a database.

The basic operations are –

• Projection
• Selection
• Union
• Intersection
• Minus
• Join

Projection

Projection operation displays a subset of fields of a table. This gives a vertical partition of the table.

Syntax in Relational Algebra

$$\pi_{<{AttributeList}>}{(<{Table Name}>)}$$

For example, let us consider the following Student database −

### STUDENT

Roll_No Name Course Semester Gender
2 Ankur BCA 1 Male
4 Mohit MCA 1 Male
5 Azam BCA 2 Male
6 Anshika BCA 1 Female
8 Shivani MCA 1 Female

We will use the following relational algebra expression if we want to show the names and courses of all students.

$$\pi_{Name,Course}{(STUDENT)}$$

Selection

The selection operation shows a subset of tuples that satisfy those conditions in a table. This gives the table a horizontal partition.

Syntax in Relational Algebra

$$\sigma_{<{Conditions}>}{(<{Table Name}>)}$$

For example, we can use the following relational algebra expression in the Student table if we want to show the information of all students who have selected MCA courses.

$$\sigma_{Course} = {\small "BCA"}^{(STUDENT)}$$

Combination of Projection and Selection Operations

We need a combination of projection and selection operations for most queries. These expressions can be written in two ways:

• Use of projection and selection sequence operations.
• Use Operation Rename to generate intermediate results.

To show the names of all female BCA course students , for example-

$$\pi_{Name}(\sigma_{Gender = \small "Female" AND \: Course = \small "BCA"}{(STUDENT)})$$
$$FemaleBCAStudent \leftarrow \sigma_{Gender = \small "Female" AND \: Course = \small "BCA"} {(STUDENT)}$$
$$Result \leftarrow \pi_{Name}{(FemaleBCAStudent)}$$
• Expression of relational algebra using projection and selection sequence operations
• Relational expression of algebra using the rename function to generate intermediate results

Union

If P is the result of an operation, and Q is the result of another operation, then the union of P and Q ($p \cup Q$) is the set of all tuples without duplicates in either P or Q, or both.

To reveal all students who are either in semester 1 or in the BCA course, for example-

$$Sem1Student \leftarrow \sigma_{Semester = 1}{(STUDENT)}$$
$$BCAStudent \leftarrow \sigma_{Course = \small "BCA"}{(STUDENT)}$$
$$Result \leftarrow Sem1Student \cup BCAStudent$$


Intersection

If P is the result of an operation and Q is the result of another operation, then the P and Q intersection ($p \cap Q$) is the set of all the tuples in both P and Q.

Provided the following two schemas, for example,-

EMPLOYEE

EmpID Name City Department Salary

PROJECT

PId City Department Status

To display the names of all cities where a project is located and also an employee resides −

$$CityEmp \leftarrow \pi_{City}{(EMPLOYEE)}$$
$$CityProject \leftarrow \pi_{City}{(PROJECT)}$$
$$Result \leftarrow CityEmp \cap CityProject$$


Minus

P-Q is the set of all tuples that are in P and not in Q, if P is the result of an operation and Q is the result of another operation.

For example, to list all departments that do not have an ongoing project (status projects = ongoing) –

$$AllDept \leftarrow \pi_{Department}{(EMPLOYEE)}$$
$$ProjectDept \leftarrow \pi_{Department} (\sigma_{Status = \small "ongoing"}{(PROJECT)})$$
$$Result \leftarrow AllDept - ProjectDept$$


Join

The Join operation combines the related tuples of two separate tables (query results) into one single table.

For example , consider two schemas in a bank database, Customer and Branch, as follows.

CUSTOMER

CustID AccNo TypeOfAc BranchID DateOfOpening

BRANCH

To list the employee details along with branch details −

$$Result \leftarrow CUSTOMER \bowtie_{Customer.BranchID=Branch.BranchID}{BRANCH}$$

### Translating SQL Queries into Relational Algebra

Until optimization, SQL queries are converted into equivalent relational algebra expressions. A query is initially broken down into smaller blocks of queries. The equivalent relational algebra expressions are translated from these blocks. Optimization involves optimizing each block and then optimizing the whole of the query.

Examples

Let us consider the following schemas −

EMPLOYEE

EmpID Name City Department Salary

PROJECT

PId City Department Status

WORKS

EmpID PID Hours

Example 1

We write the SQL query to show the details of all employees who earn a salary LESS than the average salary.

SELECT * FROM EMPLOYEE
WHERE SALARY < ( SELECT AVERAGE(SALARY) FROM EMPLOYEE ) ;


This query contains one nested sub-query. So, this can be broken down into two blocks.

The inner block is −

SELECT AVERAGE(SALARY)FROM EMPLOYEE ;

If the result of this query is AvgSal, then outer block is −

SELECT * FROM EMPLOYEE WHERE SALARY < AvgSal;

Relational algebra expression for inner block −

$$AvgSal \leftarrow \Im_{AVERAGE(Salary)}{EMPLOYEE}$$

Relational algebra expression for outer block −

$$\sigma_{Salary <{AvgSal}>}{EMPLOYEE}$$

Example 2

We write a SQL query to show the project ID and status of all employee 'Arun Kumar' projects.

SELECT PID, STATUS FROM PROJECT
WHERE PID = ( SELECT FROM WORKS  WHERE EMPID = ( SELECT EMPID FROM EMPLOYEE
WHERE NAME = 'ARUN KUMAR'));


This query contains two nested sub-queries. Thus, can be broken down into three blocks, as follows −

SELECT EMPID FROM EMPLOYEE WHERE NAME = 'ARUN KUMAR';
SELECT PID FROM WORKS WHERE EMPID = ArunEmpID;
SELECT PID, STATUS FROM PROJECT WHERE PID = ArunPID;


(Here ArunEmpID and ArunPID are the results of inner queries)

Relational algebra expressions for the three blocks are −

$$ArunEmpID \leftarrow \pi_{EmpID}(\sigma_{Name = \small "Arun Kumar"} {(EMPLOYEE)})$$
$$ArunPID \leftarrow \pi_{PID}(\sigma_{EmpID = \small "ArunEmpID"} {(WORKS)})$$
$$Result \leftarrow \pi_{PID, Status}(\sigma_{PID = \small "ArunPID"} {(PROJECT)})$$

Computation of Relational Algebra Operators

It is necessary to analyse relational algebra operators in several different ways, and any alternative is called an access road.

The alternative to computation relies on three main factors:

• Operator type
• Available memory
• Disk structures

The time for a relational algebra operation to be performed is the sum of-

• Time to process the tuples.
• Time to fetch the tuples of the table from disk to memory.

Because the time a tuple is processed is much shorter than the time it takes to get the tuple from the storage , especially in a distributed system, disk access is very often considered to be the metric for calculating the cost of relational expression.

### Computation of Selection

The selection operation estimation depends on the complexity of the selection situation and the availability of indexes for the table attributes.

Depending on the indexes, the computation alternatives below are

• No Index − If the table is unsorted and has no indexes, all the disk blocks in the table are checked by the selection process. Each block is put into the memory and each tuple in the block is examined to see if the selection condition is satisfied. If the condition is met, it is shown as an output. Since each tuple is brought into memory and each tuple is processed, this is the costliest approach.
• B+ Tree Index − The B+ Tree index is built on most database systems. If the selection condition is based on a field that is the key of this B+ Tree index, this index will be used to obtain the results. Processing selection statements with complex conditions, however, can require a greater number of accesses to the disk block and, in some cases, a full scan of the table.
• Hash Index-If hash indexes are used and their main field is used in the selection situation, it becomes a easy method to obtain tuples using the hash index. In order to find the address of a bucket where the key value corresponding to the hash value is stored, a hash index uses a hash function. The hash feature is executed and the bucket address is found in order to find a key value in the index. It looks for the key values in the bucket. If a match is found, the actual tuple is fetched into the memory from the disk block.

### Computation of Joins

Each tuple in P must be compared with each tuple in Q to test whether the join condition is satisfied when we want to join two tables, say P and Q. If the condition is satisfied, the corresponding tuples are concatenated, duplicate fields are deleted and added to the result relationship. This is, consequently, the most costly operation.

The popular approaches for computing joins are −

Nested-loop Approach

This is the conventional join approach. It can be illustrated through the following pseudocode (Tables P and Q, with tuples tuple_p and tuple_q and joining attribute a) −

For each tuple_p in P
For each tuple_q in Q
If tuple_p.a = tuple_q.a Then
Concatenate tuple_p and tuple_q and append to Result
End If
Next tuple_q
Next tuple-p


Sort-merge Approach

In this method, based on the joining attribute, the two tables are individually sorted and then the sorted tables are merged. As the number of records is very high and can not be accommodated in the memory, external sorting methods are introduced. Once the individual tables are sorted, one page is brought into the memory of each of the sorted tables, merged based on the joining attribute, and the joined tuples are written out.

Hash-join Approach

This approach consists of three steps: the phase of partitioning and the phase of probing. Tables P and Q are broken into two sets of disjointed partitions in the partitioning phase. A common feature of hash is agreed upon. For assigning tuples to partitions, this hash function is used. Tuples in a partition of P are contrasted with tuples in the corresponding partition of Q during the testing process. They are written out if they match, then.