Tech Point Fundamentals

Friday, May 27, 2022

SQL Interview Questions and Answers - Part 01

SQL Interview Questions and Answers - Part 01

sql-interview-questions-answers

Are you preparing for the SQL Interview? If yes, then you are at the right place. This is the SQL Interview Questions and Answers article series. Here we will see Top 100+ SQL Interview Questions with Answers. 

Please visit our YouTube Channel for Interviews and other videos by below link:




Please read the complete Design Pattern, C#, MVC, WebAPI, and .Net Framework Interview Questions and Answers article series here.




Introduction


This is the 1st part of the SQL Interview Questions and Answers article series. Each part will contain eight to ten SQL Interview Questions with Answers. Please read all the SQL Interview Questions list here.





SQL Interview Questions and Answers - Part 01


Q01. What is Normalization in SQL?

Database Normalization or Data Normalization is the process of structuring a relational database to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd in 1970. 

Normalization includes creating tables and establishing relationships between those tables according to normalization rules. Normalization is the process of reorganizing data in a database so that it meets two basic requirements:

  • There is no redundancy of data, all data is stored in only one place.
  • Data dependencies are logical, all related data items are stored together.

Normalization is a database design technique, which is used to design a relational database table up to a higher normal form. The process is progressive, and a higher level of database normalization cannot be achieved unless the previous levels have been satisfied.

There are a series of standard rules known as Normal Forms which are used to normalize the data. Multiple Normal forms are used based on the requirement to normalize the database.

A fully normalized database allows its structure to be extended to accommodate new types of data without changing the existing structure too much. As a result, applications interacting with the database are minimally affected.



Advantage of Normalization:

  1. Normalization is the only way of organizing the data in the database.
  2. Normalization divides the larger table into the smaller table and links them using a relationship so that operation can happen only on the required data table.
  3. Normalization is used to minimize data redundancy.
  4. Normalization is used to eliminate undesirable characteristics like Insertion, Update, and Deletion Anomalies.



Q02. What are the different Normal Forms (NF) in SQL?

In 1970, E.F. Codd proposed the relational data model and introduced the concept of normalization and what is now known as the normal form.

There are a few rules for database normalization and each rule is called a "Normal Form" If the first rule is observed, the database is said to be in "first normal form." If the first three rules are observed, the database is considered to be in "third normal form."

There are mainly four types of normal forms i.e 1NF, 2NF, 3NF, and BCNF. Mostly 3NF relations are free of INSERT, UPDATE, and DELETE anomalies. However the normal forms beyond 4NF are mainly of academic interest, as the problems, they exist to solve rarely appear in practice.

Whereas the second, third, and Boyce–Codd normal forms are concerned with functional dependencies, 4NF is concerned with a more general type of dependency known as a multivalued dependency



Unnormalized Form (UNF): 

UNF is also known as an UnNormalized Form or Non First Normal Form (N1NF). UNF is a database data model which does not meet any of the conditions of database normalization defined by the relational model. Database systems that support unnormalized data are sometimes called non-relational or NoSQL databases. 



First Normal Form(1NF): 

Codd defines the 1NF in 1970.  To satisfy the First Normal Form, each column of a table must have a single value. Columns that contain sets of values or nested records are not allowed.

"A relation is in 1NF if it contains an atomic value."

This means each table cell should contain a single value and each and every record needs to be unique.

The 1NF says that an attribute of a table cannot hold multiple values. The first normal form disallows the multi-valued attribute, composite attribute, and their combinations.



Second Normal Form(2NF): 

Codd define the second normal form (2NF) and third normal form (3NF) in 1971.  

"A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional dependent on the primary key. i.e A relation is in 2NF if it is in 1NF and every non-prime attribute of the relation is dependent on the whole of every candidate key."

A relation is in the second normal form if it fulfills the following two requirements:

  1. It is in its first normal form.
  2. It does not have any non-prime attribute that is functionally dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.

It says that create separate tables for sets of values that apply to multiple records. Relate these tables with a foreign key. Records should not depend on anything other than a table's primary key. A functional dependency on part of any candidate key is a violation of 2NF. 



Third Normal Form(3NF): 

3NF was originally defined by E. F. Codd in 1971. A database relation is said to meet third normal form standards if all the attributes (database columns) are functionally dependent on solely the primary key. 

Codd defined this as a relation in the second normal form where all non-prime attributes depend only on the candidate keys and do not have a transitive dependency on another key. 

A relation will be in 3NF if it is in 2NF and no transition dependency exists. If there is no transitive dependency for non-prime attributes, then the relationship must be in third normal form. 3NF is used to reduce data duplication. It is also used to achieve data integrity.

"A relation will be in 3NF if it is in 2NF and does not contain any transitive partial dependency."

This says that values in a record that are not part of that record's primary key should not belong in the table so remove those values which are not directly depend on the primary key.

Transitive Functional Dependencies: A transitive functional dependency is when changing a non-key column, might cause any of the other non-key columns to change



Boyce–Codd Normal Form(BCNF) or 3.5 NF:

Codd later realized that 3NF did not eliminate all undesirable data anomalies and developed a stronger version to address this in 1974, known as Boyce–Codd normal form.

Codd and Raymond F. Boyce defined the Boyce–Codd normal form (BCNF) in 1974. BCNF is the advanced version of 3NF. It is stricter than 3N. Even when a database is in 3rd Normal Form, still there would be anomalies resulting if it has more than one Candidate Key.

If a relational schema is in BCNF then all redundancy based on functional dependency has been removed,  although other types of redundancy may still exist.  A 3NF table that does not have multiple overlapping candidate keys is guaranteed to be in BCNF.

"For BCNF, the table should be in 3NF, and for every FD, LHS is super key"

A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:

  • X → Y is a trivial functional dependency (Y ⊆ X)
  • X is a superkey for schema R.



Fourth Normal Form (4NF):

The 4th Normal Form is introduced by Ronald Fagin in 1977, 4NF is the next level of normalization after Boyce–Codd normal form (BCNF). 4NF is concerned with a more general type of dependency known as a multivalued dependency

The Fourth Normal Form comes into the picture when Multi-valued Dependency occurs in any relation. 

"A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency."

For a table to satisfy the Fourth Normal Form, it should satisfy the following two conditions: 

  1. It should be in the Boyce-Codd Normal Form.
  2. And, the table should not have any Multi-valued Dependency.

A table is said to have a multi-valued dependency if the following conditions are true:

  1. For a dependency A → B, if for a single value of A, multiple values of B exist, then the table may have a multi-valued dependency.
  2. Also, a table should have at least 3 columns for it to have a multi-valued dependency.
  3. And, for a relation R(A, B, C), if there is a multi-valued dependency between, A and B, then B and C should be independent of each other.

If all these conditions are true for any relation(table), it is said to have a multi-valued dependency.

A trivial multivalued dependency X -->--> Y is one where either Y is a subset of X, or X and Y together form the whole set of attributes of the relation. A functional dependency is a special case of multivalued dependency. In a functional dependency X → Y, every x determines exactly one y, never more than one.



Fifth Normal Form (5NF) or Project-Join Normal Form (PJ/NF):

The fifth normal form (5NF), also known as project-join normal form (PJ/NF), is a level of database normalization designed to reduce redundancy in relational databases recording multi-valued facts by isolating semantically related multiple relationships. 

The fifth normal form was first described by Ronald Fagin in his 1979 conference paper Normal forms and relational database operators

"A table is said to be in the 5NF if and only if every non-trivial join dependency in that table is implied by the candidate keys."

A relation is in 5NF only if:

  • It is in 4NF
  • It does not contain any join dependency
  • And joining is lossless

This means "a table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of smaller tables without loss of data."




Q03. What is DeNormalization and why do we need it?

Denormalization is a strategy used on a previously-normalized database to increase performance. Denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. 

Denormalization is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. 

Denormalization is a technique in which we add duplicate data to one or more tables. With the help of this, we can avoid costly joins in a relational database. It speeds up read-oriented data retrieval performance in a relational database. 



DeNormalization vs UnNormalized Form:

A denormalized data model is not the same as a data model that has not been normalized, and denormalization should only take place after a satisfactory level of normalization has taken place and any required constraints and/or rules have been created to deal with the inherent anomalies in the design. 

Denormalization differs from the unnormalized form in that denormalization benefits can only be fully realized on a data model that is otherwise normalized.

Advantages of Denormalization:

  1. Minimizing the need for joins and Reducing the number of tables
  2. Simplify the SQL Queries which are used to retrieve data
  3. Precomputing derived values




Q04. What is the 8KB Page Concept in SQL?

The page is the fundamental unit of data storage in SQL Server. A page in SQL Server is always 8KB in size. Whenever we write data it is first sent to the 8 KB page it is called a chunk of data stores and then read from there.

In SQL Server all the data rows are written on pages and all data pages are the same size i.e 8KB.

Since 1 KB = 1024 bytes, so the size of SQL Page i.e  8 KB = 8192 bytes. This means SQL Server databases have 128 pages per MegaByte.

There are different types of pages such as data pages, index pages, system pages, IAM pages, etc. Data Pages is a page where actually SQL Server uses to get the datastore. 

The Index pages contain index references about where the data is and finally, there are system pages that store a variety of metadata about the organization of the data (PFS, GAM, SGAM, IAM, DCM, BCM pages). 






PAGE is divided into the 3 sections:

Page Header:  Each page begins with a 96-byte header that is used to store system information about the page. This information includes the page number, page type, the amount of free space on the page, and the allocation unit ID of the object that owns the page.

Data: This contains actual rows of data that were stored by users. Data rows are put on the page serially, starting immediately after the header. The maximum amount of data and overhead that is contained in a single row on a page is 8,060 bytes (8-KB). 

Offset Table: A row offset table starts at the end of the page, and each row offset table contains one entry for each row on the page. Each row offset entry records how far the first byte of the row is from the start of the page.

Thus, the function of the row offset table is to help SQL Server locate rows on a page very quickly. The entries in the row offset table are in reverse sequence from the sequence of the rows on the page.



Data Types and 8KB Page:

The maximum amount of data and overhead that is contained in a single row on a page is 8,060 bytes (8-KB). This restriction is relaxed for tables that contain varchar, nvarchar, varbinary, or sql_variant columns. 

When the total row size of all fixed and variable columns in a table exceeds the 8,060-byte limitation, SQL Server dynamically moves one or more variable-length columns to pages in the ROW_OVERFLOW_DATA allocation unit, starting with the column with the largest width. 

When a column is moved to a page in the ROW_OVERFLOW_DATA allocation unit, a 24-byte pointer on the original page in the IN_ROW_DATA allocation unit is maintained.

IN_ROW Data: Holds a partition of a heap or index. It is a Btree or heap allocation unit. The IN-ROW data holds the static data types data.

LOB Data: LOB_DATA Holds large object (LOB) data types, such as XML, VARBINARY(max), and VARCHAR(max).

ROW_OVERFLOW_DATA: Holds variable-length data stored in VARCHAR, NVARCHAR, VARBINARY, or SQL_VARIANT columns that exceed the 8060 Byte row size limit.

Please watch the SQL 8KB Page video here for more details and a live demo.



Q05. What is the IAM Page in SQL?

IAM pages stand for Index Allocation Map. IAM Pages have the information where all the pages of SQL Server are stored. 

Extents are the basic unit in which space is managed. An extent is made up of 8 physically contiguous pages. An extent is a collection of eight physically contiguous pages ie. 64 KB. Extents help efficiently manage pages. All pages are organized into extents.

A GAM(Global Allocation Map) page can track  4GB worth of space and they are repeated in the 4GB interval. An IAM page tracks the pages/extents allocation in the GAM interval of a partition for a specific allocation unit of a table. 



An IAM page maps the extents in a 4-GB part of a database file used by an allocation unit. In SQL Server 2005 and later, there are three types of allocation units: IN_ROW_DATA, LOB_DATA, and ROW_OVERFLOW_DATA.

An IAM page has a header that indicates the starting extent of the range of extents mapped by the IAM page. The IAM page also has a large bitmap in which each bit represents one extent. IAM pages are linked in a chain per allocation unit.

When the SQL Server Database Engine has to insert a new row and no space is available on the current page, it uses the IAM and PFS pages to find a page to allocate, or, for a heap or a Text/Image page, a page with sufficient space to hold the row. The SQL Server Database Engine uses the IAM pages to find the extents allocated to the allocation unit.

Please watch the IAM Page video here.



Q06. What is Heap in SQL?

A heap is a table without any clustered index. In a Heap, data is stored without any underlying order. One or more nonclustered indexes can be created on tables stored as a heap. Data is stored in the heap without specifying an order.

When rows are inserted into a heap, there is no way to ensure where the pages will be written nor are those pages guaranteed to remain in the same order as the table is written to or when maintenance is performed against it.

Usually, data is initially stored in the order in which the rows are inserted into the table, but the Database Engine can move data around in the heap to store the rows efficiently; so the data order cannot be predicted. To guarantee the order of rows returned from a heap, you must use the ORDER BY clause. To specify a permanent logical order for storing the rows, create a clustered index on the table so that the table is not a heap.



When a table is stored as a heap, individual rows are identified by reference to an 8-byte row identifier (RID) consisting of the file number, data page number, and slot on the page (FileID:PageID:SlotID). If a table is a heap and does not have any nonclustered indexes, then the entire table must be read (a table scan) to find any row. SQL Server cannot seek a RID directly on the heap. 

Heaps have one row in sys.partitions, with index_id = 0 for each partition used by the heap. By default, a heap has a single partition. When a heap has multiple partitions, each partition has a heap structure that contains the data for that specific partition. The column first_iam_page in the sys.system_internals_allocation_units system viewpoints to the first IAM page in the chain of IAM pages that manage the space allocated to the heap in a specific partition



Depending on the data types in the heap, each heap structure will have one or more allocation units to store and manage the data for a specific partition like IN_ROW_DATA, LOB_DATA, or ROW_OVERFLOW_DATA.

Table scans or serial reads of a heap can be performed by scanning the IAM pages to find the extents that are holding pages for the heap. Because the IAM represents extents in the same order that they exist in the data files, this means that serial heap scans progress sequentially through each file.

A heap table has no root, intermediate, or leaf-level pages; it has data pages only. Each page read from the IAM is a data page and can be processed. But rows on a data page of a heap table can contain forwarding pointers, that cause out-of-order data access. A Table Scan always uses the Index Allocation Map (IAM) to find all pages allocated to the heap table, similar to the “RowStore – Allocation Order Scan” of an Index Scan operator. 








Use of Heap:

Heaps can be used as staging tables for large, unordered insert operations. Because data is inserted without enforcing a strict order, the insert operation is usually faster than the equivalent insert into a clustered index. If the heap's data will be read and processed into a final destination, it may be useful to create a narrow nonclustered index that covers the search predicate used by the read query. Data is retrieved from a heap in order of data pages, but not necessarily the order in which data was inserted. However,

  1. Do not use a heap when the data is frequently returned in sorted order. A clustered index on the sorting column could avoid the sorting operation.
  2. Do not use a heap when the data is frequently grouped together. Data must be sorted before it is grouped, and a clustered index on the sorting column could avoid the sorting operation.
  3. Do not use a heap when ranges of data are frequently queried from the table. A clustered index on the range column will avoid sorting the entire heap.
  4. Do not use a heap when there are no nonclustered indexes and the table is large unless you intend to return the entire table content without any specified order. In a heap, all rows of the heap must be read to find any row.
  5. Do not use a heap if the data is frequently updated. If you update a record and the update uses more space in the data pages than they are currently using, the record has to be moved to a data page that has enough free space.

Please watch the HEAP vs Clustered Table video here.



Q07. What is the difference between Table Scan vs Index Scan vs Index Seek?

An index scan or table scan is when SQL Server has to scan the data or index pages to find the appropriate records.  A scan is the opposite of a seek, where a seek uses the index to pinpoint the records that are needed to satisfy the query.



Table Scan:

A table scan is the lack of indexes or incorrect indexes which causes SQL Server to process more data to find the records that meet the search criteria.  So a table scan is the reading of every row in a table and is caused by queries that don’t properly use indexes.  

A table that has no clustered index is known as Heap, and a heap always uses a table scan. But a table that has clustered index may also cause a table scan if there is no proper indexing.

When your query engine performs a table scan it starts from the physical beginning of the table and goes through every row in the table. If a row matches the criteria then it includes that in the result set.  Table scans on large tables take an excessive amount of time and cause performance problems.



Index Scan:

If your table has a clustered index and you are firing a query that needs all or most of the rows i.e. query without WHERE or HAVING clause, then it uses an index scan. 

It works similar to the table scan, during the query optimization process, the query optimizer takes a look at the available index and chooses the best one, based on information provided in your joins and where clause, along with the statistical information database keeps. Once the right index is chosen, the SQL Query processor or engine navigates the tree structure to the point of data that matches your criteria and again extracts only the records it needs.

The main difference between a table scan and an index scan is that because data is sorted in the index tree, the query engine knows when it has reached the end of the current it is looking for. It can then send the query, or move on to the next range of data as necessary.

There are two types of Index Scans: Clustered Index scans and Non-Clustered Index scans.



Index Seek:

When your search criterion matches an index well enough that the index can navigate directly to a particular point in your data, that's called an index seek. It is the fastest way to retrieve data in a database. The index seeks are also a great sign that your indexes are being properly used.

This happens when you specify a condition in the WHERE clause which has a respective index. Seek uses the index to pinpoint the records that are needed to satisfy the query. 

Index Seek is also two types: Clustered Index Seek and Non-Clustered Index Seek.

Please watch the Table Scan vs Clustered Index Scan video here.




Q08. What is Fragmentation or Index Fragmentation in SQL?

Fragmentation can be defined as any condition that causes more than the optimal amount of disk I/O to be performed in accessing a table or cause a longer disk I/O. Optimal performance of SELECT queries occurs when the data pages of tables are contiguous as possible and pages are fully packed as possible. Fragmentation breaks this rule and reduces the performance of the queries. 

Fragmentation can happen in two-level. One is file system level fragmentation which is called Logical or Physical Disk Fragmentation and Index level fragmentation. 



Logical/Physical Disk Fragmentation:

Logical fragmentation is the fragmentation of database files in the file system itself like any other file. This occurs when the file system is not able to allocate contiguous space for the database file. As a result, the disk head has to move back and forth to read from the database files. SQL server is completely unaware of this kind of fragmentation and it is not possible to measure the logical disk fragmentation using any script.

To remove logical fragmentation we can use the windows fragmentation tool but note that we need to stop the SQL server while running the defragmentation tools. Otherwise, the defragmentation tool will skip the database files as it is used by the SQL server. 

Index Level Fragmentation:

Index level fragmentation is also two types: Internal Fragmentation and External Fragmentation. If Index level fragmentation is high it may prevent the optimizer from using the available indexes in an optimal way.



Internal Index Fragmentation:

Internal fragmentation is measured in the average page fullness of the index i.e Page density. A page that is 100% full has no internal fragmentation. The internal fragmentation occurs when there is empty space in the index page and this can happen due to the insert/update/delete DML operation.

Every index page can hold a certain number of records based on the size of the index, but that does not guarantee that the page always holds the maximum number of records.  An index page that has internal fragmentation of 90% may be full in terms of record. The remaining 10% bytes of the pages may not be enough to hold one more record.

Internal Fragmentation will increase the I/O. When you run queries that scan part or complete table/index if you have internal fragmentation on that table/index, it causes additional page reads. Internal Fragmentation also reduces the efficiency of the buffer cache because when indexes have internal fragmentation, they need more space to fit in the buffer.



External Index Fragmentation:

External Fragmentation happens when the logical order of the pages does not match the physical order of the pages. External fragmentation refers to the lack of correlation between the logical sequence of an index and its physical sequence. 

It is measured as the percentage of out-of-order pages in the leaf pages of an index. An out-of-order page is a page for which the next physical page allocated to the index is not the page pointed to by the next-page pointer in the current leaf page.

External Fragmentation can happen due to various reasons:

  • While allocating pages for the new table, the SQL server allocates pages from mixed extend till it reaches the 8 pages. There is a possibility of having the first 8 pages from 8 different extents. 
  • When all records are deleted from a page, the page will be de-allocated from the index(because the de-allocation of pages will not happen immediately) which creates a gap and increase fragmentation. Please watch the truncate vs delete video here for more details.
  • Once the object reached 8 pages size, the SQL server will start allocating uniform extent to the objects. While allocation uniform extent to an index, the next sequential extent to the current extent might be already allocated to other objects/indexes.


To Be Continued Part-02...


Recommended Articles






Thanks for visiting this page. Please follow and join us on LinkedInFacebookTelegramQuoraYouTubeTwitterPinterestTumbler, and VKfor regular updates.

    

No comments:

Post a Comment

Please do not enter any HTML. JavaScript or spam link in the comment box.