Moments shared with SQL Server: August 2007

Thursday, August 23, 2007

Avg Index Fragmentation - SQL 2005

select a.index_id, name, avg_fragmentation_in_percent
from
sys.dm_db_index_physical_stats(DB_ID('AdventureWorks'),OBJECT_ID('Production.Product'),NULL,NULL,NULL) as a
INNER JOIN sys.indexes as b
on a.object_id = b.object_id
and a.index_id = b.index_id

Isolation level

Transactions specify an isolation level that defines the degree to which one transaction must be isolated from resource or data modifications made by other transactions.

A transaction always gets an exclusive lock on any data it modifies, and holds that lock until the transaction completes, regardless of the isolation level set for that transaction.

For read operations, transaction isolation levels primarily define the level of protection from the effects of modifications made by other transactions.

READ UNCOMMITTED
Specifies that statements can read rows that have been modified by other transactions but not yet committed.

Transactions running at the READ UNCOMMITTED level do not issue shared locks to prevent other transactions from modifying data read by the current transaction. READ UNCOMMITTED transactions are also not blocked by exclusive locks that would prevent the current transaction from reading rows that have been modified but not committed by other transactions. When this option is set, it is possible to read uncommitted modifications, which are called dirty reads. Values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction. This option has the same effect as setting NOLOCK on all tables in all SELECT statements in a transaction. This is the least restrictive of the isolation levels.

READ COMMITTED
Specifies that statements cannot read data that has been modified but not committed by other transactions. This prevents dirty reads. Data can be changed by other transactions between individual statements within the current transaction, resulting in nonrepeatable reads or phantom data. This option is the SQL Server default.

REPEATABLE READ
Specifies that statements cannot read data that has been modified but not yet committed by other transactions and that no other transactions can modify data that has been read by the current transaction until the current transaction completes.

Shared locks are placed on all data read by each statement in the transaction and are held until the transaction completes. This prevents other transactions from modifying any rows that have been read by the current transaction. Other transactions can insert new rows that match the search conditions of statements issued by the current transaction. If the current transaction then retries the statement it will retrieve the new rows, which results in phantom reads. Because shared locks are held to the end of a transaction instead of being released at the end of each statement, concurrency is lower than the default READ COMMITTED isolation level. Use this option only when necessary.

SNAPSHOT
Specifies that data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction. The transaction can only recognize data modifications that were committed before the start of the transaction. Data modifications made by other transactions after the start of the current transaction are not visible to statements executing in the current transaction. The effect is as if the statements in a transaction get a snapshot of the committed data as it existed at the start of the transaction.

Except when a database is being recovered, SNAPSHOT transactions do not request locks when reading data. SNAPSHOT transactions reading data do not block other transactions from writing data. Transactions writing data do not block SNAPSHOT transactions from reading data.

SERIALIZABLE
Specifies that:

Statements cannot read data that has been modified but not yet committed by other transactions.

No other transactions can modify data that has been read by the current transaction until the current transaction completes.

Other transactions cannot insert new rows with key values that would fall in the range of keys read by any statements in the current transaction until the current transaction completes.

Range locks are placed in the range of key values that match the search conditions of each statement executed in a transaction. This blocks other transactions from updating or inserting any rows that would qualify for any of the statements executed by the current transaction. This means that if any of the statements in a transaction are executed a second time, they will read the same set of rows. The range locks are held until the transaction completes. This is the most restrictive of the isolation levels because it locks entire ranges of keys and holds the locks until the transaction completes. Because concurrency is lower, use this option only when necessary. This option has the same effect as setting HOLDLOCK on all tables in all SELECT statements in a transaction.

Concurrency control theory has two classifications for the methods of instituting concurrency control:

Pessimistic concurrency control

A system of locks prevents users from modifying data in a way that affects other users. After a user performs an action that causes a lock to be applied, other users cannot perform actions that would conflict with the lock until the owner releases it. This is called pessimistic control because it is mainly used in environments where there is high contention for data, where the cost of protecting data with locks is less than the cost of rolling back transactions if concurrency conflicts occur.

Optimistic concurrency control

In optimistic concurrency control, users do not lock data when they read it. When a user updates data, the system checks to see if another user changed the data after it was read. If another user updated the data, an error is raised. Typically, the user receiving the error rolls back the transaction and starts over. This is called optimistic because it is mainly used in environments where there is low contention for data, and where the cost of occasionally rolling back a transaction outweighs the costs of locking data when read.

If a data storage system has no concurrency control, users could see the following side effects:

Lost updates.
Lost updates occur when two or more transactions select the same row and then update the row based on the value originally selected. Each transaction is unaware of the other transactions. The last update overwrites updates made by the other transactions, which results in lost data.

Uncommitted dependency (dirty read).
Uncommitted dependency occurs when a second transaction selects a row that is being updated by another transaction. The second transaction is reading data that has not been committed yet and may be changed by the transaction updating the row.

Inconsistent analysis (nonrepeatable read).
Inconsistent analysis occurs when a second transaction accesses the same row several times and reads different data each time. Inconsistent analysis is similar to uncommitted dependency in that another transaction is changing the data that a second transaction is reading. However, in inconsistent analysis, the data read by the second transaction was committed by the transaction that made the change. Also, inconsistent analysis involves multiple reads (two or more) of the same row, and each time the information is changed by another transaction; thus, the term nonrepeatable read.

Phantom reads.
Phantom reads occur when an insert or delete action is performed against a row that belongs to a range of rows being read by a transaction. The transaction's first read of the range of rows shows a row that no longer exists in the second or succeeding read as a result of a deletion by a different transaction. Similarly, the transaction's second or succeeding read shows a row that did not exist in the original read as the result of an insertion by a different transaction.

The following table shows the concurrency side effects allowed by the different isolation levels.

Isolation level Dirty read Nonrepeatable read Phantom
Read uncommitted
Yes
Yes
Yes

Read committed
No
Yes
Yes

Repeatable read
No
No
Yes

Snapshot
No
No
No

Serializable
No
No
No

Query Plan Joins

SQL Server employs three types of join operations:

Nested loops joins

If one join input is small (fewer than 10 rows) and the other join input is fairly large and indexed on its join columns, an index nested loops join is the fastest join operation because they require the least I/O and the fewest comparisons.

Also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.

Effective if the outer input is small and the inner input is preindexed and large.

Merge joins

If the two join inputs are not small but are sorted on their join column (for example, if they were obtained by scanning sorted indexes), a merge join is the fastest join operation. If both join inputs are large and the two inputs are of similar sizes, a merge join with prior sorting and a hash join offer similar performance. However, hash join operations are often much faster if the two input sizes differ significantly from each other.

The merge join requires both inputs to be sorted on the merge columns, which are defined by the equality (ON) clauses of the join predicate. The query optimizer typically scans an index, if one exists on the proper set of columns, or it places a sort operator below the merge join.

Because each input is sorted, the Merge Join operator gets a row from each input and compares them. For example, for inner join operations, the rows are returned if they are equal. If they are not equal, the lower-value row is discarded and another row is obtained from that input. This process repeats until all rows have been processed.

can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.

Hash joins

Hash joins can efficiently process large, unsorted, nonindexed inputs. They are useful for intermediate results in complex queries because:

Intermediate results are not indexed (unless explicitly saved to disk and then indexed) and often are not suitably sorted for the next operation in the query plan.

Query optimizers estimate only intermediate result sizes. Because estimates can be very inaccurate for complex queries, algorithms to process intermediate results not only must be efficient, but also must degrade gracefully if an intermediate result turns out to be much larger than anticipated.

The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.

In-Memory Hash Join
The hash join first scans or computes the entire build input and then builds a hash table in memory. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If the entire build input is smaller than the available memory, all rows can be inserted into the hash table. This build phase is followed by the probe phase. The entire probe input is scanned or computed one row at a time, and for each probe row, the hash key's value is computed, the corresponding hash bucket is scanned, and the matches are produced.

Grace Hash Join
If the build input does not fit in memory, a hash join proceeds in several steps. This is known as a grace hash join. Each step has a build phase and probe phase. Initially, the entire build and probe inputs are consumed and partitioned (using a hash function on the hash keys) into multiple files. Using the hash function on the hash keys guarantees that any two joining records must be in the same pair of files. Therefore, the task of joining two large inputs has been reduced to multiple, but smaller, instances of the same tasks. The hash join is then applied to each pair of partitioned files.

Recursive Hash Join
If the build input is so large that inputs for a standard external merge would require multiple merge levels, multiple partitioning steps and multiple partitioning levels are required. If only some of the partitions are large, additional partitioning steps are used for only those specific partitions. In order to make all partitioning steps as fast as possible, large, asynchronous I/O operations are used so that a single thread can keep multiple disk drives busy.

Wednesday, August 22, 2007

Join Story

There are three types of joins: inner, outer, and cross.

In addition, there are three types of outer joins: left, right, and full.

Inner Joins

In an inner join, records from two tables are combined and added to a query's results only if the values of the joined fields meet certain specified criteria. If you use an inner join to combine the authors and publishers tables based on their city and state columns, the result would be a list of all authors who live in the same city as a publisher:

USE pubs
SELECT a.au_fname, a.au_lname, p.pub_name
FROM authors AS a INNER JOIN publishers AS p
ON a.city = p.city
AND a.state = p.state
ORDER BY a.au_lname ASC, a.au_fname ASC

au_fname au_lname pub_name
Abraham Bennet Algodata Infosystems
Cheryl Carson Algodata Infosystems

Outer Joins

An outer join returns all rows from the joined tables whether or not there's a matching row between them. The ON clause supplements the data rather than filtering it. The three types of outer joins, left, right, and full, indicate the main data's source.

Left Outer Joins

When you use a left outer join to combine two tables, all the rows from the left-hand table are included in the results. So, for the authors and publishers tables, the result will include a list of all authors along with a publisher's name column. If a publisher exists in the author's city, it's listed. Otherwise, the field in the publisher's column is set to NULL:

USE pubs
SELECT a.au_fname, a.au_lname, p.pub_name
FROM authors a LEFT OUTER JOIN publishers p
ON a.city = p.city
ORDER BY p.pub_name ASC, a.au_lname ASC, a.au_fname ASC

au_fname au_lname pub_name
Reginald Blotchet-Halls NULL
Michel DeFrance NULL
Innes del Castillo NULL
Ann Dull NULL
Marjorie Green NULL
Morningstar Greene NULL
Burt Gringlesby NULL
Sheryl Hunter NULL
Livia Karsen NULL
Charlene Locksley NULL
Stearns MacFeather NULL
Heather McBadden NULL
Michael O'Leary NULL
Sylvia Panteley NULL
Albert Ringer NULL
Anne Ringer NULL
Meander Smith NULL
Dean Straight NULL
Dirk Stringer NULL
Johnson White NULL
Akiko Yokomoto NULL
Abraham Bennet Algodata Infosystems
Cheryl Carson Algodata Infosystems

Right Outer Joins

A right outer join is conceptually the same as a left outer join except that all the rows from the right-hand table are included in the results. They can be included more than once, if more than one author exists in the publisher's city. If no author lives in the publisher's city, the author name fields are set to NULL:

USE pubs
SELECT a.au_fname, a.au_lname, p.pub_name
FROM authors AS a RIGHT OUTER JOIN publishers AS p
ON a.city = p.city
ORDER BY p.pub_name ASC, a.au_lname ASC, a.au_fname ASC

au_fname au_lname pub_name
Abraham Bennet Algodata Infosystems
Cheryl Carson Algodata Infosystems
NULL NULL Binnet & Hardley
NULL NULL Five Lakes Publishing
NULL NULL GGG&G
NULL NULL Lucerne Publishing
NULL NULL New Moon Books
NULL NULL Ramona Publishers
NULL NULL Scootney Books

Full Outer Join

As you might have gathered, a full outer join retrieves all the rows from both joined tables. It returns all of the paired rows where the join condition is true, plus the unpaired rows from each table concatenated with NULL rows from the other table. You usually won't want to use one of these.

Cross Join

A cross join returns not the sum but the product of two tables. Each row in the left-hand table is matched up with each row in the right-hand table. It's the set of all possible row combinations, without any filtering, as shown here:

USE pubs
SELECT au_fname, au_lname, pub_name
FROM authors CROSS JOIN publishers
ORDER BY au_lname DESC

The resultset contains 184 rows (authors has 23 rows, and publishers has 8; therefore, 23 × 8 = 184).

Top 11 are:
au_fname au_lname pub_name
Akiko Yokomoto New Moon Books
Akiko Yokomoto Binnet & Hardley
Akiko Yokomoto Algodata Infosystems
Akiko Yokomoto Five Lakes Publishing
Akiko Yokomoto Ramona Publishers
Akiko Yokomoto GGG&G
Akiko Yokomoto Scootney Books
Akiko Yokomoto Lucerne Publishing
Johnson White New Moon Books
Johnson White Binnet & Hardley
Johnson White Algodata Infosystems

However, if you add a WHERE clause (like WHERE authors.city = publishers.city), a cross join functions as an inner join—it uses the condition to filter all possible row combinations down to the ones you want:

USE pubs
SELECT au_fname, au_lname, pub_name
FROM authors CROSS JOIN publishers
WHERE authors.city = publishers.city
ORDER BY au_lname DESC

au_fname au_lname pub_name
Cheryl Carson Algodata Infosystems
Abraham Bennet Algodata Infosystems

Moments shared with SQL Server