SQL Performance Explained by Markus Winand
Databases expose a declarative interface - SQL, which promises that you need only specify what needs to be done, without worrying about how its done. However, when it comes to performance, the developer needs to worry because poorly written queries have poor performance. Since neither the database admin nor an external consultant will have any insight into the codepaths of the application, it is up to the developer to understand the codepaths and optimise the queries that will be used. Typically that's done by indexing.
- Chapter 1 - Anatomy of an index
- Chapter 2 - Scalability
- Chapter 4 - Joins
- Chapter 5 - Clustering Data
- Chapter 6 - Sorting and Grouping
- Chapter 7 - Partial results
- Chapter 8 - Modifying data
Chapter 1 - Anatomy of an index
An index is a data structure that is a distinct data structure in the database. It requires disk space, where it keeps a copy of the data that's been indexed. It makes queries faster.
- Data structures: Balanced search tree (B-tree) in addition to a Doubly Linked List
- The leaf nodes of the B-tree are connected to each other, forming a Doubly Linked List. This allows us to traverse all indexed rows quickly. Also insertions and deletions are fast. Physical location of each node doesn't matter because the LL maintains logical ordering.
- Each node contains 1 or more indexed rows. Each node is stored in a single database block or page (the smallest storage unit of the database, typically a few kB).
- Each indexed row contains the value of the indexed column(s) and a rowID, which points to the location of the row in the database.
- The rows of the database are stored in a heap structure and are unordered.
- Each branch node of the B-tree refers to multiple leaf nodes. For each leaf node, the highest value in the leaf node and a pointer to the leaf node are stored.
- The next layer is built above the first level branch nodes.
- The procedure is repeated until all branch nodes of a layer fit into a single node, the root node.
- The B-tree is balanced because its depth is equal at all points.
- The tree depth grows very slowly compared to the number of leaf nodes - logarithmically. This means that accessing a leaf node is fast
Slow Indexes, part 1
An index lookup happens in 3 parts
INDEX UNIQUE SCAN- Tree traversal, find the first matching indexed row.
INDEX RANGE SCAN- Tree traversal, followed by a scan along the linked list, starting with the first match.
TABLE ACCESS BY ROWID- Retrieve each matching row.
The first step has an upper bound on the number of operations, the next two don't. They can be very slow if there are multiple matching rows.
A poorly phrased where clause will access much more of the table than it has to.
CREATE TABLE employees ( employee_id NUMBER NOT NULL, first_name VARCHAR2(1000) NOT NULL, last_name VARCHAR2(1000) NOT NULL, date_of_birth DATE NOT NULL, phone_number VARCHAR2(1000) NOT NULL, CONSTRAINT employees_pk PRIMARY KEY (employee_id) );
Consider a simple query like
SELECT first_name, last_name FROM employees WHERE employee_id = 123. Looking at the output of the Query Optimizer, which transforms a statement into an execution plan, we see exactly what the database will do. In this case, it will do an
INDEX UNIQUE SCAN, assuming that primary keys are unique. If not it will do an
INDEX RANGE SCAN which is potentially slower, if there are multiple matching rows.
- #FunFact - One of the reasons for using non-unique indexes for a primary keys are deferrable constraints. As opposed to regular constraints, which are validated during statement execution, the database postpones the validation of deferrable constraints until the transaction is committed. Deferred constraints are required for inserting data into tables with circular dependencies.
Also known as composite, multi-column or combined index.
Suppose the company described above undergoes a merger, with employee_id no longer being unique. However, the pair (employee_id, subsidiary_id) is.
Searching for a particular employee remains the same. The database uses
INDEX UNIQUE SCAN and all is right with the world. However, if we want to find all the employees in a subsidiary, the index (employee_id, subsidiary_id) would be unsuitable. The query
SELECT first_name, last_name FROM employees WHERE subsidiary_id = 20; doesn't use the index, it uses
FULL TABLE SCAN.
It is clear why - when looking at the index, rows with equal subsidiary_id aren't stored next to each other. The fix is simple - the index should be (subsidiary_id, employee_id). There's no downside to this unless you wanted to query by employee_id alone, which no longer makes sense. To execute the previous query, it would use
INDEX RANGE SCAN, as expected.
In simple terms - an index (col1, col2, col3) can be used when searching by col1 alone, searching by col1 + col2 and searching by col1 + col2 + col3. Searches by combinations of col2 and col3 alone would not be supported.
- #FunFact - sometimes
FULL TABLE SCANcan be the most efficient operation in some cases anyway, in particular when retrieving a large part of the table. This is partly due to the overhead for the index lookup itself, which does not happen for a TABLE ACCESS FULL operation. This is mostly because an index lookup reads one block after the other as the database does not know which block to read next until the current block has been processed. A FULL TABLE SCAN must get the entire table anyway so that the database can read larger chunks at a time (multi block read). Although the database reads more data, it might need to execute fewer read operations.
- #GeneralRule - avoid queries in production that use a
FULL TABLE SCAN. It should generally use an appropriate index because using an index does not automatically mean a statement is executed in the best way possible.
- #GeneralRule - the fewer indexes a table has, the better. This leads to better update, delete and insert performance. More indexes lead to better select performance.
Second query to consider -
SELECT first_name, last_name, subsidiary_id, phone_number FROM employees WHERE last_name = 'WINAND' AND subsidiary_id = 30;. This query developed performance problems after we changed the order in index employees_pk. Currently its execution plan uses the employees_pk index. By looking at the performance plan before the index change, either with an optimizer hint (
SELECT /*+ NO_INDEX(EMPLOYEES EMPLOYEE_PK) */ first_name...) or by removing the index, we see that it did a
FULL TABLE SCAN earlier. With the index, it goes something like this. It does an
INDEX RANGE SCAN, finding all rows that match that particular subsidiary_id. Then it fetches all of rows with individual
TABLE ACCESS BY ROWID calls, filters by last_name and returns the result. Turns out, it was faster to simply read the whole table and filter by last_name.
- Rule-based optimizers (RBO) generate the execution plan using a hard- coded rule set. Rule based optimizers are less flexible and are seldom used today.
- Cost-based optimizers (CBO) generate many execution plan variations and calculate a cost value for each plan. The cost calculation is based on the operations in use and the estimated row numbers. In the end the cost value serves as the benchmark for picking the “best” execution plan.
- A cost-based optimizer uses statistics about tables, columns, and indexes. Most statistics are collected on the column level: the number of distinct values, the smallest and largest values (data range), the number of NULL occurrences and the column histogram (data distribution).
- The most important statistical value for a table is its size (in rows and blocks).
- The most important index statistics are the tree depth, the number of leaf nodes, the number of distinct keys and the clustering factor.
- The optimizer uses these values to estimate the selectivity of the where clause predicates.
If you're going to use a function to define an index, it should be pure and DETERMINISTIC (Oracle) or IMMUTABLE (PostgreSQL)
Instead of putting the value into the sql statement, a placeholder like ?, :name or @name is used. There are two advantages of doing this
- Security. This protects against SQL injection automatically. If you do not use a bind parameter but put the search term directly into the SQL statement, you must take other precautions against SQL injection attacks!
- Performance. Some databases have an execution plan cache. Assuming that the exact same query is executed, it will use the cached execution plan. If a non-parameterized query is executed, the execution plan needs to be calculated if the values change. For a parameterized query, the cached execution plan can be used.
It is not necessary that the same execution plan will always be used for a particular query. For the query
SELECT first_name, last_name FROM employees WHERE subsidiary_id = 30;, its better to use a
FULL TABLE SCAN rather than the index because the subsidiary_id 30 is large. For a smaller subsidiary, its better to use the index. The statistics collected by the cost-based optimizer helps it make this decision.
- #GeneralRule - Index for equality first—then for ranges. You'll end up scanning less.
Indexing LIKE filters
LIKE filters can only use the characters before the first wildcard during tree traversal. The remaining characters are just filter predicates that do not narrow the scanned index range. A single LIKE expression can therefore contain two predicate types:
- the part before the first wildcard as an access predicate;
- the other characters as a filter predicate.
The more selective the prefix before the first wildcard is, the smaller the scanned index range becomes. That, in turn, makes the index lookup faster
In general, using a single composite index is better than using multiple indexes for a query. In the latter case, much more CPU and memory is required to first scan both indexes and second combine the two results.
CREATE INDEX messages_todo ON messages (receiver) WHERE processed = 'N' - indexes only unprocessed messages.
Its good to use queries that will be in the execution plan cache. That's why we use queries with bind params. However, if we're not sure which index we're going to use, one example of a "smart" way to avoid writing dynamic SQL is
SELECT first_name, last_name, subsidiary_id, employee_id FROM employees WHERE ( subsidiary_id = :sub_id OR :sub_id IS NULL ) AND ( employee_id = :emp_id OR :emp_id IS NULL ) AND ( UPPER(last_name) = :name OR :name IS NULL )
This is dumb af because the db has no idea which index to use so no matter which param is selected, it will use
FULL TABLE SCAN. Congratulations, you got the execution plan from the cache but it won't use an index.
- #GeneralRule - Use dynamic SQL if you need dynamic where clauses.
Chapter 2 - Scalability
- The ability of a system to handle increased load
- The ability of a system to grow to handle increased load
By Data Volume
It is necessary to see how a query's execution time changes with increase in the amount of data in the table. This is not usually apparent on development machines. Even when a query uses an index and the execution plan uses that index, its execution time might rise much faster than another query using a different index on the same table. This is not because there is such a thing as a "slow index". But it could be some other reason.
He gives an example of a query who's execution time increases because of a filter predicate. In his words, "Filter predicates are like unexploded ordnance devices. They can explode at any time." A filter predicate is filter that takes place after the db returns a bunch of rows. Ideally, we'd like the db to return exactly what's required, but sometimes the index doesn't support that and it returns a larger number of rows that need to be filtered.
Even if your development environment has a full copy of the production DB, it might still be much faster on your machine because of a lack of background load.
- #GeneraRule - Careful execution plan inspection yields more confidence than superficial benchmarks.
Distributed systems and CAP
Maintaining strict consistency in a distributed system requires a synchronous coordination of all write operations between the nodes. This principle has two unpleasant side effects:
- it adds latencies and increases response times
- it reduces the overall availability because multiple members must be available at the same time to complete a write operation.
The consistency between both systems is still a desirable goal that is often achieved using the two-phase commit (2PC) protocol. This protocol established global transactions that deliver the well-known “all-or-nothing” behavior across multiple databases. Completing a global transaction is only possible if all contributing members are available. It thus reduces the overall availability. In case of a network partition, the system ceases to function. Traditional database systems like Postgres, MySQL, Spanner are thus available and consistent (AC).
The more nodes a distributed system has, the more troublesome strict consistency becomes. Maintaining strict consistency is almost impossible if the system has more than a few nodes. There are two approaches to this:
- CP (consistency and partition tolerance)- When partitioned, they become read-only or refuse to respond to any requests rather than be inconsistent and permit some users to see old data while others see fresh data. eg. Hbase, Redis, and Bigtable
- AP - (availability and partition tolerance) - Dropping strict consistency solves the availability problem and eliminates the increased response time. The basic idea is to reestablish the global consistency after completing the write operation on a subset of the nodes. This approach leaves just one problem unsolved: it is impossible to prevent conflicts if two nodes accept contradictory changes. Consistency is eventually reached by handling conflicts, not by preventing them. In that context, consistency means that all nodes have the same data—it is not necessarily the correct or best data. eg. Cassandra, Riak, and Dynamo
Accessing the 4 levels of a B-Tree could take 4 disk seeks, which is fine if its done once, but awful if a db is forced to do it hundreds or thousands of times for a single query. (I'm not sure about this because the entire B-Tree should be in memory, ideally.)
Chapter 4 - Joins
A join operation transforms data from a normalized model to a denormalized form that suits a specific purpose. Joining is particularly sensitive to disk seek latencies because it combines scattered data fragments. Proper indexing is again the best solution to reduce response times. The correct index however depends on which of the three common join algorithms is used for the query.
There is, however, one thing that is common to all join algorithms: they process only two tables at a time. A SQL query with more tables requires multiple steps: first building an intermediate result set by joining two tables, then joining the result with the next table and so forth.
Even though the join order has no impact on correctness of the final result, it still affects performance. The optimizer will therefore evaluate all possible join order permutations and select the best one. That means that just optimizing a complex statement might become a performance problem. The more tables to join, the more execution plan variants to evaluate—mathematically speaking: n! (factorial growth), though this is not a problem when using bind parameters.
An antipattern - nested selects
Its like a nested loop. The outer or driving query to fetch the results from one table and the second query for each row from the driving query. If the driving query returns N rows, this will execute N selects for each of those and N+1 in all. This is made worse by the network latencies between the application and the DB.
Some poorly written ORMs are susceptible to generating queries like this. Always enable SQL logging and look at the actual SQL commands being generated by ORMs
Join Algo 1 - Nested loops
Basically like nested selects, without the network latencies because its done on the database itself.
Join Algo 2 - Hash join
The candidate queries from the smaller side (the side with fewer rows) are loaded into a hash table so they can be found in O(1) time rather than O(lg(N)) time. To improve performance, only the columns used in the where clause need to be indexed.
Its possible to optimize this query by choosing fewer columns so the hash table will be smaller. The constraint isn't number of rows but memory used.
Note: not supported by MySQL
Join Algo 3 - Sort merge
Both sides of the join are already sorted by the join predicates. They are joined "like a zipper". To improve performance, only the columns used in the where clause need to be indexed.
Although the sort-merge join performs very well once the inputs are sorted, it is hardly used because sorting both sides is very expensive. The hash join, on the other hand, needs to preprocess only one side.
The strength of the sort-merge join emerges if the inputs are already sorted. This is possible by exploiting the index order to avoid the sort operations entirely.
Chapter 5 - Clustering Data
There are many meanings of cluster.
A data cluster means storing rows that are frequently accessed together close to each other so that accessing it requires minimal I/O operations.
“Let’s use a cluster to improve database performance” could refer to this or it could refer to a computer cluster.
The simplest data cluster in an SQL database is the row. Databases store all columns of a row in the same database block if possible.
Indexes allow one to cluster data. The basis for this was already explained in Chapter 1, “Anatomy of an Index”: the index leaf nodes store the indexed columns in an ordered fashion so that similar values are stored next to each other. That means that indexes build clusters of rows with similar values. This capability to cluster data is so important that I refer to it as the second power of indexing.
SELECT first_name, last_name, subsidiary_id, phone_number FROM employees WHERE subsidiary_id = ? AND UPPER(last_name) LIKE '%INA%';. The second where clause can't be improved with an index but the first one can. What would be cool is if all the rows for a subsidiary ID were stored together in the table, in a one or more contiguous database blocks. So with one (or a few) I/O operations, its possible to retrieve all the rows you're interested in. In this case, performance of this query depends on minimizing the number of I/O operations triggered by
TABLE ACCESS BY ROW ID. So it depends on the clustering of the rows.
Index only scan
If all the required columns are in the index, the table doesn't need to be accessed at all. An index that covers a query completely is a covering index. If an index prevents a table access it is also called a covering index. The term is misleading, however, because it sounds like an index property. The phrase index-only scan correctly suggests that it is an execution plan operation.
The index-only scan is an aggressive indexing strategy. Do not design an index for an index-only scan on suspicion only because it unnecessarily uses memory and increases the maintenance effort needed for update statements
Chapter 6 - Sorting and Grouping
Sorting needs to be done once all the rows are in memory. However, this can be avoided if the column on which its going to ordered is indexed, since the index is already sorted.
Queries that use an index in the where clause will be automatically sorted by that index. Specifying an order by on the same column is ignored by the database and you get pipelined operation, ie, streaming operation.
If the database uses a sort operation even though you expected a pipelined execution, it can have two reasons:
- the execution plan with the explicit sort operation has a better cost value
- the index order in the scanned index range does not correspond to the order by clause.
This works even if the sort order is Descending, because the index uses a Doubly Linked List.
It is possible to specify the order of an index while creating it (though not in MySQL).
Chapter 7 - Partial results
- #GeneralRule - always inform the database if you want limited results. This is done using
LIMITon MySQL and PG
If you want to implement "infinite scrolling" but your sorting doesn't use an index, it will keep the results buffered on the db after sorting.
Pagination can be accomplished by
OFFSET on MySQL and PG. Like
SELECT * FROM sales ORDER BY sale_date DESC LIMIT 10 OFFSET 10;. Oracle and SQL Server define different keywords for this. Note that finding rows with higher offset requires scanning past all previous rows, so the response time is increased.
Instead of a row number (offset), you use the last value of the previous page to specify the lower bound. This has a huge benefit in terms of performance. In this case, the database skips it for real.
Chapter 8 - Modifying data
- Inserts can't benefit from indexes. Each extra index slows down insert performance. Writing a new record to the heap doesn't take much time. Any table block will do. Writing to an index is slower because it has to be inserted in a particular place and the tree needs to be balanced afterwards.
- Not having indexes makes a big difference to insert performance. If a lot of data is being inserted and there are multiple indexes, it makes sense to drop the indexes, insert the data and reinstate the index.
- A delete is like a select + a delete. The first part is helped by an index, but the subsequent delete is like a select - deleting from heap, deleting entries from all the indexes. Its quickest with one index (assuming the where clause uses it), and then gets slower with additional indexes.
- Note: The PostgreSQL database only keeps the version information (=visibility information) on the table level: deleting a row just sets the “deleted” flag in the table block. PostgreSQL’s delete performance therefore does not depend on the number of indexes on the table. The physical deletion of the table row and the related index maintenance is carried out only during the VACUUM process.
truncate table. This command has the same effect as delete without where except that it deletes all rows in one shot. It is very fast but has two important side effects: (1) it does an implicit commit (exceptions: PostgreSQL and SQL Server); (2) it does not execute any triggers.
- Its like a select + indexes that contain a changed column need to be updated. The optimal case is having only one index, response times increase with extra indices.
- Sometimes ORMs generate bad code that end up updating all indexes.
TODO, especially Postgres