Selectivity threshold for a non-covering index
Assume you have a table with about 300 000 rows, and an indexed column ‘col1′ with only 9 distinct values. Now you got a query like ’select * from t1 where col1 = const’. The questions are
- when the index is faster to full table scan and vice versa?
- does MySQL use the optimal plan by default?
These questions became very relevant now that QOT got server access and is able to gather various table metrics including selectivity. Besides index selectivity the threshold value obviously depends on the storage engine used, so for me it is also interesting to see how our PBXT engine compares to others in this aspect. Namely to InnoDB - an engine with similar transactional properties and MyISAM - a very fast engine for read-only scenarios.
For the test I took the employees test database. The database contains dept_emp table which serves to maintain a many-to-many relation between departments and employees - there could be many employees in a department and an employee at different times can work in different departments. The query would be to select all employees who ever worked in a given department.
mysql> show create table dept_emp \G
*************************** 1. row ***************************
Table: dept_emp
Create Table: CREATE TABLE `dept_emp` (
`emp_no` int(11) NOT NULL,
`dept_no` char(4) NOT NULL,
`from_date` date NOT NULL,
`to_date` date NOT NULL,
PRIMARY KEY (`emp_no`,`dept_no`),
KEY `emp_no` (`emp_no`),
KEY `dept_no` (`dept_no`),
CONSTRAINT `FOREIGN_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE,
CONSTRAINT `FOREIGN_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE
) DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
the only index that can be theoretically used here is dept_no. It’s selectivity:
mysql> select dept_no, count(*) from dept_emp group by dept_no with rollup;
+---------+----------+
| dept_no | count(*) |
+---------+----------+
| d001 | 20211 |
| d002 | 17346 |
| d003 | 17786 |
| d004 | 73485 |
| d005 | 85707 |
| d006 | 20117 |
| d007 | 52245 |
| d008 | 21126 |
| d009 | 23580 |
| NULL | 331603 |
+---------+----------+
10 rows in set (0.17 sec)
For the query I will use the least selective index value ‘d005′. The query and the default plan:
mysql> explain select * from dept_emp where dept_no = 'd005' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: dept_emp
type: ref
possible_keys: dept_no
key: dept_no
key_len: 4
ref: const
rows: 86069
Extra: Using where
1 row in set (0.00 sec)
For all 3 engines MySQL returns different selectivity estimates but always prefers ref access to full scan. This is a bit strange as the index is non-covering. The index-less query and plan:
mysql> explain select * from dept_emp ignore index (dept_no) where dept_no = 'd005' \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: dept_emp
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 331603
Extra: Using where
1 row in set (0.00 sec)
Test results:

Quite predictably index ref access is slower for all engines as it requires more random disk reads which eat considerable amount of time. So the default MySQL plan is sub-optimal.
Another interesting result is that MyISAM index scan is much slower that the others (doesn’t MyISAM have leaf-to-leaf pointers?). Quite predictably InnoDB table and index times make much less diference than for others. I’m pleased to notice that PBXT shows the best reading time both in table and index scan. I believe main reasons for this are the PBXT’s fixed record format and memory-mapped file access.

Memory-bound case is much more interesting. MyISAM table scan is the fastest one here. But remember to achieve it you will need to explicitly disable index access and this is what QOT will soon detect and do for you
. What’s much more interesting is that for PBXT index ref access is faster than table scan and is almost as fast as MyISAM table scan. As we are not touching disk random reads don’t anymore make any difference to sequential scans and here come into play other factors. For example PBXT indexes keep direct pointers to records (i.e. not to rows), so version resolution path for them is shorter than for table scans. (hmm, can we avoid row-index file read if the sweeper has no work to do?). Another interesting detail (provided that the engines have very different architecures) is that InnoDB and PBXT table scans take the same time.
The bottom line is that in disk-bound scenario table scan is predictably faster. Unlike other engines in memory-bound case PBXT’s index ref access is faster than table scan providing optimal performance for the default MySQL plan. For other engines you have to watch your selectivity and query plans to achieve optimal results. For those who’s lazy enough to do it him/her self there’s a new version of QOT coming out soon that will evaluate the optimal execution plan.
Interesting read! I am following this project, I really like it.
Thnx!