group by优化原理
============
首先看一个生产上group by优化的案例:
-- 优化前SQL ,使用紧凑索引扫描 :执行3s
SELECT taskUniqueId,
max(reportTime) AS reportTime
FROM task_log_info
WHERE reportTime > '2024-04-07'
GROUP BY taskUniqueId
-- 优化后,使用松散索引扫描:执行30ms左右
SELECT a.taskUniqueId,
reportTime
FROM task_log_info a
JOIN
(SELECT taskUniqueId,
max(id) AS id
FROM task_log_info
GROUP BY taskUniqueId ) tmp
ON a.id=tmp.id
AND reportTime>='2024-04-07'
-- 需要注意的是:id和reporttime字段值具有相关性的情况才可以这样修改
两个group by使用的同样一个索引,但是效率却相差很多,可能你一眼就知道其内部的原理
anyway,有兴趣的可以往后看看,欢迎指出文章不足之处~
在本文的参考链接中也都详细说明了group by的优化原理,本文只是加以总结和结合实际加入自己的理解进行整理
准备
对于group by,可以使用索引进行优化,主要包含两个方面:
1、避免使用临时表
- 没有index的情况下需要扫描所有的数据,并使用临时表来保存分组值,且默认情况下,会对分组后的结果进行排序
- 有index的情况下,由于索引的有序性,每个组的数据在一起,可以更快速地进行分组,且避免了排序
2、使用Loose index scan,避免扫描所有的索引记录(有一定的使用条件)
测试数据
CREATE TABLE t2 (
id INT AUTO_INCREMENT,
c1 CHAR(64) NOT NULL,
c2 CHAR(64) NOT NULL,
c3 CHAR(64) NOT NULL,
c4 CHAR(64) NOT NULL,
PRIMARY KEY(id),
KEY c1_c2_c3_idx (c1, c2,c3)
) ENGINE=INNODB;
INSERT INTO t2 VALUES (null,'a','b','a','a'), (null,'a','b','a','a'),
(null,'a','c','a','a'), (null,'a','c','a','a'),
(null,'a','d','b','b'), (null,'a','b','b','b'),
(null,'d','b','c','c'), (null,'e','b','c','c'),
(null,'f','c','d','d'), (null,'k','c','d','d'),
(null,'y','d','y','y'), (null,'f','b','f','y'),
(null,'a','b','a','a'), (null,'a','b','a','a'),
(null,'a','c','a','a'), (null,'a','c','a','a'),
(null,'a','d','b','b'), (null,'a','b','b','b'),
(null,'d','b','c','c'), (null,'e','b','c','c'),
(null,'f','c','d','d'), (null,'k','c','d','d'),
(null,'y','d','y','y'), (null,'f','b','f','y');
-- 收集统计信息,否则可能影响测试
ANALYZE TABLE t2;
无索引的情况
- 不使用索引的group by
mysql> explain select c4,count(*) from t2 group by c4 order by null;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 24 | 100.00 | Using temporary |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain select c4,count(*) from t2 group by c4;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
| 1 | SIMPLE | t2 | NULL | ALL | NULL | NULL | NULL | NULL | 24 | 100.00 | Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
1 row in set, 1 warning (0.00 sec)
从执行计划Extra: Using temporary可以看到这里使用到了临时表
- 使用索引的group by,没有使用到临时表
mysql> explain select c1,count(*) from t2 group by c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t2 | NULL | index | c1_c2_c3_idx | c1_c2_c3_idx | 768 | NULL | 24 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
Extra:Using index,且type为index,表示全索引扫描
这种情况下,如果表数据量很大,还是会比较耗时
有索引的情况
对于使用索引的情况,对于索引的访问有两种算法
-
Loose Index Scan:松散索引扫描
- 不需要扫描所有的索引key,根据分组前缀(group by字段)跳跃扫描部分
- 执行计划Extra有:Using index for group-by
-
Tight Index Scan:紧凑索引扫描
-
需要扫描范围或全部的索引key
- 执行计划Extra有:Using index
另外还有一种两种算法结合使用的方式:
- 在松散索引扫描的成本大于紧凑索引扫描时可能用到,后文说明
下面是两条SQL分别使用Loose Index Scan和Tight Index Scan:
mysql> explain SELECT c1,MIN(c2) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 256 | NULL | 7 | 100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain SELECT c1,COUNT(*) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t2 | NULL | index | c1_c2_c3_idx | c1_c2_c3_idx | 768 | NULL | 24 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> SELECT c1,MIN(c2),COUNT(*) FROM t2 GROUP BY c1;
+----+---------+----------+
| c1 | MIN(c2) | COUNT(*) |
+----+---------+----------+
| a | b | 12 |
| d | b | 2 |
| e | b | 2 |
| f | b | 4 |
| k | c | 2 |
| y | d | 2 |
+----+---------+----------+
6 rows in set (0.00 sec)
第一条SQL使用Loose Index Scan,示意图为::
第二条SQL使用Tight Index Scan,示意图为:
Loose Index Scan
跳跃扫描部分索引key,而不需要扫描全部
例如:
- 执行计划Extra为:Using index for group-by,表示使用松散索引扫描
mysql> explain SELECT c1,MIN(c2) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 256 | NULL | 7 | 100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
使用场景,可大致归为两类:
1、当需要获取每个分组的某条记录,而非对全部记录做聚合运算时可能会用到Loose Index Scan,比如:
- min()、max():获取每个分组的最小值或最大值
- agg(distinct):count(distinct)、sum(distinct)、avg(distinct)
注意:如果sql语句中既有min\max中的1-2个,也有count(distinct)\sum(distinct)\avg(distinct)中的1-3个时,无法用到Loose index,两组分别出现的时候才可能会用到
2、distinct 可以转换为group by进行处理
其他的必要条件:
- 查询基于一个表
- group by字段满足索引的最左匹配原则,例如,一个表有(c1,c2,c3)的索引,sql为group by (c1,c2)时可以用到索引,sql为(c1,c3)时无法用到索引
- 聚合函数使用的列,必须包含在索引上;且使用多个聚合函数时,必须使用相同的字段,且group by 字段+聚合函数字段也必须满足最左匹配原则
- 索引中字段必须是全字段索引,而不能是前缀索引,例如 INDEX(c1(10))
以上条件结合索引的结构就很好理解了
另外,在选择是否使用Loose Index Scan时,也会收到SQL、统计信息、成本等因素的影响
因此,针对当前表结构,列举几个常见的可以使用到Loose Index Scan的SQL:
场景1
-- MIN()、MAX()
SELECT c1,MIN(c2),MAX(c2) FROM t2 GROUP BY c1;
SELECT c1,c2,MAX(c3),MIN(c3) FROM t2 WHERE c2 > 'k' GROUP BY c1, c2;
SELECT c1,c2,MAX(c3),MIN(c3) FROM t2 WHERE c3 > 'k' GROUP BY c1, c2;
SELECT c1,c2,c3,MAX(id) FROM t2 GROUP BY c1,c2,c3;
SELECT c1, c2 FROM t2 WHERE c3 = 'd' GROUP BY c1, c2;
-- 以下几种情况在当前数据量和数据分布下没有用到,和成本计算有关,结合后文成本对比的章节改变数据量和数据分布测试出来
SELECT c1,c2,MAX(c3),MIN(c3) FROM t2 WHERE c1>='k' and c2 > 'f' GROUP BY c1, c2;
SELECT DISTINCT c1, c2 FROM t2 where c1>'k';
SELECT c1,c2,count(distinct c3) FROM t2 where c1>='k' and c2>'k' GROUP BY c1,c2;
-- count(distinct)、sum(distinct)、avg(distinct)
SELECT c1,count(distinct c2,c3) FROM t2 GROUP BY c1;
SELECT c1,c2,sum(distinct c3) FROM t2 GROUP BY c1,c2;
SELECT c1,c2,sum(distinct c3) FROM t2 where c2>'k' GROUP BY c1,c2;
场景2
SELECT DISTINCT c1, c2 FROM t2;
-- 无法使用到Loose Index Scan
SELECT c1,count(distinct c2,c3),sum(distinct c2) FROM t2 GROUP BY c1;
Tight Index Scan
对于无法使用到Loose Index Scan的一些group by,在满足索引最左匹配原则情况下可能会用到Tight Index Scan
该种方式实际上是范围索引扫描或全部索引扫描,数据量大的情况下性能仍然可能会比较差,但是相比无索引还是可以避免使用临时表和全表扫描,在某些情况下有一定的优化作用
两种算法的结合
对于AGG(DISTINCT),即SUM|COUNT|AVG(distinct),可能会出现使用松散索引扫描成本大于紧凑��引扫描的情况
两种方式在引擎层主要包含的成本:
-
Loose Index Scan
- 读取分组的第一条记录,得到分组前缀
- 根据分组前缀读取分组的第一条或最后一条记录返回给server层
-
Tight Index Scan
-
从引擎层读取数据,返回给server层
- server层判断是否符合where条件的记录,并根据聚合函数进行处理
可以看到,对于引擎层的访问,Loose Index Scan的成本有可能会高于Tight Index Scan的,且在MySQL中,引擎层读取数据页的成本常数是1,Server层判断一条记录的成本常数是0.2,在某些情况下,Loose Index Scan的成本会高于Tight Index Scan
至于MIN/MAX为什么不会出现Loose Index Scan成本>Tight Index Scan成本,我理解只有到组内值都是唯一的情况下才会出现吧,那这样也没有必要去分组求最值了
比如:
- 当分组较多,但组内的记录数并不多或唯一值较高的情况,对于每一个分组,都需要扫描两次,能跳过的记录数很少的情况。即Loose Index Scan在分组字段的选择性相对不太高,组内的数据量相对较多的情况更适用
例如,如下SQL:
select count(distinct c1,c2) from t2;
在当前的测试数据中,松散扫描的成本还是要低于紧凑扫描,执行计划如下:
mysql> explain select count(distinct c1,c2) from t2;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 512 | NULL | 10 | 100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)
新建一个相同表结构的表,插入下面的测试数据:
INSERT INTO t2 VALUES (null,'a','b','a','a'), (null,'a','b','a','a'),
(null,'k','c','a','a'), (null,'k','g','a','a'),
(null,'a','d','b','b'), (null,'a','b','b','b'),
(null,'d','b','c','c'), (null,'e','b','c','c'),
(null,'f','c','d','d'), (null,'k','c','d','d'),
(null,'y','d','y','y'), (null,'f','b','f','y'),
(null,'j','b','a','a'), (null,'m','b','a','a'),
(null,'z','c','a','a'), (null,'t','c','a','a'),
(null,'x','d','b','b'), (null,'x','b','b','b'),
(null,'d','b','c','c'), (null,'e','b','c','c'),
(null,'f','c','d','d'), (null,'k','c','d','d'),
(null,'y','d','y','y'), (null,'f','b','f','y');
-- 其数据分布
mysql> select c1,c2,count(*) from t2 group by c1,c2;
+----+----+----------+
| c1 | c2 | count(*) |
+----+----+----------+
| a | b | 3 |
| a | d | 1 |
| d | b | 2 |
| e | b | 2 |
| f | b | 2 |
| f | c | 2 |
| j | b | 1 |
| k | c | 3 |
| k | g | 1 |
| m | b | 1 |
| t | c | 1 |
| x | b | 1 |
| x | d | 1 |
| y | d | 2 |
| z | c | 1 |
+----+----+----------+
15 rows in set (0.00 sec)
执行计划Extra:Using index for group-by (scanning)
mysql> explain select count(distinct c1,c2) from t2;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
| 1 | SIMPLE | t2 | NULL | range | c1_c2_c3_idx | c1_c2_c3_idx | 512 | NULL | 16 | 100.00 | Using index for group-by (scanning) |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
1 row in set, 1 warning (0.00 sec)
该方式可以理解为Loose Index Scan的扩展或两种方式的结合:
- 索引顺序扫描的同时进行去重
其执行的示意图如下:
最后,再回到文章开头的案例,其执行计划如下:其核心就是将紧凑索引扫描转化为了松散索引扫描
- 优化前执行计划:
- 优化后执行计划:
小结
对于group by可以使用索引进行优化,Loose Index Scan相对于Tight Index Scan在一些情况下可以大大减少扫描的行数,使用Loose Index Scan时,执行计划Extra有:Using index for group-by
在Loose Index Scan的成本大于Tight Index 一些情况Loose Index Scan的一些情况下,可能可以用到两者的结合的方式,执行计划中Extra有:Using index for group-by (scanning)
Loose Index Scan更适用于分组内重复值相对较多,分组个数相对较少的情况
可根据实际情况进行优化
参考链接
MySQL 怎么用索引实现 group by?-鸿蒙开发者社区-51CTO.COM
原文链接: https://juejin.cn/post/7369534106604355621
文章收集整理于网络,请勿商用,仅供个人学习使用,如有侵权,请联系作者删除,如若转载,请注明出处:http://www.cxyroad.com/17426.html