1. 首页
  2. 后端

group by优化原理

  group by优化原理

============

首先看一个生产上group by优化的案例:
-- 优化前SQL ,使用紧凑索引扫描 :执行3s
SELECT taskUniqueId,
         max(reportTime) AS reportTime
FROM task_log_info
WHERE reportTime > '2024-04-07'
GROUP BY  taskUniqueId



-- 优化后,使用松散索引扫描:执行30ms左右
SELECT a.taskUniqueId,
        reportTime
FROM task_log_info a
JOIN
    (SELECT taskUniqueId,
        max(id) AS id
    FROM task_log_info
    GROUP BY  taskUniqueId ) tmp
    ON a.id=tmp.id
        AND reportTime>='2024-04-07'

-- 需要注意的是:id和reporttime字段值具有相关性的情况才可以这样修改

两个group by使用的同样一个索引,但是效率却相差很多,可能你一眼就知道其内部的原理

anyway,有兴趣的可以往后看看,欢迎指出文章不足之处~

在本文的参考链接中也都详细说明了group by的优化原理,本文只是加以总结和结合实际加入自己的理解进行整理

准备

对于group by,可以使用索引进行优化,主要包含两个方面:

1、避免使用临时表

  • 没有index的情况下需要扫描所有的数据,并使用临时表来保存分组值,且默认情况下,会对分组后的结果进行排序
  • 有index的情况下,由于索引的有序性,每个组的数据在一起,可以更快速地进行分组,且避免了排序

2、使用Loose index scan,避免扫描所有的索引记录(有一定的使用条件)

测试数据

CREATE TABLE t2 (
  id INT AUTO_INCREMENT,
  c1 CHAR(64) NOT NULL,
  c2 CHAR(64) NOT NULL,
  c3 CHAR(64) NOT NULL,
  c4 CHAR(64) NOT NULL,
  PRIMARY KEY(id),
  KEY c1_c2_c3_idx (c1, c2,c3)
) ENGINE=INNODB;


INSERT INTO t2 VALUES (null,'a','b','a','a'), (null,'a','b','a','a'),
                      (null,'a','c','a','a'), (null,'a','c','a','a'),
                      (null,'a','d','b','b'), (null,'a','b','b','b'),
                      (null,'d','b','c','c'), (null,'e','b','c','c'),
                      (null,'f','c','d','d'), (null,'k','c','d','d'),
                      (null,'y','d','y','y'), (null,'f','b','f','y'),
                      (null,'a','b','a','a'), (null,'a','b','a','a'),
                      (null,'a','c','a','a'), (null,'a','c','a','a'),
                      (null,'a','d','b','b'), (null,'a','b','b','b'),
                      (null,'d','b','c','c'), (null,'e','b','c','c'),
                      (null,'f','c','d','d'), (null,'k','c','d','d'),
                      (null,'y','d','y','y'), (null,'f','b','f','y');    

-- 收集统计信息,否则可能影响测试
ANALYZE TABLE t2;

无索引的情况

  • 不使用索引的group by
mysql> explain select c4,count(*) from t2 group by c4 order by null;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra           |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   24 |   100.00 | Using temporary |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select c4,count(*) from t2 group by c4;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                           |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   24 |   100.00 | Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+---------------------------------+
1 row in set, 1 warning (0.00 sec)

从执行计划Extra: Using temporary可以看到这里使用到了临时表

  • 使用索引的group by,没有使用到临时表
mysql> explain select c1,count(*) from t2 group by c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t2    | NULL       | index | c1_c2_c3_idx  | c1_c2_c3_idx | 768     | NULL |   24 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

Extra:Using index,且type为index,表示全索引扫描

这种情况下,如果表数据量很大,还是会比较耗时

有索引的情况

对于使用索引的情况,对于索引的访问有两种算法

  • Loose Index Scan:松散索引扫描

    • 不需要扫描所有的索引key,根据分组前缀(group by字段)跳跃扫描部分
    • 执行计划Extra有:Using index for group-by
    • Tight Index Scan:紧凑索引扫描

    • 需要扫描范围或全部的索引key

    • 执行计划Extra有:Using index

另外还有一种两种算法结合使用的方式:

  • 在松散索引扫描的成本大于紧凑索引扫描时可能用到,后文说明

下面是两条SQL分别使用Loose Index Scan和Tight Index Scan:

mysql> explain SELECT c1,MIN(c2) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | t2    | NULL       | range | c1_c2_c3_idx  | c1_c2_c3_idx | 256     | NULL |    7 |   100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)


mysql> explain SELECT c1,COUNT(*) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t2    | NULL       | index | c1_c2_c3_idx  | c1_c2_c3_idx | 768     | NULL |   24 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)


mysql> SELECT c1,MIN(c2),COUNT(*) FROM t2 GROUP BY c1;
+----+---------+----------+
| c1 | MIN(c2) | COUNT(*) |
+----+---------+----------+
| a  | b       |       12 |
| d  | b       |        2 |
| e  | b       |        2 |
| f  | b       |        4 |
| k  | c       |        2 |
| y  | d       |        2 |
+----+---------+----------+
6 rows in set (0.00 sec)

第一条SQL使用Loose Index Scan,示意图为::

group by优化原理

第二条SQL使用Tight Index Scan,示意图为:

group by优化原理

Loose Index Scan

跳跃扫描部分索引key,而不需要扫描全部

例如:

  • 执行计划Extra为:Using index for group-by,表示使用松散索引扫描
mysql> explain SELECT c1,MIN(c2) FROM t2 GROUP BY c1;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | t2    | NULL       | range | c1_c2_c3_idx  | c1_c2_c3_idx | 256     | NULL |    7 |   100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

使用场景,可大致归为两类:

1、当需要获取每个分组的某条记录,而非对全部记录做聚合运算时可能会用到Loose Index Scan,比如:

  • min()、max():获取每个分组的最小值或最大值
  • agg(distinct):count(distinct)、sum(distinct)、avg(distinct)

注意:如果sql语句中既有min\max中的1-2个,也有count(distinct)\sum(distinct)\avg(distinct)中的1-3个时,无法用到Loose index,两组分别出现的时候才可能会用到

2、distinct 可以转换为group by进行处理

其他的必要条件:

  • 查询基于一个表
  • group by字段满足索引的最左匹配原则,例如,一个表有(c1,c2,c3)的索引,sql为group by (c1,c2)时可以用到索引,sql为(c1,c3)时无法用到索引
  • 聚合函数使用的列,必须包含在索引上;且使用多个聚合函数时,必须使用相同的字段,且group by 字段+聚合函数字段也必须满足最左匹配原则
  • 索引中字段必须是全字段索引,而不能是前缀索引,例如 INDEX(c1(10))

以上条件结合索引的结构就很好理解了

另外,在选择是否使用Loose Index Scan时,也会收到SQL、统计信息、成本等因素的影响

因此,针对当前表结构,列举几个常见的可以使用到Loose Index Scan的SQL:

场景1
-- MIN()、MAX()
SELECT c1,MIN(c2),MAX(c2) FROM t2 GROUP BY c1;
SELECT c1,c2,MAX(c3),MIN(c3)  FROM t2 WHERE c2 > 'k' GROUP BY c1, c2;
SELECT c1,c2,MAX(c3),MIN(c3)  FROM t2 WHERE c3 > 'k' GROUP BY c1, c2;
SELECT c1,c2,c3,MAX(id) FROM t2  GROUP BY c1,c2,c3;  
SELECT c1, c2 FROM t2 WHERE c3 = 'd' GROUP BY c1, c2;
-- 以下几种情况在当前数据量和数据分布下没有用到,和成本计算有关,结合后文成本对比的章节改变数据量和数据分布测试出来
SELECT c1,c2,MAX(c3),MIN(c3)  FROM t2 WHERE c1>='k' and c2 > 'f' GROUP BY c1, c2;  
SELECT DISTINCT c1, c2 FROM t2 where c1>'k';  
SELECT c1,c2,count(distinct c3) FROM t2 where c1>='k' and c2>'k' GROUP BY c1,c2;  



-- count(distinct)、sum(distinct)、avg(distinct)
SELECT c1,count(distinct c2,c3) FROM t2 GROUP BY c1;
SELECT c1,c2,sum(distinct c3) FROM t2 GROUP BY c1,c2;
SELECT c1,c2,sum(distinct c3) FROM t2 where c2>'k' GROUP BY c1,c2;



场景2
SELECT DISTINCT c1, c2 FROM t2;




-- 无法使用到Loose Index Scan
SELECT c1,count(distinct c2,c3),sum(distinct c2) FROM t2 GROUP BY c1;

Tight Index Scan

对于无法使用到Loose Index Scan的一些group by,在满足索引最左匹配原则情况下可能会用到Tight Index Scan

该种方式实际上是范围索引扫描或全部索引扫描,数据量大的情况下性能仍然可能会比较差,但是相比无索引还是可以避免使用临时表和全表扫描,在某些情况下有一定的优化作用

两种算法的结合

对于AGG(DISTINCT),即SUM|COUNT|AVG(distinct),可能会出现使用松散索引扫描成本大于紧凑��引扫描的情况

两种方式在引擎层主要包含的成本:

  • Loose Index Scan

    • 读取分组的第一条记录,得到分组前缀
    • 根据分组前缀读取分组的第一条或最后一条记录返回给server层
    • Tight Index Scan

    • 从引擎层读取数据,返回给server层

    • server层判断是否符合where条件的记录,并根据聚合函数进行处理

可以看到,对于引擎层的访问,Loose Index Scan的成本有可能会高于Tight Index Scan的,且在MySQL中,引擎层读取数据页的成本常数是1,Server层判断一条记录的成本常数是0.2,在某些情况下,Loose Index Scan的成本会高于Tight Index Scan

至于MIN/MAX为什么不会出现Loose Index Scan成本>Tight Index Scan成本,我理解只有到组内值都是唯一的情况下才会出现吧,那这样也没有必要去分组求最值了

比如:

  • 当分组较多,但组内的记录数并不多或唯一值较高的情况,对于每一个分组,都需要扫描两次,能跳过的记录数很少的情况。即Loose Index Scan在分组字段的选择性相对不太高,组内的数据量相对较多的情况更适用

例如,如下SQL:

select count(distinct c1,c2) from t2;

在当前的测试数据中,松散扫描的成本还是要低于紧凑扫描,执行计划如下:

mysql> explain select count(distinct c1,c2) from t2;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | t2    | NULL       | range | c1_c2_c3_idx  | c1_c2_c3_idx | 512     | NULL |   10 |   100.00 | Using index for group-by |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)

新建一个相同表结构的表,插入下面的测试数据:

INSERT INTO t2 VALUES (null,'a','b','a','a'), (null,'a','b','a','a'),
                      (null,'k','c','a','a'), (null,'k','g','a','a'),
                      (null,'a','d','b','b'), (null,'a','b','b','b'),
                      (null,'d','b','c','c'), (null,'e','b','c','c'),
                      (null,'f','c','d','d'), (null,'k','c','d','d'),
                      (null,'y','d','y','y'), (null,'f','b','f','y'),
                      (null,'j','b','a','a'), (null,'m','b','a','a'),
                      (null,'z','c','a','a'), (null,'t','c','a','a'),
                      (null,'x','d','b','b'), (null,'x','b','b','b'),
                      (null,'d','b','c','c'), (null,'e','b','c','c'),
                      (null,'f','c','d','d'), (null,'k','c','d','d'),
                      (null,'y','d','y','y'), (null,'f','b','f','y'); 

-- 其数据分布
mysql> select c1,c2,count(*) from t2 group by c1,c2;
+----+----+----------+
| c1 | c2 | count(*) |
+----+----+----------+
| a  | b  |        3 |
| a  | d  |        1 |
| d  | b  |        2 |
| e  | b  |        2 |
| f  | b  |        2 |
| f  | c  |        2 |
| j  | b  |        1 |
| k  | c  |        3 |
| k  | g  |        1 |
| m  | b  |        1 |
| t  | c  |        1 |
| x  | b  |        1 |
| x  | d  |        1 |
| y  | d  |        2 |
| z  | c  |        1 |
+----+----+----------+
15 rows in set (0.00 sec)

执行计划Extra:Using index for group-by (scanning)

mysql> explain select count(distinct c1,c2) from t2;
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key          | key_len | ref  | rows | filtered | Extra                               |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
|  1 | SIMPLE      | t2    | NULL       | range | c1_c2_c3_idx  | c1_c2_c3_idx | 512     | NULL |   16 |   100.00 | Using index for group-by (scanning) |
+----+-------------+-------+------------+-------+---------------+--------------+---------+------+------+----------+-------------------------------------+
1 row in set, 1 warning (0.00 sec)

该方式可以理解为Loose Index Scan的扩展或两种方式的结合:

  • 索引顺序扫描的同时进行去重

其执行的示意图如下:

group by优化原理

最后,再回到文章开头的案例,其执行计划如下:其核心就是将紧凑索引扫描转化为了松散索引扫描

  • 优化前执行计划:

group by优化原理

  • 优化后执行计划:

group by优化原理

小结

对于group by可以使用索引进行优化,Loose Index Scan相对于Tight Index Scan在一些情况下可以大大减少扫描的行数,使用Loose Index Scan时,执行计划Extra有:Using index for group-by

在Loose Index Scan的成本大于Tight Index 一些情况Loose Index Scan的一些情况下,可能可以用到两者的结合的方式,执行计划中Extra有:Using index for group-by (scanning)

Loose Index Scan更适用于分组内重复值相对较多,分组个数相对较少的情况

可根据实际情况进行优化

参考链接

dev.mysql.com/doc/refman/…

dev.mysql.com/blog-archiv…

MySQL 怎么用索引实现 group by?-鸿蒙开发者社区-51CTO.COM

www.oreilly.com/library/vie…

原文链接: https://juejin.cn/post/7369534106604355621

文章收集整理于网络,请勿商用,仅供个人学习使用,如有侵权,请联系作者删除,如若转载,请注明出处:http://www.cxyroad.com/17426.html

QR code