使用EXPLAIN进行SQL调优

本文档介绍了SQL查询缓慢的常见原因,并描述了如何使用EXPLAIN来解决针对查询中的问题。

问题类型

全表扫描

查询缓慢的最常见原因是SELECT语句的全表扫描和索引的不正确使用。

数据准备

create table cust_info(
    cust_no       string primary key,
    cust_name     varchar(30) not null,
    cust_card_no  varchar(18),  
    cust_phoneno  decimal(15),
    cust_address  varchar(30),
    cust_type     varchar(10)
 );
 

根据不在主键或任何索引中的列从大型表中检索少量行时,性能通常会很差:

SELECT * FROM cust_info WHERE cust_name = '刘明';
  cust_no  | cust_name |    cust_card_no    | cust_phoneno | cust_address | cust_type
-----------+-----------+--------------------+--------------+--------------+------------
  14435551 | 刘明      | 431256197306265320 |  15534343555 | 山西临汾     | 质押
  14435552 | 刘明      | 371452199303034312 |  18967756743 | 陕西延安     | 信用
  14435553 | 刘明      | 52112119860621421X |  15833355455 | 湖北武汉     | 抵押
(3 rows)

Time: 442ms total (execution 1ms / network 1ms)

要了解此查询执行不佳的原因,使用EXPLAIN

explain SELECT * FROM cust_info WHERE cust_name = '刘明';
                                       info
-----------------------------------------------------------------------------------
  distribution: full
  vectorized: false

  • filter
  │ estimated row count: 1
  │ filter: cust_name = e'\U00005218\U0000660E'
  │
  └── • scan
        estimated row count: 5 (100% of the table; stats collected 8 minutes ago)
        table: cust_info@primary
        spans: FULL SCAN

spans: FULL SCAN展示,如果列cust_name上没有索引,hubble会扫描表的每一行,cust_info按主键排序,直到找到具有正确cust_name值的行。

解决方法:按照索引过滤

要加快此查询,在cust_name上添加二级索引:

create index on cust_info (cust_name);

查询现在将返回得更快:

SELECT * FROM cust_info WHERE cust_name = '刘明';
  cust_no  | cust_name |    cust_card_no    | cust_phoneno | cust_address | cust_type
-----------+-----------+--------------------+--------------+--------------+------------
  14435551 | 刘明      | 431256197306265320 |  15534343555 | 山西临汾     | 质押
  14435552 | 刘明      | 371452199303034312 |  18967756743 | 陕西延安     | 信用
  14435553 | 刘明      | 52112119860621421X |  15833355455 | 湖北武汉     | 抵押
(3 rows)

Time: 6ms total (execution 1ms / network 1ms)

要了解性能提高的原因,使用EXPLAIN查看新的查询计划:

explain SELECT * FROM cust_info WHERE cust_name = '刘明';
                                       info
-----------------------------------------------------------------------------------
  distribution: local
  vectorized: false

  • index join
  │ estimated row count: 1
  │ table: cust_info@primary
  │
  └── • scan
        estimated row count: 1 (20% of the table; stats collected 18 minutes ago)
        table: cust_info@cust_info_cust_name_idx
        spans: [/e'\U00005218\U0000660E' - /e'\U00005218\U0000660E']

有索引后,查询可以直接跳转到相关值,然后返回整行。

解决方案:通过存储额外列的二级索引进行过滤

例如,经常检索用户名和卡号

SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
 cust_name |    cust_card_no
------------+---------------------
  刘明      | 431256197306265320
  刘明      | 371452199303034312
  刘明      | 52112119860621421X
  
(3 rows)

Time: 4ms total (execution 2ms / network 0ms)
  

有了当前的二级索引cust_name,hubble仍然需要扫描一级索引来获取卡号

删除并重新创建索引cust_name,这次将cust_card_no值存储在索引中

drop index cust_info_cust_name_idx;
create index on cust_info (cust_name) STORING (cust_card_no);

现在cust_card_no值存储在cust_name上的索引中,hubble只需要扫描该索引:

explain SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
                                     info
-------------------------------------------------------------------------------
  distribution: local
  vectorized: false

  • scan
    estimated row count: 3 (60% of the table; stats collected 14 minutes ago)
    table: cust_info@cust_info_cust_name_idx
    spans: [/e'\U00005218\U0000660E' - /e'\U00005218\U0000660E']
SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
 cust_name |    cust_card_no
------------+---------------------
  刘明      | 431256197306265320
  刘明      | 371452199303034312
  刘明      | 52112119860621421X
  
(3 rows)

Time: 2ms total (execution 2ms / network 0ms)
  

这样会使查询更快

低效连接

基于成本的优化器无法执行查找连接,因为该查询没有可用的表主键前缀,因此必须读取整个表并搜索匹配项,从而导致速度慢询问:

EXPLAIN SELECT * FROM emp JOIN dept on  emp.deptno = dept.deptno limit 1;
                                         info
----------------------------------------------------------------------------------------
  distribution: full
  vectorized: false

  • limit
  │ estimated row count: 1
  │ count: 1
  │
  └── • hash join
      │ estimated row count: 27
      │ equality: (deptno) = (deptno)
      │
      ├── • scan
      │     estimated row count: 27 (100% of the table; stats collected 3 minutes ago)
      │     table: emp@primary
      │     spans: FULL SCAN
      │
      └── • scan
            estimated row count: 5 (100% of the table; stats collected 2 minutes ago)
            table: dept@primary
            spans: FULL SCAN

解决方法:提供主键允许lookup join

为了加快查询速度,可以提供主键以允许基于成本的优化器执行查找连接而不是散列连接:

EXPLAIN SELECT * FROM emp JOIN dept on  emp.deptno = dept.deptno and emp.empno=dept.empno limit 1;
                                         info
---------------------------------------------------------------------------------------
  distribution: full
  vectorized: false

  • limit
  │ estimated row count: 0
  │ count: 1
  │
  └── • lookup join
      │ estimated row count: 0
      │ table: emp@primary
      │ equality: (empno) = (empno)
      │ equality cols are key
      │ pred: deptno = deptno
      │
      └── • scan
            estimated row count: 5 (100% of the table; stats collected 2 minutes ago)
            table: dept@primary
            spans: FULL SCAN

连接来自不同表的数据

在连接来自不同表的数据时,二级索引也很重要。

要计算在给定区间的员工数。

SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
  count
---------
     17
(1 row)

Time: 90ms total (execution 90ms / network 0ms)

要了解发生了什么,使用EXPLAIN查看查询计划:

explain SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
                                              info
-------------------------------------------------------------------------------------------------
  distribution: full
  vectorized: false

  • group (scalar)
  │ estimated row count: 1
  │
  └── • distinct
      │ estimated row count: 7
      │ distinct on: empno
      │
      └── • hash join
          │ estimated row count: 9
          │ equality: (loc) = (loc)
          │
          ├── • filter
          │   │ estimated row count: 9
          │   │ filter: (deptno >= 20) AND (deptno <= 40)
          │   │
          │   └── • scan
          │         estimated row count: 27 (100% of the table; stats collected 52 minutes ago)
          │         table: emp@primary
          │         spans: FULL SCAN
          │
          └── • filter
              │ estimated row count: 2
              │ filter: (deptno >= 20) AND (deptno <= 40)
              │
              └── • scan
                    estimated row count: 5 (100% of the table; stats collected 51 minutes ago)
                    table: dept@primary
                    spans: FULL SCAN
(31 rows)

Time: 2ms total (execution 1ms / network 0ms)

hubble首先进行全表扫描dept以获取指定行,然后进行另一次全表扫描emp以查找匹配的行并计算计数。

解决方案:在存储连接键的条件上创建二级索引

CREATE INDEX ON dept(deptno) STORING (loc);
SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
  count
---------
     17
(1 row)

Time: 7ms total (execution 90ms / network 0ms)

可见,添加二级索引减少了查询时间。

要了解性能提高的原因,请再次使用EXPLAIN查看新的查询计划:

explain SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
                                            info
---------------------------------------------------------------------------------------------
  distribution: full
  vectorized: false

  • group (scalar)
  │ estimated row count: 1
  │
  └── • distinct
      │ estimated row count: 0
      │ distinct on: empno
      │
      └── • hash join
          │ estimated row count: 0
          │ equality: (loc) = (loc)
          │
          ├── • scan
          │     estimated row count: 27 (100% of the table; stats collected 10 minutes ago)
          │     table: emp@primary
          │     spans: FULL SCAN
          │
          └── • scan
                estimated row count: 3 (60% of the table; stats collected 8 minutes ago)
                table: dept@dept_deptno_idx
                spans: [/20 - /40]

hubble现在开始使用dept@dept_deptno_idx索引来检索数据,而无需扫描整个表。