本文档介绍了SQL查询缓慢的常见原因,并描述了如何使用EXPLAIN
来解决针对查询中的问题。
查询缓慢的最常见原因是SELECT
语句的全表扫描和索引的不正确使用。
数据准备
create table cust_info(
cust_no string primary key,
cust_name varchar(30) not null,
cust_card_no varchar(18),
cust_phoneno decimal(15),
cust_address varchar(30),
cust_type varchar(10)
);
根据不在主键或任何索引中的列从大型表中检索少量行时,性能通常会很差:
SELECT * FROM cust_info WHERE cust_name = '刘明';
cust_no | cust_name | cust_card_no | cust_phoneno | cust_address | cust_type
-----------+-----------+--------------------+--------------+--------------+------------
14435551 | 刘明 | 431256197306265320 | 15534343555 | 山西临汾 | 质押
14435552 | 刘明 | 371452199303034312 | 18967756743 | 陕西延安 | 信用
14435553 | 刘明 | 52112119860621421X | 15833355455 | 湖北武汉 | 抵押
(3 rows)
Time: 442ms total (execution 1ms / network 1ms)
要了解此查询执行不佳的原因,使用EXPLAIN
:
explain SELECT * FROM cust_info WHERE cust_name = '刘明';
info
-----------------------------------------------------------------------------------
distribution: full
vectorized: false
• filter
│ estimated row count: 1
│ filter: cust_name = e'\U00005218\U0000660E'
│
└── • scan
estimated row count: 5 (100% of the table; stats collected 8 minutes ago)
table: cust_info@primary
spans: FULL SCAN
spans: FULL SCAN
展示,如果列cust_name
上没有索引,hubble会扫描表的每一行,cust_info
按主键排序,直到找到具有正确cust_name
值的行。
解决方法:按照索引过滤
要加快此查询,在cust_name
上添加二级索引:
create index on cust_info (cust_name);
查询现在将返回得更快:
SELECT * FROM cust_info WHERE cust_name = '刘明';
cust_no | cust_name | cust_card_no | cust_phoneno | cust_address | cust_type
-----------+-----------+--------------------+--------------+--------------+------------
14435551 | 刘明 | 431256197306265320 | 15534343555 | 山西临汾 | 质押
14435552 | 刘明 | 371452199303034312 | 18967756743 | 陕西延安 | 信用
14435553 | 刘明 | 52112119860621421X | 15833355455 | 湖北武汉 | 抵押
(3 rows)
Time: 6ms total (execution 1ms / network 1ms)
要了解性能提高的原因,使用EXPLAIN
查看新的查询计划:
explain SELECT * FROM cust_info WHERE cust_name = '刘明';
info
-----------------------------------------------------------------------------------
distribution: local
vectorized: false
• index join
│ estimated row count: 1
│ table: cust_info@primary
│
└── • scan
estimated row count: 1 (20% of the table; stats collected 18 minutes ago)
table: cust_info@cust_info_cust_name_idx
spans: [/e'\U00005218\U0000660E' - /e'\U00005218\U0000660E']
有索引后,查询可以直接跳转到相关值,然后返回整行。
解决方案:通过存储额外列的二级索引进行过滤
例如,经常检索用户名和卡号
SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
cust_name | cust_card_no
------------+---------------------
刘明 | 431256197306265320
刘明 | 371452199303034312
刘明 | 52112119860621421X
(3 rows)
Time: 4ms total (execution 2ms / network 0ms)
有了当前的二级索引cust_name
,hubble仍然需要扫描一级索引来获取卡号
删除并重新创建索引cust_name
,这次将cust_card_no
值存储在索引中
drop index cust_info_cust_name_idx;
create index on cust_info (cust_name) STORING (cust_card_no);
现在cust_card_no
值存储在cust_name
上的索引中,hubble只需要扫描该索引:
explain SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
info
-------------------------------------------------------------------------------
distribution: local
vectorized: false
• scan
estimated row count: 3 (60% of the table; stats collected 14 minutes ago)
table: cust_info@cust_info_cust_name_idx
spans: [/e'\U00005218\U0000660E' - /e'\U00005218\U0000660E']
SELECT cust_name,cust_card_no FROM cust_info WHERE cust_name = '刘明';
cust_name | cust_card_no
------------+---------------------
刘明 | 431256197306265320
刘明 | 371452199303034312
刘明 | 52112119860621421X
(3 rows)
Time: 2ms total (execution 2ms / network 0ms)
这样会使查询更快
基于成本的优化器无法执行查找连接,因为该查询没有可用的表主键前缀,因此必须读取整个表并搜索匹配项,从而导致速度慢询问:
EXPLAIN SELECT * FROM emp JOIN dept on emp.deptno = dept.deptno limit 1;
info
----------------------------------------------------------------------------------------
distribution: full
vectorized: false
• limit
│ estimated row count: 1
│ count: 1
│
└── • hash join
│ estimated row count: 27
│ equality: (deptno) = (deptno)
│
├── • scan
│ estimated row count: 27 (100% of the table; stats collected 3 minutes ago)
│ table: emp@primary
│ spans: FULL SCAN
│
└── • scan
estimated row count: 5 (100% of the table; stats collected 2 minutes ago)
table: dept@primary
spans: FULL SCAN
解决方法:提供主键允许lookup join
为了加快查询速度,可以提供主键以允许基于成本的优化器执行查找连接而不是散列连接:
EXPLAIN SELECT * FROM emp JOIN dept on emp.deptno = dept.deptno and emp.empno=dept.empno limit 1;
info
---------------------------------------------------------------------------------------
distribution: full
vectorized: false
• limit
│ estimated row count: 0
│ count: 1
│
└── • lookup join
│ estimated row count: 0
│ table: emp@primary
│ equality: (empno) = (empno)
│ equality cols are key
│ pred: deptno = deptno
│
└── • scan
estimated row count: 5 (100% of the table; stats collected 2 minutes ago)
table: dept@primary
spans: FULL SCAN
在连接来自不同表的数据时,二级索引也很重要。
要计算在给定区间的员工数。
SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
count
---------
17
(1 row)
Time: 90ms total (execution 90ms / network 0ms)
要了解发生了什么,使用EXPLAIN
查看查询计划:
explain SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
info
-------------------------------------------------------------------------------------------------
distribution: full
vectorized: false
• group (scalar)
│ estimated row count: 1
│
└── • distinct
│ estimated row count: 7
│ distinct on: empno
│
└── • hash join
│ estimated row count: 9
│ equality: (loc) = (loc)
│
├── • filter
│ │ estimated row count: 9
│ │ filter: (deptno >= 20) AND (deptno <= 40)
│ │
│ └── • scan
│ estimated row count: 27 (100% of the table; stats collected 52 minutes ago)
│ table: emp@primary
│ spans: FULL SCAN
│
└── • filter
│ estimated row count: 2
│ filter: (deptno >= 20) AND (deptno <= 40)
│
└── • scan
estimated row count: 5 (100% of the table; stats collected 51 minutes ago)
table: dept@primary
spans: FULL SCAN
(31 rows)
Time: 2ms total (execution 1ms / network 0ms)
hubble首先进行全表扫描dept
以获取指定行,然后进行另一次全表扫描emp
以查找匹配的行并计算计数。
解决方案:在存储连接键的条件上创建二级索引
CREATE INDEX ON dept(deptno) STORING (loc);
SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
count
---------
17
(1 row)
Time: 7ms total (execution 90ms / network 0ms)
可见,添加二级索引减少了查询时间。
要了解性能提高的原因,请再次使用EXPLAIN
查看新的查询计划:
explain SELECT count(DISTINCT emp.empno) FROM emp INNER JOIN dept ON emp.loc = dept.loc WHERE dept.deptno BETWEEN 20 AND 40;
info
---------------------------------------------------------------------------------------------
distribution: full
vectorized: false
• group (scalar)
│ estimated row count: 1
│
└── • distinct
│ estimated row count: 0
│ distinct on: empno
│
└── • hash join
│ estimated row count: 0
│ equality: (loc) = (loc)
│
├── • scan
│ estimated row count: 27 (100% of the table; stats collected 10 minutes ago)
│ table: emp@primary
│ spans: FULL SCAN
│
└── • scan
estimated row count: 3 (60% of the table; stats collected 8 minutes ago)
table: dept@dept_deptno_idx
spans: [/20 - /40]
hubble现在开始使用dept@dept_deptno_idx
索引来检索数据,而无需扫描整个表。