数据查询语言
规则
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT { count | ALL } | FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
from_item
是以下的一种:
table_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
或
from_item join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
join_type
包括:
[ INNER ] JOIN
LEFT [ OUTER ] JOIN
RIGHT [ OUTER ] JOIN
FULL [ OUTER ] JOIN
CROSS JOIN
grouping_element
包括:
()
expression
GROUPING SETS ( ( column [, ...] ) [, ...] )
CUBE ( column [, ...] )
ROLLUP ( column [, ...] )
描述
查询语法,由一个表或多个表查询数据。
详细解释
WITH子句
用于定义在查询时使用的命名关系,它允许扁平化的嵌套查询或简化子查询。例如,以下查询是等效的:
SELECT a, b
FROM (
SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;
也可以用于多子查询语句:
WITH
t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),
t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)
SELECT t1.*, t2.*
FROM t1
JOIN t2 ON t1.a = t2.a;
并且可以是一个链式:
WITH
x AS (SELECT a FROM t),
y AS (SELECT a AS b FROM x),
z AS (SELECT b AS c FROM y)
SELECT c FROM z;
GROUP BY子句
GROUP BY
将SELECT语句的输出分成包含匹配值的行组,简单的GROUP BY
包含由列组成的表达式,也可以按位置选择输出列的序号(从1开始)。
以下查询是等效的,都是按nationkey
对查询进行分组,第一个使用序号,第二个使用列名:
SELECT count(*), nationkey FROM customer GROUP BY 2;
SELECT count(*), nationkey FROM customer GROUP BY nationkey;
GROUP BY
语句也可以按不在SELECT
内的字段进行分组:
SELECT count(*) FROM customer GROUP BY mktsegment;
所有的输出都必须是select中的字段或者是聚合函数。
复杂的GROUP BY
语句,支持使用GROUPING SETS
,CUBE
,ROLLUP
语法,允许对需要的多组列进行聚合分析,复杂分组操作不支持对输入列组成的表达式进行分组,只允许使用列名和序号。
复杂分组操作通常等价于一个联合的所有简单分组的表达式,当然,由于连接源的不同,有些等价表达式可能并不适用。
GROUPING SETS
允许指定分组的列的多个列表,不属于分组列的指定子列表的列将被设置为NULL。
SELECT * FROM shipping;
origin_city | origin_zip | destination_city | destination_zip | package_weight
--------------+------------+------------------+-----------------+-----------------
beijing | 100000 | tianjin | 300000 | 12
tianjin | 300010 | shanghai | 200000 | 60
beijing | 100100 | chongqing | 400000 | 38
guangzhou | 510000 | tianjin | 300100 | 14
beijing | 100012 | shanghai | 200000 | 78
beijing | 100000 | tianjin | 300010 | 40
(6 rows in set)
分组例子:
SELECT origin_city, origin_zip, destination_city, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city),
(origin_city, origin_zip),
(destination_city));
origin_city | origin_zip | destination_city | _col3
+-------------+------------+------------------+-------+
beijing | NULL | NULL | 168
beijing | 100100 | NULL | 38
tianjin | 300010 | NULL | 60
guangzhou | NULL | NULL | 14
beijing | 100012 | NULL | 78
beijing | 100000 | NULL | 52
tianjin | NULL | NULL | 60
guangzhou | 510000 | NULL | 14
NULL | NULL | shanghai | 138
NULL | NULL | chongqing | 38
NULL | NULL | tianjin | 66
(11 rows in set)
逻辑上等同于下面的语法:
SELECT origin_city, NULL, NULL, sum(package_weight)
FROM shipping GROUP BY origin_city
UNION ALL
SELECT origin_city, origin_zip, NULL, sum(package_weight)
FROM shipping GROUP BY origin_city, origin_zip
UNION ALL
SELECT NULL, NULL, destination_city, sum(package_weight)
FROM shipping GROUP BY destination_city;
使用UNION ALL将从基础表中读取三次数据,而使用GROUPING SETS
只需要读取一次。
CUBE
CUBE
为给定的列集生成所有可能的分组集(即幂集)。例如,查询:
SELECT origin_city, destination_city, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_city, destination_city);
等同于:
SELECT origin_city, destination_city, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city, destination_city),
(origin_city),
(destination_city),
());
得到结果:
origin_city | destination_city | _col2
+-------------+------------------+-------+
tianjin | NULL | 60
NULL | chongqing | 38
NULL | NULL | 242
guangzhou | NULL | 14
tianjin | shanghai | 60
beijing | chongqing | 38
guangzhou | tianjin | 14
NULL | shanghai | 138
NULL | tianjin | 66
beijing | shanghai | 78
beijing | tianjin | 52
beijing | NULL | 168
(12 rows in set)
ROLLUP
为给定的列集生成所有可能的分类汇总。例如,查询:
SELECT origin_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ROLLUP (origin_city, origin_zip);
等同于
SELECT origin_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS ((origin_city, origin_zip), (origin_city), ());
得到结果
origin_city | origin_zip | _col2
+------------+------------+------+
beijing | 100000 | 52
tianjin | NULL | 60
beijing | 100100 | 38
guangzhou | NULL | 14
beijing | NULL | 168
beijing | 100012 | 78
NULL | NULL | 242
guangzhou | 510000 | 14
tianjin | 300010 | 60
(9 rows in set)
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
GROUPING SETS ((origin_city, destination_city)),
ROLLUP (origin_zip);
可以重写为:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
GROUPING SETS ((origin_city, destination_city)),
GROUPING SETS ((origin_zip), ());
等同于:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city, destination_city, origin_zip),
(origin_city, destination_city));
输出结果:
origin_city | destination_city | origin_zip | _col3
+------------+------------------+------------+------+
beijing | tianjin | 100000 | 52
beijing | tianjin | NULL | 52
guangzhou | tianjin | NULL | 14
beijing | chongqing | 100100 | 38
guangzhou | tianjin | 510000 | 14
beijing | chongqing | NULL | 38
tianjin | shanghai | NULL | 60
tianjin | shanghai | 300010 | 60
beijing | shanghai | NULL | 78
beijing | shanghai | 100012 | 78
(10 rows in set)
ALL
和DISTINCT
确定重复分组集是否每个都产生不同的输出行。当在同一查询中组合多个复杂分组集时,这尤其有用。例如,以下查询:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ALL
CUBE (origin_city, destination_city),
ROLLUP (origin_city, origin_zip);
等同于:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city, destination_city, origin_zip),
(origin_city, origin_zip),
(origin_city, destination_city, origin_zip),
(origin_city, origin_zip),
(origin_city, destination_city),
(origin_city),
(origin_city, destination_city),
(origin_city),
(origin_city, destination_city),
(origin_city),
(destination_city),
());
而如果使用DISTINCT
:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY DISTINCT
CUBE (origin_city, destination_city),
ROLLUP (origin_city, origin_zip);
则等同于:
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city, destination_city, origin_zip),
(origin_city, origin_zip),
(origin_city, destination_city),
(origin_city),
(destination_city),
());
分组操作返回转换为十进制的位集,指示分组中存在哪些列。它必须与GROUPING SETS
, ROLLUP
, CUBE
或 GROUP BY
一起使用,其参数必须与相应GROUPING SETS
, ROLLUP
, CUBE
或 GROUP BY
子句中引用的列完全匹配。
为了计算特定行的结果位集,位被分配给参数列,最右边的列是最低有效位。对于给定的分组,如果分组中包含相应的列,则位设置为0,否则设置为1。例如,考虑下面的查询:
SELECT origin_city, origin_zip, destination_city, sum(package_weight),
grouping(origin_city, origin_zip, destination_city)
FROM shipping
GROUP BY GROUPING SETS (
(origin_city),
(origin_city, origin_zip),
(destination_city));
结果:
origin_city | origin_zip | destination_city | _col3 | _col4
+------------+------------+------------------+-------+------+
beijing | NULL | NULL | 168 | 3
beijing | 100000 | NULL | 52 | 1
guangzhou | NULL | NULL | 14 | 3
guangzhou | 510000 | NULL | 14 | 1
tianjin | NULL | NULL | 60 | 3
beijing | 100012 | NULL | 78 | 1
NULL | NULL | tianjin | 66 | 6
NULL | NULL | chongqing | 38 | 6
NULL | NULL | shanghai | 138 | 6
tianjin | 300010 | NULL | 60 | 1
beijing | 100100 | NULL | 38 | 1
(11 rows in set)
HAVING子句
HAVING子句与聚合函数和GROUP BY子句一起使用,以控制选择哪些组。HAVING子句消除不满足给定条件的组。在计算组和聚合之后有过滤器组。
以下示例查询customer表并选择帐户余额大于指定值的组:
SELECT count(*), c_mktsegment, c_nationkey,
CAST(sum(c_acctbal) AS bigint) AS totalbal
FROM customer
GROUP BY c_mktsegment, c_nationkey
HAVING sum(c_acctbal) > 542666087
ORDER BY totalbal DESC;
结果:
_col0 | c_mktsegment | c_nationkey | totalbal
+-------+--------------+-------------+----------+
120838 | AUTOMOBILE | 20 | 543268712
120339 | BUILDING | 10 | 543056590
120760 | HOUSEHOLD | 6 | 543017181
(3 rows in set)
UNION | INTERSECT | EXCEPT子句
UNION INTERSECT EXCEPT都是set操作. 这些子句用于将多个select语句的结果组合到单个结果集中:
query UNION [ALL | DISTINCT] query
query INTERSECT [DISTINCT] query
query EXCEPT [DISTINCT] query
参数ALL
或DISTINCT
控制最终结果集中包含哪些行。如果指定了参数ALL
,则即使行相同,也包括所有行。如果指定了参数DISTINCT
,则组合结果集中只包含唯一的行。如果两者都未指定,则行为默认为DISTINCT
。INTERSECT
或EXCEPT
不支持ALL
参数。
除非通过括号明确指定顺序,否则将从左到右处理多个集合操作。此外,INTERSECT
比EXCEPT
和UNION
优先级更高。也就是说,A UNION B INTERSECT C EXCEPT D
,等同于A UNION (B INTERSECT C) EXCEPT D
。
UNION
连接操作,返回两个结果合并的集合。
SELECT 13
UNION
SELECT 42;
_col0
+-------+
13
42
(2 rows in set)
SELECT 13
UNION
SELECT * FROM (VALUES 42, 13);
_col0
+-------+
42
13
(2 rows in set)
SELECT 13
UNION ALL
SELECT * FROM (VALUES 42, 13);
_col0
+-------+
42
13
13
(3 rows in set)
由上面的三个例子可以看出,如果不带ALL
实际结果进行了合并去重。
INTERSECT
返回两个结果集的交集。也就是说返回两个结果集中都有的结果。
SELECT * FROM (VALUES 13, 42)
INTERSECT
SELECT 13;
_col0
+-------+
13
(1 row in set)
EXCEPT
返回第一个集合的相对补集。即在第一个集合中存在,在第二个集合中不存在的值。
SELECT * FROM (VALUES 13, 42)
EXCEPT
SELECT * FROM (VALUES 13, 45);
_col0
+-------+
42
(1 row in set)
ORDER BY子句
ORDER BY
子句用于给结果集进行排序。
ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
ORDER BY
子句在GROUP BY
或HAVING
之后,在OFFSET
,LIMIT
或FETCH FIRST
之前计算。默认NULL排在后边,与升降序无关。
OFFSET子句
OFFSET
用于跳过结果集的多个行。
如果存在ORDER BY
,则将对排序后的结果集计算,并跳过指定行,结果仍然是排序的。
SELECT n_name FROM nation ORDER BY n_name OFFSET 22;
n_name
+----------------+
UNITED KINGDOM
UNITED STATES
VIETNAM
(3 rows in set)
当跳过的行数大于结果集的大小,将返回空。
LIMIT
或FETCH FIRST
子句
LIMIT
或FETCH FIRST
子句限制结果集返回的行数。
LIMIT { count | ALL }
FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }
下面的示例查询一个大表,但是LIMIT
子句限制输出只有五行(因为查询缺少order by,所以返回的行是任意的):
SELECT o_orderdate FROM orders LIMIT 5;
o_orderdate
+-------------+
1994-02-12
1997-07-11
1994-06-18
1992-01-19
1996-09-03
(5 rows in set)
LIMIT ALL
等同于LIMIT
。
FETCH FIRST
子句支持FIRST
或NEXT
关键字以及ROW
或ROWS
关键字。这些关键字是等效的,关键字的选择对查询执行没有影响。
如果在FETCH FIRST
子句中未指定计数,则默认为1:
SELECT o_orderdate FROM orders FETCH FIRST ROW ONLY;
o_orderdate
+-------------+
1994-02-12
(1 row in set)
如果存在OFFSET
子句,则LIMIT
或者FETCH FIRST
在其后进行计算。
SELECT * FROM (VALUES 5, 2, 4, 1, 3) t(x) ORDER BY x OFFSET 2 LIMIT 2;
x
+---+
3
4
(2 rows in set)
对于FETCH FIRST
子句,参数ONLY
或WITH TIES
控制结果集中包含哪些行。
如果只指定了参数,则结果集将限制为由计数确定的行数。
如果指定了WITH TIES
的参数,则需要存在ORDER BY
子句。结果集由同一组的所有行组成,这些行与ORDER BY
子句中的排序所建立的最后一个行(“ties”)相同。结果集排序:
SELECT n_name, n_regionkey FROM nation ORDER BY n_regionkey FETCH FIRST ROW WITH TIES;
n_name | n_regionkey
+----------+-------------+
ETHIOPIA | 0
MOROCCO | 0
KENYA | 0
ALGERIA | 0
MOZAMBIQUE| 0
(5 rows in set)
TABLESAMPLE
包含的采样方法有:
选择每一行作为表样本,其概率为样本百分比。当使用BERNOULLI方法对一个表进行采样时,将扫描该表的所有物理块,并跳过某些行(基于采样百分比与运行时计算的随机值之间的比较)。
结果中包含的行的概率与任何其他行无关。这不会减少从磁盘读取采样表所需的时间。如果进一步处理采样输出,可能会影响总查询时间。
此采样方法将表划分为数据的逻辑段,并以此粒度对表进行采样。此采样方法要么选择特定数据段中的所有行,要么跳过它(基于采样百分比与运行时计算的随机值之间的比较)。
系统采样中选择的行将取决于使用的catalog。例如,当与Hubble一起使用时,它取决于数据上的布局方式。这种方法不能保证独立的抽样概率。
这两个方法都不允许返回的行数有确定的界限。
例子:
SELECT *
FROM nation TABLESAMPLE BERNOULLI(25);
SELECT *
FROM nation TABLESAMPLE SYSTEM(75);
与join结合使用:
SELECT o.*, i.*
FROM orders o TABLESAMPLE SYSTEM (10)
JOIN lineitem i TABLESAMPLE BERNOULLI (40)
ON o.o_orderkey = i.l_orderkey;
UNNEST
UNNEST
可用于将ARRAY
或MAP
展开为关系。ARRAY展开为单列,MAP展开为两列(键、值)。UNNEST
还可以与多个参数一起使用,在这种情况下,它们将展开为多个列,行数与最高基数参数相同(其他列用空值填充)。UNNEST
可以有WITH ORDINALITY
子句,在这种情况下,会在末尾添加一个附加的有序列。UNNEST
通常与JOIN
一起使用,可以引用JOIN
左侧关系中的列。
SELECT numbers, animals, n, a
FROM (
VALUES
(ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']),
(ARRAY[7, 8, 9], ARRAY['cow', 'pig'])
) AS x (numbers, animals)
CROSS JOIN UNNEST(numbers, animals) AS t (n, a);
+-----------+------------------+------+------+
| numbers | animals | n | a |
+-----------+------------------+------+------+
| [2, 5] | [dog, cat, bird] | 2 | dog |
| [2, 5] | [dog, cat, bird] | 5 | cat |
| [2, 5] | [dog, cat, bird] | NULL | bird |
| [7, 8, 9] | [cow, pig] | 7 | cow |
| [7, 8, 9] | [cow, pig] | 8 | pig |
| [7, 8, 9] | [cow, pig] | 9 | NULL |
+-----------+------------------+------+------+
(6 rows in set)
有序列:
SELECT numbers, n, a
FROM (
VALUES
(ARRAY[2, 5]),
(ARRAY[7, 8, 9])
) AS x (numbers)
CROSS JOIN UNNEST(numbers) WITH ORDINALITY AS t (n, a);
+-----------+---+---+
| numbers | n | a |
+-----------+---+---+
| [2, 5] | 2 | 1 |
| [2, 5] | 5 | 2 |
| [7, 8, 9] | 7 | 1 |
| [7, 8, 9] | 8 | 2 |
| [7, 8, 9] | 9 | 3 |
+-----------+---+---+
(5 rows in set)
JOIN
JOIN允许组合来自多个关系的数据。
CROSS JOIN
CROSS JOIN
返回两个关系的笛卡尔积(所有组合)。
下面两个查询等价:
SELECT *
FROM nation
CROSS JOIN region;
SELECT *
FROM nation, region;
如果一个表有25行,另一个表有5行,使用CROSS JOIN
将返回125行结果。
LATERAL
FROM
子句中出现的子查询前面可以有关键字lateral。这允许它们引用前面FROM
提供的列。
SELECT name, x, y
FROM nation,
CROSS JOIN LATERAL (SELECT name || ' :-' AS x),
CROSS JOIN LATERAL (SELECT x || ')' AS y)
如果两个关系有同名列需要使用别名。
SELECT nation.name, region.name
FROM nation
CROSS JOIN region;
SELECT n.name, r.name
FROM nation AS n
CROSS JOIN region AS r;
SELECT n.name, r.name
FROM nation n
CROSS JOIN region r;
子查询是由查询组成的表达式。当子查询引用子查询之外的列时,它是相关的。从逻辑上讲,将为周围查询中的每一行计算子查询。因此,在子查询的任何单个计算期间,引用的列都将是常量。
并非所有标准的子查询都支持。
EXISTS
子查询返回子查询存在的行。
SELECT name
FROM nation
WHERE EXISTS (SELECT * FROM region WHERE region.regionkey = nation.regionkey)
IN
子查询返回子查询中存在的行,子查询必须是单列。
SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)
SELECT name
FROM nation
WHERE regionkey = (SELECT max(regionkey) FROM region)
PREPARE
支持对limit,OFFSET,FETCH FIRST中使用变量。
-- offset limit 样例
PREPARE my_sql FROM
SELECT * FROM item offset ? limit ?;
EXECUTE my_sql USING 1, 3;
-- offset FETCH 样例
PREPARE my_sql FROM
SELECT * FROM item OFFSET ? ROWS FETCH NEXT ? ROWS ONLY;
EXECUTE my_sql USING 1, 3;
支持在create schema
中绑定变量。
PREPARE my_createschema FROM
CREATE SCHEMA hubbleap.tpcds WITH ( LOCATION = ? );
EXECUTE my_createschema USING '/hubble/data/tpcds';
查看表的分区:
SELECT * FROM hubble.default."testfq01$partitions"