DQL

数据查询语言

SELECT

规则

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT { count | ALL } | FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]

from_item是以下的一种:

table_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]

from_item join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]

join_type包括:

[ INNER ] JOIN
LEFT [ OUTER ] JOIN
RIGHT [ OUTER ] JOIN
FULL [ OUTER ] JOIN
CROSS JOIN

grouping_element包括:

()
expression
GROUPING SETS ( ( column [, ...] ) [, ...] )
CUBE ( column [, ...] )
ROLLUP ( column [, ...] )

描述

查询语法,由一个表或多个表查询数据。

详细解释

WITH子句

用于定义在查询时使用的命名关系,它允许扁平化的嵌套查询或简化子查询。例如,以下查询是等效的:

SELECT a, b
FROM (
  SELECT a, MAX(b) AS b FROM t GROUP BY a
) AS x;

WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;

也可以用于多子查询语句:

WITH
  t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),
  t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)
SELECT t1.*, t2.*
FROM t1
JOIN t2 ON t1.a = t2.a;

并且可以是一个链式:

WITH
  x AS (SELECT a FROM t),
  y AS (SELECT a AS b FROM x),
  z AS (SELECT b AS c FROM y)
SELECT c FROM z;

GROUP BY子句

GROUP BY将SELECT语句的输出分成包含匹配值的行组,简单的GROUP BY包含由列组成的表达式,也可以按位置选择输出列的序号(从1开始)。

以下查询是等效的,都是按nationkey对查询进行分组,第一个使用序号,第二个使用列名:

SELECT count(*), nationkey FROM customer GROUP BY 2;

SELECT count(*), nationkey FROM customer GROUP BY nationkey;

GROUP BY语句也可以按不在SELECT内的字段进行分组:

SELECT count(*) FROM customer GROUP BY mktsegment;

所有的输出都必须是select中的字段或者是聚合函数。

复杂的GROUP BY语句,支持使用GROUPING SETS,CUBEROLLUP语法,允许对需要的多组列进行聚合分析,复杂分组操作不支持对输入列组成的表达式进行分组,只允许使用列名和序号。

复杂分组操作通常等价于一个联合的所有简单分组的表达式,当然,由于连接源的不同,有些等价表达式可能并不适用。

  • GROUPING SETS

允许指定分组的列的多个列表,不属于分组列的指定子列表的列将被设置为NULL。

SELECT * FROM shipping;
  origin_city | origin_zip | destination_city | destination_zip | package_weight
--------------+------------+------------------+-----------------+-----------------
  beijing     | 100000     | tianjin          | 300000          |             12
  tianjin     | 300010     | shanghai         | 200000          |             60 
  beijing     | 100100     | chongqing        | 400000          |             38
  guangzhou   | 510000     | tianjin          | 300100          |             14
  beijing     | 100012     | shanghai         | 200000          |             78
  beijing     | 100000     | tianjin          | 300010          |             40

(6 rows in set)

分组例子:

SELECT origin_city, origin_zip, destination_city, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_city),
    (origin_city, origin_zip),
    (destination_city));
  origin_city | origin_zip | destination_city | _col3
+-------------+------------+------------------+-------+
  beijing     | NULL       | NULL             |   168
  beijing     | 100100     | NULL             |    38
  tianjin     | 300010     | NULL             |    60
  guangzhou   | NULL       | NULL             |    14
  beijing     | 100012     | NULL             |    78
  beijing     | 100000     | NULL             |    52
  tianjin     | NULL       | NULL             |    60
  guangzhou   | 510000     | NULL             |    14
  NULL        | NULL       | shanghai         |   138
  NULL        | NULL       | chongqing        |    38
  NULL        | NULL       | tianjin          |    66

(11 rows in set)

逻辑上等同于下面的语法:

SELECT origin_city, NULL, NULL, sum(package_weight)
FROM shipping GROUP BY origin_city
UNION ALL
SELECT origin_city, origin_zip, NULL, sum(package_weight)
FROM shipping GROUP BY origin_city, origin_zip
UNION ALL
SELECT NULL, NULL, destination_city, sum(package_weight)
FROM shipping GROUP BY destination_city;

使用UNION ALL将从基础表中读取三次数据,而使用GROUPING SETS只需要读取一次。

  • CUBE

CUBE为给定的列集生成所有可能的分组集(即幂集)。例如,查询:

SELECT origin_city, destination_city, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_city, destination_city);

等同于:

SELECT origin_city, destination_city, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_city, destination_city),
    (origin_city),
    (destination_city),
    ());

得到结果:

  origin_city | destination_city | _col2 
+-------------+------------------+-------+
  tianjin     | NULL             |    60
  NULL        | chongqing        |    38
  NULL        | NULL             |   242
  guangzhou   | NULL             |    14
  tianjin     | shanghai         |    60
  beijing     | chongqing        |    38
  guangzhou   | tianjin          |    14
  NULL        | shanghai         |   138
  NULL        | tianjin          |    66
  beijing     | shanghai         |    78
  beijing     | tianjin          |    52
  beijing     | NULL             |   168

(12 rows in set)
  • ROLLUP

为给定的列集生成所有可能的分类汇总。例如,查询:

SELECT origin_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ROLLUP (origin_city, origin_zip);

等同于

SELECT origin_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS ((origin_city, origin_zip), (origin_city), ());

得到结果

 origin_city | origin_zip | _col2 
+------------+------------+------+
 beijing     | 100000     |    52
 tianjin     | NULL       |    60
 beijing     | 100100     |    38
 guangzhou   | NULL       |    14
 beijing     | NULL       |   168
 beijing     | 100012     |    78
 NULL        | NULL       |   242
 guangzhou   | 510000     |    14
 tianjin     | 300010     |    60

(9 rows in set)
  • 组合多个grouping语法
SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
    GROUPING SETS ((origin_city, destination_city)),
    ROLLUP (origin_zip);

可以重写为:

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY
    GROUPING SETS ((origin_city, destination_city)),
    GROUPING SETS ((origin_zip), ());

等同于:

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_city, destination_city, origin_zip),
    (origin_city, destination_city));

输出结果:

 origin_city | destination_city | origin_zip | _col3 
+------------+------------------+------------+------+
 beijing     | tianjin          | 100000     |    52
 beijing     | tianjin          | NULL       |    52
 guangzhou   | tianjin          | NULL       |    14
 beijing     | chongqing        | 100100     |    38
 guangzhou   | tianjin          | 510000     |    14
 beijing     | chongqing        | NULL       |    38
 tianjin     | shanghai         | NULL       |    60
 tianjin     | shanghai         | 300010     |    60
 beijing     | shanghai         | NULL       |    78
 beijing     | shanghai         | 100012     |    78

(10 rows in set)

ALLDISTINCT确定重复分组集是否每个都产生不同的输出行。当在同一查询中组合多个复杂分组集时,这尤其有用。例如,以下查询:

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY ALL
    CUBE (origin_city, destination_city),
    ROLLUP (origin_city, origin_zip);

等同于:

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_city, destination_city, origin_zip),
    (origin_city, origin_zip),
    (origin_city, destination_city, origin_zip),
    (origin_city, origin_zip),
    (origin_city, destination_city),
    (origin_city),
    (origin_city, destination_city),
    (origin_city),
    (origin_city, destination_city),
    (origin_city),
    (destination_city),
    ());

而如果使用DISTINCT

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY DISTINCT
    CUBE (origin_city, destination_city),
    ROLLUP (origin_city, origin_zip);

则等同于:

SELECT origin_city, destination_city, origin_zip, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
    (origin_city, destination_city, origin_zip),
    (origin_city, origin_zip),
    (origin_city, destination_city),
    (origin_city),
    (destination_city),
    ());
  • GROUPING操作

分组操作返回转换为十进制的位集,指示分组中存在哪些列。它必须与GROUPING SETS, ROLLUP, CUBEGROUP BY一起使用,其参数必须与相应GROUPING SETS, ROLLUP, CUBEGROUP BY子句中引用的列完全匹配。

为了计算特定行的结果位集,位被分配给参数列,最右边的列是最低有效位。对于给定的分组,如果分组中包含相应的列,则位设置为0,否则设置为1。例如,考虑下面的查询:

SELECT origin_city, origin_zip, destination_city, sum(package_weight),
       grouping(origin_city, origin_zip, destination_city)
FROM shipping
GROUP BY GROUPING SETS (
        (origin_city),
        (origin_city, origin_zip),
        (destination_city));

结果:

 origin_city | origin_zip | destination_city | _col3 | _col4 
+------------+------------+------------------+-------+------+
 beijing     | NULL       | NULL             |   168 |     3
 beijing     | 100000     | NULL             |    52 |     1
 guangzhou   | NULL       | NULL             |    14 |     3
 guangzhou   | 510000     | NULL             |    14 |     1
 tianjin     | NULL       | NULL             |    60 |     3
 beijing     | 100012     | NULL             |    78 |     1
 NULL        | NULL       | tianjin          |    66 |     6
 NULL        | NULL       | chongqing        |    38 |     6
 NULL        | NULL       | shanghai         |   138 |     6
 tianjin     | 300010     | NULL             |    60 |     1
 beijing     | 100100     | NULL             |    38 |     1

(11 rows in set)

HAVING子句

HAVING子句与聚合函数和GROUP BY子句一起使用,以控制选择哪些组。HAVING子句消除不满足给定条件的组。在计算组和聚合之后有过滤器组。

以下示例查询customer表并选择帐户余额大于指定值的组:

SELECT count(*), c_mktsegment, c_nationkey,
       CAST(sum(c_acctbal) AS bigint) AS totalbal
FROM customer
GROUP BY c_mktsegment, c_nationkey
HAVING sum(c_acctbal) > 542666087
ORDER BY totalbal DESC;

结果:

 _col0  | c_mktsegment | c_nationkey | totalbal  
+-------+--------------+-------------+----------+
 120838 | AUTOMOBILE   |          20 | 543268712
 120339 | BUILDING     |          10 | 543056590
 120760 | HOUSEHOLD    |           6 | 543017181

(3 rows in set)

UNION | INTERSECT | EXCEPT子句

UNION INTERSECT EXCEPT都是set操作. 这些子句用于将多个select语句的结果组合到单个结果集中:

query UNION [ALL | DISTINCT] query
query INTERSECT [DISTINCT] query
query EXCEPT [DISTINCT] query

参数ALLDISTINCT控制最终结果集中包含哪些行。如果指定了参数ALL,则即使行相同,也包括所有行。如果指定了参数DISTINCT,则组合结果集中只包含唯一的行。如果两者都未指定,则行为默认为DISTINCTINTERSECTEXCEPT不支持ALL参数。

除非通过括号明确指定顺序,否则将从左到右处理多个集合操作。此外,INTERSECTEXCEPTUNION优先级更高。也就是说,A UNION B INTERSECT C EXCEPT D,等同于A UNION (B INTERSECT C) EXCEPT D

  • UNION

连接操作,返回两个结果合并的集合。

SELECT 13
UNION
SELECT 42;
 _col0 
+-------+
    13
    42

(2 rows in set)
SELECT 13
UNION
SELECT * FROM (VALUES 42, 13);
 _col0 
+-------+
    42
    13
(2 rows in set)
SELECT 13
UNION ALL
SELECT * FROM (VALUES 42, 13);
 _col0 
+-------+
    42
    13
    13

(3 rows in set)

由上面的三个例子可以看出,如果不带ALL实际结果进行了合并去重。

  • INTERSECT

返回两个结果集的交集。也就是说返回两个结果集中都有的结果。

SELECT * FROM (VALUES 13, 42)
INTERSECT
SELECT 13;
 _col0 
+-------+
   13

(1 row in set)
  • EXCEPT

返回第一个集合的相对补集。即在第一个集合中存在,在第二个集合中不存在的值。

SELECT * FROM (VALUES 13, 42)
EXCEPT
SELECT * FROM (VALUES 13, 45);
 _col0 
+-------+
    42

(1 row in set)

ORDER BY子句

ORDER BY子句用于给结果集进行排序。

ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]

ORDER BY子句在GROUP BYHAVING之后,在OFFSET,LIMITFETCH FIRST之前计算。默认NULL排在后边,与升降序无关。

OFFSET子句

OFFSET用于跳过结果集的多个行。

如果存在ORDER BY,则将对排序后的结果集计算,并跳过指定行,结果仍然是排序的。

SELECT n_name FROM nation ORDER BY n_name OFFSET 22;
     n_name     
+----------------+
 UNITED KINGDOM
 UNITED STATES
 VIETNAM

(3 rows in set)

当跳过的行数大于结果集的大小,将返回空。

LIMITFETCH FIRST子句

LIMITFETCH FIRST子句限制结果集返回的行数。

LIMIT { count | ALL }
FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }

下面的示例查询一个大表,但是LIMIT子句限制输出只有五行(因为查询缺少order by,所以返回的行是任意的):

SELECT o_orderdate FROM orders LIMIT 5;
 o_orderdate 
+-------------+
 1994-02-12
 1997-07-11
 1994-06-18
 1992-01-19
 1996-09-03
(5 rows in set)

LIMIT ALL等同于LIMIT

FETCH FIRST子句支持FIRSTNEXT关键字以及ROWROWS关键字。这些关键字是等效的,关键字的选择对查询执行没有影响。

如果在FETCH FIRST子句中未指定计数,则默认为1:

SELECT o_orderdate FROM orders FETCH FIRST ROW ONLY;

 o_orderdate 
+-------------+
 1994-02-12

(1 row in set)

如果存在OFFSET子句,则LIMIT或者FETCH FIRST在其后进行计算。

SELECT * FROM (VALUES 5, 2, 4, 1, 3) t(x) ORDER BY x OFFSET 2 LIMIT 2;
 x 
+---+
 3
 4
(2 rows in set)

对于FETCH FIRST子句,参数ONLYWITH TIES控制结果集中包含哪些行。

如果只指定了参数,则结果集将限制为由计数确定的行数。

如果指定了WITH TIES的参数,则需要存在ORDER BY子句。结果集由同一组的所有行组成,这些行与ORDER BY子句中的排序所建立的最后一个行(“ties”)相同。结果集排序:

SELECT n_name, n_regionkey FROM nation ORDER BY n_regionkey FETCH FIRST ROW WITH TIES;
  n_name   | n_regionkey
+----------+-------------+
 ETHIOPIA  |           0
 MOROCCO   |           0
 KENYA     |           0
 ALGERIA   |           0
 MOZAMBIQUE|           0

(5 rows in set)

TABLESAMPLE

包含的采样方法有:

  • BERNOULLI

选择每一行作为表样本,其概率为样本百分比。当使用BERNOULLI方法对一个表进行采样时,将扫描该表的所有物理块,并跳过某些行(基于采样百分比与运行时计算的随机值之间的比较)。

结果中包含的行的概率与任何其他行无关。这不会减少从磁盘读取采样表所需的时间。如果进一步处理采样输出,可能会影响总查询时间。

  • SYSTEM

此采样方法将表划分为数据的逻辑段,并以此粒度对表进行采样。此采样方法要么选择特定数据段中的所有行,要么跳过它(基于采样百分比与运行时计算的随机值之间的比较)。

系统采样中选择的行将取决于使用的catalog。例如,当与Hubble一起使用时,它取决于数据上的布局方式。这种方法不能保证独立的抽样概率。

这两个方法都不允许返回的行数有确定的界限。

例子:

SELECT *
FROM nation TABLESAMPLE BERNOULLI(25);

SELECT *
FROM nation TABLESAMPLE SYSTEM(75);

与join结合使用:

SELECT o.*, i.*
FROM orders o TABLESAMPLE SYSTEM (10)
JOIN lineitem i TABLESAMPLE BERNOULLI (40)
  ON o.o_orderkey = i.l_orderkey;

UNNEST

UNNEST可用于将ARRAYMAP展开为关系。ARRAY展开为单列,MAP展开为两列(键、值)。UNNEST还可以与多个参数一起使用,在这种情况下,它们将展开为多个列,行数与最高基数参数相同(其他列用空值填充)。UNNEST可以有WITH ORDINALITY子句,在这种情况下,会在末尾添加一个附加的有序列。UNNEST通常与JOIN一起使用,可以引用JOIN左侧关系中的列。

SELECT numbers, animals, n, a
FROM (
  VALUES
    (ARRAY[2, 5], ARRAY['dog', 'cat', 'bird']),
    (ARRAY[7, 8, 9], ARRAY['cow', 'pig'])
) AS x (numbers, animals)
CROSS JOIN UNNEST(numbers, animals) AS t (n, a);
+-----------+------------------+------+------+
|  numbers  |     animals      |  n   |  a   |
+-----------+------------------+------+------+
| [2, 5]    | [dog, cat, bird] |    2 | dog  |
| [2, 5]    | [dog, cat, bird] |    5 | cat  |
| [2, 5]    | [dog, cat, bird] | NULL | bird |
| [7, 8, 9] | [cow, pig]       |    7 | cow  |
| [7, 8, 9] | [cow, pig]       |    8 | pig  |
| [7, 8, 9] | [cow, pig]       |    9 | NULL |
+-----------+------------------+------+------+
(6 rows in set)

有序列:

SELECT numbers, n, a
FROM (
  VALUES
    (ARRAY[2, 5]),
    (ARRAY[7, 8, 9])
) AS x (numbers)
CROSS JOIN UNNEST(numbers) WITH ORDINALITY AS t (n, a);
+-----------+---+---+
|  numbers  | n | a |
+-----------+---+---+
| [2, 5]    | 2 | 1 |
| [2, 5]    | 5 | 2 |
| [7, 8, 9] | 7 | 1 |
| [7, 8, 9] | 8 | 2 |
| [7, 8, 9] | 9 | 3 |
+-----------+---+---+
(5 rows in set)

JOIN

JOIN允许组合来自多个关系的数据。

  • CROSS JOIN

CROSS JOIN返回两个关系的笛卡尔积(所有组合)。

下面两个查询等价:

SELECT *
FROM nation
CROSS JOIN region;

SELECT *
FROM nation, region;

如果一个表有25行,另一个表有5行,使用CROSS JOIN将返回125行结果。

  • LATERAL

FROM子句中出现的子查询前面可以有关键字lateral。这允许它们引用前面FROM提供的列。

SELECT name, x, y
FROM nation,
CROSS JOIN LATERAL (SELECT name || ' :-' AS x),
CROSS JOIN LATERAL (SELECT x || ')' AS y)
  • 别名

如果两个关系有同名列需要使用别名。

SELECT nation.name, region.name
FROM nation
CROSS JOIN region;

SELECT n.name, r.name
FROM nation AS n
CROSS JOIN region AS r;

SELECT n.name, r.name
FROM nation n
CROSS JOIN region r;
  • 子查询

子查询是由查询组成的表达式。当子查询引用子查询之外的列时,它是相关的。从逻辑上讲,将为周围查询中的每一行计算子查询。因此,在子查询的任何单个计算期间,引用的列都将是常量。

并非所有标准的子查询都支持。

  • EXISTS子查询

返回子查询存在的行。

SELECT name
FROM nation
WHERE EXISTS (SELECT * FROM region WHERE region.regionkey = nation.regionkey)
  • IN子查询

返回子查询中存在的行,子查询必须是单列。

SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)
  • 标准子查询
SELECT name
FROM nation
WHERE regionkey = (SELECT max(regionkey) FROM region)

PREPARE

支持对limit,OFFSET,FETCH FIRST中使用变量。

-- offset limit 样例
PREPARE my_sql FROM
SELECT * FROM item offset ? limit ?;
EXECUTE my_sql USING 1, 3;
-- offset FETCH 样例
PREPARE my_sql FROM
SELECT * FROM item OFFSET ? ROWS FETCH NEXT ? ROWS ONLY;
EXECUTE my_sql USING 1, 3;

支持在create schema中绑定变量。

PREPARE my_createschema FROM
CREATE SCHEMA hubbleap.tpcds WITH ( LOCATION = ? );
EXECUTE my_createschema USING '/hubble/data/tpcds';

查看表的分区:

SELECT * FROM hubble.default."testfq01$partitions"