前言

除了上一篇中提到的使用 Shell 处理日志等文本数据,我们在日常工作中更多时候需要处理各种结构化数据。

存储结构化数据的组件通常而言是关系型数据库,各种各样的业务数据和统计数据通常都会存放到关系型数据库中。

SQL 语言作为和关系型数据库交互的语言,是每个后端程序员或者数据分析人员必备的技能。

并且,除了传统关系型数据库(MySQL, PostgreSQL),或者嵌入型数据库(SQLite)等常规意义上的数据库可以使用 SQL 外。例如 ElasticSearch、Druid、Flink 等组件也可以通过 SQL 来交互(参考elasticsearch-sqlDruid 文档Flink文档),HIVE 和 Spark SQL 也同样使用 SQL 作为其交互的语言,像 Python 中 Pandas 等数据处理框架也同样支持 SQL 语法。

数据库所支持的 SQL 为所谓的 ansi SQL (MySQL 所支持的和 ansi 有少量区别,参考文档),而后面举例的一些列组件中所支持的 “SQL” 为借鉴了最初 SQL 的设计理念,针对特定的查询场景所定制的实现。

大部分场景下,我们使用数据库,编写 SQL 就能完成任务。在其他的一些情况下,我们同样可以根据之前编写 SQL 的经验来学习新的组件。所以可以说 SQL 是一名数据分析人员必备的技能。

而在关系型数据库中,使用最多的为 MySQL 。接下来我们以公开数据集为例,总结一下 MySQL 在做数据分析时常用的一些方法和需要注意的地方。

准备

我们基于 MySQL 5.6 版本,以及 MySQL 官方公开的 employees 数据集,来进行接下来的探索。

首先需要安装 MySQL 5.6, Ubuntu 用户参考 https://gist.github.com/Voronenko/31161ab292c7967fcd38c092335a99e1, mac 用户参考dev-on-mac#setup-mysql

在成功安装 MySql 后,需要启动 mysqld 服务,然后从 https://github.com/datacharmer/test_db 上 clone 我们要用到的数据集。

git clone https://github.com/datacharmer/test_db
cd test_db
mysql -uroot < employees.sql

浏览数据

首先打开 mysql client,连接上 mysql server: mysql -uroot

查看数据库

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| employees          |
| mysql              |
| performance_schema |
| test               |
+--------------------+
5 rows in set (0.02 sec)

选择数据库

mysql> use employees

列出所有表

mysql> show tables;
+----------------------+
| Tables_in_employees  |
+----------------------+
| current_dept_emp     |
| departments          |
| dept_emp             |
| dept_emp_latest_date |
| dept_manager         |
| employees            |
| salaries             |
| titles               |
+----------------------+
8 rows in set (0.00 sec)

查看表定义

使用 \G 在 SQL 语句结尾,可以改变结果展示的样式。

mysql> show create table employees\G
*************************** 1. row ***************************
       Table: employees
Create Table: CREATE TABLE `employees` (
  `emp_no` int(11) NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

浏览表中部分数据

接下来我们参考一下 employees 表和 salaries 表中的数据。

mysql> select * from employees limit 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date  |
+--------+------------+------------+-----------+--------+------------+
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |
|  10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21 |
|  10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28 |
|  10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01 |
|  10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12 |
|  10006 | 1953-04-20 | Anneke     | Preusig   | F      | 1989-06-02 |
|  10007 | 1957-05-23 | Tzvetan    | Zielinski | F      | 1989-02-10 |
|  10008 | 1958-02-19 | Saniya     | Kalloufi  | M      | 1994-09-15 |
|  10009 | 1952-04-19 | Sumant     | Peac      | F      | 1985-02-18 |
|  10010 | 1963-06-01 | Duangkaew  | Piveteau  | F      | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
10 rows in set (0.01 sec)
mysql> select * from salaries limit 10;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |  60117 | 1986-06-26 | 1987-06-26 |
|  10001 |  62102 | 1987-06-26 | 1988-06-25 |
|  10001 |  66074 | 1988-06-25 | 1989-06-25 |
|  10001 |  66596 | 1989-06-25 | 1990-06-25 |
|  10001 |  66961 | 1990-06-25 | 1991-06-25 |
|  10001 |  71046 | 1991-06-25 | 1992-06-24 |
|  10001 |  74333 | 1992-06-24 | 1993-06-24 |
|  10001 |  75286 | 1993-06-24 | 1994-06-24 |
|  10001 |  75994 | 1994-06-24 | 1995-06-24 |
|  10001 |  76884 | 1995-06-24 | 1996-06-23 |
+--------+--------+------------+------------+
10 rows in set (0.00 sec)

查询

查看薪水表中最近记录中的十个时间

mysql> select from_date from salaries group by from_date order by from_date desc limit 10;
+------------+
| from_date  |
+------------+
| 2002-08-01 |
| 2002-07-31 |
| 2002-07-30 |
| 2002-07-29 |
| 2002-07-28 |
| 2002-07-27 |
| 2002-07-26 |
| 2002-07-25 |
| 2002-07-24 |
| 2002-07-23 |
+------------+
10 rows in set (1.37 sec)

查看最近的时间段的薪水排行

通过上面的查询,我们可以看到数据库中 salaries 表的最新年份为 2002 年,我们可以查看一下 2002 年最高的 10 个薪水数据。

mysql> select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  43624 | 158220 | 2002-03-22 | 9999-01-01 |
|  47978 | 155709 | 2002-07-14 | 9999-01-01 |
| 253939 | 155513 | 2002-04-11 | 9999-01-01 |
| 109334 | 155190 | 2002-02-11 | 9999-01-01 |
|  80823 | 154459 | 2002-02-22 | 9999-01-01 |
| 493158 | 154376 | 2002-05-05 | 9999-01-01 |
| 237542 | 152687 | 2002-04-08 | 9999-01-01 |
| 279776 | 150740 | 2002-06-06 | 9999-01-01 |
|  46439 | 150345 | 2002-05-15 | 9999-01-01 |
|  66793 | 150052 | 2002-06-16 | 9999-01-01 |
+--------+--------+------------+------------+
10 rows in set (0.83 sec)

上面用到了 YEAR() 函数,在编写 SQL 语句的过程中,我们会经常和 MySQL 的 Datetime 和 Date 两种时间相关的数据类型打交道,需要常常翻阅 MySQL 的文档–Date and Time Functions

如上面的 YEAR() 函数会返回 date 类型中年份的数据,和 2002 比较即可筛选出 2002 年的数据。

查看2002年薪水最多的5位员工

mysql> select * FROM (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 5) AS A left join employees AS B on A.emp_no = B.emp_no\G
*************************** 1. row ***************************
    emp_no: 43624
    salary: 158220
 from_date: 2002-03-22
   to_date: 9999-01-01
    emp_no: 43624
birth_date: 1953-11-14
first_name: Tokuyasu
 last_name: Pesch
    gender: M
 hire_date: 1985-03-26
*************************** 2. row ***************************
    emp_no: 47978
    salary: 155709
 from_date: 2002-07-14
   to_date: 9999-01-01
    emp_no: 47978
birth_date: 1956-03-24
first_name: Xiahua
 last_name: Whitcomb
    gender: M
 hire_date: 1985-07-18
*************************** 3. row ***************************
    emp_no: 253939
    salary: 155513
 from_date: 2002-04-11
   to_date: 9999-01-01
    emp_no: 253939
birth_date: 1957-12-03
first_name: Sanjai
 last_name: Luders
    gender: M
 hire_date: 1987-04-15
*************************** 4. row ***************************
    emp_no: 109334
    salary: 155190
 from_date: 2002-02-11
   to_date: 9999-01-01
    emp_no: 109334
birth_date: 1955-08-02
first_name: Tsutomu
 last_name: Alameldin
    gender: M
 hire_date: 1985-02-15
*************************** 5. row ***************************
    emp_no: 80823
    salary: 154459
 from_date: 2002-02-22
   to_date: 9999-01-01
    emp_no: 80823
birth_date: 1963-01-21
first_name: Willard
 last_name: Baca
    gender: M
 hire_date: 1985-02-26
5 rows in set (0.82 sec)

因为我们需要 employees 表中的 first_name 和 last_name 数据,所以我们需要使用 left join 或者 right join,将 salaries 表中 emp_no 与 employees 中 emp_no 相同的行拼接。

关于 SQL join 有一篇通过可视化帮助理解的文章非常值得一读https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

join

在 “查看2002年薪水最多的5位员工” 的场景下,我们先选择出了 2002 年薪水最多的10条数据,而 employees 表为一个大表。所以我们应该把 employees 放在右边然后使用 left join,避免将 employees 大量无关的 emp_no 选择出来。

使用 EXPLAIN 可以看到不同 join 写法下 mysql 执行的查询操作不尽相同。

mysql> EXPLAIN select * from employees AS A left join (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10) AS B on A.emp_no = B.emp_no where B.emp_no is not NULL;
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref      | rows    | Extra                       |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL    | NULL    | NULL     |      10 | Using where                 |
|  1 | PRIMARY     | A          | eq_ref | PRIMARY       | PRIMARY | 4       | B.emp_no |       1 | NULL                        |
|  2 | DERIVED     | salaries   | ALL    | NULL          | NULL    | NULL    | NULL     | 2838426 | Using where; Using filesort |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
3 rows in set (0.01 sec)
mysql> EXPLAIN select * FROM (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10) AS A left join employees AS B on A.emp_no = B.emp_no;
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref      | rows    | Extra                       |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL    | NULL    | NULL     |      10 | NULL                        |
|  1 | PRIMARY     | B          | eq_ref | PRIMARY       | PRIMARY | 4       | A.emp_no |       1 | NULL                        |
|  2 | DERIVED     | salaries   | ALL    | NULL          | NULL    | NULL    | NULL     | 2838426 | Using where; Using filesort |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
3 rows in set (0.00 sec)

统计

统计2002年各部门雇佣人数排名

mysql> select * from (select dept_no,count(distinct emp_no) as cnt from dept_emp where YEAR(to_date) >= 2002 group by dept_no order by cnt desc limit 10) B left join departments C on B.dept_no = C.dept_no;
+---------+-------+---------+--------------------+
| dept_no | cnt   | dept_no | dept_name          |
+---------+-------+---------+--------------------+
| d005    | 62962 | d005    | Development        |
| d004    | 54577 | d004    | Production         |
| d007    | 38653 | d007    | Sales              |
| d009    | 17982 | d009    | Customer Service   |
| d008    | 15814 | d008    | Research           |
| d001    | 15154 | d001    | Marketing          |
| d006    | 14924 | d006    | Quality Management |
| d003    | 13209 | d003    | Human Resources    |
| d002    | 12726 | d002    | Finance            |
+---------+-------+---------+--------------------+
9 rows in set (0.40 sec)

统计2002年各部门的平均薪水

SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
salaries AS A  LEFT JOIN  dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
WHERE YEAR(A.from_date) >= 2002
GROUP BY dept_no ORDER BY avg_salary desc limit 10;
mysql> SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
    -> salaries AS A  LEFT JOIN  dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
    -> WHERE YEAR(A.from_date) >= 2002
    -> GROUP BY dept_no ORDER BY avg_salary desc limit 10;
+------------+---------+--------------------+
| avg_salary | dept_no | dept_name          |
+------------+---------+--------------------+
| 89480.3720 | d007    | Sales              |
| 80512.3990 | d001    | Marketing          |
| 79257.8962 | d002    | Finance            |
| 68792.2218 | d008    | Research           |
| 68434.1278 | d004    | Production         |
| 68335.6932 | d005    | Development        |
| 67657.3375 | d009    | Customer Service   |
| 66012.1669 | d006    | Quality Management |
| 64485.9964 | d003    | Human Resources    |
+------------+---------+--------------------+
9 rows in set (0.98 sec)

所有记录中部门的平均薪水状况

mysql> SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
    -> salaries AS A  LEFT JOIN  dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
    -> GROUP BY dept_no ORDER BY avg_salary desc limit 10;
+------------+---------+--------------------+
| avg_salary | dept_no | dept_name          |
+------------+---------+--------------------+
| 80667.6058 | d007    | Sales              |
| 71913.2000 | d001    | Marketing          |
| 70489.3649 | d002    | Finance            |
| 59665.1817 | d008    | Research           |
| 59605.4825 | d004    | Production         |
| 59478.9012 | d005    | Development        |
| 58770.3665 | d009    | Customer Service   |
| 57251.2719 | d006    | Quality Management |
| 55574.8794 | d003    | Human Resources    |
+------------+---------+--------------------+
9 rows in set (6.39 sec)

参考阅读

https://www.kaggle.com/dimarudov/data-analysis-using-sql

https://dev.mysql.com/doc/refman/5.6/en/