数据分析的七种武器-sql
前言
除了上一篇中提到的使用 Shell 处理日志等文本数据,我们在日常工作中更多时候需要处理各种结构化数据。
存储结构化数据的组件通常而言是关系型数据库,各种各样的业务数据和统计数据通常都会存放到关系型数据库中。
SQL 语言作为和关系型数据库交互的语言,是每个后端程序员或者数据分析人员必备的技能。
并且,除了传统关系型数据库(MySQL, PostgreSQL),或者嵌入型数据库(SQLite)等常规意义上的数据库可以使用 SQL 外。例如 ElasticSearch、Druid、Flink 等组件也可以通过 SQL 来交互(参考elasticsearch-sql、 Druid 文档、Flink文档),HIVE 和 Spark SQL 也同样使用 SQL 作为其交互的语言,像 Python 中 Pandas 等数据处理框架也同样支持 SQL 语法。
数据库所支持的 SQL 为所谓的 ansi SQL (MySQL 所支持的和 ansi 有少量区别,参考文档),而后面举例的一些列组件中所支持的 “SQL” 为借鉴了最初 SQL 的设计理念,针对特定的查询场景所定制的实现。
大部分场景下,我们使用数据库,编写 SQL 就能完成任务。在其他的一些情况下,我们同样可以根据之前编写 SQL 的经验来学习新的组件。所以可以说 SQL 是一名数据分析人员必备的技能。
而在关系型数据库中,使用最多的为 MySQL 。接下来我们以公开数据集为例,总结一下 MySQL 在做数据分析时常用的一些方法和需要注意的地方。
准备
我们基于 MySQL 5.6 版本,以及 MySQL 官方公开的 employees 数据集,来进行接下来的探索。
首先需要安装 MySQL 5.6, Ubuntu 用户参考 https://gist.github.com/Voronenko/31161ab292c7967fcd38c092335a99e1, mac 用户参考dev-on-mac#setup-mysql。
在成功安装 MySql 后,需要启动 mysqld 服务,然后从 https://github.com/datacharmer/test_db 上 clone 我们要用到的数据集。
git clone https://github.com/datacharmer/test_db
cd test_db
mysql -uroot < employees.sql
浏览数据
首先打开 mysql client,连接上 mysql server: mysql -uroot
查看数据库
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| employees |
| mysql |
| performance_schema |
| test |
+--------------------+
5 rows in set (0.02 sec)
选择数据库
mysql> use employees
列出所有表
mysql> show tables;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| salaries |
| titles |
+----------------------+
8 rows in set (0.00 sec)
查看表定义
使用 \G
在 SQL 语句结尾,可以改变结果展示的样式。
mysql> show create table employees\G
*************************** 1. row ***************************
Table: employees
Create Table: CREATE TABLE `employees` (
`emp_no` int(11) NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
PRIMARY KEY (`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
浏览表中部分数据
接下来我们参考一下 employees 表和 salaries 表中的数据。
mysql> select * from employees limit 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 |
| 10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21 |
| 10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28 |
| 10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01 |
| 10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 |
| 10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02 |
| 10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10 |
| 10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15 |
| 10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18 |
| 10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
10 rows in set (0.01 sec)
mysql> select * from salaries limit 10;
+--------+--------+------------+------------+
| emp_no | salary | from_date | to_date |
+--------+--------+------------+------------+
| 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 10001 | 66961 | 1990-06-25 | 1991-06-25 |
| 10001 | 71046 | 1991-06-25 | 1992-06-24 |
| 10001 | 74333 | 1992-06-24 | 1993-06-24 |
| 10001 | 75286 | 1993-06-24 | 1994-06-24 |
| 10001 | 75994 | 1994-06-24 | 1995-06-24 |
| 10001 | 76884 | 1995-06-24 | 1996-06-23 |
+--------+--------+------------+------------+
10 rows in set (0.00 sec)
查询
查看薪水表中最近记录中的十个时间
mysql> select from_date from salaries group by from_date order by from_date desc limit 10;
+------------+
| from_date |
+------------+
| 2002-08-01 |
| 2002-07-31 |
| 2002-07-30 |
| 2002-07-29 |
| 2002-07-28 |
| 2002-07-27 |
| 2002-07-26 |
| 2002-07-25 |
| 2002-07-24 |
| 2002-07-23 |
+------------+
10 rows in set (1.37 sec)
查看最近的时间段的薪水排行
通过上面的查询,我们可以看到数据库中 salaries 表的最新年份为 2002 年,我们可以查看一下 2002 年最高的 10 个薪水数据。
mysql> select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10;
+--------+--------+------------+------------+
| emp_no | salary | from_date | to_date |
+--------+--------+------------+------------+
| 43624 | 158220 | 2002-03-22 | 9999-01-01 |
| 47978 | 155709 | 2002-07-14 | 9999-01-01 |
| 253939 | 155513 | 2002-04-11 | 9999-01-01 |
| 109334 | 155190 | 2002-02-11 | 9999-01-01 |
| 80823 | 154459 | 2002-02-22 | 9999-01-01 |
| 493158 | 154376 | 2002-05-05 | 9999-01-01 |
| 237542 | 152687 | 2002-04-08 | 9999-01-01 |
| 279776 | 150740 | 2002-06-06 | 9999-01-01 |
| 46439 | 150345 | 2002-05-15 | 9999-01-01 |
| 66793 | 150052 | 2002-06-16 | 9999-01-01 |
+--------+--------+------------+------------+
10 rows in set (0.83 sec)
上面用到了 YEAR()
函数,在编写 SQL 语句的过程中,我们会经常和 MySQL 的 Datetime 和 Date 两种时间相关的数据类型打交道,需要常常翻阅 MySQL 的文档–Date and Time Functions。
如上面的 YEAR() 函数会返回 date 类型中年份的数据,和 2002 比较即可筛选出 2002 年的数据。
查看2002年薪水最多的5位员工
mysql> select * FROM (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 5) AS A left join employees AS B on A.emp_no = B.emp_no\G
*************************** 1. row ***************************
emp_no: 43624
salary: 158220
from_date: 2002-03-22
to_date: 9999-01-01
emp_no: 43624
birth_date: 1953-11-14
first_name: Tokuyasu
last_name: Pesch
gender: M
hire_date: 1985-03-26
*************************** 2. row ***************************
emp_no: 47978
salary: 155709
from_date: 2002-07-14
to_date: 9999-01-01
emp_no: 47978
birth_date: 1956-03-24
first_name: Xiahua
last_name: Whitcomb
gender: M
hire_date: 1985-07-18
*************************** 3. row ***************************
emp_no: 253939
salary: 155513
from_date: 2002-04-11
to_date: 9999-01-01
emp_no: 253939
birth_date: 1957-12-03
first_name: Sanjai
last_name: Luders
gender: M
hire_date: 1987-04-15
*************************** 4. row ***************************
emp_no: 109334
salary: 155190
from_date: 2002-02-11
to_date: 9999-01-01
emp_no: 109334
birth_date: 1955-08-02
first_name: Tsutomu
last_name: Alameldin
gender: M
hire_date: 1985-02-15
*************************** 5. row ***************************
emp_no: 80823
salary: 154459
from_date: 2002-02-22
to_date: 9999-01-01
emp_no: 80823
birth_date: 1963-01-21
first_name: Willard
last_name: Baca
gender: M
hire_date: 1985-02-26
5 rows in set (0.82 sec)
因为我们需要 employees 表中的 first_name 和 last_name 数据,所以我们需要使用 left join 或者 right join,将 salaries 表中 emp_no 与 employees 中 emp_no 相同的行拼接。
关于 SQL join 有一篇通过可视化帮助理解的文章非常值得一读https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
在 “查看2002年薪水最多的5位员工” 的场景下,我们先选择出了 2002 年薪水最多的10条数据,而 employees 表为一个大表。所以我们应该把 employees 放在右边然后使用 left join,避免将 employees 大量无关的 emp_no 选择出来。
使用 EXPLAIN 可以看到不同 join 写法下 mysql 执行的查询操作不尽相同。
mysql> EXPLAIN select * from employees AS A left join (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10) AS B on A.emp_no = B.emp_no where B.emp_no is not NULL;
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 1 | PRIMARY | A | eq_ref | PRIMARY | PRIMARY | 4 | B.emp_no | 1 | NULL |
| 2 | DERIVED | salaries | ALL | NULL | NULL | NULL | NULL | 2838426 | Using where; Using filesort |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
3 rows in set (0.01 sec)
mysql> EXPLAIN select * FROM (select * from salaries where YEAR(from_date) >= 2002 order by salary desc limit 10) AS A left join employees AS B on A.emp_no = B.emp_no;
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10 | NULL |
| 1 | PRIMARY | B | eq_ref | PRIMARY | PRIMARY | 4 | A.emp_no | 1 | NULL |
| 2 | DERIVED | salaries | ALL | NULL | NULL | NULL | NULL | 2838426 | Using where; Using filesort |
+----+-------------+------------+--------+---------------+---------+---------+----------+---------+-----------------------------+
3 rows in set (0.00 sec)
统计
统计2002年各部门雇佣人数排名
mysql> select * from (select dept_no,count(distinct emp_no) as cnt from dept_emp where YEAR(to_date) >= 2002 group by dept_no order by cnt desc limit 10) B left join departments C on B.dept_no = C.dept_no;
+---------+-------+---------+--------------------+
| dept_no | cnt | dept_no | dept_name |
+---------+-------+---------+--------------------+
| d005 | 62962 | d005 | Development |
| d004 | 54577 | d004 | Production |
| d007 | 38653 | d007 | Sales |
| d009 | 17982 | d009 | Customer Service |
| d008 | 15814 | d008 | Research |
| d001 | 15154 | d001 | Marketing |
| d006 | 14924 | d006 | Quality Management |
| d003 | 13209 | d003 | Human Resources |
| d002 | 12726 | d002 | Finance |
+---------+-------+---------+--------------------+
9 rows in set (0.40 sec)
统计2002年各部门的平均薪水
SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
salaries AS A LEFT JOIN dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
WHERE YEAR(A.from_date) >= 2002
GROUP BY dept_no ORDER BY avg_salary desc limit 10;
mysql> SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
-> salaries AS A LEFT JOIN dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
-> WHERE YEAR(A.from_date) >= 2002
-> GROUP BY dept_no ORDER BY avg_salary desc limit 10;
+------------+---------+--------------------+
| avg_salary | dept_no | dept_name |
+------------+---------+--------------------+
| 89480.3720 | d007 | Sales |
| 80512.3990 | d001 | Marketing |
| 79257.8962 | d002 | Finance |
| 68792.2218 | d008 | Research |
| 68434.1278 | d004 | Production |
| 68335.6932 | d005 | Development |
| 67657.3375 | d009 | Customer Service |
| 66012.1669 | d006 | Quality Management |
| 64485.9964 | d003 | Human Resources |
+------------+---------+--------------------+
9 rows in set (0.98 sec)
所有记录中部门的平均薪水状况
mysql> SELECT avg(A.salary) as avg_salary, B.dept_no ,C.dept_name FROM
-> salaries AS A LEFT JOIN dept_emp AS B ON A.emp_no = B.emp_no LEFT JOIN departments as C ON B.dept_no = C.dept_no
-> GROUP BY dept_no ORDER BY avg_salary desc limit 10;
+------------+---------+--------------------+
| avg_salary | dept_no | dept_name |
+------------+---------+--------------------+
| 80667.6058 | d007 | Sales |
| 71913.2000 | d001 | Marketing |
| 70489.3649 | d002 | Finance |
| 59665.1817 | d008 | Research |
| 59605.4825 | d004 | Production |
| 59478.9012 | d005 | Development |
| 58770.3665 | d009 | Customer Service |
| 57251.2719 | d006 | Quality Management |
| 55574.8794 | d003 | Human Resources |
+------------+---------+--------------------+
9 rows in set (6.39 sec)
参考阅读
https://www.kaggle.com/dimarudov/data-analysis-using-sql
https://dev.mysql.com/doc/refman/5.6/en/