数据分析的七种武器-hive
实验环境搭建
实验环境 阿里云服务器 Ubuntu 16.04
# lsb_release -a
LSB Version: core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
通过 Docker 安装 Hive,先安装 Docker-ce,并使用阿里云提供的镜像加速服务。
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
sudo apt-get update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
apt-get update
https://download.docker.com/linux/ubuntu/gpg
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo mkdir -p /etc/docker
# 登陆阿里云控制台复制加速链接。
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": ["https://xxx.mirror.aliyuncs.com"]
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
# docker pull parrotstream/hive
docker pull bde2020/hive
拉取 Hive 的 Docker Compose 配置并启动 Hive
# git clone https://github.com/parrot-stream/docker-hive.git
git clone https://github.com/big-data-europe/docker-hive
pip install docker-compose
sudo python -m easy_install --upgrade pyOpenSSL
cd docker-hive
docker network create parrot_default
docker-compose up -d # hive 需要 postgres 存储 metadata
使用 docker-compose 命令登录 hive-server,并执行一条命令做测试 (Windows 下需要使用 powershell 调用)
# docker-compose exec hive-server bash
# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
资料准备
使用 axel 从网上下载测试用的 Hacker News 评论数据并解压。
axel -n 10 https://files.pushshift.io/hackernews/HNI_2018-05.bz2
bzip2 -dk HNI_2018-05.bz2
使用 head 查看该文件的格式如下
head -n 1 HNI_2018-05
{"by":"burntwater","id":16963601,"parent":16962078,"retrieved_on":1528401048,"text":"The New York State fee by itself is $200. Unless I'm missing something, you paid significantly more than that for the publication requirement (it was around $1,000, give or take a couple hundred, last I checked).","time":1525132803,"type":"comment"}
因为该文件为 json 格式,需要用 jq 工具转换为 csv 格式再导入 hive。
jq -r '[ .by, (.id|tostring), (.parent|tostring), (.retrieved_on|tostring), .text, .type, (.time|tostring) , (.time| strftime("%Y-%m-%d"))] | join("||") ' HNI_2018-05 > HNI_2018-05.csv
该命令会将 json 转为如下的格式,因为 Hive 导入数据都是以文本的格式导入,需要设置其分隔符号,最好不要与数据中的常见字段冲突。
同时,Hive 表一般会做分区处理,需要有一个 date 字符串,这里用原有的 unix 时间戳格式化了一个 ds 字段。
Caprinicus||16963610||16962203||1528401049||And that they grew to this point before imposing this "curation". It seems to be a recurring theme with these companies to grow to a monopoly with a place for relatively free discourse and then to do this as soon as they feel safe.||comment||1525132915||2018-05-01
该操作会生成 HNI_2018-05.csv
文件。
生成该文件后,可以创建一个共享目录,挂载到 hive-server Docker 中。
修改 docker-hive 下的 docker-compose.yml
文件,给 hive-server 添加一个 volume, 映射到当前目录下的 data 目录。
hive-server:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hadoop-hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
ports:
- "10000:10000"
volumes:
- ./data:/data
拷贝 csv 数据到共享目录,并重启 hive。
docker-hive# cp ../HNI_2018-05.csv data/
docker-compose stop
docker-compose up -d
创建对应的 HIVE TABLE
hive> CREATE TABLE HN_COMMENTS (comment_by STRING, comment_id INT, parent INT,
retrieved_on INT, text STRING, type STRING,
time INT,dt STRING)
row format delimited fields terminated by '||';
将 csv 中的数据导入 HIVE 中。
LOAD DATA LOCAL INPATH '/data/HNI_2018-05.csv' OVERWRITE INTO TABLE HN_COMMENTS;
查看导入是否成功。
> select text from HN_COMMENTS limit 1;
创建分区表
CREATE TABLE HN_COMMENTS_DT (comment_by STRING, comment_id INT, parent INT,
retrieved_on INT, text STRING, type STRING,
time INT) partitioned by (dt STRING);
开启动态分区功能
set hive.exec.dynamic.partition.mode=nonstrict;
原始表按分区导入分区表
insert into table HN_COMMENTS_DT partition(dt)
select * from HN_COMMENTS;
常用语句
查看支持的函数
show functions;
例如转换 unix 时间戳为日期字符串
SELECT from_unixtime(1525350800)
+----------------------+
| _c0 |
+----------------------+
| 2018-05-03 12:33:20 |
+----------------------+
查看表定义
desc hn_comments;
查看分区
show partitions HN_COMMENTS_DT;
+----------------------------------------------------+
| partition |
+----------------------------------------------------+
| dt=2018-05-01 |
| dt=2018-05-06 |
| dt=2018-05-08 |
| dt=2018-05-09 |
| dt=2018-05-11 |
| dt=2018-05-12 |
| dt=2018-05-15 |
| dt=2018-05-16 |
| dt=2018-05-23 |
| dt=2018-05-24 |
| dt=2018-05-29 |
+----------------------------------------------------+
Hive 表用于存储大量数据,通常会按时间分区(小时、天、周等等),查询 Hive 表前最好使用该命令查看分区详情,按分区来查询数据避免无效计算。
开始分析
查看评论数量最多的人
> SELECT comment_by, COUNT(*) as cnt
FROM HN_COMMENTS WHERE comment_by!=""
GROUP BY comment_by
order BY cnt desc limit 10;
+----------------+-------+
| comment_by | cnt |
+----------------+-------+
| </code></pre> | 1706 |
| jacquesm | 798 |
| rbanffy | 751 |
| icebraining | 549 |
| dragonwriter | 546 |
| pjmlp | 437 |
| dang | 425 |
| gowld | 423 |
| s73v3r_ | 420 |
| TeMPOraL | 405 |
+----------------+-------+
参考
https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb