实验环境搭建

实验环境 阿里云服务器 Ubuntu 16.04

# lsb_release -a
LSB Version:    core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.4 LTS
Release:        16.04
Codename:       xenial

通过 Docker 安装 Hive,先安装 Docker-ce,并使用阿里云提供的镜像加速服务。

sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
sudo apt-get update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
apt-get update
https://download.docker.com/linux/ubuntu/gpg
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo mkdir -p /etc/docker
# 登陆阿里云控制台复制加速链接。
sudo tee /etc/docker/daemon.json <<-'EOF'
{
  "registry-mirrors": ["https://xxx.mirror.aliyuncs.com"] 
}
EOF

sudo systemctl daemon-reload
sudo systemctl restart docker
# docker pull parrotstream/hive
docker pull bde2020/hive

拉取 Hive 的 Docker Compose 配置并启动 Hive

# git clone https://github.com/parrot-stream/docker-hive.git
git clone https://github.com/big-data-europe/docker-hive
pip install docker-compose
sudo python -m easy_install --upgrade pyOpenSSL
cd docker-hive
docker network create parrot_default
docker-compose up -d # hive 需要 postgres 存储 metadata

使用 docker-compose 命令登录 hive-server,并执行一条命令做测试 (Windows 下需要使用 powershell 调用)

# docker-compose exec hive-server bash
# /opt/hive/bin/beeline -u jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+

资料准备

使用 axel 从网上下载测试用的 Hacker News 评论数据并解压。

axel -n 10 https://files.pushshift.io/hackernews/HNI_2018-05.bz2
bzip2 -dk HNI_2018-05.bz2

使用 head 查看该文件的格式如下

head -n 1 HNI_2018-05
{"by":"burntwater","id":16963601,"parent":16962078,"retrieved_on":1528401048,"text":"The New York State fee by itself is $200. Unless I&#x27;m missing something, you paid significantly more than that for the publication requirement (it was around $1,000, give or take a couple hundred, last I checked).","time":1525132803,"type":"comment"}

因为该文件为 json 格式,需要用 jq 工具转换为 csv 格式再导入 hive。

jq -r  '[ .by, (.id|tostring), (.parent|tostring), (.retrieved_on|tostring), .text, .type, (.time|tostring) , (.time| strftime("%Y-%m-%d"))] | join("||") ' HNI_2018-05 > HNI_2018-05.csv

该命令会将 json 转为如下的格式,因为 Hive 导入数据都是以文本的格式导入,需要设置其分隔符号,最好不要与数据中的常见字段冲突。

同时,Hive 表一般会做分区处理,需要有一个 date 字符串,这里用原有的 unix 时间戳格式化了一个 ds 字段。

Caprinicus||16963610||16962203||1528401049||And that they grew to this point before imposing this &quot;curation&quot;. It seems to be a recurring theme with these companies to grow to a monopoly with a place for relatively free discourse and then to do this as soon as they feel safe.||comment||1525132915||2018-05-01

该操作会生成 HNI_2018-05.csv 文件。

生成该文件后,可以创建一个共享目录,挂载到 hive-server Docker 中。

修改 docker-hive 下的 docker-compose.yml 文件,给 hive-server 添加一个 volume, 映射到当前目录下的 data 目录。

  hive-server:
    image: bde2020/hive:2.3.2-postgresql-metastore
    env_file:
      - ./hadoop-hive.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    ports:
      - "10000:10000"
    volumes:
      - ./data:/data

拷贝 csv 数据到共享目录,并重启 hive。

docker-hive# cp ../HNI_2018-05.csv data/
docker-compose stop
docker-compose up -d

创建对应的 HIVE TABLE

hive> CREATE TABLE HN_COMMENTS (comment_by STRING, comment_id INT, parent INT, 
retrieved_on INT, text STRING, type STRING,
time INT,dt STRING)
row format delimited fields terminated by '||';

将 csv 中的数据导入 HIVE 中。

LOAD DATA LOCAL INPATH '/data/HNI_2018-05.csv' OVERWRITE INTO TABLE HN_COMMENTS;

查看导入是否成功。

> select text from HN_COMMENTS limit 1;

创建分区表

CREATE TABLE HN_COMMENTS_DT (comment_by STRING, comment_id INT, parent INT, 
retrieved_on INT, text STRING, type STRING,
time INT) partitioned by (dt STRING);

开启动态分区功能

set hive.exec.dynamic.partition.mode=nonstrict;

原始表按分区导入分区表

insert into table HN_COMMENTS_DT partition(dt) 
select * from HN_COMMENTS;

常用语句

查看支持的函数

show functions;

例如转换 unix 时间戳为日期字符串

SELECT from_unixtime(1525350800)
+----------------------+
|         _c0          |
+----------------------+
| 2018-05-03 12:33:20  |
+----------------------+

查看表定义

desc hn_comments;

查看分区

show partitions HN_COMMENTS_DT;

+----------------------------------------------------+
|                     partition                      |
+----------------------------------------------------+
| dt=2018-05-01                                      |
| dt=2018-05-06                                      |
| dt=2018-05-08                                      |
| dt=2018-05-09                                      |
| dt=2018-05-11                                      |
| dt=2018-05-12                                      |
| dt=2018-05-15                                      |
| dt=2018-05-16                                      |
| dt=2018-05-23                                      |
| dt=2018-05-24                                      |
| dt=2018-05-29                                      |
+----------------------------------------------------+

Hive 表用于存储大量数据,通常会按时间分区(小时、天、周等等),查询 Hive 表前最好使用该命令查看分区详情,按分区来查询数据避免无效计算。

开始分析

查看评论数量最多的人

> SELECT comment_by, COUNT(*) as cnt
FROM  HN_COMMENTS WHERE comment_by!=""
GROUP BY comment_by
order BY cnt desc limit 10;

+----------------+-------+
|   comment_by   |  cnt  |
+----------------+-------+
| </code></pre>  | 1706  |
| jacquesm       | 798   |
| rbanffy        | 751   |
| icebraining    | 549   |
| dragonwriter   | 546   |
| pjmlp          | 437   |
| dang           | 425   |
| gowld          | 423   |
| s73v3r_        | 420   |
| TeMPOraL       | 405   |
+----------------+-------+

参考

https://github.com/fhoffa/notebooks/blob/master/analyzing%20hacker%20news.ipynb