Hive详解（3）

Hive

数据类型

struct类型

struct：结构体，对应了Java中的对象，实际上是将数据以json形式来进行存储和处理

案例

原始数据

a tom,19,male amy,18,female
b bob,18,male john,18,male
c lucy,19,female lily,19,female
d henry,18,male david,19,male

案例

-- 建表
create table groups (
    group_id string,
    mem_a    struct<name:string, age:int, gender:string>,
    mem_b    struct<name:string, age:int, gender:string>
) row format delimited
    fields terminated by ' '
    collection items terminated by ',';
-- 加载数据
load data local inpath '/opt/hive_data/infos' into table groups;
-- 查询数据
select * from groups;
-- 获取成员a的信息
select mem_a from groups;
-- 获取成员a的名字
select mem_a.name from groups;

运算符和函数

概述

在Hive中，提供了非常丰富的运算符和函数，用于对数据进行处理和分析。在Hive中，运算符和函数可以归为一类
如果需要查看Hive中所有的函数，可以通过
```
show functions;
```

如果想要查看某一个函数的描述，可以使用

-- 简略描述
desc function sum;
-- 详细描述
desc function extended sum;

在Hive中，还允许用户自定义函数
在Hive中，函数必须结合其他的关键字来构成语句！

入门案例

案例一：给定字符串表示日期，例如'2024-03-25'，从获取年份

-- 方式一：以-拆分字符串，获取数组的第一位，将字符串转化为整数类型
select cast(split('2024-03-25', '-')[0] as int);
-- 方式二：正则表达式-捕获组
select cast(regexp_extract('2024-03-25', '(.*)-(.*)-(.*)', 1) as int);
-- 方式三：提供了year函数，直接用于提取年份，要求年月日之间必须用-隔开
select year('2024-03-25');

案例一：给定字符串表示日期，例如'2024/03/25'，从获取年份

-- 方式一
select cast(split('2024/03/25', '/')[0] as int);
-- 方式二
select cast(regexp_extract('2024/03/25', '(.*)/(.*)/(.*)', 1) as int);
-- 方式三：先将/替换为-，再利用year函数来提取
select year(regexp_replace('2024/03/25', '/', '-'));

常用函数

nvl函数

nvl(v1, v2)：判断v1的值是否为null，如果v1的值不是null，那么返回v1，如果v1的值是null，那么返回v2

案例

原始数据

1 Adair 800
2 David 600
3 Danny 1000
4 Ben 500
5 Grace
6 Cathy 700
7 Kite
8 Will 600
9 Thomas 800
10 Tony 1000

案例

-- 建表
create table rewards (
    id     int,
    name   string,
    reward double
) row format delimited fields terminated by ' ';
-- 加载数据
load data local inpath '/opt/hive_data/rewards' into table rewards;
-- 查询数据
select * from rewards;
-- 计算每一个人平均发到的奖金是多少
-- avg属于聚合函数，所有的聚合函数在遇到null的时候自动跳过不计算
-- select avg(reward) from rewards;
select avg(if(reward is not null, reward, 0.0)) from rewards;
-- nvl
select avg(nvl(reward, 0)) from rewards;

case-when函数

类似于Java中的switch-case结构，是对不同的情况进行选择

案例

原始数据

1 bob 财务 男
2 bruce 技术 男
3 cindy 技术 女
4 david 财务 男
5 eden 财务 男
6 frank 财务 男
7 grace 技术 女
8 henry 技术 男
9 iran 技术 男
10 jane 财务 女
11 kathy 财务 女
12 lily 技术 女

案例

-- 建表
create table employers (
    id         int,
    name       string,
    department string,
    gender     string
) row format delimited fields terminated by ' ';
-- 加载数据
load data local inpath '/opt/hive_data/employers' into table employers;
-- 查询数据
select *
from employers;
-- 需求：统计每一个部门的男生和女生人数
-- 方式一：sum(if())
select department                   as `部门`,
       sum(if(gender = '男', 1, 0)) as `男`,
       sum(if(gender = '女', 1, 0)) as `女`
from employers
group by department;
-- 方式：sum(case-when)
select department                                   as `部门`,
       sum(case gender when '男' then 1 else 0 end) as `男`,
       sum(case gender when '女' then 1 else 0 end) as `女`
from employers
group by department;

explode函数

explode在使用的时候，需要传入一个数组或者是映射类型的参数。如果传入的是数组，那么会将数组中的每一个元素拆分成单独的一行构成一列数据；如果传入的是映射，那么会将映射的键和值拆分成两列

案例：单词统计

-- 创建目录
dfs -mkdir /words
-- 将文件复制到这个目录下
dfs -cp /txt/words.txt /words
-- 查看数据
dfs -ls /words
-- 建表
-- 注意：数据在HDFS上已经存在，所以应该建立外部表
create external table words (
    line array<string>
) row format delimited
    collection items terminated by ' '
    location '/words';
-- 查询数据
select * from words;
-- 需求：统计这个文件中每一个单词出现的次数
-- 思路
-- 第一步：先将数组中的元素转成一列
select explode(line)
from words;
-- 第二步：统计单词出现的次数
-- 基本结构：select x, count(x) from tableName group by x;
select w, count(w) from (
  select explode(line) as w from words
) t1 group by w;

列转行

列转行，顾名思义，指的是将一列的数据拆分成多行数据。在列转行的过程中，最重要的函数就是explode

案例

原始数据

沙丘2 剧情/动作/科幻/冒险
被我弄丢的你 剧情/爱情
堡垒 剧情/悬疑/历史
热辣滚烫 剧情/喜剧
新威龙杀阵 动作/惊悚
周处除三害 动作/犯罪

案例

-- 建表
create table movies (
    name  string,       -- 电影名
    kinds array<string> -- 电影类型
) row format delimited
    fields terminated by ' '
    collection items terminated by '/';
-- 加载数据
load data local inpath '/opt/hive_data/movies' into table movies;
-- 查询数据
select * from movies;
-- 需求：查询所有的动作片
-- lateral view function(ex) tableAlias as colAlias
-- 列转行，又称之为'炸列'
select name, k
from movies lateral view explode(kinds) ks as k
where k = '动作';

案例二

原始数据

bob 开朗,活泼   打游戏,打篮球
david   开朗,幽默   看电影,打游戏
lucy    大方,开朗   看电影,听音乐
jack    内向,大方   听音乐,打游戏

案例

-- 建表
create table persons (
    name       string,        -- 姓名
    characters array<string>, -- 性格
    hobbies    array<string>  -- 爱好
) row format delimited
    fields terminated by '\t'
    collection items terminated by ',';
-- 加载数据
load data local inpath '/opt/hive_data/persons' into table persons;
-- 查询数据
select * from persons;
-- 获取性格开朗且喜欢打游戏的人
select name, c, h
from persons
         lateral view explode(characters) cs as c
         lateral view explode(hobbies) hs as h
where c = '开朗'
  and h = '打游戏';

行转列

行转列，将多行的数据合并成一列

案例

select * from students_tmp;
-- 将同年级同班级的学生放到一起
-- collect_list和collect_set将数据合并到一个数组中
-- 不同的地方在于，collect_list允许有重复数据，但是collect_set不允许元素重复
-- concat_ws(符号，元素)，表示将后边的元素之间用指定的符号进行拼接，拼接成一个字符串
select grade                               as `年级`,
       class                               as `班级`,
       concat_ws(', ', collect_list(name)) as `学生`
from students_tmp
group by grade, class;

自定义函数

自定义UDF：需要定义一个类，Hive1.x和Hive2.x继承UDF类，但是Hive3.x，UDF类已经过时，所以需要继承GenericUDF
自定义UDTF：需要定义一个类，继承GenericUDTF
打成jar包，然后上传到HDFS上

在Hive中创建函数

-- 基本语法
create function 函数名
    as '包名.类名'
    using jar '在HDFS上的存储路径';
-- UDF
create function indexOf
    as 'com.fesco.AuthUDF'
    using jar 'hdfs://hadoop01:9000/F_Hive-1.0-SNAPSHOT.jar';
-- UDTF
create function splitLine
    as 'com.fesco.AuthUDTF'
    using jar 'hdfs://hadoop01:9000/F_Hive-1.0-SNAPSHOT.jar';

-- 测试
select indexOf('welcome', 'm');
select splitLine('welcome to big data', ' ');

删除函数
```
drop function indexOf;
```

窗口函数

概述

窗口函数又称之为开窗函数，用于限定要处理的数据范围
基本语法结构
```
分析函数 over(partition by 字段 order by 字段 [desc/asc] rows between 起始范围 and 结束范围)
```
1. partition by对数据进行分类
2. order by对数据进行排序
3. rows between x and y指定数据的处理范围
  
  关键字解释
  preceding 向前
  following 向后
  unbounded 无边界
  current row 当前行
4. 示例：假设当前处理的第5行数据
  1. 2 preceding and current row：处理前两行到当前行。即处理第3~5行的数据
  2. current row and 3 following：处理当前行以及向后3行。即处理第5~8行的数据
  3. unbounded preceding and current row：从第一行到当前行
  4. current row and unbounded following：从当前行到最后一行
5. 分析函数：大致可以分为三组
  1. 聚合函数，例如sum，avg等
  2. 移位函数，包含lag，lead，ntil
  3. 排序函数，包含row_number，rank，dense_rank

关键字	解释
preceding	向前
following	向后
unbounded	无边界
current row	当前行

案例

原始数据

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

建表

-- 建表
create table orders
(
    name       string,
    order_date string,
    cost       int
) row format delimited fields terminated by ',';
-- 加载数据
load data local inpath '/opt/hive_data/orders' into table orders;

需求一：查询每一位顾客的消费明细以及到消费日期为止的总消费金额

-- 思路：
-- 1. 拆寻每一位顾客的信息，那么需要按照顾客姓名来分类
-- 2. 按照日期，将订单进行排序
-- 3. 计算总消费金额，所以需要求和
-- 4. 到当前消费日期为止的金额，也就意味着是获取从第一行到当前行的数据来处理
select *,
       sum(cost) over (partition by name order by order_date rows between unbounded preceding and current row ) as total_cost
from orders;

补充：正则捕获组

概述

在正则表达式中，将()括起来的部分，称之为捕获组，此时可以将捕获组看作是一个整体
在正则表达式中，默认会对捕获组进行编号，编号是从1开始的。编号的计算，是从捕获组左半边括号出现的顺序来依次计算的
```
例如：(AB(C(D)E)F(G))
1	AB(C(D)E)F(G)
2	C(D)E
3	D
4	G
```
在正则表达式中，可以通过\n的形式来引用对应编号的捕获组。例如\1表示引用编号为1的捕获组