Hive中默认使用单字节分隔符来加载文本数据,例如逗号、制表符、空格等等,默认的分隔符为\001。根据不同文件的不同分隔符,可以通过在创建表时使用 row format delimited fields terminated by ‘单字节分隔符’ 来指定文件中的分割符,确保正确将表中的每一列与文件中的每一列实现一一对应的关系。
//程序入口 publicstaticvoidmain(String[] args)throws Exception { //调用run Configuration conf = new Configuration(); int status = ToolRunner.run(conf, new ChangeSplitCharMR(), args); System.exit(status); }
publicstaticclassChangeSplitMapperextendsMapper<LongWritable,Text,Text,NullWritable>{ //定义输出的Key private Text outputKey = new Text(); //定义输出的Value private NullWritable outputValue = NullWritable.get();
parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL. It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string. Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME>
parse_url在使用时可以指定多个参数
第一个参数:url:指定要解析的URL
第二个参数:key1:指定要解析的内容1
……
第N个参数:keyN:指定要解析的内容N
示例:
1 2 3
SELECT b.* FROM src LATERALVIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT1;
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
测试:
查询tb_url中每个url的HOST、PATH:
1
select parse_url_tuple(url,"HOST","PATH") as (host,path) from tb_url;
查询tb_url中每个url的PROTOCOL、HOST、QUERY:
1
select parse_url_tuple(url,"PROTOCOL","HOST","PATH") as (protocol,host,path) from tb_url;
实现需求:
1
select parse_url_tuple(url,"HOST","PATH","QUERY") as (host,path,query) from tb_url;
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)* fromClause: FROM baseTable (lateralView)*
基本语法如下:
1
select …… from tabelA lateralview UDTF(xxx) 别名 as col1,col2,col3……
测试
单个lateral view调用,实现上述需求中的应用:
1 2 3 4 5 6 7
select a.id asid, b.host as host, b.path aspath, b.query asquery from tb_url a lateralview parse_url_tuple(url,"HOST","PATH","QUERY") b as host,path,query;
多lateral view调用
1 2 3 4 5 6 7 8 9
select a.id asid, b.host as host, b.path aspath, c.protocol as protocol, c.query asquery from tb_url a lateralview parse_url_tuple(url,"HOST","PATH") b as host,path lateralview parse_url_tuple(url,"PROTOCOL","QUERY") c as protocol,query;
Outer Lateral View
如果UDTF不产生数据时,这时侧视图与原表关联的结果将为空,如下图所示:
1 2 3 4 5 6
select id, url, col1 from tb_url lateralviewexplode(array()) et as col1;
如果加上outer关键字以后,就会保留原表数据,类似于outer join
1 2 3 4 5 6
select id, url, col1 from tb_url lateralviewouterexplode(array()) et as col1;
行列转换应用与实现
行转列:多行转多列
需求
原始数据表:
目标结果表:
case when判断
功能:用于实现对数据的判断,根据条件,不同的情况返回不同的结果,类似于Java中的switch case 功能
语法: 语法一:
1 2 3 4 5 6 7
CASE WHEN 条件1 THEN VALUE1 WHEN 条件2 THEN VALUE2 …… WHEN 条件N THEN VALUEN ELSE 默认值 END
语法二:
1 2 3 4 5 6 7
CASE 列 WHEN V1 THEN VALUE1 WHEN V2 THEN VALUE2 …… WHEN VN THEN VALUEN ELSE 默认值 END
测试:
语法一:当id < 2显示a,当id = 2 显示b ,其他的显示c:
1 2 3 4 5 6 7 8
select id, case whenid < 2then'a' whenid = 2then'b' else'c' endas caseName from tb_url;
语法二:当id =1 显示a,当id = 2 显示b ,其他的显示c
1 2 3 4 5 6 7 8
select id, caseid when1then'a' when2then'b' else'c' endas caseName from tb_url;
实现
创建原始数据表,加载数据
1 2 3 4 5 6 7 8 9 10
--切换数据库 use db_function; --建表 createtable row2col1( col1 string, col2 string, col3 int ) rowformatdelimitedfieldsterminatedby'\t'; --加载数据到表中 loaddatalocal inpath '/export/data/r2c1.txt'intotable row2col1;
SQL实现转换
1 2 3 4 5 6 7 8 9
select col1 as col1, max(case col2 when'c'then col3 else0end) as c, max(case col2 when'd'then col3 else0end) as d, max(case col2 when'e'then col3 else0end) as e from row2col1 groupby col1;
select col1, 'c'as col2, col2 as col3 from col2row1 UNIONALL select col1, 'd'as col2, col3 as col3 from col2row1 UNIONALL select col1, 'e'as col2, col4 as col3 from col2row1;
select json, get_json_object(json,"$.device") as device from tb_json_test1;
获取设备名称及信号强度字段:
1 2 3 4 5 6
select --获取设备名称 get_json_object(json,"$.device") as device, --获取设备信号强度 get_json_object(json,"$.signal") as signal from tb_json_test1;
实现需求:
1 2 3 4 5 6 7 8 9 10
select --获取设备名称 get_json_object(json,"$.device") as device, --获取设备类型 get_json_object(json,"$.deviceType") as deviceType, --获取设备信号强度 get_json_object(json,"$.signal") as signal, --获取时间 get_json_object(json,"$.time") as stime from tb_json_test1;
JSON函数:json_tuple
功能
用于实现JSON字符串的解析,可以通过指定多个参数来解析JSON返回多列的值
语法
1 2
json_tuple(jsonStr, p1, p2, ..., pn) like get_json_object, but it takes multiple names and return a tuple
参数:
第一个参数:指定要解析的JSON字符串
第二个参数:指定要返回的第1个字段
……
第N+1个参数:指定要返回的第N个字段
特点:
功能类似于get_json_object,但是可以调用一次返回多列的值。属于UDTF类型函数
返回的每一列都是字符串类型
一般搭配lateral view使用
使用
获取设备名称及信号强度字段
1 2 3 4
select --返回设备名称及信号强度 json_tuple(json,"device","signal") as (device,signal) from tb_json_test1;
实现需求,单独使用
1 2 3 4
select --解析所有字段 json_tuple(json,"device","deviceType","signal","time") as (device,deviceType,signal,stime) from tb_json_test1;
实现需求,搭配侧视图
1 2 3 4 5
select json,device,deviceType,signal,stime from tb_json_test1 lateralview json_tuple(json,"device","deviceType","signal","time") b as device,deviceType,signal,stime;
+-----------+--------------+-----------+--------------+ | A_USERID | A_LOGINTIME | B_USERID | B_LOGINTIME | +-----------+--------------+-----------+--------------+ | A | 2021-03-22 | A | 2021-03-22 | | B | 2021-03-22 | A | 2021-03-22 | | C | 2021-03-22 | A | 2021-03-22 | | A | 2021-03-23 | A | 2021-03-22 | | C | 2021-03-23 | A | 2021-03-22 | | A | 2021-03-24 | A | 2021-03-22 | | B | 2021-03-24 | A | 2021-03-22 | | A | 2021-03-22 | B | 2021-03-22 | | B | 2021-03-22 | B | 2021-03-22 | | C | 2021-03-22 | B | 2021-03-22 | | A | 2021-03-23 | B | 2021-03-22 | | C | 2021-03-23 | B | 2021-03-22 | | A | 2021-03-24 | B | 2021-03-22 | | B | 2021-03-24 | B | 2021-03-22 | | A | 2021-03-22 | C | 2021-03-22 | | B | 2021-03-22 | C | 2021-03-22 | | C | 2021-03-22 | C | 2021-03-22 | | A | 2021-03-23 | C | 2021-03-22 | | C | 2021-03-23 | C | 2021-03-22 | | A | 2021-03-24 | C | 2021-03-22 | | B | 2021-03-24 | C | 2021-03-22 | | A | 2021-03-22 | A | 2021-03-23 | | B | 2021-03-22 | A | 2021-03-23 | | C | 2021-03-22 | A | 2021-03-23 | | A | 2021-03-23 | A | 2021-03-23 | | C | 2021-03-23 | A | 2021-03-23 | | A | 2021-03-24 | A | 2021-03-23 | | B | 2021-03-24 | A | 2021-03-23 | | A | 2021-03-22 | C | 2021-03-23 | | B | 2021-03-22 | C | 2021-03-23 | | C | 2021-03-22 | C | 2021-03-23 | | A | 2021-03-23 | C | 2021-03-23 | | C | 2021-03-23 | C | 2021-03-23 | | A | 2021-03-24 | C | 2021-03-23 | | B | 2021-03-24 | C | 2021-03-23 | | A | 2021-03-22 | A | 2021-03-24 | | B | 2021-03-22 | A | 2021-03-24 | | C | 2021-03-22 | A | 2021-03-24 | | A | 2021-03-23 | A | 2021-03-24 | | C | 2021-03-23 | A | 2021-03-24 | | A | 2021-03-24 | A | 2021-03-24 | | B | 2021-03-24 | A | 2021-03-24 | | A | 2021-03-22 | B | 2021-03-24 | | B | 2021-03-22 | B | 2021-03-24 | | C | 2021-03-22 | B | 2021-03-24 | | A | 2021-03-23 | B | 2021-03-24 | | C | 2021-03-23 | B | 2021-03-24 | | A | 2021-03-24 | B | 2021-03-24 | | B | 2021-03-24 | B | 2021-03-24 | +-----------+--------------+-----------+--------------+
保存为表
1 2 3 4 5 6 7
createtable tb_login_tmp as select a.userid as a_userid, a.logintime as a_logintime, b.userid as b_userid, b.logintime as b_logintime from tb_login a,tb_login b;
过滤数据:用户id相同并且登陆日期相差1
1 2 3 4 5
select a_userid,a_logintime,b_userid,b_logintime from tb_login_tmp where a_userid = b_userid andcast(substr(a_logintime,9,2) asint) - 1 = cast(substr(b_logintime,9,2) asint);
统计连续登陆两天的用户
1 2 3 4 5
select distinct a_userid from tb_login_tmp where a_userid = b_userid andcast(substr(a_logintime,9,2) asint) - 1 = cast(substr(b_logintime,9,2) asint);
select userid, logintime, --本次登陆日期的第二天 date_add(logintime,1) as nextday, --按照用户id分区,按照登陆日期排序,取下一次登陆时间,取不到就为0 lead(logintime,1,0) over (partitionby userid orderby logintime) as nextlogin from tb_login;
1 2 3 4 5 6 7 8 9 10
with t1 as ( select userid, logintime, --本次登陆日期的第二天 date_add(logintime,1) as nextday, --按照用户id分区,按照登陆日期排序,取下一次登陆时间,取不到就为0 lead(logintime,1,0) over (partitionby userid orderby logintime) as nextlogin from tb_login ) selectdistinct userid from t1 where nextday = nextlogin;
统计连续3天登录
1 2 3 4 5 6 7 8
select userid, logintime, --本次登陆日期的第三天 date_add(logintime,2) as nextday, --按照用户id分区,按照登陆日期排序,取下下一次登陆时间,取不到就为0 lead(logintime,2,0) over (partitionby userid orderby logintime) as nextlogin from tb_login;
1 2 3 4 5 6 7 8 9 10
with t1 as ( select userid, logintime, --本次登陆日期的第三天 date_add(logintime,2) as nextday, --按照用户id分区,按照登陆日期排序,取下下一次登陆时间,取不到就为0 lead(logintime,2,0) over (partitionby userid orderby logintime) as nextlogin from tb_login ) selectdistinct userid from t1 where nextday = nextlogin;
统计连续N天登录
1 2 3 4 5 6 7 8
select userid, logintime, --本次登陆日期的第N天 date_add(logintime,N-1) as nextday, --按照用户id分区,按照登陆日期排序,取下下一次登陆时间,取不到就为0 lead(logintime,N-1,0) over (partitionby userid orderby logintime) as nextlogin from tb_login;
A 2021-01 5 A 2021-01 15 B 2021-01 5 A 2021-01 8 B 2021-01 25 A 2021-01 5 A 2021-02 4 A 2021-02 6 B 2021-02 10 B 2021-02 5 A 2021-03 7 B 2021-03 9 A 2021-03 11 B 2021-03 6
createtable tb_money_mtn as select userid, mth, sum(money) as m_money from tb_money groupby userid,mth;
方案一:自连接分组聚合
基于每个用户每个月的消费总金额进行自连接
1 2 3 4 5 6 7 8
select a.userid as auserid, a.mth as amth, a.m_money as am_money, b.userid as buserid, b.mth as bmth, b.m_money as bm_money from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid;
将每个月之前月份的数据过滤出来
1 2 3 4 5 6 7 8 9
select a.userid as auserid, a.mth as amth, a.m_money as am_money, b.userid as buserid, b.mth as bmth, b.m_money as bm_money from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid where a.mth >= b.mth;
对每个用户每个月的金额进行分组,聚合之前月份的消费金额
1 2 3 4 5 6 7 8
select a.userid as auserid, a.mth as amth, a.m_money as am_money, sum(b.m_money) as t_money from tb_money_mtn a join tb_money_mtn b on a.userid = b.userid where a.mth >= b.mth groupby a.userid,a.mth,a.m_money;
方案二:窗口函数实现
窗口函数sum
功能:用于实现基于窗口的数据求和
语法:sum(colName) over (partition by col order by col)
select empno, ename, salary, deptno, row_number() over (partitionby deptno orderby salary desc) as rn from tb_emp;
过滤每个部门的薪资最高的前两名
1 2 3 4 5 6 7 8 9
with t1 as ( select empno, ename, salary, deptno, row_number() over (partitionby deptno orderby salary desc) as rn from tb_emp ) select * from t1 where rn < 3;