后面再来放截图2016-03-25

场景

一个公司中，上一个数据导入组有他们自己的表结构，定义的分隔符。

你们数据处理组要用人家的数据。肯定不能改动人家的表啊，没有这个权限。

假如数据导入组用的是$@_@$这种恶心的，又实用的分隔符(与内容重复的概率低)。

你用Hive要进行处理，但是分隔符又不是Hive里面的分隔符，咋办涅？

解决方案：

*　自定义InputFormat类，来实现复杂分隔符的处理

自定义InputFormat类

第一次尝试

想着简单啊，把里面默认的分隔符换成$@_@$，不就搞定了吗？说干就干：

把原有的org.apache.hadoop.mapred.TextInputFormat，复制了一份，放在自己写的Hive.myinputformat中。
添加以下改动：

public RecordReader<LongWritable, Text> getRecordReader(
	InputSplit genericSplit, JobConf job, Reporter reporter)
			throws IOException {

		reporter.setStatus(genericSplit.toString());

		/**
		 * 修改"textinputformat.record.delimiter"值
		 */
		job.set("textinputformat.record.delimiter", "$@_@$");

		String delimiter = job.get("textinputformat.record.delimiter");

		//delimiter.toLowerCase().replaceAll("\\$@_@\\$", "\001");

		byte[] recordDelimiterBytes = null;
		if (null != delimiter) {
			recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
		}
		return new LineRecordReader(job, (FileSplit) genericSplit,
				recordDelimiterBytes);
	}

还有个插曲：


hive>create table t1(id int,name string)
    stored as InputFormat 'myinputformat'
    outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

显示：cannot find class myinputformat.

问题：没加包名 -_-


hive>create table t1(id int,name string)
    stored as InputFormat 'Hive.myinputformat'
    outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

显示：cannot find class Hive.myinputformat.

问题：居然要重新进入一遍Hive。(。﹏。*)
可能是要刷新，不会。直接重新进入就好了。

本以为这该好了。可是数据一导入，卧槽。还是跪了0_0

显示：

1     NULL
NULL  NULL

可惜，理想很丰满，现实很骨感(￢_￢)

失败。

第二次尝试

百度，找到以下解决方法：http://www.micmiu.com/bigdata/hive/hive-inputformat-string/

把他的类复制下来，搞成jar包。直接用他的分隔符|@^_^@|
直接成功了。

研究了一下他的代码，主要是这句：

1 2	String strReplace = text.toString().toLowerCase() .replaceAll("\\\|@\\^_\\^@\\\|", "\001");

但是我就奇了怪:

为啥不是|@^^@|，而是\|@\^\^@\|
经过一番折腾：
原来replaceAll()函数是去匹配正则表达式。

这里的意思是：用Hive默认的“\001”去替换原本的“|@^_^@|”。

而加上很多反斜杠是正则表达式的原因：

要告诉编译器，别解析我这里面的字符。我是一个整体的。

我就把这句改成了：

1 2	String strReplace = text.toString().toLowerCase() .replaceAll("\\$@_@\\$", "\001");

然后，打包，放入lib目录下。建表，导入数据。成功~