Monday, April 28, 2014

Use R to connect with HDFS and Hive

This document will provide some tips of R on how to connect Hive and HDFS with R. The RHive manual has a good document on how to do installation and initial setup for RHive before we start. In this document, I will provide more information on Hive and HDFS related stuff.

As we know the following command will connect HiveServerHost at port 10000 with HiverServer2:

> library(RHive)
> rhive.connect("HiveServerHost", 10000, hiveServer2)


If we give no arguments for rhive.connect, it will connect the Hive server with port number 10000 on local host:

> rhive.connect()

Then you can do query works on the Hive server by rhive.query(...), however, you need more power by using RHive, such as access HDFS and write a table onto it.

If you want to access a HDFS, you need to connect the name node of the HDFS first:

> rhive.hdfs.connect("hdfs://namenode:8020")

If above command doesn't work or not supported, add a dot(.) before the command, like:

> .rhive.hdfs.connect("hdfs://namenode:8020")


Next, we can try to create a 3-column table named "test" in Hive and save it in HDFS by using R: 

> L3 <- LETTERS[1:3]
> d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE))
> rhive.write.table(data=d, tableName="test")
[1] "test"


And show the content in table test:
> rhive.query("select * from test");
   x  y fac
1  1  1   B
2  1  2   A
3  1  3   A
4  1  4   B
5  1  5   B
6  1  6   A
7  1  7   C
8  1  8   A
9  1  9   B
10 1 10   A


You should check that there is a file located at /user/hive/warehouse/test/ in HDFS by the command:
> rhive.hdfs.ls("/user/hive/warehouse/test/")

Which means the content of the table has been saved to HDFS.

There are two good slides that you can refer on how to use RHive basic functions and functions related to HDFS:

RHive Basic Functions Tutorials
RHive HDFS Functions Tutorials

No comments:

Post a Comment