Mastering Hadoop 3
上QQ阅读APP看书,第一时间看更新

HDFS read 

We will talk about two approaches to reading a file and will discuss when to use what. In the first approach, we can use the URL class, which is part of the java.net package and can be used to read files stored on HDFS. The URL calls setURLStreamHandlerFactory(), which requires an instance of FsUrlStreamHandlerFactory(). This initialization is a part of a static block that is executed before any instance creation. This method is in a static block because it can only be called once per JVM and hence, if any third-party program sets URLStreamHandlerFactory, we won't be able to use it for reading files from HDFS:

static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}

Once the URL has been initialized, it opens a stream on the file that returns InputStream. Then, the IOUtils class can be used to copy stream data to the output stream, as shown in the following code:

import java.io.InputStream;
import java.net.URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;

public class HDFSReadUsingURL {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}

public static void main(String[] args) throws Exception {
InputStream fileInputStream = null;
try {
fileInputStream = new URL(args[0]).openStream();
IOUtils.copyBytes(fileInputStream, System.out, 4096, false);
} finally {
IOUtils.closeStream(fileInputStream);
}
}
}

Once the program is ready, you can package it into a .jar and deploy it to the Hadoop classpath:

export HADOOP_CLASSPATH=HdfS_read.jar

Now, you can use classname as a command to read the file from HDFS, as shown in the following code:

hadoop HDFSReadUsingURL hdfs://localhost/user/chanchals/test.txt

We have already mentioned that this approach will not work in every scenario. Due to this, we have another approach, where the FileSystem class API can be used to read a HDFS file. There are basically two steps involved when we use the FileSystem class to read a file from HDFS:

  • Creating a File System Instance: The first step is to create a FileSystem instance. HDFS provides different static factory methods to create a FileSystem instance, and each method can be used in different scenarios:
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

The configuration object is common in all of these methods, which contains client and server configuration parameters. These parameters are set by reading a property from the core-site.xml and core-default.xml files. In the second method, the URI object tells the FileSystem about what URI scheme to use. 

  • Calling an open method to read a file: Once the FileSystem instance has been created, we can call open() to get the input stream from a file. FileSystem has two method signatures for the open method, as follows:
    public FSDataInputStream open(Path f) throws IOException

    public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

In the first method, the buffer size is not specified and it uses the default buffer size of 4 KB, while in the second method you can specify the default buffer size. The return type of the method is FSDataInputStream, which extends DataInputStream and allows you to read any part of a file. The class is as follows:

package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {

}

The Seekable and PositionedReadable interfaces allow you to read a file from any seekable position. When we say seekable position, we mean that a position value should not be greater than the file length, otherwise it will result in IOException. The interface definition is given as follows:

public  interface  Seekable   {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

Now, let's write a program to read the HDFS file by using the FileSystem API:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

import java.io.InputStream;
import java.net.URI;

public class HDFSReadUsingFileSystem {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(URI.create(uri), conf);
InputStream fileInputStream = null;
try {
fileInputStream = fileSystem.open(new Path(uri));
IOUtils.copyBytes(fileInputStream, System.out, 4096, false);
} finally {
IOUtils.closeStream(fileInputStream);
}
}
}

To execute and test this program, we need to package it into a .jar, as we did previously, copy it to the hadoop classpath, and use it as follows:

hadoop HDFSReadUsingFileSystem filepath