Updated on 2024-08-16 GMT+08:00

HDFS Java APIs

For details about HDFS APIs, see http://hadoop.apache.org/docs/r2.7.2/api/index.html.

Common HDFS APIs

Common HDFS Java classes are as follows:

  • FileSystem: is the core class of client applications. For details about its common APIs, see Table 1.
  • FileStatus: records the status of files and directories. For details about its common APIs, see Table 2.
  • DFSColocationAdmin: used to manage Colocation group information. For details about its common APIs, see Table 3.
  • DFSColocationClient: used to operate Colocation files. For details about its common APIs, see Table 4.
    • The system reserves only the mapping between nodes and locator IDs, but does not to reserve the mapping between files and locator IDs. When a file is created by using a Colocation API, the file is created on the node that corresponds to a locator ID. File creation and writing must use Colocation APIs.
    • After the file is written, subsequent operations on the file can use other open source APIs in addition to Colocation APIs.
    • The DFSColocationClient class inherits from the open source DistributedFileSystem class, including common APIs. You are advised to use DFSColocationClient to perform operations related to Colocation files.
Table 1 Common FileSystem APIs

API

Description

public static FileSystem get(Configuration conf)

FileSystem is the API class provided for users in the Hadoop class library. FileSystem is an abstract class. Concrete classes can be obtained only using the Get method. The Get method has multiple overload versions and is commonly used.

public FSDataOutputStream create(Path f)

This API is used to create files in the HDFS. f indicates a complete file path.

public void copyFromLocalFile(Path src, Path dst)

This API is used to upload local files to a specified directory in the HDFS. src and dst indicate complete file paths.

public boolean mkdirs(Path f)

This API is used to create folders in the HDFS. f indicates a complete folder path.

public abstract boolean rename(Path src, Path dst)

This API is used to rename a specified HDFS file. src and dst indicate complete file paths.

public abstract boolean delete(Path f, boolean recursive)

This API is used to delete a specified HDFS file. f indicates the complete path of the file to be deleted, and recursive specifies recursive deletion.

public boolean exists(Path f)

This API is used to query a specified HDFS file. f indicates a complete file path.

public FileStatus getFileStatus(Path f)

This API is used to obtain the FileStatus object of a file or folder. The FileStatus object records status information of the file or folder, including the modification time and file directory.

public BlockLocation[] getFileBlockLocations(FileStatus file, long start, long len)

This API is used to query the block location of a specified file in an HDFS cluster. file indicates a complete file path, and start and len specify the block scope.

public FSDataInputStream open(Path f)

This API is used to open the output stream of a specified file in the HDFS and read the file using the API provided by the FSDataInputStream class. f indicates a complete file path.

public FSDataOutputStream create(Path f, boolean overwrite)

This API is used to create the input stream of a specified file in the HDFS and write the file using the API provided by the FSDataOutputStream class. f indicates a complete file path. If overwrite is true, the file is rewritten if it exists; if overwrite is false, an error is reported if the file exists.

public FSDataOutputStream append(Path f)

This API is used to open the input stream of a specified file in the HDFS and write the file using the API provided by the FSDataOutputStream class. f indicates a complete file path.

Table 2 Common FileStatus APIs

API

Description

public long getModificationTime()

This API is used to query the modification time of a specified HDFS file.

public Path getPath()

This API is used to query all files in an HDFS directory.

Table 3 Common DFSColocationAdmin APIs

API

Description

public Map<String, List<DatanodeInfo>> createColocationGroup(String groupId,String file)

This API is used to create a group based on the locatorIds information in the file. file indicates the file path.

public Map<String, List<DatanodeInfo>> createColocationGroup(String groupId,List<String> locators)

This API is used to create a group based on the locatorIds information in the list in the memory.

public void deleteColocationGroup(String groupId)

This API is used to delete a group.

public List<String> listColocationGroups()

This API is used to return all group information of Colocation. The returned group ID arrays are sorted by creation time.

public List<DatanodeInfo> getNodesForLocator(String groupId, String locatorId)

This API is used to obtain the list of all nodes in the locator.

Table 4 Common DFSColocationClient APIs

API

Description

public FSDataOutputStream create(Path f, boolean overwrite, String groupId,String locatorId)

This API is used to create an FSDataOutputStream in Colocation mode to allow users to write files to the f path.

f is the HDFS path.

overwrite indicates whether an existing file can be overwritten.

groupId and locatorId of the file specified by a user must exist.

public FSDataOutputStream create(final Path f, final FsPermission permission, final EnumSet<CreateFlag> cflags, final int bufferSize, final short replication, final long blockSize, final Progressable progress, final ChecksumOpt checksumOpt, final String groupId, final String locatorId)

The function of this API is the same as that of FSDataOutputStream create(Path f, boolean overwrite, String groupId, String locatorId) except that users are allowed to customize checksum.

public void close()

This API is used to close the connection.

Table 5 HDFS client WebHdfsFileSystem API

API

Description

public RemoteIterator<FileStatus> listStatusIterator(final Path)

This API will help fetch the subfiles and folders information through multiple requests using remote iterator, thereby avoiding the user interface from becoming slow when there are plenty of child information to be fetched.

Using API-based Glob Path Mode to Obtain LocatedFileStatus and Open Files from FileStatus

The following APIs are added to DistributedFileSystem to obtain the FileStatus with a block location and open the file from the FileStatus object. These APIs reduce the number of RPC calls from clients to the NameNode.

Table 6 FileSystem APIs

API

Description

public LocatedFileStatus[] globLocatedStatus(Path, PathFilter, boolean) throws IOException

A LocatedFileStatus object array is returned. The corresponding file path complies with the path filtering rule.

public FSDataInputStream open(FileStatus stat) throws IOException

If the stat object is an instance of LocatedFileStatusHdfs and the instance has location information, InputStream is directly created without contacting the NameNode.