Blog

Hadoop HDFS Commands with Examples

  • (4.0)
  •   |   2184 Ratings

Introduction:

In this article we will discuss regarding the commands that are generally used in an Apache Hadoop based HDFS environment. Before we to do so, let us understand Hadoop in a better way. Hadoop is nothing but an open source Java based programming framework which supports processing and storage of extremely huge datasets in a distributed computing like environment. Hadoop is a part of the Apache project that is in turn sponsored by the Apache Software Foundation. HDFS is an Apache Software Foundation project too and is a subproject to Apache Hadoop project. Hadoop uses HDFS as its storage system and accessing the data files is done as if it is just one seamless file system.

The following section explains in detail the various commands that can be used in conjunction with an Hadoop based HDFS environment, to access and store data into it.

HDFS File System Commands:

Apache Hadoop has come up with a simple and yet basic Command Line interface, a simple interface to access the underlying Hadoop Distributed File System. In this section we will introduce you to the basic and the most useful HDFS File System Commands which will be more or the like similar to UNIX file system commands. Once the Hadoop daemons are started and UP and Running, HDFS file system is ready to be used. The file system operations like creating directories, moving files, adding files, deleting files, reading files and listing directories can be done seamlessly on the same.

Want to Get Hadoop Certification Training Course From Experts? Enroll Now For Free Demo On Hadoop Training.

We can get list of FS Shell commands with below command.

$ hadoop fs -help

user@ubuntu1:~$ hadoop fs -help

Usage: hadoop fs [generic options]

        [-appendToFile ... ]

        [-cat [-ignoreCrc] ...]

        [-checksum ...]

        [-chgrp [-R] GROUP PATH...]

        [-chmod [-R] PATH...]

        [-chown [-R] [OWNER][:[GROUP]] PATH...]

        [-copyFromLocal [-f] [-p] ... ]

        [-copyToLocal [-p] [-ignoreCrc] [-crc] ... ]

        [-count [-q] ...]

        [-cp [-f] [-p | -p[topax]] ... ]

        [-createSnapshot []]

        [-deleteSnapshot ]

        [-df [-h] [ ...]]

        [-du [-s] [-h] ...]

        [-expunge]

        [-get [-p] [-ignoreCrc] [-crc] ... ]

        [-getfacl [-R] ]

        [-getfattr [-R] {-n name | -d} [-e en] ]

        [-getmerge [-nl] ]

        [-help [cmd ...]]

        [-ls [-d] [-h] [-R] [ ...]]

        [-mkdir [-p] ...]

        [-moveFromLocal ... ]

        [-moveToLocal ]

        [-mv ... ]

        [-put [-f] [-p] ... ]

        [-renameSnapshot ]

        [-rm [-f] [-r|-R] [-skipTrash] ...]

        [-rmdir [--ignore-fail-on-non-empty]

...]

        [-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set ]]

        [-setfattr {-n name [-v value] | -x name} ]

        [-setrep [-R] [-w] ...]

        [-stat [format] ...]

        [-tail [-f] ]

        [-test -[defsz] ]

        [-text [-ignoreCrc] ...]

        [-touchz ...]

        [-usage [cmd ...]]

Most of the commands that we use on a HDFS environment are listed as above, from this thorough list of commands we will take a look at the some of the most important commands with examples. Let us take a look into the commands and its usage then:

1. mkdir:

This is no different from the UNIX mkdir command and is used to create a directory on a HDFS environment. The option –p is used to mention not to fail if the directory already exists.

Syntax:

$ hadoop fs -mkdir  [-p]

Usage:

$ hadoop fs -mkdir /user/hadoop/

$ hadoop fs -mkdir /user/data/

In order to create subdirectories, the parent directory needs to be existing already. If the condition is not met then ‘No such file or directory’ message will be returned.

2. ls:

This is no different from the UNIX ls command and it is used for listing the directories present under a specific directory in an HDFS system. The –lsr command may be used for the recursive listing of the directories and files under a specific folder.

The –d option is used to list the directories as plain files

The –h option is used to format the sizes of files into a human readable manner than just number of bytes

The –R option is used to recursively list the contents of directories

Syntax:

$ hadoop fs -ls [-d] [-h] [-R]

Usage:

$ hadoop fs -ls /

$ hadoop fs -lsr /

The command above will match the specified file pattern, directory entries are of the form (as shown below)

permissions - userId groupId sizeOfDirectory(in bytes) modificationDate(yyyy-MM-dd HH:mm) directoryName

3. put:

This command is used to copy files from the local file system to the HDFS filesystem. This is very much similar to the –copyFromLocal command. Copying of the files fails if the file already exists, unless the –f flag is given to the command. This overwrites the destination if the file already exists before the copy

The –p flag preserves the access, modification time, ownership and the mode

Syntax:

$ hadoop fs -put [-f] [-p] ...

Usage:

$ hadoop fs -put sample.txt /user/data/

4. get:

This command Copies files from HDFS file system to the local file system, just the opposite of the command that we have seen just now. This is similar to the –copyToLocal command

Usage:

$ hadoop fs -get /user/data/sample.txt workspace/

5. cat:

This command is similar to the UNIX cat command and is used for displaying the contents of a file on the console.

Usage:

$ hadoop fs -cat /user/data/sampletext.txt

6. cp:

This command is similar to the UNIX cp command, and it is used for copying files from one directory to another directory within the HDFS file system.

Usage:

$ hadoop fs -cp /user/data/sample1.txt /user/hadoop1

$ hadoop fs -cp /user/data/sample2.txt /user/test/in1

7. mv:

This command is similar to the UNIX mv command, and it is used for moving a file from one directory to another directory within the HDFS file system.

Usage:

$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/

8. rm:

This command is similar to the UNIX rm command, and it is used for removing a file from the HDFS file system. The command –rmr can be used to delete files recursively.

Directories can’t be deleted by the –rm command, we need to use the –rm r (recursive remove) command to delete directories and files inside them. Only files can be removed using the –rm command.

The –skipTrash option is used to bypass the trash, if enabled, and then it immediately deletes the source as mentioned in the tag.

The –f option is used to mention that if there is no file existing, then not to display any diagnostic message or even to modify the exit status to reflect on showing up an error.

The –rR option is used to recursively delete directories

Syntax:

$ hadoop fs -rm [-f] [-r|-R] [-skipTrash]

Usage:

$ hadoop fs -rm -r /user/test/sample.txt

9. getmerge:

This is the most important and the most useful command on the HDFS filesystem, when trying to read the contents of a MapReduce job or PIG job’s output files. This is used for merging a list of files in a directory on the HDFS filesystem into a single local file on the local filesystem.

Usage:

$ hadoop fs -getmerge /user/data

Checkout Hadoop Interview Questions

10. setrep:

This command is used to change the replication factor of a file to a specific count instead of the default replication factor for the remaining in the HDFS file system. If is a directory then the command will recursively change the replication factor of all the residing files in the directory tree as per the input provided.

The –w option is used to request the command to wait for the replication to be completed. This operation may take considerably longer duration

The –R option is used to accept for backward capability and has no effect

Syntax:

$ hadoop fs -setrep [-R] [-w]

Usage:

$ hadoop fs -setrep -R /user/hadoop/

11. touchz:

This command can be used to create a file of zero bytes size in HDFS filesystem.

Usage:

$ hadoop fs -touchz URI

12. test:

This command is used to test a HDFS file’s existence of zero length of the file or whether if it is a directory or not.

The –d option is used to check whether if it is a directory or not, returns 0 if it is a directory

The –e option is used to check whether the exists or not, returns 0 if the exists

The –f option is used to check whether the is a file or not, returns 0 if the file exists

The –s option is used to check whether the file size is greater than 0 bytes or not, returns 0 if the size is greater than 0 bytes

The –z option is used to check whether the file size is zero bytes or not. If the file size is zero bytes, then returns 0 or else returns 1.

Usage:

$ hadoop fs -test -[defsz] /user/test/test.txt

13. expunge:

This command is used to empty the trash available in a HDFS system.

Syntax:

$ hadoop fs –expunge

Usage:

user@ubuntu1:~$ hadoop fs –expunge

17/10/15 10:15:22 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

14. appendToFile:

This command appends the contents of all the given local files to the provided destination file on the HDFS filesystem. The destination file will be created if it is not existing earlier. If the   is -, then the input is read from stdin.

Syntax:

$ hadoop fs -appendToFile

Usage:

user@ubuntu1:~$ hadoop fs -appendToFile derby.log data.tsv /in/appendfile

user@ubuntu1:~$ hadoop fs -cat /in/appendfile

Sun Oct 15 14:41:10 IST 2017 Thread[main,5,main] Ignored duplicate property derby.module.dataDictionary in jar:file:/home/user/Downloads/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/apache/derby/modules.properties

Sun Oct 15 14:41:10 IST 2017 Thread[main,5,main] Ignored duplicate property derby.module.lockManagerJ1 in jar:file:/home/user/Downloads/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/apache/derby/modules.properties

Sun Oct 15 14:41:10 IST 2017 Thread[main,5,main] Ignored duplicate property derby.env.classes.dvfJ2 in jar:file:/home/user/Downloads/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/apache/derby/modules.properties

15. tail:

This command is used to show the last 1KB of the file.

The –f option is used to the show appended data as the file grows

Syntax:

$ hadoop fs -tail [-f]

Usage:

user@tri03ws-386:~$ hadoop fs -tail /in/appendfile

Sun Oct 15 14:41:10 IST 2017:

Booting Derby version The Apache Software Foundation - Apache Derby - 10.10.1.1 - (1458268): instance a816c00e-0149-e638-0064-0000093808b8 on database directory /home/user/metastore_db with class loader sun.misc.Launcher$AppClassLoader@3485def8

Loaded from file:/home/user/Downloads/apache-hive-0.14.0-bin/lib/derby-10.10.1.1.jar

java.vendor=Oracle Corporation

java.runtime.version=1.7.0_65-b32

user.dir=/home/user

os.name=Linux

os.arch=amd64

os.version=3.13.0-39-generic

derby.system.home=null

Database Class Loader started - derby.database.classpath=''

16. stat:

This command is used to print the statistics about the file / directory at in the specified format. Format accepts file size in blocks (%b), group name of the owner (%g) and the file name (%n), block size (%o), replication (%r), user name of the owner (%u), modification date (%y, %Y)

Syntax:

$ hadoop fs -stat [format]

Usage:

user@tri03ws-386:~$ hadoop fs -stat /in/appendfile

2014-11-26 04:57:04

user@tri03ws-386:~$ hadoop fs -stat %Y /in/appendfile

1416977824841

user@tri03ws-386:~$ hadoop fs -stat %b /in/appendfile

20981

user@tri03ws-386:~$ hadoop fs -stat %r /in/appendfile

1

user@tri03ws-386:~$ hadoop fs -stat %o /in/appendfile

134217728

17. setfattr:

This command sets an extended attribute name and value for a file or directory on the HDFS filesystem.

The –n option is used to provide the extended attribute name

The –x option is used to remove the extended attribute , file or directory

The –v option is used to provide the extended attribute value. There are 3 different encoding methods available for the value.

1. If the argument is enclosed in double quotes, then the value is the string inside the quotes

2. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number

3. If the argument begins with 0s or 0S, then it is taken as a Base64 encoding

Syntax:

$ hadoop fs -setfattr {-n name [-v value] | -x name}

18. df:

This command is used to show the capacity, free and the used space available on the HDFS filesystem. If the filesystem has multiple partitions and if there is no path is mentioned to any specific partition, then the status of the root partition will be displayed for us to know.

The –h option is used to format the sizes of the files in a human readable manner rather than number of bytes.

Syntax:

$ hadoop fs -df [-h] [ ...]

19. du:

This command is used to show the amount of space in Bytes that has been used by the files that match the specified file pattern. Even without the –s option, this only shows the size summaries one level deep in the directory.

The –s option is used to show the size of each individual file that matches the pattern, shows the total (summary) size

The –h option is used to format the sizes of the files in a human readable manner rather than number of bytes.

Syntax:

$ hadoop fs -du [-s] [-h]

20. count:

This command is used to count the number of directories, files and bytes under the path that match the provided file pattern.

Syntax:

$ hadoop fs -count [-q]

The output columns are as follows:

1. DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME

2. QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA

3. DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME

21. chgrp:

This command is used to change the group of a file or a path.

Syntax:

$ hadoop fs -chgrp [-R] groupname

22. chmod:

This command is used to change the permissions of a file, this command works as similar to that of LINUX’s shell command chmod with a few exceptions.

The –R option is used to modify the files recursively and it is the only option that is being supported currently.

The is the same as the mode used for the shell command. The letters that are recognized are ‘rwxXt’.

The is the mode specified in 3 or 4 digits. The first may be 0 or 1 to turn the sticky bit OFF or ON respectively. Unlike the shell command, it is not at all possible to specify only part of the mode.

Syntax:

$ hadoop fs -chmod [-R] PATH

23. chown:

This command is used to change the owner and group of a file. This command is similar to the shell’s chown command with a few exceptions.

If only the owner of the group is specified then only the owner of the group is modified via this command. The owner and group names may only consists of digits, alphabets and any of the characters mentioned here [-_./@a-zA-Z0-9]. The names thus specified are case sensitive as well.

It is better to avoid using ‘.’ to separate user name and the group just the way LINUX allows it. If the user names have dots in them and if you are using local file system, you might see surprising results since the shell command chown is used for the local file alone.

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

The –R option modifies the files recursively and is the only option that is being supported currently

Syntax:

$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH

Conclusion:

In this article, we have provided a brief introduction in Apache Hadoop and then we have considered introducing the most important and the most commonly used HDFS commands to get and put files into a Hadoop Distributed File System (HDFS). Hope this article has served the purpose of being the one stop shop for all the necessary commands to be used.

List of Other Big Data Courses:

 Hadoop Adminstartion  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

 


Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules