Usage and Quirks of in Hadoop FileSystem

January 15, 2013 Bhooshan Mogal

Image via Future of CIO.

A few weeks ago, while working on a unified storage abstraction for Hadoop, I implemented the Hadoop FileSystem interface at work. To make it easier to test, I wanted to make my Filesystem the default for Hadoop URIs on the CLI. That way, I could test my FileSystem implementation by merely executing

‘hadoop fs -ls <path_on_filesystem>’

as opposed to

‘hadoop fs -ls <filesystem_scheme>://<filesystem_authority>:<filesystem_port>

While working on this, I had quite a few ‘Eureka!’ moments about how and where is used in Hadoop.

Going into this task, I had a basic understanding of As the name of the property suggests, it points to the default URI for all filesystem requests in Hadoop. A filesystem path in Hadoop has two main components: a URI that identifies the filesystem, and a path specifying the location of a file/directory in the filesystem. the URI component is optional. So if the user makes a request for a file by specifying its path only, Hadoop tries to find that path on the filesystem defined by If is set to an an HDFS URI like hdfs://<authority>:<port>, then Hadoop tries to find the path on HDFS whose namenode is running at <authority>:<port>. At the same time, if a user specifies both the URI and the path in the request, then the URI in the request overrides, and Hadoop tries to find the path on the filesystem identified by the URI in the request. This is a fairly simple and sensible design choice.

However, there are two other interesting characteristics of that I discovered:

1. In the Hadoop client, MUST point to a filesystem URI that can accept filesystem requests.

Assuming my understanding above was right, I set to a local HDFS namenode (hdfs://localhost:9000). I chose ‘xyz’ as the scheme for my filesystem, and localhost:16666 as the authority and port. I then executed my tests as ‘hadoop fs -ls xyz://localhost:16666/path/to/file’, expecting Hadoop to read /path/to/file on my filesystem. Note that I did not have an HDFS namenode running at hdfs://localhost:9000. I was presuming that since I was passing a fully qualified path in my request, was not needed. However, as it turned out, I started seeing Connection refused and Connection timed out errors from localhost/

Upon digging further into this, I traced the path of my request through the FsShell class. That was when I had my first “Eureka!” moment of the day: for any filesystem request, Hadoop first tries to ensure that the URI points to is available to serve filesystem requests. This is regardless of whether the user’s request contains a fully qualified URI or just a path. My understanding is that this ensures that if the user specifies just a directory path, is available to serve the request. For example, you cannot set to ‘blah’, even if you always use fully qualified URIs for paths. MUST point to a URI that is available and capable of handling filesystem requests.

2. The Hadoop namenode uses to read not only its port number, but also its hostname.

The other usage of I discovered was that it is used by the Hadoop namenode to determine its address (hostname and port). This came as a surprise: since the Hadoop (1.x) namenode is single-host, I presumed that its hostname would always be ‘localhost’, allowing users to configure its port by specifying it in Such behavior could present an issue when running Hadoop on a single-host setup, on which you would generally have one core-site.xml defining However, core-site.xml is used by both the Hadoop client, to get the URI of the default filesystem, as well as by the namenode, to read its address.

This was surprising because my understanding was that the Hadoop namenode reads all its configuration parameters from hdfs-site.xml, and not core-site.xml. If you use the same core-site.xml for both client and hadoop-namenode, then you can run into the same issue I experienced.

To ease testing, I wanted to use my filesystem as the default. As a result, I set to xyz://localhost:16666 in the core-site.xml. I also wanted to run HDFS on the same host. Hence, I ran the script that comes bundled with Hadoop. This caused protocol mismatch errors, since the script was trying to start the Hadoop namenode at localhost:16666, instead of localhost:9000.

On a production setup, the Hadoop namenode, my filesystem, and the Hadoop client would each exist on separate hosts, reducing the chance this would be a problem. I could have separate core-site.xml’s for client and Hadoop namenode, each containing a different, which would work just fine. On a single host setup with just one core-site.xml, this could be a problem. The solution that we came up with was using two different core-site.xmls, one of them used in the classpath while running, the other while running the Hadoop client. is a crucial HDFS configuration parameter that seems to be easy to comprehend at a high level. However, I hope this blog post helps clear two finer details about it.

About the Author


Ethics and the Data-Driven Enterprise
Ethics and the Data-Driven Enterprise

Despite the prodigious business value, civic innovation, and predictive insight yielded from the cascading ...

Eventful Tuesday
Eventful Tuesday

Events Tuesday: eXtreme Choosday eXtreme Tuesday Club is a place to talk about software after work in a ret...