a-z in hadoop: 2016

Tuesday, 27 December 2016

How to read fsimage:

We can use Offline Image Viewer tool to view the fsimage data in a human readable format. Sometimes, this becomes more essential to analyse the fsimage to understand the usage pattern, how many 0 bite files are created, what is the space consumption pattern and is the fsimage corrupt.

Donwload the fsimage:

hdfs dfsadmin –fetchImage /fsimage

This will download the latest fsimage from Name node

16/12/27 05:40:43 INFO namenode.TransferFsImage: Opening connection to http://<nn_hostname>:50070/getimage?getimage=1&txid=latest

16/12/27 05:40:43 INFO namenode.TransferFsImage: Transfer took 0.23s at 89.74 KB/s

Reading fsimage:

We can read the fsimage in several output formats.

1 . Web is the default output format.

2 . XML document

3 . Delimiters

4 . Reverse XML.

1 . FileDistribution is the tool for analyzing file sizes in the namespace image.

In this blog I will focusing on two output formats Web and Delimiters.

To get the output on web:

Run the oiv command with fsimage as input file:

hdfs oiv –i /fsimage/fsimage_0000000000000005792

16/12/27 05:48:43 INFO offlineImageViewer.FSImageHandler: Loading 9 strings

16/12/27 05:48:43 INFO offlineImageViewer.FSImageHandler: Loading 64 inodes.

16/12/27 05:48:43 INFO offlineImageViewer.FSImageHandler: Loading inode references

16/12/27 05:48:44 INFO offlineImageViewer.FSImageHandler: Loaded 0 inode references

16/12/27 05:48:44 INFO offlineImageViewer.FSImageHandler: Loading inode directory section

16/12/27 05:48:44 INFO offlineImageViewer.FSImageHandler: Loaded 32 directories

16/12/27 05:48:44 INFO offlineImageViewer.WebImageViewer: WebImageViewer started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.

Now open another terminal and run the below commands to read fsimage.

hdfs dfs -ls webhdfs://127.0.0.1:5978/

hdfs dfs -ls –R webhdfs://127.0.0.1:5978/

We can also get the output in JSON format by using curl:

curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus

To get the output in to an output directory:

hdfs ovi –p Delimited –i /fsimage/fsimage__0000000000000005792 –o /fsimage/fsimage.txt

We can read the data in fsimage.txt by running: head fsimage.txt from the local folder.

Thursday, 24 November 2016

Hortonworks Smart Sense tool (HST)

Hadoop as a distributed cluster environment with very huge data value, velocity of data, verity of data and with hardware components, its different configurations makes Hadoop Cluster so complex. The daily operations on Hadoop Cluster and business need to on-board new component (Data/Technology) integrations increases workload on cluster. In this context it is very difficult to maintain health of the cluster, operational risk, optimized performance and efficiently resource their staff for Hadoop operations.

To resolve all the these Hortonworks come with Hortonworks Smart Sense tool (HST) is a proactive service from Hortonworks to monitor cluster, issue detection, improve your cluster performance, faster case resolution to solve problems before they occur. It provides details in terms of Operations, Performance, Security by analyzing diagnostic information based on your HDP Cluster data. It provide recommendations to your HDP cluster for the following components:

· The Operating System
· HDFS
· YARN
· MapReduce2
· Apache Hive and Apache Tez

Here data refers to logs, HDP Cluster Component Metrics, Configurations. Hortonworks come up with vast list of best practices to take account a number of cluster diagnostic variables to meet their customer needs. Here Ambari is helping to collect all cluster diagnostic variables quickly, generate bundle and upload it on Hortonworks SFTP/ share it manually. Once bundle is uploaded, HST is plugs in to Ambari quickly capture cluster diagnostics and display support case resolution, and configuration-related recommendations to improve cluster performance, security, and operations. We can schedule to upload your cluster information to HDP SFTP weekly/month to keep an eye on your Cluster.

configuration-related recommendations contains details analysis with:

· Justification for recommendation.

· How to apply recommendations.

· List of components, services and hosts are effected with this.

· Pros & Cons of the recommendations.

support case resolution: This required Hortonworks subscription.

Hortonworks subscription that provides faster support case resolution by easily capturing log files and metrics for insight into the root causes of issues. Hortonworks SmartSense also provides proactive cluster configuration via an intelligent stream of cluster analytics and data-driven recommendations.

Sunday, 28 August 2016

Git Commands (Part-1):

Git is a powerful that is used for software and other version control tasks, but you can feel its futures only when you’re using its commands. Git commands are more flexible than any other tools. You can make any folder as git repository, to track the changes. As I said earlier it will creates ‘.git’ hidden folder to track the changes.

Let’s dive in to it...............

How to create/initialize git:

Note: GitShell accept most of Unix and dos commands

Go to your folder which you would like to initialize git, then enter ‘git init’. It will create an empty repository for you.

How to add content to your git:

Before adding content you can see what is available to add by using ‘git status’:

We have one branch master with Initial Commit and Demo_Code.scala files which needs to be committed.

Now I know what are files are available to add. I am adding it using ‘git add .’. Here I am using ‘.’ since all the files in this directory needs to be added to the directory. In case you’re looking to add only one file addition you can use ‘git add file_name’.

Note: If you’re getting warning message like ‘LF will be replaced by CRLF’. Please ignore it. It is nothing to do with your code. When you copy file from unix to windows the end of line represented as LF will be converted to CRLF.

If you don’t want to see this manage again and again you can set autocrlf on in your config file with ‘git config core.autocrlf off’

Once you execute git add command, files are added to index area. It has different names to it, some of them are: cached, Staging area, Current directory cache. It’s not the git repository, Index is like a staging area to commit your changes.

Simple Git Architecture look like this

Changes are added to Index. Now I am committing changes to repository using ‘git commit –m “your commit comments”’.

All my files are added to my local repository.

We can use ‘git commit - am “provide your comments”’ as shortcut to add your files to index and commit changes to git repository in one command.

How to check your changes in repository:

With our previous commit we have also provided comments. It is very important to provide a valid and meaningful comments about your changes. We will use these same comments to identify commits and for every commits git will generate SHA-1 which we have already discussed in my previous post. We can see all the commits using ‘git log’.

Your commits can be identified using:

1 Date and time of commits
2. Your comments
3. All your commits are arranged from recent commits to old commits. It means your most recent commit will be on top and follows all the rest old commits.
4. Who made the commits and email address.

We can get short log using: using ‘git shortlog’

We can see most recent commit: using ‘git log’

If we would like to see ‘N’ number of recent commits: we can see it using ‘git log –n 3’:

If we would like to see all the details in one line: with SHA and commit comments we can see it using ‘git log --oneline’

Note: Commit comments essential when we are trying to analyse our previous commits. In the above screen shot if we look at the second commit, commit comments are vogue it will not give you which file has been removed. It is recommended to provide good and short comments.

If we would like to decorate the log we can use: ‘git log --oneline --decorate’ . Here ‘decorate provide details about your branch and tag information. Which will explore it next blog.

If we would like to see the change summary and statistics: we can use ‘git log --stat --summary’

Note: statistics will provide details about what has been added with ‘+’ and what has been deleted with ‘-‘.

We can filter logs using author name: by using ‘git log --author=“Niranjan”’

We can pass email address to author to filter log: by using ‘git log --author=”b.niranjanrao@gmail.com”’

We can filter commits using global regular expressions as: by using ‘git log –grep=”Init”’

We can see log based on file: by using ‘git log file_name’

How to find differences between your working directory and git:

Differences between git and working repository using ‘git diff
Differences between git index and working repository using ‘git diff --cached’

How to remove Indexed/Cached file:

We can remove file from Indexed/Cached using ‘git rm –cache file_name’

At a time we can remove more than one file:

We will continue exploring the rest of the git commands in next blog. See you in my next blog.

Friday, 26 August 2016

GitHub- Source Code Version Control Work Flow- Basics

In this blog I am planning to explore some of the key futures of Git. Git is useful when we are trying to develop code in distributed manner, independent and off the network. GitHub is a web-based repository hosting service, it provide flat from to save your code for open source projects. GitHub provides services through Web GUI, Command Line tool and Desktop tools. In this blog I am focusing on GitHub Desktop futures.

1. Git Background:

Git is an open source project which is compatible with most of the operating systems like Mac, Linux, and Windows.
It’s a distributed version control system, aimed at speed, data integrity.
Git is an independent of network and central server.
Every git directory on every computer is a full-fledged repository with complete history and full version-tracking capabilities.

2. GitHub account:

If you don’t have a github account create it using: https://github.com/

Note: The free version of github account creates repository open for all, if you’re looking for secured/private code management you have to choose the paid version.

3. How to install Github on Desktop:

Download exe file from https://desktop.github.com/

Run the exe file it will take few minutes to check Internet connectivity.

After github desktop install we should be able to see Git Shell, GitHub tools in your All programs. In this blog I am planning show how to use GitHub and manage your code.

Note: Provide your name and email address which we have created in #2, which enables you make commits on repository.

4. Creating a new git project on Desktop:

4.1. We can create repository as shown below:

Once the git repository is created we will can see .gitattributes, .gitignore files. There are three specific reasons to create these ‘.git*’ files.

a. If the folder is empty git will be recognize the folder.

b. A .gitignore file update with the file types which git has to ignore during its operations.

c. A .gitattributes file is a simple text file that gives attributes to path names.

4.2. We can drag and drop the complete folder on to github desktop which folder has your project files, it will create git repository under the same folder:

Note: Git doesn’t make any changes to your folder, it just creates some hidden files/folder to track your file changes.

4.3. How to add a new file/file updates/remove files in to your git repository:

If you have new file/file changes in your repository folder you can provide ‘Summary’ and ‘Description’ to commit your changes to your local git repository using ‘Changes’ tab. Description is optional.

How to remove/rename a file: Let’s say that Demo_Test_Rename.scala has been renamed to Demo_Test.scala. GitHub recognize these changes like this:

a. File renamed – GitHub understading file has been deleted and a new file has been created.

b. File Deleted – GitHub understanding is file has been deleted.

Note: Without first commit you can’t push your local repository to host (github.com).

4.4 How check your commits: Using History you can see your commits, based on your comments and date & time of commits we can identify your commits.

4.5. SHA-1: It’s a very important future of git. A SHA is a hash algorithm checksum to compare added & commit files/file changes. SHA algorithm generates a 40-hexadecimal string, it’s a unique identifier to identify each commit on your repository.

Note: In the above GitHub Desktop screen shot SHA contains only 7 characters. Since we have very less commits and we can identify uniqueness of commit using less than 10 characters it been shown as 7 character string. If we can look at the log from GitSheel, we will be able to see the 40 character string OR we can copy and past it to see 40 character string.

4.6. How to compare file changes:

Now let’s say that we are making some changes to “LineCount.scala” and committed it to our local repository. We can compare it and see what changes has been made from our commit.

4.6.1. Comparison between your working directory and git repository:

You can find the difference between your working directory and git repository easily from ‘Changes’ tab as shown below:

Here I just split the comments line in to two lines from one line. GitHub identifying these changes with ‘–‘ and ‘+’.

4.6.2. Comparing changes between two commits:

In our repository we have to commits 41431abf38639dd01a077a5c33b9d046b9bf35fb and 28b7e310a79e036ca296c1de48da4f6aa1b4b8d5 its changes can viewed at:

5. How to create a new GitHub.com repository from GitHub Desktop:

Click on Publish à provide your details and Publish & do initial commit.

6. Cloning a new project:

We can close repository in two ways:

6.1. Cloning a Github.com repository from Github.com:

Open your repository and click on “Set up in Desktop” and select the folders to create repository.

Now we can see GitHub.com repository on your GitHub desktop:

6.2. Cloning a Github.com repository from GitHub desktop:

Go to your desktop top right corner and click on ‘+’ à ‘Clone’ à select the repository to be cloned. Then select folder when you would like to create local repository.

Now we can see GitHub.com repository on your GitHub desktop:

7. How to syncing your changes:

Let say that you’re working on A-ZinHadoop repository using GitHub Desktop. Once your changes are successful you can Push your changes from GitHub Desktop to https://github.com/repository using Sync option.

8. Branches: A Branch can be created to implement your new ideas without creating storage overhead and without impacting other branches code. At the same time other members can work on master/new branches to collaborate changes. Once the idea is successful we can merge it with master branch.

Note: Before creating a branch your current branches changes should be committed and clean.