a-z in hadoop: August 2016

Sunday, 28 August 2016

Git Commands (Part-1):

Git is a powerful that is used for software and other version control tasks, but you can feel its futures only when you’re using its commands. Git commands are more flexible than any other tools. You can make any folder as git repository, to track the changes. As I said earlier it will creates ‘.git’ hidden folder to track the changes.

Let’s dive in to it...............

How to create/initialize git:

Note: GitShell accept most of Unix and dos commands

Go to your folder which you would like to initialize git, then enter ‘git init’. It will create an empty repository for you.

How to add content to your git:

Before adding content you can see what is available to add by using ‘git status’:

We have one branch master with Initial Commit and Demo_Code.scala files which needs to be committed.

Now I know what are files are available to add. I am adding it using ‘git add .’. Here I am using ‘.’ since all the files in this directory needs to be added to the directory. In case you’re looking to add only one file addition you can use ‘git add file_name’.

Note: If you’re getting warning message like ‘LF will be replaced by CRLF’. Please ignore it. It is nothing to do with your code. When you copy file from unix to windows the end of line represented as LF will be converted to CRLF.

If you don’t want to see this manage again and again you can set autocrlf on in your config file with ‘git config core.autocrlf off’

Once you execute git add command, files are added to index area. It has different names to it, some of them are: cached, Staging area, Current directory cache. It’s not the git repository, Index is like a staging area to commit your changes.

Simple Git Architecture look like this

Changes are added to Index. Now I am committing changes to repository using ‘git commit –m “your commit comments”’.

All my files are added to my local repository.

We can use ‘git commit - am “provide your comments”’ as shortcut to add your files to index and commit changes to git repository in one command.

How to check your changes in repository:

With our previous commit we have also provided comments. It is very important to provide a valid and meaningful comments about your changes. We will use these same comments to identify commits and for every commits git will generate SHA-1 which we have already discussed in my previous post. We can see all the commits using ‘git log’.

Your commits can be identified using:

1 Date and time of commits
2. Your comments
3. All your commits are arranged from recent commits to old commits. It means your most recent commit will be on top and follows all the rest old commits.
4. Who made the commits and email address.

We can get short log using: using ‘git shortlog’

We can see most recent commit: using ‘git log’

If we would like to see ‘N’ number of recent commits: we can see it using ‘git log –n 3’:

If we would like to see all the details in one line: with SHA and commit comments we can see it using ‘git log --oneline’

Note: Commit comments essential when we are trying to analyse our previous commits. In the above screen shot if we look at the second commit, commit comments are vogue it will not give you which file has been removed. It is recommended to provide good and short comments.

If we would like to decorate the log we can use: ‘git log --oneline --decorate’ . Here ‘decorate provide details about your branch and tag information. Which will explore it next blog.

If we would like to see the change summary and statistics: we can use ‘git log --stat --summary’

Note: statistics will provide details about what has been added with ‘+’ and what has been deleted with ‘-‘.

We can filter logs using author name: by using ‘git log --author=“Niranjan”’

We can pass email address to author to filter log: by using ‘git log --author=”b.niranjanrao@gmail.com”’

We can filter commits using global regular expressions as: by using ‘git log –grep=”Init”’

We can see log based on file: by using ‘git log file_name’

How to find differences between your working directory and git:

Differences between git and working repository using ‘git diff
Differences between git index and working repository using ‘git diff --cached’

How to remove Indexed/Cached file:

We can remove file from Indexed/Cached using ‘git rm –cache file_name’

At a time we can remove more than one file:

We will continue exploring the rest of the git commands in next blog. See you in my next blog.

Friday, 26 August 2016

GitHub- Source Code Version Control Work Flow- Basics

In this blog I am planning to explore some of the key futures of Git. Git is useful when we are trying to develop code in distributed manner, independent and off the network. GitHub is a web-based repository hosting service, it provide flat from to save your code for open source projects. GitHub provides services through Web GUI, Command Line tool and Desktop tools. In this blog I am focusing on GitHub Desktop futures.

1. Git Background:

Git is an open source project which is compatible with most of the operating systems like Mac, Linux, and Windows.
It’s a distributed version control system, aimed at speed, data integrity.
Git is an independent of network and central server.
Every git directory on every computer is a full-fledged repository with complete history and full version-tracking capabilities.

2. GitHub account:

If you don’t have a github account create it using: https://github.com/

Note: The free version of github account creates repository open for all, if you’re looking for secured/private code management you have to choose the paid version.

3. How to install Github on Desktop:

Download exe file from https://desktop.github.com/

Run the exe file it will take few minutes to check Internet connectivity.

After github desktop install we should be able to see Git Shell, GitHub tools in your All programs. In this blog I am planning show how to use GitHub and manage your code.

Note: Provide your name and email address which we have created in #2, which enables you make commits on repository.

4. Creating a new git project on Desktop:

4.1. We can create repository as shown below:

Once the git repository is created we will can see .gitattributes, .gitignore files. There are three specific reasons to create these ‘.git*’ files.

a. If the folder is empty git will be recognize the folder.

b. A .gitignore file update with the file types which git has to ignore during its operations.

c. A .gitattributes file is a simple text file that gives attributes to path names.

4.2. We can drag and drop the complete folder on to github desktop which folder has your project files, it will create git repository under the same folder:

Note: Git doesn’t make any changes to your folder, it just creates some hidden files/folder to track your file changes.

4.3. How to add a new file/file updates/remove files in to your git repository:

If you have new file/file changes in your repository folder you can provide ‘Summary’ and ‘Description’ to commit your changes to your local git repository using ‘Changes’ tab. Description is optional.

How to remove/rename a file: Let’s say that Demo_Test_Rename.scala has been renamed to Demo_Test.scala. GitHub recognize these changes like this:

a. File renamed – GitHub understading file has been deleted and a new file has been created.

b. File Deleted – GitHub understanding is file has been deleted.

Note: Without first commit you can’t push your local repository to host (github.com).

4.4 How check your commits: Using History you can see your commits, based on your comments and date & time of commits we can identify your commits.

4.5. SHA-1: It’s a very important future of git. A SHA is a hash algorithm checksum to compare added & commit files/file changes. SHA algorithm generates a 40-hexadecimal string, it’s a unique identifier to identify each commit on your repository.

Note: In the above GitHub Desktop screen shot SHA contains only 7 characters. Since we have very less commits and we can identify uniqueness of commit using less than 10 characters it been shown as 7 character string. If we can look at the log from GitSheel, we will be able to see the 40 character string OR we can copy and past it to see 40 character string.

4.6. How to compare file changes:

Now let’s say that we are making some changes to “LineCount.scala” and committed it to our local repository. We can compare it and see what changes has been made from our commit.

4.6.1. Comparison between your working directory and git repository:

You can find the difference between your working directory and git repository easily from ‘Changes’ tab as shown below:

Here I just split the comments line in to two lines from one line. GitHub identifying these changes with ‘–‘ and ‘+’.

4.6.2. Comparing changes between two commits:

In our repository we have to commits 41431abf38639dd01a077a5c33b9d046b9bf35fb and 28b7e310a79e036ca296c1de48da4f6aa1b4b8d5 its changes can viewed at:

5. How to create a new GitHub.com repository from GitHub Desktop:

Click on Publish à provide your details and Publish & do initial commit.

6. Cloning a new project:

We can close repository in two ways:

6.1. Cloning a Github.com repository from Github.com:

Open your repository and click on “Set up in Desktop” and select the folders to create repository.

Now we can see GitHub.com repository on your GitHub desktop:

6.2. Cloning a Github.com repository from GitHub desktop:

Go to your desktop top right corner and click on ‘+’ à ‘Clone’ à select the repository to be cloned. Then select folder when you would like to create local repository.

Now we can see GitHub.com repository on your GitHub desktop:

7. How to syncing your changes:

Let say that you’re working on A-ZinHadoop repository using GitHub Desktop. Once your changes are successful you can Push your changes from GitHub Desktop to https://github.com/repository using Sync option.

8. Branches: A Branch can be created to implement your new ideas without creating storage overhead and without impacting other branches code. At the same time other members can work on master/new branches to collaborate changes. Once the idea is successful we can merge it with master branch.

Note: Before creating a branch your current branches changes should be committed and clean.

Friday, 19 August 2016

Lambda Architecture Over View:

Lambda Architecture (LA) is a scalable and fault-tolerant data processing architecture.

Few years back when Big data analysis was done only through batch process using Hadoop. The evaluation in Big data technologies makes the Big data analysis real time. One of the approach to get the real time data for analytics is Lambda Architecture.

The underlying motivation for building systems with Lambda Architecture are:

· The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
· To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
· The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
The system should be extensible so that features can be added easily, and it should be easily de-buggable and require minimal maintenance.

Essentially, the Lambda Architecture comprises the following components, processes, and responsibilities are:

· New Data: All data entering the system is dispatched to both the batch layer and the speed layer for processing.

· Batch layer: This layer has two functions: (i) managing the master dataset, an immutable, append-only set of raw data, and (ii) to pre-compute arbitrary query functions, called batch views. Hadoop's HDFS is typically used to store the master dataset and perform the computation of the batch views using MapReduce.

· Serving layer: This layer indexes the batch views so that they can be queried in ad hoc with low latency. To implement the serving layer, usually technologies such as Apache HBase or ElephantDB are utilized. The Apache Drill project provides the capability to execute full ANSI SQL 2003 queries against batch views.

· Speed layer:This layer compensates for the high latency of updates to the serving layer, due to the batch layer. Using fast and incremental algorithms, the speed layer deals with recent data only. Storm is often used to implement this layer.

Queries: Last but not least, any incoming query can be answered by merging results from batch views and real-time vie.

Key Word: Lambda , Hadoop , Big Data

How to create Jar using Maven:

In my previous post (Apache Spark IDE Setup with Scala with Maven & Spark Installation in Windows:) we have seen how to setup eclipse with scala & spark so that we can create and run scala objects. In this post we will see how to create jar out of eclipse with/wihtout Maven and submitting it to spark.

Creating Jar using Maven (mvn):

I am assuming that you have some scala code created in your maven project and ready to create jar. If you haven't created any program please look in to this Apache Spark IDE Setup with Scala with Maven.

Now we will add maven dependencies and class path to pom.xml and will Compile, Test, Package maven project, for more details please look in to below links:
1. Maven pom.xml update for mvn dependencies: pom.xml basics & pom.xml update
2. Maven Build Life Cycle: Maven Build Life Cycle.

Let's get started by updating pom.xml:

Open your pom.xml and you will see pom.xml will have xml tags as below:

Now we will add build tags which will contain maven plugins, compiler level, Maven Assembly Plugin and MainClass in mainfest make a executable jar. I am pasting my maven project pom.xml for your reference.

__________________________________Starting of POM.XML____________________________
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>a-z.in.hadoop</groupId>
<artifactId>Maven_Example1</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>

<name>Maven_Example1</name>
<url>http://maven.apache.org</url>

<properties>
<jdk.version>1.8</jdk.version>
<junit.version>3.8.1</junit.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<finalName>Maven_Example1</finalName>
<plugins>


<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<version>4.0.0</version>
<configuration>
<downloadSources>true</downloadSources>
<downloadJavadocs>false</downloadJavadocs>
</configuration>

</plugin>



<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>


<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>

<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>a_z.in.hadoop.Maven_Example1.Line_Count</mainClass>
</manifest>
</archive>

</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase> 
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

</plugins>
</build>
</project>

__________________________________End of POM.XML____________________________

If you know maven build life cycle hierarchy then you can directly crate mvn package. To all the others who are new to maven we can follow all the steps:

Right click on your maven project --> Run as --> you can select Maven clean as an direct option OR you can click on "Maven build..." to clean the project.

With this option it will clean the target folder under your project only if the Build is success.

Now we will create package using "Maven build..."

Right click on your maven project --> Run as --> Maven build... to crate package.

After successful Build we should be able to see target directory with jar created as shown below:

Note: Refresh your project after every Maven build, so that you can see the changes on eclipse.

Now we submit this jar to spark cluster using spark-submit:

Hoping that this post helps you to create maven project and creating jar out of it. Pleas comments on this if you have any questions.

Key Words: MAVEN , Maven , mvn , SPARK , Spark, spark , Scala , scala , SCALA . Eclipse , eclipse , ECLIPSE , Hadoop , Big Data .