Monday, 15 August 2016

Spark Installation in Windows:

In this post we will see how to install Spark on Windows machine and running spark application as an stand alone application. 

Pre-requites: 
We should have following components installed/configured:
1. Java 
2. Spark
3. Scala
4. SBT

Spark Configuration: 
To do this we need to download Spark from spark binary. Choose package type as "Pre-built for Hadoop x.x" and download type as "Direct Download" then click on Download Spark. 

After download the zip file move it to in to your preferred location (for eg: D:\Softwares\Spark_Binary\spark-1.3.1-bin-hadoop2.4\) and extract it. Now open your spark extracted directory and click on bin type cmd hit enter.

Note: Please use 7-Zip to extract the binary zip folder. Other extract tools are not able to extract without any error. 



With this we our spark-shell setup is done. Let's test spark-shell:

Type spark-shell on your command prompt from spark bin directory and hit enter. you should be able to see spark-shell screen as below:

We can run all the spark shell commands using your local files. You can access spark UI using http://localhost:404/




Scala Installation:

Donwnload scala-x.x.x.msi file from scala_binary and install. It should be simple next and next after selecting the preferred installation location.  

After installation we should able to access scala from command prompt. To do this go to your scala bin directory from command prompt and enter scala as below:




SBT Installation:

Download sbt-x.x.x.msi from sbt_binary and install. It should be simple next and next after selecting the preferred installation location.  


After installation we should able to access sbt from command prompt. To do this go to your sbt bin directory from command prompt and enter sbt as below:

In order to access scala and sbt from out side of respective bin folders, we have to create environment variables by pointing to its bit path as shown below:


Now we can access scala/sbt out side of bin folders:



Spark Application Setup:
To setup Spark application, we have to create work space folders with specific folder structure so that sbt package can create jar for us. Using these jar we can submit it to spark.  

Let's create work space for sbt:
choose any folder and create folders as below:;
1. D:\Spark\sparkproject\src\main\scala\
in side this folder create a file "SimpleApp.scala"
2. D:\Spark\sparkproject\target
3. D:\Spark\sparkproject\
In side this folder create a file "simple.sbt"

Now open "D:\Spark\sparkproject\simple.sbt" file and past the below 4 line:

name := "Simple Project"

version := "1.0"

// version needs to changed as per your scala version.
scalaVersion := "2.11.8"

//spark-core version needs to be changed as per your spack-core version.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1" 

Now open "D:\Spark\sparkproject\src\main\scala\SimpleApp.scala" and place your scala code, for example:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object WordCount{
  def main(args: Array[String]) {
    System.setProperty("hadoop.home.dir", "D:\\Spark\\")

    val logFile = "D:\\Softwares\\Spark_Binary\\spark-1.3.1-bin-hadoop2.4\\README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}


After updating SimpleApp.scala open command prompt and enter 'sbt' as shown below it may take several minutes to complete setup:


After successful completion sbt configuration we can see following folders in target work space directory:


Now let create jar using "sbt package"


After successful completion of sbt package we will get jar created under your spark project work space target directory:

Move this jar to your spark bin folder as shown below:



Now run below spark submit command from your command prompt:

spark-submit --class WordCount simple-project_2.11-1.0.jar --master local[2]


Note: To avoid any issues with hadoop path, we can download and extract "hadoop-winutils-2.6.0".


Key Words: MAVEN , Maven , mvn , SPARK  , Spark, spark , Scala , scala , SCALA . Eclipse , eclipse , ECLIPSE , Lambda , Hadoop , Big Data .

No comments:

Post a Comment