Monthly Archives: October 2018

Spark + Scala + Windows 10

Run and test your Spark codes without VMs and Dockers for starters.

Introduction

The goal of this post is to help people to install and run Apache Spark in a computer with window 10 platform without much hassle. This post is composed of pieces of installation content—somewhat similar to LEGO bricks—that you can work around piece by piece. Hope you find the post useful to get started with your experiments on Apache Spark framework.

Note: If you really want to build a serious prototype, I strongly recommend to install one of the quick start Hadoop virtual machines.

Java Virtual Machine

Install a JDK: You need to first install a JDK, that’s a Java Development Kit. You can just go to Sun’s website and download that and install it if you need to.

We need the JDK because, even though we’re going to be developing in Python or Scala during this course. Even Python gets translated under the hood to Scala code, which is what Spark is developed in natively. And, Scala, in turn, runs on top of the Java interpreter. So, in order to run Python code, you need a Scala system, which will be installed by default as part of Spark. Also, we need Java, or more specifically Java’s interpreter, to actually run that Scala code. It’s like a technology layer cake. (applications configuration shown after installation below)

JDK 8 is a superset of JRE 8, and contains everything that is in JRE 8, plus tools such as the compilers and debuggers necessary for developing applets and applications. JRE 8 provides the libraries, the Java Virtual Machine (JVM), and other components to run applets and applications written in the Java programming language.

Scala Binaries

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages.

  • Download the Scala binaries for Windows
  • Accept the agreement. Select Next and continue to complete installation.
  • You can verify Scala installation in folder: C:\Program Files (x86)\scala

Scala IDE

Scala IDE provides advanced editing and debugging support for the development of pure Scala and mixed Scala-Java applications. While one free to use the python shell and interactive interpreter such as Jupyter or Spyder, we will assume pure Scala development and try Scala IDE.

  • Pick Windows 64 bit version
  • Save it under download folders.
  • Choose the installer and file associations.
  • Move the archive to D drive and unzip using winrar or 7 zip application. It creates a folder by name eclipse.
  • Right click on eclipse application and create a short cut.
  • Send the shortcut to desktop and rename it as you want.
  • Open up Scala IDE and launch it in a workspace of your choice (my case D:\workspace).
  • File => New => Scala Project
  1. Name the project HelloWorld. Select the src folder and right click the context menu to pick New=> Scala Object, type “Hello” and click Finish.
  • Writing code:. Change the code to the following:

object Hello extends App {

  println(“Hello, World!”)

}

  • Running it: Right click on Hello object in your code and select Run > Scala Application. You’re done!
  • Output:

Download Spark

As we are not going to use Hadoop it makes no difference the version you choose. Fortunately, the Apache website makes available prebuilt versions of Spark that will just run out of the box that are precompiled for the latest Hadoop version. You don’t have to build anything, you can just download that to your computer and stick it in the right place and be good to go for the most part.

  • Now, we have used Spark 2.3.2 here, but anything beyond 2.0 should work just fine.
  • Make sure you get a prebuilt version, and select the direct download option so all these defaults are perfectly fine.
  • Now, it downloads a TGZ (Tar in GZip) file. You can use WinRAR to unzip the files. Extract the files to any location in your drive with enough permissions for your user.

WinUtils

The official release of Hadoop does not include the required binaries (e.g., winutils.exe) necessary to run Apache Hadoop. In order to use Hadoop on Windows, it must be compiled from source. So we must get the 64 bit winutils.exe from a trusted store. I used from here–  feel free to pick.

Environment Variables

Every process has an environment block that contains a set of environment variables and their values. There are two types of environment variables: user environment variables (set for each user) and system environment variables (set for everyone).

The command processor provides the set command to display its environment block or to create new environment variables. You can also view or modify the environment variables by selecting System from the Control Panel, selecting Advanced system settings, and clicking Environment Variables. Each environment block contains the environment variables in the following format:

Var1=Value1\0
Var2=Value2\0
Var3=Value3\0
...
VarN=ValueN\0\0

To set environment variables in Windows 10 and Windows 8:

  1. In Search, search for and then select: System (Control Panel)
  2. Click the Advanced system settings link.
  3. Click Environment Variables. In the section System Variables, find the PATH environment variable and select it. Click Edit. If the PATH environment variable does not exist, click New.
  4. In the Edit System Variable (or New System Variable) window, specify the value of the PATH environment variable. Click OK. Close all remaining windows by clicking OK.
  5. Reopen Command prompt window, and run your code.
  6. _JAVA_OPTION: We set this variable to the value -Xmx512M -Xms512M. It helps with Java Heap Memory problems with the default values pre-set. You are free to increase the memory allocated.
  7. HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt and looks in the code for it. To fix this inconvenience this variable points to the folder containing the winutils.exe file (In my case D:\winutils)
  8. JAVA_HOME: we usually already set this variable when we install java but it is better to verify that exist and is correct. (In my case C:\Java\jdk1.8.0_181 – since I avoided the program files owing to the blank space character)
  9. SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path C:\Program Files (x86)\scala.  
  10. SPARK_HOME: the bin folder path of where you uncompressed Spark. In my case it is D:\spark-2.3.2-bin-hadoop2.7.

When you add an EXE path as an environment variable, you can access the program from any command line. The command line in Windows being the Command Prompt, you can open a Command Prompt in any location and run commands. Which paths you add is entirely up to you since you know which programs you need to access from the Command Prompt.

So after you have introduced all the above said environment variables, the last one to modify is PATH

  • PATH: We set this variable to include all the new variables set previously.

%JAVA_HOME%\bin

%SCALA _HOME%\bin

%HADOOP_HOME%\bin

%SPARK_HOME%\bin

Permissions 

 After we set everything, the shell tries to find the folder tmp/hive. So when you want to run the spark-shell “C:\tmp\hive” needs permission, I have to set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils

Open a command prompt as administrator and type:

Testing

Start the command prompt as administrator again. Move to the folder where you have stored the data files. To test the spark-shell, open a command terminal and type spark-shell – you are ready to use the spark CLI.

val textFile = spark.read.textFile(“a.txt”)

textFile.count()

The screenshot below describes the output expected.

Or, alternatively try the following code inside ScalaIDE:

Have fun coding. Cheers to the craft of creating clean code!!! Suria ☼

👋