Spark + Scala + Windows 10

Run and test your Spark codes without VMs and Dockers for starters.


The goal of this post is to help people to install and run Apache Spark in a computer with window 10 platform without much hassle. This post is composed of pieces of installation content—somewhat similar to LEGO bricks—that you can work around piece by piece. Hope you find the post useful to get started with your experiments on Apache Spark framework.

Note: If you really want to build a serious prototype, I strongly recommend to install one of the quick start Hadoop virtual machines.

Java Virtual Machine

Install a JDK: You need to first install a JDK, that’s a Java Development Kit. You can just go to Sun’s website and download that and install it if you need to.

We need the JDK because, even though we’re going to be developing in Python or Scala during this course. Even Python gets translated under the hood to Scala code, which is what Spark is developed in natively. And, Scala, in turn, runs on top of the Java interpreter. So, in order to run Python code, you need a Scala system, which will be installed by default as part of Spark. Also, we need Java, or more specifically Java’s interpreter, to actually run that Scala code. It’s like a technology layer cake. (applications configuration shown after installation below)

JDK 8 is a superset of JRE 8, and contains everything that is in JRE 8, plus tools such as the compilers and debuggers necessary for developing applets and applications. JRE 8 provides the libraries, the Java Virtual Machine (JVM), and other components to run applets and applications written in the Java programming language.

Scala Binaries

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages.

  • Download the Scala binaries for Windows
  • Accept the agreement. Select Next and continue to complete installation.
  • You can verify Scala installation in folder: C:\Program Files (x86)\scala

Scala IDE

Scala IDE provides advanced editing and debugging support for the development of pure Scala and mixed Scala-Java applications. While one free to use the python shell and interactive interpreter such as Jupyter or Spyder, we will assume pure Scala development and try Scala IDE.

  • Pick Windows 64 bit version
  • Save it under download folders.
  • Choose the installer and file associations.
  • Move the archive to D drive and unzip using winrar or 7 zip application. It creates a folder by name eclipse.
  • Right click on eclipse application and create a short cut.
  • Send the shortcut to desktop and rename it as you want.
  • Open up Scala IDE and launch it in a workspace of your choice (my case D:\workspace).
  • File => New => Scala Project
  1. Name the project HelloWorld. Select the src folder and right click the context menu to pick New=> Scala Object, type “Hello” and click Finish.
  • Writing code:. Change the code to the following:

object Hello extends App {

  println(“Hello, World!”)


  • Running it: Right click on Hello object in your code and select Run > Scala Application. You’re done!
  • Output:

Download Spark

As we are not going to use Hadoop it makes no difference the version you choose. Fortunately, the Apache website makes available prebuilt versions of Spark that will just run out of the box that are precompiled for the latest Hadoop version. You don’t have to build anything, you can just download that to your computer and stick it in the right place and be good to go for the most part.

  • Now, we have used Spark 2.3.2 here, but anything beyond 2.0 should work just fine.
  • Make sure you get a prebuilt version, and select the direct download option so all these defaults are perfectly fine.
  • Now, it downloads a TGZ (Tar in GZip) file. You can use WinRAR to unzip the files. Extract the files to any location in your drive with enough permissions for your user.


The official release of Hadoop does not include the required binaries (e.g., winutils.exe) necessary to run Apache Hadoop. In order to use Hadoop on Windows, it must be compiled from source. So we must get the 64 bit winutils.exe from a trusted store. I used from here–  feel free to pick.

Environment Variables

Every process has an environment block that contains a set of environment variables and their values. There are two types of environment variables: user environment variables (set for each user) and system environment variables (set for everyone).

The command processor provides the set command to display its environment block or to create new environment variables. You can also view or modify the environment variables by selecting System from the Control Panel, selecting Advanced system settings, and clicking Environment Variables. Each environment block contains the environment variables in the following format:


To set environment variables in Windows 10 and Windows 8:

  1. In Search, search for and then select: System (Control Panel)
  2. Click the Advanced system settings link.
  3. Click Environment Variables. In the section System Variables, find the PATH environment variable and select it. Click Edit. If the PATH environment variable does not exist, click New.
  4. In the Edit System Variable (or New System Variable) window, specify the value of the PATH environment variable. Click OK. Close all remaining windows by clicking OK.
  5. Reopen Command prompt window, and run your code.
  6. _JAVA_OPTION: We set this variable to the value -Xmx512M -Xms512M. It helps with Java Heap Memory problems with the default values pre-set. You are free to increase the memory allocated.
  7. HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt and looks in the code for it. To fix this inconvenience this variable points to the folder containing the winutils.exe file (In my case D:\winutils)
  8. JAVA_HOME: we usually already set this variable when we install java but it is better to verify that exist and is correct. (In my case C:\Java\jdk1.8.0_181 – since I avoided the program files owing to the blank space character)
  9. SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path C:\Program Files (x86)\scala.  
  10. SPARK_HOME: the bin folder path of where you uncompressed Spark. In my case it is D:\spark-2.3.2-bin-hadoop2.7.

When you add an EXE path as an environment variable, you can access the program from any command line. The command line in Windows being the Command Prompt, you can open a Command Prompt in any location and run commands. Which paths you add is entirely up to you since you know which programs you need to access from the Command Prompt.

So after you have introduced all the above said environment variables, the last one to modify is PATH

  • PATH: We set this variable to include all the new variables set previously.






 After we set everything, the shell tries to find the folder tmp/hive. So when you want to run the spark-shell “C:\tmp\hive” needs permission, I have to set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils

Open a command prompt as administrator and type:


Start the command prompt as administrator again. Move to the folder where you have stored the data files. To test the spark-shell, open a command terminal and type spark-shell – you are ready to use the spark CLI.

val textFile =“a.txt”)


The screenshot below describes the output expected.

Or, alternatively try the following code inside ScalaIDE:

Have fun coding. Cheers to the craft of creating clean code!!! Suria ☼


‘Server less’ Or ‘Serve Less’

The journey of distributed computing has evolved over the years from physical deployments to virtualization, to platforms, to services, to container orchestration and now towards server-less compute services. The word ‘server-less’ doesn’t mean lack of compute resources but rather a consumer perspective where such physical boundaries are non existent and computation happens on a need-basis.

Server-less architecture focuses on small granular tasks or jobs as opposed to more granular applications. The difference between an application and a task can be explained thus: applications are hosted at run-time and they need management of container, distribution, redundancy and orchestration; while on the other hand, task tends to be simplistic, task would start-execute-stop, scale, stay concurrent and possibly communicate to other tasks via asynchronous messaging.

Serverless architectures are application designs that incorporate third-party “Backend as a Service” (BaaS) services, and/or that include custom code run in managed, ephemeral containers on a “Functions as a Service” (FaaS) platform. By using these ideas, and related ones like single-page applications, such architectures remove much of the need for a traditional always-on server component. Serverless architectures may benefit from significantly reduced operational cost, complexity, and engineering lead time, at a cost of increased reliance on vendor dependencies and comparatively immature supporting services.  

Martin Fowler,  2017

The goal of server-less computing platform is to provide an opportunity to  reduce, organize, and manage complexity of an application, by tackling application as a bunch of services. Each service is composed of a bundle of specific tasks or actions or event based architecture. Thus changing the architecture is easier, which is an important factor for any long-term application. It does not replace the PaaS or container based architecture, however server-less architecture provides an alternative abstraction for building scalable and flexible on-demand services.

A Picture is Worth a Thousand Words

It is common to create applications in which  services (either in built or that of third-party) are adopted as a part of the system (rather than creating services from scratch). These services are commonly known as Backend as a service (BaaS). Similarly, business logic can be coded in the form of functions that are hosted elastically in the cloud as Function as a service (FaaS). Also server-less systems use an event based architecture to trigger the functions or services on need basis. The following diagram illustrates how services and custom functions are created, deployed, and consumed by different parts of the software systems:

An example of Serverless Reference Architecture Layering

Serverless computing, is the new way of consuming cloud computing services. In this style of computing, the cloud vendors are responsible for service provisioning, maintenance and stability of services offered. The platform engineers focus on building context sensitive, new business features that are innovative and billed only against the amount of computation they consume. 


There are many reasons as to why developers prefer server-less computing model:

  1. No Operations Overhead: Platform engineers are free from managing underlying resources such as infra and os. The cloud vendors provision and path the FaaS. This results in improved productivity as the cloud engineers can now focus on business functionality than cloud service architecture.
  2. Scalability and Availability: Both functions and events are small, granular, loosely coupled, stateless components. This makes the services easily scalable and thus effectivly avilable for consumers based on current load.
  3. Optimization: Services are invoked on a need basis, thereby only consuming and paying for what is essential. This will reduce the overall costs and also improves efficiency of the service invocation layer.
  4. Polyglot Options: The server less computing opens doors to use different languages and run-times depending on the specific use case under consideration. This provides liberty to platform engineers to use multiple languages within an application that best suits the task in hand.

However, server less computing is still in its infancy; hence, it is not suitable for all use cases. It does have limitations such as lack of good state management facility, vendor-lock-in, limited function support, and lack of debugging tools.


To summarize, there are many different ways to implement server-less compute systems or, managed services. The classical use cases that are suitable for server-less architecture are extensive. For instance, situations that demand less complex computations, that stay stateless and possess predictable behavior are worthy of this architecture. Data transformation, speech recognition, pattern mining, video object recollection etc are specific services that are executed once a while and would benefit from a need based serve-less (optimal), managed service invocation of the server-less architecture.

Essential Gradle – A Birdie’s Look.

This blog is to help the reader understand essentials of Gradle build tool.


Building a software is a craft by itself; build process usually involve a series of task such as picking the right libraries, compiling the source code, package distributed executable,  test using automated scripts and report process progress. Each task can be broken down into further activities, for example directing the execution order, resolving dependencies, resource identification and  component configuration.

Gradle is the new build automation tool that assists with continuous delivery.  Gradle offers flexibility in the ways projects are put together, it optimizes the build in interesting ways and it is extremely customizable.  The documentation can be found here. You may find detailed installation instructions here.

Gradle uses Groovy syntax. Groovy is a powerful, optionally typed and dynamic language, with static-typing and static compilation capabilities, for the Java platform aimed at improving developer productivity thanks to a concise, familiar and easy to learn syntax. It integrates smoothly with any Java program, and immediately delivers to your application powerful features.

Gradle includes interesting features such as scripting, DSL (Domain-Specific Language) authoring, run-time and compile-time meta-programming and functional programming features. Thus programming language Groovy is its natural choice. Groovy is concise, readable and expressive syntax, easy to learn for Java developers.

Gradle Build Scripts

Gradle is composed of two components (1) Project and (2) Task. Each Gradle build comprises of one or more projects. A project represent a thing to be done, such as deploying your application to staging or production environments. Each project is made up of one or more tasks. A task represents some atomic piece of work which a build performs. This might be compiling some classes, creating a Java archive, generating Java documentation, or publishing an archive to a repository.

The Gradle works on the principle that every build has two phases (1) The Configuration Phase and (2) The Execution Phase. During the configuration phase, the Gradle kernel will scan at all the configuration details in the build, figures out how the various tasks are laid out, and also the dependencies between tasks in form of a ‘Directed Acyclic Graph’. All customization that are specific to a task is done during the execution phase. The execution phase is done as a built-in auto-generated-task or custom specific task directed by the developer.

Defining Tasks and Closures

Let us dive into the basics and syntax soup – assuming you have proper installation and set up done. Say for example, a simple task is defined as below:

task helloWorld { 
 doLast {
    println 'Hello world.'

Here, we have defined a helloWorld task and method doLast. When executed, the task will print the words ‘Hello world.’ on the console. The println is a Groovy method to print text to the console and is basically a shorthand version of the System.out.println Java method. A task can comprise of any number of methods. Alternatively, we can use the << left shift operator as synonym for the doLast method.

 task helloWorld <<  {       println 'Hello world.' }

Gradle support closures. Closures are reusable pieces of code, whose result can be stored in a variable or passed to a method. Closures are represented by a pair of curly parenthesis ( {. . .} ). We can pass one or more parameters to a closure. On instances where there is only one parameter, that parameter can be referred via it.  Parameters can also be explicitly named. See the following examples below.

task sometask { 
   doLast { 
    // Using implicit 'it' closure parameter. 
    // The type of 'it' is a Gradle task. 
    println "Running ${}" 

task sometask { 
   doLast { Task task -> 
   // Using explicit name 'task' as closure parameter. 
   // We also defined the type of the parameter. 
   println "Running ${}" 

Defining Actions

We can add actions to a task. This is an implementation class from the org.gradle.api.Action interface. Usually an action has one execute method that us invoked when the task is executed. See the simple example below:

task sometask { 
     new Action() { 
            void execute(O task) { 
                println "Running ${}" 

Defining Dependency

We can add task dependencies with the dependsOn method for a task. We can specify a task name as the String value or task object as the argument. We can even specify more than one task name or object to specify multiple task dependencies. The dependency between tasks is lazy. Dependencies can be defined via tasks or closures. We can define a dependency on a task that is defined later in the build script. Gradle will set up all task dependencies during the configuration phase and not during the execution phase. The order of the tasks doesn’t matter in the build script. Some examples below

task third(dependsOn: 'second') << { task -> 
 println "Run ${}" 
task second(dependsOn: 'first') << { task -> 
 println "Run ${}" 
task first << { task -> 
 println "Run ${}" 

Another example. We define a dependency for the second task on all tasks in the project with task names that have the letter f in the task name. For this we use the Groovy method findAll that returns all tasks that apply to the condition we define in the closure: the task name starts with the letter ‘f’.

def printTaskName = { task -> 
    println "Run ${}" 

task second << printTaskName 

// We use the dependsOn method 
// with a closure. 
second.dependsOn {
    project.tasks.findAll { task -> 
  'f' } } 

task first << printTaskName 

task beforeSecond << printTaskName

Besides, tasks can be also be organised, task defaults can be specified, tasks can be grouped, additional project/task properties can be defined and certain tasks can be skipped.

Defining the Java Plugin

In Gradle, we have the concept of introducing additional extra functionalities beyond tasks and properties via plugins. we can apply plugins to our project. Also the concept of plugin keeps these extra functionalities decoupled from the core build logic. Gradle ships with plugins that are ready out of the box; it also allows us to write our won. For example, Gradle has a Java plugin. This plugin adds tasks for compiling, testing, and packaging Java source code related to our project.

Dependency Management

Gradle provides control of project dependencies via a manageable ‘Dependency Tree’. It also has provisions for designing custom dependencies and declare non-managed dependencies. Gradle can help in producing run time dependency report. The tree structure makes it natural for the tool to dynamic scale into multiple projects avoiding version conflict. Inclusion and exclusion can be  well scoped within the branches of the trunk.

Test Support

Gradle supports compilation and execution of tests. It offers testCompile and testRuntime dependency configurations for this purpose. Gradle offers support to JUnit, Test NG and direct test annotations from few other frameworks that can run on JVM. Gradle also helps in producing test reports in XML and HTML formats as configured by the test engineer.

Build and Publish

Any software project is composed of artifacts we may want to publish. Such articles are stored in jar, war or zip file formats. Gradle allows developers to publish their artifacts in a central repository, thereby enabling other developers in the organization to be able to access via intranet or internet.

Concluding Remarks

So here in this blog we have understood the basic syntax of Gradle and composed a simple task using the same.  Gradle supports incremental builds. To further the horizon, the next steps of advancement would be to learn to build more complex tasks, multi project composition, dependency management and task graph design.

Pointers from here

Gradle official documentation site:

Gradle and Spring Boot:

Gradle Essentials Book:

Time stamp validity: The blog content is relevant as of December 2017

Code Schools Vs University

Evgeny Shadchnev (Makers Academy). An interesting interview on learning software development from an university vs code schools such as Makers Academy.

Key Points:
1. 1000 hours of practice for decent level of programming skill in a (ONE) programming language.
2. Code schools are target with laser focus on a single skill towards quick junior level employment; while the university looks at knowledge and skills in the long term.
3. Code schools learners host are matured and keen to switch career, while universities deal with younger fresh audience keen for quality formal education.
4. Code schools look at producing high quality junior programmers; Interestingly, when it comes to qualified code school teachers they have short lived career span of teaching only for few years – because frameworks/languages/platform are outdated faster.
5. The code school curriculum is flexible, meaning they don’t follow rigid time table; they function on challenges and mini-projects where learners research, self-educate and finish in their own pace.
6. The optimal class size for code school is recommended to be 24 – with at least two instructors. Projects are done as smaller teams with plenty of mentoring.
7. Key success of a code school – lie in the manner of learner selection and screening models.
8. In history, code schools where always available, but the trail blazer was the bootcamp (@ US 2012) a few years ago. Also finding a sustainable business model in this space is difficult and hence many of code schools also are short lived including the famous bootcamp. One of the success pointer for a code school is to partner with corporate (hiring partners) for perennial flow of learners.
9. While online version courses are cheaper, learning modes can be challenging; Also post learning, recruitment for remote learners in good technology companies is highly challenging. Thus code schools prefer offline learning as a more effective way of learning and recruitment.

Why this Blog?

Why not? I have been always giving excuses for not being organised when it comes to capturing my thoughts.  Most of the time I hide behind ‘I am busy’ excuses. Given my  clouded head that includes facts, artifacts, creations, opinions, sarcasm, reviews and bunch of natural bias – I have had few home run of cute ideas/opinions/phrases in the past. Finally, It is time to maintain a decent journal. Hence this blog. Please fee free to read, admonish and critique in the comment section. And have fun doing so.


Suria ☼