With so many programming languages out there why should you invest a chunk of your precious time learning Scala & Spark?
The answer is Data.
“The worlds most valuable commodity is no longer oil, but data”
– The Economist
Ask any Big Data industry expert – Scala is the best language to learn when you are starting out.
Scala is now used by a large number of companies. Scala and Spark are being used at Facebook, Groupon, NetFlix, Ebay, TripAdvisor for crunching big data and applying machine learning algorithms.
Increasing industry adoption is causing higher demand for experienced Scala developers and data engineers. In 2017 the number of jobs advertised citing Scala was up 77% on the same period in 2016 (www.itjobswatch.co.uk) and this trend continued through 2018.
Learning a new language can be tricky, however this site aims to make the process slightly less painful. Providing easy access resources to take you from a beginner to a competent Scala developer.
Read on for a high level introduction to Scala & Spark or skip to xx if you are keen to start writing some code.
What is Scala?
What Makes Scala Scalable?
Scala’s scalability is driven by many different factors, however perhaps the most important of these is the blend of object orientated and functional programming. Unlink most other languages, in Scala a function is also an object. Function types are themselves classes that can be inherited by sub classes. Although this might sound like a fairly trivial statement, it has far reaching implications for scalability. More on this in later tutorials.
Briefly, the impact of scalability is a language that is highly flexible, adaptable and can be moulded to the users needs. Scala provides the ability for the user to grow the language, through defining custom libraries and control constructs that effectively feel like native language support. This is an incredibly powerful concept.
Object Orientated and Functional
If you have a technical background, the chances are you’ve heard the OOP vs FP debate. As mentioned previously, Scala combines both of these programming paradigms.
Scala is pure form object orientated language. That is, every value you define is an object and operation is a method call. For example, when you say
10 - 5 in Scala you are actually invoking a method defined
- in class Int.
Additionally, Scala is also a pure functional language. There are two main aspects to functional programming. Firstly, functions are first class values. In Scala a function is a value with the same status as, say, a String. This is very different different to how non functional languages like C treat their functions. Secondly, it is expected that functions should map input values to output values rather update data held in one place. Or in other words, a function should not have any side effects. This means methods should only take some defined arguments as input and only output results. It should not interact or update its environment in any other way.
Scala does not force you into a functional style or an imperative style. You can choose to write your code in whichever style suits you. However, it almost always possible to avoid imperative styles in favour of a more functional approach. This is generally the preferred approach when writing code in Scala and will be encouraged in these tutorials.
What is Apache Spark?
Apache spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. Spark is written in Scala but supports multiple widely used programming languages (Python, Java, Scala and R). It is open source and includes libraries and APIs for diverse tasks ranging from SQL to streaming and machine learning. These APIs are easy to use for developers. Abstracting away much of the complexity of a distributed processing engine behind simple method calls. Spark can run anywhere from a laptop to a cluster of thousands of servers making it an easy system to start with and scale up to process even the largest datasets.
At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.
RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.
Spark SQL has become more and more important to the Apache Spark project. It is likely the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.
Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.
Selecting some columns from a dataframe is as simple as this line:
Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:
spark.sql(“SELECT name, pop FROM cities”)
Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. In the Apache Spark 2.x era, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) is the recommended approach for development. The RDD interface is still available, but is recommended only if you have needs that cannot be encapsulated within the Spark SQL paradigm.
Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.
Note that while Spark MLlib covers basic machine learning including classification, regression, clustering, and filtering, it does not include facilities for modeling and training deep neural networks (for details see InfoWorld’s Spark MLlib review). However, Deep Learning Pipelines are in the works.
Spark GraphX comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the GraphFrames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.
Spark Streaming was an early addition to Apache Spark that helped it gain traction in environments that required real-time or near real-time processing. Previously, batch and stream processing in the world of Apache Hadoop were separate things. You would write MapReduce code for your batch processing needs and use something like Apache Storm for your real-time streaming requirements. This obviously leads to disparate codebases that need to be kept in sync for the application domain despite being based on completely different frameworks, requiring different resources, and involving different operational concerns for running them.
Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches, which could then be manipulated using the Apache Spark API. In this way, code in batch and streaming operations can share (mostly) the same code, running on the same framework, thus reducing both developer and operator overhead. Everybody wins.
A criticism of the Spark Streaming approach is that microbatching, in scenarios where a low-latency response to incoming data is required, may not be able to match the performance of other streaming-capable frameworks like Apache Storm, Apache Flink, and Apache Apex, all of which use a pure streaming method rather than microbatches.
Structured Streaming (added in Spark 2.x) is to Spark Streaming what Spark SQL was to the Spark Core APIs: A higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming dataframes and datasets. It also solves some very real pain points that users have struggled with in the earlier framework, especially concerning dealing with event-time aggregations and late delivery of messages. All queries on structured streams go through the Catalyst query optimizer, and can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data.
Structured Streaming is still a rather new part of Apache Spark, having been marked as production-ready in the Spark 2.2 release. However, Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.
Next learn how to install Spark on your environment. Getting Started With Spark