Additional materials can be found in this seminar’s Canvas module, but here we provide a brief introduction to the tools we will be using in the seminar – Apache Spark and Jupyter Notebooks. Both are popular data science tools on their own, but combined they provide an excellent platform for the democratization of data – allowing many people in an organization to explore large quantities of data. The platform we will be using is an online interface that Databricks hosts on Amazon’s AWS (and they kindly pick up the cost, so many thanks to them).

Why Jupyter and Spark? Why now?

In this seminar we will be exploring Apache Spark, and we will be doing it through Jupyter notebooks. The current era of Big Data started with an article from researchers at Google who wrote a paper on MapReduce – a technique they employed to analyze data on a massive scale by spreading the work out across computers. That paper was published in 2004, and when Doug Cutting (who was also working on a search engine) read it, he realized it was the solution to the problems they were having with building their search index at scale (the search index is the card catalog for the web).  Although Google did not release the software, based on their paper, Cutting developed Hadoop.  Cutting was hired by Yahoo! (also in the search engine business) because they saw great promise in Hadoop.  Yahoo! open-sourced Hadoop and it was the Big Data tool for many years and gave birth to a whole ecosystem of Big Data technologies.   If you want to know more about the early days of Hadoop, there is an excellent (non-technical) article in Wired from back in 2011.

Although Hadoop allowed programmers to scale (process larger volumes of data), by spreading the work across many computers, one of the drawbacks was that Hadoop was slow.  The reason for it’s speed issues was that it was always writing to and reading from disk (think about the difference between recalling something from memory and reading it in a book).  Some researchers at Berkely’s AMPLab developed Spark as the solution – instead of writing to disk across a lot of computers, they kept data in memory across many computers (to the extent possible).  That was a lot faster and Spark took off as a technology.

However, although both Hadoop and Spark enabled programmers to scale across many computers, a lot of the folks that wanted to use the data were not only programmers, but data scientists, and data analysts. These folks were used to tabular data. If you are not familiar with that term – think of spreadsheets with rows and columns. In 2013, Spark added DataFrames, which allowed more analysts to use it, and in 2016 they added Spark SQL which allowed data analysts to use SQL, a common database query language. You may not be familiar with SQL, but a lot of data analysts knew enough SQL to be able to write queries (question to a machine) to get the data they wanted and do the analysis required for their jobs.

The addition of Big Data tools such as Spark SQL allows for the democratization of data – that’s where you come in.  Even if you are not a programmer or computer engineer, these tools will allow you to analyze volumes of data and find new insights.  This seminar will hopefully pique your interest in these capabilities and encourage you to explore further.

Jupyter notebooks:

At the same time that Spark was taking off, another tool was taking hold in the world of data science.  That tool was iPython Notebooks. In 2014, iPython notebooks became Jupyter notebooks.  These have taken off from a data science perspective because of a number of factors, but some of the most important are:

  • Sharing – notebooks allow people to easily share work.  There are over 2.5 million notebooks on GitHub alone (a popular code sharing website for programmers).  You can import a notebook a notebook from a file (the notebook only has code and documentation – not data), and the files are so small that you could put a ton of them on a floppy disk (if you are not sure what that is, think about the size of your smallest thumb drive and divide by 1000).  You can also load a notebook from a URL – if you see a notebook on the Databricks website you want to play with, it provides a URL you can easily import (we will share a notebook with you this way for the seminar).
  • Collaboration – in addition to passing notebooks around, web-based implementations such as Databricks or other vendors allow for collaboration among a team.  If you were using this in a business setting, you could collaborate with your co-workers (like Google Docs).  On the community accounts, you can add two of your classmates to your account if you are working on a class project.
  • Markdown – In a notebook there are code cells and markdown cells.  These can be interspersed, and markdown is an easier to write superset of HTML, so you can write well formatted documentation, including formatting, headers, links, tables, bullet lists, graphics, and other HTML to describe and document your process without having to even learn even basic HTML.
  • Incremental and Iterative – data science often involves asking, “what if?” questions.  Instead of writing a long program, in notebooks you can write single code cells, run them, edit, and re-run them.  If you want to retry anything, you can run only the cells you need, insert a cell, try it, delete it, move the cell, and keep asking “what if?”
  • Publishing – not only can you share your notebook, you can share your results.  This can be with policy makers, managers, or others in your community.  The notebook contains your code, the markdown, and your results, but not your data.  You can publish your notebook, and anyone who has that link will be able to see your work. This enables one of the key aspects of data science – storytelling.  It is easy to share with others the story you uncovered in the data.

The combination of Apache Spark and Jupyter notebooks is powerful in democratizing data.  This means that instead of just programmers being able to access and analyze data, data scientists and data analysts can work with large volumes of data that would not have been possible for them to analyze on their desktop, but in a way that’s more intuitive, does not require knowledge of distributed computing, and allows them to tell a story and share it easily with others.

A few examples:

  • This recent Datanami article discusses why Spark is popular and lowers the barriers to analysis of Big Data.
  • Notebooks at Netflix. Netflix has integrated notebooks (including as front-end to Spark) into multiple aspects of their business both from a data exploration perspective (data science), but also from an operational (engineering) perspective.
    • Netflix runs all of their website on top of Amazon’s cloud, and similar to other web-based companies, they analyze a lot of data. This article provides an overview of the infrastructure they created around notebooks to address different types of uses: Beyond Interactive: Notebook Innovation at Netflix.
    • This follow-up post discusses the the infrastructure they created using notebooks to run scheduled jobs: Scheduling Notebooks at Netflix.
  • A brief summary of Spark – for a somewhat more “programmer-ish” viewpoint, read this KD Knuggets posting (KD Knuggets is a data science / machine learning focussed website).
  • Jupyter Notebooks as a “Gateway Drug”. This short article discusses why Jupyter Notebooks are the “gateway drug” to data science at universities.