Taught by: Dr. Scott Jensen

Why we hope to see YOU at the seminar!

Apache Spark and Jupyter notebooks are currently two of the hottest tools in data science and this seminar provides the opportunity to work hands-on with these tools even if you have no prior experience in programing or data science! You don’t even need your own computer!

Jupyter Notebooks and Apache Spark are being used by data scientists at some of the largest web-based companies in the Silicon Valley. Apache Spark allows data scientists to explore large datasets in varied formats to quickly identify patterns in the data. Jupyter notebooks allow them to not only visualize and document their results, but also easily share their research with colleagues and even generate publications, webpages, and presentations. Together, through a web-based interface, these tools allow you to explore and experiment with large datasets, quickly ask questions about your data, generate visualizations, and share your work (with a couple clicks you can even publish your notebook to the web and share a link with family, friends, recruiters, or include it on your LinkedIn profile) – all without extensive coding!

After participating in the seminar and completing the post-seminar assessment, you will be able to:

  • Load data into Spark DataFrames and ask basic questions of your data using PySpark
  • Understand the importance of documenting your work and using markdown in Jupyter notebooks
  • Create basic visualizations in Jupyter
  • Share and publish your results

What you will be doing during the seminar:

Back in 2012, “data scientist” was declared to be the sexiest job of the 21st century. Although it’s a bit early in this century to be declaring the sexiest job of the century (in 1912 many of the jobs of the 20th century had not yet been imagined), and the claim was being made in the Harvard Business Review (not generally considered to be the definitive guide to sexiness), data scientists have turned out to be in high demand. However, only two years later, one of the academics who declared “data scientist” to be the sexiest job of the century changed their mind and said data scientists were actually data plumbers. Aside from the issue that this now meant that being a plumber was the sexiest job of the century (and nobody imagined that back in 1999), the reason was that data scientists spend 80% of their time doing data wrangling. What’s data wrangling? It’s the work of figuring out what’s in your data and cleaning it up to get it ready for the “sexy” part of data science. As stated by DJ Patil, the first Chief Data Scientist of the U.S. Government (and the other author of the article claiming data scientist was the sexiest job), “Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.” The upside is that while a lot of data scientists have advanced degrees, you can get your foot in the door learning how to wrangle data, and we will be exploring that during this seminar.

The data we will be using is from a Federal website named USASpending.gov. This site has information on all of the payments on U.S. Government contracts by Federal agencies large and small. You will be analyzing nearly 20 million transactions covering the period 2014 – 2018. This covers multiple years of two different administrations, led by two different political parties, led by two different presidents with different world views. Is spending different across the two administrations? Does the spending by agency reflect different priorities? Are the vendors used located in different states? Does the government’s spending follow any annual patterns across the months of each year?

You will use Spark to profile the data, clean up errors in the data, query the data using Spark SQL to create DataFrames, and explore different types of visualizations to identify patterns. If this all sounds new, that’s great! You have never run a SQL query (or even heard of SQL)? No worries! we will walk through the basics in a very hands-on format. Keep in mind that the seminar is zero risk – you can explore something new and different and your grade is not on the line.

How to get started:

  • Register for the seminar – its 100% free, but registering for the seminar will get you access to a Canvas course with all of the seminar materials, optional pre-seminar exercises, and additional materials (some of these are included below, but there is more materials in Canvas and they are easier to access).
  • Sign up for a Databricks web-based, community account – its FREE! You will be using your account during the seminar and can continue to use it afterwards to explore further or work with data in your class projects! Databricks was a company started by the creators of Apache Spark (down the road at UC Berkeley if you are at SJSU). Your community account combines Apache Spark and a web-based Notebook interface. see the simple instructions below on how to sign up for an account.
  • Try out the pre-seminar exercise in Canvas. In this exercise you will use your web-based Databricks community account and a Databricks notebook you import from the web to explore different types of data visualizations. There is no software to install, all you need is a browser!

Seminar materials (additional materials are available in Canvas after you register):

  • Sign up for a FREE Databricks community account. After you have signed up, if you forget the login URL, click here.
  • A brief intro to Apache Spark and Jupyter notebooks. See this page for a brief discussion of why Apache spark and Jupyter notebooks are hot, and how they relate to you as part of the current emphasis in the industry on the democratization of data. The page linked above also includes links to a few articles you may find interesting.
  • A short exercise on Apache Spark and Jupyter notebooks. See this document for an introduction to loading a notebook from Databricks into your account to explore the types of charts you can create in your notebook. Although the notebook runs in Databricks, the focus is more on possible data visualizations than on Apache Spark (we’ll get into Spark more in the seminar).
  • Seminar slides. This is a PDF of the slides from the seminar. Feel free to look at them beforehand, but if you don’t understand them before the seminar, that’s fine! We will be walking through learning about the topics covered in the slides.
  • The completed seminar notebook. As discussed above, in the seminar you will import a notebook that contains some calculations and markdown documenting what you are doing, and we will walk through wrangling the data, adding queries, and visualizing the data. Your notebook will look like this at the end of the seminar.
  • Creating the data files and loading them manually (totally optional). The data files we will be using in the seminar are based on a download from the USASpending website and contain data from 2014-2018. Some wrangling of the data will have already been done (though you will do some more in the seminar), and the data will be compressed and staged on Amazon’s S3 so the notebook can pull it directly into your Databricks account (avoiding having us all use the network to load large data files at the same time). However, if you want to see the sausage making and get your hands into the data, see this page for more of the details. That page also covers loading data into Databricks in case you want to explore further with a different dataset of your own.
  • Faculty: If you are a faculty member at SJSU or any university or community college, and you would like to host a seminar at your school or use the materials in your course, please see the faculty page for additional materials. If you are a Dean or faculty member at a a Bay Area community college, we would like to hear from you! We are working with community college faculty in the Bay Area and provide small stipends to attend the seminar and assist you in presenting it at your school.