The dataset is based on a download from the USASpending website which provides an API for downloading data regarding payments on government contracts and government assistance. Although the website has some pre-packaged download files, these are limited as to what accounts they contain. The API also has the ability to download by year, but there is a silent limit of 1GB on the download which prevents downloading an entire year’s data. To run the download, you do not need a Databricks account. The notebook for the download will run in Jupyter through Anaconda on your desktop. You should have a fast Internet connection for downloading the data.
To upload the dataset you will need to have signed up for a Databricks account and you will need a fast Internet connection to upload the files.
To create and load the data, perform the following steps:
- Install Anaconda (a free tool for working in Python).
- Create the data files. We have created a Jupyter notebook that will download the necessary data from the USASpending website using their API, filter it for the columns we will be using, combine it by year, and then generate compressed data files using the BZip2 format. The compressed files can be uploaded to Databricks and used in Spark. The notebook does not require Databricks – just Jupyter running on Anaconda. Click here to download the notebook zipped up. Unzip the file and it contains a single file – a notebook named DownloadUSASpendingData.ipynb. to create the data files using that notebook, follow these instructions. If you wanted to expand on the project and include additional columns (the download includes 74 of approximately 250 columns), you can modify the logic in the notebook to include additional columns.
- Alternative to Creating the files. If you generate new files with the provided notebook, they may contain additional data or corrections that USASpending has made to the data since it was downloaded when we created the seminar. Instead of creating the files, you can download this zip file that contains a a folder named SeminarData with the data files for 2014 – 2018 zipped up in the bzip2 format; just unzip this file and then upload all of the files in the SeminarData folder as described below for manually uploading the data. If you have a slow connection, you may want to download each file separately, these are the same files included in the above zip file:
- Manually uploading the seminar data. If you create your own data files and want to manually upload them, follow these instructions for loading the files. If you manually load the data (and use the directory specified in these directions), the seminar notebook will not attempt to load the data again since it already exists. These directions can also be used as a guide if you want to do another project with your Databricks community account later. In the seminars we stage the data on S3 and load directly into the notebook. If you are faculty, we include instructions on how to do that in the faculty materials.