Taught by: Dr. Subhankar Dhar
The seminar introduces data science as an interdisciplinary field and delves into different steps needed to become a data scientist. It also covers important statistical concepts along with relevant computer programming fundamentals, in conjunction with hands-on analysis of real-world datasets. It consists of several statistical modules developed using Jupyter Notebook and Python.
Why this seminar has been designed?
Data Science is an important field and there is a growing number of career opportunities. Data Science has applications in almost every industry. For example, recommendations for movies and restaurants, improving customer loyalty and retention, hiring the right people, loan approval, measuring brand exposure, detecting credit card fraud, predictive maintenance, early detection of supply chain disruption, to name a few. This seminar is designed to introduce students to various problems and use cases arising from industry and the statistical concepts necessary to deal with these problems. These modules are meant to introduce students to data science early in their academic careers. No prior knowledge of Data Science is required.
Why this topic is relevant?
Good knowledge of statistics is absolutely necessary to solve problems in Data Science. Statistical tools and techniques are useful for exploratory data analysis and decision making. Hence, this topic is chosen to introduce statistical concepts that are relevant to data scientists.
After completing the seminar, you should be able to:
- Understand basic statistical principles often used by data scientists
- Apply common statistical tools and techniques used in Data Science
- Use Python and Jupyter Notebook to analyze large datasets
- Visualize and interpret results for decision making
After participating in the seminar and completing the post-seminar assessment, you should be able to:
- Work with Jupyter Notebook on your computer
- Use various python toolkits and related statistical packages most commonly used in data science
- Run statistical applications using Python
- Understand the landscape of data science tools and their applications, and how to identify and dig into new technologies and algorithms needed for the job at hand
- Analyze large datasets for visualization
- Analyze large datasets to get insights and make business decisions
What you will be doing during the seminar:
You will be working with open datasets made available by Kaggle and we will be looking at housing prices. We will analyze various features and also try to predict prices based on various parameters. No prior experience is needed, but to get the most out of the seminar, please do the following
How to get started:
- Register for the seminar – its 100% free, but registering for the seminar will get you to access to a Canvas course with all of the seminar materials, optional pre-seminar exercises, and additional materials (some of these are included below, but more convenient in Canvas).
- Take the survey.
- Get familiar with the pre-seminar review material in Canvas. This includes basic concepts in probability and statistics, documentation on Anaconda Distribution – world’s most popular Python/R Data Science open-source platform. It is the easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. With over 11 million users worldwide, it is the industry standard for developing, testing, and training on a single machine.
Seminar materials (additional materials are available in Canvas after you register):
- Download the dataset (kc_house_data.csv) from the following link: https://www.kaggle.com/harlfoxem/housesalesprediction
- The dataset contains house sale prices between May 2014 and May 2015 for King County
- It has 21613 observations and includes 19 house features plus the price and the id columns.
- The features are number of bedrooms, bathrooms, square feet, year in which built, etc.
- Download the data (googleplaystore.csv) from the link provided
- This dataset contains information about apps from Google Play store.
- It has 13 columns describing various features like the name of the app, it’s rating, category, whether its’s free or paid etc.
- The reviews/ratings column can be used to deduce how many people use the app
- Pre-seminar module: Basic probability review (Khan Academy) In addition to datasets and websites with examples using python, there is a review material in the Canvas module that covers fundamental concepts of probability and statistics for data science. The module contains introductory statistical concepts that are widely used in data science. If you want to learn more, there are PDFs and links to other resources that discuss in detail about various applications of statistics in data science for decision making.
- Introductory Statistics is a free online book. Read chapters 1 through 6 to get an overview of the material that will be covered in this seminar. URL: https://openstax.org/details/books/introductory-statistics
- Seminar slides. This is a PDF of the slides from the seminar.
- Module 1
- Module 2
- Module 3
- Module 4
- Feel free to look at them beforehand, but if you don’t understand them before the seminar, that’s fine! We will be walking through learning about the topics covered in the slides.
- Installing Anaconda (optional). As stated, this is totally optional – Anaconda will be installed on the computers we will be using in the seminar, so you do not need to install anything (or even own a computer), to participate in the seminar.
- Faculty: If you are a faculty member at SJSU or any university or community college, and you would like to host a seminar at your school or use the materials in your course, please see the faculty page for additional materials. If you are a Dean or faculty member at a Bay Area community college, we would like to hear from you! We are working with community college faculty in the Bay Area and provide small stipends to attend the seminar and assist you in presenting it at your school.