Data Science 3: Data Science Applications and Analysis

Course Description: 

 

"Data Science 3" is currently offered through the Department of Statistics and Applied Probability as "PSTAT 100."

Overview and use of data science tools in Python for data retrieval, analysis, visualization, reproducible research and automated report generation. Case studies will illustrate practical use of these tools. This new course will focus on concepts that are relevant for data science by using some of the popular software tools in this area. Doing data science is more than using isolated methods. Creatively using a collection of concepts and domain knowledge is emphasized to clean, transform, analyze, and present data. Concepts in data ethics and privacy will also be discussed. Case studies will illustrate real usage scenarios.

Prerequisites:

  • Probability and Statistics I (PSTAT 120A)
  • Linear Algebra (MATH 4A)
  • Prior experience with Python or another programming language (CMPSC 9 or CMPSC 16).
Audience and goals
This course is a hands-on introduction to data science intended for intermediate-level students from any discipline with some exposure to probability and basic computing skills, but few or no upper-division courses in statistics or computer science. The course introduces central concepts in statistics – such as sampling variation, uncertainty, and inference – through an applied and computational lens alongside techniques for data exploration and analysis. Course activities model standard data science workflow practices by example, and successful students acquire programming skills, project management skills, and subject exposure that will serve them well in upper-division courses as well as in independent research or projects.
 
Materials
Readings for the course will draw on multiple sources, including in particular the Python Data Science Handbook and Berkeley’s Data 8 Inferential Thinking and Data 100 Principles and Techniques of Data Science textbooks, all available online. Computing will be hosted on an LSIT server (link to be provided).
 
Learning outcomes:
In this course, students will:
  1. Simulate, retrieve, organize, summarize, and visualize, and model data using scientific computing tools in Python.
  2. Practice critical thinking about the relationship between data collection and scope of inference, and assess the plausibility of assumptions required to meaningfully model real data.
  3. Use appropriate programming style, conventions, and practices to write readable, organized, and reproducible codes.
  4. Demonstrate good data science workflow and communication practices through completing a collaborative data analysis project and preparing a written summary of results.
 

 

Additional Information: 

Course Schedule Spring 2021
Week Lecture Topic(s) Lab Assessment
1 Data science lifecycle Jupyter Notebooks  
2 Sampling and inference Summary statistics and simulation  
3 Data wrangling and tidy data Pandas HW1 due
4 Elements of data visualization Plot types and aesthetics  
5 Exploratory analysis I Density estimateion HW 2 due
6 Exploratory analysis II Dimension reduction  
7 Statistical models I Simple linear regression HW 3 due
8 Statistical models II Multiple regression  
9 Classification Project workflow HW4 due
10 TBD TBD  
11 None (finals week) None Project due

 

Course Level: 

  • Undergraduate

Course Number: 

3

Course Time: 

Spring 2021

Tues/Thurs: 3:30 - 4:45 pm
Labs: Wed at 10 am, 2 pm, and 6 pm.