Data Science 3: Data Science Applications and Analysis | Data Science Initiative

Course Description:

"Data Science 3" is currently offered through the Department of Statistics and Applied Probability as "PSTAT 100."

Overview and use of data science tools in Python for data retrieval, analysis, visualization, reproducible research and automated report generation. Case studies will illustrate practical use of these tools. This new course will focus on concepts that are relevant for data science by using some of the popular software tools in this area. Doing data science is more than using isolated methods. Creatively using a collection of concepts and domain knowledge is emphasized to clean, transform, analyze, and present data. Concepts in data ethics and privacy will also be discussed. Case studies will illustrate real usage scenarios.

Prerequisites:

Probability and Statistics I (PSTAT 120A)
Linear Algebra (MATH 4A)
Prior experience with Python or another programming language (CMPSC 9 or CMPSC 16).

Audience and goals

This course is a hands-on introduction to data science intended for intermediate-level students from any discipline with some exposure to probability and basic computing skills, but few or no upper-division courses in statistics or computer science. The course introduces central concepts in statistics – such as sampling variation, uncertainty, and inference – through an applied and computational lens alongside techniques for data exploration and analysis. Course activities model standard data science workflow practices by example, and successful students acquire programming skills, project management skills, and subject exposure that will serve them well in upper-division courses as well as in independent research or projects.

Materials

Readings for the course will draw on multiple sources, including in particular the Python Data Science Handbook and Berkeley’s Data 8 Inferential Thinking and Data 100 Principles and Techniques of Data Science textbooks, all available online. Computing will be hosted on an LSIT server (link to be provided).

Learning outcomes:

In this course, students will:

Simulate, retrieve, organize, summarize, and visualize, and model data using scientific computing tools in Python.
Practice critical thinking about the relationship between data collection and scope of inference, and assess the plausibility of assumptions required to meaningfully model real data.
Use appropriate programming style, conventions, and practices to write readable, organized, and reproducible codes.
Demonstrate good data science workflow and communication practices through completing a collaborative data analysis project and preparing a written summary of results.

Additional Information:

**Course Schedule Spring 2021**
Week	Lecture Topic(s)	Lab	Assessment
1	Data science lifecycle	Jupyter Notebooks
2	Sampling and inference	Summary statistics and simulation
3	Data wrangling and tidy data	Pandas	HW1 due
4	Elements of data visualization	Plot types and aesthetics
5	Exploratory analysis I	Density estimateion	HW 2 due
6	Exploratory analysis II	Dimension reduction
7	Statistical models I	Simple linear regression	HW 3 due
8	Statistical models II	Multiple regression
9	Classification	Project workflow	HW4 due
10	TBD	TBD
11	None (finals week)	None	Project due

Course Level:

Undergraduate

Course Number:

Course Time:

Winter 2022

Mon/Weds: 8:00 - 9:15 am
Labs: Tues at 2 pm, 3 pm, and 4 pm

Spring 2021

Tues/Thurs: 3:30 - 4:45 pm
Labs: Wed at 10 am, 2 pm, and 6 pm

Instructor:

Trevor Ruiz

February 24, 2021 - 9:08am