TIME AND PLACE
Lectures: M W 12:00pm-1:30pm
Location: Zoom
UNITS: 1.0 CU
INSTRUCTOR
Jing (Jane) Li (janeli@seas.upenn.edu)
Office: Levine 274
Office Hours: 2pm-3pm M
TEACHING ASSISTANTS
- Nick Beckwith (nickbeck@seas.upenn.edu)
- Stefano Yushinski (ystefano@seas.upenn.edu)
COURSE OVERVIEW
Machine learning (ML) techniques are enjoying rapidly increasing adoption in our daily life, due to the synergistic advancements across data, algorithm, and hardware. However, designing and implementing systems that can efficiently support ML models across various deployment scenarios from edge to cloud remains a significant obstacle, in large part due to the gap between machine learning’s promise (core ML algorithm and method) and its real-world utility (diverse and heterogeneous computing platforms).
The course is designed to introduce a new engineering discipline at the intersection of machine learning and hardware systems to bridge the gap. The covered topics include basics of deep learning, deep learning frameworks, deep learning on contemporary computing platforms (CPU, GPU, FPGA) and programmable accelerators (TPU), performance measures, numerical representation and customized data types for deep learning, co-optimization of deep learning algorithms and hardware, training for deep learning and support for complex deep learning models. The course is structured with a combination of lectures, labs, research paper reading/in-class discussion, a final project and guest lectures with state-of-the-art industry practices (Amazon, Facebook, Google, Intel, Microsoft, and Xilinx). The goal is to help students to 1) gain hands-on experiences on deploying deep learning models on CPU, GPU and FPGA; 2) develop the intuition on how to perform close-loop co-design of algorithm and hardware through various engineering knobs such as algorithmic transformation, data layout, numerical precision, data reuse, and parallelism for performance optimization given target accuracy metrics, 3) understand future trends and opportunities at the intersection of ML and computer system fields. 4) (For CIS or ML students), gain necessary computer hardware knowledge for algorithm-level optimizations.
PREREQUISITES
CIS 240, or equivalent
Proficiency in programming: ENGR105, CIS110, CIS120, or equivalent. Lab assignments in
this course will be based in PyTorch (CPU, GPU) and OpenCL (FPGA).
(Note that CIS 371 is not officially required but helpful)
Undergraduates: Permission of the instructor is required to enroll in this class. If you are unsure whether your background is sufficient for this class, please talk to/email the instructor.
Lab 0 is designed to evaluate if your background is sufficient to take the course.
GRADING POLICY
Lab Assignments = 40%
Final Project = 50%
Others (Reading, Course Survey) = 10%
Late policy: Each student will have 5 free “late days” to use during the semester. You can use these late days to submit lab/project after the due date without any penalty. Assignments that are submitted late, after exhausting the quota of late days will result in 50% credit deducted per day, i.e., zero credit after 2 late days. Do not exhaust all the late days on the first lab.
Collaboration policy: Study groups are allowed, and students may discuss in groups. However, we expect students to understand and complete their own lab assignments. Each student must conduct the lab independently and hand in one lab assignment per student. For the final project, students are expected to work in groups (2 students per group). Each team should turn in one final project report. In the project report, please write down each team member’s specific contribution.
Reading assignment turn-in: The paper review will be turned in via Google form by 11:30am before lecture (link will be posted on the canvas website). Lab/project assignment turn-in: Lab and project reports will be turned in electronically through the Penn Canvas website. Log in to Canvas with your PennKey and password, then select ESE 539 from the Courses and Groups dropdown menu. Submission should be as a single file (preferably .pdf).
CLASS HOMEPAGE:
TBA (a Canvas website will be provided)
Piazza will be used for discussions and clarifications.
INVITED SPEAKERS
CLASS SCHEDULE (TENTATIVE)
Date | Topic | Course Content | Notes/Assignment |
---|---|---|---|
01/20 | Class Introduction | Lab 0 release | |
01/25 | Introduction to Deep Learning | Model, Dataset, Cost (loss) function, Optimizer, Overfitting/Generalization, Regularization | |
01/27 | PyTorch Tutorial | Lab 0 due; Lab 1 release | |
02/01 | Deep Neural Network Architecture | Kernel Computation (Inference), AlexNet, VGG, GoogLeNet, ResNet | |
02/03 | Deep Learning System: Hardware and Software | CPU, GPU, FPGA, TPU, PyTorch, ONNX, MLPerf | Lab 1 due, Lab 2 release |
02/08 | FPGA fudementals | ||
02/10 | OpenCL Tutorial | ||
02/15 | Parellelism | Data/Model/Pipeline Parellelism, ILP, DLP, TLP, Roofline, Amdahl’s Law | |
02/17 | Mapping and Scheduling I | Extended Roofline, Parellelism/Data Reuse, Loop Unrolling/Order/Bound, Spatial/Temporal Choice | Lab 2 due, Lab 3 release |
02/22 | Mapping and Scheduling II | Auto Tuning, Optimization for specialized HW, Case studies | |
02/24 | Numerial Precision and Custom Data Type | INT, FP, Bfloat16, MS-FP, TF32, DLFloat16, Quantization Process (Mapping/Scaling/Range Calibration) | |
03/01 | Arithmetic Hardware | Complexity, Cost, Operator fusion | |
03/03 | Co-Design I | Dense transformation (Direct Conv, GEMM, FFT, Winograd) | Lab 3 due, Lab 4 release |
03/08 | Co-Design II_part 1 | Sparse transformation | |
03/10 | No Class | Spring Break | |
03/15 | Co-Design II_part 2 | Sparse transformation | |
03/17 | Co-Design III | Compact Models and NAS | Lab 4 due |
03/22 | Natural Language Processing | RNN, LSTM/GRU, Attention, Transformer | |
03/24 | Project Overview | Project release | |
03/29 | Training Neural Network I | Backprop, Chain Rule, Kernel Computation (Training) | |
03/31 | Training Tutorial | Compact Models and NAS | |
04/05 | Training Neural Network II | Distributed Training | |
04/07 | Guest Lecture (Derek Chiou, Microsoft) | Accelerating the Cloud | |
04/12 | No Class | Engagement Day | |
04/14 | Guest Lecture (Eriko Nurvitadhi, Intel) | Beyond Peak Performance: Comparing the Real Performance of FPGAs and GPUs on Deep Learning Workloads | |
04/19 | Guest Lecture (Ron Diamant, Randy Huang, Amazon AWS ) | Accelerating the Pace of AWS inferentia Chip Development | |
04/21 | Guest Lecture (Yuan Dong Tian, Facebook) | Using Machine Learning to learn heuristics for hard optimization problems in computer system design | |
04/26 | Wrap up | ||
04/28 | Project Presentation | Project final report (due 05/04) |
READING:
We will assign one paper to read before most lectures. Several review questions will guide you through the paper reading process. In addition to the paper reading questions, we will also ask you to provide brief course feedback after each lecture to help us make fine-grained adjustment throughout the semester.
LAB AND PROJECT:
In addition to Lab 0 which is mainly used to evaluate your background, we have four more labs (two software labs and two hardware labs) and 1 final project (software/hardware co-design). Lab 1 is a one-week assignment and Lab 2-4 are two-week assignments. Lab 1 and Lab 2 will teach students how to build deep neural network (DNN) models in PyTorch and perform workload analysis on CPU and GPU. These two labs will help students to get familiar with AWS computing environment and navigate/modify the tools to find the performance bottlenecks when running DNN on different computing platforms. Lab 3 and Lab 4 will teach student to implement a core architectural component on FPGA in OpenCL and get familiar with Xilinx Vitis unified software platform. The final project will be 1.5-month long (6 weeks). It requires the students to leverage the key learnings from the 4 labs and perform co-design on hardware and software using the techniques and design options introduced in the course to achieve an end-to-end implementation optimized for inference latency given resource constraint and batch size.
COMPARISON TO ESE 532:
This course is designed to target broader audience (e.g., CIS or ML students) including but not limited to computer engineering. The course covers a higher level abstraction in the system stack with a focus on a specific application domain - deep learning. The various deep learning-specific topics includes domain-specific framework, algorithmic transformation, numerical precision and customized data type and the consequent optimization opportunities in both algorithm and hardware, etc.. For non-computer engineering students, it provides necessary computer-related knowledge to help develop intuitions on how to design hardware-friendly algorithms. For computer engineering students, it provides an in-depth coverage on the state-of-the-art deep learning techniques (software and hardware). This course is not a pre-requisite for ESE 532 but it motivates and prepares computer engineering students before diving deep into the advanced topics covered in ESE 532.
COMPARISON TO ESE 546:
This course is more focused on the practical deployment of deep learning in various computing environment (phone, wearable, cloud and supercomputer) via the co-design of hardware and algorithm: 1) design hardware to better support the current and next generation of deep learning models and 2) design algorithms that are hardware friendly and can run efficiently on current and future systems. ESE 546 is more focused on the fundamental principles of deep learning and how to build/train deep neural networks. These two courses are complementary to each other.
ACADEMIC MISCONDUCT:
Please refer to Penn’s Code of Academic Integrity for more information.