Distributed Systems
Tuesday/Thursday 8:00 AM to 9:30 AM
310 Hugh Dempster Pavilion
Instructor: Arpan Gujarati
Teaching Assistants: Milad Rezaei, Hadi Sinaee, Cathy Yang, Richard Yang
Online Forums: Piazza, Discord, Canvas
Leslie Lamport, the 2013 ACM Turing Award winner, gave the following definition of a distributed system.
A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
Yet, distribution provides numerous benefits. A system becomes more fault tolerant if there are fewer points of failure and it has no centralized components. By extending the system with more physical nodes the system gains performance and becomes more scalable, capable of handling more load. Distribution can also improve latency, by improving geographic diversity, by placing resources closer to clients who use the system. Achieving these benefits is not easy. As the quote above illustrates, distributed systems can fail in complex ways and these systems are more difficult to build, test, and understand than centralized systems.
This course will introduce you to a broad range of topics in distributed systems. The tentative topics are listed in the schedule below. For the most part this will be a lecture-style course. However, distributed system concepts are notoriously challenging to internalize without first-hand experience. The emphasis of this course, therefore, will be on building distributed system prototypes, small and large.
Overview
- Prerequisites
-
- CPSC 317 (networks)
- CPSC 313 (computer hardware and operating systems)
- Learning objectives
-
- Understand key principles in designing and implementing distributed systems
- Reason about problems that involve distributed components
- Become familiar with important techniques for solving problems that arise in distributed contexts
- Build distributed system prototypes using the Go programming language
- Communication
- We will use Piazza for all course-related communication. Access code will be shared in the first lecture.
- We have also setup a Discord server to facilitate discussion among students.
- Further, the teaching staff will have weekly, in-person office hours:
- Hadi's: Monday, 05:00 PM to 06:00 PM, ICCS 355
- Cathy's: Tuesday, 05:00 PM to 06:00 PM,
ICCS 355ICCS X141 - Milad's: Wednesday, 08:00 AM to 09:00 AM, ICCS 355
- Richard's: Thursday, 10:00 AM to 11:00 AM, ICCS X241
- Arpan's: Friday, 09:00 AM to 10:00 AM, ICCS 333
- Lectures
- The lectures will focus primarily on the understanding fundamental distributed systems concepts and how they apply to production systems. See Schedule for details.
- The research papers and lecture notes will constitute the primary reading material. You may optionally refer to the Distributed Systems: Principles and Paradigms (2nd Edition) textbook.
- The lectures will not be recorded. We will make all lecture slides, lecture notes, and research papers available online on Canvas.
- Programming Assignments / Project / Labs
- There is no open-ended project component in this course.
- There are a sequence of programming labs or assignments, which together constitute one large project, in which you will build a distributed, fault-tolerant, key-value store.
- The programming labs will require Go Programming and are due every week or two throughout the term.
- The first programming lab is to be done individually. The subsequent programming labs are to be done in pairs.
- More information will be released along with the first programming assignment.
- Grading Exams will constitute 62% of the total grade. These may be organized using the Computer Science department's Computer Based Testing Facilities (CBTF). The exams will evaluate your understanding of the lecture material as well as your understanding of the programming assignments. The tentative syllabus for each exam is specified in the Schedule.
- Quiz 1 (16%)
- Quiz 2 (16%)
- Final (30%)
Project labs will constitute 38% of the total grade.
We will evaluate your solutions using automated tests as well as your understanding of your solution using oral examination.
- LAB 1 (04%)
- LAB 2 (18%) = 2A (3%) + 2B (5%) + 2C (5%) + 2D (5%)
- LAB 3 (09%) = 3A (4%) + 3B (5%)
- LAB 4 (07%) = 4A (2%) + 4B (5%) (Note on 4B)
You do not need to pass the Final Exam to pass the course. We will look at your total grade across all components.
Note that sharing quiz / exam questions and aswers to any external site, or to people outside the course section, now or at a later point of time, is forbidden.
- Waiting List
- Do not contact the instructor or course staff about the waiting list or about admission into the class -- instructors have no control over who gets into the course.
- Waitlists are processed in priority order by the department.
- Unfortunately, we cannot sign course registration forms and have no knowledge or control over the class composition or waitlists.
- If you have any questions about registration, please contact the CS advisors.
- If you are on the waiting list, you are required to keep up with all the course work.
- Amanda and Stewart led an in-class Go tutorial in the Winter 2017 version of the course. Here is the recorded version: part 1, and part 2. These are for an earlier version of Go. They are still useful, but take them with a grain of salt.
- Late submissions for programming assignments.
- Other Absences We will deal with absences from quizzes and the final exam on a case-by-case basis.
- IMPORTANT: DO NOT copy/share any portion of the programming assignments. The project is a continuous cumulative project with a series of checkpoints (labs). Copying any portion (even from your own project from a prior semester) means you cheated on the whole project. Cheating is copying or sharing solutions. All work must be done by you. A tutor must not help you with project content. You are not allowed to use (look at, copy) any online source that is specifically related to this project beyond those explicitly provided by us. Once the project is over, your repository must remain private, and must not be shared publicly on the internet.
- Specifically, you are NOT allowed to
- look at prior solutions to this project, or current students' code for the project
- use, look at, etc. any aspects of your own prior solution, if you have taken a similar course previously
- copy code or algorithms from others or the web without appropriate attribution
- discuss code-level details or pseudocode level algorithms with students past or present or take other students' help in any detailed way
- share or discuss details of code or algorithms with students past, present or future or with tutors
- post code to the web during the course, or after completing this course
Make sure every line of code you commit is either provided by us, or written by you (or your team member).
If you see snippets of code on stack overflow or some general resource, you can use them but you must cite your source, and indicate the extent of the code used.
This policy is bidirectional: whether you copied some other code, or your code was copied, the penalties are the same.
- It is straightforward to avoid violating this policy!
- Always make sure your work is your own.
- Never look at or copy another team's solution, either from this term or past terms, either from UBC or from other schools.
- If you have a tutor, don't ask them project-specific questions; use them to help you generally on skill building.
- Do not share your code with other teams or post your solution online.
After the respective lab deadlines, we may carefully examine all commits using automated analyses and the course staff may carefully manually examine every case of derivative code.
We will contact you if we have any questions about whether you violated the policy at some point during the semester.
Even if no copied code is retained in your final solution, if you have a commit with copied code, you have still committed academic misconduct.
- The official policies for Academic Misconduct at UBC can be found at the following links.
Piazza and office hours are intended to be the primary mechanism to communicate with the teaching team. You may use private posts on Piazza to coomunicate with the instructor and the TAs. The teaching staff will not answer questions on Discord.
While we encourage everyone to participate on Piazza and Discord, participation, or lack thereof, on these forums will not influence your grades in any way.
Schedule (a work in progress; will change)
Slides and/or lecture notes will be uploaded after the lecture. These, and other reading material, including all research papers, can be downloaded from the course Canvas page. Note that, since many papers have multiple versions online, if you download a paper directly from the Internet, it may be inconsistent with the version on Canvas. All deadlines are at 6 PM, Vancouver time!
Date / Day | Week / Lecture | Theme | Topic / Reading Material | Milestones |
01/09, Tuesday | W01, L01 | OVERVIEW | slides, whiteboard | |
01/11, Thursday | W01, L02 | DISTRIBUTED PROGRAMMING | Implementing Remote Procedure Calls. TOCS 1984, slides | |
01/16, Tuesday | W02, L03 | ^ | MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004, slides, whiteboard | LAB 1 Released |
01/18, Thursday | W02, L04 | ^ | Tutorial by Hadi Sinaee on Example code used in the class. Go Programming section points to several other online tutorials. |
|
01/23, Tuesday | W03, L05 | CRASH FAULTS | Crash Consistency: FSCK and Journaling. OSTEP (chapter 42), slides Reimplementing the Cedar File System Using Logging and Group Commit. SOSP 1987 |
|
01/25, Thursday | W03, L06 | ^ | Distributed Commit. Van Steen & Tanenbaum (Section 8.5), slides, notes Guardians and Actions: Linguistic Support for Robust, Distributed Programs. TOPLAS 1983 |
LAB 1 Due |
01/30, Tuesday | W04, L07 | ^ | In Search of an Understandable Consensus Algorithm. ATC 2014 (Extended Version), slides Leader Election in Raft. Van Steen & Tanenbaum (Section 5.4.4) Consensus in Faulty Systems with Crash Failures. Van Steen & Tanenbaum (Section 8.2.3) |
LAB 2 Released |
02/01, Thursday | W04, L08 | ^ | ^, slides | |
02/06, Tuesday | W05, L09 | ^ | ^, slides | LAB 2A Due |
02/08, Thursday | W05, L10 | Quiz 1: DISTRIBUTED PROGRAMMING, CRASH FAULTS, and LAB 1 | ||
02/13, Tuesday | W06, L11 | BYZANTINE FAULTS | Practical Byzantine Fault Tolerance. OSDI 1999, slides Consensus in Faulty Systems with Arbitrary Failures. Van Steen & Tanenbaum (Section 8.2.5) |
|
02/15, Thursday | W06, L12 | ^ | Guest Lecture by Neeraj Gandhi from University of Pennsylvania REBOUND: Defending Distributed Systems Against Attacks with Bounded-Time Recovery. EuroSys 2021, slides |
LAB 2B Due |
02/20, Tuesday | W07, L13 | Mid-term Break | ||
02/22, Thursday | W07, L14 | Mid-term Break | ||
02/27, Tuesday | W08, L15 | CONSISTENCY MODELS | Consistent Ordering of Operations. Van Steen & Tanenbaum (Section 7.2.1), slides, notes Memory Coherence in Shared Virtual Memory Systems. TOCS 1989 |
LAB 2C Due |
02/29, Thursday | W08, L16 | ^ | Logical Clocks. Van Steen & Tanenbaum (Section 5.2), slides File Synchronization with Vector Time Pairs. MIT-CSAIL-TR 2005 |
|
03/05, Tuesday | W09, L17 | ^ | ^, slides | |
03/07, Thursday | W09, L18 | ^ | Logical Clocks. Van Steen & Tanenbaum (Section 5.2), slides Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. SOSP 1995 |
|
03/10, Sunday | No lectures | LAB 2D Due | ||
03/11, Monday | No lectures | LAB 3 Released | ||
03/12, Tuesday | W10, L19 | ^ | Guest Lecture by Finn Hackett from University of British Columbia Going Beyond an Incident Report with TLA+. USENIX ;login: 2023, slides, code Understanding Inconsistency in Azure Cosmos DB with TLA+. ICSE 2023 |
|
03/14, Thursday | W10, L20 | MUTUAL EXCLUSION | Mutual Exclusion. Van Steen & Tanenbaum (Section 5.3), slides An Optimal Algorithm for Mutual Exclusion in Computer Networks. CACM 1981 Deconstructing the Bakery to Build a Distributed State Machine. CACM 2022 |
|
03/17, Sunday | No lectures | LAB 3A Due | ||
03/19, Tuesday | W11, L21 | Quiz 2: BYZANTINE FAULTS, CONSISTENCY MODELS, and LABS 2 & 3 | ||
03/21, Thursday | W11, L22 | SHARDING | Guest Lecture by Milad Rezaei Hajidehi from University of British Columbia TAO: Facebook’s Distributed Data Store for the Social Graph. ATC 2013, slides |
|
03/26, Tuesday | W12, L23 | DISTRIBUTED SNAPSHOTS | Global States. Coulouris & others (Section 14.5), slides Distributed Snapshots: Determining Global States of Distributed Systems. TOCS 1985 |
|
03/27, Wednesday | No lectures | LAB 3B Due | ||
03/28, Thursday | W12, L24 | ^ | State Management in Apache Flink: Consistent Stateful Distributed Stream Processing. VLDB 2017, slides | LAB 4 Released |
04/02, Tuesday | W13, L25 | PEER-TO-PEER | Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications. SIGCOMM 2001, slides Naming. Van Steen & Tanenbaum (Chapter 6) |
|
04/04, Thursday | W13, L26 | ^ | ^, slides | LAB 4A Due |
04/09, Tuesday | W14, L27 | ^ | How the Bitcoin protocol actually works. 2013, slides Bitcoin: A Peer-to-Peer Electronic Cash System. 2008 |
|
04/11, Thursday | W14, L28 | MISCELLANEOUS | Quiz 2 recap, final exam discussion, practice questions, SEI survey, slides | LAB 4B Due |
04/17, Wednesday | Final Exam: EVERYTHING, INCLUDING LABS 1-4! Time: 3:30 PM, Venue: CBTF Rooms 008 and 014 |
Go Programming
In this course we will exclusively use the Go programming language for all project labs.
Learning a new programming language is an important skill.
You will practice it in this course.
For the most part I will expect that you learn this language on your own.
We will be using Go version 1.20.3 (available at /cs/local/bin/go
on ugrad servers).
If you use a personal machine, make sure to install this exact version.
Though, please note that all homework solutions will be tested on the ugrad server machines.
Go is a systems language originally introduced by Google. It is especially well suited to building distributed systems. Like with any language, the fastest way to become proficient at Go is to put in the time writing programs in Go. Here are some resources to get you started:
Accommodations
For Lab 1, there is no extension. Please start working on it as soon as possible.
For all subsequent labs, i.e., 2A-2D, 3A and 3B, and 4A and 4B, we will apply a flexible slip date policy for late submissions. Each team will be allocated an automatic extension of 4 calendar days for the entire course. Teams can use the extension in daily increments. For instance, you can hand in Lab 2A 4 days late, or Lab 2A 2 days late and Labs 4A and 4B each 1 day late. Since we measure extensions in daily increments, submitting an assignment 1 hour late is equivalent to submitting it 1 day late.
We will provide more instructions on how to seek extensions closer to the Lab 2 release date.
No Copying Policy
How to Do Well in This Course
Learn Go early and practice it regularly. Learning a new language while being time constrained is stressful and not fun. Since the assignments rapidly increase in their difficulty, it will be to your advantage to learn Go as quickly as possible and to learn it well. The posted Go resources are a great starting point, but reading is no substitute for practice, bug, debug, practice, practice, bug, coffee, debug, practice, ...
Do not skimp on software engineering. Distributed systems are hard. They are hard to understand, to build, to debug, to run, to trace, to document, etc. Do not make your life any more difficult. Use best practices from software engineering to help you in this course. Write unit and integration tests, use version control, document your code with comments, write small prototypes, refactor your code, make your code readable and easy to run and debug. If you fail to follow best practices, they will come back to bite you later on. Unfortunately, this course will not explicitly teach you these best practices, but you probably took a course that introduced you to these concepts. If you have any questions, just ask us on Piazza.
In the lectures, we will focus on fundamental concepts of distributed systems. The reading material majorly consists of papers describing well-known systems that rely on these fundamental concepts. To do well in exams, prioritize understanding the key concepts discussed in the class, and, when reading papers, understand how these are applied to practical systems. The exams will not test your knowledge of details covered in the papers that are not related to the concepts covered in class.
Reach out for success. This is intended to be a challenging fourth year course, but that does not mean that you have to work through it on your own! The course piazza should be your first stop for all technical questions. The course has specific office hours (see top of page), but I and the TAs are flexible. Send any of us an email to schedule a time to discuss the course, the assignments, etc. University students often encounter setbacks from time to time that can impact academic performance. Discuss your situation with us or an academic advisor as early as possible. For help in addressing mental or physical health concerns, including seeing a UBC counselor or doctor, visit this link.
Acknowledgements
This course is based on the graduate course on Distributed Systems (6.584) developed by Robert Morris, Frans Kaashoek, and Nickolai Zeldovich at MIT, with permission from the content authors. Many aspects are also inspired by the graduate course on Distributed Systems taught by Peter Druschel and others at MPI-SWS.