HYPERSCALE AND AI INFRASTRUCTURECOMS E6998, Dept of Computer Science, Columbia University
Home | Lectures | Projects

LECTURES
A tentative set of papers that we will cover is listed below, though the list may change based on the interests of the class. All students are required to read the papers before they are presented and will be graded based on apparent understanding of the material in the papers and contributions to class discussions on the papers. Students will be asked to explain various aspects of the papers during class as part of the discussions.


September 10 - Course Overview and Background

September 17 - Virtualization
September 24 - Orchestration

  • Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes, "Large-scale Cluster Management at Google with Borg", Proceedings of the 7th European Conference on Computer Systems (EuroSys), Bordeaux, France, April 2015.

  • Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang, "Twine: A Unified Cluster Management System for Shared Infrastructure", Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Virtual, November 2020.
October 1 - No class
October 8 - File Systems and Storage

  • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System", Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing, NY USA, October 2003.

  • Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, "Bigtable: A Distributed Storage System for Structured Data", Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Seattle, WA USA, November 2006.
October 15 - Databases

  • Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels, "Dynamo: Amazon's Highly Available Key-value Store", Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), Stevenson, WA USA, October 2007.

  • James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford, "Spanner: Google's Globally-Distributed Database", Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Hollywood, CA USA, October 2012.
October 22 - Caching

  • Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, Venkateshwaran Venkataramani, "Scaling Memcache at Facebook", Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Lombard, IL USA, April 2013.

  • Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani, "TAO: Facebook's Distributed Data Store for the Social Graph", Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC), San Jose, CA USA, June 2013.
October 29 - Midterm Project Presentations
November 5 - TPUs and GPUs

  • Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeff Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Daniel Hurt, Julian Ibarz, Arjun Jaffey, Alek Jaworski, Aaron Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, David Le, Carole Leary, Zhuyao Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kshitij Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emery Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Steinberg, Ambuj Sukhwani, Matt Swett, Alice Thorson, Bojian Tian, Greg Toma, Erik Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Dave H. Yoon, "In-Datacenter Performance Analysis of a Tensor Processing Unit", Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON Canada, June 2017.

  • Zhe Jia, Marco Maggioni, Benjamin Staiger, Daniele Paolo Scarpazza, "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking", Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Belfast, Northern Ireland UK, April 2018.
November 12 - Scaling Models
November 19 - AI Inference Serving
November 26 - No class
December 3 - Optimizing AI Inference
December 10 - Final Project Presentations