CS699 - Representation Learning: Theory and Practice

Welcome to the course homepage of CS699 - Representation Learning: Theory and Practice.

The course is designed for PhD students wishing to gain theoretical and applied skills in Inference, Machine Learning, Information Theory, and Deep Learning. The goal of this course is to bring students up-to-speed on skills required for publishing Machine Learning papers in top-tier venues (ICML, NIPS, KDD, CVPR, etc). We will cover the necessary background in Mathematics (Linear Algebra, Matrix Calculus, Information Theory) and Programming (Numpy, Computational Graphs, TensorFlow). Course includes a variety of inter-related theoretical topics including Deep Learning, Graphical Models, Variational Methods, Embeddings, and others. We extensively apply the theoretical concepts on applications (Natural Language Processing, Computer Vision, Graph Theory). The majority of the grade is hands-on: specifically, by implementing programs which require the student to thoroughly understand both the theory and the application domain. Our goal is to show students that most of the applications require a similar set of theoretical skills, which we will teach in the course.

There are no official pre-requisits for the course. The unofficial requirements are good math skills and good programming skills.

Time & Location

Class will be held Mondays and Wednesdays, 10AM-noon, at VKC 102.

Sami's office hours are held Mondays and Wednesdays, 2-3:20 PM, at Basement of Leavey Library (at the open discussion tables).


This course is delivered to you by:

Aram Galstyan
Greg Ver Steeg
Sami Abu-El-Haija

Teaching Assistant

Kyle Reing


We will follow the following grading scheme:
Component% of Grade
Assignment 1 (due Sunday Sept 15)8
Assignment 2 (due Saturday Oct 5)16
Assignment 314
Assignment 414
Beyond Assignment14
Test [last class; multiple choice]14
Participation [in-class, piazza]10
All items are to be completed individually, except for the Beyond Assignment, which should be completed in groups.


The purpose of the assignment is give you sufficient experience in deriving mathematical expressions for models, implementing them (in TensorFlow), and understanding the models e.g. through visualizations. All assignments must be completed individually. You can ask your classmates for help, but following rules: Assignment discussions must without a computer (e.g. using whiteboard). No one is allowed to take physical/electronic notes from the discussion. After the discussion, students must take an hour gap before returning to the assignment. The only electronic thing, that can be shared, are links of third-party information (e.g. paper links)

All assignments must be completed on Vocareum. If you have not received access, please contact Sami.

Late Days

Each student has a total of 7 late days for the semester, that can be used on all the assignments combined. If a student uses more than 7 late days, they will get 20% deduction for every additional late day for the assignment they are submitting.

Beyond Assignment

Students are to form groups (of 2-to-4 students each) where they will extend one of the assignments in some direction. The direction is completely optional and there are no guidelines except for: be creative. Some suggestions include: try the same problem on a different dataset (preferrably, from a different domain e.g. NLP if the assignment was on Vision), or extend the model in some novel way which should hopefully improve some metrics of your choice. The instructors will give a list of ideas on each of the assignments. You can choose any of them or come up with your own. We will be sending more information about this during the course.


There is no one golden textbook in Machine Learning or Representation Learning. The reason is obvious: If one starts drafting a textbook, the field would significantly change by the time the writing is done. Nonetheless, we will use the Deep Learning book (though we wont cover the whole thing) and we will supplement the book with published papers and (free) online notes, as linked on the syllabus.
In addition, we are currently developing a supplement handout, to be completed in class, for the Deep Learning material.

Course Outline

The course outline is a moving target. We will be changing the schedule as-we-go. If you have any suggestions (E.g. topics you really like or do not like, even they are not listed on the syllabus), please talk to us, either in-person (preferred) or via email.

  • Reading: Linear Algebra
  • Welcome; Syllabus Overview; Course Goals; Logistics; Overview of Learning Paradigms; Matrix Calculus; Backpropagation Algorithm;
    [by Sami on whiteboard + slides 1 & 2]
  • Reading: DL#3;
  • Probability Theory and Information Theory (part 1: probability mass functions, densities, joints and conditionals. Change-of-variable rule)
    [by Greg on whiteboard + slides].
  • numpy refresher (basics, io, broadcasting, shared memory, advanced indexing, boolean ops, concatenation & stacking); Skipped: come to office hours for help!
    [by Sami on whiteboard + slides].
Sept/2 Labor Day Holiday [no class]
  • Recommended Reading: (DL#5)
  • Probability Theory and Information Theory (part 2).

    [by Greg; slides]
  • Deep Learning light intro. Computation Graphs. Tensors. Transformations. Decision Boundaries, Tensorflow for gradient calculation.
    [by Sami on whiteboard + slides]

Supervised Representation Learning: P(Y|X)
  • Reading (DL#6 except 6.1 and 6.5)
  • Supplement Handout
  • Deep Learning: First example. Intuition: what do the layers learn (NLP application).
  • Geometric and Bayesian Perspectives of Regularization
  • Geometric and Bayesian Interpretations of minimizing Cross Entropy.
[by Sami; on whiteboard]
  • Deep Representation Learning for Computer Vision
  • Euclidean Convolution (DL#9 up to 9.3); Spatial Pooling
  • Dropout
  • Summary of Computer Vision Tasks

[by Sami; on slides]
Sept/16 [by Sami; slides above]
  • Deep Architectures: Residual Networks, Dense Networks, U-Net, Spatial Transformer Networks Normalization, Fully-Convolutional Networks.
  • Summary of last 2 weeks; Deep Networks as Function Approximators.

[by Sami, slides]
Unsupervised (Representation) Learning: P(X)
  • Generative Models
  • Probabilistic Graph Models (PGM)
  • Basic Intro to Statistical Physics.

[by Aram, slides]
  • PGM (continue)
  • Intro to Variational Learning
  • Restricted Boltzmann Machines (RBM);

[by Aram, slides]
  • Representation learning goals
  • Autoencoders
  • Variational Auto Encoders (DL#20)
[by Greg, slides]
[Logistics slides]
  • Rate-Distortion, mutual information, noisy channels
  • Disentanglement.
  • Invariant Representation.
[by Greg, slides]
[Logistic Slide]
  • Desirable Properties of Unsupervised Models
  • Autoregressive Models & Density Estimation
  • Concrete Implementations: WaveNet and PixelCNN
  • Scheduled Sampling

[by Sami, slides]
Review of derivation of Cross Entropy. Deeper dive on the Computational Graphs of:
  • Autoregressive Models (Fully-Visible Sigmoid Belief Networks)
  • Variational Auto Encoders.
Both for Learning and Inference.
[by Sami, on whiteboard]
Oct/14 Deeper dive on the Computational Graphs of:
  • Restricted Boltzmann Machines
  • PixelCNN and WaveNet
Change of Variables Formula. Normalizing Flows [time permitting]
[by Sami, on whiteboard]
Guest Speaker: Alessandro Achille
  • Talk Title: Information in the Weights of Deep Neural Networks
  • Abstract We introduce the notion of information contained in the weights of a Deep Neural Network and show that it can be used to control and describe the training process of DNNs, and can explain how properties, such as invariance to nuisance variability and disentanglement, emerge naturally in the learned representation. Through its dynamics, stochastic gradient descent (SGD) implicitly regularizes the information in the weights, which can then be used to bound the generalization error through the PAC-Bayes bound. Moreover, the information in the weights can be used to defined both a topology and an asymmetric distance in the space of tasks, which can then be used to predict the training time and the performance on a new task given a solution to a pre-training task.

    While this information distance models difficulty of transfer in first approximation, we show the existence of non-trivial irreversible dynamics during the initial transient phase of convergence when the network is acquiring information, which makes the approximation fail. This is closely related to critical learning periods in biology, and suggests that studying the initial convergence transient can yield important insight beyond those that can be gleaned from the well-studied asymptotics.
[Superset of taught slides]
Representations for variable-sized data
Oct/21 Introduction to Embedding Learning.
  • Recommender Systems & Similarity matrices.
  • Closed-form Embedding Learning
  • Stochastic Embedding Learning
  • Language Embeddings & Context Distributions
[by Sami, slides & whiteboard]
Beyond Sequences and Euclidean Structures: Intro to Machine Learning on Graphs (part 1).
  • Notation \& overview of applications.
  • Semi-supervision via consistency on related entities
  • Unsupervised Embedding: Matrix Factorization; Auto-encoding the Adjacency Matrix; Modeling Context;
  • Transition Matrix; Stationary Distribution; Random Walks
[by Sami]
Oct/28 Intro to ML on Graphs (part 2):
  • Recap: Transition Matrix & Random Walks.
  • Analysis of DeepWalk.
  • Learning Random Walk hyperparameters & Context Distributions.
  • Summary of ML algorithms on Graphs.
Sequences & Recurrent Neural Networks (RNNs)
  • Simplest Form
  • Applications
  • Backpropagation through time
  • Extensions
  • Implementations
Nov/4 Convolution on Graph-structured Data.
  • Defining node ordering thru heuristics
  • Mapping onto the Spectral Fourier Basis
  • Approximating the Fourier Basis
  • Message Passing
  • Higher-Order Graph Convolution
Application: Computational Graphs for Implementing Graph Convolutional Networks
  • We can collect votes on which ones we should whiteboard.
Misc Topics
Nov/11 Normalizing Flows; Generative Adversarial Networks (GANs) (DL#20) Guest Speaker: Bryan Perozzi (Senior Research Scientist @ Google AI)
Nov/18 Sequence-to-Sequence and Transformer Models; on Videos and NLP. Gradient estimators for non-differentiable computation graphs
Nov/25 Meta-Learning Thanksgiving Holiday. No class
  • Neural Network Compression
  • Neural Architecture Search
In class test (multiple choice) on Misc topics; Farewell; Advice for the future.
The End