Neural network models

This module implements building-blocks for larger neural network models in the Keras-style. This module does not implement a general autograd system in order emphasize conceptual understanding over flexibility.

Activations. Common activation nonlinearities. Includes:
- Rectified linear units (ReLU) (Hahnloser et al., 2000)
- Leaky rectified linear units (Maas, Hannun, & Ng, 2013)
- Exponential linear units (ELU) (Clevert, Unterthiner, & Hochreiter, 2016)
- Scaled exponential linear units (Klambauer, Unterthiner, & Mayr, 2017)
- Softplus units
- Hard sigmoid units
- Exponential units
- Hyperbolic tangent (tanh)
- Logistic sigmoid
- Affine
Losses. Common loss functions. Includes:
- Squared error
- Categorical cross entropy
- VAE Bernoulli loss (Kingma & Welling, 2014)
- Wasserstein loss with gradient penalty (Gulrajani et al., 2017)
- Noise contrastive estimation (NCE) loss (Gutmann & Hyvärinen; Minh & Teh, 2012)
Wrappers. Layer wrappers. Includes:
- Dropout (Srivastava, et al., 2014)
Layers. Common layers / layer-wise operations that can be composed to create larger neural networks. Includes:
- Fully-connected
- Sparse evolutionary (Mocanu et al., 2018)
- Dot-product attention (Luong, Pho, & Manning, 2015; Vaswani et al., 2017)
- 1D and 2D convolution (with stride, padding, and dilation) (van den Oord et al., 2016; Yu & Kolton, 2016)
- 2D "deconvolution" (with stride and padding) (Zeiler et al., 2010)
- Restricted Boltzmann machines (with CD-n training) (Smolensky, 1996; Carreira-Perpiñán & Hinton, 2005)
- Elementwise multiplication
- Embedding
- Summation
- Flattening
- Softmax
- Max & average pooling
- 1D and 2D batch normalization (Ioffe & Szegedy, 2015)
- 1D and 2D layer normalization (Ba, Kiros, & Hinton, 2016)
- Recurrent (Elman, 1990)
- Long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997)
Optimizers. Common modifications to stochastic gradient descent. Includes:
- SGD with momentum (Rummelhart, Hinton, & Williams, 1986)
- AdaGrad (Duchi, Hazan, & Singer, 2011)
- RMSProp (Tieleman & Hinton, 2012)
- Adam (Kingma & Ba, 2015)
Learning Rate Schedulers. Common learning rate decay schedules.
- Constant
- Exponential decay
- Noam/Transformer scheduler (Vaswani et al., 2017)
- King/Dlib scheduler (King, 2018)
Initializers. Common weight initialization strategies.
- Glorot/Xavier uniform and normal (Glorot & Bengio, 2010)
- He/Kaiming uniform and normal (He et al., 2015)
- Standard normal
- Truncated normal
Modules. Common multi-layer blocks that appear across many deep networks. Includes:
- Bidirectional LSTMs (Schuster & Paliwal, 1997)
- ResNet-style "identity" (i.e., same-convolution) residual blocks (He et al., 2015)
- ResNet-style "convolutional" (i.e., parametric) residual blocks (He et al., 2015)
- WaveNet-style residual block with dilated causal convolutions (van den Oord et al., 2016)
- Transformer-style multi-headed dot-product attention (Vaswani et al., 2017)
Models. Well-known network architectures. Includes:
- vae.py: Bernoulli variational autoencoder (Kingma & Welling, 2014)
- wgan_gp.py: Wasserstein generative adversarial network with gradient penalty (Gulrajani et al., 2017; Goodfellow et al., 2014)
- w2v.py: word2vec model with CBOW and skip-gram architectures and training via noise contrastive estimation (Mikolov et al., 2012)
Utils. Common helper functions, primarily for dealing with CNNs. Includes:
- im2col
- col2im
- conv1D
- conv2D
- dilate
- deconv2D
- minibatch
- Various weight initialization utilities
- Various padding and convolution arithmetic utilities