Daniel J. Hsu
Daniel J. Hsu, Columbia University
Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. This talk presents positive and negative results on the representation power of attention layers, with a focus on relevant complexity parameters such as width, depth, and embedding dimension. The main results establish separations between attention layers and other traditional neural network architectures such as recurrent neural networks, as well as separations between different transformer architectures. Based on joint work with Clayton Sanford (Columbia) and Matus Telgarsky (NYU).
3:30pm - Pre-talk meet and greet teatime - 219 Prospect Street, 13 floor, there will be light snacks and beverages in the kitchen area.