Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.
One really good way to improve the accuracy deep learning based speech recognition system is throw a lot of diverse data at it. Which is one trick we use at Baidu's Silicon Valley AI Lab (SVAIL), but this means it can take a very long time to train the network. I will talk about some of the hardware, software and algorithmic tricks we use to enable training on tens of thousands of hours of raw speech data. These techniques are broadly applicable to a wide variety of sequence based machine learning tasks. We use synchronous SGD while scaling to multiple nodes (8 GPUs per nodes). Achieving this requires paying careful attention to the performance of the gradient synchronization between nodes - we have written our own version of MPI's all_reduce primitive. It also requires maintaining high performance on each individual node as you scale, which means paying close attention to the performance of your BLAS library. Combined with other tricks such as training in reduced precision means we can sustain 3 TFlops per node while weak scaling to 128 GPUs.