Alex Smola Why your machine learning algorithm is slow
Scalable machine learning is not hard. But it is easy to overlook details that cost orders of magnitude in performance. In this talk we will cover the basics of scalable data analysis for multicore, multi-machine and GPUs. In particular, we will go over recommender systems, computational advertising, and deep learning.
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. Driven by trends in the data economy, rapid progress in AI, and an increasingly programmable physical world we are at an inflection point which demands a new class of AI system. This new class of systems will go beyond training models at scale, to connecting models with the world, rendering predictions in real-time under heavy query load, adapting to new observations and contexts. These systems will need to be composable and elastically scalable to accommodate new technologies and variations in workloads. Operating in the physical world, observing intimate details of our lives, and making critical decisions, these systems must also be secure. At UC Berkeley we are starting a new five year effort to study the design of Real-time, Intelligent, and Secure (RISE) Systems that brings together researchers across AI, robotics, security, and data systems. In this talk, I will present our research vision and then dive into ongoing projects in prediction serving and hardware enabled distributed analytics on encrypted data.
Apache Spark is the most popular open source project for big data analytics, while matrix factorization/completion is among the most popular algorithms for collaborative filtering. We will briefly introduce Spark and the matrix factorization algorithm implemented there — alternating least squares (ALS), present a scalable implementation of ALS on Apache Spark, and share the lessons learned. We utilize Spark’s in-memory caching and a special partitioning strategy to make ALS efficient and scalable. Optimized internal data storage and other techniques are used to accelerate the computation and to improve JVM performance. As a result, Spark’s implementation of ALS scales up to billions of ratings.
Modern machine learning tasks depend on complex pipelines comprised of a diverse set of components from optimization, statistics and computation. Each one of those components comes with its own set of hyper-parameters that require careful tuning. To make things worse, those components can interact in unpredictable ways and make joint tuning necessary. In this talk I will describe a previously unknown interaction between system and optimization dynamics when running an asynchronous learning system. Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to non-convex problems. We will see that running stochastic gradient descent (SGD) in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration. Our result does not assume convexity of the objective function, so is applicable to deep learning systems. We observe that a standard queuing model of asynchrony results in a form of momentum that is commonly used by deep learning practitioners. This result has one important implication: when designing an asynchronous system, you have to tune your momentum! We see that tuning can substantially reduce the statistical penalty we pay for asynchrony and we provide a simple tuner to jointly control the value of momentum and the level of asynchrony.
The talk will be presented by Dr. Ioannis Mitliagkas on behave of Chris due to his last-minute schedule issue.