Hey reader, welcome to the š„ free edition š„ of my weekly newsletter. I write about ML concepts, how to build ML products, and how to thrive at work. You can learn more about me here. Feel free to send me your questions and Iām happy to offer my thoughts. Subscribe to this newsletter to receive it in your inbox every week.
If we want to train a machine learning model, we gather all the training data and let the model train on it. Thatās how it usually works. The training data is aggregated from various sources and structured in a way that allows the model to train on it. Now what if you're not allowed to combine the data from all those sources? How will you train the machine learning model then?
Why do we need federated learning?
In the traditional training setup, a central cloud server gathers all the data. We clean up this data, prepare it for training, and then run the code to train the model. But there are situations where we won't be able to combine all this data due to:
Data privacy: You are not allowed to share the data because of privacy laws
Data security: You can't share the data since your system might not be secure
Limited bandwidth: You don't have the bandwidth to send large quantities of raw data
In these situations, we need a modeling mechanism that doesn't combine the data from multiple sources into a central database. And yet builds a model that encompasses all the patterns across various data sources.
This is where federated learning comes into picture. You can build models using data you don't own and are not allowed to see.
What exactly is it?
Federated learning is a way to train a machine learning model based on data generated from multiple sources, but without exchanging data. The result is a model that encompasses all the patterns across those sources. This allows participants to have access to a robust global model without having to share their data with others.
Let's say we're training a neural network using the federated learning technique. The data is being generated on mobile phones that are geographically distributed. In this situation, we use the local data samples and train the models on that device. And communicate the information to the central server.
How does it work?
In the federated learning setup, there's a central server that manages the learning process with all the participants. Each participant is called a node. Nodes generate data and have computing resources available locally.
To get started, the central server chooses a particular model e.g. random forest, neural network. The central server then transmits the initial model to several nodes, but not all of them. It's a way to initialize the model in some way as opposed to starting blank.
Each node starts with this model and trains it based on the local dataset. Once the training is finished across these selected nodes, they communicate it back to the central server. The central server pools these results and creates an updated model. It then transmits this updated model to the next set of nodes and repeats the whole process until the learning process is terminated.
How is it used in the real world?
Federated learning can be used in any situation where some participants can generate high quality data and some participants canāt. You can use federated learning to create a global model that can benefit all the participants. Here are a few examples:
Healthcare: This data is generated at various hospitals, but the hospitals can't share the data due to patient privacy reasons. Federated learning can come in handy.
Industrial IoT: Sensor data is being generated at various industrial facilities. These facilities can't keep the data endpoints open at all times due to security reasons. Federated learning can help build models that can adhere to this constraint.
Mobile: The usage data is being generated on all the mobile phones, but people don't want to mix their data with other people's data to train a global model. Also streaming large quantities of data would not be possible at all times. Federated learning uses local computing resources on the mobile device and builds a model.
Self-driving cars: Autonomous cars generate large quantities of raw data, but they don't have enough bandwidth to stream all of it to the cloud. We also face latency issues that might pose safety risks. Federated learning uses local compute resources to train models and make it available to all the cars.
Where to go from here?
If you want to build a model using federated learning, there are many frameworks available:
TensorFlow Federated: You can use this offering from Googleās TensorFlow to build models using federated learning.
PySyft: This library is from OpenMined. It works on TensorFlow and PyTorch.
FATE (Federated AI Technology Enabler): Itās an open source library hosted by Linux foundation.
Federated learning is a relatively new field, so there are many challenges that still need to be addressed. The devices need to have enough compute power to run the training process locally. Smartphones are getting more powerful, but sensors don't have compute power to run model training.
Another key issue to keep in mind is model convergence. Federated learning models take longer to train as compared to regular training. Federated learning is currently being employed in cases where the nodes are powerful (e.g. smartphones) and the raw data cannot be streamed to the cloud server. Transmitting all that data is difficult over wireless networks, so training locally is advantageous.
šš„ Two new episodes on the Infinite ML pod
Amir Feizpour: We talk about building online communities, knowledge discovery, doing research, data science culture, and coaching. You can listen to it on Apple Podcasts, Spotify, and Google Podcasts.
Pedram Navid: We talk about data activation, how data can get political, reverse ETL, and how data practitioners should forge alliances. You can listen to it on Apple Podcasts, Spotify, and Google Podcasts.
šš»āāļø šš»āāļø How would you rate this weekās newsletter?
You can rate this newsletter to let me know what you think. Your feedback will help make it better.
Insightful! Considering the compliance and regulations like HIPAA in US healthcare ecosystem, this model has really good potential building AI/ML models
Very well explained Prateek sir. This will be helpful since people/organizations are more concerned about their data these days and will continue to do so.