There is more data available to build ML models today.
However, the quality and volume of data differentiate good ML models from bad ones.
The rise of data-greedy algorithms such as Deep Neural Networks has pushed data teams to the brink, chasing, collecting, and annotating as much data as possible—to train machine learning models.
In the early days of machine learning, a data scientists' job was to collect data. Today it is his responsibility to label data.
There are three challenges data scientists face:
- Get the right volume of data labeled at the speed they need, regardless of time or budget constraints.
- As data scientists chase data labeling speed, they end up sacrificing labeling accuracy.
- And some data can only be labeled by experts. E.g., tagging X-ray images for pockets of infection, geo-data where only experts can annotate oil pockets from satellite images.
How data scientists are trying to speed up data labeling.
To save time, data scientists use an intelligent data labeling technique where a machine learning algorithm predicts labels before this data is sent to annotators for reviews and corrections. Though this technique can save time, it is based on the assumption that we precisely know how much data we need before building the model. Based on this false assumption, data scientists label all the data in a single tranche. Labeling all the data in one go is expensive and time-consuming. But in reality, we don't know how much data we need to train, test, and validate our ML models. While 90% of all machine learning models are built on supervised learning, is there value in looking at unsupervised learning, or can we find the best of both worlds?Supervised vs Unsupervised vs Semi-Supervised Learning


Semi-supervised learning paves way to a new way of data labeling
The old way of labeling data is sequential—build a data training set without knowing how much data an ML model needs. And then use this data to train the ML model. Because we label the data and then feed it to the model, we try to adapt the model to the data. The new way of labeling data goes hand-in-hand with modeling—today, we can pick and choose the data we need to build a specific model. We can, in fact, label data and build ML models simultaneously. This is called Active Learning. Active Learning is a process where you incrementally add more and more data to train your ML models. Instead of labeling all of the data at once, it is possible to reach the same model accuracy by labeling just a fraction of the data as long as the most informational rows are labeled. Active Learning allows data scientists to train their models and label training sets simultaneously to guarantee the best results with the minimum number of labels. Instead of adapting the ML model to the data set, Active Learning encourages us to adopt data labeling to our ML models.