[email protected] +1 916-234-3136
What The Book 'Real World AI' Got to Say About Training Data for Machine Learning

What The Book 'Real World AI' Got to Say About Training Data for Machine Learning

The boundaries of AI and Machine Learning get pushed every year.

Multiple conferences, summits, books, whitepapers, etc bring us new AI and ML innovations in areas such as neural networks, deep learning architectures, and computer vision every year.

A lot of machine learning work is based out of academic research. However, real world application of AI, a.k.a applied AI/ML, is not the same as academic AI/ML research and validation.

In the book 'Real World AI: A Practical Guide for Responsible Machine Learning' the authors Alyssa Simpson Rochwerger and Wilson paint a crisp picture that highlights the difference between academic and real-world AI and ML.

I wrote this post to highlight what the authors say about one aspect of real-world AI/ML models - gathering data for modeling.

Gathering and organizing data in academia is not the same for applied ML

A significant challenge of applied machine learning is gathering and organizing the data needed to train ML models.

This is in contrast to scientific research where training data is usually available and the goal is to create the right machine learning model.

The authors of 'Real World AI: A Practical Guide for Responsible Machine Learning' say "when creating AI in the real world, the data used to train the model is far more important than the model itself."

They continue saying "the data used to train models in academia are only meant to prove the functionality of the model, not solve real problems. Out in the real world, high-quality and accurate data that can be used to train a working model is incredibly tricky to collect."

The trouble with gathering and preparing high quality data

The trouble with public data to train applied ML models is that they are not useful. This invariably leads project engineers to either generate their data or buy data from third party sources.

The authors discuss a real world ML scenario where engineers struggle to gather and organize data to train the ML model.

A group of engineers step forward to create an ML model to detect herbicide in crops. This means the engineers need a lot of images of crops and weeds.

For the machine learning model to work reliably, the engineers will need photos of crops under different lighting, environmental, and soil conditions. After gathering the data, they’ll need to label the images as “plant” or “weed.”

Data labeling requires manual effort and is a tiring job and has given rise to an entire industry of its own.

There are cases with industries such as healthcare and banking where training data may contain sensitive information. In such cases, outsourcing labeling tasks can be tricky and laborious due to privacy concerns.

The problem with verifying data quality and source

Verifying data quality also becomes a big headache as data comes from many different sources within the organization. The reality is that not all data is organized and up to date.

The authors say that "It’s incredibly common in an enterprise to find data scattered throughout databases in different departments without any documentation about where it’s from or how it got there."

There's no control on when and how often data gets updated too.

This is what the authors have to say, "as data makes its way from the point where it’s collected into the database where you find it, it’s very likely that it has been changed or manipulated in a meaningful way. If you make assumptions about how the data you’re using got there, you could end up producing a useless model."

Bounding box annotation

Gathering, organizing, labeling up-to-date data is no mean task.

That's why we built Traindata Inc. to help enterprises prepare data through labeling, annotation, structuring, and cleaning on time and budget.

Talk to us about your AL/ML data training challenges today, or visit www.traindata.us to learn more.