Many companies do not realize that they are sitting on a pile of bad or dirty data. Existing data contains a lot of missing fields, has wrong formatting, numerous duplicates, or is simply irrelevant information. IBM research estimated that the annual cost of bad data for the U.S. economy is a whopping $3.6 trillion. The good news is that the data that companies have may not necessarily be bad, it is just likely incomplete to solve the problem.
It has to do with why present data is collectedThe original system is usually built to collect the data needed for human-driven solutions and moving it to an AI driven solution might require filling in the gaps. While a human can quickly assess these and fix the problem, the automated system needs automated ways to wrangle the data. Let's take the company that wants to build a robot that will automatically put library books on the shelves. It has plenty of data about the actual book content, it knows the names of the authors and the year the book was published. But, in reality, this data is not sufficient for an automated arrangement of the books. The robot can use the existing data only to find the proper shelf for the book. But, it doesn't know the measurements of the book, so it's hard for the robot to figure out if the book will fit on the shelf. The company never thought of collecting this information because the library staff could easily figure out if the book fits the space. Now this company needs a completely new data set, which it doesn't have. This means the company has to equip a robot with some way of assessing the book measurements instead. While this is not impossible, the project budget and timeline will change. That's why you should always ask yourself if you have the right type of data that is helpful to solve the problem. Further reading: What the book 'Real World AI' says about training data for machine learning
Why you should let your product get the dataFinding good data should start with a product itself. To get good data, companies should design products that provide the right incentive for the users to contribute their data. This is the stage where you figure out what type of data you need your users to contribute and then move forward to offer incentives to get the data. Good usability and user experience will encourage users to contribute valuable information. Further reading: Why enterprises struggle to train data for machine learning
What about the data you already have at hand?Most organizations come to realize the need for AI and ML driven solutions purely based on the large quantities of data they already possess. They tell themselves that "this data should amount to something? A pattern? User behavior changes? We need to monetize this data." But circling back to the beginning of this article, not all data is complete or structured. The real work to even see if the existing data can amount to any value is to do four things:
- Clearly define a business problem
- Assess if existing data can help you gain new insights in the problem
- Then analyze if the existing data is complete
- If not you need to prepare your data, label it, structure it before you hand it over to your machine learning engineering team.