Why Data Management is Important for Machine Learning
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

In our first pre-processing guide, we're going to be talking about the importance of data management, along with what properly managed data should look like and the effect it can have on a company. t's an important topic because over the past few decades, we've seen a massive increase in automation, resulting in data quickly becoming a foundational building block in business strategy. Then when we compile that with cloud-based platforms, even the smallest business can utilize automation at some scale. And this reliance on data is only going to increase causing companies to lean more heavily on machine learning to help guide them in their decision-making process.

With all that being said, it's fair to assume data management will soon become a major factor in determining the success or failure of a company. We hear people talk about companies becoming more data-driven all the time, but it's important to understand just how encompassing the process really is, and how it can shape the foundation of a company.

The first step in any company should always begin with the leadership team establishing the clear data strategy. This gives the company guidance on how it plans on leveraging their data to drive profitability or reduce future expense. Then with the roadmap in place, the company can move forward with the development of sound data management systems. This includes acquiring, validating, storing, protecting, processing and managing the accessibility of data for the entire lifecycle.

When managed properly, companies are able to develop a strong culture that embraces data-driven decisions, leading to a great deal of success. But in contrast, when data is poorly managed, companies really open themselves up to unnecessary risk, and possibly jeopardize their future success.

Now let's take a look at a couple different companies and how data management has affected them. The first example that we're going to be talking about is Netflix. And to me, Netflix is a great example to use, because when their media platform first launched, their success hinged solely on how well they were able to manage their data in order to produce a truly individualized user experience. And I'm pretty sure we all know how well it turned out for them. But let's actually take a look at some of the things they were able to accomplish. At this point, the information's going to be a little bit out of date, because there was a massive surge in streaming platforms. But originally, Netflix utilized and managed their data systems so well, that they were able to save about $1 billion a year in customer retention. It's also pretty remarkable that after 10 years, they're still using only a slightly modified version of the original predictive algorithm. Not only that, their systems were also able to influence about 80% of the content played, and even accurately predict the success of their first original series.It's also really cool to note that through all of their success, they didn't share or sell any of their data. They even managed to avoid any major security breach.

In the next example, we're going to be talking about a web service that Google released back in 2008, called Google Flu Trends. And if you haven't heard of it, there's a pretty good reason for it. Basically, their idea stemmed from Google's previously collected data that suggested that when people got the flu, they started searching for flu-related information. So with a little bit of help from their friends from the Centers for Disease Control and Prevention, Google began mining data records of flu-related search terms to produce a results model that almost exactly matched the CDC data.

All in all, Flu Trends started off okay, but came across its first major problem when it underestimated the 2009 outbreak of swine flu. Originally Google attributed their inaccuracies to changes in flu-related search behavior. So in an attempt to fix that, they revised the algorithm by expanding and modifying the search query. Needless to say, the revisions over-corrected a little bit too much, because just four years later, during the flu season, their algorithm overestimated flu levels by about 140%.

Google ran into the same issue as before with the search query. Because searching for terms like sore throat or stuffy nose didn't necessarily mean people were searching for flu-related symptoms. I mean, for all Google knew, people were doing research on how data collection from search queries could potentially produce some pretty inaccurate models.

I think it's important for us to discuss a case like this, because it shows that even a billion-dollar company like Google is susceptible to improperly managing and implementing data. With no real way for Google to determine the intent of a search query, it made it nearly impossible for them to know if they were actually getting quality data. And not only that, when you're dealing with any exponential model, you're also dealing with exponential error, so you have to be really careful with any inaccuracy in your actual data set.

I hope these examples show the importance of data management and how relevant, complete, accurate and meaningful data can help a company grow and succeed. And in contrast, when data isn't managed well, at the very least, it's proven to be just useless. But at its worst, not only can it be harmful to the organization, it can jeopardize the well-being of the people that it was supposed to help.