We talk about the promises of AI quite a bit but forget that there’s a big barrier to getting AI to do it’s AI thing… the data. If you replace the words “artificial intelligence” with “statistics” then it doesn’t sound as wonderful, but we’re really looking for patterns in data or finding probabilities so we might as well.
To find these patterns, we need data and at least over a few projects, it’s been difficult to get at this data. Sometimes, this means manually scraping data or doing data entry. Sometimes it means a social movement, like Project Common Voice, to get samples. Or, it could mean using Mechanical Turk to tag data for a training set.
Often, the data isn’t directly supplied but needs to be inferred from some other information. For example, absence data might need to be found by subtracting attendance data from total number of possible days. It’s often not pretty and this data collection can be costly and introduce many errors. Or the data itself might be dirty.
Yes, machine learning is a powerful tool. But so is calculus. The hardest part is putting together the mechanics of doing the calculation, including finding the data to use to find your answer.