Hello! And welcome to my 4th post in my Tales of Machine Learning. This time, I’d like to talk about a notably crucial part of machine learning that until recently I had carefully neglected: data.
The entire reason we pursue ML is because we want technology that can smartly analyze data and make predictions from it. If we only focused on the models and architecture of ML and not the data, then we aren’t doing ML. We’re just creating fancy, overly complex linear algebra problems.
I have recently been studying additional sources of machine learning and after looking through pages upon pages of documentation, code, and theoretical explanations, I have come to understand that acquiring data, accessing it, and processing it takes up at least half of the energy necessary to study ML as a whole. Building the NN architecture, making predictions, and training the weights take up the remainder.
The reason for this is because there is SO MUCH DATA. It can take so many forms. And what format we read it in, what data we ignore or use, or how complete it is can vary a lot. Therefore, in order to succeed in the world of ML, you also need to know how to handle data.
For Context…
After finishing my Udacity nanodegree on programming NNs with python, I decided I wanted to create my own set of lessons and modules for people to explore the world of ML with. So, I began my work on adapting my existing work for personal Jupyter Notebooks. Look at the previous blog here to learn about the struggles I faced. After getting necessary libraries to work, I needed to get my example codes to train on the databases I used during the nanodegree. In this case, I was using the MNIST database of black and white images of hand-drawn numbers.
The Challenge…
As I mentioned earlier, I was programming my NN code to train to identify hand-drawn digits as their appropriate values. For instance, it should identify a picture of a 3 as the number 3. However, what I ended up struggling with was getting the NN to accept the format of the data for training. At this point, I was using PyTorch, which only accepts PyTorch tensors. For some reason, even though I had imported the necessary libraries to access the MNIST database and create the architecture for the NN.
My Solution…
Fortunately, I was able to find the reason for why the MNIST data wasn’t being accepted. The images I was using were PIL images. Each PIL has effectively two parts to it: the data portion and label portion. The data is the picture itself. The label portion is what’s used to identify the image as a particular digit. A PIL of the number 3 will have a picture of a 3 as well as a label indicating the value 3.
I was initially trying to feed the NN a PIL image when the network only needs the data to make a prediction. The NN only needs the data portion of the PIL, so the question was how to separate it from the label. Fortunately, that is where the DataLoader class from Torch and Torchvision comes in. It acts as an iterable object, where each output is the separated data and label.
Further Investigations…
Fortunately, the code I used had a reference to build off of. But most ML devices don’t train to identify numbers, let alone using the MNIST database. There is several formats, several data structures that can store information. In addition to the MNIST dataset, there’s the fashion MNIST, as well as the STL10 dataset. There are many more opensource datasets provided by many people, some of which contain no images and are purely numerical and categorical information.
Upon seeing all of these kinds of sources of data, I realized I had been biased in my views of NNs, working only with MNIST at this point. Therefore, I needed to see how other sources teach ML and processing data. I ended up looking at a book published by O’Riley called Hands-On Machine Learning with Scikit-Learn & TensorFlow. In the first chapter, the book was more conceptual, discussing some techniques used in classification for other kinds of machine learning. However, in the 2nd chapter, topics and coding examples solely discussed the process of gathering data, displaying data, and manipulating data, and extracting specific data.
Although the actual manipulations of data using Pandas and Scikit-learn where not inherently groundbreaking for me, the nuances and tricks introduced to me revealed just how much of a magnitude of content I have left to learn. It covered tricks with extracting statistical values like means and standard deviations and ways to gather specific kinds of data and how to get numeric information from non-numeric labels.
Takeaways…
Data itself, like any other part of the ML field, deserves due diligence and attention. However, for the longest portion of time, I had been focused on building a NN, which is only a fraction of the world of ML. My next steps in my journey to conquer the world of ML is to go back to the world of statistics and data analysis and learn how python can be interfaced with data.
Thanks For Reading!
Thank you for reading this Post. To see the previous blog or next blog in this series, click on the appropriate label below.
Previous | Next (TBD)
Until next time, keep building and stay creative!