Computer Vision: Do Androids Dream of Electric Hawaii?
In Back to the Future Part II we were promised flying cars by the year 2015. Hasn't happened. We were also promised that the Chicago Cubs would win the World Series. Didn't happen -- but it sure did one year later (Go Cubs)! People have been predicting the future of technology for quite some time; some with a degree of accuracy and others with delusions of grandeur. It is a favorite pastime of both self-titled "futurists" (whatever they are) and science fiction writers alike.
Speaking of the later, one of my favorite authors of the genre, Philip K. Dick, has a fairly good track record of predicting future tech (e.g. video chat, e-readers, online role-playing games, etc.). In the film Blade Runner (set in the year 2019), robots have become so realistic that they are virtually indistinguishable from humans and if it wasn't for their noticeable lack of emotions they might very well pass the Turing Test (in fact Harrison Ford's character is charged with performing a form of such a test to distinguish humans from robots or "replicants" as they are called in the film).
While we clearly aren't to the point of walking among lifelike automatons, we are making strides in that direction. The massive increase in modern computing power has made artificial intelligence (AI) a topic of discussion among a much broader audience than academia alone...and it is slowly permeating business.
The Eyes of Your Computer
Both AI and machine learning are used interchangeably by some although to academia machine learning is typically considered a subfield of AI. One area that has grown substantially due to the aforementioned increase in computing power is computer vision, or, teaching machines how to "see," recognize what they are seeing, and recall what they saw so they can identify new images never before "seen."
Simply put, you feed into a computer a bunch of images (even video works since you can look frame by frame) labeled with what that image is of and the computer learns what features best define and distinguish it from other images so it will recognize a new image and label it appropriately. While it sounds like a neat parlor trick, there are a number of applications in business. More on that later...
C3PO Goes on Vacation
I find it rather amazing that a feat like this can be accomplished on a single computer. Granted, in a corporate setting things get far more complicated but ultimately the algorithms stay about the same. You're really just scaling the storage to hold your images and the computing power to process them, all of which can be handled by a cloud infrastructure like AWS.
For today's demonstration, I wanted to test how well I could classify images from some of the natural wonders here on Hawai'i Island. It isn't difficult to find hundreds of images for each on Google so I set about downloading 30-50 for each. Keep in mind though that as with any machine learning or statistical process, the more (good) data the better. Many computer vision algorithms are built on hundreds of thousands of images to improve their accuracy so I was skeptical using so few for this demo.
Although the R programming language was the first one I actually spent classroom time learning, its "arch nemesis" Python is a little simpler in my opinion for this type of project. Python also has access to existing algorithms for image classification and a really simple third-party tool for creating a visual data workflow called Orange. The great thing about Orange and other similar tools (e.g. RapidMiner, KNIME, etc.) is that they simplify the process by removing some of the intricacies of manual programming -- they're drag-and-drop. That being said, it is still a VERY good idea to understand the underlying programming language here because you can't always make every tweak in Orange.
Orange: Widgets and Workflows
The Orange interface works by providing users with a set of "widgets" each designed to perform a particular task in the data analysis process. You simply drag the widget onto the workspace and double-click it to access its features. Widgets are then "linked" to one another to create the workflow which moves you from loading raw data to data cleansing, to modeling, and to validation. Here is what the full process looks like for this demo:
This might look intimidating at first but I can assure you that trying to understand the same process by looking over line after line of Python code is far more challenging. To make workflows easier to read, I recommend designing them to be read from left to right, top to bottom.
To begin, images are loaded via the "Import Images" widget. There isn't much we can do with the raw images as they are. Instead, we need to create variables that describe each image in a way a computer can understand -- numerically. This is done via the next widget in our workflow, "Image Embedding." This widget takes each image and passes it through a pre-trained model designed for image classification. What it returns are 2,048 new variables describing each image. What exactly each variable represents is somewhat of a "black box," but like most image classification models they are going to be things like horizontal/vertical lines, edges, colors, shapes, and other things that can be used to describe an image.
Model Building, Scoring, and Evaluation
For this demonstration, two different methods for testing the model were used: 10-fold cross-validation and hold-out validation. Let's look at the 10-fold cross-validation method first:
OK, so what exactly is "10-fold cross validation" you are probably asking? Cross-validation takes your data (in this case our images) and splits it into separate chunks -- in this case 10 chunks. We then build or "train" our model using 9 of those chunks and then "test" how well our model works on the 10th chunk. This process is repeated 10 times with each chunk playing the part of the "test chunk" once and the remaining 9 used as the "train chunks." At the end, we average together how well each iteration did classifying the images in the test chunk to arrive at a final measure of model accuracy. In our case, the model "works" well if it can take one of our images and accurately classify it into one of the six locations chosen for this demo (e.g. Pololu Valley, Rainbow Falls, etc.). All of this occurs in the "Test & Score (CV)" widget shown in the picture above.
But how do we train our model to accomplish this? We need some algorithm that can take our image data (i.e. the 2,048 new variables we created for each) and use it to "learn" which values best differentiate a picture of Mauna Kea from, say, Waipi'o Valley or Akaka Falls from, say, Pololu Valley.
Orange comes with a set of built-in algorithms, or "learners," for both classification problems like ours and regression problems. The pink widgets in the above image represent some of these. Describing each falls outside the scope of this article, but what you can see is that each is linked to the "Test & Score (CV)" widget. Orange runs 10-fold cross validation using each learner and we can then compare how well each one worked:
Orange gives us a variety of evaluation results but in our case, we are interested in "CA," or Classification Accuracy. From what we can see "Logistic Regression (R)" had the best classification accuracy at "86.1%." This is good to know, but wouldn't it be more useful to actually see which images were misclassified? If we visually compare them to images that were correctly classified to their respective location we might be able to tell why they were misclassified.
The first step in this is to create a "confusion matrix" which is actually somewhat of a misnomer since these tables are easy to understand!
The image above shows where our correct and incorrect classifications occurred (you can also look at each model by selecting it in the box to the upper left). The numbers along the diagonal show correct classifications. For example, 53 images of Akaka Falls were correctly classified as "Akaka Falls" images, 1 image of Kealakekua Bay was _mis_classified as "Akaka Falls," and 2 images of Akaka Falls were _mis_classified as "Kealakekua Bay." You can see how other images were or were not classified correctly the same way.
Orange can take this a step further. By selecting one of the cells in the confusion matrix, we can see which images are in it using the "Image Viewer" widget. Let's look at our worst classifications -- those between Waipi'o Valley and Pololu Valley. Our model misclassified 13 images of Waipi'o Valley as Pololu Valley and 8 vice versa. Any coincidence they are both valleys?
Visually comparing misclassified images with correctly classified ones reveals some clues as to what our model was basing its decision on. Here are some of the correctly classified images of Waipi'o Valley:
Here are some of the images incorrectly classified as Pololu Valley:
Can you see what is happening here? Although we as humans could discern one valley from the other (although decidedly easier for someone familiar with both), our model is clearly focusing on a few key attributes. In the correctly classified images we are always seeing the valley from the same angle: blue water is on the right, green land is on the left, whitewash from the shoreline is prominent in the middle, foliage is in the foreground. These image features are all captured within our 2,048 new variables we created initially.
Our incorrect images view Waipi'o Valley from different angles and do not contain the same prominent features. They do, however, contain features prominent in the Pololu Valley images and thus the misclassification.
Now keep in mind that we only used a few images for each location. If we were to expand to include several hundred or thousand taken from different perspectives our model should learn more of the nuances that make each location unique.
The second way to test the model is called "hold-out validation." This actually uses the same method as 10-fold cross-validation only rather than divide our data into 10 chunks, we just make 2 chunks -- one for training our model and one for testing it. Although cross validation is typically a better choice for testing a model, I'm presenting hold-out validation because it's a faster process since there is only one training chunk and one testing chunk. That being said though, it typically requires you to have more data which we don't really have.
In Orange, we use the widget "Data Sampler" shown at the left of the above image. This allows us to split our data into a training set and a testing set (you can look at these sets individually using the "Data Table" widget as I've shown above, but it isn't necessary to run the model). The process follows the same workflow as above only now our "Test & Score (Train/Test Sets) widget is set to take both our training and testing data sets. Orange also has a widget for "Predictions" which shows the model's predicted image classification next to their actual classification kind of like a disaggregated confusion matrix.
For the hold-out validation, we only used "Logistic Regression (R)" since it was our best performing learner in the cross-validation approach. We can see that our classification accuracy is about the same but keep in mind that every time we run this model our "Data Sampler" widget will create new training and testing sets so this can change (hence why cross validation typically elicits a better accuracy estimate):
Putting Computer Vision to Good Use
So neat parlor trick, right? We can teach a computer to recognize images and label them appropriately (most of the time). Although we don't see the actual computations taking place within the Orange application, rest assured that this is no simple task. If anything we should be impressed at how easy it is to create the workflow and run the model so quickly (although with more images the process would take much longer).
However, there are some real-life applications that can be built on computer vision models -- some already in existence:
- Automotive Claims - The process of filing an insurance claim following a traffic accident can take an excessive amount of time in many cases. Damage is typically visually inspected, fault must be determined, police reports need to be reviewed, etc. While determining who will pay for damages may continue to be relegated to human judgement determining what will be paid can be much simpler with computer vision smartphone apps. By allowing motorists to photograph the damage to their vehicle and send those images through a similar model to what we used, damage quotes can be estimated in a matter of seconds. The insured's photo is linked to their account which provides additional information to the model such as policy details, the car's make, model, VIN, etc. and may even take their location data to inform them of the nearest body shop that can service them the quickest/cheapest. The company Tractable is currently working on this type of technology.
- (Skin) Condition Diagnosis - Possibly nowhere else will the test of computer vision be more important than in healthcare. This isn't simply because of the need for accurate diagnosis and the possibilities it could open up for an already overtaxed system (see telemedicine). It also signals a fundamental change in human/machine relations -- we reach a point where we favor the medical diagnosis of a computer over that of a human. It probably isn't far off considering some machines are proving to be better than humans in early research.
- Early Detection of Forest Disease or Agricultural Viruses - Fast-spreading diseases infecting plants can have any number of ill effects. The availability (and therefore cost) of food, the irreversible damage to ecosystems, and even the loss of a culturally significant tree species as with the spread of Rapid 'Ohi'a Death here on Hawai'i Island.
While the use of aerial photography for the detection of these types of diseases isn't entirely new technology, the means of capturing the image data as well as the means of analyzing it have significantly improved. While 4K capable cameras are commercially available for drones, cameras like DARPA's 1.8 gigapixel ARGUS-IS could significantly increase the likelihood of detecting aberrations from normal tree/crop growth via this type of image modeling. Quality data in, quality data out.
The image in this article's title bar was created using DeepStyle, a spin-off of Google's DeepDream generator. Using a similar technique as us, the model has been repurposed from an image classifier to a model interpreter. Basically, we can run the model backward and ask it to adjust the original image according to what the model was trained to do -- in this case understand different artistic styles. This is in a sense a way to see what the computer is trained to see and can be used in applications where we ask computers to create rather than classify; music, art, and even Shakespearean plays
So where will we go from here? Well, we might not have the flying cars we've been promised for years, but autonomous vehicles are quickly becoming reality and you can guess one of the systems used to "see" the road ahead. Maybe someday where we are going (to quote Doc Brown)...we won't need roads...