In this article, we’ll use Microsoft’s Azure Machine Learning (ML) service to predict breast cancer diagnoses from test data.
If you don’t have an Azure account, a free trial is available.
Note that, at the time of writing, ML is in preview, so the details may change. However, the basic concepts should still apply.
The University of California, Irvine (UCI) maintains a repository of machine learning data sets. We’ll use their data set of breast cancer cases from Wisconsin to build a predictive model that distinguishes between malignant and benign growths.
The transformed data set can be downloaded from https://blog.learningtree.com/wp-content/uploads/2015/01/breast-cancer-wisconsin.data.arff.txt.
Before we can start building our prediction model we need to create an ML workspace. Log into your Azure portal and, on the left-hand side (scroll down) you’ll see the Machine Learning tab. Select that and click the
New button at the bottom.
Configure the workspace as shown in the following screenshot. You’ll need to select a unique storage account name.
Create an ML workspace button and wait while Azure creates your workspace. When the workspace has been created, it will appear in the main list. Select it and then
Sign-in to ML Studio on the following page.
When you first sign-in you’ll be presented with an empty list of “experiments”. Experiments are ML models. Add a new experiment using the button at the bottom of the screen.
Blank Experiment template. This results in a blank canvas ready for us to build our prediction model.
The first thing we need to do is access the breast cancer data set. As this is available on-line, we can use the ML
Reader module to make it available in our experiment. We’re using a relatively small data set here, so reading it directly from the URL makes sense, but we could just as easily draw on a big data resource in Azure Storage, for example.
Search for the
Reader module using the search control at the top-left. Drag the
Reader module onto the experiment canvas and configure it as follows (using the URL from earlier):
Click on the
Run button in the toolbar at the bottom of the screen. After a short delay, the
Reader module will display a green check. This means that it has successfully read the data.
Right-click on the “connection” circle at the bottom of the
Reader module. Select “Visualize” from the pop-up menu.
This displays the cancer data set in tabular form. Charts at the top of the columns summarize the data. We can see the class field on the far-left has two values, 2 and 4, representing benign and malignant growths, respectively. There are more benign cases in the data set than malignant ones.
Note that all the data, apart from the diagnosis (class) and ID variables, is in the same range (1–10).
Close the visualization to return to the experiment canvas.
There are three problems with this data set.
Some of the cases in the data set have missing values. For example:
We can remove these cases from the data set using the
Missing Values Scrubber module. Search for the module, drag it onto the canvas and configure it as shown in the following screenshot. The key option is choosing
Remove entire row for missing values. Join the output of the
Reader module to the input of the
Missing Values Scrubber module.
Project Columns module can be used to choose which fields to take forward into subsequent stages of the analysis. We can use it to exclude the ID field.
Project Columns module to the canvas and configure it as follows:
The red exclamation mark on the module tells us we have more work to do. Click the
Launch column selector button in the right-hand sidebar to chose the column we wish to exclude. We wish to include all columns except the ID column.
At present, the class field—representing the diagnosis—takes values of either 2 or 4. Benign cases = 2, whereas malignant cases = 4. That’s not very intuitive, to say the least. So, we’re going to convert this field into a true/false value where true denotes that the growth is malignant. We’ll use the
Apply Math Operation module to do this.
Configure the module as shown in the following screenshot. We want to compare (
EqualTo) the class to 4, so that the result will be true when the growth is malignant. We don’t need the original data so we use the
Inplace replacement output mode. Use the
Launcher column selector button to specify the class column.
Our prediction model is going to use logistic regression classification. We will need to teach it how to make diagnoses by presenting it with a number of examples. These examples are the cases in our newly-cleaned breast cancer data set.
As we have a binary output (true/false) we’ll use the
Two-Class Logistic Regression module to denote our classification method. It’s default settings are fine.
We also want to be able to evaluate our model by testing how well it predicts new cases. So, we’ll hold back some of the data to use for testing.
Let’s split the data into training and testing sets—70% of the data will be used for training and remaining 30% for testing. This can be done using the
Time to train the model using the…
Train Model module. Connect the classification method and the training data to it, as in the following screenshot. Make sure that you specify the class column as the training output using the
Launch column selector button.
Now that the model is trained, we’ll run the test data through it and see how well it performs. This is achieved using the
Score Model module. Connect our newly-trained model and the test data to it.
Add the point we could run the model and launch the visualizer on the
Score Model module’s output to see what diagnoses the model predicted from the test data. Comparing this with the actual diagnoses from the original data set would allow us to calculate the accuracy of the model.
However, this would be quite tedious—and ML provides us with a module that does this work for us. Drag an
Evaluate Model module onto the canvas and wire it to the test results.
Now for the fun. Run the model using the toolbar button at the bottom of the screen. Watch the clocks on the modules turn to green checks as the analysis progresses.
When the analysis is complete visualize the output of the evaluation by right-clicking on the output node of the
Evaluate Model module.
Among other data, this summarizes the number of correct and incorrect predictions made by the model.
We can see that the accuracy of the model is 98%. It made two incorrect benign predictions (false negatives) and two incorrect malignant predictions (false positives).
You can see how easy it is to undertake machine learning projects in Azure. No programming is required—it’s all drag and drop. You can use other classification methods (e.g. neural networks) by dragging them onto the canvas and wiring them up to the
Train Model model (replacing the current
Two-Class Logistic Regression module).
Another significant benefit of using Azure Machine Learning is that you can publish your experiments as web services, allowing your web or mobile apps to make use of your predictive models, recommendation engines, etc. This is a “point and click” process instigated by the “Publish web service” button in the experiment toolbar.
To learn more about Azure Machine Learning, have a look at Learning Tree’s new 1-day course – Azure Machine Learning. We also have a new session on Spark Machine Learning. Both can be taken online from the convenience of home or office.