Using AI Image Recognition for Breast Cancer Detection and Classification

Overview

Artificial intelligence (AI) refers to the algorithms and processes that are able to mimic human intelligence in terms of cognitive functions such as problem solving. Machine learning and deep learning are subcategories of AI. Machine learning involves machines being able to learn and develop over time through either data or a model and deep learning involves attempting to imitate human neural networks without the need for pre-processed data. Some major applications of AI include voice search and recognition and personalized shopping recommendations.

In this tutorial, you will be learning how to apply the concept of transfer learning, a subset of deep learning, on the Rescale platform. Transfer learning focuses on training a model on a base dataset and a base task, and then transferring what the model has learned from that training onto another dataset and task. There are two approaches to transfer learning: developing the model yourself or using a pre-trained model. When developing a model by yourself, you must first select a predictive modeling problem that has an abundance of data to train on. Then, you must develop a skillful model where the model learns something from the base task, but not too much or else the second task will not learn much. Then, use the model learned from the base task as a starting point for the second task. However, when using a pre-trained model, you just have to choose an open source pre-trained model and just use it as a starting point for the second task instead of developing the predictive modeling problem and training it on a base task yourself.

A plethora of pre-trained models are available on the internet such as the Microsoft ResNet50 and the Google Inception Model. Today, you will be using the Microsoft ResNet50, an image classification model that was pre-trained on large datasets of images and requires the model to make predictions on a large number of classes. This allows the model to learn how to extract features from a photograph and tell the user what the input image is showing.

More specifically, this tutorial will show you how to use the Microsoft ResNet50 pre-trained model to further train it on a subset of a large dataset of 780 breast cancer ultrasound images (also known as training), obtained in 2018 from women in ages between 25 and 75 years old, and then inputting another breast ultrasound picture from that same dataset to see whether the breast in the image is malignant, benign, or normal (also known as validation). You will be using the Google Chrome Interactive software and the Conda Miniconda Interactive software to run this tutorial on Rescale Workstations. Rescale Workstations will help you interact with the model in real-time – allowing you to change the image that you want to classify and to modify the code. For the purpose of this tutorial, we will not be going through every single block of code, but instead will be focusing on getting it set up on Rescale as well as the results.

This tutorial was taken from Kaggle’s “Breast Cancer Detection Using ResNet50” by Khizar Khan. The breast ultrasound image dataset was taken from Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data in Brief. 2020 Feb;28:104863. DOI: 10.1016/j.dib.2019.104863.

Video Tutorial

Configuring Your Workstation

Go through the following sections to properly configure your workstation.

  1. To start using Rescale, go to platform.rescale.com and log in using your account information. Using Rescale requires no download or implementation of additional software. Rescale is browser-based, which allows you to securely access your analyses from your work station or from your home computer.
  2. Download the two attachments you will be needing for this project: the Databust_BUSI_with_GT.zip and the breast-cancer-detection-using-resnet50.ipynb
  3. First, click on the Workstations icon at the upper left hand corner of the screen. This will take you to the Workstations home screen. Click on + New Workstation and then Create New Workstation when prompted by the side screen.This will take you to a Workstations Setup page.
  4. There are now 4 Setup stages to complete.

First, you need to give the workstation a name. Since Rescale saves all your workstations, we recommend you to choose a unique name that will help you to identify it later. To change the name of your project, click on the pencil next to the current job name in the top left corner of the window.

Next upload the zip file Databust_BUSI_with_GT.zip and breast-cancer-detection-using-resnet50.ipynb by clicking the Upload from this computer button.

On completion of this step, the Attachments setup page should look like that shown below:

Click Next to move on to the Software Settings section of the analysis. Now, you need to select the software module you want to use for your analysis. You can scroll down or use the search bar to search for a software. For this demo, scroll down and click on Google Chrome Interactive and Conda Miniconda Interactive.

Next, the Analysis Options must be set.

  • The drop-down selector allows users to choose their preferred version of Google Chrome Interactive and Conda Miniconda Interactive.
  • The input files used in this tutorial were tested with Google Chrome Interactive version latest and Conda Miniconda Interactive version 4.8.4_e2ed, so select those options. Google Chrome Interactive will allow you to interact with the code and change its inputs in real-time on Google Chrome. Conda Miniconda Interactive will allow you to create, save, load, and switch between different software environments and manage packages on your local computer. 

On completion, the Software Settings page should look like that below:

Now that you have chosen the analysis code you want to use, the next step is to select the desired computing hardware for the workstation. Click on the Hardware Settings icon.

  • On this page, you must select your desired Core Type and how many cores you want to use for this job. A “core” is a virtualized computing unit, with each core representing a single core from a physical computer. For this demo, select Ruby On Demand-Priority. For further explanation on why this hardware was chosen, please refer to the Coretype Explorer heading under Further (Optional) Steps
  • The Number of Cores should be set to 4. You want the hardware to run on enough cores to give you enough memory to finish your workstation completely. 
  • The Walltime is how long you want the workstation to run until it automatically stops. Keep in mind that once a workstation is stopped (either by the walltime running out or by clicking the red Stop button in the upper right hand corner of the screen), it cannot be restarted. You want to choose a reasonable amount of time that allows you to complete the workstation and for the workstation to produce all of the desired output files while balancing the monetary cost of running the simulation for too long. For this workstation, set the walltime to 8 hours, so that you have enough time to account for mistakes and interact with the code by changing your input images. 

Your Hardware Settings screen should look like this:

Finally, move to the Review stage of Setup and check that the setup is correct. It should look like that below:

Now, hit the Submit button in the upper right corner of the screen. 

When the default access settings popup shows up, click None. However, if you want to visualize using the local client, then you could select Use this IP

Monitoring Your Workstation

Now you can monitor the progress of your workstation from the Workstations home screen. 

On the Workstations home screen, you can look at My Workstations to look at all of the workstations you have created as a Rescale user:

Or you can look at Active Workstations to look at the workstations you are currently running:

In both My Workstations and Active Workstations, you can track the status of your workstation by looking under Status. As you can see, the status of the Breast Cancer Detection Using ResNet50 Workstation is currently at ‘Starting’ which means the workstation is loading the attachments, software, and hardware you chose in the Setup stage to prepare for the interactive part of Rescale Workstations. 

Once submitted, the workstation will be at the status ‘Starting’ for a couple of minutes. After about 8 minutes, the status will change to ‘Active’ and a blue Connect button will pop up next to the Name of the workstation you are looking to use:

Once that happens, press it and it will take you to another browser tab – the in-browser workstation.

This is where you will be able to interact with your breast cancer classification model on Rescale Workstations. 

Once the browser appears, click the Terminal Emulator button.

In the terminal, type jupyter notebook. We will be using Jupyter notebook because it provides open-source software and services to help create and run projects in all different types of programming languages whether it be Python, Java, or R. Click the first link that pops up to get to the jupyter notebook and wait a couple of seconds for a new browser to appear.

Once the jupyter notebook appears, click on the work folder, then the shared folder, then click on the breast-cancer-detection-using-resnet50.ipynb. This should open the jupyter notebook for the breast classification notebook.

Now, go back to the terminal emulator.

Click File at the top left hand corner and then click Open Terminal.

In order to install the necessary packages to use the breast cancer classification model using the Microsoft ResNet50 Model, in no particular order, you must type these commands into the terminal:

  • python3 -m pip install tensorflow
  • pip3 install opencv-python
  • pip install pandas 
  • pip install matplotlib
  • pip install pillow 
  • pip install -U scikit-learn scipy matplotlib

Now fill in all of the asterisk red texted words according to your own computer. By implementing this step, you are ensuring that the code is getting access to the dataset of images that you want to train the ResNet50 on. Without inserting the file path into the code, the code will have no way of accessing the dataset of images. This image dataset should be the same one you attached in the first step of the workstation Setup: Attachments.

Hover over the right-most file icon and then hover over the work folder. Click Open Folder

Once in the work folder, click the shared folder. 

Next, click the Dataset_BUSI_with_GT folder.

There should be a search bar above the folders. Copy what is within this search bar.

Then, paste what was copied into the red double quotes of the second block of code.

Finally, run the first two blocks of code – you will need the output of the second block of code to modify the second red text in the fourth block of code.

Now, for the second red text that you must change in order for the code to run properly is to make sure that the number corresponding to the first line of code in the fourth block corresponds to which number key ‘benign’ is in the output of the second block of code. For example, in the code below, ‘benign’ is the first key in the dictionary – the dictionary lists ‘benign’ first, then ‘normal’, then ‘malignant’. Since the third and fourth blocks of code are written to deal with the ‘benign’ key specifically, you must input the corresponding number that ‘benign’ is as the dictionary key. Since Python utilizes 0 first numbering, that means that the first key in the dictionary would be referred to as 0, the second would be 1, and the third would be 2. Thus, in this example of the tutorial, since ‘benign’ is the third key of the dictionary, the number that would be inputted into the red text would be 2. 

Lastly, for the third red text which is located in the last (16th) block of code, you must choose which picture you would like to classify from the image dataset that you imported in the first step of the Setup: Attachments, which is also the same one that you imported into the code in the first of the red texts you had to modify above.

First, hover over the right-most file icon and then hover over the work folder. Click Open Folder.

Once in the work folder, click the shared folder.

Next, click the Dataset_BUSI_with_GT folder. This is the image dataset that contains all of the images that you could classify based on the trained model.

Now, once you are in the image dataset folder, you can pick any subfolder (benign, malignant, or normal) that contains benign, malignant, or normal images respectively for the model to classify. This way, you know that if you are inputting an image of a malignant breast ultrasound picture, the model should tell you that there is cancer detected. For this example, and not for any particular reason, we will be inputting a normal breast ultrasound image – more specifically, the normal (3).png. So, click the normal folder which contains all of the normal breast ultrasound images.

Scroll and find the normal (3).png image. Once you find it, right click and hit the Copy button. This copies the file path of the image and allows your code to trace through your computer to find the image.

Finally, in the red text of the last block of code, paste the file path link that you just copied within the double quotes. Again, this is the image that the model will classify as either benign, malignant, or normal for you. If you wish to classify a different image from the Dataset_BUSI_with_GT.zip, you must copy the file path for that image and paste it in the last block of code instead of the normal (3).png. 

Once you have modified all three of the red texts based off of your local outputs and file directories along with all of the steps above it, your code is ready to run. Click Kernel at the bar across the top of the jupyter notebook and press Restart & Run All. This will allow you to run your code from top to bottom smoothly without any previous runs and variables getting in the way.

In the blocks of code 14 and 15, two separate graphs are outputted, respectively:

In both graphs, you can see that as the number of epochs increases, the history of accuracy for both the training and the validation dataset increases while the history of loss for both the training and the validation dataset decreases. An epoch means one complete pass of the training dataset through the model. In this case, the training dataset was a smaller subset of the images from the Databust_BUSI_with_GT.zip. This means that the more a model trains itself on the training dataset, the more accurate the model will be in predicting an image from the validation set and telling you the correct classification for the image (benign, malignant, or normal).

Once all of the code is done running, you will see that a number is printed at the bottom of the screen. This number tells you what the model classified the image as: the number ‘0’ means that the input breast ultrasound image is normal, the number ‘1’ means that the input breast ultrasound image is benign, and the number ‘2’ means that the input breast ultrasound image is malignant. In this case, we can see that the model correctly classified the normal breast ultrasound image that was inputted since the number outputted from the code is ‘0’.

Improving Workload with Coretype Explorer

Optionally, you can help optimize your workload for performance and/or cost by using Rescale’s Coretype Explorer.

Rescale’s feature Coretype Explorer comes in handy when determining which core type and how many cores to use when completing a project. Coretype Explorer allows you to compare different core types across different areas such as Cores/Node, Memory/Node, Storage/node, GPUs/Node, Price – On demand, and Price – On Demand Priority to help you choose the best core type for your financial, memory, and project needs. In order to use Coretype Explorer, you must go to Hardware Settings in the Setup. It should be located at the bottom right corner of the Specify Hardware Settings table:

As shown below, four different coretypes, Citrine, Iolite-1, Emerald, and Ruby, were compared based on Cores/Node, Memory/Dode, Storage/Node, and Price – ODP (on demand priority). These four coretypes were chosen to be compared because Emerald is highly popular for running jobs and workstations because of its cheaper costs, and Citrine, Iolite-1, and Ruby all had similar costs to one another, so their memory, storage, and cores would be tested. Ruby was chosen for this specific tutorial because although it costs more than Emerald, at the same price of Citrine and Iolite-1, it has a significantly higher Cores/Node, Memory/Node, and Storage/Node than all of the other core types which is perfect for projects that require a lot of memory usage and running code in an interactive environment like the Rescale Workstation. However, if you value price over memory, storage, and cores, you would probably choose Emerald.

Conclusion

This tutorial shows you how to classify an image from a given breast ultrasound image dataset that was collected in 2018 which was trained using the Microsoft ResNet50 Image Classification Model. In the future, you can change which image you would like to classify from the image dataset by copying and pasting a different image file path into the last block of code. You may also try to train a different image dataset using the same principles and the Microsoft ResNet50 to classify a different topic of images.

Furthermore, different coretypes can be compared to one another using Coretype Explorer to help you choose which one you would like to use to run your workstation in the Setup: Hardware Settings according to your project and financial needs. Please see Further (Optional) Steps for more information on how to navigate this option. Image classification models are widely used in artificial intelligence and machine learning for object identification in satellite images, animal classification, medical imaging, and brake light detection.

Completing this tutorial on Rescale helps you leverage high computing power, access to different types of hardware and software to test scalability and diversity for the same simple workstation, allows you to interact with the code in real-time and change the input image to classify, and cut the code runtime to less than half compared to if you were to run it on a local Python program on your computer.