Introduction

This is Abe from the Lakehouse Department of the GLB Division. In this article, we will introduce a method to search for images similar to head MRI images using KDB.AI.

KDB.AI is a knowledge base vector database and search engine provided by KX systems. Vector databases can analyze and process vast amounts of text data, and by converting text data into vector format, computers can understand and respond to natural language input. Developers can build scalable and reliable real-time applications by using real-time data to provide advanced search, recommendations, and personalization for AI applications.

An overview of KDB.AI and explanations about account creation are explained in the previous article.

techblog.ap-com.co.jp

In this article, we will introduce a method to save a brain MRI image as a vector in KDB.AI and search for images similar to that image. It is also created based on the sample code provided by KDB.

kdb.ai

Introduction
Purpose of the Tutorial
Preparation
1. Load Image Data
- Define List Of Paths To The Extracted Image Files
- Load data using image_dataset_from_directory()
2.Create Image Vector Embeddings
3. Store Embeddings in KDB.AI
4. Query KDB.AI Table
5. Search For Similar Images To A Target Image
6. Delete the KDB.AI Table
Conclusion

Purpose of the Tutorial

The purpose of this tutorial is to show you how to create an embedding using a trained neural network and save it to a vector database. You will also learn how to use KDB.AI to find similar images to the input image. Specifically, follow the steps below.

Loading image data
Creating an embedding image
Saving embedding images to KDB.AI
Querying KDB.AI tables
Similar image search for target image
Delete KDB.AI table

The code in this article is taken from the GitHub repository.

github.com

The execution environment is a Databricks workspace, but you can run it with your favorite editor.

Preparation

Now let's take a look at the code. First, install the necessary packages and restart the Python process.

pip install huggingface_hub umap-learn hdbscan tensorflow Pillow matplotlib kdbai_client -q

dbutils.library.restartPython()

Import the required Library.

# download data
import os
from zipfile import ZipFile

# embeddings
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.preprocessing import image
from PIL import Image
import numpy as np
import pandas as pd
import ast

# timing
# A library that displays the progress of loops and iterable operations as a progress bar.
from tqdm.auto import tqdm

# vector DB
import kdbai_client as kdbai
#Provides utilities to securely enter sensitive information like passwords without displaying them
from getpass import getpass 
import time

# plotting
import hdbscan
import umap.umap_ as umap
from matplotlib import pyplot as plt

You may get a warning 'Could not find TensorRT', but ignore it as it will not affect later code.

Let's also define a Helper Function in advance.

# Check the shape and contents of the DataFrame
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

# Load and display images
def plot_image(axis, source: str, label=None) -> None:
    axis.imshow(plt.imread(source))
    axis.axis("off")
    title = (f"{label}: " if label else "") + source.split("/")[-1]
    axis.set_title(title)

1. Load Image Data

The sample dataset used is a brain tumor classification image obtained from Kaggle. The dataset consists of MRI brain scan images organized into four classes (glioma, melanoma, pituitaryoma, no tumor) based on the brain tumor in the image.

The original Kaggle dataset consists of two folders, a Training folder and a Testing folder, both of which contain images organized by tumor class. These images were preprocessed by resizing them to (224, 224, 3), renaming each image's class, and giving each image a unique ID within the directory.

After preprocessing, the dataset's Training folder is used to train a ResNet model, which we will use during embedding. After processing, the Testing folder will be renamed to data and will be used in this notebook. Of course, the ResNet model is not learning the test data in the Testing folder, which helps avoid overfitting when creating the embedding.

Define List Of Paths To The Extracted Image Files

Next, extract the image file paths from different subfolders within the 'Testing' directory. We need these to pass to the function that creates the embedding later.

def extact_file_paths_from_folder(parent_dir: str) -> dict:
    image_paths = {}
    for sub_folder in os.listdir(parent_dir):
        sub_dir = os.path.join(parent_dir, sub_folder)
        image_paths[sub_folder] = [
            os.path.join(sub_dir, file) for file in os.listdir(sub_dir)
        ]
    return image_paths

image_paths_map = extact_file_paths_from_folder("data")

You can see that we were able to get the path of the image file.

Image files are numbered from 1 to 100, and images can be retrieved from each folder.

Next, use the plot_image() helper function to display image number 20 as an example.

image_index = 20  # feel free to change this!

# create subplots
_, ax = plt.subplots(nrows=len(image_paths_map) // 2, ncols=2, figsize=(10, 8))
axes = ax.reshape(-1)

# get image at specified index
for i, (_, image_paths) in enumerate(image_paths_map.items()):
    for path in image_paths:
        if path.endswith(f"{image_index}.png"):
            break

    # plot each image in subplots
    plot_image(axes[i], path)

Images could be displayed for each tumor class.

Load data using image_dataset_from_directory()

The image_dataset_from_directory() function has been imported from TensorFlow's Keras API, and is a function that efficiently handles image data during deep learning training and evaluation. It stores each image with its corresponding class label in that image's directory, so you can retrieve the data and its label in a format suitable for embedding.

dataset = image_dataset_from_directory(
    "data",
    labels="inferred",# Label inferred from directory name
    label_mode="categorical",# Get labels in one-hot encoding format
    shuffle=False,
    seed=1,
    image_size=(224, 224),
    batch_size=1,
)

2.Create Image Vector Embeddings

We use a trained neural network for brain tumor classification to create image embeddings. This example uses a network with a ResNet-50 backbone. ResNet-50 is a neural network architecture commonly used for general image classification tasks.

ResNet-50 was originally trained on the ImageNet dataset, and the dataset does not include MRI images. Examples of different brain tumor images are not included, so ResNet-50 is retrained to classify MRI brain scan images.

Load Pre-Trained Classification Neural Network

Load a model to classify MRI images. Load the retrained model called mri_resnet_model from KX from Hugging Face Hub.

model = from_pretrained_keras("KxSystems/mri_resnet_model")

Check the structure of the model.

model.summary()

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 resnet50 (Functional)       (None, 2048)              23587712  
                                                                 
 flatten_1 (Flatten)         (None, 2048)              0         
                                                                 
 dense_2 (Dense)             (None, 8)                 16392     
                                                                 
 dense_3 (Dense)             (None, 4)                 36        
                                                                 
=================================================================
Total params: 23604140 (90.04 MB)
Trainable params: 23551020 (89.84 MB)
Non-trainable params: 53120 (207.50 KB)
_________________________________________________________________

This model consists of four layers (ResNet-50, Flatten, and two Dense), and the ResNet-50 layer is actually an abstraction of many layers under one name. , contains millions of parameters. The Flatten layer flattens the output of ResNet-50 into a (1, 2048) vector (feature vector), and the last two Dense layers (or fully connected layers) convert the ResNet-50 feature vector to the input image. Convert to 4 columns of data according to the class.

The network diagram below details the results of model.summary().

(Retrieved from https://github.com/KxSystems/kdbai-samples/blob/main/image_search/image_search.ipynb)

Transform Classification Network Into Embedding Network

Dense layers were needed to classify the four brain tumor classes, but they are no longer needed. In this example, we are not interested in the output value of the Dense layer, but the value after embedding. Therefore, call pop() to remove the two Dense layers in the above model. The new output of the model is a (1, 2048) feature vector, i.e. a ResNet-50 embedding vector of the input image.

model.pop()
model.pop()
model.summary()

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 resnet50 (Functional)       (None, 2048)              23587712  
                                                                 
 flatten_1 (Flatten)         (None, 2048)              0         
                                                                 
=================================================================
Total params: 23587712 (89.98 MB)
Trainable params: 23534592 (89.78 MB)
Non-trainable params: 53120 (207.50 KB)
_________________________________________________________________

I was able to remove the Dense layer.

Use Embedding Network To Create Image Embeddings

To obtain embedding data, we use a model to extract features for each image, and then save the features and class labels in respective Numpy arrays.

# create empty arrays to store the embeddings and labels
num_files = len(dataset)
embeddings = np.empty([num_files, 1, 2048])
labels = np.empty([num_files, 1, 4])

# for each image in dataset, get its embedding and class label
for i, image in tqdm(enumerate(dataset)):
    embeddings[i, :, :] = model.predict(image[0], verbose=0)
    labels[i, :, :] = image[1]

Now that we have the class labels, we can get the tumor type by checking which index in the vector is equal to 1.

# reduce these vectors from 3 dimensions to 2 dimensions
reduced_embeddings = embeddings[:, 0, :]
reduced_labels = labels[:, 0, :]

# list the tumor types in order
tumor_types = ["glioma", "meningioma", "no_tumor", "pituitary"]

# for each vector, save the tumor type given by the high index
class_labels = [tumor_types[label.argmax()] for label in reduced_labels]

It's often useful to store the entire file path rather than just the name of the image, so in the cell below I iterate through the files and store the file path.

# get a single list of all paths
all_paths = []
for _, image_paths in image_paths_map.items():
    all_paths += image_paths

# sort the source_files in alphanumeric order
sorted_all_paths = sorted(all_paths)

Now we have all the components: image file path, image class, and vector embedding. The next step is to convert everything to a DataFrame for insertion into KDB.AI's vector database.

embedded_df = pd.DataFrame(
    {
        "source": sorted_all_paths,
        "class": class_labels,
        "embedding": reduced_embeddings.tolist(),
    }
)

show_df(embedded_df)

We have created a DataFrame consisting of the image file path, tumor class, and embedding values.

Visualising The Embeddings

Since the embedding of features is high-dimensional, it is difficult to understand how they can be organized and clustered. UMAP is an easy-to-understand way to recognize feature organization. UMAP is a method of dimensionality reduction and a technique for visualizing clustering in 2D. This not only provides a better understanding of the success of the classification network, but also provides insight into where misclassifications may occur.

_umap = umap.UMAP(n_neighbors=15, min_dist=0.0) # UMAP's instance
umap_df = pd.DataFrame(_umap.fit_transform(reduced_embeddings), columns=["u0", "u1"]) # dimensionality reduction

Indicates the parameters when creating a UMAP instance. - n_neighbors``: Specifies how much UMAP considers each data point's neighbors. Larger values emphasize global data structures; lower values emphasize local data structures. -min_dist``: Specifies the minimum distance between data points in low-dimensional space. When this distance is small, the data points tend to form clusters; when it is large, the data points tend to spread out.

Before dimension reduction

reduced_embeddings

After dimension reduction

show_df(umap_df)

I was able to reduce the dimensionality to a 2D DataFrame using UMAP.

Next, cluster the dimensionally reduced embedding data. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is used for clustering.

_hdbscan = hdbscan.HDBSCAN(min_cluster_size=5)
clusters = _hdbscan.fit_predict(umap_df)

# number of unique clusters
len(list(set(clusters)))

22 clusters were formed. min_cluster_size specifies the minimum number of points to be recognized as a cluster. In this example, clusters with fewer than 5 points are treated as noise.

Now plot the embedding in 2D, displaying each class in a different color.

# define color for each class label
class_colors = {
    'glioma': 'blue',
    'meningioma': 'red',
    'no_tumor': 'green',
    'pituitary': 'purple',
}

# Create a figure for plotting
plt.figure(figsize=(10, 8)) 

# Scatter plot with 'u0' and 'u1' columns as x and y, color mapped by class_labels
for class_label, color in class_colors.items():
    indices = [i for i, label in enumerate(class_labels) if label == class_label]
    subset = umap_df.iloc[indices]
    plt.scatter(subset['u0'], subset['u1'], label=f'{class_label}', color=color)

# beutify plot
plt.title('Embeddings Map for MRI Brain Scan Images')
plt.xlabel('u0')
plt.ylabel('u1')
plt.legend()
plt.show()

This is the result of plotting embedding for each tumor class.

As shown above, most of the data can be separated between classes, but there is still some overlap, especially in the glioma class (shown in blue). However, for most of the points on the graph, the "nearest point" belongs to the same class as yourself. Therefore, when performing a vector similarity search using embedding data, the majority of results should be of the same class.

3. Store Embeddings in KDB.AI

Connect to the KDB.AI session. To use KDB.AI, you need two pieces of session information: a URL endpoint and an API key. You can sign up for free here.

Connect to the KDB.AI session from `kdbai.Session and pass the session URL endpoint and API key details from the KDB.AI cloud portal.

KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

Create a session.

session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

Define Vector DB Table Schema

Define the schema for the KDB.AI table that will store the embedded data. This table contains the same three columns as the embedding dataframe:

source: File path to raw image file class: Tumor class label embedding：2048-dimensional feature vector for similarity search

image_schema = {
    "columns": [
        {"name": "source", "pytype": "str"},
        {"name": "class", "pytype": "str"},
        {
            "name": "embedding",
            "vectorIndex": {"dims": 2048, "metric": "L2", "type": "hnsw"},
        },
    ]
}

Crate Vector DB Table

Next, use the KDB.AI create_table() function to create a table that matches the schema defined in the Vector database.

# ensure the table does not already exist
try:
    session.table("mri").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

table = session.create_table("mri", image_schema)

Add Embedded Data to KDB.AI Table

Check the memory usage of your data before inserting it into KDB.AI. This is because the recommended data size is 10MB or less.

# convert bytes to MB
embedded_df.memory_usage(deep=True).sum() / (1024**2)

If it's larger than 10MB, I would consider batch (chunk) splitting, but since this dataset is only 6MB, I can insert all the data at once.

table.insert(embedded_df)

Verify Data Has Been Inserted

If you run table.query(), you will see that the data has been added.

table.query()

4. Query KDB.AI Table

Now that all image embeddings are registered in KDB.AI's database, we are ready to demonstrate KDB.AI's fast query capabilities.

Query functions accept a variety of arguments for easy filtering, aggregation, and sorting. You can see all of this by running table.query().

Let's demonstrate this by filtering for images with "glioma" in the file name.

table.query(filter=[("like", "class", "*glioma*")])

We were able to obtain the data source and embedding data whose tumor class is "glioma".

5. Search For Similar Images To A Target Image

Finally, we perform an image similarity search. This is done using the table.search() function.

Choose Example Image

First, randomly select a row from the test dataset.

# Get a sample row
random_row_index_1 = 40

# Select the random row and the desired column's value
random_row_1 = embedded_df.iloc[random_row_index_1]

plot_image(plt.subplots()[-1], random_row_1["source"], label="Query Image")

Search for similar images to this image.

Save the embedding of this image in the sample embedding variable.

sample_embedding_1 = random_row_1["embedding"]

Search Based On The Chosen Image

Using the embedding data extracted with sample_embedding, find the eight neighboring images closest to the query image.

results_1 = table.search([sample_embedding_1], n=8)
results_1[0]

The results returned from table.search() will display the closest match along with the value of the nearest neighbor distance __nn_distance. Of course, __nn_distance from the sample image selected earlier will be 0.

Plot Most Similar Images

Let's visualize these images. The plot_test_result_with_8NN() function plots the query image and its eight nearest neighbors.

def plot_test_result_with_8NN(test_file: str, neighbors: pd.Series) -> None:
    # create figure
    _, ax = plt.subplots(nrows=3, ncols=3, figsize=(12, 7))
    axes = ax.reshape(-1)

    # plot query image
    plot_image(axes[0], test_file, "Test")

    # plot nearest neighbors
    for i, (_, value) in enumerate(neighbors.items(), start=1):
        plot_image(axes[i], value, f"{i}-NN")

nn1_filenames = results_1[0]["source"]
plot_test_result_with_8NN(random_row_1["source"], nn1_filenames)

Although some of the images returned are from different cross-sections, I believe the contrast of the tumor is similar to the test image.

Automate This Search Process

Based on the above steps, we will define the process from selecting a sample image to displaying 8 images as the mri_image_nn_search function.

def mri_image_nn_search(table, df: pd.DataFrame, row_index: int) -> None:
    # Select the random row and the desired column's value
    row = df.iloc[row_index]

    # get the embedding from this row
    row_embedding = row["embedding"]

    # search for 8 nearest neighbors
    nn_results = table.search([row_embedding], n=8)

    # plot the neighbors
    plot_test_result_with_8NN(row["source"], nn_results[0]["source"])

Find similar images with another sample image.

# Get another row
random_row_index_2 = 210

mri_image_nn_search(table, embedded_df, random_row_index_2)

The cross-sectional images of the returned similar image and the test image are different, but I think the contrast of the tumor is similar. I believe that these results will provide diagnostic material for doctors to determine whether or not it is a tumor.

6. Delete the KDB.AI Table

This completes the process from preparing the dataset to searching for images similar to the test image. It is a best practice to throw away a table when you are finished using it.

table.drop()

Conclusion

In this article, we performed a similar search for MRI images using KDB.AI. This time, we focused on MRI images, but since KDB.AI has a high-speed query function, I think it can also be used for real-time defective product detection on manufacturing lines.

The Lakehouse department is looking for an engineer/PM to perform development and consulting on data & AI projects. We are also recruiting in other departments, so if you are interested in APC, please contact us for a casual interview (Job Listing).

Translated by Johann

APC 技術ブログ

株式会社エーピーコミュニケーションズの技術ブログです。

Introduction to KDB.AI (6) - Sample code (Image Search)