How to connect Databricks and KX: PyKX with kdb Insights license

Introduction

I am Jung from the GLB Business Unit, Lakehouse Section.

In this article, I will introduce how to use KX's PyKX to connect Databricks and KX.
We will perform verification using kdb Insights license of KX's product kdb+ on Databricks.
kdb+ is the world's fastest time-series database and analytics engine.

Please check the article for more details on KX's product.

techblog.ap-com.co.jp

Introduction
Contents
Introduction to PyKX
Verification
Summary

Introduction to PyKX

PyKX is the first Python interface to the world's fastest time-series database, kdb+, and primarily utilizes the vector programming language q. PyKX takes a Python-first approach to integrate q/kdb+ with Python. It offers users the ability to efficiently query and analyze vast amounts of in-memory and on-disk time-series data.

The description of PyKX is quoted from the following KX document. code.kx.com

The verification process is as follows.

Download kdb Insights license
Install PyKX
Using the kdb Insights license, import PyKX
Try using KX's q language
Try using Databricks' EDA function
Performance comparison between PyKX and Pandas

Verification

1. Download kdb Insights license

Download the kdb Insights license to use kdb+. The kdb Insights license allows a free trial of KX's kdb Insights Personal Edition for 12 months.

kx.com

Once you input personal information on the mentioned page, the license will be sent to you via email.

The license comes in two versions: one is a base64-encoded version written in the email, and the other is an LIC file version that is automatically installed when you click on "kdb Insights Personal Edition License Download File."

2. Install PyKX

Install PyKX on Databricks.

pip install pykx

3. Import PyKX

Use the kdb Insights license to import PyKX.

import pykx as kx

There are two methods to register the kdb Insights license.

➀This is a method to save the license file in the same location as the notebook. This can be done if the Python version is 3.10 or higher. It automatically loads the license.

➁This is a method for inputting the license. The instructions for inputting are available in the document below.

code.kx.com

Those who have performed "1. Download kdb Insights license" can input Y / Y / 2 in order, then copy and input the license key received in the email, which will allow for the import.

4. Try using KX's q language.

Once you import PyKX, you can use KX's q language in Databricks' Python environment.

kx.q.til(10)

You can create a kdb+ table similar to a table in Databricks.

After installing KX's public data (Time series) into Databricks, it can be loaded using pandas.

import pandas as pd
weather = pd.read_csv("自分のファイルのパス")

Load the data from Databricks and import it into kdb+.

# Creating the same table into kdb+
kx.q["weather"] = weather
kx.q["weather"]

Let's try using KX's SQL.

kx.q.sql("SELECT * FROM weather limit 5")

Most of the features available in SQL can be used.

kx.q.sql("SELECT * FROM weather  ORDER BY name limit 5")

kx.q.sql("SELECT * FROM weather WHERE borough = 'Staten Island' limit 5")

Load the data from kdb+ and convert it into a pandas table, then save it to Databricks as a CSV file.

# Creating the same table into pandas
df = kx.q["weather"].pd()
df.head(5)

kx.q.write.csv('path_to_save/weather_kdb.csv', df)

5. Try using Databricks' EDA function

Databricks allows for simple exploratory data analysis (EDA) using features within the workspace. It loads data from kdb+ and visualizes it within Databricks.

display(df)

【+】When you press the button, Visualization and Data Profile appear.

When using Data Profile, you can easily explore the characteristics of features. It classifies and displays them into Numeric Features and Categorical Features.

Numeric Features allow you to see missing values, basic statistics, and simple graphs.

In Categorical Features, you can easily see the number of unique values and the most frequent variables with their frequency count.

When using Visualization, you can visually create without coding using the UI.

You can select various types of charts using the dropdown.
* Please refer to the following URL for more details on the types of graphs.

learn.microsoft.com

If you have longitude and latitude, mapping is possible.

Line graphs can be easily created through column settings (columns like X, Y, Group by, etc.).

You can review the created graphs in the workspace.

The layers of the map can also be adjusted in the workspace.

You can save Line graphs as PNG files for images.

6. Performance comparison between PyKX and Pandas

Compare the performance of PyKX and Pandas.

The method of comparison was based on the materials below.

Comparison of aggregation speed between kdb+'s q language and Pandas:

www.jameslittle.me.uk

Speed comparison and visualization of PyKX and Pandas:

kx.com

Read the data using both PyKX and Pandas' csv read functionality. If you read the csv file with PyKX, it will be saved as a PyKX table, If read with Pandas, it will be saved as a Pandas data frame.

Using PyKX

weather = kx.q.read.csv("自分のファイルのパス")
type (weather)

Using PyKX

weather_pd = pd.read_csv("自分のファイルのパス")
type (weather_pd)

Check the column type of your data. If you read the data with Pandas, it may recognize timestamp as another type. If the type is not timestamp, you cannot use time-related functions such as date and time. If you need a Pandas data frame, you can use timestamp by reading the data with Pykx and converting the table to PD with Pykx.

Compare the performance based on the time it takes when calculating sum and avg based on timestamp time. timestamp has the same format as "2022.07.28D00:00:00.000000000". The aggregation results will create a table like the one below.

When calculating 100 time series aggregations, use "%%capture" and "%timeit -n 100" to save the time taken in each variable.

Using PyKX

%%capture pykx_time_res #Save the code result
%timeit -n 100 kx.q('grp:select sum airtemp, avg airtemp by timestamp.time from weather') #Measure the time it takes to calculate number 100

Using Pandas

%%capture pandas_time_res #Save the code result
%timeit -n 100 grp=weather_pd_q.groupby(weather_pd_q.timestamp.dt.time).agg({'airtemp': ['sum', 'mean']}) #Measure the time it takes to calculate number 100

Create a function to use for visualization. For the function, I used a function from KX Academy. Some of the code was modified and used for this verification.

kx.com

➀ String Processing Functions

# ➀ String Processing Functions
def fix_res_string(v) -> str:
    return v.stdout.split("\n")[0].replace("+-", "±")

➁ Function to dictionary input value to Pandas and PyKX

# ➁ Function to dictionary input value to Pandas and PyKX
def create_res_dict(pandas_time_res, pykx_time_res) -> pd.DataFrame:
    return pd.DataFrame(
        {
            "Time": {
                "Pandas": fix_res_string(pandas_time_res),
                "PyKX": fix_res_string(pykx_time_res),
            }
        }
    )

③ A function that converts the input dictionary into a data frame

# ③ A function that converts the input dictionary into a data frame
def parse_time_vals(d: dict) -> pd.DataFrame:
    all_rows = pd.DataFrame()
    for k, v in d.items():

        if "±" not in v:
            one_row = pd.DataFrame(
                {"avg_time": "", "avg_dev": "", "runs": "", "loops": ""}, index=[k]
            )

        else:
            avg_time, rest1, rest2 = v.split(" ± ")
            avg_dev, _ = rest1.split(" per loop")
            rest2 = rest2.strip(" std. dev. of ")
            rest2 = rest2.strip(" loops each)")
            runs, loops = rest2.split(" runs, ")
            one_row = pd.DataFrame(
                {"avg_time": avg_time, "avg_dev": avg_dev, "runs": runs, "loops": loops}, index=[k]
            )

        all_rows = pd.concat([all_rows, one_row])

    return all_rows.reset_index().rename(columns={"index": "syntax"})

④ If you input the value measured with "%timeit-n 100", the function returns the data frame of the function ③ after calculating the functions ➀ and ➁.

# ④ If you input the value measured with "%timeit-n 100", the function returns the data frame of the function ③ after calculating the functions ➀ and ➁.
def parse_vals(pandas_time_res, pykx_time_res) -> tuple[pd.DataFrame]:
    res_df = create_res_dict(pandas_time_res, pykx_time_res)
    return parse_time_vals(dict(res_df["Time"]))

⑤ Function to compare data frames (PyKX/Pandas)

# ⑤ Function to compare data frames (PyKX/Pandas)
def compare(df: pd.DataFrame, metric: str, syntax1: str, syntax2: str) -> float:
    factor = float(df[df["syntax"] == syntax1][metric].values) / float(
        df[df["syntax"] == syntax2][metric].values
    )
    str_factor = f"{factor:,.2f} times less" if factor > 1 else f"{1/factor:,.2f} time more"
    print(
        color.BOLD
        + f"\nThe '{metric}' for '{syntax2}' is {str_factor} than '{syntax1}'.\n"
        + color.END
    )
    return factor

⑥ Function to convert time

# ⑥ Function to convert time
def fix_time(l: list[str]) -> list[int]:
    fixed_l = []
    for v in l:
        if v.endswith(" ns"):
            fixed_l.append(float(v.strip(" ns")) / 1000 / 1000)
        elif v.endswith(" µs"):
            fixed_l.append(float(v.strip(" µs")) / 1000)
        elif v.endswith(" us"):
            fixed_l.append(float(v.strip(" us")) / 1000)
        elif v.endswith(" ms"):
            fixed_l.append(float(v.strip(" ms")))
        elif v.endswith(" s"):
            fixed_l.append(float(v.strip(" s")) * 1000)
        elif "min " in v:
            mins, secs = v.split("min ")
            total_secs = (float(mins) * 60) + float(secs.strip(" s"))
            fixed_l.append(total_secs * 1000)
        else:
            fixed_l.append(v)

    return fixed_l

⑦ Function to visualize

from tabulate import tabulate
import matplotlib.pyplot as plt
# ⑦ Function to visualize
def graph_time_data(df_to_graph: pd.DataFrame) -> pd.DataFrame:
    print(tabulate(df_to_graph, headers="keys", tablefmt="psql"))

    fig, ax = plt.subplots(figsize=(7, 3))

    df = df_to_graph.copy()
    df["avg_time"] = fix_time(df["avg_time"])
    df["avg_dev"] = fix_time(df["avg_dev"])
    df["upper_dev"] = df["avg_time"] + df["avg_dev"]
    df["lower_dev"] = df["avg_time"] - df["avg_dev"]

    _ = compare(df, "avg_time", "Pandas", "PyKX")

    df.plot(ax=ax, kind="bar", x="syntax", y="avg_time", rot=0)
    # df.plot(ax=ax, kind="scatter", x="syntax", y="upper_dev", color="yellow")
    # df.plot(ax=ax, kind="scatter", x="syntax", y="lower_dev", color="orange")

    ax.set_title("Pandas VS Pykx Time Taken")
    ax.set_ylabel("Average Time (ms)")
    ax.get_legend().remove()
    ax.set(xlabel=None)

    plt.show()

    return df

Creating a class color

# Creating a class color
class color:
    PURPLE = "\033[95m"
    CYAN = "\033[96m"
    DARKCYAN = "\033[36m"
    BLUE = "\033[94m"
    GREEN = "\033[92m"
    YELLOW = "\033[93m"
    RED = "\033[91m"
    BOLD = "\033[1m"
    UNDERLINE = "\033[4m"
    END = "\033[0m"

Visualize using the function you created. When aggregating using timestamps, pykx was found to be 11.7 times faster than pandas.

time_res_df = parse_vals(
    pandas_time_res, pykx_time_res
)

numeric_time_res_df = graph_time_data(time_res_df)

For time series analysis, PyKX gave better results in terms of use of timestamp type and aggregation speed.

Summary

In this post, I registered the kdb Insights license on Databricks and tried using KX's kdb. By using a library called PyKX, it's possible to access KX's KDB from Python and use the q language. This is recommended for those who want to utilize both the functionalities of Databricks and KX's KDB+.

Thank you for reading until the end.
Your continued support is greatly appreciated!

We offer comprehensive support, from implementing data analysis infrastructure using Databricks to aiding in internalization. If you're interested, please feel free to contact us.

www.ap-com.co.jp

Also, we are currently recruiting individuals who can work with us! We look forward to hearing from those who are interested in APC.

www.ap-com.co.jp

Translated by Johann

APC 技術ブログ

株式会社エーピーコミュニケーションズの技術ブログです。

How to connect Databricks and KX: PyKX with kdb Insights license

Introduction

Contents

Introduction to PyKX

Verification

1. Download kdb Insights license

2. Install PyKX

3. Import PyKX

4. Try using KX's q language.

5. Try using Databricks' EDA function

6. Performance comparison between PyKX and Pandas

Summary