Unity Catalog Daft Integration
This page shows you how to use Unity Catalog with Daft.
Daft is a library for parallel and distributed processing of multimodal data.
Set up
Section titled “Set up”To start, install Daft with the extra Unity Catalog dependencies using:
pip install -U "getdaft[unity,deltalake]"
Then import Daft and the UnityCatalog
abstraction:
import daftfrom daft.unity_catalog import UnityCatalog
You need to have a Unity Catalog server running to connect to.
For testing purposes, you can spin up a local server by running the code below in a terminal:
bin/start-uc-server
Connect Daft to Unity Catalog
Section titled “Connect Daft to Unity Catalog”Use the UnityCatalog
abstraction to point Daft to your UC server.
This object requires an endpoint
and a token
. If you launched the UC server locally using the command above then
you can use the values below. Otherwise, substitute the endpoint
and token
values with the corresponding values
for your UC server.
# point Daft to your UC serverunity = UnityCatalog( endpoint="http://127.0.0.1:8080", token="not-used",)
You can also connect to a Unity Catalog in your Databricks workspace by using the following setting:
endpoint = "https://<databricks_workspace_id>.cloud.databricks.com"
Once you’re connected, you can list all your available catalogs using:
> print(unity.list_catalogs())['unity']
You can list all available schemas in a given catalog:
> print(unity.list_schemas("unity"))['unity.default']
And you can list all the available tables in a given schema:
print(unity.list_tables("unity.default"))['unity.default.numbers', 'unity.default.marksheet_uniform', 'unity.default.marksheet']
Load Unity Tables into Daft DataFrame
Section titled “Load Unity Tables into Daft DataFrame”You can use Daft to read Delta Lake tables in a Unity Catalog.
First, point Daft to your Delta table stored in your Unity Catalog:
unity_table = unity.load_table("unity.default.numbers")
Unity Catalog tables are stored in the Delta Lake format.
Simply read your table using the Daft read_deltalake
method:
> df = daft.read_deltalake(unity_table)> df.show()
as_int as_double564 188.755356755 883.610563644 203.43955975 277.88021942 403.857969680 797.691220821 767.799854484 344.003740477 380.678561131 35.443732294 209.322436150 329.197303539 425.661029247 477.742227958 509.371273
Any subsequent filter operations on the Daft df
DataFrame object will be correctly optimized to take advantage of
Delta Lake features.
> df = df.where(df["as_int"] > 500)> df.show()
as_int as_double564 188.755356755 883.610563644 203.439559680 797.691220821 767.799854539 425.661029958 509.371273
Daft support for Unity Catalog is under rapid development. Refer to the Daft documentation for more information.