Skip to content

WIP: Variant RFC#15

Open
AdamGS wants to merge 1 commit intodevelopfrom
adamg/variant
Open

WIP: Variant RFC#15
AdamGS wants to merge 1 commit intodevelopfrom
adamg/variant

Conversation

@AdamGS
Copy link
Collaborator

@AdamGS AdamGS commented Feb 25, 2026

Still WIP, but starting to extract my local notes into a publicly share-able form

Signed-off-by: Adam Gutglick <adam@spiraldb.com>

In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful:

1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth mentioning expressions that convert to/from other variant-like data, e.g. JSON as a DType::Utf8 can be parsed into a DType::Variant.

I wonder if our JSON extension type has storage type DType::UTf8? or storage type DType::Variant...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind JSON type is “string verified as JSON”, like a PG column.
So far my impression is that there’s no consistent naming, and any choice we make will end up conflicting with something

3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert.
The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array.

### Others
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning the Spark variant type? I think it inspired the parquet one


Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible.

I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should we do execute_arrow for these, using the Parquet Variant? Or union?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants