Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
|
|
||
| In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful: | ||
|
|
||
| 1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype. |
There was a problem hiding this comment.
Maybe also worth mentioning expressions that convert to/from other variant-like data, e.g. JSON as a DType::Utf8 can be parsed into a DType::Variant.
I wonder if our JSON extension type has storage type DType::UTf8? or storage type DType::Variant...?
There was a problem hiding this comment.
In my mind JSON type is “string verified as JSON”, like a PG column.
So far my impression is that there’s no consistent naming, and any choice we make will end up conflicting with something
| 3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert. | ||
| The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array. | ||
|
|
||
| ### Others |
There was a problem hiding this comment.
Maybe worth mentioning the Spark variant type? I think it inspired the parquet one
|
|
||
| Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible. | ||
|
|
||
| I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type. |
There was a problem hiding this comment.
how should we do execute_arrow for these, using the Parquet Variant? Or union?
Still WIP, but starting to extract my local notes into a publicly share-able form