Currently, our encoding scheme puts the column name first, which can lead to a few issues. First, it may introduce ambiguity if a column is named the same as a different keyword, and second, how do we set these fields on nested datatypes (ex. struct fields, etc). I'm wondering if rather than prefixing the column, we should postfix it? So in this case, rather then:
'payload.lance.compression' = 'zstd'
'payload.lance.compression-level' = '3'
'ts.lance.structural-encoding' = 'miniblock'
'ts.lance.rle-threshold' = '0.5'
'ts.lance.bss' = 'auto'
we would do something like:
'lance.compression.column.payload' = 'zstd'
'lance.compression-level.column.payload' = '3'
'lance.structural-encoding.column.ts' = 'miniblock'
'lance.rle-threshold.column.ts' = '0.5'
'lance.bss.column.ts' = 'auto'
This would remove the potential ambiguity and allow us to set specifically on nested fields by using multiple identifiers after the column naming.
If this is the direction we decide to go, we can make the change backward compatible. For example, the existing options will respect both formats, but moving forward we should only support the new format.
Currently, our encoding scheme puts the column name first, which can lead to a few issues. First, it may introduce ambiguity if a column is named the same as a different keyword, and second, how do we set these fields on nested datatypes (ex. struct fields, etc). I'm wondering if rather than prefixing the column, we should postfix it? So in this case, rather then:
we would do something like:
This would remove the potential ambiguity and allow us to set specifically on nested fields by using multiple identifiers after the column naming.
If this is the direction we decide to go, we can make the change backward compatible. For example, the existing options will respect both formats, but moving forward we should only support the new format.