Describe the bug
It is expected that the query result for numeric arrays when use_numper/columnar=True should be numpy.ndarray instead of python list/tuple. However when your column is sparse (i.e., majority of the data is of same value like 0/-1/nan), the behavior after the introduction of sparse serialization becomes the later. The implementation does not take into account whether it is a numpy column or not.
See commit 4de5b2b
https://github.com/mymarilyn/clickhouse-driver/blame/49afa09cede2e904090d46b44c1a059bec14c598/clickhouse_driver/columns/base.py#L49
def apply_sparse(self, items):
default = self.column.null_value
if self.column.after_read_items:
default = self.column.after_read_items([default])[0]
rv = [default] * (self.items_total - 1)
for item_number, i in enumerate(self.sparse_indexes):
rv[i - 1] = items[item_number]
return rv
To Reproduce
Read any sparse column with use_numper/columnar=True
Expected behavior
Returns a numpy array as usual columns.
Versions
After commit 4de5b2b
Suggest implementation
Add another NumpyColumnSparseSerialization that
- save sparse indexes in numpy int array
- apply_sparse simply create a buffer with np.full, and buf[index]=items.
- it is recommended to implement
read_sparse in a compiled way.
Or ad hoc introduce such simple fix:
def apply_sparse(self, items):
default = self.column.null_value
if hasattr(self.column, "dtype") and not self.column.nullable:
import numpy as np
rv = np.full((self.items_total - 1,), dtype=items.dtype, fill_value=default)
rv[np.array(self.sparse_indexes)-1] = items
return rv
if self.column.after_read_items:
default = self.column.after_read_items([default])[0]
rv = [default] * (self.items_total - 1)
for item_number, i in enumerate(self.sparse_indexes):
rv[i - 1] = items[item_number]
return rv
Describe the bug
It is expected that the query result for numeric arrays when
use_numper/columnar=Trueshould be numpy.ndarray instead of python list/tuple. However when your column is sparse (i.e., majority of the data is of same value like 0/-1/nan), the behavior after the introduction of sparse serialization becomes the later. The implementation does not take into account whether it is a numpy column or not.See commit 4de5b2b
https://github.com/mymarilyn/clickhouse-driver/blame/49afa09cede2e904090d46b44c1a059bec14c598/clickhouse_driver/columns/base.py#L49
To Reproduce
Read any sparse column with
use_numper/columnar=TrueExpected behavior
Returns a numpy array as usual columns.
Versions
After commit 4de5b2b
Suggest implementation
Add another NumpyColumnSparseSerialization that
read_sparsein a compiled way.Or ad hoc introduce such simple fix: