Skip to content

SparseSerialization should perserve whether is numpy #499

@victor-zou

Description

@victor-zou

Describe the bug
It is expected that the query result for numeric arrays when use_numper/columnar=True should be numpy.ndarray instead of python list/tuple. However when your column is sparse (i.e., majority of the data is of same value like 0/-1/nan), the behavior after the introduction of sparse serialization becomes the later. The implementation does not take into account whether it is a numpy column or not.

See commit 4de5b2b

https://github.com/mymarilyn/clickhouse-driver/blame/49afa09cede2e904090d46b44c1a059bec14c598/clickhouse_driver/columns/base.py#L49

    def apply_sparse(self, items):
        default = self.column.null_value
        if self.column.after_read_items:
            default = self.column.after_read_items([default])[0]

        rv = [default] * (self.items_total - 1)
        for item_number, i in enumerate(self.sparse_indexes):
            rv[i - 1] = items[item_number]

        return rv

To Reproduce
Read any sparse column with use_numper/columnar=True

Expected behavior
Returns a numpy array as usual columns.

Versions
After commit 4de5b2b

Suggest implementation
Add another NumpyColumnSparseSerialization that

  1. save sparse indexes in numpy int array
  2. apply_sparse simply create a buffer with np.full, and buf[index]=items.
  3. it is recommended to implement read_sparse in a compiled way.

Or ad hoc introduce such simple fix:

    def apply_sparse(self, items):
        default = self.column.null_value
        if hasattr(self.column, "dtype") and not self.column.nullable:
            import numpy as np
            rv = np.full((self.items_total - 1,), dtype=items.dtype, fill_value=default)
            rv[np.array(self.sparse_indexes)-1] = items
            return rv

        if self.column.after_read_items:
            default = self.column.after_read_items([default])[0]

        rv = [default] * (self.items_total - 1)
        for item_number, i in enumerate(self.sparse_indexes):
            rv[i - 1] = items[item_number]

        return rv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions