Fix categorical data handling with global dictionaries by yohplala · Pull Request #954 · dask/fastparquet

yohplala · 2025-03-25T13:39:53Z

Replaces PR #953
I am very sorry, I did a mess in commits history, so I restarted from fresh.

Text from PR #953 applies:

PR aiming to solve #949

The solution implemented is a "at read time" solution. Step by step:

a new ParquetFile attribute has been created to store categorical values : global_cats,
when reading row group (with ParquetFile.read_row_group_file), it passes this global attribute down to read_col
read_col uses this global attribute and populates it with new categorical values that are encountered when reading successively new row group
whenever new values are found (or existing values with inconsistent codes compared to previous row groups), it creates a remapping table specific for this row group. It uses it to update to correct codes the column values.

Additional modifications are:

when slicing a ParquetFile (row group selection) with getitem, global_cats is reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.
Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
global_cats has been added to getstate to ensure corret pickling
2 test cases have been provided, testing 1 or 2 categorical columns, appending up to 3 times, using categorical strings and integers

Finally, datapage v2 CI workflow now runs also.

martindurant · 2025-03-25T13:56:41Z

datapage v2 CI workflow now runs also

Well done!

yohplala · 2025-03-26T08:50:21Z

@martindurant
A quick word, I am still working on this PR. I could check with a new test case I am working on it is not working when we are row filtering + using nulls. I am investigating.

martindurant · 2025-03-29T17:43:32Z

@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.

yohplala · 2025-03-30T18:57:12Z

@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.

Thanks Martin, I will be happy to review.
I will first focus on the on-going fix we can bring in this PR and PR #956 and I do have a feature I would like to work on and then propose (implementing pf = ParquetFile.create_empty(fn, file_scheme, partition_on) from which we could then pf.write_row_groups() and/or mutate with the other methods that are already available)
I think this proposal is not so far, and I would like to keep some time for this. Then yes, I will be happy to check this new project.

Fix categorical data handling with global dictionaries

b2b7300

This was referenced Mar 25, 2025

Wrong categories data on reading with engine='fastparquet' #949

Open

Fix cats global #953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical data handling with global dictionaries#954

Fix categorical data handling with global dictionaries#954
yohplala wants to merge 1 commit intodask:mainfrom
yohplala:fix-cats-global-v2

yohplala commented Mar 25, 2025

Uh oh!

martindurant commented Mar 25, 2025

Uh oh!

yohplala commented Mar 26, 2025

Uh oh!

martindurant commented Mar 29, 2025

Uh oh!

yohplala commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yohplala commented Mar 25, 2025

Uh oh!

martindurant commented Mar 25, 2025

Uh oh!

yohplala commented Mar 26, 2025

Uh oh!

martindurant commented Mar 29, 2025

Uh oh!

yohplala commented Mar 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants