You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The solution implemented is a "at read time" solution. Step by step:
a new ParquetFile attribute has been created to store categorical values : global_cats,
when reading row group (with ParquetFile.read_row_group_file), it passes this global attribute down to read_col
read_col uses this global attribute and populates it with new categorical values that are encountered when reading successively new row group
whenever new values are found (or existing values with inconsistent codes compared to previous row groups), it creates a remapping table specific for this row group. It uses it to update to correct codes the column values.
Additional modifications are:
when slicing a ParquetFile (row group selection) with getitem, global_cats is reset. It could be used in a future modification to retrieve categorical values (why not) but in case of slicing, fewer categorical values would remain relevant.
Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
global_cats has been added to getstate to ensure corret pickling
2 test cases have been provided, testing 1 or 2 categorical columns, appending up to 3 times, using categorical strings and integers
@martindurant
A quick word, I am still working on this PR. I could check with a new test case I am working on it is not working when we are row filtering + using nulls. I am investigating.
@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.
@yohplala , you might be interested in looking at the progress in kylebarron/arro3#313 , to see to what extent arro3 can meet your requirements, and maybe enumerate what fastparquet can do that that package still cannot.
Thanks Martin, I will be happy to review.
I will first focus on the on-going fix we can bring in this PR and PR #956 and I do have a feature I would like to work on and then propose (implementing pf = ParquetFile.create_empty(fn, file_scheme, partition_on) from which we could then pf.write_row_groups() and/or mutate with the other methods that are already available)
I think this proposal is not so far, and I would like to keep some time for this. Then yes, I will be happy to check this new project.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces PR #953
I am very sorry, I did a mess in commits history, so I restarted from fresh.
Text from PR #953 applies:
PR aiming to solve #949
The solution implemented is a "at read time" solution. Step by step:
Additional modifications are:
Anyhow, at next read_row_group_file operation, it would be repopulated with the right values
Finally, datapage v2 CI workflow now runs also.