Skip to content

Commit 842eb7b

Browse files
zhengruifengHyukjinKwon
authored andcommitted
[SPARK-54938][PYTHON][TEST][FOLLOW-UP] Fix inferred time unit for pandas >= 3
### What changes were proposed in this pull request? Fix inferred time unit for pandas >= 3 ### Why are the changes needed? there is behavior change in pandas 3 ### Does this PR introduce _any_ user-facing change? No, test-only ### How was this patch tested? manually check pandas=2.3.3 ``` In [7]: pd.__version__ Out[7]: '2.3.3' In [8]: pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"])).dtype Out[8]: dtype('<M8[ns]') In [9]: pa.array(pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"]))).type Out[9]: TimestampType(timestamp[ns]) ``` pandas=3.0.1 ``` In [6]: pd.__version__ Out[6]: '3.0.1' In [7]: pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"])).dtype Out[7]: dtype('<M8[us]') In [8]: pa.array(pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"]))).type Out[8]: TimestampType(timestamp[us]) ``` ### Was this patch authored or co-authored using generative AI tooling? Co-authored-by: Claude code (Opus 4.6) Closes #55158 from zhengruifeng/fix-pyarrow-ts-inference. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent d9c8eda commit 842eb7b

1 file changed

Lines changed: 12 additions & 10 deletions

File tree

python/pyspark/tests/upstream/pyarrow/test_pyarrow_array_type_inference.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ def test_pandas_series_numpy_backed(self):
299299

300300
# pandas >= 3 infers large_string instead of string for object-dtype string Series
301301
string_type = pa.large_string() if LooseVersion(pd.__version__) >= "3.0.0" else pa.string()
302+
# pandas >= 3 defaults to microsecond resolution instead of nanosecond
303+
ts_unit = "us" if LooseVersion(pd.__version__) >= "3.0.0" else "ns"
302304

303305
sg = ZoneInfo("Asia/Singapore")
304306
la = "America/Los_Angeles"
@@ -324,17 +326,17 @@ def test_pandas_series_numpy_backed(self):
324326
(pd.Series([True, False, True]), pa.bool_()),
325327
# Temporal
326328
(pd.Series([date1, date2]), pa.date32()),
327-
(pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"])), pa.timestamp("ns")),
328-
(pd.Series([pd.Timestamp("1970-01-01")]), pa.timestamp("ns")),
329-
(pd.Series([pd.Timestamp.min]), pa.timestamp("ns")),
330-
(pd.Series([pd.Timestamp.max]), pa.timestamp("ns")),
331-
(pd.Series(pd.to_timedelta(["1 day", "2 hours"])), pa.duration("ns")),
332-
(pd.Series([pd.Timedelta(0)]), pa.duration("ns")),
333-
(pd.Series([pd.Timedelta.min]), pa.duration("ns")),
334-
(pd.Series([pd.Timedelta.max]), pa.duration("ns")),
329+
(pd.Series(pd.to_datetime(["2024-01-01", "2024-01-02"])), pa.timestamp(ts_unit)),
330+
(pd.Series([pd.Timestamp("1970-01-01")]), pa.timestamp(ts_unit)),
331+
(pd.Series([pd.Timestamp.min]), pa.timestamp(ts_unit)),
332+
(pd.Series([pd.Timestamp.max]), pa.timestamp(ts_unit)),
333+
(pd.Series(pd.to_timedelta(["1 day", "2 hours"])), pa.duration(ts_unit)),
334+
(pd.Series([pd.Timedelta(0)]), pa.duration(ts_unit)),
335+
(pd.Series([pd.Timedelta.min]), pa.duration(ts_unit)),
336+
(pd.Series([pd.Timedelta.max]), pa.duration(ts_unit)),
335337
# Timezone-aware
336-
(pd.Series([dt1_sg, dt2_sg]), pa.timestamp("ns", tz="Asia/Singapore")),
337-
(pd.Series([ts1_la, ts2_la]), pa.timestamp("ns", tz=la)),
338+
(pd.Series([dt1_sg, dt2_sg]), pa.timestamp(ts_unit, tz="Asia/Singapore")),
339+
(pd.Series([ts1_la, ts2_la]), pa.timestamp(ts_unit, tz=la)),
338340
# Binary
339341
(pd.Series([b"hello", b"world"]), pa.binary()),
340342
# Nested

0 commit comments

Comments
 (0)