Skip to content

fix: recover latin-1 encoded Location headers on redirects#12325

Open
MAXDVVV wants to merge 3 commits intoaio-libs:masterfrom
MAXDVVV:fix/redirect-non-ascii-location-10047
Open

fix: recover latin-1 encoded Location headers on redirects#12325
MAXDVVV wants to merge 3 commits intoaio-libs:masterfrom
MAXDVVV:fix/redirect-non-ascii-location-10047

Conversation

@MAXDVVV
Copy link
Copy Markdown

@MAXDVVV MAXDVVV commented Apr 6, 2026

Problem

When a server sends a Location header containing raw latin-1 encoded bytes (e.g. \xf8 for ø), the redirect URL gets corrupted.

Redirect chain example (from #10047):

https://cornelius-k.dk/synsproeve/
  → Location: https://cornelius-k.dk/synsprøve  (URL-encoded %C3%B8, OK)
  → Location: https://cornelius-k.dk/synspr\xf8ve  (raw latin-1 byte!)
    → aiohttp sees: https://cornelius-k.dk/synspr\udcf8ve  (broken surrogate)
    → 404!

Root cause

The HTTP parser decodes header values with utf-8/surrogateescape (http_parser.py L208). When a server sends raw latin-1 bytes in the Location header (which some servers do, despite RFC violations), bytes like \xf8 are not valid UTF-8 and get decoded as surrogates like \udcf8. These surrogates then cause URL() to produce a broken URL.

Fix

In the redirect handling code (client.py), after reading the Location header value, detect if it contains surrogates (can't encode to UTF-8). If so, round-trip through surrogateescape back to bytes and decode as latin-1, recovering the original characters:

'\udcf8'encode('utf-8', 'surrogateescape') → b'\xf8'decode('latin-1') → 'ø'

This is a targeted fix that only affects redirect URL processing, not general header decoding.

Verification

>>> r_url = 'https://cornelius-k.dk/synspr\udcf8ve'
>>> raw = r_url.encode('utf-8', 'surrogateescape')
>>> r_url = raw.decode('latin-1')
>>> r_url
'https://cornelius-k.dk/synsprøve'  # correct!

Fixes #10047

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Apr 6, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.11%. Comparing base (e412ccb) to head (5ea52e5).
⚠️ Report is 5 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12325      +/-   ##
==========================================
- Coverage   99.11%   99.11%   -0.01%     
==========================================
  Files         130      130              
  Lines       45558    45623      +65     
  Branches     2404     2406       +2     
==========================================
+ Hits        45156    45218      +62     
- Misses        272      275       +3     
  Partials      130      130              
Flag Coverage Δ
CI-GHA 98.96% <100.00%> (-0.01%) ⬇️
OS-Linux 98.71% <100.00%> (-0.02%) ⬇️
OS-Windows 96.96% <100.00%> (-0.03%) ⬇️
OS-macOS 97.87% <100.00%> (-0.01%) ⬇️
Py-3.10.11 97.42% <100.00%> (+<0.01%) ⬆️
Py-3.10.20 97.89% <100.00%> (+<0.01%) ⬆️
Py-3.11.15 98.10% <100.00%> (+<0.01%) ⬆️
Py-3.11.9 97.63% <100.00%> (+<0.01%) ⬆️
Py-3.12.10 97.72% <100.00%> (+<0.01%) ⬆️
Py-3.12.13 98.20% <100.00%> (+<0.01%) ⬆️
Py-3.13.12 98.44% <100.00%> (+<0.01%) ⬆️
Py-3.14.3 98.50% <100.00%> (+<0.01%) ⬆️
Py-3.14.3t 97.50% <100.00%> (-0.01%) ⬇️
Py-pypy3.11.15-7.3.21 97.38% <100.00%> (-0.01%) ⬇️
VM-macos 97.87% <100.00%> (-0.01%) ⬇️
VM-ubuntu 98.71% <100.00%> (-0.02%) ⬇️
VM-windows 96.96% <100.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 6, 2026

Merging this PR will not alter performance

✅ 61 untouched benchmarks
⏩ 4 skipped benchmarks1


Comparing MAXDVVV:fix/redirect-non-ascii-location-10047 (5ea52e5) with master (fc67cfd)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

except (UnicodeEncodeError, UnicodeDecodeError):
try:
raw = r_url.encode("utf-8", "surrogateescape")
r_url = raw.decode("latin-1")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's not latin-1? This seems unreasonable for us to just start guessing charsets randomly.

If fallback_charset_resolver is set, we could use that instead maybe?

@Dreamsorcerer Dreamsorcerer added the pr-unfinished The PR is unfinished and may need a volunteer to complete it label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR pr-unfinished The PR is unfinished and may need a volunteer to complete it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

On redirects, middle URL with ø char gets parsed wrongly - leading to a 404

3 participants