Description
Is there an existing issue for this?
- I have searched the existing issues
Category of feature request
Transpile
Problem statement
When reading a local file to transpile, the remorph CLI currently does not attempt to detect the file encoding and uses the default Python mechanism to assume a decoding. (The default for Windows is typically cp1252
and varies on Linux/macOS but is typically utf-8
).
This assumption is often incorrect, leading to corrupted files when reading.
Proposed Solution
If a file contains a Unicode BOM marker, the encoding of the marker should be used when reading the file. If a BOM marker is not present the default encoding should be used. In pseudo-code:
- Open the file in binary mode, and read the first 4 bytes from file.
- If the first 4 bytes start with a Unicode BOM, that indicates the encoding to use.
- Otherwise, the encoding will be
locale.getpreferredencoding(False)
. - Open the file again in text mode, specifying the encoding determined during steps 2 and 3.
Additional Context
A reference implementation of how this should work can be found in UCX where we solve the same problem.