Skip to content

[FEATURE]: Detect source file encoding when transpiling #1516

@asnare

Description

@asnare

Is there an existing issue for this?

  • I have searched the existing issues

Category of feature request

Transpile

Problem statement

When reading a local file to transpile, the remorph CLI currently does not attempt to detect the file encoding and uses the default Python mechanism to assume a decoding. (The default for Windows is typically cp1252 and varies on Linux/macOS but is typically utf-8).

This assumption is often incorrect, leading to corrupted files when reading.

Proposed Solution

If a file contains a Unicode BOM marker, the encoding of the marker should be used when reading the file. If a BOM marker is not present the default encoding should be used. In pseudo-code:

  1. Open the file in binary mode, and read the first 4 bytes from file.
  2. If the first 4 bytes start with a Unicode BOM, that indicates the encoding to use.
  3. Otherwise, the encoding will be locale.getpreferredencoding(False).
  4. Open the file again in text mode, specifying the encoding determined during steps 2 and 3.

Additional Context

A reference implementation of how this should work can be found in UCX where we solve the same problem.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestfit&finishFit and Finish

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions