Skip to content

Commit 4fa43e6

Browse files
committed
20250415_00-rel1
- First release
1 parent fff51eb commit 4fa43e6

File tree

7 files changed

+309
-0
lines changed

7 files changed

+309
-0
lines changed

.gitignore

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# File created using '.gitignore Generator' for Visual Studio Code: https://bit.ly/vscode-gig
2+
# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,macos,python
3+
# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,macos,python
4+
5+
### macOS ###
6+
# General
7+
.DS_Store
8+
.AppleDouble
9+
.LSOverride
10+
11+
# Icon must end with two \r
12+
Icon
13+
14+
# Thumbnails
15+
._*
16+
17+
# Files that might appear in the root of a volume
18+
.DocumentRevisions-V100
19+
.fseventsd
20+
.Spotlight-V100
21+
.TemporaryItems
22+
.Trashes
23+
.VolumeIcon.icns
24+
.com.apple.timemachine.donotpresent
25+
26+
# Directories potentially created on remote AFP share
27+
.AppleDB
28+
.AppleDesktop
29+
Network Trash Folder
30+
Temporary Items
31+
.apdisk
32+
33+
### macOS Patch ###
34+
# iCloud generated files
35+
*.icloud
36+
37+
### Python ###
38+
# Byte-compiled / optimized / DLL files
39+
__pycache__/
40+
*.py[cod]
41+
*$py.class
42+
43+
# C extensions
44+
*.so
45+
46+
# Distribution / packaging
47+
.Python
48+
build/
49+
develop-eggs/
50+
dist/
51+
downloads/
52+
eggs/
53+
.eggs/
54+
lib/
55+
lib64/
56+
parts/
57+
sdist/
58+
var/
59+
wheels/
60+
share/python-wheels/
61+
*.egg-info/
62+
.installed.cfg
63+
*.egg
64+
MANIFEST
65+
66+
# PyInstaller
67+
# Usually these files are written by a python script from a template
68+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
69+
*.manifest
70+
*.spec
71+
72+
# Installer logs
73+
pip-log.txt
74+
pip-delete-this-directory.txt
75+
76+
# Unit test / coverage reports
77+
htmlcov/
78+
.tox/
79+
.nox/
80+
.coverage
81+
.coverage.*
82+
.cache
83+
nosetests.xml
84+
coverage.xml
85+
*.cover
86+
*.py,cover
87+
.hypothesis/
88+
.pytest_cache/
89+
cover/
90+
91+
# Translations
92+
*.mo
93+
*.pot
94+
95+
# Django stuff:
96+
*.log
97+
local_settings.py
98+
db.sqlite3
99+
db.sqlite3-journal
100+
101+
# Flask stuff:
102+
instance/
103+
.webassets-cache
104+
105+
# Scrapy stuff:
106+
.scrapy
107+
108+
# Sphinx documentation
109+
docs/_build/
110+
111+
# PyBuilder
112+
.pybuilder/
113+
target/
114+
115+
# Jupyter Notebook
116+
.ipynb_checkpoints
117+
118+
# IPython
119+
profile_default/
120+
ipython_config.py
121+
122+
# pyenv
123+
# For a library or package, you might want to ignore these files since the code is
124+
# intended to run in multiple environments; otherwise, check them in:
125+
# .python-version
126+
127+
# pipenv
128+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
129+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
130+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
131+
# install all needed dependencies.
132+
#Pipfile.lock
133+
134+
# poetry
135+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
136+
# This is especially recommended for binary packages to ensure reproducibility, and is more
137+
# commonly ignored for libraries.
138+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
139+
#poetry.lock
140+
141+
# pdm
142+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
143+
#pdm.lock
144+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
145+
# in version control.
146+
# https://pdm.fming.dev/#use-with-ide
147+
.pdm.toml
148+
149+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
150+
__pypackages__/
151+
152+
# Celery stuff
153+
celerybeat-schedule
154+
celerybeat.pid
155+
156+
# SageMath parsed files
157+
*.sage.py
158+
159+
# Environments
160+
.env
161+
.venv
162+
env/
163+
venv/
164+
ENV/
165+
env.bak/
166+
venv.bak/
167+
168+
# Spyder project settings
169+
.spyderproject
170+
.spyproject
171+
172+
# Rope project settings
173+
.ropeproject
174+
175+
# mkdocs documentation
176+
/site
177+
178+
# mypy
179+
.mypy_cache/
180+
.dmypy.json
181+
dmypy.json
182+
183+
# Pyre type checker
184+
.pyre/
185+
186+
# pytype static type analyzer
187+
.pytype/
188+
189+
# Cython debug symbols
190+
cython_debug/
191+
192+
# PyCharm
193+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
194+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
195+
# and can be added to the global gitignore or merged into this file. For a more nuclear
196+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
197+
#.idea/
198+
199+
### Python Patch ###
200+
# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
201+
poetry.toml
202+
203+
# ruff
204+
.ruff_cache/
205+
206+
# LSP config files
207+
pyrightconfig.json
208+
209+
### VisualStudioCode ###
210+
.vscode/*
211+
!.vscode/settings.json
212+
!.vscode/tasks.json
213+
!.vscode/launch.json
214+
!.vscode/extensions.json
215+
!.vscode/*.code-snippets
216+
217+
# Local History for Visual Studio Code
218+
.history/
219+
220+
# Built Visual Studio Code Extensions
221+
*.vsix
222+
223+
### VisualStudioCode Patch ###
224+
# Ignore all local history of files
225+
.history
226+
.ionide
227+
228+
# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,macos,python
229+
230+
# Custom rules (everything added below won't be overriden by 'Generate .gitignore File' if you use 'Update' option)
231+

README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,39 @@
11
# UnicodeFix
2+
23
Normalizes Unicode to ASCII equivalents
4+
5+
## Installation
6+
7+
```bash
8+
git clone https://github.com/jeff-h/UnicodeFix.git
9+
cd UnicodeFix
10+
pip install -r requirements.txt
11+
```
12+
13+
## Usage
14+
15+
```bash
16+
(python-3.10-PA-dev) [unixwzrd@xanax: UnicodeFix]$ python bin/cleanup-text.py --help
17+
usage: cleanup-text.py [-h] [-o OUTPUT] [infile]
18+
19+
Clean Unicode quirks from text.
20+
21+
positional arguments:
22+
infile Input file (or use STDIN)
23+
24+
options:
25+
-h, --help show this help message and exit
26+
-o OUTPUT, --output OUTPUT
27+
Output file (default: STDOUT)
28+
29+
python bin/cleanup-text.py <input_file>
30+
```
31+
32+
## License
33+
Copyright 2025 unixwzrd@unixwzrd.ai
34+
35+
[MIT License](LICENSE)
36+
37+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
38+
39+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

bin/cleanup-text.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/usr/bin/env python3
2+
import sys
3+
import argparse
4+
import re
5+
from unidecode import unidecode
6+
7+
def clean_text(text):
8+
replacements = {
9+
'\u2018': "'", '\u2019': "'",
10+
'\u201C': '"', '\u201D': '"',
11+
'\u2013': '-', '\u2014': '-',
12+
}
13+
for orig, repl in replacements.items():
14+
text = text.replace(orig, repl)
15+
text = re.sub(r'[\u200B\u200C\u200D\uFEFF]', '', text)
16+
return unidecode(text)
17+
18+
def main():
19+
parser = argparse.ArgumentParser(description="Clean Unicode quirks from text.")
20+
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin,
21+
help='Input file (or use STDIN)')
22+
parser.add_argument('-o', '--output', type=argparse.FileType('w'), default=sys.stdout,
23+
help='Output file (default: STDOUT)')
24+
args = parser.parse_args()
25+
26+
input_text = args.infile.read()
27+
cleaned = clean_text(input_text)
28+
args.output.write(cleaned)
29+
30+
if __name__ == '__main__':
31+
main()

data/unicode-tst-1.bin

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Here’s a “smart” example—complete with em dashes, zero-width spaces, and other quirks.​
2+
Notice the zero-width space between 'and' and 'other'? It’s invisible but present.​
3+
Also, beware of zero-width non-joiners‌ and joiners‍ that sneak into your text.

data/unicode-tst.bin

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Here’s a “smart” example—complete with em dashes, zero-width spaces, and other quirks.​
2+
Notice the zero-width space between 'and' and 'other'? It’s invisible but present.​
3+
Also, beware of zero-width non-joiners‌ and joiners‍ that sneak into your text.

output-clean.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Here's a "smart" example-complete with em dashes, zero-width spaces, and other quirks.
2+
Notice the zero-width space between 'and' and 'other'? It's invisible but present.
3+
Also, beware of zero-width non-joiners and joiners that sneak into your text.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Unidecode==1.4.0

0 commit comments

Comments
 (0)