mirror of
https://github.com/morpheus65535/bazarr
synced 2025-01-03 13:35:18 +00:00
149 lines
6.1 KiB
Text
149 lines
6.1 KiB
Text
Metadata-Version: 2.1
|
||
Name: ftfy
|
||
Version: 6.1.3
|
||
Summary: Fixes mojibake and other problems with Unicode, after the fact
|
||
License: Apache-2.0
|
||
Author: Robyn Speer
|
||
Author-email: rspeer@arborelia.net
|
||
Requires-Python: >=3.8,<4
|
||
Classifier: License :: OSI Approved :: Apache Software License
|
||
Classifier: Programming Language :: Python :: 3
|
||
Classifier: Programming Language :: Python :: 3.8
|
||
Classifier: Programming Language :: Python :: 3.9
|
||
Classifier: Programming Language :: Python :: 3.10
|
||
Classifier: Programming Language :: Python :: 3.11
|
||
Classifier: Programming Language :: Python :: 3.12
|
||
Requires-Dist: wcwidth (>=0.2.12,<0.3.0)
|
||
Description-Content-Type: text/markdown
|
||
|
||
# ftfy: fixes text for you
|
||
|
||
[![PyPI package](https://badge.fury.io/py/ftfy.svg)](https://badge.fury.io/py/ftfy)
|
||
[![Docs](https://readthedocs.org/projects/ftfy/badge/?version=latest)](https://ftfy.readthedocs.org/en/latest/)
|
||
|
||
```python
|
||
|
||
>>> from ftfy import fix_encoding
|
||
>>> print(fix_encoding("(ง'⌣')ง"))
|
||
(ง'⌣')ง
|
||
|
||
```
|
||
|
||
The full documentation of ftfy is available at [ftfy.readthedocs.org](https://ftfy.readthedocs.org). The documentation covers a lot more than this README, so here are
|
||
some links into it:
|
||
|
||
- [Fixing problems and getting explanations](https://ftfy.readthedocs.io/en/latest/explain.html)
|
||
- [Configuring ftfy](https://ftfy.readthedocs.io/en/latest/config.html)
|
||
- [Encodings ftfy can handle](https://ftfy.readthedocs.io/en/latest/encodings.html)
|
||
- [“Fixer” functions](https://ftfy.readthedocs.io/en/latest/fixes.html)
|
||
- [Is ftfy an encoding detector?](https://ftfy.readthedocs.io/en/latest/detect.html)
|
||
- [Heuristics for detecting mojibake](https://ftfy.readthedocs.io/en/latest/heuristic.html)
|
||
- [Support for “bad” encodings](https://ftfy.readthedocs.io/en/latest/bad_encodings.html)
|
||
- [Command-line usage](https://ftfy.readthedocs.io/en/latest/cli.html)
|
||
- [Citing ftfy](https://ftfy.readthedocs.io/en/latest/cite.html)
|
||
|
||
## Testimonials
|
||
|
||
- “My life is livable again!”
|
||
— [@planarrowspace](https://twitter.com/planarrowspace)
|
||
- “A handy piece of magic”
|
||
— [@simonw](https://twitter.com/simonw)
|
||
- “Saved me a large amount of frustrating dev work”
|
||
— [@iancal](https://twitter.com/iancal)
|
||
- “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.”
|
||
— Brennan Young
|
||
- “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.”
|
||
— [/u/ocrow](https://reddit.com/u/ocrow)
|
||
- “9.2/10”
|
||
— [pylint](https://bitbucket.org/logilab/pylint/)
|
||
|
||
## What it does
|
||
|
||
Here are some examples (found in the real world) of what ftfy can do:
|
||
|
||
ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:
|
||
|
||
>>> import ftfy
|
||
>>> ftfy.fix_text('✔ No problems')
|
||
'✔ No problems'
|
||
|
||
Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.
|
||
|
||
ftfy can fix multiple layers of mojibake simultaneously:
|
||
|
||
>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
|
||
"The Mona Lisa doesn't have eyebrows."
|
||
|
||
It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:
|
||
|
||
>>> ftfy.fix_text("l’humanité")
|
||
"l'humanité"
|
||
|
||
ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:
|
||
|
||
>>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
|
||
'à perturber la réflexion'
|
||
>>> ftfy.fix_text('à perturber la réflexion')
|
||
'à perturber la réflexion'
|
||
|
||
ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:
|
||
|
||
>>> # by the HTML 5 standard, only 'PÉREZ' is acceptable
|
||
>>> ftfy.fix_text('P&EACUTE;REZ')
|
||
'PÉREZ'
|
||
|
||
These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.
|
||
|
||
The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.
|
||
|
||
>>> ftfy.fix_text('IL Y MARQUÉ…')
|
||
'IL Y MARQUÉ…'
|
||
|
||
## Installing
|
||
|
||
ftfy is a Python 3 package that can be installed using `pip`:
|
||
|
||
pip install ftfy
|
||
|
||
(Or use `pip3 install ftfy` on systems where Python 2 and 3 are both globally
|
||
installed and `pip` refers to Python 2.)
|
||
|
||
### Local development
|
||
|
||
ftfy is developed using `poetry`. Its `setup.py` is vestigial and is not the
|
||
recommended way to install it.
|
||
|
||
[Install Poetry](https://python-poetry.org/docs/master/#installing-with-the-official-installer), check out this repository, and run `poetry install` to install ftfy for local development, such as experimenting with the heuristic or running tests.
|
||
|
||
## Who maintains ftfy?
|
||
|
||
I'm Robyn Speer, also known as Elia Robyn Lake. You can find me
|
||
[on GitHub](https://github.com/rspeer) or [Cohost](https://cohost.org/arborelia).
|
||
|
||
## Citing ftfy
|
||
|
||
ftfy has been used as a crucial data processing step in major NLP research.
|
||
|
||
It's important to give credit appropriately to everyone whose work you build on
|
||
in research. This includes software, not just high-status contributions such as
|
||
mathematical models. All I ask when you use ftfy for research is that you cite
|
||
it.
|
||
|
||
ftfy has a citable record [on Zenodo](https://zenodo.org/record/2591652).
|
||
A citation of ftfy may look like this:
|
||
|
||
Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
|
||
http://doi.org/10.5281/zenodo.2591652
|
||
|
||
In BibTeX format, the citation is::
|
||
|
||
@misc{speer-2019-ftfy,
|
||
author = {Robyn Speer},
|
||
title = {ftfy},
|
||
note = {Version 5.5},
|
||
year = 2019,
|
||
howpublished = {Zenodo},
|
||
doi = {10.5281/zenodo.2591652},
|
||
url = {https://doi.org/10.5281/zenodo.2591652}
|
||
}
|
||
|