From ad3f37f8ac3825b93b82e53926ff098552828888 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Louis=20V=C3=A9zina?= <5130500+morpheus65535@users.noreply.github.com> Date: Tue, 24 Sep 2019 06:23:53 -0400 Subject: [PATCH] WIP --- libs/bs4/AUTHORS.txt | 43 ++ libs/bs4/COPYING.txt | 27 + libs/bs4/NEWS.txt | 1190 +++++++++++++++++++++++++++++++++++++++++ libs/bs4/README.txt | 63 +++ libs/bs4/TODO.txt | 31 ++ libs/bs4/formatter.py | 100 ++++ 6 files changed, 1454 insertions(+) create mode 100644 libs/bs4/AUTHORS.txt create mode 100644 libs/bs4/COPYING.txt create mode 100644 libs/bs4/NEWS.txt create mode 100644 libs/bs4/README.txt create mode 100644 libs/bs4/TODO.txt create mode 100644 libs/bs4/formatter.py diff --git a/libs/bs4/AUTHORS.txt b/libs/bs4/AUTHORS.txt new file mode 100644 index 000000000..2ac8fcc8c --- /dev/null +++ b/libs/bs4/AUTHORS.txt @@ -0,0 +1,43 @@ +Behold, mortal, the origins of Beautiful Soup... +================================================ + +Leonard Richardson is the primary programmer. + +Aaron DeVore is awesome. + +Mark Pilgrim provided the encoding detection code that forms the base +of UnicodeDammit. + +Thomas Kluyver and Ezio Melotti finished the work of getting Beautiful +Soup 4 working under Python 3. + +Simon Willison wrote soupselect, which was used to make Beautiful Soup +support CSS selectors. + +Sam Ruby helped with a lot of edge cases. + +Jonathan Ellis was awarded the prestigous Beau Potage D'Or for his +work in solving the nestable tags conundrum. + +An incomplete list of people have contributed patches to Beautiful +Soup: + + Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang, + Zephyr Fang, Fuzzy, Roman Gaufman, Yoni Gilad, Richie Hindle, Peteris + Krumins, Kent Johnson, Ben Last, Robert Leftwich, Staffan Malmgren, + Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, "Jon", Ed + Oskiewicz, Greg Phillips, Giles Radford, Arthur Rudolph, Marko + Samastur, Jouni Seppänen, Alexander Schmolck, Andy Theyers, Glyn + Webster, Paul Wright, Danny Yoo + +An incomplete list of people who made suggestions or found bugs or +found ways to break Beautiful Soup: + + Hanno Böck, Matteo Bertini, Chris Curvey, Simon Cusack, Bruce Eckel, + Matt Ernst, Michael Foord, Tom Harris, Bill de hOra, Donald Howes, + Matt Patterson, Scott Roberts, Steve Strassmann, Mike Williams, + warchild at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison, + Joren Mc, Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed + Summers, Dennis Sutch, Chris Smith, Aaron Sweep^W Swartz, Stuart + Turner, Greg Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de + Sousa Rocha, Yichun Wei, Per Vognsen diff --git a/libs/bs4/COPYING.txt b/libs/bs4/COPYING.txt new file mode 100644 index 000000000..b91188869 --- /dev/null +++ b/libs/bs4/COPYING.txt @@ -0,0 +1,27 @@ +Beautiful Soup is made available under the MIT license: + + Copyright (c) 2004-2015 Leonard Richardson + + Permission is hereby granted, free of charge, to any person obtaining + a copy of this software and associated documentation files (the + "Software"), to deal in the Software without restriction, including + without limitation the rights to use, copy, modify, merge, publish, + distribute, sublicense, and/or sell copies of the Software, and to + permit persons to whom the Software is furnished to do so, subject to + the following conditions: + + The above copyright notice and this permission notice shall be + included in all copies or substantial portions of the Software. + + THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + SOFTWARE. + +Beautiful Soup incorporates code from the html5lib library, which is +also made available under the MIT license. Copyright (c) 2006-2013 +James Graham and other contributors diff --git a/libs/bs4/NEWS.txt b/libs/bs4/NEWS.txt new file mode 100644 index 000000000..3726c570a --- /dev/null +++ b/libs/bs4/NEWS.txt @@ -0,0 +1,1190 @@ += 4.4.1 (20150928) = + +* Fixed a bug that deranged the tree when part of it was + removed. Thanks to Eric Weiser for the patch and John Wiseman for a + test. [bug=1481520] + +* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel + Kramer for the patch. [bug=1483781] + +* Improved the implementation of CSS selector grouping. Thanks to + Orangain for the patch. [bug=1484543] + +* Fixed the test_detect_utf8 test so that it works when chardet is + installed. [bug=1471359] + +* Corrected the output of Declaration objects. [bug=1477847] + + += 4.4.0 (20150703) = + +Especially important changes: + +* Added a warning when you instantiate a BeautifulSoup object without + explicitly naming a parser. [bug=1398866] + +* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode + string in Python 3, instead of a UTF8-encoded bytestring in both + versions. In Python 3, __str__ now returns a Unicode string instead + of a bytestring. [bug=1420131] + +* The `text` argument to the find_* methods is now called `string`, + which is more accurate. `text` still works, but `string` is the + argument described in the documentation. `text` may eventually + change its meaning, but not for a very long time. [bug=1366856] + +* Changed the way soup objects work under copy.copy(). Copying a + NavigableString or a Tag will give you a new NavigableString that's + equal to the old one but not connected to the parse tree. Patch by + Martijn Peters. [bug=1307490] + +* Started using a standard MIT license. [bug=1294662] + +* Added a Chinese translation of the documentation by Delong .w. + +New features: + +* Introduced the select_one() method, which uses a CSS selector but + only returns the first match, instead of a list of + matches. [bug=1349367] + +* You can now create a Tag object without specifying a + TreeBuilder. Patch by Martijn Pieters. [bug=1307471] + +* You can now create a NavigableString or a subclass just by invoking + the constructor. [bug=1294315] + +* Added an `exclude_encodings` argument to UnicodeDammit and to the + Beautiful Soup constructor, which lets you prohibit the detection of + an encoding that you know is wrong. [bug=1469408] + +* The select() method now supports selector grouping. Patch by + Francisco Canas [bug=1191917] + +Bug fixes: + +* Fixed yet another problem that caused the html5lib tree builder to + create a disconnected parse tree. [bug=1237763] + +* Force object_was_parsed() to keep the tree intact even when an element + from later in the document is moved into place. [bug=1430633] + +* Fixed yet another bug that caused a disconnected tree when html5lib + copied an element from one part of the tree to another. [bug=1270611] + +* Fixed a bug where Element.extract() could create an infinite loop in + the remaining tree. + +* The select() method can now find tags whose names contain + dashes. Patch by Francisco Canas. [bug=1276211] + +* The select() method can now find tags with attributes whose names + contain dashes. Patch by Marek Kapolka. [bug=1304007] + +* Improved the lxml tree builder's handling of processing + instructions. [bug=1294645] + +* Restored the helpful syntax error that happens when you try to + import the Python 2 edition of Beautiful Soup under Python + 3. [bug=1213387] + +* In Python 3.4 and above, set the new convert_charrefs argument to + the html.parser constructor to avoid a warning and future + failures. Patch by Stefano Revera. [bug=1375721] + +* The warning when you pass in a filename or URL as markup will now be + displayed correctly even if the filename or URL is a Unicode + string. [bug=1268888] + +* If the initial tag contains a CDATA list attribute such as + 'class', the html5lib tree builder will now turn its value into a + list, as it would with any other tag. [bug=1296481] + +* Fixed an import error in Python 3.5 caused by the removal of the + HTMLParseError class. [bug=1420063] + +* Improved docstring for encode_contents() and + decode_contents(). [bug=1441543] + +* Fixed a crash in Unicode, Dammit's encoding detector when the name + of the encoding itself contained invalid bytes. [bug=1360913] + +* Improved the exception raised when you call .unwrap() or + .replace_with() on an element that's not attached to a tree. + +* Raise a NotImplementedError whenever an unsupported CSS pseudoclass + is used in select(). Previously some cases did not result in a + NotImplementedError. + +* It's now possible to pickle a BeautifulSoup object no matter which + tree builder was used to create it. However, the only tree builder + that survives the pickling process is the HTMLParserTreeBuilder + ('html.parser'). If you unpickle a BeautifulSoup object created with + some other tree builder, soup.builder will be None. [bug=1231545] + += 4.3.2 (20131002) = + +* Fixed a bug in which short Unicode input was improperly encoded to + ASCII when checking whether or not it was the name of a file on + disk. [bug=1227016] + +* Fixed a crash when a short input contains data not valid in + filenames. [bug=1232604] + +* Fixed a bug that caused Unicode data put into UnicodeDammit to + return None instead of the original data. [bug=1214983] + +* Combined two tests to stop a spurious test failure when tests are + run by nosetests. [bug=1212445] + += 4.3.1 (20130815) = + +* Fixed yet another problem with the html5lib tree builder, caused by + html5lib's tendency to rearrange the tree during + parsing. [bug=1189267] + +* Fixed a bug that caused the optimized version of find_all() to + return nothing. [bug=1212655] + += 4.3.0 (20130812) = + +* Instead of converting incoming data to Unicode and feeding it to the + lxml tree builder in chunks, Beautiful Soup now makes successive + guesses at the encoding of the incoming data, and tells lxml to + parse the data as that encoding. Giving lxml more control over the + parsing process improves performance and avoids a number of bugs and + issues with the lxml parser which had previously required elaborate + workarounds: + + - An issue in which lxml refuses to parse Unicode strings on some + systems. [bug=1180527] + + - A returning bug that truncated documents longer than a (very + small) size. [bug=963880] + + - A returning bug in which extra spaces were added to a document if + the document defined a charset other than UTF-8. [bug=972466] + + This required a major overhaul of the tree builder architecture. If + you wrote your own tree builder and didn't tell me, you'll need to + modify your prepare_markup() method. + +* The UnicodeDammit code that makes guesses at encodings has been + split into its own class, EncodingDetector. A lot of apparently + redundant code has been removed from Unicode, Dammit, and some + undocumented features have also been removed. + +* Beautiful Soup will issue a warning if instead of markup you pass it + a URL or the name of a file on disk (a common beginner's mistake). + +* A number of optimizations improve the performance of the lxml tree + builder by about 33%, the html.parser tree builder by about 20%, and + the html5lib tree builder by about 15%. + +* All find_all calls should now return a ResultSet object. Patch by + Aaron DeVore. [bug=1194034] + += 4.2.1 (20130531) = + +* The default XML formatter will now replace ampersands even if they + appear to be part of entities. That is, "<" will become + "&lt;". The old code was left over from Beautiful Soup 3, which + didn't always turn entities into Unicode characters. + + If you really want the old behavior (maybe because you add new + strings to the tree, those strings include entities, and you want + the formatter to leave them alone on output), it can be found in + EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] + +* Gave new_string() the ability to create subclasses of + NavigableString. [bug=1181986] + +* Fixed another bug by which the html5lib tree builder could create a + disconnected tree. [bug=1182089] + +* The .previous_element of a BeautifulSoup object is now always None, + not the last element to be parsed. [bug=1182089] + +* Fixed test failures when lxml is not installed. [bug=1181589] + +* html5lib now supports Python 3. Fixed some Python 2-specific + code in the html5lib test suite. [bug=1181624] + +* The html.parser treebuilder can now handle numeric attributes in + text when the hexidecimal name of the attribute starts with a + capital X. Patch by Tim Shirley. [bug=1186242] + += 4.2.0 (20130514) = + +* The Tag.select() method now supports a much wider variety of CSS + selectors. + + - Added support for the adjacent sibling combinator (+) and the + general sibling combinator (~). Tests by "liquider". [bug=1082144] + + - The combinators (>, +, and ~) can now combine with any supported + selector, not just one that selects based on tag name. + + - Added limited support for the "nth-of-type" pseudo-class. Code + by Sven Slootweg. [bug=1109952] + +* The BeautifulSoup class is now aliased to "_s" and "_soup", making + it quicker to type the import statement in an interactive session: + + from bs4 import _s + or + from bs4 import _soup + + The alias may change in the future, so don't use this in code you're + going to run more than once. + +* Added the 'diagnose' submodule, which includes several useful + functions for reporting problems and doing tech support. + + - diagnose(data) tries the given markup on every installed parser, + reporting exceptions and displaying successes. If a parser is not + installed, diagnose() mentions this fact. + + - lxml_trace(data, html=True) runs the given markup through lxml's + XML parser or HTML parser, and prints out the parser events as + they happen. This helps you quickly determine whether a given + problem occurs in lxml code or Beautiful Soup code. + + - htmlparser_trace(data) is the same thing, but for Python's + built-in HTMLParser class. + +* In an HTML document, the contents of a