mylar/mylar/newpull.py

132 lines
4.6 KiB
Python
Raw Permalink Normal View History

from bs4 import BeautifulSoup, UnicodeDammit
import urllib2
import csv
import fileinput
import sys
import re
import os
import sqlite3
import datetime
import unicodedata
from decimal import Decimal
from HTMLParser import HTMLParseError
from time import strptime
2016-09-07 04:04:42 +00:00
import requests
import mylar
from mylar import logger
def newpull():
pagelinks = "http://www.previewsworld.com/Home/1/1/71/952"
try:
r = requests.get(pagelinks, verify=False)
except Exception, e:
FIX: When retrieving feeds from 32p and in Auth mode, personal notification feeds contained some invalid html entries that weren't removed properly resulting in no results when attempting to match for downloading, FIX: When searching 32P, if title had a '/' within the title - Mylar would mistakingly skip it due to some previous exceptions that were made for CBT, FIX: Main page would quickly display & hide the have% column instead of always being hidden, FIX: Adjusted some incorrect spacing for non-alphanumeric characters when comparing search results (should result in better matching hopefully), FIX: When adding a series and the most recent issue was present on the weekly-pull list, it would sometimes not mark it as Wanted and auto-attempt to search for it (if auto mark Upcoming enabled), FIX: Added Test Connection button for 32P where it will test logon credentials as well as if Captcha is present, IMP: If captcha is enabled for 32p and signon is required because keys are stale, will not send authentication information and will just bypass as a provider, IMP: Test Connection button added for SABnzbd, IMP: Added ability to directly add torrents to rtorrent and apply label + download directory options (config.ini only atm), FIX: If a search result had a 'vol.' label in it, depending on how the format of the label was mylar would refuse to remove the volume which resulted in failed matches (also fixed a similar issue with failing to remove the volume label when comparing search results), FIX: When filechecking, if a series had a - in the title, will now account for it properly, IMP: Completely redid the filecheck module which allows for integration into other modules as well as more detailed failure logs, IMP: Added Dynamic handder integration into filechecker and subsequent modules that use it which allows for special characters to be replaced with any other type of character, IMP: Manual post-processing speed improved greatly due to new usage of filecheck module, IMP: Importer backend code redone to include new filecheck module, IMP: Added status/counter to import process, IMP: Added force unlock option to importer for failed imports, IMP: Added new status to Import labelled as 'Manual Intervention' for imports that need the user to manually select an option from an available list, FIX: When import said there were search results to view, but none available - would blank screen, IMP: Added a failure log entry showing all the failed files that weren't able to be scanned in during an import (will be in GUI eventually), IMP: if only partial metadata is available during import, Mylar will attempt to use what's available from the metatagging instead of picking all of one/other, IMP: Better grouping of series/volumes when viewing the import results page as well as now indicating if annuals are present within the files, IMP: Added a file-icon beside each imported item on the import result page which allows the user to view the files that are associated with the given series grouping, IMP: Added a blacklisted_publishers option to config.ini which will blacklist specific publishers from being returned during search / import results, FIX: If duplicate dump folder had a value, but duplicate dump wasn't enabled - would still use the duplicate dump folder during post-processing runs, FIX: (#1194) Patch to allow for fixed H1 elements for title (thnx chazlarson), FIX: Removed UnRAR dependency checks in cmtagmylar since not being used anymore, FIX: Fixed a problem with non-ascii characters being recognized during a file-check in certain cases, IMP: Attmept by Mylar to grab an alternate jpg from file when viewing the issue details if it complies with the naming conventions, FIX: Fixed some metatagging issues with ComicBookLover tags not being handled properly if they didn't exist, IMP: Dupecheck now has a failback if it's comparing a cbr/cbr, cbz/cbz and cbr/cbz-priority is enabled, FIX: Quick check added for when adding/refreshing a comic that if a cover already existed, it would delete the cover prior to the attempt to retrieve it, IMP: Added some additional handling for when searching/adding fails, FIX: If a story arc didn't have proper issue dates (or invalid ones) would error out on loading the story arc main page - usually when arcs were imported using a cbl file.
2016-04-07 17:09:06 +00:00
logger.warn('Error fetching data: %s' % e)
soup = BeautifulSoup(r.content)
getthedate = soup.findAll("div", {"class": "Headline"})[0]
#the date will be in the FIRST ahref
try:
getdate_link = getthedate('a')[0]
newdates = getdate_link.findNext(text=True).strip()
except IndexError:
newdates = getthedate.findNext(text=True).strip()
logger.fdebug('New Releases date detected as : ' + re.sub('New Releases For', '', newdates).strip())
cntlinks = soup.findAll('tr')
lenlinks = len(cntlinks)
publish = []
resultURL = []
resultmonth = []
resultyear = []
x = 0
cnt = 0
endthis = False
pull_list = []
IMP: Ability to use 32P session cookies for Auth Login (thnx @btx), IMP: 32P Backlog support! (individual issues only), FIX: 32P will only do one search occurance, instead of issue-numbered padded searching, IMP: When post-processing, if series is in a Paused status or is in an Ended publishing state with 100% of issues completed - will ignore series for post-processing comparisons, IMP: Dynamic Name matching added as matching algorithims in Weekly Pull Lists, FIX: If series contained a '+', would not be able to scan in files for comparison checks (or during post-processing), FIX: File-checking/Post-Processing was taking the incorrect subdirectory path due to escaped slashes (windows only), FIX: If a series contained more than one/multiple special characters, dynamic naming would be slightly off and cause matching problems in some cases, FIX: When peforming a migration of paths in the db using the locmove option, invalid character references would occur on some machines when moving between OS', IMP: Added status checking of an issue prior to downloading (should fix duplicate downloading of same issue within a specific timeframe), FIX: Fixed some parsing issues when using the ALT_PULL=1 method of the weekly pull list, FIX: Fixed some 500 errors when using reverse proxy (flush Impors, Recheck Files, Manage), FIX: When selecting 'Clear Post-Processed items' from history tab, will now also clear 'Processed' items as well, FIX: When importing series that were successfully scanned for metadata and contained valid ComicID's, would fail to regenerate the dataset, IMP: Added exception catch for ComicVine API limit being reached when performing imports (graceful error), IMP: Added graphical icons to indicate Pause/Loading/Error/Active state in the Manage Comics section
2016-05-12 15:28:28 +00:00
publishers = {'PREVIEWS PUBLICATIONS', 'DARK HORSE COMICS', 'DC COMICS', 'IDW PUBLISHING', 'IMAGE COMICS', 'MARVEL COMICS', 'COMICS & GRAPHIC NOVELS', 'MAGAZINES', 'MERCHANDISE'}
isspublisher = None
while (x < lenlinks):
headt = cntlinks[x] #iterate through the hrefs pulling out only results.
found_iss = headt.findAll('td')
pubcheck = found_iss[0].text.strip() #.findNext(text=True)
for pub in publishers:
if pub in pubcheck:
chklink = found_iss[0].findAll('a', href=True) #make sure it doesn't have a link in it.
if not chklink:
isspublisher = pub
break
if isspublisher == 'PREVIEWS PUBLICATIONS' or isspublisher is None:
pass
IMP: Ability to use 32P session cookies for Auth Login (thnx @btx), IMP: 32P Backlog support! (individual issues only), FIX: 32P will only do one search occurance, instead of issue-numbered padded searching, IMP: When post-processing, if series is in a Paused status or is in an Ended publishing state with 100% of issues completed - will ignore series for post-processing comparisons, IMP: Dynamic Name matching added as matching algorithims in Weekly Pull Lists, FIX: If series contained a '+', would not be able to scan in files for comparison checks (or during post-processing), FIX: File-checking/Post-Processing was taking the incorrect subdirectory path due to escaped slashes (windows only), FIX: If a series contained more than one/multiple special characters, dynamic naming would be slightly off and cause matching problems in some cases, FIX: When peforming a migration of paths in the db using the locmove option, invalid character references would occur on some machines when moving between OS', IMP: Added status checking of an issue prior to downloading (should fix duplicate downloading of same issue within a specific timeframe), FIX: Fixed some parsing issues when using the ALT_PULL=1 method of the weekly pull list, FIX: Fixed some 500 errors when using reverse proxy (flush Impors, Recheck Files, Manage), FIX: When selecting 'Clear Post-Processed items' from history tab, will now also clear 'Processed' items as well, FIX: When importing series that were successfully scanned for metadata and contained valid ComicID's, would fail to regenerate the dataset, IMP: Added exception catch for ComicVine API limit being reached when performing imports (graceful error), IMP: Added graphical icons to indicate Pause/Loading/Error/Active state in the Manage Comics section
2016-05-12 15:28:28 +00:00
elif any([isspublisher == 'MAGAZINES', isspublisher == 'MERCHANDISE']):
#logger.fdebug('End.')
endthis = True
break
else:
IMP: Ability to use 32P session cookies for Auth Login (thnx @btx), IMP: 32P Backlog support! (individual issues only), FIX: 32P will only do one search occurance, instead of issue-numbered padded searching, IMP: When post-processing, if series is in a Paused status or is in an Ended publishing state with 100% of issues completed - will ignore series for post-processing comparisons, IMP: Dynamic Name matching added as matching algorithims in Weekly Pull Lists, FIX: If series contained a '+', would not be able to scan in files for comparison checks (or during post-processing), FIX: File-checking/Post-Processing was taking the incorrect subdirectory path due to escaped slashes (windows only), FIX: If a series contained more than one/multiple special characters, dynamic naming would be slightly off and cause matching problems in some cases, FIX: When peforming a migration of paths in the db using the locmove option, invalid character references would occur on some machines when moving between OS', IMP: Added status checking of an issue prior to downloading (should fix duplicate downloading of same issue within a specific timeframe), FIX: Fixed some parsing issues when using the ALT_PULL=1 method of the weekly pull list, FIX: Fixed some 500 errors when using reverse proxy (flush Impors, Recheck Files, Manage), FIX: When selecting 'Clear Post-Processed items' from history tab, will now also clear 'Processed' items as well, FIX: When importing series that were successfully scanned for metadata and contained valid ComicID's, would fail to regenerate the dataset, IMP: Added exception catch for ComicVine API limit being reached when performing imports (graceful error), IMP: Added graphical icons to indicate Pause/Loading/Error/Active state in the Manage Comics section
2016-05-12 15:28:28 +00:00
if "PREVIEWS" in headt:
#logger.fdebug('Ignoring: ' + found_iss[0])
break
if '/Catalog/' in str(headt):
findurl_link = headt.findAll('a', href=True)[0]
urlID = findurl_link.findNext(text=True)
issue_link = findurl_link['href']
issue_lk = issue_link.find('/Catalog/')
if issue_lk == -1:
x+=1
continue
elif "Home/1/1/71" in issue_link:
#logger.fdebug('Ignoring - menu option.')
x+=1
continue
if len(found_iss) > 0:
pull_list.append({"iss_url": issue_link,
"name": found_iss[1].findNext(text=True),
"price": found_iss[2],
"publisher": isspublisher,
"ID": urlID})
x+=1
logger.fdebug('Saving new pull-list information into local file for subsequent merge')
except_file = os.path.join(mylar.CACHE_DIR, 'newreleases.txt')
try:
csvfile = open(str(except_file), 'rb')
csvfile.close()
2015-05-22 08:32:51 +00:00
except (OSError, IOError):
logger.fdebug('file does not exist - continuing.')
else:
logger.fdebug('file exists - removing.')
os.remove(except_file)
oldpub = None
breakhtml = {"<td>", "<tr>", "</td>", "</tr>"}
with open(str(except_file), 'wb') as f:
f.write('%s\n' % (newdates))
for pl in pull_list:
if pl['publisher'] == oldpub:
FIX: included version of comictagger should now work with both Windows and *nix based OS' again, IMP: Global Copy/Move option available when performing post-processing, IMP: Added a verbose file-checking option (FOLDER_SCAN_LOG_VERBOSE) - when enabled will log as it currently does during manual post-processing/file-checking runs, when disabled it will not spam the log nearly as much resulting in more readable log files, IMP: Added Verbose debug logging both via startup option(-v), as well as toggle button in Log GUI (from headphones), as well as per-page loading of log file(s) in GUI, FIX: When doing manual post-processing on issues that were in story arcs, will now indicate X story-arc issues were post-processed for better visibility, FIX: Fixed an issue with deleting from the nzblog table when story arc issues were post-processed, IMP: Added WEEKFOLDER_LOC to the config.ini to allow for specification of where the weekly download directories will default to (as opposed to off of ComicLocation root), IMP: Better handling of some special character references in series titles when looking for series on the auto-wanted list, IMP: 32P will now auto-disable provider if logon returns invalid credentials, FIX: When using alt_pull on weekly pull list, xA0 unicode character caused error, FIX: If title had invalid character in filename that was replaced with a character that already existed in the title, would not scan in during file-checking, FIX: When searching for a series (weeklypull-list/add a series), if the title contained 'and' or '&' would return really mixed up results, FIX: When Post-Processing, if filename being processed had special characters (ie. comma) and was different than nzbname, in some cases would fail to find/move issues, IMP: Utilize internal comictagger to convert from cbr/cbz, IMP: Added more checks when post-processing to ensure files are handled correctly, IMP: Added meta-tag reading when importing series/issues - if previously tagged with CT, will reverse look-up the provided IssueID to reference the correct ComicID, IMP: If scanned directory during import contins cvinfo file, use that and force the ComicID to entire directory when importing a series, IMP: Manual meta-tagging issues will no longer create temporary directories and/or create files in the Comic Location root causing problems for some users, FIX: Annuals weren't properly sorted upon loading of comic details page for some series, IMP: Added some extra checks when validating/creating directories, FIX: Fixed a problem when displaying some covers of .cbz files on the comic details page
2016-01-26 07:49:56 +00:00
exceptln = str(pl['ID']) + "\t" + pl['name'].replace(u"\xA0", u" ") + "\t" + str(pl['price'])
else:
FIX: included version of comictagger should now work with both Windows and *nix based OS' again, IMP: Global Copy/Move option available when performing post-processing, IMP: Added a verbose file-checking option (FOLDER_SCAN_LOG_VERBOSE) - when enabled will log as it currently does during manual post-processing/file-checking runs, when disabled it will not spam the log nearly as much resulting in more readable log files, IMP: Added Verbose debug logging both via startup option(-v), as well as toggle button in Log GUI (from headphones), as well as per-page loading of log file(s) in GUI, FIX: When doing manual post-processing on issues that were in story arcs, will now indicate X story-arc issues were post-processed for better visibility, FIX: Fixed an issue with deleting from the nzblog table when story arc issues were post-processed, IMP: Added WEEKFOLDER_LOC to the config.ini to allow for specification of where the weekly download directories will default to (as opposed to off of ComicLocation root), IMP: Better handling of some special character references in series titles when looking for series on the auto-wanted list, IMP: 32P will now auto-disable provider if logon returns invalid credentials, FIX: When using alt_pull on weekly pull list, xA0 unicode character caused error, FIX: If title had invalid character in filename that was replaced with a character that already existed in the title, would not scan in during file-checking, FIX: When searching for a series (weeklypull-list/add a series), if the title contained 'and' or '&' would return really mixed up results, FIX: When Post-Processing, if filename being processed had special characters (ie. comma) and was different than nzbname, in some cases would fail to find/move issues, IMP: Utilize internal comictagger to convert from cbr/cbz, IMP: Added more checks when post-processing to ensure files are handled correctly, IMP: Added meta-tag reading when importing series/issues - if previously tagged with CT, will reverse look-up the provided IssueID to reference the correct ComicID, IMP: If scanned directory during import contins cvinfo file, use that and force the ComicID to entire directory when importing a series, IMP: Manual meta-tagging issues will no longer create temporary directories and/or create files in the Comic Location root causing problems for some users, FIX: Annuals weren't properly sorted upon loading of comic details page for some series, IMP: Added some extra checks when validating/creating directories, FIX: Fixed a problem when displaying some covers of .cbz files on the comic details page
2016-01-26 07:49:56 +00:00
exceptln = pl['publisher'] + "\n" + str(pl['ID']) + "\t" + pl['name'].replace(u"\xA0", u" ") + "\t" + str(pl['price'])
for lb in breakhtml:
2015-05-22 08:32:51 +00:00
exceptln = re.sub(lb, '', exceptln).strip()
2015-05-22 08:32:51 +00:00
exceptline = exceptln.decode('utf-8', 'ignore')
f.write('%s\n' % (exceptline.encode('ascii', 'replace').strip()))
oldpub = pl['publisher']
if __name__ == '__main__':
newpull()