yt-dlc/youtube_dl/extractor/closertotruth.py

# coding: utf-8
from __future__ import unicode_literals

import re

from .common import InfoExtractor


class CloserToTruthIE(InfoExtractor):
    _VALID_URL = r'https?://(?:www\.)?closertotruth\.com/(?:[^/]+/)*(?P<id>[^/?#&]+)'
    _TESTS = [{
        'url': 'http://closertotruth.com/series/solutions-the-mind-body-problem#video-3688',
        'info_dict': {
            'id': '0_zof1ktre',
            'display_id': 'solutions-the-mind-body-problem',
            'ext': 'mov',
            'title': 'Solutions to the Mind-Body Problem?',
            'upload_date': '20140221',
            'timestamp': 1392956007,
            'uploader_id': 'CTTXML'
        },
        'params': {
            'skip_download': True,
        },
    }, {
        'url': 'http://closertotruth.com/episodes/how-do-brains-work',
        'info_dict': {
            'id': '0_iuxai6g6',
            'display_id': 'how-do-brains-work',
            'ext': 'mov',
            'title': 'How do Brains Work?',
            'upload_date': '20140221',
            'timestamp': 1392956024,
            'uploader_id': 'CTTXML'
        },
        'params': {
            'skip_download': True,
        },
    }, {
        'url': 'http://closertotruth.com/interviews/1725',
        'info_dict': {
            'id': '1725',
            'title': 'AyaFr-002',
        },
        'playlist_mincount': 2,
    }]

    def _real_extract(self, url):
        display_id = self._match_id(url)

        webpage = self._download_webpage(url, display_id)

        partner_id = self._search_regex(
            r'<script[^>]+src=["\'].*?\b(?:partner_id|p)/(\d+)',
            webpage, 'kaltura partner_id')

        title = self._search_regex(
            r'<title>(.+?)\s*\|\s*.+?</title>', webpage, 'video title')

        select = self._search_regex(
            r'(?s)<select[^>]+id="select-version"[^>]*>(.+?)</select>',
            webpage, 'select version', default=None)
        if select:
            entry_ids = set()
            entries = []
            for mobj in re.finditer(
                    r'<option[^>]+value=(["\'])(?P<id>[0-9a-z_]+)(?:#.+?)?\1[^>]*>(?P<title>[^<]+)',
                    webpage):
                entry_id = mobj.group('id')
                if entry_id in entry_ids:
                    continue
                entry_ids.add(entry_id)
                entries.append({
                    '_type': 'url_transparent',
                    'url': 'kaltura:%s:%s' % (partner_id, entry_id),
                    'ie_key': 'Kaltura',
                    'title': mobj.group('title'),
                })
            if entries:
                return self.playlist_result(entries, display_id, title)

        entry_id = self._search_regex(
            r'<a[^>]+id=(["\'])embed-kaltura\1[^>]+data-kaltura=(["\'])(?P<id>[0-9a-z_]+)\2',
            webpage, 'kaltura entry_id', group='id')

        return {
            '_type': 'url_transparent',
            'display_id': display_id,
            'url': 'kaltura:%s:%s' % (partner_id, entry_id),
            'ie_key': 'Kaltura',
            'title': title
        }
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`# coding: utf-8`
			`from __future__ import unicode_literals`

[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`import re`

[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`from .common import InfoExtractor`


			`class CloserToTruthIE(InfoExtractor):`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`_VALID_URL = r'https?://(?:www\.)?closertotruth\.com/(?:[^/]+/)*(?P<id>[^/?#&]+)'`
			`_TESTS = [{`
			`'url': 'http://closertotruth.com/series/solutions-the-mind-body-problem#video-3688',`
			`'info_dict': {`
			`'id': '0_zof1ktre',`
			`'display_id': 'solutions-the-mind-body-problem',`
			`'ext': 'mov',`
			`'title': 'Solutions to the Mind-Body Problem?',`
			`'upload_date': '20140221',`
			`'timestamp': 1392956007,`
			`'uploader_id': 'CTTXML'`
			`},`
			`'params': {`
			`'skip_download': True,`
			`},`
			`}, {`
			`'url': 'http://closertotruth.com/episodes/how-do-brains-work',`
			`'info_dict': {`
			`'id': '0_iuxai6g6',`
			`'display_id': 'how-do-brains-work',`
			`'ext': 'mov',`
			`'title': 'How do Brains Work?',`
			`'upload_date': '20140221',`
			`'timestamp': 1392956024,`
			`'uploader_id': 'CTTXML'`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`},`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`'params': {`
			`'skip_download': True,`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`},`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`}, {`
			`'url': 'http://closertotruth.com/interviews/1725',`
			`'info_dict': {`
			`'id': '1725',`
			`'title': 'AyaFr-002',`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`},`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`'playlist_mincount': 2,`
			`}]`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
			`def _real_extract(self, url):`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`display_id = self._match_id(url)`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`webpage = self._download_webpage(url, display_id)`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`partner_id = self._search_regex(`
			`r'<script[^>]+src=["\'].*?\b(?:partner_id\|p)/(\d+)',`
			`webpage, 'kaltura partner_id')`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`title = self._search_regex(`
			`r'<title>(.+?)\s\\|\s.+?</title>', webpage, 'video title')`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`select = self._search_regex(`
			`r'(?s)<select[^>]+id="select-version"[^>]*>(.+?)</select>',`
			`webpage, 'select version', default=None)`
			`if select:`
			`entry_ids = set()`
			`entries = []`
			`for mobj in re.finditer(`
			`r'<option[^>]+value=(["\'])(?P<id>[0-9a-z_]+)(?:#.+?)?\1[^>]*>(?P<title>[^<]+)',`
			`webpage):`
			`entry_id = mobj.group('id')`
			`if entry_id in entry_ids:`
			`continue`
			`entry_ids.add(entry_id)`
			`entries.append({`
			`'_type': 'url_transparent',`
			`'url': 'kaltura:%s:%s' % (partner_id, entry_id),`
			`'ie_key': 'Kaltura',`
			`'title': mobj.group('title'),`
			`})`
			`if entries:`
			`return self.playlist_result(entries, display_id, title)`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`entry_id = self._search_regex(`
			`r'<a[^>]+id=(["\'])embed-kaltura\1[^>]+data-kaltura=(["\'])(?P<id>[0-9a-z_]+)\2',`
			`webpage, 'kaltura entry_id', group='id')`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00
			`return {`
			`'_type': 'url_transparent',`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`'display_id': display_id,`
			`'url': 'kaltura:%s:%s' % (partner_id, entry_id),`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`'ie_key': 'Kaltura',`
[closertotruth] Update and improve (Closes #8680) 2016-06-18 17:35:29 +00:00			`'title': title`
[closertotruth] Add extractor Removed print statement from code. Replaced two regex searches with the corret ones. Removed some unnecessary semicolumns fixed title extraction refactored everything to search_regex processed comments on commit 5650b0d, fixed feedback from flake8 Improved regexes and returns info dict now. Added support for closertotruth interview URL Added support for episodes page 2016-02-26 12:31:52 +00:00			`}`