-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MegaCartoons extractor #30952
base: master
Are you sure you want to change the base?
Add MegaCartoons extractor #30952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
Just a few points to consider.
youtube_dl/extractor/megacartoons.py
Outdated
'title': 'Help Wanted', | ||
'ext': 'mp4', | ||
'thumbnail': r're:^https?://.*\.jpg$', | ||
'description': 'Help Wanted: Encouraged by his best friend, Patrick Starfish, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferably for such a long field use 'md5:...'
youtube_dl/extractor/megacartoons.py
Outdated
video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail | ||
|
||
# Every video has a short summary -> save it as description | ||
video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If there may be a newline in the 'videodescription', the
s
(dotall) flag is needed. - The regex as proposed will match everything between the first
<p>
and the last</p>
(in the test video there's only one<p>
element, but that's not robust). - A named group isn't really required.
video_description = self._html_search_regex(r'<p>(?P<videodescription>.*)</p>', webpage, 'videodescription', fatal=False) or self._og_search_description(webpage) | |
article = self._search_regex( | |
r'(?s)<article\b[^>]*?\bclass\s*=\s*[^>]*?\bpost\b[^>]*>(.+?)</article\b', webpage, 'post', default='') | |
video_description = ( | |
self._html_search_regex(r'(?s)<p>\s*([^<]+)\s*</p>', article 'videodescription', fatal=False) | |
or self._og_search_description(webpage)) |
The suggestion (untested) tries to get the <article>
with class post
. Then the first <p>
element in the article's innerHTML is selected, with the description being its stripped text. An alternative could be to find all the <p>
elements in the page and return the one that matches the start of the ld+json or og:description text.
youtube_dl/extractor/megacartoons.py
Outdated
return { | ||
'id': video_id, | ||
'title': title, | ||
'format': video_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will normally be set automatically. You could apply mimetype2ext()
from utils.py
to the extracted video_type
to get mp4
(in the test video), but that's just what yt-dl should set as format
anyway -- and if you rely on ld+json video_type
won't be available.
youtube_dl/extractor/megacartoons.py
Outdated
video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage) # Get the video url | ||
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4' | ||
video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is retained, use url_or_none()
from utils.py
to condition the values:
video_url = url_json.get('sources')[0].get('src') or self._og_search_video_url(webpage) # Get the video url | |
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4' | |
video_thumbnail = url_json.get('splash') or self._og_search_thumbnail(webpage) # Get the thumbnail | |
video_url = url_or_none(url_json.get('sources')[0].get('src')) or self._og_search_video_url(webpage) # Get the video url | |
video_type = url_json.get('sources')[0].get('type') # Get the video type -> 'video/mp4' | |
video_thumbnail = url_or_none(url_json.get('splash')) or self._og_search_thumbnail(webpage) # Get the thumbnail |
(and from ..utils import url_or_none
at the top).
@dirkf Thank you for your feedback. I implemented every suggestion aside from |
You may need to add |
With this: def _real_extract(self, url):
# ID is equal to the episode name
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
info = self._search_json_ld(webpage, video_id, expected_type='VideoObject', default={})
raise Exception(json.dumps(info)) I am still getting this: [MegaCartoons] help-wanted: Downloading webpage
E
======================================================================
ERROR: test_MegaCartoons (__main__.TestDownload):
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/daenges/Git/youtube-dl/test/test_download.py", line 158, in test_template
res_dict = ydl.extract_info(
File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 808, in extract_info
return self.__extract_info(url, ie, download, extra_info, process)
File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
return func(self, *args, **kwargs)
File "/home/daenges/Git/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
ie_result = ie.extract(url)
File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/common.py", line 534, in extract
ie_result = self._real_extract(url)
File "/home/daenges/Git/youtube-dl/youtube_dl/extractor/megacartoons.py", line 31, in _real_extract
raise Exception(json.dumps(info))
Exception: {}
----------------------------------------------------------------------
Ran 1 test in 1.155s
FAILED (errors=1) |
- Verify description through md5 - Implement robust detection of description - Remove format attribute to allow auto detection - Allow conditioning of URLs
@dirkf do you have any further suggestions to get |
The ld+json structure uses Although a few tweaks to the parser fix this, I'd rather back-port the fancier yt-dlp code, which looks like it should already handle this and some other advanced structures, in due course. Meanwhile let's just specialise |
@dirkf Sooo... I copy pasted your code and the test passes. If you have no further suggestions, this PR is ready for merge. Did not contribute that much in the end. ._. |
It's very interesting pull request. Hope you can merge it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if I can force the CI test.
You forgot to add support for numbers and uppercase letters, preventing a handful of episodes from working: |
Can you suggest test URLs? |
@dirkf This url, for one: https://www.megacartoons.net/1000-years-of-courage/ |
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
MegaCartoons provides many cartoon series, such as SpongeBob or Ben10. As far as I found out, the site has no problems concerning copyright. This extractor allows the extraction of the video URL of single episodes.