Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for shows.acast.com #32747

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

PierreMesure
Copy link

@PierreMesure PierreMesure commented Mar 17, 2024

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

I tried to add support for Acast's url shows.acast.com. It's my first PR and I wasn't able to check my regex properly so I hope someone more experienced can review it properly. That being said, I tested with the new URLs and it works 🙂.

While testing with the old URLs to ensure no regression, I noticed that some of them are not active anymore: https://www.acast.com/todayinfocus returns a 404.

I didn't remove it as I wasn't sure I should but I'm happy to do a bit of cleaning if someones greenlights it.

I also saw that they are using a new format with a dollar sign for embeds URL but (not being a regex expert) I failed to add it. Here's an example: https://embed.acast.com/$/611233d8767fdf0012f22cb6/611233e4988b6a001394b5cf?feed=true

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

I've made a couple of suggestions to fix/improve the regexes (should make the CI tests pass).

For tests that don't work because the page is unavailable, you can add 'skip': 'reason why unavailable', to the test case dict with the info_dict key. If that leaves no valid test page for one of the extraction tactics being tested, a test with a valid page that has the same structure has to be added, or it should be documented in comments that the structure is obsolete.

youtube_dl/extractor/acast.py Outdated Show resolved Hide resolved
youtube_dl/extractor/acast.py Outdated Show resolved Hide resolved
@PierreMesure PierreMesure requested a review from dirkf March 29, 2024 13:45
@PierreMesure
Copy link
Author

I committed your suggestions and changed the tests as you asked. Let's see whether there are still any errors.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes, not all related to the current PR, make the download tests pass:

--- PR/youtube-dl/youtube_dl/extractor/acast.py
+++ new/youtube-dl/youtube_dl/extractor/acast.py
@@ -55,16 +55,16 @@
                     '''
     _TESTS = [{
         'url': 'https://www.acast.com/sparpodcast/2.raggarmordet-rosterurdetforflutna',
-        'md5': 'f5598f3ad1e4776fed12ec1407153e4b',
+        'md5': '9904ad5409bbc3b945efc1fe55e5f5d4',
         'info_dict': {
             'id': '2a92b283-1a75-4ad8-8396-499c641de0d9',
             'ext': 'mp3',
             'title': '2. Raggarmordet - Röster ur det förflutna',
-            'description': 'md5:a992ae67f4d98f1c0141598f7bebbf67',
+            'description': r're:(?s)Spår får kontakt med raggarna från 1979\. En titt i arkiven ger oss en gammal förundersökning som någon sparat trots alla bestämmelser om att den bör förstöras\. .{356}\.$',  # md5:a992ae67f4d98f1c0141598f7bebbf67',
             'timestamp': 1477346700,
             'upload_date': '20161024',
             'duration': 2766,
-            'creator': 'Anton Berg & Martin Johnson',
+            'creator': 'Third Ear Studio',
             'series': 'Spår',
             'episode': '2. Raggarmordet - Röster ur det förflutna',
         }
@@ -83,7 +83,7 @@
     }]
 
     def _real_extract(self, url):
-        channel, display_id = re.match(self._VALID_URL, url).groups()
+        channel, display_id = re.match(self._VALID_URL, url).group('channel', 'id')
         episode = self._call_api(
             '%s/episodes/%s' % (channel, display_id),
             display_id, {'showInfo': 'true'})
@@ -106,8 +106,7 @@
         'info_dict': {
             'id': '4efc5294-5385-4847-98bd-519799ce5786',
             'title': 'Today in Focus',
-            'description': 'md5:c09ce28c91002ce4ffce71d6504abaae',
-            'skip': 'Error 404'
+            'description': r're:Hosted by Nosheen Iqbal and Michael Safi, Today in Focus brings you closer to Guardian journalism\. .{994}$',
         },
         'playlist_mincount': 200,
     }, {
@@ -120,7 +119,8 @@
         'url': 'https://play.acast.com/s/getthebeltpod',
         'info_dict': {
             'id': '4caac858-3cda-44e6-b684-50d88b909c18',
-            'title': 'Get The Belt'
+            'title': 'Get The Belt',
+            'description': r'''re:(?s)British Podcast that you probably are not ready for! Join Myles with segments 'such as Myles Measures' that discusses relevant pop culture and society shenanigans,  .{867}$''',
         },
         'playlist_mincount': 138,
     }]
@@ -137,4 +137,4 @@
         for episode in (show.get('episodes') or []):
             entries.append(self._extract_episode(episode, show_info))
         return self.playlist_result(
-            entries, show.get('id'), show.get('title'), show.get('description'))
+            entries, show.get('id'), show.get('title'), clean_html(show.get('description')))

Example:

$ python test/test_download.py TestDownload.test_ACast
[acast] 2.raggarmordet-rosterurdetforflutna: Downloading JSON metadata
[info] Writing video description metadata as JSON to: test_ACast_2a92b283-1a75-4ad8-8396-499c641de0d9.info.json
[debug] Invoking downloader on u'https://sphinx.acast.com/sparpodcast/2.raggarmordet-rosterurdetforflutna/media.mp3?tk=eyJ0ayI6ImRlZmF1bHQiLCJhZHMiOnRydWUsInNwb25zIjp0cnVlLCJzdGF0dXMiOiJwdWJsaWMifQ==&sig=4mClRAB0qh2d8ZixupPWgjm9YxaIJSPcak-ZATQ-H4g'
[download] Destination: test_ACast_2a92b283-1a75-4ad8-8396-499c641de0d9.mp3
[download] 100% of 10.00KiB in 00:00
.
----------------------------------------------------------------------
Ran 1 test in 3.362s

OK
$ 

@dirkf
Copy link
Contributor

dirkf commented Apr 2, 2024

For long text fields, the download test proposes a test value expressed as a digest using MD5. Such a test value is accurate and short, but doesn't reveal what changed if the test starts to fail, as happened here.

In the diff I used pattern test values for description. If a test starts to fail, a pattern can be compared with the new value, showing whether there's just been a small change to the value, as when the site starts to append some or different boilerplate, or whether a completely different extraction tactic is needed to get the value correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants