Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fetching YouTube transcript data from a list of video_ids

I have a function that can fetch the transcript of a youtube video. (You don’t have to understand every line)

!pip install youtube_transcript_api
!pip install simplejson
from youtube_transcript_api import YouTubeTranscriptApi

import urllib.request
import json
import urllib


VideoID = 'LfC6pv8VISk'

def fetch_transcript():
    
    params = {"format": "json", "url": "https://www.youtube.com/watch?v=%s" % VideoID}
    url = "https://www.youtube.com/oembed"
    query_string = urllib.parse.urlencode(params)
    url = url + "?" + query_string

    with urllib.request.urlopen(url) as response:
        response_text = response.read()
        data = json.loads(response_text.decode())
        print('Titel: ' + data['title'])
    
    # retrieve the available transcripts
    transcript_list = YouTubeTranscriptApi.list_transcripts(VideoID)

    # iterate over all available transcripts
    for transcript in transcript_list:

        # fetch the actual transcript data
        data = transcript.fetch()
        print(data)         

    # filter for language 
    transcript = transcript_list.find_transcript(['en'])  

fetch_transcript()

Output: [{‘text’: "Okay, here we go. This one’s gonna be\nquick. So get your coffee ready today.", ‘start’: 0.15, ‘duration’: 2.97…}]

Now I collected a list of video_ids like:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

VideoID = ['LfC6pv8VISk', 'befUVytFC80', '4c_rKOaTquM']

How can I iterate over the list using my function?
My function only takes a single string like ‘LfC6pv8VISk’

                                                           #the single string ↓

params = {"format": "json", "url": "https://www.youtube.com/watch?v=%s" % VideoID}

for i in VideoID … fetch_transcript() doesn’t work. It feels so easy but I’m stuck.

>Solution :

Try making your function accept video id as parameter.

You can safely ignore imports from typing, it’s just for annotations.

from typing import Iterable, Tuple, Dict, List, Any

from youtube_transcript_api import YouTubeTranscriptApi

import urllib.request
import json
import urllib


def fetch_transcript(vid_id) -> Tuple[str, List[Dict[str, Any]]]:
    params = {"format": "json", "url": f"https://www.youtube.com/watch?v={vid_id}"}
    url = "https://www.youtube.com/oembed"

    query_string = urllib.parse.urlencode(params)
    url = url + "?" + query_string

    print(f"Fetching from {url}")

    with urllib.request.urlopen(url) as response:
        response_text = response.read()
        title = json.loads(response_text.decode())["title"]

    # retrieve the available transcripts
    transcript_list = YouTubeTranscriptApi.list_transcripts(vid_id)

    return title, transcript_list.find_transcript(['en']).fetch()


def fetch_transcript_gen(vid_ids: Iterable):
    for video_id in vid_ids:
        yield fetch_transcript(video_id)


if __name__ == '__main__':
    from pprint import pprint
    for title_, data_ in fetch_transcript_gen(['LfC6pv8VISk', 'befUVytFC80', '4c_rKOaTquM']):
        print(f"\nTitle: {title_}\nData length: {len(data_)}\n")
        pprint(data_)

Output:

Fetching from https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DLfC6pv8VISk

Title: KILL Linux processes!! (also manage them) // Linux for Hackers // EP 7
Data length: 399

[{'duration': 2.97,
  'start': 0.15,
  'text': "Okay, here we go. This one's gonna be\n"
          'quick. So get your coffee ready today.'},
 {'duration': 4.4,
  'start': 3.12,
  'text': "We're gonna learn how to manage our\n"
          'processes, processes  and Lennox.'},
...

Fetching from https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbefUVytFC80

Title: you need to create a Cryptocurrency RIGHT NOW!! (Solana token)
Data length: 837

...

Fetching from https://www.youtube.com/oembed?format=json&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D4c_rKOaTquM

Title: putting 5G and MEC to the test!! (does it even matter??)
Data length: 741

...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading