How to Scrape YouTube Data — Video Metrics and Comments
Need to scrape YouTube data for market research, competitor analysis, or content strategy? YouTube is the second largest search engine, and its data — view counts, engagement metrics, comment sentiment, channel growth — is valuable for creators, marketers, and researchers alike.
This tutorial covers every approach: the official YouTube Data API v3, lightweight libraries for metadata extraction, and SearchHive for scraping when the API falls short. Each method includes working Python code.
Key Takeaways
- YouTube Data API v3 is the official but rate-limited option (10K quota units/day on free tier)
- yt-dlp extracts video metadata without API keys — faster setup, no quotas
- SearchHive ScrapeForge handles full page scraping including comments, thumbnails, and related videos
- For comment scraping at scale, combining yt-dlp metadata with SearchHive page scraping is the most reliable approach
- Always respect YouTube's Terms of Service and rate limits to avoid API key suspension
Prerequisites
pip install searchhive yt-dlp google-api-python-client
- searchhive — Web scraping API (free tier: 50K requests/month)
- yt-dlp — Video/metadata downloader, no API key needed
- google-api-python-client — Official YouTube Data API client
Optional: A YouTube Data API v3 key from Google Cloud Console (free, 10K quota/day).
Step 1: Extract Video Metadata with yt-dlp
yt-dlp is the fastest way to pull video metadata without an API key:
import yt_dlp
def get_video_metadata(video_url: str) -> dict:
"""Extract comprehensive metadata from a YouTube video."""
ydl_opts = {
'quiet': True,
'no_download': True, # Don't download the actual video
'extract_flat': True,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=False)
return {
'id': info.get('id'),
'title': info.get('title'),
'uploader': info.get('uploader'),
'uploader_id': info.get('uploader_id'),
'channel_url': info.get('channel_url'),
'duration': info.get('duration'),
'view_count': info.get('view_count'),
'like_count': info.get('like_count'),
'comment_count': info.get('comment_count'),
'upload_date': info.get('upload_date'),
'description': info.get('description', '')[:500],
'tags': info.get('tags', []),
'categories': info.get('categories', []),
'thumbnail': info.get('thumbnail'),
'live_status': info.get('live_status'),
}
# Usage
metadata = get_video_metadata("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(f"Title: {metadata['title']}")
print(f"Views: {metadata['view_count']:,}")
print(f"Likes: {metadata['like_count']:,}")
print(f"Duration: {metadata['duration']}s")
print(f"Tags: {metadata['tags'][:5]}")
This works without any API key. yt-dlp parses the video page directly and extracts structured metadata.
Batch Extract Multiple Videos
def batch_metadata(video_urls: list[str]) -> list[dict]:
"""Extract metadata from multiple YouTube videos."""
results = []
for url in video_urls:
try:
meta = get_video_metadata(url)
results.append(meta)
print(f"OK: {meta['title'][:50]}...")
except Exception as e:
print(f"FAIL {url}: {e}")
return results
urls = [
"https://www.youtube.com/watch?v=video1",
"https://www.youtube.com/watch?v=video2",
"https://www.youtube.com/watch?v=video3",
]
videos = batch_metadata(urls)
Step 2: Use the YouTube Data API v3 for Structured Queries
The official API gives you search, channel info, and playlist data — things yt-dlp doesn't handle well:
from googleapiclient.discovery import build
def get_channel_stats(api_key: str, channel_id: str) -> dict:
"""Get channel statistics using YouTube Data API v3."""
youtube = build('youtube', 'v3', developerKey=api_key)
request = youtube.channels().list(
part='statistics,snippet,brandingSettings',
id=channel_id
)
response = request.execute()
if response['items']:
channel = response['items'][0]
stats = channel['statistics']
snippet = channel['snippet']
return {
'name': snippet['title'],
'subscribers': int(stats.get('subscriberCount', 0)),
'total_views': int(stats.get('viewCount', 0)),
'video_count': int(stats.get('videoCount', 0)),
'description': snippet.get('description', '')[:300],
'published_at': snippet.get('publishedAt'),
'thumbnails': snippet.get('thumbnails', {}),
}
return {}
def search_videos(api_key: str, query: str, max_results: int = 10) -> list[dict]:
"""Search YouTube videos by keyword."""
youtube = build('youtube', 'v3', developerKey=api_key)
request = youtube.search().list(
part='snippet',
q=query,
type='video',
maxResults=max_results,
order='viewCount'
)
response = request.execute()
videos = []
for item in response.get('items', []):
videos.append({
'video_id': item['id']['videoId'],
'title': item['snippet']['title'],
'channel': item['snippet']['channelTitle'],
'published_at': item['snippet']['publishedAt'],
'url': f"https://youtube.com/watch?v={item['id']['videoId']}",
})
return videos
# Usage (replace with your API key)
# stats = get_channel_stats("YOUR_API_KEY", "UCxxxxxxxx")
# results = search_videos("YOUR_API_KEY", "python web scraping tutorial")
API Quota Management
YouTube API v3 costs quota units:
search.list: 100 units per requestvideos.list: 1 unit per requestcommentThreads.list: 1 unit per request- Free tier: 10,000 units/day = ~100 searches or 10,000 video lookups
If you're hitting quota limits, switch to yt-dlp for metadata and save API quota for searches.
Step 3: Scrape Comments with SearchHive
The YouTube API only returns comments for your own videos (or public videos with limited pagination). For full comment scraping, use SearchHive:
from searchhive import ScrapeForge
def scrape_youtube_comments(video_url: str, max_comments: int = 50) -> list[dict]:
"""Scrape comments from a YouTube video using SearchHive."""
client = ScrapeForge()
result = client.scrape(
url=video_url,
render_js=True,
wait_for="#comments-section, ytd-comments",
selectors={
"comments": {
"each": "#content ytd-comment-thread-renderer",
"fields": {
"author": "#author-text span",
"text": "#content-text",
"likes": "#vote-count-middle",
"time": ".time a",
}
}
}
)
comments = result.data.get("comments", [])
return comments[:max_comments]
# Usage
comments = scrape_youtube_comments("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
for c in comments[:10]:
print(f"{c.get('author', 'Unknown')}: {c.get('text', '')[:80]}...")
YouTube loads comments dynamically with JavaScript. SearchHive's render_js=True waits for the comments section to populate before extracting data.
Step 4: Get Video Transcript
Transcripts are gold for content analysis, NLP, and SEO research:
# pip install youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi
def get_transcript(video_id: str) -> str:
"""Get the full transcript of a YouTube video."""
try:
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
full_text = " ".join(
entry['text'] for entry in transcript_list
)
return full_text
except Exception as e:
print(f"No transcript available: {e}")
return ""
# Extract video ID from URL
video_id = "dQw4w9WgXcQ"
transcript = get_transcript(video_id)
print(f"Transcript length: {len(transcript)} characters")
print(transcript[:500] + "...")
Step 5: Scrape Search Results and Trending
YouTube search results pages contain rankings, view counts, and channel info. Use SearchHive to scrape them:
from searchhive import ScrapeForge
def scrape_youtube_search(query: str) -> list[dict]:
"""Scrape YouTube search results without using the API."""
client = ScrapeForge()
url = f"https://www.youtube.com/results?search_query={query.replace(' ', '+')}"
result = client.scrape(
url=url,
render_js=True,
wait_for="ytd-video-renderer",
selectors={
"videos": {
"each": "ytd-video-renderer",
"fields": {
"title": "#video-title",
"channel": "ytd-channel-name a",
"views": "#metadata-line span:first-child",
"time": "#metadata-line span:last-child",
"url": "a#video-title @href",
}
}
}
)
return result.data.get("videos", [])
# Usage
results = scrape_youtube_search("python web scraping tutorial")
for v in results[:5]:
print(f"{v.get('title', 'N/A')[:60]} — {v.get('views', 'N/A')}")
This bypasses API quotas entirely. Each search query uses one SearchHive request instead of 100 API quota units.
Step 6: Build a Complete YouTube Analytics Pipeline
Combine all the methods into a unified pipeline:
import json
import csv
from datetime import datetime
from searchhive import ScrapeForge
import yt_dlp
class YouTubeScraper:
def __init__(self):
self.scrape_client = ScrapeForge()
self.ydl_opts = {'quiet': True, 'no_download': True}
def get_metadata(self, video_url: str) -> dict:
with yt_dlp.YoutubeDL(self.ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=False)
return {
'id': info['id'],
'title': info['title'],
'channel': info['uploader'],
'views': info.get('view_count', 0),
'likes': info.get('like_count', 0),
'duration': info.get('duration', 0),
'upload_date': info.get('upload_date'),
'tags': info.get('tags', []),
}
def get_comments(self, video_url: str) -> list[dict]:
result = self.scrape_client.scrape(
url=video_url, render_js=True,
wait_for="#comments-section",
selectors={"comments": {
"each": "ytd-comment-thread-renderer",
"fields": {
"author": "#author-text span",
"text": "#content-text",
"likes": "#vote-count-middle",
}
}}
)
return result.data.get("comments", [])
def full_analysis(self, video_url: str) -> dict:
metadata = self.get_metadata(video_url)
comments = self.get_comments(video_url)
metadata['comments'] = comments
metadata['comment_count'] = len(comments)
metadata['scraped_at'] = datetime.utcnow().isoformat()
return metadata
# Usage
scraper = YouTubeScraper()
analysis = scraper.full_analysis("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
with open("youtube_analysis.json", "w") as f:
json.dump(analysis, f, indent=2, default=str)
print(f"Title: {analysis['title']}")
print(f"Views: {analysis['views']:,}")
print(f"Comments scraped: {len(analysis['comments'])}")
Complete Code Example
Here's a production-ready script that analyzes multiple videos and exports results:
from searchhive import ScrapeForge
import yt_dlp
import json
import csv
from datetime import datetime
def analyze_videos(video_urls: list[str], output_csv: str = "youtube_data.csv"):
"""Analyze multiple YouTube videos and export to CSV."""
client = ScrapeForge()
ydl_opts = {'quiet': True, 'no_download': True}
results = []
for url in video_urls:
try:
# Get metadata via yt-dlp
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=False)
video_data = {
'url': url,
'title': info.get('title', ''),
'channel': info.get('uploader', ''),
'views': info.get('view_count', 0),
'likes': info.get('like_count', 0),
'duration_sec': info.get('duration', 0),
'upload_date': info.get('upload_date', ''),
'tags': '|'.join(info.get('tags', [])[:10]),
}
# Get comment count via SearchHive
try:
scrape_result = client.scrape(
url=url, render_js=True,
wait_for="#comments-section",
selectors={
"comment_count": "#count ytd-comments-header-renderer h2",
"top_comments": {
"each": "ytd-comment-thread-renderer",
"limit": 5,
"fields": {
"author": "#author-text span",
"text": "#content-text",
}
}
}
)
if scrape_result.data:
comments = scrape_result.data.get("top_comments", [])
video_data['comment_count'] = len(comments)
video_data['top_comment'] = comments[0]['text'][:100] if comments else ""
except Exception:
video_data['comment_count'] = 0
results.append(video_data)
print(f"OK: {video_data['title'][:50]} — {video_data['views']:,} views")
except Exception as e:
print(f"FAIL {url}: {e}")
# Export to CSV
if results:
fieldnames = list(results[0].keys())
with open(output_csv, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
print(f"\nExported {len(results)} videos to {output_csv}")
if __name__ == "__main__":
urls = [
"https://www.youtube.com/watch?v=example1",
"https://www.youtube.com/watch?v=example2",
]
analyze_videos(urls)
Common Issues
yt-dlp returns "Video unavailable"
The video may be private, deleted, or geo-restricted. Check with info.get('availability') before processing.
YouTube API quota exhausted
Switch to yt-dlp for metadata extraction and SearchHive for page scraping. Neither uses API quota.
Comments not loading with SearchHive
YouTube may require scrolling to load more comments. Use SearchHive's actions parameter to trigger scroll events, or increase the wait_for timeout.
Rate limiting from YouTube
SearchHive's proxy rotation distributes requests across different IPs, reducing the chance of rate limiting. Keep concurrency low (2-3 simultaneous requests) for YouTube specifically.
Next Steps
- Combine YouTube data with SearchHive DeepDive for sentiment analysis on comments
- Check /blog/how-to-monitor-competitor-prices-with-python-automated-system for scraping competitor pricing strategies
- See /compare/scraperapi for how SearchHive compares to other scraping APIs on reliability and cost
Start scraping YouTube data with SearchHive's free tier — 50,000 requests/month with JS rendering and proxy rotation. No API key needed for page scraping. Read the docs.