docs/excel_export_guide.md
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
data/{platform}/ directory with timestampsExcel export requires the openpyxl library:
# Using uv (recommended)
uv sync
# Or using pip
pip install openpyxl
config/base_config.py:SAVE_DATA_OPTION = "excel" # Change from jsonl/json/csv/db to excel
# Xiaohongshu example
uv run main.py --platform xhs --lt qrcode --type search
# Douyin example
uv run main.py --platform dy --lt qrcode --type search
# Bilibili example
uv run main.py --platform bili --lt qrcode --type search
data/{platform}/ directory:
{platform}_{crawler_type}_{timestamp}.xlsxxhs_search_20250128_143025.xlsx# Search by keywords and export to Excel
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
# Crawl specific posts and export to Excel
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
# Crawl creator profile and export to Excel
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
Contains post/video information:
note_id: Unique post identifiertitle: Post titledesc: Post descriptionuser_id: Author user IDnickname: Author nicknameliked_count: Number of likescomment_count: Number of commentsshare_count: Number of sharesip_location: IP locationimage_list: Comma-separated image URLstag_list: Comma-separated tagsnote_url: Direct link to postContains comment information:
comment_id: Unique comment identifiernote_id: Associated post IDcontent: Comment textuser_id: Commenter user IDnickname: Commenter nicknamelike_count: Comment likescreate_time: Comment timestampip_location: Commenter locationsub_comment_count: Number of repliesContains creator/author information:
user_id: Unique user identifiernickname: Display namegender: Genderavatar: Profile picture URLdesc: Bio/descriptionfans: Follower countfollows: Following countinteraction: Total interactionsLarge datasets: For very large crawls (>10,000 rows), consider using database storage instead for better performance
Data analysis: Excel files work great with:
pd.read_excel('file.xlsx')Combining data: You can merge multiple Excel files using:
import pandas as pd
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
combined = pd.concat([df1, df2])
combined.to_excel('combined.xlsx', index=False)
File size: Excel files are typically 2-3x larger than CSV but smaller than JSON
# Install openpyxl
uv add openpyxl
# or
pip install openpyxl
Check that:
SAVE_DATA_OPTION = "excel" in configdata/{platform}/ directory existsThis happens when:
After running a successful crawl, you'll see:
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
[ExcelStoreBase] Stored content to Excel: 7123456789
[ExcelStoreBase] Stored comment to Excel: comment_123
...
[Main] Excel file saved successfully
Your Excel file will have:
from store.excel_store_base import ExcelStoreBase
# Create store
store = ExcelStoreBase(platform="xhs", crawler_type="search")
# Store data
await store.store_content({
"note_id": "123",
"title": "Test Post",
"liked_count": 100
})
# Save to file
store.flush()
You can extend ExcelStoreBase to customize formatting:
from store.excel_store_base import ExcelStoreBase
class CustomExcelStore(ExcelStoreBase):
def _apply_header_style(self, sheet, row_num=1):
# Custom header styling
super()._apply_header_style(sheet, row_num)
# Add your customizations here
For issues or questions:
Note: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.