Web Scraping from telugu wikipedia discussions
Telugu Wikipedia Discussion Scraper This Python script scrapes user-contributed discussion content from Telugu Wikipedia talk pages. It extracts meaningful conversational text from user discussions and saves it to a text file. The script ensures only Telugu text contributed by users is retained by removing special characters, timestamps, and unrelated content.
Features: Custom Session with Retry Logic: Ensures reliable scraping by handling transient network issues. Telugu Content Extraction: Retains only conversational Telugu text, filtering out other details such as special characters and timestamps. Support for Pagination: Handles multiple pages of discussions with options to limit the number of titles fetched. Debugging Output: Prints raw and cleaned content for verification during scraping.
Requirements: Python 3.x
Libraries: requests, re
Usage: Adjust title_limit in the script to specify the number of discussion titles to fetch. Run the script to save the extracted discussions to telugu_wikipedia_user_comments.txt.
Example Command: python scraper.py
Output: The output is saved in telugu_wikipedia_user_comments.txt and includes cleaned Telugu text contributed by users from Wikipedia discussions.