Web Scraping from telugu wikipedia discussions (!1) · Merge requests · Manish kumar Sahu / Telugu Language Corpus Collection Platform

shravyavenishetty requested to merge shravyavenishetty03/corpus-collection:shravyavenishetty03-develop-patch-64875 into develop Nov 22, 2024

Telugu Wikipedia Discussion Scraper This Python script scrapes user-contributed discussion content from Telugu Wikipedia talk pages. It extracts meaningful conversational text from user discussions and saves it to a text file. The script ensures only Telugu text contributed by users is retained by removing special characters, timestamps, and unrelated content.

Features: Custom Session with Retry Logic: Ensures reliable scraping by handling transient network issues. Telugu Content Extraction: Retains only conversational Telugu text, filtering out other details such as special characters and timestamps. Support for Pagination: Handles multiple pages of discussions with options to limit the number of titles fetched. Debugging Output: Prints raw and cleaned content for verification during scraping.

Requirements: Python 3.x

Libraries: requests, re

Usage: Adjust title_limit in the script to specify the number of discussion titles to fetch. Run the script to save the extracted discussions to telugu_wikipedia_user_comments.txt.

Example Command: python scraper.py

Output: The output is saved in telugu_wikipedia_user_comments.txt and includes cleaned Telugu text contributed by users from Wikipedia discussions.

Edited Nov 22, 2024 by shravyavenishetty

Web Scraping from telugu wikipedia discussions

Merge request reports