Write a program to clean Telugu sentences collected from web
We have a lot of data that we have taken from Wikipedia. We now need to clean that text, thereby extracting 'proper' Telugu sentences which comply with our constraints.
The constraints for a proper Telugu sentence and a program to extract sentences with those constraints have to be made.
Use of regular expressions and making a generic program is appreciated so that even if the source changes in the future, the program still remains effective.
The time complexity of the program is not an issue because the program will be ran over a dataset only once. Accuracy in identifying the sentences is more important.