Commit 8718cbba authored by salecha sravan kumar jain's avatar salecha sravan kumar jain
Browse files


parent 372aea7f
# Character Recognition for Telugu Language
OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing
......@@ -9,10 +8,7 @@ In OCR processing, the scanned-in image or bitmap is analyzed for light and dark
OCR is being used by libraries to digitize and preserve their holdings. OCR is also used to process checks and credit card slips and sort the mail. Billions of magazines and letters are sorted every day by OCR machines, considerably speeding up mail delivery.
more about OCR--> | |
more about OCR--> |
The Applications are:
Data entry for business documents, e.g. check, passport, invoice, bank statement and receipt
......@@ -26,31 +22,14 @@ The Applications are:
Converting handwriting in real time to control a computer (pen computing)
Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR.The purpose can also be to test the robustness of CAPTCHA anti-bot systems.
Assistive technology for blind and visually impaired users
more about Applications:||
An optical character recognition (OCR) engine:
Tesseract is an OCR engine with support for unicode and the ability to recognize more than 100 languages out of the box. It can be trained to recognize other languages.
Tesseract is used for text detection on mobile devices, in video, and in Gmail image spam detection.
more->> | |
Tesseract was developed as a proprietary software by Hewlett Packard Labs. In 2005, it was open sourced by HP in collaboration with the University of Nevada, Las Vegas. Since 2006 it has been actively developed by Google and many open source contributors.
Tesseract acquired maturity with version 3.x when it started supporting many image formats and gradually added a large number of scripts (languages). Tesseract 3.x is based on traditional computer vision algorithms. In the past few years, Deep Learning based methods have surpassed traditional machine learning techniques by a huge margin in terms of accuracy in many areas of Computer Vision. Handwriting recognition is one of the prominent examples. So, it was just a matter of time before Tesseract too had a Deep Learning based recognition engine
| |
We learned opencv from,
and Teserract from
......@@ -58,13 +37,53 @@
How we started and got the idea:
At first we thought to implement bus number plate recogniization system,later our idea got shifted to telugu ocr.
At first we understood all the requirements and pre requisites required for completing the project.
First we swa the current impllenetation of bus stracking system.
we went through the code and got some basic idea and learnt additional concepts on various sites like
and we also have done a course in on opencv.
We have gone trough various blogs and vedios on net.
We download the tesseract ocr lib from
We took reference how to implement from
We took the sample code from
and in this link below we found github link too.
Later we downloaded many images from net and tested
Bugs we Found:
For guninthalu,otthulu and few letters the library was not functioning correctly.Instead of those words we were getting dotted circles.
For handwritten images the tesseract is not all upto the mark.
Here we explored all the libraries of telugu and tried to figure out how the bakground code is working and how the training data is
and how the ascii values are linked u together.
We modified the ascii values and added few more words to the library and we eliminated the dotted circles.
Firstly installation was challenge for us.
As in our system except english there is no dependency of other language except english no other lanuage was recognizable, we explored many ways and came with a solution and installed ibus and added telugu dependency.
Later figuring out the reason for dotted lines and mapping all the inbuilt files inorder to figure out error.
Changing multiple values and adding removing data was bigges challenge.
Futue Milestone:
To recognize handwriitend images and convert pdf content by using tesseract
Sample result of a image for reference
ఎవరు ఎమౌంఎవగు నువ్వంటే
' నీఖంయింపొత్రలు అంతే
నీవని ఒంచేబ్రతశ్రేదంటే
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment