README.md 5.01 KB
Newer Older
salecha sravan kumar jain's avatar
salecha sravan kumar jain committed
1
Introduction:
bhanu's avatar
bhanu committed
2
3
4
5
6
7
8
9
OCR (optical character recognition) is the recognition of printed or written text characters by a computer. This involves photoscanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing


In OCR processing, the scanned-in image or bitmap is analyzed for light and dark areas in order to identify each alphabetic letter or numeric digit. When a character is recognized, it is converted into an ASCII code. Special circuit boards and computer chips designed expressly for OCR are used to speed up the recognition process. 


OCR is being used by libraries to digitize and preserve their holdings. OCR is also used to process checks and credit card slips and sort the mail. Billions of magazines and letters are sorted every day by OCR machines, considerably speeding up mail delivery.

bhanu's avatar
bhanu committed
10
		   
salecha sravan kumar jain's avatar
salecha sravan kumar jain committed
11
more about OCR--> | https://searchcontentmanagement.techtarget.com/definition/OCR-optical-character-recognition 
bhanu's avatar
bhanu committed
12
13
14
15
16
17
18
19
20
21
22
The Applications are:
---------------------
    Data entry for business documents, e.g. check, passport, invoice, bank statement and receipt
    Automatic number plate recognition

    In airports, for passport recognition and information extraction
    Automatic insurance documents key information extraction
    Extracting business card information into a contact list
    More quickly make textual versions of printed documents, e.g. book scanning for Project Gutenberg
    Make electronic images of printed documents searchable, e.g. Google Books
    Converting handwriting in real time to control a computer (pen computing)
bhanu's avatar
bhanu committed
23
    Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR.The purpose can also be to test the robustness of CAPTCHA anti-bot systems.
bhanu's avatar
bhanu committed
24
    Assistive technology for blind and visually impaired users
salecha sravan kumar jain's avatar
salecha sravan kumar jain committed
25
26
		|
________________________________________________________________		 
bhanu's avatar
bhanu committed
27
Installation:
salecha sravan kumar jain's avatar
salecha sravan kumar jain committed
28
29
30
31
32
TESSERACT:-https://www.linux.com/blog/using-tesseract-ubuntu
OPENCV:https://pypi.org/project/opencv-python/
             https://www.learnopencv.com/install-opencv3-on-ubuntu/
___________________________________
Reference:
bhanu's avatar
bhanu committed
33
34
35
36
37
38
39
We learned opencv from edx.org,https://www.learnopencv.com/install-opencv3-on-ubuntu/
and Teserract from 
https://www.youtube.com/watch?v=QhJiOCwz-_I
https://www.youtube.com/watch?v=jWh0FaRR
https://www.youtube.com/watch?v=6_aqncTWgkk
https://www.learnopencv.com/deep-learning-based-text-recognition-ocr-using-tesseract
https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/
salecha sravan kumar jain's avatar
salecha sravan kumar jain committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
______________________________
How we started and got the idea:
At first we thought to implement bus number plate recogniization system,later our idea got shifted to telugu ocr.
At first we understood all the requirements and pre requisites required for completing the project.
First we swa the current impllenetation of bus stracking system.
we went through the code and got some basic idea and learnt additional concepts on various sites like
//pypi.org/project/opencv-python/
https://www.learnopencv.com/install-opencv3-on-ubuntu/
_______________________
Implemetation:
and we also have done a course in edx.org on opencv.
We have gone trough various blogs and vedios on net.
We download the tesseract ocr lib from
https://github.com/tesseract-ocr/tesseract
We took reference how to implement from
https://www.youtube.com/watch?v=SSdQyvl5MUk
We took the sample code from
https://www.youtube.com/watch?v=83vFL6d57OI
and in this link below we found github link too.
Later we downloaded many images from net and tested
___________________________
Bugs we Found:
For guninthalu,otthulu and few letters the library was not functioning correctly.Instead of those words we were getting dotted circles.
For handwritten images the tesseract is not all upto the mark.
_____________---
Understanding,Improvements:
Here we explored all the libraries of telugu and tried to figure out how the bakground code is working and how the training data is 
and how the ascii values are linked u together.
We modified the ascii values and added few more words to the library and we eliminated the dotted circles.
_________________
Challenges:
Firstly installation was challenge for us.
As in our system except english there is no dependency of other language except english no other lanuage was recognizable, we explored many ways and came with a solution and installed ibus and added telugu dependency.
Later figuring out the reason for dotted lines and mapping all the inbuilt files inorder to figure out error.
Changing multiple values and adding removing data was bigges challenge.
_____________
Futue Milestone:
To recognize handwriitend images and convert pdf content by using tesseract
_________________________
Result:
Sample result of a image for reference
ఎవరు ఎమౌంఎవగు నువ్వంటే
' నీఖంయింపొత్రలు అంతే
నీవని ఒంచేబ్రతశ్రేదంటే
తిరస్లెకదిలేచిత్రమేలంత.