Vol 7, No 7 (2017)

(1)

Research Article

July

2017

Computer Science and Software Engineering

ISSN: 2277-128X (Volume-7, Issue-7)

Real Time Implementation of Optical Character Recognition

Based TTS System using Raspberry pi

Sagar G.K*

Department of E&C,

Sri Jayachamarajendra College of Engineering, Mysuru, Karnataka, India

Shreekanth T

Assistant Professor, Department of E&C, Sri Jayachamarajendra College of Engineering,

Mysuru, Karnataka, India

DOI: 10.23956/ijarcsse/V7I7/0117

Abstract— Thetext to speech (TTS) conversion technology is proposed to help the blind people and people with poor vision. According to survey done by World Health Organization (WHO) there are about 286 million blind people in this world and about 91% of them reside in developing countries. So there is necessity of portable TTS converter which should be affordable to help the blinds. To help the blind community a smart reader is proposed in this paper. It includes a web cam to capture input text page which is then processed by TTS unit installed in raspberry pi and the output is then amplified by audio and given out on speaker.

Keywords-Raspberry Pi, Tesseracts OCR engine, Python Speech API, Camera, Speaker.

I. INTRODUCTION

The present paper has introduced an innovative, efficient and real-time cost beneficial technique that enables user to hear the contents of text images instead of reading through them. It combines the concept of Optical Character Recognition (OCR) and Text to Speech Synthesizer (TTS) in Raspberry pi. This kind of system helps visually impaired people to interact with computers effectively through vocal interface.

Text Extraction from color images is a challenging task in computer vision. Text-to-Speech conversion is a method that scans and reads English alphabets and numbers that are in the image using OCR technique and converts it into voice. This paper describes the design, implementation and experimental results of the device. This device consists of two modules, image processing module and voice processing module. The device was developed based on Raspberry Pi v2 with 900 MHz processorspeed.

Optical character Recognition (OCR) is a process that converts scanned or printed text image, handwritten text into editable text for further processing. This paper has presented a robust approach for text extraction and its conversion into speech. Testing of device was done on raspberry pi platform.

The following steps are implemented for character recognition.

 Firstly acquire the character image and the image wasread.

 Second step is pre-processing step. In this step the color image is converted into gray scale, then this gray image is converted into binary image by performing the thresholdoperation.

 Character is extracted and resized in this step. Letters are resized according to templatessize.

 Load templates that it can be matched the letters with thetemplates.

 Open the text.txt as file forwrite.

 Write in the text file and concatenate theletters.

Feature extraction and classification are the heart of OCR. The character image is mapped to a higher level by extracting special characteristics and patterns of the image in the feature extraction phase. The classifier is then trained with the extracted features for classification task. The classification stage identifies each input character image by considering the detected features. As Classifiers, Template Matching and Neural Networks are used. The character image is converted into text and then text into speech. The algorithm is as follows.

 Firstly check the condition that if Win 32 SAPI (speech application programming interface) is available in the

computer or not. If it is not available then error will be generated and Win 32 SAPI library should be loaded in the computer.

 Gets the voice object from Win 32SAPI.

 Compares the input string with Win 32 SAPIstring.

 Extracts voice by firstly select the voice which are available inlibrary.

 Choose the pace ofvoice.

 Initializes the wave player for convert the text intospeech.

 Finally get the speech for givenimage.

(2)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, DOI: 10.23956/ijarcsse/V7I7/0117, pp. 149-156

Previously analysed speech is stored in small units in the database, for concatenation in the appropriate sequence at runtime. TTS systems initial execute text processing, together with "letter-to-sound" conversion, to produce the phonetic transcription.

The database containing words or sentences gives superior quality of outputs. Furthermore, a synthesizer can also include human voice characteristics to generate a synthetic but somewhat normal voice. Speech synthesizer is evaluated on the basis of the similarity of output speech obtained with that of the natural speech.

Approaching day’s Text to speech (TTS) system focuses more on emotional amount with the speech i.e. it requires more sensible and natural voice that can be closely matched with the human being voice. Therefore there comes a drawback in input text in categorize to get high superiority of output. The examples for speech synthesis are voice enabled e-mail and unified messaging. The initial step involved in speech synthesis is for the user to speak a word in microphone than the speech is digitized using Analog to Digital (A/D) converter and it is stored in memory in the type of database.

The text processing is a form of signal processing where the input is a text, like a photograph or video frame, the output of an image processing may be both an image or a video frame or a set of characteristics or parameters correlated to the image. The acquisition of digital image typically suffers from undesirable camera shakes and due to unbalanced random cameramotions.Henceimageimprovementalgorithmsarerequiredtoremovetheseunnecessarycamerashakes.

This image processing concepts are implemented in Raspberry pi in the application of MAV (micro aerial vehicle). The Raspy is initially connected to the internet through VLAN (virtual local area network). The software is installed using command lines. Following steps are to be followed:

1. The first setup is to download the installationscript, 2. Second step is to convert it to executable formand

3. The last step starts the script which does the rest of the installation work. Device set up is done as shown in Figure3.

The webcam is manually focused towards the text. Then, it takes a picture; a delay of around 6 seconds is provided, which helps to focus the webcam, if it is accidently defocused. After delay, picture is taken and processed by Raspy to hear the spoken words of the text through the earphone or speaker plugged into Raspy through its audio jack

II. LITERATURESURVEY

Many of the researchers have found out a number of possible solutions Text to Speech Conversion using different speech synthesis, Hay mar htun et.al proposed a Text to speech conversion using different speech synthesis. This proposed TTS system consists of two phases. The first is text analysis, where the input text is transcribed into a phonetic or some other linguistic representation. The second one is the generation of speech waveforms. In this TTS system, text to phoneme conversion depends on dictionary based approach to get the exact phonetic transcription. Speech synthesis such as domain specific, phoneme based synthesis and unit selection synthesis are used for concatenating speech. For numerical text to speech system, domain specific synthesis is applied. In phoneme based synthesis, the input text is considered as word to produce sound. In this TTS system, hand-labelling is applied because database is small. But it is time consuming and has little errors in this method[1].

Chaw Su Thu et.al proposed an Implementation of Text to Speech Conversion. This paper considers the computer based system that can be able to read any text aloud, whether it was directly introduced in the computer by an operator or scanned and submitted to an Optical Character Recognition (OCR) system. While in text to speech, there are many systems which convert normal language text in to speech. In this work, the OCR system is implemented for the recognition of capital English character A to Z and number 0 to 9. Each character is recognized at once [2].

The recognized character is saved as text in notepad file. In this work a text-to-speech conversion system that can get the text through image and directly input in the computer then speech through that text using MATLAB. But this not supports small a to z characters and it doesn’t take word or sentence.

Bindu Philip et.al proposed a Human machine interface – a smart OCR for the visually challenged system, this proposed system enables the automated conversion of scanned images of book directly to the Braille format. The optical character recognition is primarily used for this purpose. Optical character recognition (OCR) method has been used in converting printed text into editable text. One of the important applications of this tool is its use in Braille Translation. It is hard to convert English to Braille it requires Braille printer which leads to high cost [3].

Roy shilkrot et.al proposed a Finger reader wearable device to support text reading; this proposed system work towards creating a wearable device that could overcome some issues that current technologies pose to 6 users. The contribution is twofold:

1. Present results of focus group sessions with VI users that uncovered salient problems with current text reading solutions and the users ideographs of future assistive devices and their Capabilities. The results serve as grounds for our designchoices.

(3)

V. Ajantha devi1 et.al proposed an embedded optical character recognition on Tamil text image using raspberry pi. This proposed system works on Methodology of a camera based assistive device that can be used by people to read Tamil Text document. The framework is on implementing image capturing technique in an embedded system based on Raspberry Pi board. But Challenges that researches face during recognition process are due to the curves in the characters number of strokes and holes, sliding characters[5].

Catherine A. Todd et.al proposed an Audio Haptic Tool for Visually Impaired Web Users. This proposed system works by preliminary of finding the browser design and implementation of Haptic and audio web element representation. Future work aims at creating a complete 3D Web browser with an interactive GUI (graphical user interface) that supports various features of existing browsers such as History, Favourites, Help and Image data. But more complex force models of image- based content and other multimedia content, such as using edge detection [6].

Vikram shirol et.al proposed a DRASHTI- an android reading aid. In this proposed system an Image is converted to text by using OCR technique. Edge detection and Segmentation Algorithms are used to detect the text regions from image. The algorithms are used to an image set with different text size, font styles and text language. And TTS method is used to convert the Text to Voice, which is made Audible. But the text will be displayed under the text field in the 3rd interface of the application which needs lot of memory to store each captured each image [7].

Jyoti Singh et.al proposed a Part Of Speech Tagging of Marathi Text Using Trigram Method. This proposed system having fundamental version of POS tagging is the identification of words as nouns, verbs, adjectives etc. POS Tagging can be regarded as a simplified form of morphological analysis where it only deals wit h assigning an appropriate POS tag to the word, while morphological analysis deals with finding the internal structure of the word. But the morphological complexity of the Marathi makes it a little hard [8].

III. FEATUREEXTRACTION

There are two approaches for the feature extraction, one is extracting the necessary characteristics of the symbols which is very difficult and another is to extract only specific features that characterize symbols yet exclude insignificant features.

Various techniques for feature extraction are:

1. Techniques based onfeatures.

2. Correlation and template matchingtechniques.

3. Using OCR speech conversion and text to speechtechnology.

4. Series expansions andtransformations

5. Structuralanalysis.

With this application the speech can be generated on speakers by taking text input from various sources like any webpage or document or e-book. The policies related to each alphabet like their usages in dictionary and grammar, pronunciation are set in the software.

This system converts text to speech, so it is helpful for the blind people and people with low eyesight to read books and documents. It is also helpful for the dumb people to write the text which is then converted to speech.

IV. TEXT TO SPEECH SYNTHESIS SYSTEMARCHITECTURE

The speech is produced from the text by Text to speech synthesizer, by means of a grapheme-to-phoneme converting sentences to text content.

Figure 1 System Architecture Design

Text is captured from camera which given as a input to the system

4.1 Text Analysis &Detection

The Text Analysis is part of preprocessing. It analyzes the input text and organizes into manageable list of words. Then it transforms them into full text. Text detection localizes the text areas from printed documents.

4.2 Dictionary/letter-to-soundconversion

(4)

process is done for converting all letters of lowercase or upper case, to remove punctuations, accent marks, stop words or “too common words “and other diacritics from letters.

.

4.3 SpeechDatabase

It performs formant synthesis. Thus it requires kind of database of speech samples. For speak out the text, it uses voice characteristics of a person.

4.4 Selections ofUnits

Firstly the input sentence is spitted into words and Part of speech (POS) tagger is assigned to each word by using bigram method. A bigram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. Then, each word is converted to phoneme transcription with the use of dictionary based approach.

Then these recorded speeches are segmented phonetically in discrete time domain. So the segmented units are stored according to their sample values. There are two approaches in speech segmentation. They are hand labeling andautomatic speech segmentation. In this TTS system, hand-labeling is applied because database issmall.

4.5 Speech GenerationModules

The software used to synthesize speech from text, a TTS Engine converts written text to a phonemic representation, and the phonemic representation is converted to speech that can give a sound output. Here it is used for converting the text file into an audio file using .flac extension file. Flac stands for free lossless audio codec. It is an audio coding format for lossless compression of digital audio. At the end of this phase an audio file iscreated.

4.6 OutputSpeech

The output is heard by connecting the speakers to the Audio jack and enabling it in the raspberry pi configuration. The volumes can be adjusted by alsamixer. The translated speech output can be heard through the speakers.

Figure 2 System hardware design

It includes camera, Raspberry pi board, Ethernet and speaker. The input text image is captured by camera and the processed by image processing techniques and audio signal is given out on speakers. The Ethernet is used to access the e- dictionary website.

The Logitech camera module captures the image which is to be read and the image file is stored in the folder. Python software is installed in the raspberry pi by a command of each component directive.

This software is used to convert the image file to text file by extracting the texts from the image and storing it in the file with .txt extension.

5.1 Processor

The first generation of Raspberry Pi which is used BCM2835 Broadcom chip, which is quite similar to that of SoC used in first generation smart phones. BCM2835 consists of ARM1176JZF processor with 700MHz, Random Access Memory and a graphics processing unit Video Core IV. It includes L1 and L2 cache. The former is 16KB and latter is 128KB. The GPU makes use of Level 2 cache. Only we can see the corners of SoC chip on the board because it is placed below the RAM chip. The Raspberry Pi includes 32-bit of four-core ARM Cortex-A7 processor and SoC

(System on Chip) BCM2836 from Broadcom with 700 MHz in Raspberry Pi used 1.2GHz BCM2837 Broadcom SoC, four-core A53 ARM Cortex processor and shared L2 cache of 512KB.

5.2 Powering

(5)

USB port which don’t give “backfeed” because the Rpi (Raspberry pi) is vulnerable to voltage fluctuations. Use of unsuitable accessories can destroy the device.

5.3 AccessingWLAN

The LAN technology used must support all the devices. For Mobile AP’s with Rpi the “Wi-Fi” was used as WLAN technology. Using other WLAN technology the support can be expanded without any changes in downstream components. According to 802.11 there are two operating modes for wireless stations

1. Adhoc

2. Infrastructure

Ad hoc mode does not have central station for controlling so it does support extra services like DNS or DHCP. Its purpose is to connect multiple stations temporarily. In this the frames transmitted flows from transmitting station to receiving station directly. This method is not used in mobile Ap. In infrastructure mode there is a central station, it can also be used as gateway. This method is used in management of wirelessnetworks.

The channels are controlled by central station and users seeking access to network are authenticated by it. It incorporates different services and features like DHCP, routing, DNS, NAT, firewall on multiple OSI layers and it allows advanced use of Wi-Fi network like roaming. The standardized operating modes help devices from different vendors to communicate irrespective of their type. Hence it is possible to have a full featured access point from an ordinary Wi-Fi adapter. Then it should be able to connect other stations in the similar way like hardware access points do. This is practically not possible with majority of Wi-Fi adapters. The information of some wireless Network Interface Controller helps to understand the cause of it and to get better understand the technology.

5.4 Different Roles in InfrastructureMode

The difficulty in Infrastructure mode is the station has to perform different functions based on its role. So the access point station has more burden than the client station. So instead having complete solution it is sensible to have certain different implementation for specific feature. The traditional USB Wi-Fi dongles can be used as a client station does not include all features, they have only required futures. In 802.11 network adapters there are different operating modes some of them are listedbelow.

1. Independent BSS mode(also called as Peer-to-peermode)

2. Access Point Infrastructuremode

3. Managedmode

4. Monitormode

5. Mesh

6. Wireless DistributionSystem

From the above listed modes the network interface at any instance can run in one mode and it should be supported by software and hardware. The default mode is infrastructure mode and it is supported by all devices and their corresponding drivers. The device manufactures do not prefer AP infrastructure mode because of its complexity. USB adapters which can on both client and AP modes are infrequent whereas WNICs based on PCI (Programing control interface), normally support both themodes.

5.5 PlatformSpecifications

The BCM2835 Broadcom SoC is core of the device including 512 MB main memory and one-core ARM1176JZF-S processor with 700MHz. The “B” model is used in this project because the earlier versions were provided with main memory of 128 or 256 MB only. Other than the Raspberry there are some other systems supported by third parties or community also available. The Raspbian provides cross support and even the software’s are widely available since they are enhanced versions of Debian GNU/Linux.

The ARM architecture supports Debian so most its packages without any changes in source code can be installed and run on Raspbain. The Pi does not have built-in flash memory or hard disk, SDHC is the only storage which is divided and used by operating system and firmware with data and traditional software. The high speed USB (2.0) can be used to connect the external devices. USB bus is internally equipped with one fast Ethernet interface. RCA or HDMI port is used for video output.

5.6 Overlocking

The first and second generation Raspberry Pi Soc’s needed the cooling only when they were over clocked. When overclocked the Raspberry Pi 2 chips may heat more compared to regular time. Heat sink is one of the cooling method. Majority of Raspberry Pi SoC can be over clocked up to 800 MHz Few can be clocked even up to 1000MHz. Software command “pseudo raspi-config” can be used on Raspbian Linux distro to over clock on boot without voiding the warranty. In such situations if chip goes up to 85 degree Celsius the over clocking is automatically shut by Pi and the overclocking settings, overvoltage can be overruled. To protect against overheating it is required to have proper sized heat sink for thechip.

(6)

is low or when running at high temperature, but the performance can be transiently increased if chip has more work and temperatureofitistolerablewithclockspeedsofupto1GHzbasedonoverclocksettingsandindividualboardused.

VI. APPLICATIONS

1. It is useful for physically disabled people. For the dumb people with the help of high speed keyboards the speech is produced by the text written in small duration. And the blind people can read the books and documents with the help of thisdevice.

2. It can be used the toys and books which can speak. Many talking toys already arrived in themarket.

3. Many a times the speech is more powerful than the written messages. Thus speech synthesizer can be used in

measurement and controlsystems.

4. The text to speech system is helpful to have a man-machinecommunication.

5. Basic and research applied text to speech synthesizers have unique feature making them excellent laboratory tool forlinguists.

Figure 3: Project Set up

VII. RESULTS ANDDISCUSSION

Any number or character in English alphabet can be read by this pi. Here numbers are taken as input and analysed. A printed image of numbers is the input. The Logitech camera module is used to capture the image by a single command. The image is captured by the camera module and stored in a .jpg file format by a 15 pin ribbon cable. The Figure 4 shows the image of text which was captured by the camera module.

Figure 4 Image 1: Text captured and conversion of text to flac file by espeak

(7)

Table 1 Performance of proposed method

Image Manual words count System count Accuracy (%)

1 5 5 100

2 6 6 100

3 11 11 100

4 17 16 94

5 24 22 90.90

6 36 35 97.2

Table 2 Performance Comparison

Method Accuracy (%) Runtime

[5] 94 8 seconds

(8)

The captured image is converted to .txt file. Image converted to text and stored in a file. The text file is then converted to .flac file which is given as input for translation.

The proposed method captures 36 to 38 words and converts it into speech in duration of 10 to 20 seconds. Compared to [3], [5] it takes less time for capturing data and processing the data

VIII. CONCLUSION AND FUTUREWORK

The text to speech system is implemented with the help of Raspberry Pi board. The results of simulation are verified successfully along with various Samples have been used to test the output. The input image is processed efficiently by algorithm used and clear output is generated. This piece of equipment is affordable and productive to blind people. The device is tested with different text and verified. This device is useful for the society and compact.

Future enhancements can be made in the image to text conversion for more accuracy and in the speech output to eliminate noise by using various algorithms and the application can be enhanced to capture images different languages andtooutputthe voiceinoneormorelanguages.Theperformancecanbeenhancedusingcommercialtesseracts.

Thus conclude that this application is an attempt to help blind and normal people in reading. With this application blindness problem cannot be overcome completely, but it can help them at least to some extent.

REFERENCES

[1] Hay mar htun and Theingi zin, hla myo tun “Text to speech conversion using different speech synthesis” International journal of scientific & technology research volume 4, issue 07, July2015.

[2] Chaw Su Thu and Theingi Zin Implementation of “Text to Speech Conversion” International Journal of

Engineering Research & Technology Volume 3, Issue 3, March – 2014.

[3] Bindu Philip and r. d. sudhaker Samuel 2009 “Human machine interface – a smart ocr for the visually challenged” International journal of recent trends in engineering, vol no.3, November

[4] Roy shilkrot, pattiemaes, jochen huber, suranga c. nanayakkara, connie k (april may 2014) “Finger reader: a

wearable device to support text reading on the go”Journal of emerging trend and information

[5] V. Ajantha devi1, dr. Santhosh baboo “Embedded optical character recognition on tamil text image using raspberry pi” international journal of computer science trends and technology (ijcst) – volume 2 issue 4, jul-aug 2014

[6] Catherine A. Todd, Ammara Rounaq, Umm Kulsum Nur, and Fadi Boufarhat,“An Audio Haptic Tool for

Visually Impaired Web Users” Journal of Emerging Trends in Computing and Information Sciences, volume. 3, NO. 8, Aug 2012

[7] Vikram shirol, abhijit m, savitri a et al. “DRASHTI- an android reading aid” International journal of computer

science and information technologies vol.6 (july2015)

[8] Jyoti Singh1, Nisheeth Joshi and Iti Mathur, “Part Of Speech Tagging Of Marathi Text Using Trigram Method”,