Text-To-Speech Technology-Based Programming Tool Final Doc

(1)

Text-to-Speech Technology-Based Programming Tool

1 1

Text-to-Speech

Technology-Based

Programming Tool

(2)

Text-to-Speech Technology-Based Programming Tool

2 2

ABSTRACT ABSTRACT

This paper presents an audio programming tool based on text-to-speech This paper presents an audio programming tool based on text-to-speech

technology for blind and vision impaired people to learn programming. The tool technology for blind and vision impaired people to learn programming. The tool can help users edit a program then compile, debug and run it. All of these stages can help users edit a program then compile, debug and run it. All of these stages are voice enabled. The programming language for evaluation is C# and the tool are voice enabled. The programming language for evaluation is C# and the tool is developed in Visual Studio .NET. Evaluations have shown that the

is developed in Visual Studio .NET. Evaluations have shown that the

programming tool can help blind and vision impaired people implement software programming tool can help blind and vision impaired people implement software applications and achieve equality of access and opportunity in information

applications and achieve equality of access and opportunity in information technology education.

(3)

Text-to-Speech Technology-Based Programming Tool

3 3

Introduction

Blindness

Blindness is is the the condition condition of of lacking lacking visual visual perception perception duedue to physiological or neurological factors.

to physiological or neurological factors.

Various scales have been developed to describe the extent of vision loss and Various scales have been developed to describe the extent of vision loss and define blindness.

define blindness.[1][1] Total blindnessTotal blindness is the complete lack of form and visual lightis the complete lack of form and visual light perception and is clinically recorded as NLP, an abbreviation for "no light perception and is clinically recorded as NLP, an abbreviation for "no light perception."

perception."[1][1] Blindness Blindness is is frequently frequently used used to to describe describe severe severe visualvisual impairment with residual vision. Those described as having only light perception impairment with residual vision. Those described as having only light perception have no more sight than the ability to tell light from dark and the general direction have no more sight than the ability to tell light from dark and the general direction of a light source.

of a light source.

In order to determine which people may need special assistance because of their In order to determine which people may need special assistance because of their visual disabilities, various governmental jurisdictions have formulated more visual disabilities, various governmental jurisdictions have formulated more complex definitions referred to as

complex definitions referred to as legal blindnesslegal blindness..[2][2] In North America and mostIn North America and most of Europe, legal blindness is defined as visual acuity (vision) of 20/200 (6/60) or of Europe, legal blindness is defined as visual acuity (vision) of 20/200 (6/60) or less in the better eye with best correction possible. This means that a legally less in the better eye with best correction possible. This means that a legally blind individual would have to stand 20 feet (6.1 m) from an object to see it² blind individual would have to stand 20 feet (6.1 m) from an object to see it² with corrective lenses²with the same degree of clarity as a normally sighted with corrective lenses²with the same degree of clarity as a normally sighted person could from 200 feet (61 m). In many areas, people with average acuity person could from 200 feet (61 m). In many areas, people with average acuity who nonetheless have a visual field of less than 20 degrees (the norm being 180 who nonetheless have a visual field of less than 20 degrees (the norm being 180 degrees) are also classified as being legally blind. Approximately ten percent of degrees) are also classified as being legally blind. Approximately ten percent of those deemed legally blind, by any measure, have no vision.

(4)

Text-to-Speech Technology-Based Programming Tool

4 4

The rest have some vision, from light perception alone to relatively good The rest have some vision, from light perception alone to relatively good acuity. Low vision is sometimes used to describe visual acuities from 20/70 to acuity. Low vision is sometimes used to describe visual acuities from 20/70 to 20/200.

20/200.[3][3] By

By the the 10th 10th Revision Revision of of the the WHO WHO International International Statistical Statistical Classification Classification of of Diseases, Injuries and Causes of Death, low vision is defined as visual acuity of Diseases, Injuries and Causes of Death, low vision is defined as visual acuity of less than 20/60 (6/18), but equal to or better than 20/200 (6/60), or corresponding less than 20/60 (6/18), but equal to or better than 20/200 (6/60), or corresponding visual field loss to less than 20 degrees, in the better eye with best possible visual field loss to less than 20 degrees, in the better eye with best possible correction. Blindness is defined as visual acuity of less than 20/400 (6/120), or correction. Blindness is defined as visual acuity of less than 20/400 (6/120), or corresponding visual field loss to less than 10 degrees, in the better eye with best corresponding visual field loss to less than 10 degrees, in the better eye with best possible correction.

possible correction.[4][5][4][5]

Blind people with undamaged eyes may still register light non-visually for the Blind people with undamaged eyes may still register light non-visually for the purpose of circadian entrainment to the 24-hour light/dark cycle. Light signals for purpose of circadian entrainment to the 24-hour light/dark cycle. Light signals for this purpose travel through the retinohypothalamic tract, so a damaged optic this purpose travel through the retinohypothalamic tract, so a damaged optic nerve beyond where the retinohypothalamic tract exits it is no hindrance nerve beyond where the retinohypothalamic tract exits it is no hindrance

Causes

Serious visual impairment has a variety of causes: Serious visual impairment has a variety of causes:

(5)

Text-to-Speech Technology-Based Programming Tool

5 5

Diseases Diseases

According to WHO estimates, the most common causes of blindness around the According to WHO estimates, the most common causes of blindness around the

world in 2002 were: world in 2002 were: 1. 1. cataracts cataracts (47.9%),(47.9%), 2. 2. glaucoma glaucoma (12.3%),(12.3%), 3.

3. age-related age-related macular macular degeneration degeneration (8.7%),(8.7%), 4.

4. corneal corneal opacity opacity (5.1%), (5.1%), andand 5.

5. diabetic diabetic retinopathy retinopathy (4.8%),(4.8%), 6.

6. childhood childhood blindness blindness (3.9%),(3.9%), 7.

7. trachoma trachoma (3.6%)(3.6%) 8.

8. onchocerconchocerciasis iasis (0.8%).(0.8%).[13][13] 9.

9.

In terms of the worldwide prevalence of blindness, the vastly greater number of In terms of the worldwide prevalence of blindness, the vastly greater number of people in the developing world and the greater likelihood of their being affected people in the developing world and the greater likelihood of their being affected mean that the causes of blindness in those areas are numerically more mean that the causes of blindness in those areas are numerically more important. Cataract is responsible for more than 22 million cases of blindness important. Cataract is responsible for more than 22 million cases of blindness and glaucoma 6 million, while leprosy and onchocerciasis each blind and glaucoma 6 million, while leprosy and onchocerciasis each blind approximately 1 million individuals worldwide. The number of individuals blind approximately 1 million individuals worldwide. The number of individuals blind from trachoma has dropped dramatically in the past 10 years from 6 million to 1.3 from trachoma has dropped dramatically in the past 10 years from 6 million to 1.3 million, putting it in seventh place on the list of causes of blindness worldwide. million, putting it in seventh place on the list of causes of blindness worldwide. Xerophthalmia is estimated to affect 5 million children each year; 500,000 Xerophthalmia is estimated to affect 5 million children each year; 500,000 develop active corneal involvement, and half of these go blind. Central corneal develop active corneal involvement, and half of these go blind. Central corneal ulceration is also a significant cause of monocular blindness worldwide, ulceration is also a significant cause of monocular blindness worldwide, accounting for an estimated 850,000 cases of corneal blindness every year in the accounting for an estimated 850,000 cases of corneal blindness every year in the Indian subcontinent alone. As a result, corneal scarring from all causes now is Indian subcontinent alone. As a result, corneal scarring from all causes now is

(6)

Text-to-Speech Technology-Based Programming Tool

6 6

the fourth greatest cause of global blindness (Vaughan & Asbury's General the fourth greatest cause of global blindness (Vaughan & Asbury's General Ophthalmology, 17e)

Ophthalmology, 17e)

People in developing countries are significantly more likely to experience visual People in developing countries are significantly more likely to experience visual impairment as a consequence of treatable or preventable conditions than are impairment as a consequence of treatable or preventable conditions than are their counterparts in the developed world. While vision impairment is most their counterparts in the developed world. While vision impairment is most common in people over age 60

common in people over age 60 across all regions, children in poorer commuacross all regions, children in poorer communitiesnities are more likely to be affected by blinding diseases than are their more affluent are more likely to be affected by blinding diseases than are their more affluent peers.

peers.

The link between poverty and treatable visual impairment is most obvious when The link between poverty and treatable visual impairment is most obvious when conducting regional comparisons of cause. Most adult visual impairment in North conducting regional comparisons of cause. Most adult visual impairment in North America and Western Europe is related to age-related macular degeneration and America and Western Europe is related to age-related macular degeneration and diabetic retinopathy. While both of these conditions are subject to treatment, diabetic retinopathy. While both of these conditions are subject to treatment, neither can be cured.

neither can be cured.

In developing countries, wherein people have shorter life expectancies, cataracts In developing countries, wherein people have shorter life expectancies, cataracts and water-borne parasites²both of which can be treated effectively²are most and water-borne parasites²both of which can be treated effectively²are most often the culprits (see river blindness, for example). Of the estimated 40 million often the culprits (see river blindness, for example). Of the estimated 40 million blind people located around the world, 70±80% can have some or all of their blind people located around the world, 70±80% can have some or all of their sight restored through treatment.

sight restored through treatment.

In developed countries where parasitic diseases are less common and cataract In developed countries where parasitic diseases are less common and cataract surgery

surgery is more ais more available, agevailable, age-related macul-related macular degeneration, ar degeneration, glaucoma, andglaucoma, and diabetic retinopathy are usually the leading causes of blindness.

(7)

Text-to-Speech Technology-Based Programming Tool

7 7

Childhood blindness can be caused by conditions related to pregnancy, such Childhood blindness can be caused by conditions related to pregnancy, such as congenital rubella syndrome and retinopathy of prematurity.

as congenital rubella syndrome and retinopathy of prematurity.

Abnormalities and injuries Abnormalities and injuries

Eye injuries, most often occurring in people under 30, are the leading cause of Eye injuries, most often occurring in people under 30, are the leading cause of monocular bli

monocular blindness (vision lndness (vision loss in one oss in one eye) throughout eye) throughout the the United States.United States. Injuries an

Injuries and cataracts affect d cataracts affect the eye itselfthe eye itself, while abnormali, while abnormalities such as ties such as opticoptic nerve hypoplasia affect the nerve bundle that sends signals from the eye to the nerve hypoplasia affect the nerve bundle that sends signals from the eye to the back of the brain, which can lead to decreased visual acuity.

back of the brain, which can lead to decreased visual acuity. People

People with with injuries injuries to to the the occipital occipital lobe lobe of of the the brain brain can, can, despite despite havinghaving undamaged eyes and optic nerves, still be legally or totally blind.

undamaged eyes and optic nerves, still be legally or totally blind.

Genetic defects Genetic defects

People with albinism often have vision loss to the extent that many are legally People with albinism often have vision loss to the extent that many are legally blind, though few of them actually cannot see. Leber's congenital amaurosis can blind, though few of them actually cannot see. Leber's congenital amaurosis can cause total blindness or severe sight loss from birth or early childhood.

cause total blindness or severe sight loss from birth or early childhood.

Recent advances in mapping of the human genome have identified other genetic Recent advances in mapping of the human genome have identified other genetic causes of low vision or blindness. One such example is Bardet-Biedl syndrome. causes of low vision or blindness. One such example is Bardet-Biedl syndrome.

Poisoning Poisoning

(8)

Text-to-Speech Technology-Based Programming Tool

8 8

Rarely, blindness is caused by the intake of certain chemicals. A well-known Rarely, blindness is caused by the intake of certain chemicals. A well-known example is methanol, which is only mildly toxic and minimally intoxicating, but example is methanol, which is only mildly toxic and minimally intoxicating, but when not competing with ethanol for metabolism, methanol breaks down into the when not competing with ethanol for metabolism, methanol breaks down into the

substances formaldehyde and formic acid which in turn can cause blindness, an substances formaldehyde and formic acid which in turn can cause blindness, an array of other health complications, and death.

array of other health complications, and death.[15][15] Methanol is commonly foundMethanol is commonly found in methylated spirits, denatured ethyl alcohol, to avoid paying taxes on selling in methylated spirits, denatured ethyl alcohol, to avoid paying taxes on selling ethanol intended for human consumption. Methylated spirits are sometimes used ethanol intended for human consumption. Methylated spirits are sometimes used by alcoholics as a desperate and cheap substitute for regular ethanol alcoholic by alcoholics as a desperate and cheap substitute for regular ethanol alcoholic beverages.

beverages.

Willful actions Willful actions

Blinding has been used as an act of vengeance and torture in some instances, to Blinding has been used as an act of vengeance and torture in some instances, to deprive a person of a major sense by which they can navigate or interact within deprive a person of a major sense by which they can navigate or interact within the world, act fully independently, and be aware of events surrounding them. An the world, act fully independently, and be aware of events surrounding them. An example from the classical realm is Oedipus, who gouges out his own eyes after example from the classical realm is Oedipus, who gouges out his own eyes after realizing that he fulfilled the awful prophecy spoken of him.

realizing that he fulfilled the awful prophecy spoken of him.

In 2003, a Pakistani anti-terrorism court sentenced a man to be blinded after he In 2003, a Pakistani anti-terrorism court sentenced a man to be blinded after he carried out an acid attack against his fiancee that resulted in her blinding.

carried out an acid attack against his fiancee that resulted in her blinding. [16][16] TheThe same sentence was given in 2009 for the man who blinded Ameneh Bahrami. same sentence was given in 2009 for the man who blinded Ameneh Bahrami.

comorbidities comorbidities

(9)

Text-to-Speech Technology-Based Programming Tool

9 9

Blindness can occur in combination with such conditions as mental Blindness can occur in combination with such conditions as mental retardation, autism, cerebral palsy, hearing impairments, and epilepsy.

retardation, autism, cerebral palsy, hearing impairments, and epilepsy. [17][18][17][18] In aIn a study of 228 visually impaired children inmetropolitan Atlanta between 1991 and study of 228 visually impaired children inmetropolitan Atlanta between 1991 and 1993, 154 (68%) had an additional disability besides visual

1993, 154 (68%) had an additional disability besides visual impairment.

impairment.[17][17] Blindness in combination with hearing loss is knownBlindness in combination with hearing loss is known as deafblindness.

as deafblindness.

M

_anagement

A 2008 study published in the New England Journal of Medicine

A 2008 study published in the New England Journal of Medicine[19][19] tested thetested the effect of using gene therapy to help restore the sight of patients with a rare form effect of using gene therapy to help restore the sight of patients with a rare form of inherited blindness, known as Leber Congenital Amaurosis or LCA. Leber of inherited blindness, known as Leber Congenital Amaurosis or LCA. Leber Congenital Amaurosis damages the light receptors in the retina and usually Congenital Amaurosis damages the light receptors in the retina and usually begins affecting sight in early childhood, with worsening vision until complete begins affecting sight in early childhood, with worsening vision until complete blindness around the age of 30.

blindness around the age of 30.

The study used a common cold virus to deliver a normal version of the gene The study used a common cold virus to deliver a normal version of the gene

called RPE65 directly into the eyes of affected patients. Remarkably all 3 patients called RPE65 directly into the eyes of affected patients. Remarkably all 3 patients aged 19, 22 and 25 responded well to the treatment and reported improved

aged 19, 22 and 25 responded well to the treatment and reported improved vision following the procedure. Due to the age of the patients and the

vision following the procedure. Due to the age of the patients and the

degenerative nature of LCA the improvement of vision in gene therapy patients is degenerative nature of LCA the improvement of vision in gene therapy patients is encouraging for researchers. It is hoped that gene therapy may be even more encouraging for researchers. It is hoped that gene therapy may be even more effective in younger LCA patients who have experienced limited vision loss as effective in younger LCA patients who have experienced limited vision loss as well as in other blind or partially blind individuals.

(10)

Text-to-Speech Technology-Based Programming Tool

10 10

Two experimental treatments for retinal problems include a cybernetic Two experimental treatments for retinal problems include a cybernetic replacement and transplant of fetal retinal cells.

replacement and transplant of fetal retinal cells.[20][20]

Adaptive techniques and aids

M

M_obility_obility

Folded long cane. Folded long cane.

Many people with serious visual impairments can travel independently, using a Many people with serious visual impairments can travel independently, using a wide range of tools and techniques. Orientation and mobility specialists are wide range of tools and techniques. Orientation and mobility specialists are

(11)

Text-to-Speech Technology-Based Programming Tool

11 11

professionals who are specifically trained to teach people with visual impairments professionals who are specifically trained to teach people with visual impairments how to travel safely, confidently, and independently in the home and the

how to travel safely, confidently, and independently in the home and the

community. These professionals can also help blind people to practice travelling community. These professionals can also help blind people to practice travelling on specific routes which they may use often, such as the route from one's house on specific routes which they may use often, such as the route from one's house to a convenience store. Becoming familiar with an environment or route can to a convenience store. Becoming familiar with an environment or route can make it much easier for a blind person to navigate successfully.

make it much easier for a blind person to navigate successfully.

Tools such as the white cane with a red tip - the international symbol of blindness Tools such as the white cane with a red tip - the international symbol of blindness - may also be used to improve mobility. A long cane is used to extend the user's - may also be used to improve mobility. A long cane is used to extend the user's range of touch sensation. It is usually swung in a low sweeping motion, across range of touch sensation. It is usually swung in a low sweeping motion, across the intended path of travel, to detect obstacles.

the intended path of travel, to detect obstacles.

However, techniques for cane travel can vary depending on the user and/or the However, techniques for cane travel can vary depending on the user and/or the situation. Some visually impaired persons do not carry these kinds of canes, situation. Some visually impaired persons do not carry these kinds of canes, opting instead for the shorter, lighter identification (ID) cane. Still others require a opting instead for the shorter, lighter identification (ID) cane. Still others require a support cane. The choice depends on the individual's vision, motivation, and support cane. The choice depends on the individual's vision, motivation, and other factors.

other factors.

A small number of people employ guide dogs to assist in mobility. These dogs A small number of people employ guide dogs to assist in mobility. These dogs

are trained to navigate around various obstacles, and to indicate when it are trained to navigate around various obstacles, and to indicate when it

becomes necessary to go up or down a step. However, the helpfulness of guide becomes necessary to go up or down a step. However, the helpfulness of guide dogs is limited by the inability of dogs to understand complex directions. The dogs is limited by the inability of dogs to understand complex directions. The human half of the guide dog team does the directing, based upon skills acquired human half of the guide dog team does the directing, based upon skills acquired through previous mobility training. In this sense, the handler might be likened to through previous mobility training. In this sense, the handler might be likened to an aircraft's navigator, who must know how to get from one place to another, and an aircraft's navigator, who must know how to get from one place to another, and the dog to the pilot, who gets them there safely.

(12)

Text-to-Speech Technology-Based Programming Tool

12 12

In addition, some blind people use software using GPS technology as a mobility In addition, some blind people use software using GPS technology as a mobility aid. Such software can assist blind people with orientation and navigation, but it aid. Such software can assist blind people with orientation and navigation, but it is not a replacement for traditional mobility tools such as white canes and guide is not a replacement for traditional mobility tools such as white canes and guide dogs.

dogs.

Government actions are sometimes taken to make public places more accessible Government actions are sometimes taken to make public places more accessible to blind people. Public transportation is freely available to the blind in many

to blind people. Public transportation is freely available to the blind in many cities. Tactile paving and audible traffic signals can make it easier and safer for cities. Tactile paving and audible traffic signals can make it easier and safer for visually impaired pedestrians to cross streets. In addition to making rules about visually impaired pedestrians to cross streets. In addition to making rules about who can and cannot use a cane, some governments mandate the right-of-way be who can and cannot use a cane, some governments mandate the right-of-way be given to users of white canes or guide dogs.

given to users of white canes or guide dogs.

Reading and magnification Reading and magnification

(13)

Text-to-Speech Technology-Based Programming Tool

13 13

Watch for the blind Watch for the blind

Most visually impaired people who are not totally blind read print, either of a Most visually impaired people who are not totally blind read print, either of a regular size or enlarged by magnification devices. Many also read large-print, regular size or enlarged by magnification devices. Many also read large-print, which is easier for them to read without such devices. A variety of magnifying which is easier for them to read without such devices. A variety of magnifying glasses, some handheld, and some on desktops, can make reading easier for glasses, some handheld, and some on desktops, can make reading easier for them.

them.

Others read Braille (or the infrequently used Moon type), or rely on talking Others read Braille (or the infrequently used Moon type), or rely on talking books and readers or reading machines, which convert printed text to speech books and readers or reading machines, which convert printed text to speech orBraille. They use computers with special hardware such

orBraille. They use computers with special hardware such

as scanners and refreshable Braille displays as well as software written as scanners and refreshable Braille displays as well as software written specifically for the blind, such as optical character recognition applications specifically for the blind, such as optical character recognition applications and screen readers.

and screen readers.

Some people access these materials through agencies for the blind, such as Some people access these materials through agencies for the blind, such as the

the National National Library Library Service for Service for the Blithe Blind and nd and Physically Physically Handicapped Handicapped in in thethe United

United States, States, the the National National Library Library for for the the Blind Blind or or the the RNIB RNIB in in the the UnitedUnited Kingdom.

Kingdom.

Closed-circuit televisions, equipment that enlarges and contrasts textual items, Closed-circuit televisions, equipment that enlarges and contrasts textual items, are a more high-tech alternative to traditional magnification devices.

are a more high-tech alternative to traditional magnification devices.

There are also over 100 radio reading services throughout the world that provide There are also over 100 radio reading services throughout the world that provide people with vision impairments with readings from periodicals over the radio. The people with vision impairments with readings from periodicals over the radio. The International Association of Audio Information Services provides links to all of International Association of Audio Information Services provides links to all of these organizations.

(14)

Text-to-Speech Technology-Based Programming Tool

14 14

Computers Computers

Access technology such as screen readers, screen magnifiers and refreshable Access technology such as screen readers, screen magnifiers and refreshable

Braille displays enable the blind to use mainstream computer applications Braille displays enable the blind to use mainstream computer applications andmobile phones. The availability of assistive technology is increasing, andmobile phones. The availability of assistive technology is increasing, accompanied by concerted efforts to ensure the accessibility of information accompanied by concerted efforts to ensure the accessibility of information technology to all potential users, including the blind. Later versions of Microsoft technology to all potential users, including the blind. Later versions of Microsoft Windows include an Accessibility Wizard & Magnifier for those with partial vision, Windows include an Accessibility Wizard & Magnifier for those with partial vision, andMicrosoft Narrator, a simple screen reader. Linux distributions (as live CDs) andMicrosoft Narrator, a simple screen reader. Linux distributions (as live CDs) for the blind include Oralux and Adriane Knoppix, the latter developed in part for the blind include Oralux and Adriane Knoppix, the latter developed in part byAdriane Knopper who has a visual impairment. Mac OS also comes with a byAdriane Knopper who has a visual impairment. Mac OS also comes with a built-in screen reader, called VoiceOver.

built-in screen reader, called VoiceOver.

The movement towards greater web accessibility is opening a far wider number The movement towards greater web accessibility is opening a far wider number of websites to adaptive technology, making the web a more inviting place for of websites to adaptive technology, making the web a more inviting place for visually impaired surfers.

visually impaired surfers.

Experimental approaches in sensory substitution are beginning to provide access Experimental approaches in sensory substitution are beginning to provide access to arbitrary live views from a camera.

to arbitrary live views from a camera.

Other aids and techniques Other aids and techniques

(15)

Text-to-Speech Technology-Based Programming Tool

15 15

A tactile feature on a Canadian banknote. A tactile feature on a Canadian banknote.

Blind people may use talking equipment such as thermometers, watches, Blind people may use talking equipment such as thermometers, watches,

clocks, scales, calculators, and compasses. They may also enlarge or mark dials clocks, scales, calculators, and compasses. They may also enlarge or mark dials on devices such as ovens and thermostats to make them usable. Other

on devices such as ovens and thermostats to make them usable. Other techniques used by blind people to assist them in daily activities include: techniques used by blind people to assist them in daily activities include:



 Adaptations of coins and banknotes so that the value can be determined by touch. Adaptations of coins and banknotes so that the value can be determined by touch.

For example: For example:



 In some currencies, such as the euro, the pound sterling and the Indian rupee, theIn some currencies, such as the euro, the pound sterling and the Indian rupee, the

size of a note increases with its value. size of a note increases with its value.

 



 On US coins, pennies and dimes, and nickels and quarters are similar in size. TheOn US coins, pennies and dimes, and nickels and quarters are similar in size. The

larger denominations (dimes and quarters) have ridges along the sides larger denominations (dimes and quarters) have ridges along the sides

(historically used to prevent the "shaving" of precious metals from the coins), (historically used to prevent the "shaving" of precious metals from the coins), which can now be used for identification.

(16)

Text-to-Speech Technology-Based Programming Tool

16 16

E

_pidemiology

The WHO estimates that in 2002 there were 161 million visually impaired people The WHO estimates that in 2002 there were 161 million visually impaired people in the world (about 2.6% of the total population). Of this number 124 million

in the world (about 2.6% of the total population). Of this number 124 million (about 2%) had low vision and 37

(about 2%) had low vision and 37 million (about 0.6%) were blind.million (about 0.6%) were blind.[22][22] In order of In order of frequency the leading causes were cataract, uncorrected refractive errors (near frequency the leading causes were cataract, uncorrected refractive errors (near sighted, far sighted, or an astigmatism), glaucoma, and age-related macular sighted, far sighted, or an astigmatism), glaucoma, and age-related macular degeneration.

degeneration.[23][23] In 1987, it was estimated that 598,000 people in the UnitedIn 1987, it was estimated that 598,000 people in the United States met the legal definition of blindness.

States met the legal definition of blindness.[24][24] Of this number, 58% were over theOf this number, 58% were over the age of 65.

(17)

Text-to-Speech Technology-Based Programming Tool

17 17

Speech synthesis

Speech synthesis is the artificial production of humanis the artificial production of human speechspeech. A computer . A computer system used for this purpose is called a

system used for this purpose is called a speech synthesizer speech synthesizer , and can be, and can be implemen

implemented ited inn softwaresoftware or or hardwarehardware. A. A text-to-speech (TTS)text-to-speech (TTS) system convertssystem converts normal language text into speech; other systems render

normal language text into speech; other systems render symbolic linguisticsymbolic linguistic representations

representations likelike phonetic transcriptionsphonetic transcriptions into speech.into speech.[1][1]

Synthesized speech can be created by concatenating pieces of recorded speech Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a

that are stored in a databasedatabase. Systems differ in the size of the stored speech. Systems differ in the size of the stored speech units; a system that stores

units; a system that stores phonesphones or or diphonesdiphones provides the largest outputprovides the largest output range, but may lack clarity.

range, but may lack clarity.

For specific usage domains, the storage of entire words or sentences allows for For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of high-quality output. Alternatively, a synthesizer can incorporate a model of the

the vocal tractvocal tract and other human voice characteristics to create a completelyand other human voice characteristics to create a completely "synthetic" voice output.

"synthetic" voice output.[2][2]

The quality of a speech synthesizer is judged by its similarity to the human voice The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows and by its ability to be understood. An intelligible text-to-speech program allows people with

people with visual impairmentsvisual impairments or or reading disabilitiesreading disabilities to listen to written worksto listen to written works on a home computer. Many computer operating systems have included speech on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.

(18)

Text-to-Speech Technology-Based Programming Tool

18 18

Overview of text processing

Overview of a typical TTS system Overview of a typical TTS system

A text-to-speech system (or "engine") is composed of two parts

A text-to-speech system (or "engine") is composed of two parts[3][3]: a front-end and: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text

a back-end. The front-end has two major tasks. First, it converts raw text

containing symbols like numbers and abbreviations into the equivalent of containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing,

out words. This process is often called text normalization, pre-processing,

ortokenization. The front-end then assigns phonetic transcriptions to each word, ortokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses,

and divides and marks the text into prosodic units, like phrases, clauses, andsentences. The process of assigning phonetic transcriptions to words is andsentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.

called text-to-phoneme or grapheme-to-phoneme conversion.

Phonetic transcriptions and prosody information together make up the symbolic Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end²often linguistic representation that is output by the front-end. The back-end²often referred to as thesynthesizer²then converts the symbolic linguistic

referred to as thesynthesizer²then converts the symbolic linguistic

representation into sound. In certain systems, this part includes the computation representation into sound. In certain systems, this part includes the computation of the target prosody(pitch contour, phoneme durations

of the target prosody(pitch contour, phoneme durations[4][4]), which is then imposed), which is then imposed on the output speech.

(19)

Text-to-Speech Technology-Based Programming Tool

19 19

H

_istory

Long before electronic signal processing was invented, there were those who Long before electronic signal processing was invented, there were those who tried to build machines to create human speech. Some early legends of the tried to build machines to create human speech. Some early legends of the existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (1198±1280), and Roger Bacon (1214±1294).

Magnus (1198±1280), and Roger Bacon (1214±1294).

In

In 1779, 1779, the the Danish Danish scientist scientist Christian Christian Kratzenstein, Kratzenstein, working working at at the the RussianRussian Academy of Sciences, built models of the human vocal tract that could produce Academy of Sciences, built models of the human vocal tract that could produce

the

the five five long long vowel vowel sounds sounds (in (in International International Phonetic Phonetic Alphabet Alphabet notation, notation, theythey are [a

are [a], [e], [e], [i], [i], [o], [o] and [u] and [u]).]).[5][5] This This was was followed followed by by the the bellows-operatedbellows-operated

"acoustic-"acoustic-mechanical mechanical speech speech machine" machine" by by Wolfgang Wolfgang vonvon Kempelen of Vienna, Austria, described in a 1791 paper.

Kempelen of Vienna, Austria, described in a 1791 paper.[6][6]This machine addedThis machine added models of

models of the tonthe tongue and gue and lips, lips, enabling ienabling it to prot to produce duce consonants consonants as well as well asas vowels. In 1837,Charles Wheatstone produced a "speaking machine" based on vowels. In 1837,Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.

Wheatstone's design was resurrected in 1923 by Paget.[7][7]

In

In the the 1930s, 1930s, Bell Bell Labs Labs developed developed the the VOCODER, VOCODER, a a keyboard-okeyboard-operatedperated electronic speech analyzer and synthesizer that was said to be clearly electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this device into the VODER, which he exhibited intelligible. Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair.

(20)

Text-to-Speech Technology-Based Programming Tool

20 20

The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950. There were at Haskins Laboratories in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman and colleagues spectrogram back into sound. Using this device, Alvin Liberman and colleagues were abl

were able to e to discover acoustidiscover acoustic cues c cues for the for the perception perception of of phonetic phonetic segmentssegments (consonants and vowels).

(consonants and vowels).

Dominant systems in the 1980s and 1990s were the MITalk system, based Dominant systems in the 1980s and 1990s were the MITalk system, based largely on the work of Dennis Klatt at MIT, and the Bell Labs system;

largely on the work of Dennis Klatt at MIT, and the Bell Labs system; [8][8] the latter the latter was one of the first multilingual language-independent systems, making

was one of the first multilingual language-independent systems, making extensive use of Natural Language Processing methods.

extensive use of Natural Language Processing methods.

Early electronic speech synthesizers sounded robotic and were often barely Early electronic speech synthesizers sounded robotic and were often barely intelligible. The quality of synthesized speech has steadily improved, but output intelligible. The quality of synthesized speech has steadily improved, but output from contemporary speech synthesis systems is still clearly distinguishable from from contemporary speech synthesis systems is still clearly distinguishable from actual human speech.

actual human speech.

As the cost-performance ratio causes speech synthesizers to become cheaper As the cost-performance ratio causes speech synthesizers to become cheaper and more accessible to the people, more people will benefit from the use of and more accessible to the people, more people will benefit from the use of text-to-speech programs.

(21)

Text-to-Speech Technology-Based Programming Tool

21 21

E

E_{lectronic devices}_{lectronic devices}

The first computer-based speech synthesis systems were created in the late The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman

1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman[10][10] usedused

an IBM 704 computer to synthesize speech, an event among the most prominent an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews.

the song "Daisy Bell", with musical accompaniment from Max Mathews.

Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the

at the Bell Labs Murray Hill facility. Clarke was so impressed by the

demonstration that he used it in the climactic scene of his screenplay for his demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey,

novel 2001: A Space Odyssey,[11][11] where the HAL 9000 computer sings the samewhere the HAL 9000 computer sings the same song as it is being put to sleep by astronaut Dave Bowman.

song as it is being put to sleep by astronaut Dave Bowman. [12][12] Despite theDespite the success of purely electronic speech synthesis, research is still being conducted success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.

into mechanical speech synthesizers.[13][13]

Handheld electronics featuring speech synthesis began emerging in the 1970s. Handheld electronics featuring speech synthesis began emerging in the 1970s. One of the first was the Telesensory Systems Inc. (TSI) Speech+ portable One of the first was the Telesensory Systems Inc. (TSI) Speech+ portable calculator for the blind in 1976.

calculator for the blind in 1976.[14][15][14][15] Other devices were produced primarily for Other devices were produced primarily for educational purposes, such as Speak & Spell, produced by Texas

educational purposes, such as Speak & Spell, produced by Texas Instruments

Instruments[16][16] in 1978. The first multi-player game using voice synthesisin 1978. The first multi-player game using voice synthesis

was Milton from Milton Bradley Company, which produced the device in 1980. was Milton from Milton Bradley Company, which produced the device in 1980.

(22)

Text-to-Speech Technology-Based Programming Tool

22 22

Synthesizer technologies

The most important qualities of a speech synthesis system

are naturalness and intelligibility. Naturalness describes how closely the output are naturalness and intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.

synthesis systems usually try to maximize both characteristics.

The two primary technologies for generating synthetic speech waveforms The two primary technologies for generating synthetic speech waveforms are

are concatenative concatenative synthesis synthesis and and formant formant synthesis. synthesis. Each Each technology technology hashas strengths and weaknesses, and the intended uses of a synthesis system will strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.

typically determine which approach is used.

Concatenative synthesis Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

There are three main sub-types of concatenative synthesis.

Unit selection synthesis Unit selection synthesis

(23)

Text-to-Speech Technology-Based Programming Tool

23 23

Unit

Unit selection selection synthesis synthesis uses uses large large databases databases of of recorded recorded speech. speech. DuringDuring database creation, each recorded utterance is segmented into some or all of the database creation, each recorded utterance is segmented into some or all of the following:

following: individual individual phones, phones, diaphones, diaphones, half- half-phones,

phones, syllables, syllables, morphemes, morphemes, words, words, phrases, phrases, and and sentences. sentences. Typically, Typically, thethe division into segments is done using a specially modified speech recognizer set division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.

visual representations such as the waveform and spectrogram. [17][17] An index of theAn index of the units in the speech database is then created based on the segmentation and units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision selection). This process is typically achieved using a specially weighted decision tree.

tree.

Unit selection provides the greatest naturalness, because it applies only a small Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the

amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often waveform. The output from the best unit-selection systems is often

indistinguishable from real human voices, especially in contexts for which the indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.

the gigabytes of recorded data, representing dozens of hours of speech. [18][18] Also,Also, unit selection algorithms have been known to select segments from a place that unit selection algorithms have been known to select segments from a place that

(24)

Text-to-Speech Technology-Based Programming Tool

24 24

results in less than ideal synthesis (e.g. minor words become unclear) even when results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.

a better choice exists in the database.[19][19]

Diaphone synthesis Diaphone synthesis

Diphone synthesis uses a minimal speech database containing all Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, one example of each diphone is contained in the speech database. At runtime, the targetp

the targetprosody rosody of a sentence is supof a sentence is superimposed on these minerimposed on these minimal units imal units byby means

means of of digital digital signal signal processing processing techniques techniques such such as as linear linear predictivepredictive coding, PSOLA

coding, PSOLA[20][20] or MBROLA.or MBROLA.[21][21] The quality of the resulting speech is generallyThe quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, and has few of the advantages of either approach other than small size. As such, its use in

its use in commerciacommercial applications is declining, althoul applications is declining, although it gh it continues to be used incontinues to be used in research because there are a number of freely available software research because there are a number of freely available software implementations.

implementations.

Domain-specific synthesis Domain-specific synthesis

(25)

Text-to-Speech Technology-Based Programming Tool

25 25

Domain-specific synthesis concatenates prerecorded words and phrases to Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule the system will output is limited to a particular domain, like transit schedule announcements or weather reports.

announcements or weather reports.[22][22] The technology is very simple toThe technology is very simple to implement, and has been in commercial use for a long time, in devices like implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.

match the prosody and intonation of the original recordings.[citation needed][citation needed]

Because these systems are limited by the words and phrases in their databases, Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of many variations are taken into account. For example, in non-rhotic dialects of English the "r" in words like "clear" /

English the "r" in words like "clear" /klikli/ is usually only pronounced when the/ is usually only pronounced when the following

following word word has a has a vowel vowel as ias its ts first first letter letter (e.g. (e.g. "clear "clear out" out" is is realizedrealized as /

as /kliklit/). Likewise in French, many final consonants become no longer t/). Likewise in French, many final consonants become no longer

silent if

silent if followed by a word that followed by a word that begins with a vowel, begins with a vowel, an effect called an effect called liaison.liaison. This alternation cannot be reproduced by a simple word-concatenation system, This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.

which would require additional complexity to be context-sensitive.

Formant synthesis Formant synthesis

(26)

Text-to-Speech Technology-Based Programming Tool

26 26

Formant synthesis does not use human speech samples at runtime. Instead, the Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis and an acoustic synthesized speech output is created using additive synthesis and an acoustic model (physical modelling synthesis).

model (physical modelling synthesis).[23][23] Parameters Parameters such such as as fundamentalfundamental frequency, voicing, and noise levels are varied over time to create a waveform of frequency, voicing, and noise levels are varied over time to create a waveform of artificial

artificial speech. Thispeech. This method s method is is sometimes sometimes called called rules-based rules-based synthesis;synthesis; however, many concatenative systems also have rules-based components. Many however, many concatenative systems also have rules-based components. Many systems based on formant synthesis technology generate artificial, systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to systems. High-speed synthesized speech is used by the visually impaired to quickly navi

quickly navigate computers usingate computers using a g a screen reader. Formant syntscreen reader. Formant synthesizers arehesizers are usually smaller programs than concatenative systems because they do not have usually smaller programs than concatenative systems because they do not have a data

a database of base of speech samples. speech samples. They can tThey can therefore herefore be used in be used in embeddedembedded systems,

systems, where where memory memory and and microprocessor microprocessor power power are are especially especially limited.limited. Because formant-based systems have complete control of all aspects of the Because formant-based systems have complete control of all aspects of the output

output speech, speech, a wida wide e variety variety of of prosodies prosodies and and intonationintonations s can be can be output,output, conveying not just questions and statements, but a variety of emotions and tones conveying not just questions and statements, but a variety of emotions and tones of voice.

of voice.

Examples of non-real-time but highly accurate intonation control in formant Examples of non-real-time but highly accurate intonation control in formant synthesis

(27)

Text-to-Speech Technology-Based Programming Tool

27 27

Instruments

Instruments toy toy Speak Speak & & Spell, Spell, and and in in the the early early 1980s 1980s SegaSega arcade machines.

arcade machines.[24][24] and in many Atari, Inc. arcade gamesand in many Atari, Inc. arcade games[25][25] using the TMS5220using the TMS5220 LPC Chips. Creating proper intonation for these projects was painstaking, and LPC Chips. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.

the results have yet to be matched by real-time text-to-speech interfaces. [26][26]

Articulatory synthesis Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech Articulatory synthesis refers to computational techniques for synthesizing speech

based on models of the human vocal tract and the articulation processes based on models of the human vocal tract and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.

1970s by Paul Mermelstein, Cecil Coker, and colleagues.

Until recently, articulatory synthesis models have not been incorporated into Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT-based commercial speech synthesis systems. A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was company of the University of Calgary, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started conducted. Following the demise of the various incarnations of NeXT (started bySteve Jobs in the late 1980s and merged with Apple Computer in 1997), the bySteve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with Trillium software was published under the GNU General Public License, with work continuing as gnu speech.

(28)

Text-to-Speech Technology-Based Programming Tool

28 28

The system, first marketed in 1994, provides full articulatory-based The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".

oral and nasal tracts controlled by Carré's "distinctive region model".

HMM

HMM_{-based synthesis}_{-based synthesis}

HMM-based synthesis is a synthesis method based on hidden Markov models, HMM-based synthesis is a synthesis method based on hidden Markov models, also called Statistical Parametric Synthesis. In this system, the frequency

also called Statistical Parametric Synthesis. In this system, the frequency spectrum (vocal tract), fundamental frequency(vocal source), and duration spectrum (vocal tract), fundamental frequency(vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs.

(prosody) of speech are modeled simultaneously by HMMs.

Speech waveforms are generated from HMMs themselves based on Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion.

the maximum likelihood criterion.[27][27]

Sine wave synthesis Sine wave synthesis

Sine wave synthesis is a technique for synthesizing speech by replacing Sine wave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.

the formants (main bands of energy) with pure tone whistles. [28][28]

Challenges

Text normalization challenges Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full The process of normalizing text is rarely straightforward. Texts are full

of heteronyms, numbers, and abbreviations that all require expansion into a of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation.

(29)

Text-to-Speech Technology-Based Programming Tool

29 29

There are many spellings in English which are pronounced differently based on There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".

voice" contains two pronunciations of "project".

Most text-to-speech (TTS) systems do not generate semantic representations of Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic techniques are used to computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.

words and using statistics about frequency of occurrence.

Recently TTS systems have begun to use HMMs (discussed above) to generate Recently TTS systems have begun to use HMMs (discussed above) to generate "parts of speech" to aid in disambiguating homographs. This technique is quite "parts of speech" to aid in disambiguating homographs. This technique is quite successful for many cases such as whether "read" should be pronounced as successful for many cases such as whether "read" should be pronounced as "red" implying past tense, or as "reed" implying present tense. Typical error rates "red" implying past tense, or as "reed" implying present tense. Typical error rates when using HMMs in this fashion are usually below five percent. These

when using HMMs in this fashion are usually below five percent. These techniques also work well for most European languages, although access to techniques also work well for most European languages, although access to required training corpora is frequently difficult in these languages.

required training corpora is frequently difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems have to Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words (at address. It is a simple programming challenge to convert a number into words (at least in English), like "1325" becoming "one thousand three hundred twenty-five." least in English), like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; "1325" may also be read as However, numbers occur in many different contexts; "1325" may also be read as "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five". "one three two five", "thirteen twenty-five" or "thirteen hundred and twenty five".