Speculating on the Future for Automatic Speech Recognition

(1)

A Survey of Attendees by

(2)

THANK YOU !

☺

(3)

“It is hard to predict …” “… especially the future.”

(4)

The Survey(s)

• 12 of the 20 statements were

exactly the same as those posed to the participants of ASRU’97 six years ago

• A couple were suggested by the ASRU’04 Technical Committee

• The rest were taken from Ray Kurzweil’s books …

(5)

Predictions from Ray Kurzweil

“A PC will have the computational power of the human brain by 2019, and will be equivalent

(6)

(7)

Some Overall Statistics

attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 3 “2020”s: 10% (7%)

(8)

Some Overall Statistics 2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)

(9)

Some Overall Statistics

2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)

(10)

The ‘Church Effect’ 0 5 10 15 20 25 30 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 1997 2003 6 years !

(11)

1. More than 50% of new PCs

have dictation on them, either at purchase or shortly after.

“won’t be used” “already comes

with Office XP”

“now … but not used”

(12)

1. More than 50% of new PCs have dictation on them, either at purchase or shortly after. 0 5 10 15 20 25 30 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201 0 2011 2012 2013 2014 201 5 Nev er 1997 2003 Mean: 2009 (2001) SD: 7 (3) Min: 2000 (1997) Max: 2050 (2010)

(13)

2. Most telephone Interactive Voice Response (IVR) systems accept speech input.

(14)

2. Most telephone Interactive Voice Response systems accept speech input (and more than just digits)

0 5 10 15 20 25 30 35 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 _Never 1997 2003 Mean: 2010 (2003) SD: 10 (4) Min: 2000 (1998) Max: 2060 (2020)

(15)

5. Automatic airline reservation by voice over the telephone is

(16)

5. Automatic airline reservation by voice over the telephone is the norm. 0 5 10 15 20 25 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 202 8 2030 2032 2034 2036 203 8 Nev er 1997 2003 Mean: 2013 (2015) SD: 10 (57) Min: 2002 (1999) Max: 2050 (2500)

(17)

4. Speech recognition is

commonly available at home

(e.g. interactive TV, control of home appliances and home

management systems).

(18)

4. Speech recognition is commonly available at home (e.g. interactive TV, control of home appliances and home management systems).

0 5 10 15 20 25 30 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 201 3 2014 2015 2016 2017 201 8 Nev er 1997 2003 Mean: 2016 (2010) SD: 15 (12) Min: 2004 (1999) Max: 2100 (2100)

(19)

7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.

(20)

7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.

0 5 10 15 20 25 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 201 2 2013 2014 2015 2016 201 7 Nev er 1997 2003 Mean: 2016 (2008) SD: 13 (8) Min: 2004 (1999) Max: 2075 (2050)

(21)

3. TV closed-captioning

(subtitling) is automatic and pervasive.

(22)

3. TV closed captioning is automatic and pervasive. 0 2 4 6 8 10 12 14 16 18 20 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 202 5 2027 2029 2031 2033 203 5 Nev er 1997 2003 Mean: 2018 (2031) SD: 17 (124) Min: 1998 (1997) Max: 2100 (3001)

(23)

15. Telephones are answered by an intelligent answering machine that converses with the calling

party to determine the nature and priority of the call.

(24)

15. Telephones are answered by an intelligent answering machine that converses with the calling party to determine the nature and priority of the call.

0 2 4 6 8 10 12 14 16 18 20 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 _{2040 Nev}er Mean: 2022 (early 2000s) SD: 25 Min: 2000 Max: 2150

(25)

11. First legal case in which a recording of a person’s voice is thrown out because it cannot be proved whether a computer or a person said it.

“not evidence in many countries

(26)

11. First legal case in which a recording of a person's voice is thrown out because it cannot be proved whether a computer or a person said it.

0 5 10 15 20 25 30 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 206 5 2070 2075 2080 2085 209 0 Nev er 1997 2003 Mean: 2025 (2050) SD: 29 (167) Min: 1995 (1990) Max: 2150 (3000)

(27)

20. Pocket-sized listening

machines are commonly available for the hearing impaired.

(28)

20. Pocket-sized listening machines are commonly available for the hearing impaired. 0 10 20 30 40 50 60 70 80 1980 200 0 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 2280 2300 2320 2340 2360 _{2380 Nev}er Mean: 2026 (2009) SD: 32 Min: 2001 Max: 2275

(29)

16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.

“impossible to

answer” “hope it’s soon”

“HMMs are here to stay, but the assumptions will

(30)

16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.

0 5 10 15 20 25 30 35 40 45 2005 200 7 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027 2029 2031 2033 2035 2037 2039 2041 2043 _{2045 Nev}er Mean: 2029 SD: 33 Min: 2007 Max: 2200

(31)

10. Public proceedings (e.g. courts, public inquiries,

parliament etc.) are transcribed automatically.

(32)

10. Public proceeedings (e.g. courts, public inquiries, parliament etc.) are transcribed automatically 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2030 (2041) SD: 26 (128) Min: 2006 (2000) Max: 2150 (3001)

(33)

9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.

“keyboards will always be

shipped”

“no problem, there will always

be advanced pills available”

(34)

9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.

0 10 20 30 40 50 60 70 80 90 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 207 5 2080 2085 2090 2095 210 0 Nev er 1997 2003 Mean: 2030 (2103) SD: 31 (287) Min: 2006 (1998) Max: 2150 (3020)

(35)

14. The majority of automatic speech recognition systems have completely abandoned the

n-grams paradigm for language modelling. “most deployed systems use CFG anyway” “impossible to answer”

(36)

14. The majority of automatic speech recognition systems have completely abandoned the n-grams paradigm for language modelling.

0 5 10 15 20 25 30 35 40 45 50 1995 199 8 2001 2004 2007 2010 2013 2016 2019 2022 2025 2028 2031 2034 2037 2040 2043 2046 2049 2052 _{2055 Nev}er Mean: 2033 SD: 39 Min: 1995 Max: 2200

(37)

13. The majority of text is

created using continuous speech recognition.

“for authoring

(38)

13. The majority of text is created using continuous speech recognition. 0 10 20 30 40 50 60 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 _{2100 Nev}er Mean: 2039 (2009) SD: 48 Min: 2000 Max: 2300

(39)

17. Most routine business

transactions take place between a human and a virtual personality (including an animated visual

presence that looks like a human face).

(40)

17. Most routine business transactions take place between a human and a virtual personality (including an animated visual presence that looks like a human face).

0 5 10 15 20 25 30 1990 199 5 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 _{2090 Nev}er Mean: 2041 (2009) SD: 66 Min: 1994 Max: 2500

(41)

18. Translating telephones allow two people across the globe to

speak to each other even if they do not speak the same language.

“depends on the task”

(42)

18. Translating telephones allow two people across the globe to speak to each other even if they do not speak the same language.

0 2 4 6 8 10 12 14 16 18 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 _{2100 Nev}er Mean: 2057 (early 2000s) SD: 116 Min: 2000 Max: 3000

(43)

12. Speech recognition accuracy equals that of the average

(individual) human transcriber.

(44)

12. Speech recognition accuracy equals that of the average (individual) human transcriber. 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2064 (2046) SD: 222 (124) Min: 2005 (1997) Max: 3827 (3001)

(45)

19. Most interaction with

computing is through gestures and two-way natural-language spoken communication.

(46)

19. Most interaction with computing is through gestures and two-way natural-language spoken communication. 0 5 10 15 20 25 30 35 40 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 _{2040 Nev}er Mean: 2069 (2019) SD: 225 Min: 2004 Max: 3827

(47)

6. It is possible to hold a

telephone conversation with an automatic chat-line system for more than 10 minutes without realising it isn’t human.

“could happen, but should not” “why would one

want that?”

“hopefully never”

(48)

6. It is possible to hold a telephone conversation with an automatic chatline system for more than 10 minutes without realising it isn't human.

0 5 10 15 20 25 30 35 40 1995 2005 2015 2025 2035 2045 2055 2065 2075 2085 2095 2105 2115 2125 2135 214 5 2155 2165 2175 2185 219 5 Nev er 1997 2003 Mean: 2086 (2128) SD: 228 (328) Min: 2000 (1998) Max: 3579 (4001)

(49)

8. No more need for speech research.

“GASP!”

(50)

8. No more need for speech research. 0 10 20 30 40 50 60 70 1980 2000 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 228 0 2300 2320 2340 2360 238 0 Nev er 1997 2003 Mean: 2342 (2240) SD: 1308 (546) Min: 1981 (1984) Max: 10000 (5001)

(51)

Overall Impressions

• High-level of participation – thanks to the 105 who responded

• Remarkably consistent with the 1997 survey

• Strong evidence of the ‘Church Effect’ • Neither more optimistic or pessimistic • More agreement & more realistic

• People less willing to be associated with their opinions than 6 years ago

(52)

(53)

20/20 Speech Ltd.

Science Park, Malvern, Worcs., WR14 3SZ, UK Tel: +44 1 684 585101 Fax: +44 1 684 585151

http://www.2020speech.com http://www.aurix.com