1 Copyright © 2003 20/20 Speech
Speculating on the Future for Automatic Speech Recognition
A Survey of Attendees by
THANK YOU !
THANK YOU !
☺
3 Copyright © 2003 20/20 Speech
“It is hard to predict …” “… especially the future.”
The Survey(s)
• 12 of the 20 statements were
exactly the same as those posed to the participants of ASRU’97 six years ago
• A couple were suggested by the ASRU’04 Technical Committee
• The rest were taken from Ray Kurzweil’s books …
5 Copyright © 2003 20/20 Speech
Predictions from Ray Kurzweil
“A PC will have the computational power of the human brain by 2019, and will be equivalent
7 Copyright © 2003 20/20 Speech
Some Overall Statistics
attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 3 “2020”s: 10% (7%)
Some Overall Statistics 2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)
9 Copyright © 2003 20/20 Speech
Some Overall Statistics
2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)
The ‘Church Effect’ 0 5 10 15 20 25 30 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 1997 2003 6 years !
11 Copyright © 2003 20/20 Speech
1. More than 50% of new PCs
have dictation on them, either at purchase or shortly after.
“won’t be used” “already comes
with Office XP”
“now … but not used”
1. More than 50% of new PCs have dictation on them, either at purchase or shortly after. 0 5 10 15 20 25 30 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201 0 2011 2012 2013 2014 201 5 Nev er 1997 2003 Mean: 2009 (2001) SD: 7 (3) Min: 2000 (1997) Max: 2050 (2010)
13 Copyright © 2003 20/20 Speech
2. Most telephone Interactive Voice Response (IVR) systems accept speech input.
2. Most telephone Interactive Voice Response systems accept speech input (and more than just digits)
0 5 10 15 20 25 30 35 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 Never 1997 2003 Mean: 2010 (2003) SD: 10 (4) Min: 2000 (1998) Max: 2060 (2020)
15 Copyright © 2003 20/20 Speech
5. Automatic airline reservation by voice over the telephone is
5. Automatic airline reservation by voice over the telephone is the norm. 0 5 10 15 20 25 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 202 8 2030 2032 2034 2036 203 8 Nev er 1997 2003 Mean: 2013 (2015) SD: 10 (57) Min: 2002 (1999) Max: 2050 (2500)
17 Copyright © 2003 20/20 Speech
4. Speech recognition is
commonly available at home
(e.g. interactive TV, control of home appliances and home
management systems).
4. Speech recognition is commonly available at home (e.g. interactive TV, control of home appliances and home management systems).
0 5 10 15 20 25 30 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 201 3 2014 2015 2016 2017 201 8 Nev er 1997 2003 Mean: 2016 (2010) SD: 15 (12) Min: 2004 (1999) Max: 2100 (2100)
19 Copyright © 2003 20/20 Speech
7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.
7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.
0 5 10 15 20 25 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 201 2 2013 2014 2015 2016 201 7 Nev er 1997 2003 Mean: 2016 (2008) SD: 13 (8) Min: 2004 (1999) Max: 2075 (2050)
21 Copyright © 2003 20/20 Speech
3. TV closed-captioning
(subtitling) is automatic and pervasive.
3. TV closed captioning is automatic and pervasive. 0 2 4 6 8 10 12 14 16 18 20 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 202 5 2027 2029 2031 2033 203 5 Nev er 1997 2003 Mean: 2018 (2031) SD: 17 (124) Min: 1998 (1997) Max: 2100 (3001)
23 Copyright © 2003 20/20 Speech
15. Telephones are answered by an intelligent answering machine that converses with the calling
party to determine the nature and priority of the call.
15. Telephones are answered by an intelligent answering machine that converses with the calling party to determine the nature and priority of the call.
0 2 4 6 8 10 12 14 16 18 20 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 2040 Never Mean: 2022 (early 2000s) SD: 25 Min: 2000 Max: 2150
25 Copyright © 2003 20/20 Speech
11. First legal case in which a recording of a person’s voice is thrown out because it cannot be proved whether a computer or a person said it.
“not evidence in many countries
11. First legal case in which a recording of a person's voice is thrown out because it cannot be proved whether a computer or a person said it.
0 5 10 15 20 25 30 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 206 5 2070 2075 2080 2085 209 0 Nev er 1997 2003 Mean: 2025 (2050) SD: 29 (167) Min: 1995 (1990) Max: 2150 (3000)
27 Copyright © 2003 20/20 Speech
20. Pocket-sized listening
machines are commonly available for the hearing impaired.
20. Pocket-sized listening machines are commonly available for the hearing impaired. 0 10 20 30 40 50 60 70 80 1980 200 0 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 2280 2300 2320 2340 2360 2380 Never Mean: 2026 (2009) SD: 32 Min: 2001 Max: 2275
29 Copyright © 2003 20/20 Speech
16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.
“impossible to
answer” “hope it’s soon”
“HMMs are here to stay, but the assumptions will
16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.
0 5 10 15 20 25 30 35 40 45 2005 200 7 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027 2029 2031 2033 2035 2037 2039 2041 2043 2045 Never Mean: 2029 SD: 33 Min: 2007 Max: 2200
31 Copyright © 2003 20/20 Speech
10. Public proceedings (e.g. courts, public inquiries,
parliament etc.) are transcribed automatically.
10. Public proceeedings (e.g. courts, public inquiries, parliament etc.) are transcribed automatically 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2030 (2041) SD: 26 (128) Min: 2006 (2000) Max: 2150 (3001)
33 Copyright © 2003 20/20 Speech
9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.
“keyboards will always be
shipped”
“no problem, there will always
be advanced pills available”
9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.
0 10 20 30 40 50 60 70 80 90 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 207 5 2080 2085 2090 2095 210 0 Nev er 1997 2003 Mean: 2030 (2103) SD: 31 (287) Min: 2006 (1998) Max: 2150 (3020)
35 Copyright © 2003 20/20 Speech
14. The majority of automatic speech recognition systems have completely abandoned the
n-grams paradigm for language modelling. “most deployed systems use CFG anyway” “impossible to answer”
14. The majority of automatic speech recognition systems have completely abandoned the n-grams paradigm for language modelling.
0 5 10 15 20 25 30 35 40 45 50 1995 199 8 2001 2004 2007 2010 2013 2016 2019 2022 2025 2028 2031 2034 2037 2040 2043 2046 2049 2052 2055 Never Mean: 2033 SD: 39 Min: 1995 Max: 2200
37 Copyright © 2003 20/20 Speech
13. The majority of text is
created using continuous speech recognition.
“for authoring
13. The majority of text is created using continuous speech recognition. 0 10 20 30 40 50 60 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 Never Mean: 2039 (2009) SD: 48 Min: 2000 Max: 2300
39 Copyright © 2003 20/20 Speech
17. Most routine business
transactions take place between a human and a virtual personality (including an animated visual
presence that looks like a human face).
17. Most routine business transactions take place between a human and a virtual personality (including an animated visual presence that looks like a human face).
0 5 10 15 20 25 30 1990 199 5 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 Never Mean: 2041 (2009) SD: 66 Min: 1994 Max: 2500
41 Copyright © 2003 20/20 Speech
18. Translating telephones allow two people across the globe to
speak to each other even if they do not speak the same language.
“depends on the task”
18. Translating telephones allow two people across the globe to speak to each other even if they do not speak the same language.
0 2 4 6 8 10 12 14 16 18 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 Never Mean: 2057 (early 2000s) SD: 116 Min: 2000 Max: 3000
43 Copyright © 2003 20/20 Speech
12. Speech recognition accuracy equals that of the average
(individual) human transcriber.
12. Speech recognition accuracy equals that of the average (individual) human transcriber. 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2064 (2046) SD: 222 (124) Min: 2005 (1997) Max: 3827 (3001)
45 Copyright © 2003 20/20 Speech
19. Most interaction with
computing is through gestures and two-way natural-language spoken communication.
19. Most interaction with computing is through gestures and two-way natural-language spoken communication. 0 5 10 15 20 25 30 35 40 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 2040 Never Mean: 2069 (2019) SD: 225 Min: 2004 Max: 3827
47 Copyright © 2003 20/20 Speech
6. It is possible to hold a
telephone conversation with an automatic chat-line system for more than 10 minutes without realising it isn’t human.
“could happen, but should not” “why would one
want that?”
“hopefully never”
6. It is possible to hold a telephone conversation with an automatic chatline system for more than 10 minutes without realising it isn't human.
0 5 10 15 20 25 30 35 40 1995 2005 2015 2025 2035 2045 2055 2065 2075 2085 2095 2105 2115 2125 2135 214 5 2155 2165 2175 2185 219 5 Nev er 1997 2003 Mean: 2086 (2128) SD: 228 (328) Min: 2000 (1998) Max: 3579 (4001)
49 Copyright © 2003 20/20 Speech
8. No more need for speech research.
“GASP!”
8. No more need for speech research. 0 10 20 30 40 50 60 70 1980 2000 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 228 0 2300 2320 2340 2360 238 0 Nev er 1997 2003 Mean: 2342 (2240) SD: 1308 (546) Min: 1981 (1984) Max: 10000 (5001)
51 Copyright © 2003 20/20 Speech
Overall Impressions
• High-level of participation – thanks to the 105 who responded
• Remarkably consistent with the 1997 survey
• Strong evidence of the ‘Church Effect’ • Neither more optimistic or pessimistic • More agreement & more realistic
• People less willing to be associated with their opinions than 6 years ago
53 Copyright © 2003 20/20 Speech
20/20 Speech Ltd.
20/20 Speech Ltd.
Science Park, Malvern, Worcs., WR14 3SZ, UK Tel: +44 1 684 585101 Fax: +44 1 684 585151
http://www.2020speech.com http://www.aurix.com