• No results found

Speculating on the Future for Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2021

Share "Speculating on the Future for Automatic Speech Recognition"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

1 Copyright © 2003 20/20 Speech

Speculating on the Future for Automatic Speech Recognition

A Survey of Attendees by

(2)

THANK YOU !

THANK YOU !

(3)

3 Copyright © 2003 20/20 Speech

“It is hard to predict …” “… especially the future.”

(4)

The Survey(s)

• 12 of the 20 statements were

exactly the same as those posed to the participants of ASRU’97 six years ago

• A couple were suggested by the ASRU’04 Technical Committee

• The rest were taken from Ray Kurzweil’s books …

(5)

5 Copyright © 2003 20/20 Speech

Predictions from Ray Kurzweil

“A PC will have the computational power of the human brain by 2019, and will be equivalent

(6)
(7)

7 Copyright © 2003 20/20 Speech

Some Overall Statistics

attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 3 “2020”s: 10% (7%)

(8)

Some Overall Statistics 2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)

(9)

9 Copyright © 2003 20/20 Speech

Some Overall Statistics

2003 (1997) attendees: 222 (180) forms returned: 47% (45%) overall mean: 2055 (2056) “never”s: 24% (17%) named responses: 4 18 “2020”s: 10% (7%)

(10)

The ‘Church Effect’ 0 5 10 15 20 25 30 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 1997 2003 6 years !

(11)

11 Copyright © 2003 20/20 Speech

1. More than 50% of new PCs

have dictation on them, either at purchase or shortly after.

“won’t be used” “already comes

with Office XP”

“now … but not used”

(12)

1. More than 50% of new PCs have dictation on them, either at purchase or shortly after. 0 5 10 15 20 25 30 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201 0 2011 2012 2013 2014 201 5 Nev er 1997 2003 Mean: 2009 (2001) SD: 7 (3) Min: 2000 (1997) Max: 2050 (2010)

(13)

13 Copyright © 2003 20/20 Speech

2. Most telephone Interactive Voice Response (IVR) systems accept speech input.

(14)

2. Most telephone Interactive Voice Response systems accept speech input (and more than just digits)

0 5 10 15 20 25 30 35 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 Never 1997 2003 Mean: 2010 (2003) SD: 10 (4) Min: 2000 (1998) Max: 2060 (2020)

(15)

15 Copyright © 2003 20/20 Speech

5. Automatic airline reservation by voice over the telephone is

(16)

5. Automatic airline reservation by voice over the telephone is the norm. 0 5 10 15 20 25 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 202 8 2030 2032 2034 2036 203 8 Nev er 1997 2003 Mean: 2013 (2015) SD: 10 (57) Min: 2002 (1999) Max: 2050 (2500)

(17)

17 Copyright © 2003 20/20 Speech

4. Speech recognition is

commonly available at home

(e.g. interactive TV, control of home appliances and home

management systems).

(18)

4. Speech recognition is commonly available at home (e.g. interactive TV, control of home appliances and home management systems).

0 5 10 15 20 25 30 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 201 3 2014 2015 2016 2017 201 8 Nev er 1997 2003 Mean: 2016 (2010) SD: 15 (12) Min: 2004 (1999) Max: 2100 (2100)

(19)

19 Copyright © 2003 20/20 Speech

7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.

(20)

7. Voice-enabled command, control and communication in cars becomes as common as intermittent wiper, power window or power door lock.

0 5 10 15 20 25 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 201 2 2013 2014 2015 2016 201 7 Nev er 1997 2003 Mean: 2016 (2008) SD: 13 (8) Min: 2004 (1999) Max: 2075 (2050)

(21)

21 Copyright © 2003 20/20 Speech

3. TV closed-captioning

(subtitling) is automatic and pervasive.

(22)

3. TV closed captioning is automatic and pervasive. 0 2 4 6 8 10 12 14 16 18 20 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 202 5 2027 2029 2031 2033 203 5 Nev er 1997 2003 Mean: 2018 (2031) SD: 17 (124) Min: 1998 (1997) Max: 2100 (3001)

(23)

23 Copyright © 2003 20/20 Speech

15. Telephones are answered by an intelligent answering machine that converses with the calling

party to determine the nature and priority of the call.

(24)

15. Telephones are answered by an intelligent answering machine that converses with the calling party to determine the nature and priority of the call.

0 2 4 6 8 10 12 14 16 18 20 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 2040 Never Mean: 2022 (early 2000s) SD: 25 Min: 2000 Max: 2150

(25)

25 Copyright © 2003 20/20 Speech

11. First legal case in which a recording of a person’s voice is thrown out because it cannot be proved whether a computer or a person said it.

“not evidence in many countries

(26)

11. First legal case in which a recording of a person's voice is thrown out because it cannot be proved whether a computer or a person said it.

0 5 10 15 20 25 30 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 206 5 2070 2075 2080 2085 209 0 Nev er 1997 2003 Mean: 2025 (2050) SD: 29 (167) Min: 1995 (1990) Max: 2150 (3000)

(27)

27 Copyright © 2003 20/20 Speech

20. Pocket-sized listening

machines are commonly available for the hearing impaired.

(28)

20. Pocket-sized listening machines are commonly available for the hearing impaired. 0 10 20 30 40 50 60 70 80 1980 200 0 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 2280 2300 2320 2340 2360 2380 Never Mean: 2026 (2009) SD: 32 Min: 2001 Max: 2275

(29)

29 Copyright © 2003 20/20 Speech

16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.

“impossible to

answer” “hope it’s soon”

“HMMs are here to stay, but the assumptions will

(30)

16. The majority of automatic speech recognition systems have completely abandoned the HMM paradigm for acoustic modelling.

0 5 10 15 20 25 30 35 40 45 2005 200 7 2009 2011 2013 2015 2017 2019 2021 2023 2025 2027 2029 2031 2033 2035 2037 2039 2041 2043 2045 Never Mean: 2029 SD: 33 Min: 2007 Max: 2200

(31)

31 Copyright © 2003 20/20 Speech

10. Public proceedings (e.g. courts, public inquiries,

parliament etc.) are transcribed automatically.

(32)

10. Public proceeedings (e.g. courts, public inquiries, parliament etc.) are transcribed automatically 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2030 (2041) SD: 26 (128) Min: 2006 (2000) Max: 2150 (3001)

(33)

33 Copyright © 2003 20/20 Speech

9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.

“keyboards will always be

shipped”

“no problem, there will always

be advanced pills available”

(34)

9. A leading cause of time away from work is being hoarse from talking all the time, and people buy keyboards as an alternative to speaking.

0 10 20 30 40 50 60 70 80 90 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 207 5 2080 2085 2090 2095 210 0 Nev er 1997 2003 Mean: 2030 (2103) SD: 31 (287) Min: 2006 (1998) Max: 2150 (3020)

(35)

35 Copyright © 2003 20/20 Speech

14. The majority of automatic speech recognition systems have completely abandoned the

n-grams paradigm for language modelling. “most deployed systems use CFG anyway” “impossible to answer”

(36)

14. The majority of automatic speech recognition systems have completely abandoned the n-grams paradigm for language modelling.

0 5 10 15 20 25 30 35 40 45 50 1995 199 8 2001 2004 2007 2010 2013 2016 2019 2022 2025 2028 2031 2034 2037 2040 2043 2046 2049 2052 2055 Never Mean: 2033 SD: 39 Min: 1995 Max: 2200

(37)

37 Copyright © 2003 20/20 Speech

13. The majority of text is

created using continuous speech recognition.

“for authoring

(38)

13. The majority of text is created using continuous speech recognition. 0 10 20 30 40 50 60 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 Never Mean: 2039 (2009) SD: 48 Min: 2000 Max: 2300

(39)

39 Copyright © 2003 20/20 Speech

17. Most routine business

transactions take place between a human and a virtual personality (including an animated visual

presence that looks like a human face).

(40)

17. Most routine business transactions take place between a human and a virtual personality (including an animated visual presence that looks like a human face).

0 5 10 15 20 25 30 1990 199 5 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 Never Mean: 2041 (2009) SD: 66 Min: 1994 Max: 2500

(41)

41 Copyright © 2003 20/20 Speech

18. Translating telephones allow two people across the globe to

speak to each other even if they do not speak the same language.

“depends on the task”

(42)

18. Translating telephones allow two people across the globe to speak to each other even if they do not speak the same language.

0 2 4 6 8 10 12 14 16 18 2000 200 5 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 Never Mean: 2057 (early 2000s) SD: 116 Min: 2000 Max: 3000

(43)

43 Copyright © 2003 20/20 Speech

12. Speech recognition accuracy equals that of the average

(individual) human transcriber.

(44)

12. Speech recognition accuracy equals that of the average (individual) human transcriber. 0 5 10 15 20 25 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 207 0 2075 2080 2085 2090 209 5 Nev er 1997 2003 Mean: 2064 (2046) SD: 222 (124) Min: 2005 (1997) Max: 3827 (3001)

(45)

45 Copyright © 2003 20/20 Speech

19. Most interaction with

computing is through gestures and two-way natural-language spoken communication.

(46)

19. Most interaction with computing is through gestures and two-way natural-language spoken communication. 0 5 10 15 20 25 30 35 40 2000 200 2 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024 2026 2028 2030 2032 2034 2036 2038 2040 Never Mean: 2069 (2019) SD: 225 Min: 2004 Max: 3827

(47)

47 Copyright © 2003 20/20 Speech

6. It is possible to hold a

telephone conversation with an automatic chat-line system for more than 10 minutes without realising it isn’t human.

“could happen, but should not” “why would one

want that?”

“hopefully never”

(48)

6. It is possible to hold a telephone conversation with an automatic chatline system for more than 10 minutes without realising it isn't human.

0 5 10 15 20 25 30 35 40 1995 2005 2015 2025 2035 2045 2055 2065 2075 2085 2095 2105 2115 2125 2135 214 5 2155 2165 2175 2185 219 5 Nev er 1997 2003 Mean: 2086 (2128) SD: 228 (328) Min: 2000 (1998) Max: 3579 (4001)

(49)

49 Copyright © 2003 20/20 Speech

8. No more need for speech research.

“GASP!”

(50)

8. No more need for speech research. 0 10 20 30 40 50 60 70 1980 2000 2020 2040 2060 2080 2100 2120 2140 2160 2180 2200 2220 2240 2260 228 0 2300 2320 2340 2360 238 0 Nev er 1997 2003 Mean: 2342 (2240) SD: 1308 (546) Min: 1981 (1984) Max: 10000 (5001)

(51)

51 Copyright © 2003 20/20 Speech

Overall Impressions

• High-level of participation – thanks to the 105 who responded

• Remarkably consistent with the 1997 survey

• Strong evidence of the ‘Church Effect’ • Neither more optimistic or pessimistic • More agreement & more realistic

• People less willing to be associated with their opinions than 6 years ago

(52)
(53)

53 Copyright © 2003 20/20 Speech

20/20 Speech Ltd.

20/20 Speech Ltd.

Science Park, Malvern, Worcs., WR14 3SZ, UK Tel: +44 1 684 585101 Fax: +44 1 684 585151

http://www.2020speech.com http://www.aurix.com

References

Related documents

Since these interest rate guarantees in favour of policy holders are difficult to honour in the current interest rate environment and require a lot of

Both groups received conventional rehabilitation : physical and occupational therapy; all participants received conventional rehabilitation for 60-minute sessions (30 mins each),

This study demonstrated that diabetes patients living in rural areas had a poor disease management level according to metabolic control parameters and a moderate self-efficacy

Stratigraphic Interpretation and Reservoir Implications of the Arbuckle Group (Cambrian- Ordovician) using 3D Seismic, Osage County, Oklahoma.. A thesis submitted in partial

Dedicamos nuestro trabajo y esfuerzo a optimizar los procesos de negocio de nuestros Dedicamos nuestro trabajo y esfuerzo a optimizar los procesos de negocio de

The software for this project includes the C-code programing for each LPC2148 microcontrollers, the XCTU configuration for XBee modules, and the Matlab m files

6(b) that the bare slot array (structure of Fig. 1 without the dielectric cube and circular patches) has one resonant frequency with 3% impedance bandwidth.. By adding the