Open Hypermedia and Streaming Audio

(1)

research or study, without prior permission or charge. This thesis cannot be

reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the

copyright holders.

When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given e.g.

AUTHOR (year of submission) "Full thesis title", University of Southampton, name of the University School or Department, PhD Thesis, pagination

(2)

1 Introduction

4

2 Background

6

2.1 Introduction: The Five Senses . . . 6

2.2 The Audio Domain . . . 8

2.3 Multimedia Authoring Tools . . . 9

2.3.1 The Timeline authoring method . . . 9

2.3.2 The Flowchart authoring method . . . 10

2.3.3 The \Book" Metaphor . . . 12

2.4 The World Wide Web (WWW) . . . 13

2.5 Existing Standards . . . 16

2.5.1 The Multimedia and Hypermedia Experts Group . . . 16

2.5.2 The Hypermedia/Time-based Structuring Language . . . 19

2.6 Hypermedia Systems . . . 22

2.6.1 Hypertext and Hypermedia . . . 22

2.6.2 \Open" Hypermedia Systems . . . 25

2.6.3 Hypermedia Systems and Audio . . . 26

2.7 The Audio-Linker Tool . . . 29

2.7.1 The Design . . . 29

2.7.2 The Implementation . . . 30

3 Streaming Media Protocols

34

3.1 The Traditional Protocols - TCP/IP . . . 34

3.2 The Real-time Transport Protocol (RTP) . . . 36

(3)

4 The Sound Viewer Tool

41

4.1 The original Sound Viewer Tool . . . 41

4.1.1 An Overview . . . 41

4.1.2 The Interface . . . 43

4.1.3 The Implementation . . . 46

4.1.4 The original Sound Viewer case study . . . 47

4.2 The streaming Sound Viewer . . . 49

4.2.1 An Overview . . . 49

4.2.2 The Design . . . 51

4.2.3 The Implementation and case study . . . 54

5 Conclusion

61

5.1 Future Work . . . 63

(4)

2.1 A screen-shot of a presentation in Flash. . . 11

2.2 A screen-shot of an Authorware presentation. . . 12

2.3 MatchWare's Medi8or design process. . . 13

2.4 The Audio Linker Application . . . 33

4.1 The Sound Viewer Interface. . . 44

4.2 The Audio Device Layers. . . 48

4.3 The original Sound Viewer demonstration. . . 49

4.4 The Audio Device Layers with RTSP. . . 52

4.5 Client / Server interaction for the RTSP \open" function . . . 56

4.6 Client / Server interaction for the RTSP \get" function . . . 57

4.7 Client / Server interaction for the RTSP \get" function (cont'd) . . . . 58

4.8 Client / Server interaction for the RTSP \play" function . . . 59

4.9 The streaming Sound Viewer. . . 60

(5)

Introduction

Audio media is one of the most neglected areas in the Hypermedia / Multimedia domain. To play sound les in traditional hypermedia systems, for example the World Wide Web (WWW), users click upon links (or hyperlinks) within a browser. This action will download the sound le to the local machine and then activate an application to play the le. More recently however, applications can stream the le over the network instead, thus reducing the overhead of having to download it rst.

In multimedia applications such as Microsoft Encarta, users click on areas of the screen called \hot-spots" that will cause an event to occur. This event could cause an audio le to be played, a picture to be displayed or it could cause some other event. In both of these systems audio is usually just the result or side-eect of some action caused by the user.

In the Hypermedia domain the information required to traverse the links is embedded within the document using the HyperText Markup Language (HTML). In multimedia systems the link information is stored in an internal format, which the user does not have access to. This creates problems because only the original author of the hypertext or multimedia document can modify, edit or create new links. Therefore because this link information is dicult to modify, these systems are known as \closed" systems.

The aim of this research is to investigate how audio could be used in Open Hypermedia Systems, specically how links can be used with streaming audio.

Chapter 2

introduces the audio domain and describes how audio has been used in multimedia authoring tools, \Open" Hypermedia Systems and the World Wide Web.

(6)

This chapter also describes existing standards used in both the hypermedia and multi-media communities and the development of a simple audio tool.

Chapter 3

describes the traditional internet protocols, TCP/IP and the protocol used to transport real-time information over TCP/IP, the Real-time Transport Protocol (RTP). A new streaming protocol, the Real Time Streaming Protocol (RTSP), is then discussed.

Chapter 4

gives an overview of the original Soundviewer Tool and how it has been used. The design and implementation of a streaming audio extension to this tool is also discussed. Finally, an example scenario is outlined for the use of this new tool.

(7)

Background

This chapter describes the audio domain and how it is used in a number of systems, specically multimedia authoring tools, the World Wide Web and \Open" Hypermedia Systems. Existing standards, such as MHEG and HyTime are discussed and the chapter draws to a close by describing how a simple audio tool can be created using a modern programming language e.g. Visual Basic.

2.1 Introduction: The Five Senses

Traditionally, vision has always been regarded as the primary sense for normally sighted people. In general, however, humans interact with the outside world by receiving in-formation through a mixture of the ve senses1, processing this information and then

reacting or responding to it. For example, the sense of smell and sight could be used to determine if a piece of food had gone o. The user might react to this situation by disposing of the food. In the development of graphical user interfaces, all of the other senses are regarded as being less important than sight. This is mainly due to the complexity of the other senses.

Dix et al. [1] describe how the majority of interactive computer systems are com-pletely visual in nature, oering rudimentary audio support. As the complexity of these systems increase, more and more visual information could be presented on the screen, making it harder for the user to understand. The authors discuss how the other sensory

1Sight, hearing, touch, taste and smell.

(8)

channels could be used to relieve the pressure of the visual channel and thus reduce the information overload. By increasing the number of sensory channels, users would be able to interact with their computers in the same way that they would interact with their everyday environment.

Most commercial computer systems, however, provide limited support for two of the other senses, hearing and touch. Systems that produce a haptic response (the sense of touch) are being used by the virtual reality (VR) community and joystick manufacturers. For example, VR users can wear a special type of glove which contains small inatable pockets. As the user in the VR world picks up an object, these pockets ll with air giving the impression that the user has actually picked up that object. Joystick manufacturers have developed tactile feedback joysticks which have small motors that move the joystick depending on the type of feedback required.

These interactive systems also support audio, although traditionally it is only used to provide warnings, alarms and status information. Most modern operating systems such as Microsoft' s Windows 95 or NT also provide support for soundcards, which can be used to record, edit and playback audio samples. The majority of the time, however, soundcards are used for playing games. Dix et al. [1] describe how users, when playing a game, will score more points when the sound is turned on rather than o. Users can detect vital information and clues via the changes in the sound, which can be used with the visual information to increase their scores. The authors also describe how audio and visual information can help increase the accuracy of speech recognition systems. For example a camera can be used to video the lip movements of the speaker. The sounds will also be analysed. By using the video footage and the sound information, words and phrases can be more accurately resolved.

Overall these interactive systems use the visual channel as the main medium for transferring information. The auditory channel is rarely used, although the amount of information that can be conveyed using audio is underestimated. The following section describes the audio domain in more detail.

2There are two types of channel, input and output. With humans an input channel represents one of

(9)

2.2 The Audio Domain

To understand the audio domain an overview of the human auditory system is required. Dix et al. [1] and Moore [2] both describe how the human ear consists of three parts and these are:

1. The Outer Ear which consists of the pinna (the visible part of the ear) and the auditory channel. Both the pinna and the auditory channel amplify certain high frequencies, whilst the channel itself secretes a waxy substance that prevents in-sects and dirt from reaching the more sensitive middle ear.

2. The Middle Ear consists of a small cavity containing three of the smallest bones in the human body, the ossicles. These connect the outer ear via the eardrum or tympanic membrane to the cochlea in the inner ear.

3. The Inner Ear. This consists of the cochlea which has rigid bony walls and is lled with a special type of uid. The cochlea also contains tiny hairs or cilia which move when the uid vibrates. This vibration causes small electrical impulses to be passed up the auditory nerve to the brain.

Moore [2] describes how the process of hearing originates with the vibration of an object. This vibration causes a pattern of changes to occur within the surrounding medium (usually air), which results in the creation of a sound wave. This wave travels through the air until it eventually reaches the outer ear, mentioned above. The sound wave will then pass down the auditory channel to the ear drum, causing the drum to vibrate. The vibrations of the ear drum are passed via the ossicles, in the middle ear, to the cochlea. These vibrations change the pressure within the cochlea, which in turn moves the cilia.

This process of passing the sound waves from the outer to the inner ear, ensures the ecient transfer of the actual sound information. Dix et al. [1] describe the dierent characteristics of sound, such as the pitch and amplitude. If the pitch of the sound increases, then the frequency3 will also increase. The amplitude of the sound is

propor-tional to its loudness and therefore, an increase in the amplitude will cause an increase in the volume of the sound.

(10)

The human ear can also identify the sound's location, since the two ears receive slightly dierent sounds. If a sound occurs to the left of a users head, then the left ear will receive the sound wave rst. It will take longer to reach the right ear due to its location and the fact that the wave will also reect o the users head.

Overall, sound can convey a remarkable amount of information. The human ear can use this information to detect dierent types of sounds, where they are coming from and how far away they are. However, the sense of hearing has always been regarded as secondary to that of sight and this can be clearly seen in the development of computer systems. By combining this sound information with visual information, users would be able to interact with computers in a more natural way, see

Section 2.1

.

2.3 Multimedia Authoring Tools

This section describes some of the most popular multimedia authoring tools that are available today. Multimedia authoring allows users to combine text, hypertext, pictures, animation, sound and video into a single application that can be distributed on and over a variety of media, for example the internet, CD-ROMs, oppy disks etc. Users can design anything from an interactive Web site to an electronic product catalogue.

Designing a multimedia presentation, however, can take a lot of time and eort. Therefore the majority of multimedia packages try and reduce this by using a range of authoring techniques. These include the timeline and owchart methods and the \book" metaphor. The following sections give a brief overview of these methods and the products that use them (a more detailed description of the products are in [3]).

2.3.1 The Timeline authoring method

(11)

presentation will show a picture containing all of the elements in the layers e.g. blue sky, a beach and a palm tree. If layer one however only spans 5 frames from the beginning, then the blue sky would only show for 5 frames and then disappear for the remaining 5. Users can also modify a cast member in each frame of a single layer which will result in a simple animation, e.g. modifying a birds wings so that in one frame they are up and in another they are down, giving the impression of ight.

Both of these packages support audio. The Flash application allows users to import audio les directly into a layer, whereas Director uses a separate digital audio layer.

Fig. 2.1

shows a section of a simple presentation in the Flash program. A section of the timeline that contains audio samples, the \speaker ashes" layer, can be seen at the top of the screen. Director has support for up to 10 digital audio channels depending on the hardware used. Audio les can be played in the background of a presentation or they can be activated by several other types of event e.g. a mouse click, the transition from one scene to another etc.

Links can not be followed from within an audio event. Audio can be placed on one layer and at a certain time / frame an event can occur e.g. an URL red, screen transition, video started etc. on another layer. Everything, however, is followed when the \play head" touches the beginning of a cast member in the timeline. So sound again is just another element used within a presentation.

2.3.2 The Flowchart authoring method

Several products use this method and they include Macromedia's Authorware, Asym-etrix's IconAuthor and Linotype's Dazzler. All of these applications use \drag and drop" to pick up and place icons on the presentation page. These icons represent:

events such as mouse clicks, key press,

actions to be performed after an event e.g. a transition, a sound,

routines to perform loops, conditional branches etc.

(12)

Figure 2.1: A screen-shot of a presentation in Flash.

the icon's properties, which can be easily changed. A presentation is built by inserting one object after another e.g. a simple application could contain just three icons; the rst could be a picture, the second a sound icon and the third a text icon. When the presentation is started the user would see and hear all three icons together.

Fig. 2.2

is a screen-shot of Macromedia's Authoware program.

(13)

Figure 2.2: A screen-shot of an Authorware presentation.

2.3.3 The \Book" Metaphor

Asymetrix's ToolBook, Digital Workshop's Illuminatus, Scala Computer Television's MM200 and MatchWare's Medi8or all use this method of authoring. Basically when the application is started, the user is shown a page in which certain objects can be placed e.g. text, pictures, buttons etc. By inserting objects into several pages, a multimedia \book" is eventually created. The author can create transitions between pages and on the objects themselves e.g. zoom text in and out, cause a picture to ow onto the page etc.

Fig. 2.3

is a screen-shot of the design process used in MatchWare's Medi8or program.

(14)

Figure 2.3: MatchWare's Medi8or design process.

being just another entity or object used within the presentation. The majority of the time audio output is caused by an event e.g. a button being pressed, the mouse cursor moving over a hotspot etc.

2.4 The World Wide Web (WWW)

The World Wide Web or W3 was originally designed in 1990 by Tim Berners-Lee at

(15)

was very dicult, if not impossible, to create and then \jump" to these references. The WWW overcame these problems by dening three new platform-independent and network-neutral components. These were:

1. The Universal Resource Locator (URL) addressing scheme. This consists of a simple string, which contains the type of protocol to use, e.g. FTP, HTTP etc., the name of the computer on the internet and the actual name of the document to be retrieved. For example \http://www.idiscover.co.uk/linux/linuxdoc/index-.html".

2. The HyperText Transfer Protocol (HTTP). This protocol was designed so that information could be eciently retrieved, for the purpose of making hypertext \jumps"; see

Section 2.6.1

for a description of hypertext.

3. The HyperText Markup Language (HTML) is used to markup, see

Section 2.5.2

, documents that will be used on the WWW. This process enables authors to pre-pare and format their documents using a mixture of text, pictures, sound, video and hypermedia links. See

Section 2.6.1

for a description of hypermedia. Users can click on these links to \jump" to related items. By using this language, authors ensure that their documents will look the same on dierent WWW browsers. A WWW browser or client is an application that renders an HTML document into a form that can be displayed on the user's screen. Netscape's Navigator and Microsoft's Internet Explorer are two of the most common browsers.

These three components form the core architecture of the WWW. Originally the WWW was only designed to be used on wide-area networks (WANs). With the growth of the Internet however, see

Section 3.1

, WWW browsers are now being used to read and retrieve information o the internet. As a result the WWW has steadily grown in size (the number of web \pages" being created) and popularity.

(16)

le once it had been downloaded. The process of downloading a le, however, could take a long time especially if the le was quite large and / or the network connection was very poor.

To overcome this problem, several companies, such as RealNetworks, Macromedia, Xing Technology etc., have created streaming audio servers and players. A streaming audio link is represented on a web page in exactly the same way as an audio link. This link will contain information about the audio server and the audio le to be streamed. When a user clicks upon this link, three events occur and they are:

1. The browser activates the relevant streaming audio player. This player will then send a message, containing the name of the audio le to be streamed, to the audio server.

2. The audio server splits the audio le into smaller packets and transmits them to the player.

3. The player buers these packets until enough have been received, so that they can be played.

On a reasonably fast network connection the audio le is played almost immediately, giving the impression that the audio le is stored on the local machine. Otherwise it could take a long time for enough information to be received so that it can be played.

The current internet protocol, which the WWW uses, was not designed to handle streaming audio information. This protocol is called TCP/IP and is described in more detail in

Section 3.1

. To overcome this problem several protocols have been developed and they are the Real-time Transport Protocol (RTP) and the Real Time Streaming Protocol (RTSP). These protocols are discussed in more detail in

Sections 3.2

and

3.3

, respectively.

(17)

2.5 Existing Standards

Several international standards have been created in the hypermedia and multimedia community. These are described in more detail in the following sections.

2.5.1 The Multimedia and Hypermedia Experts Group

The Multimedia and Hypermedia Experts Group4 (MHEG) was formed by a

subcom-mittee5 of the International Standards Organisation (ISO) to address the problems of

trying to design a software-neutral interactive multimedia presentation tool. There are several parts to MHEG and a paper by Rodriguez et al. [5] gives a brief description of MHEG parts 1 to 5.

Boudnik and Eelsberg [6] describe MHEG-1, which was the initial specication for robust multimedia data structures using ISO's own Abstract Syntax Notation (ASN.1). This notation describes all of the data structures (or MHEG classes) in an abstract syntax, which ensures that applications conforming to this standard will be able to communicate. An interactive presentation is formed by creating instances of these MHEG classes or MHEG-Objects and forming interrelationships between these objects. MHEG-1 denes several classes, of which four could be used for inserting links into streaming audio. These are:

The MH-Object class which is the root class inherited by all of the other classes.

The Content class which contains the data that the user sees and / or hears.

The Link and Action classes which describes what actions are performed when a

user activates a particular link object.

The Multiplexed Content class which is derived from the content class and either

contains or refers to the multiplexed stream data. It also assists in inter-stream synchronisation, such as lip synchronisation.

Each object when it is created can contain extra information about its original size and its play-out duration. This information is stored as virtual co-ordinates from a generic space and a virtual timeline, respectively. At run-time this extra information is

4Also known as the Multimedia and Hypermedia information coding Experts Group.

(18)

converted into real-time requirements, such as screen co-ordinates and a particular type of timer, depending on the type of hardware being used.

MHEG-2 is exactly the same as MHEG-1, except that the classes are dened in SGML (Standard Generic Markup Language) instead of ASN.1. SGML is described in more detail in

Section 2.5.2

.

A paper by Rutledge et al. [7] gives an overview of the MHEG-3 project which is an extension to MHEG-1. This part of the standard was created to increase the interactivity between multimedia objects and the environments that they run in. This is achieved by using :

1. Scripting languages6 - a script is a program containing a set of procedures that

can be used to monitor events that are created by particular objects, for example a user clicking on a button will cause an event. As a result of these events being generated, the script will execute certain actions, possibly on other objects. This allows objects to interact with each other.

2. A Virtual Machine - this is used to create a mapping table between the run-time services provided by a particular platform / environment and the interface of the scripting language, used to generate the script. This allows interactivity between the scripts and the platform being used for the presentation.

Overall MHEG-3 increases the functionality of MHEG-1, by allowing users to create presentations using external programs.

The fourth part of the MHEG standard (MHEG-4) was a simple extension and is used to register objects and formats supported by MHEG, e.g. MPEG, JPEG etc.

Joseph and Rosengren [8] give an overview of the MHEG-5 standard, which was de-signed to extend the class hierarchy of the initial MHEG-1 specication. This extended hierarchy contains a set of new classes that can be used to develop client / server multi-media applications across platforms with limited resources. This ensures that MHEG-5 conformant applications will run on conformant terminals.

MHEG-5 denes ve classes that are relevant to the streaming audio and open hy-permedia research domain. These are:

6These are also known as

(19)

The LinkEect and the LinkCondition classes. A LinkCondition is triggered by an

event, which if it matches certain conditions will run a LinkEect. The LinkEect contains a list of elementary actions to be carried out.

The Stream class, which provides the functionality to multiplex audio and video

objects, present them in synchronisation and create links from MHEG objects to specic points in the stream (stream-events). Links can also be created to specic user-dened events.

The Audio class, which encapsulates audio objects and can be used with streams.

The HyperText class, which allows users to associate objects with a link to, for

instance, another page. Objects can be words, groups of words, pictures etc. An example of an MHEG-5 application is the \GlassWWWay" tool developed by Geyer et al. [9]. The original system, known as the \GLobally Accessible ServiceS" (GLASS), used MHEG-1 to provide several dierent services such as video on demand, interactive TV or online shopping in an easy-to-use interface. However, due to the prob-lems of inferior browser technology and that the developers thought that MHEG-1 was too abstract, they decided to develop a new system based on the MHEG-5 standard. By re-dening the MHEG-5 classes in Java, they designed a Java applet which could be downloaded into any Java-enabled World Wide Web (WWW) browser. This also provided better communication between the users and the GLASS server.

This paper concentrates mainly on the benets and drawbacks of using Java to imple-ment the MHEG-5 engine and so the audio domain is not imple-mentioned. It is not known whether this system provides the functionality to stream audio or to create links from certain points within an audio stream, which MHEG-5 can. The system does provide basic sound support for interactive television and the other services mentioned above. At this stage, however, Java has very limited sound capabilities and therefore the func-tionality needed to support sound would have to be supplied by external functions7.

The MHEG standard contains many sections of which part ve seems to provide enough information to develop a tool to create links to and from the streaming audio domain. MHEG at the moment, however, is used as a tool to develop visual, easy-to-use, interoperable client / server systems, of which audio is just a small part of this.

7These are also called

(20)

2.5.2 The Hypermedia/Time-based Structuring Language

The Hypermedia / Time-based Structuring Language (HyTime) became an ISO stan-dard (ISO/IEC 10744:1992) in 1992 and uses the Stanstan-dard Generic Markup Language (SGML) to describe document architectures. HyTime evolved from the work of the ANSI X3V1.8M committee on the development of a Standard Music Description Lan-guage (SMDL) and to understand the concepts of HyTime, a brief description of SMDL and SGML will also be given in this section.

A paper by Carr et al. [10] describes traditional markup and the equivalent logical and physical markup techniques in use today. Markup can be described as the process of in-serting specic commands or codes into the original text which the compositor program (or traditionally a human) uses to compose the nal document. Physical markup re-quires the user to explicitly place certain commands, such as `bold' and `centre', around the text to be rendered8; whereas logical markup requires the program to interpret and

render abstract commands, such as \this text is a heading", inserted by the user. This paper describes the Standard Generic Markup Language as taking logical markup to the extreme. SGML does not dene default markup commands or tags; the user creates these by using SGML's logical elements and physical entities. In ArborText's SGML White Paper [11], logical elements are described as pieces of data that may contain either text or other subelements such as chapters and paragraphs etc. Each element also contains their own generic identier (GI), which is used for the markup of documents. Physical entities are described as self-contained pieces of data, e.g. a separate text le, a separate graphic le etc., that can be referenced as a unit.

These elements and entities are then stored in a Document Type Denition (DTD) le, which can be used as a template for the structure of other documents. The syntax of DTDs is very strict which ensures that they will be understood by other SGML-compliant applications. It is upto the application, however, to interpret the `meaning' of the document structure.

Mounce [12] describes in his paper, how the Standard Music Description Language (SMDL) was initially dened using a SGML DTD in 1988. SMDL is dened9 as \an 8The process of turning non-graphical information into a form that can be represented in a graphical

way e.g. a printer renders information into a form that can be printed.

(21)

architecture for the representation of music information, either alone, or in conjunction with text, graphics or other information needed for publishing or business purposes." Basically, SMDL divides a musical work into four domains and these are:

1. The Logical Domain or cantus. This will contain all of the logical information about a piece of music. The cantus can be described as an abstract timeline on which all events can be scheduled e.g. Eliens et al. [13] suggest that the cantus can contain information to do with automated lighting.

2. The Visual Domain. This graphically represents the music in some form e.g. a link to an image le (JPEG, GIF etc.) or to a coded music le.

3. The Gestural Domain. This can contain one or more links to the actual perfor-mance of the musical work e.g. a link to a MIDI le.

4. The Analytical Domain. This contains commentaries or theoretical analyses of all of the other domains.

A le written in SMDL would traditionally contain a cantus, links to one or more graphical representations of the musical work and one or more links, again, to the actual performance of the cantus. This initially provided enough information to create simple presentations, but with the development of digital audio and multimedia, there was a need to extend the functionality of the original SMDL document type denition (DTD). With each subsequent development of the DTD, more and more information was submitted until eventually a separate standards activity was initiated in 1989. This research produced the Hypermedia/Time-based Structuring Language (HyTime) which extended SGML and became a full International Standard in April 199210.

Newcomb et al. [14] describe how HyTime was originally dened using a SGML DTD, which contained the HyTime-specic generic identiers (GIs). The standards committee realised, however, that these identiers could not be changed because HyTime compliant applications needed them to recognise HyTime documents. This reduced the expres-siveness of SGML and as a result, the next draft of the HyTime standard replaced the GIs with SGML architectural forms. In Newcomb's [15] paper, an architectural form denes elements with a standard meaning, e.g. independent hyperlink, and syntax for

(22)

its associated data. An architectural form also denes an attribute type for the element, which is used for identication purposes. Multimedia systems would then use this iden-tier to access only the relevant elements and hence parts of HyTime that they require. HyTime usually provides a standard set of forms to build hypermedia / multimedia documents and therefore, is often referred to as a meta-DTD.

One of the main problems of the hypermedia / multimedia industry is its inability to store, describe and transport media objects in an application-neutral manner. New-comb [16] discusses how the HyTime / SGML paradigm addresses this problem, by encapsulating the information using abstract semantics. This ensures that the informa-tion can co-exist in a variety of applicainforma-tions and contexts.

HyTime consists of six main modules. A paper by Newcomb [15] and Newcomb et al. [14] give a thorough description of each one. They are:

1. The Base Module which includes facilities to manage documents, handle name collision between HyTime-specic and user-dened identiers and an ability to track certain activities e.g. the creation, modication and deletion of links between objects. This module also contains SGML itself.

2. The Location Address Module. SGML uses an unique `identier' and an `identier reference', #ID and #IDREF respectively, to refer to particular elements within

a document. SGML can also create simple links to particular entities e.g. a graphic le. These identiers and links, however, can only be used within the local scope of the document that dened them. HyTime, again, extends this by providing location address, i.e. a pointer, architectural forms. These forms include methods to address locations by name and by position using dimension(s) and axis, e.g. a substring within a string, a word in a sentence, a node within a tree etc. HyTime uses this address information to \resolve" the address and hence recover the information at that location. This can be within the local document or in another document.

(23)

button and the contextual link (clink) which always has two link ends, with one of them being the clink's own location e.g. a footnote.

4. The Measurement Module. This denes a way to address document objects using abstract measurable domains such as space and time.

5. The Scheduling Module which uses nite co-ordinate spaces (FCSs) to dene any number of axes. These axes can represent anything that can be measured or counted e.g. time, money, temperature etc. Objects within FCSs are called events and these are used by the rendition module.

6. The Rendition Module. This module uses two constructs, the proscope and modscope. The proscope can be used to project certain parts of events onto

another FCS e.g. show only this section of a map, whereas themodscope can be

used to modify an event e.g. change the colour. This module calls a schedule of proscopesa baton and a schedule of modscopesa wand. A wand and then a

baton can be applied to a particular event, e.g. take this picture of a crowd, change the colour and then only project a particular section of the crowd. Generally the resulting FCS is in a form that the user can see and / or hear.

2.6 Hypermedia Systems

To fully understand how the audio domain has been used in Hypermedia Systems (HSs), a brief overview of the origins of hypertext and hypermedia has to be given. This section also discusses how hypertext and hypermedia have been used in rst and second generation systems, respectively. Finally this section describes how the audio domain has been supported traditionally in second generation systems and more recently in \open" hypermedia systems.

2.6.1 Hypertext and Hypermedia

(24)

in which traditional text can be viewed. Conklin [17] describes how these systems allow references to be created between dierent chunks of text, which can be in the same or another document. This type of text is called nonlinear text or hypertext because the path through the document can branch-o to other documents via these references.

Three of the main contributors to the area of hypertext were Vannevar Bush, Douglas Engelbart and Theodore Nelson. Conklin [17] discusses each of their hypertext systems and they were:

1. Bush's Memex system. In 1945 Vannevar Bush11 predicted a rapid growth in the

amount of scientic literature and the need to create a way in which this large body of information should be browsed. In his article, see Bush [18], Bush describes how the human mind works by associating related pieces of information. He applied this concept to a machine, called the Memex, which allowed the user to tie two relevant pieces of information, from two separate documents, together. This idea of association is credited as being the rst attempt to describe hypertext.

2. Engelbart's oN Line System (NLS/Augment). In 1963, Engelbart described a computer system that would augment man's intellect, by allowing the user to interact with the system using special cooperative devices12. As a result the

amount of information that a user could manipulate and understand would steadily increase, eectively \amplifying" the native intelligence of the user. The NLS system was implemented ve years later at the Stanford Research Institute. It allowed users to create any number of links between elements within a document and between the documents themselves. See Engelbart and English [19] for more details.

3. Nelson's Xanadu System. During the development of the NLS, Ted Nelson was also developing his own ideas about augmentation. Nelson's system would only allow the storage of documents in their original format and any modications made to these documents, e.g. a dierent paragraph etc. By using links between these modications and the original documents, previous versions could be easily reconstructed. New links could easily be created between dierent bodies of text

11President Roosevelt's science advisor.

(25)

and therefore new pathways could be formed through the material. It was from this system of linking large bodies of text together that Ted Nelson created the term \hypertext". Nelson's book, see Nelson [20], describes his ideas in more detail.

The term hypermedia can be described as an extension to hypertext. Hypertext systems allow users to author, edit and follow links between dierent bodies of text. Hypermedia systems, however, are similar to hypertext systems, except that the user can use other forms of media as well. For example, the authoring of links between an audio le and a body of text.

Halasz [21] describes how Engelbart's NLS/Augment system can be called a rst generation system because it used workstations with little or no graphics capabilities and it focused primarily on text. An overview of these systems can be found in Conklin [17]. Halasz goes on to say that in the early 1980's, second generation systems began to emerge, which used workstations with more advanced user interfaces and graphics. As a result these new systems would allow users to create references between dierent types of media, e.g. text, pictures etc., and hence they were called hypermedia systems. Example hypermedia systems are NoteCards [21], KMS [22] and Intermedia [23].

These second generation hypermedia systems originally used proprietary document formats to store the data. The links themselves were embedded into these documents, which made them considerably easier to transport. However this approach can cause several problems, especially with networks and distributed systems. For example, when a document is moved from one computer on the network to another, all links pointing to this document will have to be updated. Otherwise users will not be able to follow links to this document13. Similar problems will occur if documents are deleted and

the links are not updated or removed. The use of embedded links also made it very dicult to extend these systems, to support other types of media. External programs, that were not fully integrated into these systems, would have to be used. As a result of these problems, the second generation systems were called \closed" hypermedia systems. Goose [24], Beitner [25] and Halasz [21] describe these systems in more detail.

(26)

2.6.2 \Open" Hypermedia Systems

Goose [24] describes that in 1987, at a international hypertext conference, researchers started to express their concerns about the problems mentioned in the previous sec-tion. Several ideas were discussed, see Halasz [21]; for example new search and query mechanisms, management of dynamic information, more integration of existing applica-tions etc. As a result of these discussions, several American research groups dened, in 1989, a reference model for hypermedia. It was called the Dexter Hypertext Reference Model14, see Halasz and Schwartz [26] and it was designed to:

1. Dene both formally and informally the common abstractions found in a range of existing hypertext systems, e.g. NoteCards, Intermedia, KMS etc.

2. Serve as a standard, so that the functionality and characteristics of existing hy-pertext (and non-hyhy-pertext) systems could be compared.

3. Serve as a template, for the development of standards. These would assist in the interoperability and interchange between dierent hypertext systems.

The Dexter reference model is widely regarded as being one of the most important developments in hypermedia research.

A paper by Malcolm et al. [27] describes how hypermedia could be used in industry to integrate large amounts of data from specialist tools and applications. Malcolm, however, describes how the current (second) generation of hypermedia systems were incompatible with each other, as well as the tools and applications used in industry. As a result, Malcolm et al. dened several issues that needed to be addressed, e.g the ability to access and link across dierent platforms (interoperability), templates for common hypermedia structures and interaction with operating systems and networks etc.

Goose [24] describes that in 1991, at another hypertext conference, Halasz revisited his original ideas, see [21], that he placed before the hypermedia community. Halasz re-viewed the progress that had already been made and he also discussed the contribution made by Malcolm et al. [27]. As a result of these discussions Halasz presented several new areas of research, which focused primarily on the the development of \open" sys-tems with independent communicating processes and the way in which large amounts

(27)

of information could be managed and visualised on workstation screens. With the de-velopment of more \open" hypermedia systems, researchers would be able to overcome some of the problems associated with the second generation of hypermedia systems, see

Section 2.6.1

.

In [24], Goose gives a detailed overview of several hypermedia systems, that have embraced some of the concepts of \open" hypermedia. These include the World Wide Web, see

Section 2.4

, Hyper-G, Intermedia and Multicard. The following section describes why the audio domain has been neglected in hypermedia systems and how several \open" hypermedia systems have managed to overcome this.

2.6.3 Hypermedia Systems and Audio

Traditionally, the audio domain has been neglected in the development of hypermedia systems. There are several reasons for this and they include:

The lack of technology. The rst generation of hypermedia systems did not have

the computer technology to manipulate audio data. Engelbart's NLS/Augment system used a computer with a small amount of memory and a very simple display. Over time this technology has improved, e.g. the second generation systems could manipulate pictures and text using advanced workstations. The audio domain however, is quite complex and it requires more advanced technology, which is still being developed.

The problem of visualising audio information. Audio, by its very nature, can not

be seen. As a result, the ability to develop an intuitive graphical user interface, that will allow users to manipulate audio information, is quite dicult.

The dominance of the visual sense. Vision has always been regarded as the primary

sense for normally sighted people.

Section 2.1

describes this in more detail.

The problems of the audio le formats. Currently there are several le formats

(28)

can rapidly consume large amounts of disk space. The amount of memory and the processing power of a computer can also aect the size of an audio le. These problems have caused the majority of hypermedia systems to concentrate mainly on the authoring of links between text, images and video; the visual domain. With the development of more powerful computers, sound cards and \open" hypermedia however, it has become possible to develop applications that can be used in conjunction with these systems, to manipulate the audio domain.

Audio tools have been developed for two hypermedia systems and they are:

1. The SoundViewer tool for Microcosm. Microcosm15, see [28, 29, 30, 31], is an

open hypermedia system which was primarily developed to investigate new areas of hypermedia research. Goose and Hall [32] describe it as being a set of commu-nicating processes which supplement the facilities of the native operating system. Users interact with Microcosm using viewers which display dierent types of me-dia, e.g. text and video, and allow users to author and follow links to and from this media. Viewers can be either specically created for the system or existing third-party programs that have had their functionality extended; for example by using Microsoft Word macros.

When a user performs an action in the viewer, messages are passed from this viewer to Microcosm which then dispatches them through a chain of lters. These lters process the messages and then decide what sort of information should be returned to the user. The lters might return a set of links, which are stored in separate link databases or linkbases, that the user can then follow.

Originally Microcosm had little support for sound. When a user traversed a link to an audio le, a native application would be invoked to play this sample from the beginning to the end. Users could not create links from the application or to an area within the audio sample. As a result of this problem the SoundViewer tool was developed. This tool is discussed in more detail in

Section 4.1

.

2. The Sound (FT) module for MAVIS. MAVIS, the Microcosm Architecture for Video, Image and Sound, is an extension to Microcosm and it was also developed

(29)

at the University of Southampton by the Multimedia Research Group. Lewis et al. [33] describe the system as being a tool that allows users to author generic links16between text and non-text based media. MAVIS achieves this by using the

content of the non-text media as the key to navigation and retrieval of related information.

To navigate and retrieve information related to text is relatively straight forward because algorithms already exist that can match strings or sentences. With Mi-crocosm, a user selects a piece of text within a document and then tells the system that this selection is a generic link. The system will then create a representation of this text (a key) and then store this key in a linkbase. When users want to retrieve information relating to the original selected text, this key will be used with the string matching algorithms to nd other occurrences of the text.

With MAVIS, however, exact matching with non-text media is clearly impossible; the similarities between non-text media, for example images, have to be measured instead and these values can then be used to determine the closest match17. To

achieve this, MAVIS uses signatures to represent the content of a selection. Several signatures are often used since there are several dierent ways in which non-text objects can be matched, e.g. colour, texture etc. Each description of a signature is stored in a module and users can created their own signature modules, which MAVIS can then use.

MAVIS has a prototype signature module that supports links in sound. It is called the Sound FT module and it uses the Fourier Transform (FT) on a selected portion of a digital sound. The resulting sample can then be used as the key for matching. Lewis et al. [33] describe how single words can be articulated into the MAVIS system, which then displays the word as a sound wave. A section of this sound wave, representing the word, can then be chosen as a source anchor of a generic link. The Sound FT module will process this selection and generate the key. This key can then be used to navigate and / or retrieve information, e.g. text, images etc., related to the original audio sample.

16Microcosm supports the concept of generic links { links that can be followed from any point within

any document.

(30)

Both of these tools can be used to create links to and from the audio domain and they have managed to overcome some of problems mentioned at the beginning of this section, see

Section 2.6.3

.

2.7 The Audio-Linker Tool

The Audio-Linker tool was developed18 for a demonstration to show that links could

be created from audio to text and vice-versa. Originally only the Soundviewer tool and Microcosm were going to be used, see

Sections 4.1

and

2.6.3

respectively. However it was decided that a simpler audio tool, to show how the links could be created without having to use an underlying Open Hypermedia System e.g. Microcosm, should also be developed. The design and implementation of this simple tool are discussed in more detail in the following sections.

2.7.1 The Design

The requirements for the design of the tool were, again, very simple. It needed to be created quickly, due to the limited amount of time before the actual demonstration and so a Rapid Application Development (RAD) software tool had to be used, such as Visual Basic, Borland C++ Builder, Visual Cafe etc.

The graphical user interface (GUI) of the tool had to have components to display: 1. The audio sample. This component would also be used by the user to select a

portion of the audio to link to and from.

2. The audio controls, which would allow the user to play, rewind, forward and stop the audio le. In other words, buttons similar to the audio controls on a tape deck or CD player.

3. The text le. This component would also be used to select a portion of the text to link to and from.

The components to display the audio sample and the audio controls would have to interact with each other, so that when the \Play" control / button is pressed, the audio sample display would change to show the current position in the le. The components

(31)

to display the audio sample and the text le would also have to interact, so that any links between the two could be displayed and followed, e.g. a highlighted piece of text would represent a link to a particular part of the audio sample and if the user clicked on this text, the link would be followed and the audio sample would be played from the relevant position.

2.7.2 The Implementation

The tool was implemented using Visual Basic because it was easy to learn and use. Other RAD tools were not really considered because the Audio-Linker application had to be developed quickly, with the resources at hand. Visual Basic provided an environment where components, called OCXs19, can easily be made to interact with each other. By

inserting several of these components into a Visual Basic form, which represents the main application window, the Audio-Linker tool could be rapidly developed.

Several OCX controls were used for the application and they were:

A `Slider' control, which was used to graphically represent the audio sample. This

control consists of a horizontal bar which represents the length of the audio le and a vertical slider which was programmed to represent the current position in the audio le. By moving the slider, the user can move to dierent areas within the audio le. By holding down a particular keyboard key and sliding the vertical bar, a user could easily select a portion of the audio le to be linked to or from.

A `MMControl' control. This is a multimedia MCI

20control that consists of a row

of buttons, i.e. play, stop, pause etc., which allows the user to manipulate visual and audio media. In the Audio Linker, this control was set to play WAV sound les. This control, however, has enough functionality to support MIDI, CD Audio, CD-ROM drives and AVI (Microsoft's proprietary audio/video format) les.

A `RichTextBox' control, which was used to display a text le, in either plain

ASCII or rich text. This control was used to create links from text to audio.

19OCX is an extension to Microsoft's Object Linking and Embedding (OLE) which allows users to

embed objects e.g. Word documents, le dialog boxes, text boxes etc., into other documents.

20Media Control Interface. This is a set of high-level, device-independent commands that control

(32)

A `Timer' control, which was run whenever the play button was pressed. The

timer would then poll a certain function after a few seconds, which would update the vertical bar in the `Slider' control.

A `FileDialog' control, which was used to open project les.

When the application is executed, an \Open" le dialog box is displayed. This is used to load a project le, which contains the name of the WAV audio le, the name of the text le and the links, into the application. The le does not have to contain any links since they can be created in the application and then saved in the le. When the project le has been loaded, the `MMControl' control buttons are activated and the text is displayed in the `RichTextBox' box. By clicking on the \Play" button the audio le is played and if the current position of the audio le is in between the start and end points of an audio link, the link is followed and the relevant text is highlighted. This is shown in

Fig. 2.4

. An array in the program is used to store the links.

To test the application a simple project le was created, using paragraphs of text from a book and an audio transcript of this text. Links were then created from each paragraph of the text, to the relevant position in the audio le. When the audio le was played back, a sentence of text from each paragraph would be highlighted, when the equivalent position was reached in the audio le. If the user moved the vertical slider and hence changed the current position in the audio le, again a sentence of text would be highlighted in the relevant paragraph. These two methods are used to follow links between audio and text. By clicking the right mouse button anywhere within a paragraph, the position in the audio le and hence the slider would move to the start of the audio version of that paragraph. This method is used to follow links between text and audio.

(33)

(34)

(35)

Streaming Media Protocols

Chapter 3 describes a family of protocols that are being used to support the real-time delivery of multimedia data. It specically concentrates on the Real Time Streaming Protocol which is designed to be an open standard. This chapter also gives an overview of streaming audio players and the software tools they use to create streaming multi-media presentations.

3.1 The Traditional Protocols - TCP/IP

The Transmission Control Protocol / Internet Protocol is the main protocol suite used with the Internet. Halsall [34] describes how originally Local Area Networks (LANs) were used to connect dierent computers on a local site, e.g. computers within an oce or building. This allows users to share and distribute information across the local network. Large enterprises, however, usually have several sites situated in dierent areas of the same country and more recently, dierent countries as well. To connect sites that are situated within the same country, companies lease transmission lines between those sites, from the public carriers such as British Telecom. This forms a Wide Area Network (WAN). To connect sites that are situated in other countries, companies use dierent types of communication, for example satellites, optic bres across land and / or sea etc. The networks that are formed from this type of communication are called Internetworks or just Internets.

TCP/IP consists of a suite or a layered stack of two core protocols and they are the internet and transport protocols. The Internet Protocol (IP) provides a number of core

(36)

functions that assist the process of internetworking across dissimilar networks. These are:

1. Addressing. There are three types of address used with the current version of the internet protocol (IPv4), unicast, broadcast and multicast. Unicast addressing is used when a packet of information or datagram is to be sent to a single destination. Broadcast addressing is used when a message is to be delivered to every host on a destination LAN. A multicast address is used to deliver a datagram to a specic set of hosts, called a multicast group. This type of addressing is called IP Multicast. Hosts can join a multicast group at anytime and receive the datagrams that are sent to the group.

2. Fragmentation and reassembly. If the datagrams sent by a host are larger than the packet sizes used by a particular part of the internet, the datagrams will have to be fragmented into smaller chunks so that they can be transmitted. When these smaller packets are received, they have to be reassembled into the original sized packet, so that they can be used.

3. Routing. This is used to determine which subnets, within the internet, the data-grams must travel to get to the destination host. This could involve travelling over several dierent LANs or WANs.

4. Error reporting. This consists of several functions that will detect errors, for example the process of reassembly could cause several packets to be discarded, and report them back to the IP in the source host. Halsall [34] describes these functions in more detail.

The transport protocols are designed to sit on \top of" the internet protocol men-tioned above. They provide two modes of operation, oriented or connection-less. A connection-oriented protocol creates a connection between the transmitter and the receiver before the data is actually transmitted. This is also known as a reliable transport service since the data is guaranteed to get to the destination. The Transmis-sion Control Protocol (TCP) is an example of this type of service.

(37)

the overhead associated with each message transfer because no network connection is established prior to the transmission. TCP/IP provides a connectionless protocol called the User Datagram Protocol (UDP).

When users want to transmit information over the internet, using an internet-aware application, the information is rst passed to the transport protocol layer. This layer will determine the type of delivery mechanism, e.g. TCP or UDP, to use. The information is then passed to the internet protocol layer, which attaches extra information, for example the destination host address etc.

When a host receives information from the internet, the internet protocol will carry out several tests on the packets of information received. For example error reporting, which could result in the re-transmission of particular packets, reassembly of packets etc. The information is then passed to the transport protocol layer which strips o any information to do with the delivery mechanism. The resulting information can be used by the internet-aware application. A more thorough description of this can be found in Halsall [34].

3.2 The Real-time Transport Protocol (RTP)

The Real-time Transport Protocol was developed by the \Audio-Video Transport Work-ing Group"1 and has recently become an internet standard. RTP is described in the

IETF's RFC 1889 [35] specication as being a protocol providing end-to-end delivery services, such as payload type identication, timestamping and sequence numbering, for data with real-time characteristics, e.g. interactive audio and video. It can be used over unicast or multicast networks. RTP itself however, does not provide all of the functionality required for the transport of data and therefore applications usually run it \on top" of a transport protocol such as UDP2, which is discussed in more detail in

the previous section.

RTP usually works in conjunction with another protocol called the Real Time Control Protocol (RTCP)3, which provides minimal control over the delivery and quality of the

data. It performs four main functions and these are:

1Formed by the Internet Engineering Task Force (IETF).

(38)

1. Feedback Information. This is used to check the quality of the data distribution. During an RTP session, RTCP control packets are periodically sent by each par-ticipant to all the other parpar-ticipants. These packets contain information such as the number of RTP packets sent, the number of packets lost etc., which the re-ceiving application or any other third party program can use to monitor network problems. The application might then change the transmission rate of the RTP packets to help reduce any problems.

2. Transport-level identication. This is used to keep track of each of the participants in a session. It is also used to associate multiple data streams from a given participant in a set of related RTP sessions, e.g. the synchronisation of audio and video.

3. Transmission Interval Control. This ensures that the control trac will not over-whelm network resources. Control trac is limited to at most 5% of the overall session trac.

4. Minimal Session Control. This is an optional function which can be used to convey a minimal amount of information to all session participants, e.g. to display the name of a new user joining an informal session.

When an RTP session is initiated, an application denes one network address and two ports for RTP and RTCP. If there are several media formats such as video and audio, a separate RTP session with its own RTCP packets is required for each one. Other participants can then decide which particular session and hence medium they want to receive.

(39)

3.3 The Real Time Streaming Protocol (RTSP)

The Real Time Streaming Protocol is a proposed Internet standard which is being jointly developed by Netscape Communications, RealNetworks4 and Columbia

Univer-sity. The current RFC [36] describes RTSP as being an \application-level protocol", which controls the delivery of streaming media with real-time properties. This media can be streamed over unicast or multicast networks. RTSP itself does not actually de-liver the media data; this is handled by a separate protocol and therefore RTSP can be described as a \network remote control" to the server that is streaming the media.

The underlying protocol, that is used to control the delivery of the media, is deter-mined by the scheme used in the RTSP Uniform Resource Locator (URL)5. The schemes

that are supported on the internet are \rtsp:" which requires that the commands are

delivered using a reliable protocol, e.g. TCP, \rtspu:" which identies an unreliable

protocol such as UDP and \rtsps:" which requires a TCP connection secured by the

Transport Layer Security (TLS) [37] protocol. Therefore, a valid RTSP URL could be \rtspu://foo.bar.com:5150", which requests that the commands be delivered by an un-reliable protocol to the server \foo.bar.com", on port 5150. A more detailed description of the URLs used in RTSP is given in Section 3.2 of the RFC [36].

RTSP is intentionally similar in syntax and operation to the latest version of the HyperText Transfer Protocol (HTTP/1.1) [38], which will soon become an internet standard. There are several reasons for this and they include:

Any future extensions to HTTP/1.1 can also be added to RTSP, with little or no

modication.

RTSP can be easily parsed by standard HTTP or MIME parsers.

It can adopt HTTP's work on web security mechanisms, caches and proxies.

Pizzi and Church [39], however, describe one of the fundamental reasons why RTSP is based upon HTTP/1.1; the previous version of HTTP (v1.0), which is currently being used by most web servers and browsers, was not designed to cope with the transmission of real-time data over the internet. What limited capabilities HTTP/1.0 has in this

4Formerly known as Progressive Networks.

5This is a formatted string that identies via a name, location, or any other characteristic a resource,

(40)

area have already been exhausted. By making RTSP similar in operation and syntax to HTTP/1.1 the designers have essentially provided HTTP-level services to real-time streaming data.

Although RTSP is similar to HTTP/1.1, there are several areas in which it diers. One of these areas is RTSP's need to maintain state in almost all situations. In RTSP, as mentioned previously, the data is streamed via a separate protocol which is independent of the control channel. For example, TCP could be used for the control of the stream, whilst UDP is used for the actual delivery6. Thus the data will still be delivered by

the media server, even if it receives no RTSP control commands. Since most servers are designed to handle more than one user at a time, the server needs to be able to maintain \session state", i.e. whether it is setting up a session, the \SETUP" state; playing a stream, the \PLAY" state etc. This will allow it to correlate RTSP requests with the relevant stream. HTTP/1.1, however, is a stateless protocol; there is no need to save the state of each client.

Another area in which HTTP/1.1 and RTSP dier is in the way the client and the server interact. With HTTP/1.1 the interaction is one-way; the client issues a request for a document and the server responds. With RTSP both the client and the server can issue requests.

RTSP is really more of a protocol framework rather than a protocol itself because it provides:

A way to control the delivery of multiple data streams.

A means for choosing the actual delivery channels such as UDP, multicast UDP

and TCP.

A way to choose a delivery mechanism such as RTP.

RTSP URLs, which are described earlier in this section, are used to control the delivery of the multiple data streams. The type of delivery channel and delivery mechanism, however, are usually implementation specic, e.g. coded into the implementation of the RTSP server. Early implementations of these servers support UDP, multicast UDP, TCP and RTP.

(41)

To create a presentation which could be a live videoconference or the simple trans-mission of stored data, a presentation description is used. This contains a common time axis and information about one or more media streams, e.g. RTSP URLs, dura-tion, start time etc. A simple presentation description could, therefore, contain just one audio stream, whilst a more complex example could contain an audio, video and text stream running in parallel.

RealNetworks have developed the RealMedia architecture, that can be used to create presentations. It consists of several components of which one of them is a scripting language called the Real Time Session Language (RTSL). This language is used to dene the content of a RealMedia presentation and therefore is used to create the presentation description.

(42)

The Sound Viewer Tool

This chapter covers two main areas. The rst is an overview of the original Sound Viewer application and how it has been used. The second is the design and implementation of an extension to the Sound Viewer, which will allow the creation and traversal of links to and from streaming audio. A brief discussion of how this new tool can be used is also given.

4.1 The original Sound Viewer Tool

4.1.1 An Overview

The Sound Viewer tool was developed by Stuart Goose at the University of Southamp-ton, to provide \a generic and meaningful visual representation of audio within a hy-permedia context". A paper by Goose and Hall [32] describes the development of the Sound Viewer tool and discusses some of the unique properties of the audio domain, which is described in more detail in

Section 2.2

.

This paper discusses how audio has helped in many applications; for example audio conrmation of particular types of action possibly reduces the number of errors and non-speech sounds have been successfully used in the navigation of a screen interface for blind users. The human voice, however, is the main medium in which we communicate with each other and so it is natural to assume that it would form an integral part of multimedia and hypermedia systems. As shown in

Sections 2.3, 2.4

and

2.6

most systems support a variety of media such as text, animations, video and pictures. Audio

(43)

is supported but with many systems it is just associated with a particular event e.g. a button being clicked upon, the system shutting down etc.

The main reason why audio has not attracted as much attention as other media is due to its lack of visual identity. Visual media such as text, pictures and videos all have specic graphical applications to create and modify them e.g. word processors, graphics programs etc. With the Sound Viewer a suitable graphical metaphor for sound had to be found and so existing methods were reviewed. They are:

1. The Scrollbar method. This uses a graphical scrollbar, to represent the length of an audio le. By moving the position indicator, the user can seek to a dierent position in the le. Microsoft's Media Player and the simple AudioLinker tool, see

Section 2.7

, use the scrollbar approach.

2. The Waveform method. Several samples of the audio le are taken at a given rate and the results are then drawn, forming a waveform, onto the screen. This seems to be one of the most popular ways of representing sound. Microsoft's Sound Recorder uses this method.

3. The Piano Roll. This displays a single octave of a piano keyboard, drawn vertically, on the left hand side of the display. A single line representing the \play" head scrolls from left to right, at regular time intervals, over markers representing a key depression. This method is most commonly used in programs that manipulate Musical Instrument Digital Interface (MIDI) les e.g. Cakewalk's Pro Audio. 4. The Manuscript / \Score" method. This represents audio using a musical score,

e.g. ve horizontal parallel lines (the stave) represent pitch and musical notes drawn on this stave produce the music. Again this is used with MIDI les. 5. No visualisation (Sound engineers). The majority of sound editing by studio

engineers is performed manually without any visual tools. The equipment used provides ne control over the start and stop positions within the audio recording and playback facilities for further renements of these positions. Time and position indicators are used for noting and re-locating specic sequences.

(44)

whilst it can not be displayed using waveforms. Also MIDI can not be used for speech. Microsoft's proprietary WAV format is usually displayed using waveforms and can be used for both music and speech. However, the waveform itself is of little use to the average user who can not relate the sound to its corresponding waveform. The scrollbar method is really just a means to move within the audio le itself and therefore, it can be used for all of the sound formats e.g. WAV, CD-Audio, MIDI etc.

Each of these methods have their own advantages and disadvantages and the Sound Viewer tool was designed to represent WAV, MIDI and CD-Audio in an uniform way. It would be very confusing if the tool displayed a waveform for a WAV le and something completely dierent for a MIDI le. Therefore it was decided that more priority would be given to the visual authoring of the anchors, rather than how the audio was displayed. Goose and Hall [32] describe several other objectives required for the manipulation of audio by a hypermedia author / user and they are:

To have full control over the audio device.

To discern the current position and duration of the sound sequence.

To have the ability to select a portion of the sequence, playback and rene that

selection in preparation for creating an anchor.

To identify any links present in the sequence.

To traverse links to and from the sequence in an intuitive manner.

The host hypermedia system that manages the link information created to and from the Sound Viewer, is called Microcosm. This \open" system is discussed in more detail in

Section 2.6.3

.

4.1.2 The Interface

The user interface of the Sound Viewer consists of several components and these are the audio controls, the windows that represent the audio and are used to view the links, two position counters and two selection indicators. Each of these components are shown in

Fig. 4.1

.

(45)

Figure 4.1: The Sound Viewer Interface.

are also two extra buttons; the \Memory In" which stores the current position / time within the le and the \Memory Out" button which, if pressed, displays a list box of all the stored positions. The user can double-click on any of these, to move to that position in the le.

The Sound Viewer has two white windows which span the width of the main applica-tion window. The larger of the two windows is called the detail window and the smaller, the overview window. The overview window represents the length of the audio le and within this window is a highlighted rectangle, which represents the exploded view seen in the detail window above. Users can change the width of this rectangle, so that the smaller the rectangle the higher the zoom factor and vice versa. The middle of this rectangle is represented in the detail window as a thin vertical line.

(46)

rectangles with simple text descriptions of the link. The overview window is really just an enhanced scrollbar and by moving the rectangle or the position indicator the user can seek to a position within the le.

There are two position counters and two selection indicators underneath these wi