• No results found

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

N/A
N/A
Protected

Academic year: 2021

Share "System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

System Requirement Specification for A Distributed Desktop Search and Document

Sharing Tool for Local Area Networks

OnurSoft Onur Tolga S¸ehito˘glu

November 10, 2012 v1.0

(2)

Contents

1 Introduction 3

1.1 Purpose . . . 3

1.2 Project Scope . . . 3

1.3 Definitions, Acronyms, and Abbreviations . . . 4

1.4 Overview . . . 4

2 Background 5 2.1 Document Management Systems . . . 5

2.2 Desktop Search Tools . . . 6

2.3 Peer to Peer Networking . . . 6

3 Overall Description 8 3.1 Product Perspective . . . 8

3.2 Product Functions . . . 8

3.3 User Types, Constraints and Dependencies . . . 9

4 Product Features 10 4.1 External Interfaces . . . 10

4.1.1 Authentication . . . 10

4.1.2 File System . . . 10

4.1.3 Network Search Request . . . 10

4.1.4 Network Download Request . . . 10

4.1.5 User Interface, Search . . . 11

4.1.6 User Interface, Download . . . 11

4.1.7 User Interface, Settings . . . 11

4.1.8 User Interface, Connections . . . 11

4.1.9 User Interface, Logs . . . 11

4.2 Software Functions . . . 12

4.2.1 Search index construction . . . 12

4.2.2 Authentication . . . 12

4.2.3 Authorization . . . 13

4.2.4 Distributed search . . . 14

4.2.5 Distributed download . . . 16

(3)

4.2.6 Version management . . . 17

4.3 Performance requirements and Design constraints . . . 18

4.3.1 Performance . . . 18

4.3.2 Security . . . 18

4.3.3 Flexibility . . . 18

References 20

(4)

Chapter 1

Introduction

1.1 Purpose

This document describes the requirements of a distributed desktop search and document sharing software called DistShare. It aims to describe the required product features, constraints, dependencies and form a basis for design and development phases of the project.

1.2 Project Scope

Document exchange among members of a project is a daily routine activ- ity which is in the core of the project life cycle. DistShare project proposes a solution to store and share documents through peer to peer communica- tion facilitated with distributed content (i.e. keyword, meta-data) search and authentication.

DistShareis a tool to share local documents of a desktop user with other members in the local area network. It is mostly a combination of desktop search tools and document management systems.

DistSharewatches the documents of a local user and constructs a doc- ument index as any desktop search tool. It also keeps track of sharing properties of documents, in other words who can access the documents.

Then it enables users to execute a distributed content search on all com- puters in the DistShare network where users can see all documents con- taining the search words that are accessible to them. After the search, users use DistShare peer to peer download service to receive the docu- ments. If same documents are available on multiple hosts, it is down- loaded in parallel.

(5)

1.3 Definitions, Acronyms, and Abbreviations

Document management system: A platform for storing, accessing, sharing documents. Most systems include mechanisms for authentication, ver- sion control of documents.

Desktop search tool:A software to keep track of documents on local disk and provides means of fast access of documents through name, meta- data and content keywords.

P2P:Peer to peer network.

1.4 Overview

This document describes the software requirements specification for Dist- Share. In the next chapter overall description of the project, main func- tions, dependencies and constraints are given. In Third chapter specific requirements of the project, system interfaces, functional requirements, performance and design requirements are given.

(6)

Chapter 2

Background

2.1 Document Management Systems

A document management system is a document store providing access to multiple users. They are usually realized as central web based repositories that can be accessed through web browsers. They also have remote file system links so that user will be able to access documents on repository as if they were local documents.

A document management system has the following components [10]:

• document meta-data

• integration (direct access through file system or other application)

• indexing

• searching

• versioning

• storage

• retrieval

• security

• distribution

• workflow

• capture (printed documents, fax, OCR)

• collaboration

• publishing

(7)

• reproduction

First eight items above also holds in the purpose of DistShare. The re- maining items are mostly related to printed documents and business flow of documents so out of the scope of this project.

Content management systems are similar to document management systems that are focused on especially web content and documents.

Some existing document management tools are: [1, 3, 2].

2.2 Desktop Search Tools

Desktop search tools provide quick access to documents through previ- ously generated indexes. As number of files on local filesystems increase, searching a specific file name, file attribute or content keyword becomes a common and repeated task. In order to accelerate this operation, desktop search software generates a local index to map search items to document paths. Basic indexing items are file names and file attributes. In addition to these, meta-data of documents like singer information of a music file or author of a text file can be extracted from document content and in- dexed. Another type of index is reverse index of a document which maps individual words contained in the document to documents as web search engines do. A well known framework for document indexing is Lucene [7].

In order to create indexes, a desktop search tool scans all local docu- ments and generates the database. After creating the initial index, these tools keep the index up to date through periodic rescans or listening on file change events of the system. Beagle [8], Metatracker [9] and Google desktop are example of such tools.

DistShareneeds to repeat most of the tasks of a desktop search tool for a group of user supplied share directories. In addition, it needs to keep track of the authentication information of the documents.

2.3 Peer to Peer Networking

One major drawback of central file sharing is the large volume of data and bandwidth requirements. However on a distributed environment copies of a documents can be distributed over a group of smaller capacity hosts and on demand, it can be downloaded in parallel from multiple hosts with smaller bandwidth. This idea is realized in peer to peer networks where each client is also a server for an object and shares it through peer to peer connections in contrast to all clients containing to a single master loca- tion. This allows more resilient operation and redundancy in nature. If some of the nodes or links fail, a peer to peer network can operate on re- maining nodes and links.

(8)

Objects in a peer to peer network is addressed through a Distributed Hash Table which is basically a document digest to document meta-data mapping stored, searched and redistributed among all nodes of the net- work.

DistShareaims to let everyone share documents with everyone else in the network, it is by nature suitable for peer to peer networking. Although authentication/authorization issues needs to be solved. Also a distributed search facility is needed.

There are many frameworks and tools for peer to peer networking like [4, 5, 6].

(9)

Chapter 3

Overall Description

3.1 Product Perspective

DistShareis a tool working on a desktop computer, interacting with local filesystem, user for setup and a peer to peer network for search and down- load facilities.

It may work along with existing desktop search tools, i.e. use their databases, or maintain its own search functionality. Besides that, it will work as a self contained tool.

It will have some functions of a document management system like search and retrieval of documents, marking them to be shared etc.

It will use one of the existing protocols for maintaining a distributed hash table of documents and distributed search and download facilities.

3.2 Product Functions

DistShareconsists of the following basics tasks:

1. Search index construction 2. Authorization

3. Authentication 4. Distributed search 5. Distributed download 6. Version management

Search index construction is the on demand construction and incre- mental maintenance of document indexes.

(10)

Authorization is setting local documents sharing features so that they will be accessible to specific group of authenticated users remotely. Oth- erwise documents are not going to be displayed searches initiated by other hosts, nor they will be downloadable.

Authentication is the mechanism of users to prove their identity to re- mote hosts. DistShare needs means of making sure that the host making the request is carrying the operation on behalf of the correct users.

Distributed search is the tools ability to retrieve search results from other hosts in the P2P network. A search unstated in local host is repeated on other P2P hosts and results collected.

Distributed download is executed when user asks download of a searched file. If same exact file is copied across multiple hosts, it will be down- loaded in parallel to utilize network bandwidth.

Version control is systems ability to maintain multiple revisions of the same file. System keeps track of local changes and try to associate the file with is origin. So that when the file is searched, results with version options could be returned.

3.3 User Types, Constraints and Dependencies

Users of the system is desktop users working on a local area network or Internet. All users are in the same class.

Operating environment is a desktop computer. Any of Linux, Windows or MacOs can be considered as target platform. Interoperability could be expected as platform independent documents are to be shared. A desktop with graphical interfaces are expected for user friendliness.

Major constraint in the system is the security. Ordinary users should not be able to access documents of the desktop user.

Another constraint is the lack of central control. As in other P2P appli- cations almost all information should be kept in network nodes, not on a central server. There should be no central control other than an authenti- cation server.

There are freely available libraries for document indexing, P2P and au- thentication. Project depends on such libraries.

(11)

Chapter 4

Product Features

4.1 External Interfaces

4.1.1 Authentication

Each user of software needs to be authenticated in order to access other hosts. Authentication information should be sharable among other mem- bers of the P2P network.

Authentication can be operating system/workgroup based or network based. Depending on the design choices, multiple authentication mech- anisms could be supported.

Input is the operating system or a network service. Output is the iden- tification of the user.

4.1.2 File System

All files that are marked to be shared should be watched on the filesys- tem. On manual invocation, periodic invocation or as files change, sys- tem needs to update information about the file.

Input is the file on a local file system and output is the database keep- ing search indexes.

4.1.3 Network Search Request

System should wait for network requests to answer remote search requests.

Input is a group of search sentence and identity of the remote user.

Output is the list of matching files.

4.1.4 Network Download Request

When a matched document is requested, it is served by the local file hold- ers. Software needs to serve such requests.

(12)

Input is the file locator and identity of the remote user. Output is the file content probably in fragments.

4.1.5 User Interface, Search

Major operation in graphical user interface is the search. Search oper- ation consists of a search sentence, i.e. list of keywords to search or a specific syntax for search. This search is sent to active members of P2P network and local host.

Input is the search sentence, output is the list of files that are matched, their details and their locators (locations they can be downloaded).

4.1.6 User Interface, Download

User clicks the file to be downloaded and file is downloaded from proba- bly multiple locations.

Input is the file locators, output is the file download progress and the status of the download.

4.1.7 User Interface, Settings

Authorization and other settings of local files can be configured by the user. The authorization and watch settings per directory/file can be car- ried out in user interface.

Input is the file or directory and settings to be applied.

4.1.8 User Interface, Connections

Existing connections, live hosts and users should be displayed.

Input is the network and system state, output is the list of hosts, users and ongoing connections.

4.1.9 User Interface, Logs

History of recent activity, executed searches, remotely activated searches, downloaded files, files downloaded by remote hosts are listed.

Input is the system log database, output is the list of logs.

(13)

4.2 Software Functions

4.2.1 Search index construction 4.2.1.1 Meta-data extraction

For each file, software shall extract meta data and keywords and update the index accordingly. Meta-data include text documents, office docu- ments, image, music, video and other possible meta-data types based on file type.

Use case

Activated internally. Input is a file path. Based on file type, result is written on search index database.

severity: must

4.2.1.2 Index reconstruction

On demand, software shall browse all local files and extract meta-data and construct indexes.

Use case

Activated by user. Input is all files to be watched. 4.2.1.1 is invoked.

severity must

4.2.1.3 Index update

When a local file under share control is updated, the database should be incrementally updated. Update can be instantaneous if system supports or through periodic rescans for changed files and updates.

Use case

Activated internally by system on file change. 4.2.1.1 is repeated for file and database updated.

severity must

4.2.2 Authentication

4.2.2.1 System authentication

In case of system supporting workgroup of network based authentication trust. System authentication can provide identity. Software should take this authentication information from system

Use case

Activated by user on first execution. Operating system and environment is interacted. Result is success of failure and users identity.

severity must

(14)

4.2.2.2 Third party authentication

All members of network can be authenticated through a trusted third party.

Single login servers, kerberos authentication are examples.

Use case

Activated by user on first execution. An external service is interacted. Re- sult is success of failure and users identity.

severity demanded

4.2.2.3 Public key authentication

Public key exchange through P2P confirmation as in PGP is another mech- anism for authentication. Software might support this authentication via user to user pairing. This mechanism might work as in the social networks as connection requests confirmed by two parties.

Use case

Activated by user. Input are the users identity information and the other user to connect. Other user gets the request. If approves authentication is successful and process repeated automatically without user invocation afterwords. Otherwise fails.

severity nice to have

4.2.3 Authorization

4.2.3.1 Individual user permissions

User shall give individual users to search and download files.

Use case

Activated by user. Input is a file and user identity. Authorization database is updated.

severity must

4.2.3.2 User grouping

User should be able to group users and define permission based on user groups as well.

Use case

Activated by user. Input is a group of users and group name. Authoriza- tion database is updated. Group add, delete, users group member modi- fication cases are handled.

severity demanded

(15)

4.2.3.3 Public and authenticated groups

Authorization can be granted as public in case everyone can download without authentication and authenticated where anyone that is authenti- cated to the system. These groups should be supported.

severity demanded

4.2.3.4 User/group exclusion

User should be able to define permissions to exclude users and groups in permissions as everyone but selected users can access.

severity demanded

4.2.3.5 Directory based permissions

User should be able to define directories to have default permissions so that files under that directory recursively will have a default set of permis- sions.

Use case

User initiated. Input is a directory and set of permissions. Authorization database updated. All filenames under directory inherits the permissions afterwards.

severity demanded

4.2.3.6 Multiple access levels

Permission can be given for multiple levels. For example operations like search, download, and create revision can be granted as different levels.

severity demanded

4.2.4 Distributed search

4.2.4.1 Keeping track of alive hosts

System shall keep track of currently active software that are connected.

Depending on the underlying mechanism, hosts in broadcast network, registered hosts, hosts through P2P host exchange should be monitored for being alive.

Use case

Activated periodically. Repeat host search, collect status results, store.

severity must

(16)

4.2.4.2 Local query execution

Local user can invoke a query on local search database and resulting files are listed. Query can be on filename, meta-data or file content.

Use case

Activated by user. Search query is the input. Search is executed on local database, results are output.

severity must

4.2.4.3 Query language support

Queries can be guided through a query language. Partial matches, full matches, multiple work matches, meta-data matches can be supported.

Use case

Input is search sentence. Sentence is parsed, verified. As a result, a query procedure is generated.

severity demanded

4.2.4.4 Query executed by a remote user

Software shall wait on a network port and carry out the queries sent by remote users. Query results are filtered for the identity of the user. Only permitted results are shown.

Use case

Activated by network user. Identity of user and search sentence are input.

Query executed as in 4.2.4.2 and result is returned on network.

severity must

4.2.4.5 Query executed on a remote host

Software shall repeat users query on all alive hosts, collect the results and display.

Use case

Activated by user. Input is search sentence. List of alive hosts are re- trieved. For each host query is executed as in 4.2.4.4 and results shown in user interface.

severity must

4.2.4.6 Search result detail

Based on the search query and type of match the information of each match is displayed. This information contain filename, owner and meta- data.

(17)

Use case

Internally activated. Input is the matched file and the query. Output is the detailed information about the file and query type.

severity must

4.2.4.7 Result snippets

In case of keyword match, the portion of the matched file containing the keywords, or a document summary is to be displayed.

Use case

Internally activated. Input is the matched file and the query. Output is the snippet text.

severity nice to have.

4.2.4.8 Query result caching

Results of the last executed queries are cached in case of repetition and for history revisit.

Use case

Internally activated. Input is the search result. Result is written on cache database. Effects 4.2.4.2,4.2.4.4

severity nice to have.

4.2.5 Distributed download

4.2.5.1 Keeping track of the files to download

Search results contain a locator, probably a distributed hash table iden- tity of the file. This identity should be searchable through all alive hosts.

Multiple hosts may contain the same file. Results respect to users identity and authorization, if user is not authorized file is not displayed. Each pro- moted file to DHT should contain necessary information for download access.

severity must

4.2.5.2 Handling of file download request

Software shall serve the request to download a portion of a file if the re- quest coming from an authorized user.

Use case

Activated over network. Input are the file locator, identity of the requester and the part of the file to be requested. The requested file part is trans- ferred over P2P.

severity must

(18)

4.2.5.3 Downloading a file from remote hosts

A download request initiated by user should be sent to remote hosts con- taining the file and file should be downloaded. P2P download semantics are followed.

Use case

Initiated by user. Input is the file selected by user. File request is sent to remote hosts containing the file. If multiple, requests are sent as parts.

File download executed in background and success status is displayed to user when available.

severity must

4.2.5.4 File integrity control

File locator and DHT should contain a file and fragments digest.1This di- gest information should be used to control if fragments downloaded cor- rectly. P2P frameworks have mechanisms for this. File download is only completed when all parts of the file pass integrity check.

Use case

Internal operation. Input is the meta-data of the file containing fragment checksums, the fragment that is just received. Output is mark of the frag- ment as complete or not.

severity must

4.2.5.5 Download auto share

User should set up if the downloaded file inherits the permissions and shared for other users to download

Use case

Internal operation. Input is the file path that completed download. The file is added on search index of shared files.

severity demanded

4.2.6 Version management 4.2.6.1 Version tracking

When file is changed on disk, user is asked if a new revision is to be cre- ated. Documents link to its original file is kept and revision semantics is followed in all operations of search and download.

Use case

Initiated by system on file write on shared directory. Input is the new file

1most P2P applications support fragmentation of large files and keep a summary of integrity information per fragment

(19)

and old file information. Output is the new revision stored on search in- dex.

severity nice to have

4.2.6.2 Automated Versioning

User can configure a specific file or directory to get versions with some maximum period automatically. In other words after a predefined time interval, when document is rewritten a version is automatically created without user intervention.

severity nice to have

4.2.6.3 All versions download

When multiple version exists for a document, all versions can be down- loaded with a single click by the user.

Use case

Modification of 4.2.5.3 with all download option downloading all revi- sions.

severity nice to have

4.3 Performance requirements and Design constraints

4.3.1 Performance

In order to provide fair interaction, search queries should be answered in couple of seconds. This implies preconstruction of search indexes. Pro- cessing the documents on demand per query is not acceptable.

Download performance should be optimized to take advantage of P2P networking.

4.3.2 Security

Since most private documents of a user could be revealed, security design is utterly important in the system. Proper authentication and authoriza- tion should be provided without annoying user.

4.3.3 Flexibility

Types of files supported in search is not static. For example most of the electronic document formats we are using have been introduced in last ten years and new formats are still being introduced as ePub and DejaVu.

(20)

Systems ability to adapt itself as new formats introduced is ultimately im- portant. User should not need to reinstall software in order to introduce a new format to meta-data extraction (4.2.1.1).

(21)

References

[1] OpenDocMan web site,http://www.opendocman.com/.

[2] Dropbox web site,https://www.dropbox.com/.

[3] Google docs web site,https://docs.google.com/.

[4] Gnutella in Wikipedia,http://en.wikipedia.org/wiki/Gnutella.

[5] Bittorrent in Wikipedia,http://en.wikipedia.org/wiki/

BitTorrent.

[6] Direct Connect in Wikipedia,http://en.wikipedia.org/wiki/

Direct_Connect_(file_sharing).

[7] Lucene web site,http://lucene.apache.org/core/.

[8] Beagle web site,http://www.beagle-project.org/.

[9] Meta Tracker web site,http://projects.gnome.org/tracker/.

[10] , Wikipedia article on ”Document Management Systems” http://

en.wikipedia.org/wiki/DMS/.

References

Related documents