Open Text Archive Server and Microsoft Windows Azure Storage

(1)

Open Text Archive Server and

Microsoft Windows Azure Storage

Whitepaper

Open Text

(2)

2 | M ic r o s o f t W in d o w s A zu r e P la t f o r m W h it e P a p e r

Executive Summary / Introduction

Overview

This white paper describes how the Open Text Archive Server integrates with Microsoft Windows Azure Storage.

Azure Storage is not only a newly supported storage platform for the Archive Server, but also brings new features for the deployment of an ECM environment. Traditional storage platforms are optical jukeboxes or hard disk systems which are installed at customer site. The customer had to purchase the hardware together with maintenance contracts. Besides these investments UPS (uninterruptible power supply), cooling and space in the IT center had to be provided.

Microsoft Azure Storage relieves the customer from buying expensive hardware which after only a few years is out-dated. With Azure Storage the customer gets a virtually unlimited storage through a web service interface.

The performance of local storage will be better than for cloud storage, but for long-term storage cost factors can overweigh high-performance requirements.

(5)

About the Open Text Archive Server

The Open Text Archive Server is a core component of the Open Text ECM Suite and constitutes the archiving foundation for enterprise-wide ECM solutions. It enables storage, ingestion and retrieval of archived content.

The archiving functionality is an integral part of the Open Text Enterprise Library. Open Text offers several connectors to expand the archiving functionality. These connectors allow you to manage business documents in different applications and to link them to the business processes, e.g. Open Text Email Archiving for Microsoft Exchange, Open Text Storage Services for Microsoft SharePoint

Architecture

The Open Text Archive Server comprises multiple services and processes, such as the Document Service, the Administration Server and the Storage Manager. The Document Services provides document management functionality, storage of technical metadata, and secure communication with archiving clients. . The Storage Manager is responsible for managing external devices. The

Administration Server offers an API to administer the archive environment, tools and jobs.

Open Text ArchiveServer Architecture

Scalability and Distribution

The Archive Server is built for enterprise-wide deployments. This means, the Archive Server has:

(6)

 Strong capabilities to distribute the system to all business regions.

 Flexibility to run the system on existing databases and operation systems.

 Flexibility to connect the system to existing or new storage hardware. The Archive Server client/server architecture provides versatile options for configuring, scaling and distributing an enterprise-wide archive system.

Features of the Open Text Archive Server

Single Instance Archiving

In groupware scenarios, identical documents can be a risk of wasting storage space when emails with attachments are sent to hundreds of recipients. The Archive Server enables single instance archiving (SIA), keeping the same document only once on the connected storage platform.

Compression

In order to save storage space, content can be compressed before writing to storage system. Compression can be activated different content types, and can reduce storage storage by more than 30 percent.

Encryption of the stored data

By encrypting data, e.g. critical data such as salary tables, content is on the storage is secured and cannot be read without an archive system.

Secure Data Transport

Use of SSL ensures authorized and encrypted communication.

Data transport secured with checksums

Checksums are used to recognize and reveal unwanted modifications to content on its way from creation to the long-term storage. Checksums are verified, and errors reported.

Retention Handling

The Archive Server allows applying retention periods to content. Retention periods are handled by the Archive Server and are passed to the storage platform, as far as the storage platform supports the notion of retention.

(7)

Storage Management

Logical archives

A logical archive is an area on the Archive Server in which documents belonging together can be stored. Each logical archive can be configured to represent a different archiving strategy appropriate to the types of documents archived exclusively there.

Logical archives make it possible to store documents in a structured way. You can organize archived documents in different logical archives according to various criteria, e.g.

 Compliance requirements

 The archiving and cache strategy

 Storage platforms

 Customer relations (for ASPs)

 Security requirements

Hardware abstraction

Key task of the Archive Server is hiding specific hardware characteristics to leading applications, providing transparent access, and optimizing storage resources.

The Archive Server can handle various types of storage hardware; and provides hardware abstraction by offering a unified storage. If a hardware vendor’s storage API changes, or if new versions come up, it’s not necessary to change all the leading applications using the hardware, but only the Archive Server’s interface to the storage device.

Supported storage media

The Archive Server supports a wide range of different storage media and devices. Supported storage media are cloud storage, normal hard disk drive storage, hard disk write-once media and optical media.

Backup, Replication, High Availability and Disaster

Recovery

Backup

Power outages, physical damage, outdated media, hardware faults or usage errors can unexpectedly shut down IT operations at any time. Archive Server provides a variety of options to optimize the availability of the business documents.

(8)

Archive Server can create copies of volumes as backups. The copies may be produced on the local archive server or on a remote backup or standby server. To avoid losing data in the event of a hard disk failure and resume using Archive Server immediately, we recommend using RAID (Redundant Array of

Independent Disks) technology as an additional data backup mechanism. In addition to document content, administrative information is synchronized between original and backup systems.

Disaster recovery

The Archive Server stores the technical meta data together with content on the storage media (e.g. DocId, aid, timestamp, …). This allows Archive Server to completely restore access to archived documents in case the Archive Server hardware has a major breakdown or has been destroyed.

Remote standby

With a remote standby server, all the documents in an archive are duplicated on a second Archive Server —the remote standby server—via a WAN connection for geographic separation. If the production Archive Server fails, the remote standby server continues to provide read-access to all the documents. Physically

separating the two servers also provides optimal protection against fire, flood and other catastrophic loss.

High Availability

To eliminate long downtimes, the Archive Server offers active-passive high availability.

High availability is a two node cluster solution, in which a fully-equipped Archive Server node monitors the current production system by heart-beat. If a node fails, the other node automatically assumes all activities, with full transparency for end users.

If the production system fails, users can continue to work normally on the secondary archive system. In contrast to the remote standby server scenario, both read (retrieval) and write (archiving) access to documents is possible in this configuration.

(9)

About Microsoft Windows Azure Storage

Describe the properties and features of Azure Storage

(10)

1 0 | M ic r o s o f t W in d o w s A zu r e P l a t f o r m W h it e Pa p e r

Archive Server integration with Azure

Storage

The Archive Server treats Microsoft Windows Azure Storage as a storage device where single documents can be stored. To configure the connection to Azure Storage a file <device>.Setup has to be configured.

The *.Setup file (e.g. azure.Setup) is the link between the Archive Server and the Azure system. It contains all necessary information to access the Azure servers. The first line contains the connection info, which is needed to load the

corresponding Azure library. This library is provided by Open Text, and establishes the connection to the Azure storage.

If installed and configured correctly, you will see an entry in Administration Client under Devices showing the Azure storage device.

Storage space in Archive Server devices can be accessed through volumes. Volumes are attached to logical archives, thus providing dedicated storage space to logical archives.

Volumes in Azure devices are closely related to Azure containers. A container is basically the top-level directory in an Azure cloud data is being stored in. One or more volumes can be associated with one Azure container. Linkage between Azure containers and volumes is configured in the Setup file. Actually so-called GS-partitions are linked to the containers. GS-partitions have a one-to-one relation to Archive Server volumes.

To access an Azure container an account name and access key is necessary.

The following picture shows an Azure device names “OTCloud”. Five volumes are configured.

(11)

11 | M ic r o s o f t W in d o w s A zu r e P l a t f o r m W h it e Pa p e r

Figure 1 Configuration of Microsoft Windows Azure as storage device

The volumes market_vol1, market_vol2, market_vol3 are used in the logical archive HH_LA_4.

(12)

Business Case

How can Open Text customers profit from the Microsoft

Azure Storage?

Any customer using an application based on the Open Text ECM Suite and the Archive Server is a candidate for using Azure Storage. Customers have to upgrade to Archive Server 9.7.1. Use of Azure Storage is not restricted to Microsoft Windows platforms, but also available for Unix OS, such as Sun Solaris, IBM AIX, HP HP-UX and Linux. The Archive Server runs on-premise at customer site whereas the Azure Storage is provided over the Internet.

To use Azure Storage customers need to contact Microsoft for an account. With the account the customer can configure the storage environment (see page 10) and start using the cloud.

The Archive Server comes with an in-built Volume Migration tool which allows transparently migrating existing content on local hardware to the Azure Storage.

What are the benefits for the customer

The customer only buys what he needs and can grow continuously. He has only to pay for the storage space in use and the upload and download traffic.

With Azure Storage customers have a small initial investment, no maintenance fees and pay only for what you really need.

(13)

Performance Measurements

Performance tests were done by using the Open Text XOTE test tool.

The XOTE test tool is an internal test suite developed by the Open Text Quality Assurance department to run performance and benchmark tests with the Archive Server and storage platforms. The tool allows creating arbitrary documents of different size; supports automated test scenarios and collects result for evaluation in log files.

Within the benchmark test the following test cases were set up.

Test scenarios

1. Write documents to the disk buffer

2. Read documents from the disk buffer (verify)

3. Write the documents to the Microsoft Windows Azure

4. Purge documents from disk buffer (not evaluated in this white paper) 5. Read documents from Microsoft Windows Azure

6. Delete documents from Microsoft Windows Azure The tests were performed with different documents sizes.

# of documents Document size

20’000 10 kB

10’000 20 kB

10’000 50 kB

5’000 100 kB

5’000 200 kB

2’000 500 kB

2’000 1’000 kB

Measured times are extracted from the log files of the tool with a precision of milliseconds. The start and end times are given in GMT+1 (CET).

Test environment

The test setup consists of one Archive Server 9.7.1 connected to Microsoft Windows Azure storage.

(14)

Microsoft Windows Azure is configured as storage device on the Archive Server, and documents are written to the storage by using a so-called “Single file” pool. Compression and single instance archiving are disabled. The pool is configured with 15 threads to OT Azure library, and 15 connections were configured for the connection between the OT Azure library and Microsoft Windows Azure. SSL was used to connect to Microsoft Azure Storage.

Figure 3Archive Server and Microsoft Windows Azure connection

There are four test PCs each hosting 5 virtual clients that send parallel read and write request to the Archive Server. In sum, 20 parallel clients send read and write requests.

All servers are hosted on a Hyper-V server with Microsoft Windows 2008 Server as operating system.

Host system

2 x Quad Core Opteron 2376 (2,3GHz, 6MB) 32GB (8x4GB Dual Rank DIMMs) 667MHz 450GB SAS 15.000 1/min

Gigabit Ethernet network

Virtual test clients

Windows Server 2008 R2 Standard (64 Bit) 2 (virtual) CPU 2,3 GHz Opteron 2376 2 GB memory

ARCHIVE SERVER

MICROSOFT AZURE STORAGE

Document Service

libAzure http client

http server

15 connections

(15)

Archive Server

Version 9.7.1, Patch AS097-057

Network connection

The Archive Server is connected via a Gigabit Ethernet to the Open Text network in Munich/Germany. The Open Text network (Ethernet backbone) connects via the Internet (155 Mbit) to the cloud storage stored in South US. Therefore, the latencies and throughput from the Archive Server to Windows Azure is dominated by a combination of (a) the connection between Munich and South US, (b) the bandwidth between the two sites..

Hyper-V Server 2 x Quad Core Opteron 2376 (2,3GHz, 6MB) 32GB (8x4GB Dual Rank DIMMs)

667MHz 450GB SAS 15.000 1/min

ARCHIVE SERVER 9.7.1 Hyper-V

Windows Server 2008 R2 Standard (64 Bit) 4 (virtual) CPUs 2,3 GHz Opteron 2376, 4 GB memory

Gigabit Ethernet

Client PC:

5 archive clients per PC 20 clients in total

Gigabit-Ethernet between Clients and Server

Microsoft Azure Storage Internet Location: Munich, Germany

(16)

Figure 4 Test environment and deployment

Hyper-V Server 2 x Quad Core Opteron 2376

(2,3GHz, 6MB) 32GB (8x4GB Dual Rank DIMMs) 667MHz 450GB SAS 15.000 1/min

ARCHIVE SERVER 9.7.1 Hyper-V

Windows Server 2008 R2 Standard (64 Bit) 4 (virtual) CPUs 2,3 GHz Opteron 2376, 4 GB memory

Gigabit Ethernet

Client PC:

5 archive clients per PC 20 clients in total

Gigabit-Ethernet between Clients and Server

Microsoft Azure Storage Internet

(17)

Performance Results

Load on the Archive Server

The following figures show the load of the server during the different phases. These figures didn’t change with different document size.

Figure 5 Archive Server taskmanger during archiving documents to the disk buffer

(18)

Figure 7 Archive server taskmanger during writing documents to the Azure

(19)

Figure 9 Archive server taskmanger during verifying documents from Microsoft Windows Azure

(20)

Iteration with 10 kB documents

The minimum and maximum values for the different scenarios can vary

extremely. This can be due to temporary additional load on the server or on the network.

Action AVG (ms) Min (ms) Max (ms)

Write to disk buffer 19,75 < 16 297

Read from disk buffer 87,25 16 656

Write to Azure 1.239,89 1.190 2.846

Read from Azure 825,25 578 11.327

Delete from Azure 1.153,75 828 2.969

Table 1 Overview on results for 10 kB documents

Iteration duration for 20.000 documents: 6,5 hours

Start: 2009-11-13 22:11:54 (Fri) End: 2009-11-14 04:32:12 (Sat)

The cause for the maximum value is unknown. The average was calculated over 20.000 documents. The minimal value (578 ms) for reading from Azure is an upper boundary for the latency time.

In the Figure 11 the average time per step during the test is shown in a graphical view.

(21)

Figure 11 Graphical overview on results for 10 kB documents 0,00

200,00 400,00 600,00 800,00 1.000,00 1.200,00 1.400,00

20000 documents

10 KB documents average time per step

Write to Diskbuffer Read from Diskbuffer Write to Azure Read from Azure Delete from Azure

(22)

Iteration with 20 kB documents

Write to disk buffer 20,00 < 16 344

Read from disk buffer 121,25 16 39.059

Write to Azure 1.213,52 1.170 2.441

Read from Azure 822,50 578 22.048

The Figure 12 shows the values of Table 2 in a graphical view.

Figure 12 Graphical overview on results for 20 kB documents

Iteration duration for 10.000 documents: 6,5 hours.

Start: 2009-11-13 22:11:54 (Fri) End: 2009-11-14 04:32:12 (Sat)

0,00 200,00 400,00 600,00 800,00 1.000,00 1.200,00 1.400,00

10000 documents

20 KB documents average time per step

(23)

Iteration with 50 kB documents

Write to disk buffer 37,75 16 5.044

Write to Azure 1.362,63 1.180 370.016

Read from Azure 881,50 594 96.232

The graphical overview is shown in Figure 5 below.

Iteration duration for 10.000 documents: approx. 7 hours.

Start: 2009-11-07 11:31:44 (Sat) End: 2009-11-07 18:12:24 (Sat)

0,00 200,00 400,00 600,00 800,00 1.000,00 1.200,00 1.400,00

10000 documents

50 kB documents

(24)

Iteration with 100 kB documents

Write to disk buffer 54,25 16 328

Write to Azure 1.402,12 1.354 4.332

Read from Azure 1.310,25 828 6.874

Table 4 Overview for 100 kB documents

Figure 14 Graphical overview for 100 kB documents

The graphic shows that the read request is longer than the delete request and almost as long as the write request.

The following iterations show that the time of the read process will increase with file size.

Iteration duration for 5.000 documents: approx. 7 hours

Start: 2009-11-07 21:16:57 (Sat) End: 2009-11-08 04:17:31 (Sun)

0,00 200,00 400,00 600,00 800,00 1.000,00 1.200,00 1.400,00 1.600,00

5000 documents

100 kB documents

(25)

Iteration with 200 kB documents

Read from disk buffer 707,50 78 20.499

Write to Azure 1.650,55 1.549 3.839

Read from Azure 2.269,25 1.203 36.934

Delete from Azure 935,75 811 2.577

Table 5 Overview for 200 kB documents

As already described in the iteration of 100 kB documents the time consumption of the read process is growing significantly with the size of the documents.

Iteration duration for 5.000 documents: approx. 10,5 hours. Start: 2009-11-08 11:47:28 (Sun)

End: 2009-11-08 22:13:38 (Sun)

0,00 500,00 1.000,00 1.500,00 2.000,00 2.500,00

5000 documents

200 kB documents

(26)

Iteration with 500 kB documents

Read from disk buffer 1.551,75 202 2.906

Write to Azure 2.530,55 2.221 6.773

Read from Azure 4.016,50 2.125 10.826

Table 6 Overview of time per step for 500 kB documents

The graphical overview is shown in Figure 16.

Iteration duration for 2.000 documents: approx. 10 hours.

Start: 2009-11-09 19:58:19 (Mon) End: 2009-11-10 06:10:26 (Tue)

0,00 500,00 1.000,00 1.500,00 2.000,00 2.500,00 3.000,00 3.500,00 4.000,00 4.500,00

2000 documents

500 kB documents

(27)

Iteration with 1000 kB documents

Read from disk buffer 3.281,50 484 4.453

Write to Azure 3.866,18 2.544 5.511

Read from Azure 7.249,75 3.952 13.405

The graphical overview is shown in Figure 16.

Iteration duration for 2.000 documents: 11 hours:

Start: 2009-11-13 10:21:04 (Fri) End: 2009-11-13 21:09:09 (Fri)

0,00 1.000,00 2.000,00 3.000,00 4.000,00 5.000,00 6.000,00 7.000,00 8.000,00

2000 documents

1.000 kB documents

(28)

Summary

Any interpretation of the results has to be done with care. There are a lot of known and unknown parameters influencing the results.

The following parameters can influence the results:

 Throughput capacity and variations acrossof the internet is unknown

 Variation of throughput capacity during time of day is unknown

 Performance dependency on number of http client connections is unknown. Because of the high network latency times (response time) the throughput strongly depends on the number of parallel requests, i.e. the number of parallel http connections.

Outlook

The results show that the Internet latency seems to be limiting factor for write and read requests. To proof the assumption and to overcome this factor several steps are possible.

 Client and Archive Server installation in the U.S. or use a European data center with the clients in Munich.

 As read performance from Azure Storage decreases with document size, a cache implementation on the Archive Server can improve the performance for larger documents.

(29)

time in msec size 10 KB 20 KB 50 KB 100 KB 200 KB 500 KB 1000 KB

Write to Diskbuffer 20 20 38 54 91 199 384

Read from Diskbuffer 87 121 201 351 708 1.552 3.282 Write to Azure 1.240 1.214 1.363 1.402 1.651 2.531 3.866 Read from Azure 825 823 882 1.310 2.269 4.017 7.250 Delete from Azure 1.154 1.126 1.141 1.114 936 1.090 1.074 Table 8 Overall measurement results

The following graphic shows an overall view on the different test runs.

Figure 18 Overall view of benchmark tests with Microsoft Windows Azure

The following findings can be deduced from Figure 18:

Read and write requests to local disk are significantly faster than requests sent to the cloud.

The average time to write documents increases slightly with larger files. This applies to writing to the disk buffer as well as to writing to Microsoft Windows Azure.

The increase of dependency on document size is higher for write requests from the cloud compared to local disk. This is probably due to the latencies for the http requests and network bandwidth.

0 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000

0 500 1000

Ti m e in m s

Document size in kB

(30)

The deletion of the documents is mainly independent of the document size. The time is constant over the different iterations.

The document retrieval (read) time increases for documents larger than 64 kB. This applies to reading from the disk buffer as well as to reading from the cloud. This affect is due to the fact that the Archive Server reads

documents in 64 kB chunks. This effect did not matter up to now, as it only gets significant if latency times are high. The problem could be resolved by implementing a read cache on the Archive Server.

Figure 19 Write rates for Microsoft Windows Azure

The write rate to Azure is calculated from the number of connections to Azure, document size and write time per document.

Write Rate = # of connections/write time * document size

The number of connections for writing was 15.

The rate did not yet reach the saturation, i.e. more connections or larger documents could lead to higher write rates.

0,00 0,50 1,00 1,50 2,00 2,50 3,00 3,50 4,00 4,50

0 200 400 600 800 1000

Wr ite r ate in M B /sec

(31)

Read Rate = # of clients/read time * docsize

# of clients for reading was: 20

The decrease of the rate for 20 kb documents is unclear. Due to 64 kB block reads the read rate is lower than the write rate.

Microsoft Windows Azure Update

On November 11th 2009 Microsoft released a new version of Windows Azure. Open Text was not aware of the upgrade, and some tests were already

performed with the new release. The results did not show a significant change.

The following diagram shows a representation of the results in a logarithmical manner. This allows a better overview on the results for small documents.

0,00 0,50 1,00 1,50 2,00 2,50 3,00

0 200 400 600 800 1000

R

e

ad

r

ate

in

M

B

/sec

Document size in KB

(32)

Figure 20 Overall view of benchmark test with Microsoft Windows Azure Cloud Storage (logarithmical) 1

10 100 1.000 10.000

0 200 400 600 800 1000

Ti

m

e

in

m

s

(33)

w w w.

o p e n t e x t

. c o m

For more information about Open Text products and services, visit www.opentext.com. Open Text is a publicly traded company on both NASDAQ (OTEX) and the TSX (OTC). Copyright © 2009 by Open Text Corporation. Open Text and The Content Experts are trademarks or registered trademarks of Open Text Corporation. This list is not exhaustive. All other

About Open Text

Open Text is a leader in Enterprise Content Management (ECM). With two decades of experience helping organizations overcome the challenges

associated with managing and gaining the true value of their business content, Open Text stands unmatched in the market.

Together with our customers and partners, we are truly The Content Experts,™ supporting 46,000 organizations and millions of users in 114 countries around the globe. We know how organizations work. We have a keen understanding of how content flows throughout an enterprise, and of the business challenges that organizations face today.

It is this knowledge that gives us our unique ability to develop the richest array of tailored content management applications and solutions in the industry. Our unique and collaborative approach helps us provide guidance so that our customers can effectively address business challenges and leverage content to drive growth, mitigate risk, increase brand equity, automate processes, manage compliance, and generate competitive advantage. Organizations can trust the management of their vital business content to Open Text, The Content Experts.

Open Text Archive Server and Microsoft Windows Azure Storage