#MMTM15 #INFOARCHIVE #EMCWORLD 1

(1)

(2)

INFOARCHIVE

A TECHNICAL

OVERVIEW

DAVID HUMBY

(3)

TWEET LIVE DURING

THE SESSION!

Connect with us:

Sign up for a

Hands On Lab

6

th

May, 1.30 PM, Galileo 906

Share your thoughts

Take the survey

In the App or via email

Attend

Big Data Velocity &

Information Governance

6

th

May, 3 PM Session

Set Your Data Free

(4)

(5)

•

InfoArchive

•

High Level Goals

•

Architecture

•

Core Principles

•

Scalability

•

Retention Policy Services Integration

(6)

ACTIVE APPLICATIONS

READ & WRITE

LEGACY APPLICATIONS

READ ONLY

INFRASTRUCTURE

& STORAGE

EMC INFOARCHIVE

DATA

BACKUP COSTS

APPLICATON

PERFORMANCE

STORAGE COSTS

DATA GROWTH

IT COST

SHUT DOWN

USERS

EXTENDED AND

CONTROLLED DATA

ACCESS

COMPLIANCE MANAGEMENT

(7)

HIGH LEVEL GOALS

Scalability

Flexibility

TCO

•

Number of

sources / data

types

•

Data volumes

•

Number of

requests

(search,

retrieval)

•

Data types

•

SLA's (ingestion,

search, retrieval)

•

Compliance

requirements

•

Infrastructure costs

•

Operating costs

•

Costs to add a new data type,

source or client application

(8)

HIGH LEVEL GOALS

•

Documents

Structured data

Flat records

Complex records

Flexibility : Data types

Property

Name

Title

Keywords

Customer

…

(9)

HIGH LVEL GOALS

Complex structured records which can have a varying number of associated unstructured contents

(10)

Flexibility : Data types

RDBMS

Purchase Order

PO Number

PO

Date

Customer Ref

Product Description

Business Object

COMPLEX DATA MODEL OF THE

SOURCE APPLICATION

Information independently comprehensible

SIMPLE SELF DESCRIBING DATA MODEL

On direct link between

source model & archive

model

Simplifies adapting to source data model

changes e.g. application upgrades

POTENTIAL

REFERENCE

TO AN

EXTERNAL

ASSOCIATED

CONTENT

(11)

EXAMPLE TABLE DATA: EMPLOYEES

5 tables:

employees.xml

dep_emp.xml

departments.xml

salaries.xml

titles.xml

<emp_no>10001</emp_no>

<birth_date>1953-09-02</birth_date>

<first_name>Georgi</first_name>

<last_name>Facello</last_name>

<hire_date>1986-06-26</hire_date>

</employee>

<first_name>Bezalel</first_name>

<last_name>Simmel</last_name>

<hire_date>1985-11-21</hire_date>

</employee>

….

</employees>

<dep_emps>

<dep_emp>

<dept_no>d005</dept_no>

<from_date>1986-06-26</from_date>

<to_date>9999-01-01</to_date>

</dep_emp>

<dep_emp>

<from_date>1996-08-03</from_date>

</dep_emp>

….

</dep_emps>

(12)

EXAMPLE DATA: EMPLOYEE

BUSINESS OBJECT

<first_name>Guoxiang</first_name>

<last_name>Nooteboom</last_name>

<hire_date>1987-07-02</hire_date>

<dept_name>Research</dept_name>

</departments>

<title>

<title>Senior Staff</title>

</title>

</titles>

</salary>

</salaries>

</employee>

(13)

HIGH LEVEL GOALS

The performance and the TCO must not significantly change with the number/volume of archived items

Scalability & TCO

Number/volume

of archived items

•

Millions – hundreds Billions

•

TB – Hundreds TB

Time

Operating costs

Performances

(14)

ARCHITECTURE

Source & client applications

Archive Services

GUI

Data Access

Batch & transactional ingestion

Storage Platform

EMC (Atmos, Centera, Isilon, VNX) + Others

Data Services (xDB)

Content Services (Content Server)

Data

Confirmations

(15)

ARCHITECTURE

Batch Ingestion

Module

Scheduling

Ingestion

Reception

Commit

Transactional

Ingestion Module

Archive Services

Module

Access Control

Retention Mngt

Classification

Auditing

Systems

Administration

Reporting

Confirmation

Generation

Source Applications

Portal

Client applications

Data Access Module

Synchronous search

Asynchronous search

Content retrieval

Transactional ingestion

Archive

Storage

Archived

Data

Audits,

logs

Data

Rejection/invalidation

Confirmations

Partition Mngt

Package Aggregation

Partition caching

Async. search. exec.

(16)

ARCHITECTURE

Application

Server

Receiver

Web

Services

Content

Server

xDB

database

GUI

Client

Application

Source

Application

Transfer

File

Enumerator

Job

Scheduler

Ingestor

Order processor xDB cache

Legend

EMC product

InfoArchive component

Third party product

Storage area

Documentum

Administrator

(DA)

RDBMS

database

Reception area

Ingestion working area

xDB datafiles

Working area

RDBMS datafiles

Staging area

Archiving storage

(e.g. Atmos, Centera, Isilon,

NAS, SAN…)

DA

extension

(17)

CORE PRINCIPLES

Alignment with the OAIS

Framework defining the

Core features and

processes of a digital

CORE PRINCIPLES

Base OAIS terminology

Archive Storage

In

g

est

io

n

Dat

a

A

ccess

Archive Services

EMC InfoArchive

Submission Information

Packages

SIP’s

Archive Information Packages

AIP’s

Dissemination

Information Packages

DIP’s

Archive

Holdings

for

information

classification

An AIP contains Archived Information Units (AIU's)

(19)

CORE PRINCIPLES

SIP – Submission Information Package

What do we archive ?

(20)

CORE PRINCIPLES

SIP Descriptor : Describes the data in the SIP, it is a small XML file which

must conform to a simple schema imposed by InfoArchive

SIP Data : The data to be archived

SIP

Descriptor

SIP Data

(21)

CORE PRINCIPLES

The payload consists of a number of business objects called

AIU (Archive Information Units)

Documents, transactions, customer case, ...

SIP

Descriptor

SIP Data

AIU

(22)

CORE PRINCIPLES

Each AIU always has:

Meta Data (structured data)

•

SIP

Descriptor

SIP Data

Meta Data

(structured data)

<XML>

(23)

CORE PRINCIPLES

Each AIU consists of:

Meta data (structured data)

And optionally content information (unstructured data)

SIP

Descriptor

SIP Data

Meta Data

(structured data)

Content Information

(unstructured data)

<XML>

(24)

CORE PRINCIPLES

Meta data for

all

AIUs are contained in a single XML file

Content information is a collection of files in any format

SIP

Descriptor

SIP Data

Meta Data

Content Information

<XML>

(25)

CORE PRINCIPLES

InfoArchive does

not

impose

The schema of the structured data, it can be as complex as needed

How content files are referenced in structured data

A given content file can be referenced by several AIUs

AIUs can reference a varying number of content files

eas_sip.xml

eas_pdi.xml

recording1.mp3

recording2.mp3

recording3.mp4

SIP

SIP descriptor

Structured data

Content Information

(if any)

(26)

CORE PRINCIPLES

An AIP lightweight (LWSO) repository object is created

Information read in the descriptor are assigned as properties of the object

Content information is imported as content of the object

Meta data is imported in an xDB partition (aka

detachable library

)

How is information processed & stored?

SIP

Descriptor

SIP Data

Meta Data

Content Information

AIP object

xDB

Repository

(27)

CORE PRINCIPLES

•

A compressed copy of the xDB partition data file can be imported

as rendition of the AIP object

Allows to skip this partition during xDB backup

•

The caching service can cache out/in an xDB partition

Data Partitions

Archive

Storage

(e.g. CAS)

Detached

xDB Storage (FS)

xDB

Caching Service

DATA

<XML>

AIP object

DATA

(28)

CORE PRINCIPLES

•

The caching service can be configured for each data type according to

its associated search SLA

All xDB partitions can be cached in for providing synchronous searching on all data

xDB partitions of a predefined recent period can be kept in the cache for providing

synchronous searching on this period

•

It is possible to post an asynchronous search even if it embraces a cached

out partition

Query Processing

Archive

Storage

(e.g. CAS)

Detached

xDB Storage (FS)

xDB

Caching Service

DATA DATA

(29)

CORE PRINCIPLES

•

InfoArchive uses a two-phase search

•

Find packages (AIPs) that might contain records (AIU’s) that match the search criteria.

•

Search within these AIP’s for individual AIU’s matching the search critera.

How do we search information?

AIP

AIU

(30)

CORE PRINCIPLES

A company records phone calls and archives the recordings

Search Example

SIP

Descriptor

SIP Data

Phonecall

(31)

CORE PRINCIPLES

A company records phone calls and archives the recordings

Search Example

SIP

Descriptor

SIP Data

Meta Data

Content Information

<Calls>

</Calls>

(32)

CORE PRINCIPLES

•

The min./max. values contained in some elements of the structured data of the

SIP can be extracted during the ingestion and be assigned as attributes of the AIP

repository object

•

The first phase of the search consists in

•

Identifying the subset of criteria of the search associated with such partitioning criteria

•

Search the subset of archived AIPs having a min./max. value range which fits the condition of the search

(33)

CORE PRINCIPLES

For the purposes of this example, an archive package

contains all the phone calls for a given month

•

For example, all the phone calls for the month of

February.

•

One package may contain calls to different people

(one call to Jack, two calls to Jill, etc.)

•

The AIP Index field is the date on which calls where

made

2011 January

2011 Febuary

2011 March

2013 March

...

(34)

CORE PRINCIPLES

Now suppose we want to search for a given set of archived phone calls,

specifically:

All phone calls made to John on February 11, 2011

(35)

CORE PRINCIPLES

•

Find the relevant AIP’s that may contain the records for which you are

searching

•

The AIP Index is queried and all the AIP’s that have records for February

2011 identified

•

In this example there is only one such package

In the next slide it is highlighted in green

(36)

CORE PRINCIPLES

Archive Storage

Cache Storage

Content Server

Archive Holding

AIP ID

CallDate_min

CallDate_max

Cached in

PhoneCalls

1

2011-01-01

2011-01-31

No

PhoneCalls

2

2011-02-01

2011-02-28

Yes

PhoneCalls

3

2011-03-01

2011-03-31

No

PhoneCalls

...

PhoneCalls

9

2011-09-01

2011-09-31

Yes

Management of AIP’s

AIP repository objects

DA TA AIP ID 1 DAT A AIP ID 2 DAT A AIP ID 9 DA TA AIP ID 2 DA TA AIP ID 3 DA TA AIP ID 4 DA TA AIP ID 5 DA TA AIP ID 6 DA TA AIP ID 7 DA TA AIP ID 8 DA TA AIP ID 9 DA TA AIP ID 10

AIP’s 2 & 9 are also

stored in archive

storage for no xDB

(37)

CORE PRINCIPLES

•

Before we search the contents of an AIP we have to make sure that the

Data Partition is attached to the Data Service search engine

•

By looking in the AIP Registry we see that the answer is Yes

•

If the answer is No we cannot search this data synchronously but

have to issue an asynchronous search instead.

(38)

CORE PRINCIPLES

Archive Storage

Cache Storage

Content Server

Archive Holding

AIP ID

CallDate_min

CallDate_max

Cached in

PhoneCalls

1

2011-01-01

2011-01-31

No

PhoneCalls

2

2011-02-01

2011-02-28

Yes

PhoneCalls

3

2011-03-01

2011-03-31

No

PhoneCalls

...

PhoneCalls

9

2011-09-01

2011-09-31

Yes

Determine whether the AIP is

cached in

AIP repository objects

DA TA AIP ID 1 DAT A AIP ID 2 DAT A AIP ID 9 DA TA AIP ID 2 DA TA AIP ID 3 DA TA AIP ID 4 DA TA AIP ID 5 DA TA AIP ID 6 DA TA AIP ID 7 DA TA AIP ID 8 DA TA AIP ID 9 DA TA AIP ID 10

AIP’s 2 & 9 are also

stored in archive

storage for no xDB

(39)

CORE PRINCIPLES

•

The AIP consists of a set of AIU’s that are the phone call recordings

•

We need to search the AIU metadata of these recordings to find all

the recordings that were made to John on Feb 11

•

The good news is that we only have to search one AIP that contains

all the phone calls for February

•

By filtering out AIP’s that do not contain AIU’s that we need, we

ensure that the search time is not influenced by the volume of

archived data for a query having an identical search range

i.e. it will of course take longer if you have to query 10 years of data

instead of 6 months.

(40)

CORE PRINCIPLES

xDB

DATA

AIP ID 2

The index generated & stored in the xDB Partition

is now used to improve the query performance

when searching for individual AIU’s

(41)

SCALABILITY

•

Until now, the sizing has been driven by the demanded batch ingestion

throughput – not the search & retrieval activity to serve

•

The key for increasing the global batch ingestion throutput is to be able to run an

higher number of concurrent ingestions

•

The search & retrieval workload unlikely to significantly impact the global sizing

•

The ingestion workload profile is very different between structured data and

unstructured data

•

Performance is not sensitive to the already archived volume

•

Each batch ingestion works in a single xDB partition

•

A search including a partitioning criteria quickly narrows the XQuery scope to a

subset of xDB partitions

(42)

SCALABILITY

Type of increasing

workload

Main stressed resources

Batch ingestion

throutput

•

For structured data

•

CPU of the ingestor and xDB tiers

•

I/O on the ingestion working area & xDB file system

•

For unstructured data

•

CPU of the ingestor and Content Server tiers

•

I/O on the ingestion working area & Content Server storage

areas

Searches

•

CPU & memory on the Web Services and xDB tiers

•

I/O on the xDB file system

# of concurrent users CPU & memory on the GUI tiers

# of submitted

background searches

•

CPU & memory on the Order processor and xDB tiers

I/O on the xDB file system

Transactional ingestion

(43)

SCALABILITY

VM

Several instances of each tier can be run on a given server

Each tier can be isolated on distinct servers*

•

Vertical scalability

* Without requiring to set up any shared file system across tiers

App Server

GUI

HTTP Load

Balancing HTTP Load Balancing

App Server

Web

Services

Receiver

Ingestor

xDB database#1

xDB database#n

xDB cache

Content

Server

RDBMS

database

Order processor

Concurrent

executions

The activity profile of the

system unlikely to require

multiple active Content Server

or RDBMS instances for

scalability purpose but it is

technically possible as well

If needed, several distinct

xDB databases can be

concurrently used by the

system

Session affinity is

mandatory for load

balancing across

GUI instances

All Web services

are stateless (i.e.

session affinity not

required for load

balancing across

WS instances)

(44)

SCALABILITY

Instances of each tier can be distributed among multiple servers

•

Horizontal scalability

VM

HTTP Load

Balancing HTTP Load Balancing

App Server

Web

Services

App Server

GUI

VM

Receiver

Ingestor

Order processor

Concurrent

executions

VM

Content

Server

RDBMS

database

VM

xDB database#n

xDB cache#n

VM

xDB database#1

xDB cache#1

The activity profile of the

system unlikely to require

multiple active Content Server

or RDBMS instances for

scalability purpose but it is

technically possible as well

(45)

SCALABILITY

•

Use batch ingestion with large SIPs instead of synchronous ingestion

with small SIPs whenever possible

•

The associated workload is much lower and the ingestion is much faster

•

Batch ingestion can be scheduled during low search/retrieval activity periods

in order to optimize the usage of the platform

•

Configuring distinct file system areas (local, distinct LUNs) for each

ingestion node allows to horizontally scale the I/Os

(46)

RETENTION POLICY SERVICES

Objectives

•

Delegate to RPS the management of the retention and/or the retention markup of the AIPs.

•

The RPS retention policies can be inherited from the parent folder or applied directly on the AIP during

the reception or manually.

•

The integration has been designed for :

•

Imposing minimal constraints on the definition of the RPS policies.

•

Not imposing an additional complexity to customers who do not need RPS (i.e. who have to only

apply a basic date based retention provided by the existing built-in InfoArchive retention management).

General principles

•

InfoArchive (EIA) will never decide to purge an AIP if at least one RPS retention policy or an RPS markup is

applied on an AIP.

•

The RPS disposal event is being trapped by an EIA TBO on the AIP destroy for attaching the AIP to the EIA

Purge lifecycle.

•

This attachment is required for being able to generate the EIA confirmation message which might be

configured.

(47)

Restrictions

Batch ingestion only

RPS features cannot be used with the synchronous ingestion.

AIP only

The application of RPS retention policies is only supported for AIP repository objects (i.e. eas_aip

type or sub-types).

Destroy All

RPS policies applied on AIPs must use a disposal strategy including the destruction of the repository

object

(i.e. ”

Export All, Destroy All

” or “

Destroy All

”), the usage of other disposal strategy is not

currently supported.

All Renditions

(48)

Applied Retention Policies and Retention Markups

•

To know if an AIP is under RPS control

•

Display the

Retainer ID

and

Retain Content Until

column.

•

If the Retainer ID is not empty the AIP is linked to a

Retention Policy

and/or a

Retention

Markup

.

•

To have more details, it’s necessary to go to

View > Applied Retention

or

View > Applied Retention Markup

(49)

Apply a Retention Policy

Manually

•

You can apply manually an

Individual

/

Linked Retention Policy

on an

AIP

/

Folder

by selecting the action menu

Records > Apply Retention Policy

•

Don’t forget to select the Policy you want to apply at the top.

During the reception

•

You can decide to apply a

RPS retention policy

and an

EAS retention period

.

•

At the Holding level, you can decide to apply a default

Retention

Class

.

•

The retention class is defined into the table. You decide to apply

or not a RPS retention policy and/or an EAS retention period.

•

The default value can be overwritten by a retention class

provided by the SIP descriptor.

(50)

Disposition

Date Based Retention

When the Retain Until Date is expired, the retainer is eligible to be promoted to the

Final state

.

The

promotion

is performed by the RPS job : dmc_rps_PromotionJob

The

disposition

is performed by the RPS job : dmc_rps_DispositionJob

When an AIP is disposed, the object is moved to the folder /System EAS/data/purge and attached to the

Purge lifecycle

. You need to complete the Purge process to delete the object.

Event Based Retention

To be promoted to the Final state,

all conditions must be satisfied

.

In the

Applied Retention view

, select an item and go to

View > Properties > Info

In the

Phases

tab,

edit

the Condition and enter an

event date

.

Run the

Promotion

and

Disposition

jobs to continue the process.

Privileged Delete

You can force the disposition by using the action

Records > Privileged Delete

The AIP is moved to the folder

/System EAS/data/purge

and attached to the

Purge Lifecycle

.

The Privileged Delete is only possible on an AIP in COM, REJ-DONE or INV-DONE state.

(51)

QUESTIONS

(52)

(53)

Data