INFOARCHIVE
A TECHNICAL
OVERVIEW
DAVID HUMBY
TWEET LIVE DURING
THE SESSION!
Connect with us:
Sign up for a
Hands On Lab
6
th
May, 1.30 PM, Galileo 906
Share your thoughts
Take the survey
In the App or via email
Attend
Big Data Velocity &
Information Governance
6
th
May, 3 PM Session
Set Your Data Free
•
InfoArchive
•
High Level Goals
•
Architecture
•
Core Principles
•
Scalability
•
Retention Policy Services Integration
ACTIVE APPLICATIONS
READ & WRITE
LEGACY APPLICATIONS
READ ONLY
INFRASTRUCTURE
& STORAGE
EMC INFOARCHIVE
DATA
BACKUP COSTS
APPLICATON
PERFORMANCE
STORAGE COSTS
DATA GROWTH
IT COST
SHUT DOWN
USERS
EXTENDED AND
CONTROLLED DATA
ACCESS
COMPLIANCE MANAGEMENT
HIGH LEVEL GOALS
Scalability
Flexibility
TCO
•
Number of
sources / data
types
•
Data volumes
•
Number of
requests
(search,
retrieval)
•
Data types
•
SLA's (ingestion,
search, retrieval)
•
Compliance
requirements
•
Infrastructure costs
•
Operating costs
•
Costs to add a new data type,
source or client application
HIGH LEVEL GOALS
•
Documents
Structured data
Flat records
Complex records
Flexibility : Data types
Property
Name
Title
Keywords
Customer
…
HIGH LVEL GOALS
Complex structured records which can have a varying number of associated unstructured contents
Flexibility : Data types
RDBMS
Purchase Order
PO Number
PO
Date
Customer Ref
Product Description
Business Object
COMPLEX DATA MODEL OF THE
SOURCE APPLICATION
Information independently comprehensible
SIMPLE SELF DESCRIBING DATA MODEL
On direct link between
source model & archive
model
Simplifies adapting to source data model
changes e.g. application upgrades
POTENTIAL
REFERENCE
TO AN
EXTERNAL
ASSOCIATED
CONTENT
EXAMPLE TABLE DATA: EMPLOYEES
5 tables:
employees.xml
dep_emp.xml
departments.xml
salaries.xml
titles.xml
<employees>
<employee>
<emp_no>10001</emp_no>
<birth_date>1953-09-02</birth_date>
<first_name>Georgi</first_name>
<last_name>Facello</last_name>
<gender>M</gender>
<hire_date>1986-06-26</hire_date>
</employee>
<employee>
<emp_no>10002</emp_no>
<birth_date>1964-06-02</birth_date>
<first_name>Bezalel</first_name>
<last_name>Simmel</last_name>
<gender>F</gender>
<hire_date>1985-11-21</hire_date>
</employee>
….
</employees>
<dep_emps>
<dep_emp>
<emp_no>10001</emp_no>
<dept_no>d005</dept_no>
<from_date>1986-06-26</from_date>
<to_date>9999-01-01</to_date>
</dep_emp>
<dep_emp>
<emp_no>10002</emp_no>
<dept_no>d007</dept_no>
<from_date>1996-08-03</from_date>
<to_date>9999-01-0</to_date>
</dep_emp>
….
</dep_emps>
EXAMPLE DATA: EMPLOYEE
BUSINESS OBJECT
<employee>
<emp_no>10015</emp_no>
<birth_date>1959-08-19</birth_date>
<first_name>Guoxiang</first_name>
<last_name>Nooteboom</last_name>
<gender>M</gender>
<hire_date>1987-07-02</hire_date>
<departments>
<dept_no>d008</dept_no>
<dept_name>Research</dept_name>
<from_date>1992-09-19</from_date>
<to_date>1993-08-22</to_date>
</departments>
<titles>
<title>
<title>Senior Staff</title>
<from_date>1992-09-19</from_date>
<to_date>1993-08-22</to_date>
</title>
</titles>
<salaries>
<salary>
<amount>40000</amount>
<from_date>1992-09-19</from_date>
<to_date>1993-08-22</to_date>
</salary>
</salaries>
</employee>
HIGH LEVEL GOALS
The performance and the TCO must not significantly change with the number/volume of archived items
Scalability & TCO
Number/volume
of archived items
•
Millions – hundreds Billions
•
TB – Hundreds TB
Time
Operating costs
Performances
ARCHITECTURE
Source & client applications
Archive Services
GUI
Data Access
Batch & transactional ingestion
Storage Platform
EMC (Atmos, Centera, Isilon, VNX) + Others
Data Services (xDB)
Content Services (Content Server)
Data
Confirmations
ARCHITECTURE
Batch Ingestion
Module
Scheduling
Ingestion
Reception
Commit
Transactional
Ingestion Module
Archive Services
Module
Access Control
Retention Mngt
Classification
Auditing
Systems
Administration
Reporting
Confirmation
Generation
Source Applications
Portal
Client applications
Data Access Module
Synchronous search
Asynchronous search
Content retrieval
Transactional ingestion
Archive
Storage
Archived
Data
Audits,
logs
Data
Rejection/invalidation
Confirmations
Partition Mngt
Package Aggregation
Partition caching
Async. search. exec.
ARCHITECTURE
Application
Server
Receiver
Web
Services
Content
Server
xDB
database
GUI
Client
Application
Source
Application
Transfer
File
Enumerator
Job
Scheduler
Ingestor
Order processor xDB cacheLegend
EMC product
InfoArchive component
Third party product
Storage area
Documentum
Administrator
(DA)
RDBMS
database
Reception area
Ingestion working area
xDB datafiles
Working area
RDBMS datafiles
Staging area
Archiving storage
(e.g. Atmos, Centera, Isilon,
NAS, SAN…)
DA
extension
CORE PRINCIPLES
Alignment with the OAIS
Framework defining the
Core features and
processes of a digital
archive
CORE PRINCIPLES
Base OAIS terminology
Archive Storage
In
g
est
io
n
Dat
a
A
ccess
Archive Services
EMC InfoArchive
Submission Information
Packages
SIP’s
Archive Information Packages
AIP’s
Dissemination
Information Packages
DIP’s
Archive
Holdings
for
information
classification
An AIP contains Archived Information Units (AIU's)
CORE PRINCIPLES
SIP – Submission Information Package
What do we archive ?
CORE PRINCIPLES
SIP Descriptor : Describes the data in the SIP, it is a small XML file which
must conform to a simple schema imposed by InfoArchive
SIP Data : The data to be archived
What do we archive ?
SIP
SIP
Descriptor
SIP Data
CORE PRINCIPLES
The payload consists of a number of business objects called
AIU (Archive Information Units)
Documents, transactions, customer case, ...
What do we archive ?
SIP
SIP
Descriptor
SIP Data
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
CORE PRINCIPLES
Each AIU always has:
Meta Data (structured data)
•
What do we archive ?
SIP
SIP
Descriptor
SIP Data
Meta Data
(structured data)
<XML>
CORE PRINCIPLES
Each AIU consists of:
Meta data (structured data)
And optionally content information (unstructured data)
What do we archive ?
SIP
SIP
Descriptor
SIP Data
Meta Data
(structured data)
Content Information
(unstructured data)
<XML>
CORE PRINCIPLES
Meta data for
all
AIUs are contained in a single XML file
Content information is a collection of files in any format
What do we archive ?
SIP
SIP
Descriptor
SIP Data
Meta Data
Content Information
<XML>
<XML>
CORE PRINCIPLES
InfoArchive does
not
impose
The schema of the structured data, it can be as complex as needed
How content files are referenced in structured data
A given content file can be referenced by several AIUs
AIUs can reference a varying number of content files
What do we archive ?
eas_sip.xml
eas_pdi.xml
recording1.mp3
recording2.mp3
recording3.mp4
SIP
SIP descriptor
Structured data
Content Information
(if any)
CORE PRINCIPLES
An AIP lightweight (LWSO) repository object is created
Information read in the descriptor are assigned as properties of the object
Content information is imported as content of the object
Meta data is imported in an xDB partition (aka
detachable library
)
How is information processed & stored?
SIP
SIP
Descriptor
SIP Data
Meta Data
Content Information
AIP object
xDB
Repository
CORE PRINCIPLES
•
A compressed copy of the xDB partition data file can be imported
as rendition of the AIP object
Allows to skip this partition during xDB backup
•
The caching service can cache out/in an xDB partition
Data Partitions
Archive
Storage
(e.g. CAS)
Detached
xDB Storage (FS)
xDB
Caching Service
DATA<XML>
AIP object
DATACORE PRINCIPLES
•
The caching service can be configured for each data type according to
its associated search SLA
All xDB partitions can be cached in for providing synchronous searching on all data
xDB partitions of a predefined recent period can be kept in the cache for providing
synchronous searching on this period
•
It is possible to post an asynchronous search even if it embraces a cached
out partition
Query Processing
Archive
Storage
(e.g. CAS)
Detached
xDB Storage (FS)
xDB
Caching Service
DATA DATACORE PRINCIPLES
•
InfoArchive uses a two-phase search
•
Find packages (AIPs) that might contain records (AIU’s) that match the search criteria.
•
Search within these AIP’s for individual AIU’s matching the search critera.
How do we search information?
AIP
AIP
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
AIU
CORE PRINCIPLES
A company records phone calls and archives the recordings
Search Example
SIP
SIP
Descriptor
SIP Data
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
Phonecall
CORE PRINCIPLES
A company records phone calls and archives the recordings
Search Example
SIP
SIP
Descriptor
SIP Data
Meta Data
Content Information
<Calls>
<Call>...</Call>
<Call>...</Call>
<Call>...</Call>
<Call>...</Call>
</Calls>
CORE PRINCIPLES
•
The min./max. values contained in some elements of the structured data of the
SIP can be extracted during the ingestion and be assigned as attributes of the AIP
repository object
•
The first phase of the search consists in
•
Identifying the subset of criteria of the search associated with such partitioning criteria
•
Search the subset of archived AIPs having a min./max. value range which fits the condition of the search
CORE PRINCIPLES
For the purposes of this example, an archive package
contains all the phone calls for a given month
•
For example, all the phone calls for the month of
February.
•
One package may contain calls to different people
(one call to Jack, two calls to Jill, etc.)
•
The AIP Index field is the date on which calls where
made
2011 January
2011 Febuary
2011 March
2013 March
...
CORE PRINCIPLES
Now suppose we want to search for a given set of archived phone calls,
specifically:
All phone calls made to John on February 11, 2011
CORE PRINCIPLES
•
Find the relevant AIP’s that may contain the records for which you are
searching
•
The AIP Index is queried and all the AIP’s that have records for February
2011 identified
•
In this example there is only one such package
In the next slide it is highlighted in green
CORE PRINCIPLES
Archive Storage
Cache Storage
Content Server
Archive Holding
AIP ID
CallDate_min
CallDate_max
Cached in
PhoneCalls
1
2011-01-01
2011-01-31
No
PhoneCalls
2
2011-02-01
2011-02-28
Yes
PhoneCalls
3
2011-03-01
2011-03-31
No
PhoneCalls
...
...
...
...
PhoneCalls
9
2011-09-01
2011-09-31
Yes
Management of AIP’s
AIP repository objects
DA TA AIP ID 1 DAT A AIP ID 2 DAT A AIP ID 9 DA TA AIP ID 2 DA TA AIP ID 3 DA TA AIP ID 4 DA TA AIP ID 5 DA TA AIP ID 6 DA TA AIP ID 7 DA TA AIP ID 8 DA TA AIP ID 9 DA TA AIP ID 10
AIP’s 2 & 9 are also
stored in archive
storage for no xDB
CORE PRINCIPLES
•
Before we search the contents of an AIP we have to make sure that the
Data Partition is attached to the Data Service search engine
•
By looking in the AIP Registry we see that the answer is Yes
•
If the answer is No we cannot search this data synchronously but
have to issue an asynchronous search instead.
CORE PRINCIPLES
Archive Storage
Cache Storage
Content Server
Archive Holding
AIP ID
CallDate_min
CallDate_max
Cached in
PhoneCalls
1
2011-01-01
2011-01-31
No
PhoneCalls
2
2011-02-01
2011-02-28
Yes
PhoneCalls
3
2011-03-01
2011-03-31
No
PhoneCalls
...
...
...
...
PhoneCalls
9
2011-09-01
2011-09-31
Yes
Determine whether the AIP is
cached in
AIP repository objects
DA TA AIP ID 1 DAT A AIP ID 2 DAT A AIP ID 9 DA TA AIP ID 2 DA TA AIP ID 3 DA TA AIP ID 4 DA TA AIP ID 5 DA TA AIP ID 6 DA TA AIP ID 7 DA TA AIP ID 8 DA TA AIP ID 9 DA TA AIP ID 10
AIP’s 2 & 9 are also
stored in archive
storage for no xDB
CORE PRINCIPLES
•
The AIP consists of a set of AIU’s that are the phone call recordings
•
We need to search the AIU metadata of these recordings to find all
the recordings that were made to John on Feb 11
•
The good news is that we only have to search one AIP that contains
all the phone calls for February
•
By filtering out AIP’s that do not contain AIU’s that we need, we
ensure that the search time is not influenced by the volume of
archived data for a query having an identical search range
i.e. it will of course take longer if you have to query 10 years of data
instead of 6 months.
CORE PRINCIPLES
xDB
DATA
AIP ID 2
The index generated & stored in the xDB Partition
is now used to improve the query performance
when searching for individual AIU’s
SCALABILITY
•
Until now, the sizing has been driven by the demanded batch ingestion
throughput – not the search & retrieval activity to serve
•
The key for increasing the global batch ingestion throutput is to be able to run an
higher number of concurrent ingestions
•
The search & retrieval workload unlikely to significantly impact the global sizing
•
The ingestion workload profile is very different between structured data and
unstructured data
•
Performance is not sensitive to the already archived volume
•
Each batch ingestion works in a single xDB partition
•
A search including a partitioning criteria quickly narrows the XQuery scope to a
subset of xDB partitions
SCALABILITY
Type of increasing
workload
Main stressed resources
Batch ingestion
throutput
•
For structured data
•
CPU of the ingestor and xDB tiers
•
I/O on the ingestion working area & xDB file system
•
For unstructured data
•
CPU of the ingestor and Content Server tiers
•
I/O on the ingestion working area & Content Server storage
areas
Searches
•
CPU & memory on the Web Services and xDB tiers
•
I/O on the xDB file system
# of concurrent users CPU & memory on the GUI tiers
# of submitted
background searches
•
•
CPU & memory on the Order processor and xDB tiers
I/O on the xDB file system
Transactional ingestion
SCALABILITY
VM
VM
VM
VM
VM
VM
VM
VM
Several instances of each tier can be run on a given server
Each tier can be isolated on distinct servers*
•
Vertical scalability
* Without requiring to set up any shared file system across tiers
App Server
App Server
App Server
GUI
HTTP Load
Balancing HTTP Load Balancing
App Server
Web
Services
Receiver
Ingestor
xDB database#1
xDB database#n
xDB cache
Content
Server
RDBMS
database
Order processor
Concurrent
executions
The activity profile of the
system unlikely to require
multiple active Content Server
or RDBMS instances for
scalability purpose but it is
technically possible as well
If needed, several distinct
xDB databases can be
concurrently used by the
system
Session affinity is
mandatory for load
balancing across
GUI instances
All Web services
are stateless (i.e.
session affinity not
required for load
balancing across
WS instances)
SCALABILITY
Instances of each tier can be distributed among multiple servers
•
Horizontal scalability
VM
VM
VM
VM
VM
VM
VM
HTTP LoadBalancing HTTP Load Balancing
App Server
Web
Services
App Server
GUI
VM
VM
VM
Receiver
Ingestor
Order processor
Concurrent
executions
VM
VM
Content
Server
RDBMS
database
VM
xDB database#n
xDB cache#n
VM
xDB database#1
xDB cache#1
The activity profile of the
system unlikely to require
multiple active Content Server
or RDBMS instances for
scalability purpose but it is
technically possible as well
SCALABILITY
•
Use batch ingestion with large SIPs instead of synchronous ingestion
with small SIPs whenever possible
•
The associated workload is much lower and the ingestion is much faster
•
Batch ingestion can be scheduled during low search/retrieval activity periods
in order to optimize the usage of the platform
•
Configuring distinct file system areas (local, distinct LUNs) for each
ingestion node allows to horizontally scale the I/Os
RETENTION POLICY SERVICES
Objectives
•
Delegate to RPS the management of the retention and/or the retention markup of the AIPs.
•
The RPS retention policies can be inherited from the parent folder or applied directly on the AIP during
the reception or manually.
•
The integration has been designed for :
•
Imposing minimal constraints on the definition of the RPS policies.
•
Not imposing an additional complexity to customers who do not need RPS (i.e. who have to only
apply a basic date based retention provided by the existing built-in InfoArchive retention management).
General principles
•
InfoArchive (EIA) will never decide to purge an AIP if at least one RPS retention policy or an RPS markup is
applied on an AIP.
•
The RPS disposal event is being trapped by an EIA TBO on the AIP destroy for attaching the AIP to the EIA
Purge lifecycle.
•
This attachment is required for being able to generate the EIA confirmation message which might be
configured.
RETENTION POLICY SERVICES
Restrictions
Batch ingestion only
RPS features cannot be used with the synchronous ingestion.
AIP only
The application of RPS retention policies is only supported for AIP repository objects (i.e. eas_aip
type or sub-types).
Destroy All
RPS policies applied on AIPs must use a disposal strategy including the destruction of the repository
object
(i.e. ”
Export All, Destroy All
” or “
Destroy All
”), the usage of other disposal strategy is not
currently supported.
All Renditions
RETENTION POLICY SERVICES
Applied Retention Policies and Retention Markups
•
To know if an AIP is under RPS control
•
Display the
Retainer ID
and
Retain Content Until
column.
•
If the Retainer ID is not empty the AIP is linked to a
Retention Policy
and/or a
Retention
Markup
.
•
To have more details, it’s necessary to go to
View > Applied Retention
or
View > Applied Retention Markup
RETENTION POLICY SERVICES
Apply a Retention Policy
Manually
•
You can apply manually an
Individual
/
Linked Retention Policy
on an
AIP
/
Folder
by selecting the action menu
Records > Apply Retention Policy
•
Don’t forget to select the Policy you want to apply at the top.
During the reception
•
You can decide to apply a
RPS retention policy
and an
EAS retention period
.
•
At the Holding level, you can decide to apply a default
Retention
Class
.
•
The retention class is defined into the table. You decide to apply
or not a RPS retention policy and/or an EAS retention period.
•
The default value can be overwritten by a retention class
provided by the SIP descriptor.
RETENTION POLICY SERVICES
Disposition
Date Based Retention
When the Retain Until Date is expired, the retainer is eligible to be promoted to the
Final state
.
The
promotion
is performed by the RPS job : dmc_rps_PromotionJob
The
disposition
is performed by the RPS job : dmc_rps_DispositionJob
When an AIP is disposed, the object is moved to the folder /System EAS/data/purge and attached to the
Purge lifecycle
. You need to complete the Purge process to delete the object.
Event Based Retention
To be promoted to the Final state,
all conditions must be satisfied
.
In the
Applied Retention view
, select an item and go to
View > Properties > Info
In the
Phases
tab,
edit
the Condition and enter an
event date
.
Run the
Promotion
and
Disposition
jobs to continue the process.
Privileged Delete
You can force the disposition by using the action
Records > Privileged Delete
The AIP is moved to the folder
/System EAS/data/purge
and attached to the
Purge Lifecycle
.
The Privileged Delete is only possible on an AIP in COM, REJ-DONE or INV-DONE state.