© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
MarkLogic 8: Infrastructure
Management API, Flexible Replication, Incremental Backup, and Sizing Recommendations
Caio Milani
MarkLogic 8 Feature Presentations
Topics Product Manager
Developer Experience: Samplestack and Reference Architecture Kasey Alderete
Developer Experience: Node.js and Java Client APIs, Server-side JavaScript, and Native JSON
Justin Makeig
REST Management API, Flexible Replication, Sizing, and Reference Hardware Architectures
Caio Milani
Bitemporal Jim Clark
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3
Agenda
Flexible Replication Management API Incremental Backup© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5
Flexible Replication
Customizable information sharing between systems
Enable content collaboration across numerous systems
Support directly connected or mobile users
Provide data that users need using simple configurable parameters or queries
Ensure data consistency and security with simple workflows
Even better with Bitemporal and Management API
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6
Intelligent Data Layer Enabling Data Collaboration
Data replicates across many databases– No need for a master data store
– No need for continuous connectivity – No need to replicate all data
Consistency on edits can be handled by
– Simple versioning
– Check-in/outs/publish – Conflicts resolution rules – Bitemporal collections
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7
Users Get Only the Data They Need
Data moves based on collections,URIs, or user defined queries
User changes to settings and
queries update replicated content on their laptops
Data can be transformed and filtered before replication
Security is consistent across all
peers ensuring reliable data access control
Flexible Replication is a document centric solution aimed at information sharing
Flexible Replication is not intended for DR and does not preserve transaction boundaries
Database Replication makes a
transactionally consistent copy of the primary data in another data center aimed at DR
Choosing the Right Feature For the Job
Filter Filter
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9
How Documents Are Replicated
Flexible Replication is an asynchronous solution built on top of the Content Processing Framework (CPF) running on a task queue
Any time a target document changes its properties fragment is updated. Document updates can be pushed (to the replica) or pulled (by the replica)
For push targets, an immediate push is attempted. For pull targets, the properties are updated to reflect that the document needs to be replicated
Query-based targets typically use pull, and for scalability reasons, query-based push targets will also not have an immediate push attempt
If the task server queue is more than half full, the Master Server will not push documents to the Replica and will instead leave it for the scheduled push task
Scheduled Tasks
Regardless of whether you configure replication as push or pull, you must create a scheduled task to periodically replicate updated content
A scheduled replication task does the following:
– Moves zero-day content that existed before replication was configured
– Provides a retry mechanism in the event the initial replication fails – Replicates deletes on the Master to the Replica
Replication retries are a combination of the task frequency, documents per batch and min. and max. wait retry times
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11
Choose What To Replicate
Documents are replicated based on domain or serialized queries
A domain may be a document, a collection of documents, or a directory
A query works as if you were replicating the results of a search
Users can manage their queries to control what gets replicated
Also can pause/restart replication in order to preserve bandwidth
cfg = flexrep:configuration-create() flexrep:target-create() admin:group-add-scheduled-task() flexrep:configuration-target-set-user-id() alert:make-rule(…. xdmp:user(“me"), cts:word-query("apple")…) flexrep:pull-create()
Query-Based Replication
Based on Alerting
Start from with a FlexRep config
Create a query-based target by passing in a user id
Then use alerting API to manage the user’s queries, and any matching documents will be replicated to the target
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13
Modify Documents Before and After Replication
Flexible Replication supports filters that can modify the content, URI, properties, collections, permissions, or anything else about the document
Filters can help deciding which documents to replicate and which not to, and which documents should have only pieces replicated
Or even wholly transform the content as part of the replication, using something like an XSLT stylesheet to automatically adjust from one schema to another
Multi-Master
Each database can be a master for its own documents sets and transmit
updates to remote servers
A database can be a master for some content and replica for another
A database can transitively replicate to additional data centers
Updates Updates Reads Domain/Query Application Replication
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15
Ownership and Conflicts
In cases of conflict, the master by default “wins” but filters and custom code can assist with more sophisticated conflict handling
Filters can be used to modify document's properties creating virtual locks (example)
Or filters can move documents along collections: “pending”, “merging”, “conflicted” to enable
automatic or manual resolution
This is a proven solution deployed in critical operations
Implementation Example
Logic of a virtual lock using custom code on outbound/ inbound filters
Scale and Collaborate
Scalability to thousands ofsystems can be achieved by a tiered architecture
Core clusters replicate to regional clusters that replicate to personal databases
Modifications on personal
databases can be cascaded back to core clusters and redistributed globally Core clusters Regional clusters Personal Databases
Management API
REST-based API to manage all MarkLogic capabilities
Increase efficiency and agility by automating time-consuming repetitive tasks across production, testing and development
Reduce setup time and admin error by orchestrating multi-step configurations and deployments
Fit more seamlessly into IT environments by using REST interfaces unlike CLI or proprietary APIs
Perform automated testing and monitor
performance using market tools that support REST
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19
Adaptive to Every Environment
Stateless HTTP calls adapt to changing datacenter topologies unlike CLI and socket based APIs
Use filtering and property parameters to scope endpoint calls and reduce client-side code
Format payloads and outputs to either HTML, JSON, or XML, adapting to different scripting technics
Control access to endpoints with the manage-user(GET, HEAD) and manage-admin roles
Manage simultaneous requests with built in concurrency and lock control, avoiding partial or erroneous updates
API HTTP
Script All Operations in MarkLogic 8
Topologies
Databases, forests, groups, application servers, clusters coupling and decoupling
Security
Users, roles, amps, privileges, and external security
HA/DR
Local failover, database and flexible replication
Backup and Storage
Backup and restore, Tiered storage, CPF configuration
Configuration
SQL views, re/index, merge, bitemporal, inference operations
Deployment
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21
From Read-Only to Full Control
MarkLogic 5: exposed read-only APIs for status and configuration information
MarkLogic 7: exposed cluster, host and forest-level interfaces sufficient for standing up a cluster
MarkLogic 8: exposing almost all other configuration/management tasks that can be accomplished via GUI, with minor exceptions
General Pattern of Endpoints
Http /manage/(v2|latest)/ Descritption JSON or XML
Output/Input GET resource-type returns a list of the resources Yes
POST resource-type accepts a “properties” flavor and creates a resource of that type.
Yes GET resource-type/name returns a description of the resource Yes DELETE resource-type/name deletes the resource N/A POST resource-type/name performs an operation on that resource Yes GET
resource-type/name/properties
returns a description of the resource in a “properties” flavor. Property representations are generally replayable.
Yes PUT
resource-type/name/properties
accepts a “properties” flavor and modifies the resource accordingly
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23 Request parameters – format? Request headers: – accept? – content-type Response headers: – content-type
On endpoints that support both content negotiation via accept headers and a format parameter, format
parameter will override the accept headers.
Acceptable format:
– JSON
– XML
Acceptable content types:
– application/xml
– application/json
– application/x-www-form-urlencoded
Example: Payloads for POST
{ "admin-username" : "adminuser", "admin-password" : "mypassword", "realm" : "public" } <instance-admin xmlns="http://marklogic.com/manage"> <admin-password>adminuser</admin-password> <admin-username>mypassword</admin-username> <realm>public</realm> </instance-admin>© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25
Example: Checking Backup Status
$payload-status :='{"operation": "backup-status",
"job-id" : "' || jobid || '","host-name": "' || $backup-hostname || '"}' $status-response := xdmp:http-post("http://localhost:8002/manage/v2/databases/test-db?format=json", <data>{$payload-status}</data> <headers> <content-type>application/json</content-type> <accept>application/json</accept> </headers>
Example: Adding a Host to a Cluster
curl -X POST -d "" http://${JOINING_HOST}:8001/admin/v1/init
JOINER_CONFIG=`curl -s -S -X GET -H "Accept: application/xml“ http://${JOINING_HOST}:8001/admin/v1/server-config`
curl -s -S --digest --user admin:password -X POST-o cluster-config.zip -d "group=Default“--data-urlencode "server-config=${JOINER_CONFIG}“-H "Content-type: application/x-www-form-urlencoded“
http://${BOOTSTRAP_HOST}:8001/admin/v1/cluster-config
curl -s -S -X POST -H "Content-type: application/zip“--data-binary @./cluster-config.zip
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27
Example: Adding Flexrep Configuration on a Master
POST resource-type/name/properties XML script and JSON payload
$payload := '{"domain-name": "marklogic-com-domain-2","alerting-uri": "http://marklogic.com/org/uri"}‘
$response := xdmp:http-post ("http://localhost:8002/manage/v2/databases/flxrep-master-db/flexrep-configs?format=json", … <data>{$payload}</data> <headers> <content-type>application/json</content-type> <accept>application/json</accept> </headers>
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29
Incremental Backup
SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY SUNDAY
FULL FULL
INCREMENTAL BACKUP (delta/differential)
Store only changes since the previous full or incremental backup
Consume less storage for backup copies
Reduce backup window
Improve availability with multiple daily backups
Work with Log Archiving to enable fine-grained point-in-time recovery
Uncompromised
Data Resiliency
Reduce Recovery Point Objective (RPO) with incremental backup and journal archiving
Perform point-in-time recovery to overcome garbage-in problems
Simple operation as server restores backup set and replays the journal starting from given timestamp
Journal Frames Active Journal Journal Frames With Timestamps Archived Journals Full or Incremental Backup FULL BACKUP INCREMENTAL BACKUP INCREMENTAL BACKUP INCREMENTAL BACKUP INCREMENTAL BACKUP Garbage in Restore timestamp in journal
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31
3-Subsequent
Smaller, Faster,
and More Consistent
Store only data that changed since last full for faster restores
Store changes since last incremental
(deltas) for faster backups and less space
Shorter validation as subsequent incrementals do not examine the full backup
Backup and restore are transactional and guarantee a consistent view of the data
1-Full Backup 2-Incremental TIME Validation Phase Copy Phase Sync Phase Begin Transaction End Transaction FULL FULL
INCREMENTAL BACKUP (cumulative)
FULL FULL
INCREMENTAL BACKUP (delta/differential)
Distributed Backups and Restores
Database backup and restore operations are distributed
All data nodes in a cluster participate
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33
When you back up a
database, you specify a backup directory
Incremental backups are stored in their own
directory
Supports either a shared or unshared directory
(same path must exist on each data node)
Example:
In this example, the backup directory is /abc/backup and the incremental backup directory is
/abc/incremental
Backup Directory Structure
/abc/backups 20140801-1223942093224 (full backup on 8/1) /abc/incremental 20140801-1223942093224 20140802 331006226070 (incremental backup on 8/2) 20140803 341007528950 (incremental backup on 8/3)
Flexibility to Select Data to Backup
By default you backup everything:– The configuration files
– The Security database, including all of its forests – The Schemas database, including all of its forests
– All of the forests of the database you are backing up
If you back up all forests, you will have a backup that you can restore to the exact same state as when the backup begins copying files
You can also backup individual forests, choosing the ones you need. Forest-level backups are consistent for the data in the forest
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35
Consistent Database-Level Backups and Restores
Backup and restore operations are transactional and guarantee a consistent view of the data
Data changes after copy begins are not reflected in the backup or restore set
Backup and restore operations do not lock the database
Database and Forest administrative tasks such as drop, clear, and delete cannot take place during a backup; any such operation is queued up and will initiate after the backup transaction has completed
Phases of Backup and Restore Operation
Validation Phase
Checks for needed files and directories and if they are writable and valid
For backup operations, they are checked for sufficient disk
space
Synchronization Phase
Deletes temporary files
Leaves the database in a consistent state
On a restore, it also takes the old version of the database offline and replaces it with the newly restored version Validation Phase Copy Phase Sync Phase Begin Transaction End Transaction Copy Phase
The files are actually copied to or from the backup directory
The config files are copied at the beginning and a timestamp is written
Starts a transaction; if the
transaction fails on a restore, the database remains unchanged
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37
Summary of Incremental Backup
Since an incremental backup takes less time than a full backup, it is possible to schedule frequent incremental backups (for example, by the hour)
A full backup and a series of incremental backups can allow you to recover from a situation where a database has been lost
Incremental backup can be used with or without journal archiving
If you enable both incremental backup and journal archiving, you can replay the journal starting from the last incremental backup timestamp
Incremental backups are recommended for large databases that would take long to backup in full mode
Backup/Restore Operations with Journal Archiving
Journal Archiving enables restore to a specific point in time between backups with the input of a wall clock time
When journal archiving is enabled, journal frames are written to backup directories by near synchronously streaming from the active journal
When journal archiving is enabled, you will experience longer restore times and slightly increased system load as a result of the streaming of journal frames
Performance can be tuned by adjusting the lag limit, the amount of time in which journal frames can differ from the frames streamed to the backup journal
REFERENCE HARDWARE
ARCHITECTURE
Reference Hardware Architecture
With some direct recommendations, you will know exactly how many nodes you will need for your data to ensure you achieve optimal performance for your applications at the lowest cost.
PERFORMANCE CAPACITY
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41
Sizing Forests of Indexed Content
PERFORMANCE CAPACITY 100 GB/Forest 8 M Docs/Forest 500 GB/Forest 100 M Docs/Forest High capacity
Fewer Concurrent Requests
Archive/Repository/Analytics
High Performance
Many Facets/Range Indexes (~10)
Sub Second
High Number of Concurrent Requests
Indexed Content Versus Non-Indexed Content
100% INDEXED 1% INDEXED
Database Records Small Text Files 100% indexed
Media Binaries Metadata only 1% Indexed
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43
Ready to Wear: High Performance/High Capacity
Minimum number of hosts and forests per host remains constant– 3 host cluster, 6 primary forests, 6 replica forests per host on commodity
hardware
Size of forests shift depending upon where you are on the High Performance/High Capacity spectrum
Ready to Wear: High Performance
Storage: 20 2.5’’ 15K 600 Gb drives– RAID 10, striping plus mirroring Use Case: Search Application
– Multiple facets (range indexes) – Large number of concurrent users – Subsecond queries
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45
Ready to Wear: High Capacity
Storage: 20 2.5’’ 10K 1200 Gb drives– RAID 50, striping plus parity
Use Case: Data Warehouse, Large Scale Analytics
– Smaller number of concurrent users
– Batch report processing that can run offline – Forests can get much larger
Hardware/Sizing Recommendations
2U 25 SFF Chassis 128GB – 256GB RAM 22 10K 900GB Data Drives 2 Socket 8 Core/2.8Ghz 10GB Network 2 2GB RAID Cards© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47
Hardware/Sizing Recommendations
2U 25 SFF Chassis 300GB/Forest + Temp, Binaries, Logs 32 Threads @ 2Ghz 4GB/8GB per Thread 1GB/Sec IO to Network 1GB/Sec IO to DisksExample 3 Node Clusters (All HA)
Archival,
eDiscovery
RAID50 22TB Indexed • 6TB Online • 16TB NearlineMetadata Search,
Media Store
RAID50 9TB Indexed • 20TB BinariesMid-Density
Database
RAID10 4TB IndexedHigh-Performance
RAID10 2TB Indexed© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49
Best Practices: Ancillary Database Placement
Replicate Security, Triggers, Modules, Schemas, Meters Critical to replicate Security and Modules; multiple copies are good
When upgrading, masters should all be on ONE HOST in the cluster
Best Practices: Huge Pages
Transparent Huge Pages: enabled by default in RHEL 6. Instead, disable THP and configure Huge Pages instead.
Should be set to 3/8 physical memory
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51
Best Practices: Local Disk Replicas
6 Replicas per Host– Ingestion: still need background merges for replicas
– Essentially doubles the size of the forests: now we have a copy of all documents
in a replica forest
– 2x the size forests, 2x the number of forests – Another way of saying this: non-HA is ½ of HA
Design Patterns for High Availability
6 Primary, 6 Replica per Host Distribute across hosts—don’t want to be in a situation where we’re not sharing load evenly in failover situation
Easiest to add 3 hosts at a time and use same distribution pattern; you can add one or two, but you will need to use forest migration
© COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 53