SAP Data Services Hacks
Auto Generating Data Migration Jobs
Shobhit Acharya Session# 3507
Improve data migration efficiency using SAP Data
Services and implementing a few custom approaches that will speed up the extraction and load of source data.
These approaches deal with
Programmatically generating data migration jobs to replace labor
intensive and monotonous job design
Using xml import mechanism to create job templates
Using datastore configurations to ingest multiple instances of
identical source databases
Introductions & Overview
Data Migration & SAP EIM tools
• Data Services: Designer & Workbench • Information Steward
• Review a baseline framework
The use case for DS Workbench Alternative solutions for efficiency
• Understanding the DS Job xml structure
• Developing job generation programs for automation • Datastore configurations
• When to use these programs
Demo
Data Migration
Assess Extract Consolidate Cleanse Load Reconcile
Data migrations at an enterprise scale need a focus on migrating data efficiently, quickly, repeatedly and assuredly. The goal is not just to move and convert data; its to ensure that the data is of high quality and supports the business processes and target system’s operational needs.
Data Migration
80% of organizations will underestimate the costs related to the data acquisition tasks by an average of 50
percent.
SAP Data Services & Information Steward
One Solution that Provides • Data Integration • Data Quality • Text Data Processing One server to execute all capabilities One design environment to manage all development One administration console to monitor all functions
Business
Measure and compare against Information governance rules and standards
SAP Information Steward
IT
Share data quality metrics and problems with business
Case Study
Review a baseline Data Migration Framework
Staging Area IVM Target System Legacy / Staging 1 2 3 4 7 8 BoA Staging BOBJ (Reporting) Source SAP BODS SAP IS 9 10 12 13 14 15 16 Extraction 1 Ingestion into Stage 2 Initial Data Profiling 3 Transform to common structure 4 Initial Health Check 5 Auto De-dup & Cleanse 6 Apply Relevancy Rules 7 Secondary Health Check 8 Facilitated Cleansing 9 Consolidate cleansed data 10 Reference Data validation 11 Fix Reference Data issues 12 Pre-load sign off 13 Load to Target System 14 Reconciliation 15 Post load sign off 16 Transformed 5 6 Target Relevant Load BoA 11
25+ distinct source systems
Multiple source product versions
Need separate job streams
Need separate job control
Over 400 databases to ingest
Hadoop as staging
Multiple waves of migration
Case Study
Ingestion Scope Legacy / Staging 1 2 3 Source SAP BODS Extraction 1 Ingestion into Stage 2 Initial Data Profiling 3The use case for DS Workbench
Quick to build data replication projects
Data flow design
Additional customizations in DS Designer
Progressively additional
Data Replication Design
Data flows
Goodies
The use case for DS Workbench
View data
The use case for DS Workbench
No big data sources or targets
Little or no workflow customizations
A small list of supported transforms
Additional work could be required in DS Designer for job control and customizations
Generate your own jobs in XML
Dataflow= Source Tables -> Query Transforms -> Target (including HDFS)
Workflows
Custom script stages
Jobs
Datastores and Configurations
Flat file formats
Import generated XML as DS Designer Jobs, workflows, dataflow
Configure datastore for multiple deployments of the same source product database
Example : 1 Dataflow in a job
Data Services job export in XML
Understanding the structure of a simple job
Data Services job export in XML
Understanding the structure of a simple job
… 350 lines of xml Lets look closer …
Data Services job export in XML
Understanding the structure of a simple job
DIDatabaseDatastore DIAttributes
DSConfigurations --- <odbc_data_source>*DATASTORE*</odbc_data_source>
DITable --- name="*TABLE_NAME*" owner="*DBOWNER*“ datastore="*DATASTORE”
DIProperties
DIColumn --- List all columns and column properties, *COLUMN_NAME*”
DIDataflow ---name=“ *DATAFLOW*"
DITransforms
DIAttributes --- array sizes, static parameters
DIDatabaseTableSource --- datastoreName="*DATASTORE*“tableName="*TABLE_NAME*"
DIOutputView --- name="*TABLE_NAME*"
DIFileTarget --- formatName="*DATASTORE*_*TABLE_NAME*" filename="*TABLE_NAME*
DIAttributes --- HDFS File location/path + static parameters
DIQuery
DIAttribute --- name="*TABLE_NAME*" value="*TABLE_NAME*"
DISchema
DIElement --- List all columns and column properties, *COLUMN_NAME*
DISelect
DIProjection
DIExpression --- column="*COLUMN_NAME*"
DIFrom
DIFlatFileDatastore --- Name="*DATASTORE*_*TABLE_NAME*"
DISchema
DIElement --- List all columns and column properties, *COLUMN_NAME*
DIAttributes --- Datastore input and output file store attributes, 1 per job
DIUIOptions --- name="*DATASTORE*_*TABLE_NAME*" value="*TABLE_NAME*"
*Variables* Repeat for each
Generating the xml programmatically
Understanding what you need Programmers for your code
• Sql programming skills • Or on java/python/.net
Data Services
• Sandbox Repository
Source database table and column definitions
• Oracle : all_tab_columns, Sql Server : information_schema, Progress DB: sysprogress.syscolumns_full ….
Generating the xml programmatically
Applying the understanding for complex designs
Imported xml This example: 1 Source Datastore 20+ Configurations 400+ Tables 400+ HDFS formats 400+ Dataflow 400+ Workflows 1 Job Auto generated
Generating the xml programmatically
Applying the understanding for complex designs Workflow
Scripts
Dataflow Datastores
Data Services Datastore Configurations
Contains alternate connection
parameters for the datastore
Typically used for promotions to
new environments
Could be leveraged for using
template jobs on multiple
databases with identical schemas (e.g. QAD Progress databases)
Always isolate any custom xml imports into a sandbox repository
Use datastore configurations to maximum effect
Pre-import , export (to xml) the intended source and destination datastore without any tables included
Post-import , re-import these datastores to override the generated datastores
Generating the xml programmatically
Return on Investment
Initial assessment at JCI for developing the custom programs needed (Codename : ATLGEN)
Effort invested in ATLGEN development: 1 pers. week
Typical pre-use efforts: 2 weeks per source
Potential post use efficiency : 10x per source
STAY INFORMED
Follow the ASUGNews team:
Tom Wailgum: @twailgum
Chris Kanaracus: @chriskanaracus