SAP Data Services 4.X
An Enterprise Information management Solution
Table of Contents
I. SAP Data Services 4.X ... 3
Highlights
Training Objectives Audience
Pre Requisites Keys to Success Certification
II. SAP Data Services Core Components ... 4
Designer Repository Job Server Access Server Engine / Service
Data Services Management Console (DMC) Central Management Console (CMC)
III. Data Services Designer... 7
Fundamentals Key Areas Login
Object Hierarchy Define Metadata Pre Requisites Use Case
Naming Standards
IV. Working with Flat Files ... 11
Define a Flat File Query
Case Merge Validation
Custom Functions Working with Variables
V. Data Integrator Transforms ... 15
Performance Tuning
Strategy and Implementation
Life Cycle Management / Deployment Model Failure Recovery
Scheduling
VI. Appendix ... 16
Known Issues
References
SAP Data Services 4.X
The SAP Data Services application can help by delivering a single enterprise- class solution for data integration, data Quality, data profiling, and text analysis that allows you to integrate, transform, improve, and deliver trusted data that supports critical business processes and enables sound decisions.
The software customizes and manages data access and uniquely combines industry-leading technologies for delivering data to analytic, supply-chain management, customer relationship management, and Web applications.
Highlights
An enterprise can manage data as a corporate asset independent of any single system. Integrate data across many systems and reuse that data for many purposes
We can create a single infrastructure for batch and real-time data movement to enable faster and lower cost implementation
Use SAP Data Services to develop enterprise data integration for batch and real- time uses
Distributed Architecture
Training Objectives
Participants will gain the knowledge to participate in implementing, administering, and managing data migration / Integration related projects
Audience
Project team members with basic knowledge of Data Warehousing and data integration concepts
Data Warehousing consultants and project managers who work with data integration
Pre Requisites
Basic knowledge of data warehousing and ETL (Extraction, Transfer and Loading) concepts
Keys to Success
Basic knowledge of SQL language
Basic knowledge of elementary procedural programming
Author recommends combining education courses and hands-on experience to prepare for your certification exam as questions will test your ability to apply the knowledge you have gained in training
Certification
The certification test "SAP Certified Application Associate - Data Integration with SAP Data Services 4.x" verifies proven skills and fundamental knowledge in implementing ETL projects, administering and managing projects using SAP Data Services. This exam will measure your knowledge of SAP Data Services starting with understanding Data Integrator concepts through to managing the Data Integrator environment and profiling data. It will also test your skills in applying performance tuning and sizing methods, implementing complex design methodology, deploying Data Integrator using best practices and troubleshooting and supporting Data Integrator
SAP Data Services Core Components
SAP Data Services combines industry-leading data quality and integration into one platform.
With Data Services, your organization can transform and improve data anywhere. You can have a single environment for development, runtime, management, security and data
connectivity.
Designer
One of the fundamental capabilities of Data Services is extracting, transforming, and loading (ETL) data from heterogeneous sources into a target database or data warehouse. You create applications (jobs) that specify data mappings and transformations by using the Designer.
The Designer is a development tool with an easy-to-use graphical user interface. It enables developers to define data management applications that consist of data mappings, transformations, and control logic.
Use the Designer to create applications containing work flows (job execution definitions) and data flows (data transformation definitions)
Repository
The SAP Data Services repository is a set of tables that hold user-created and predefined system objects, source and target metadata, and transformation rules. Set up repositories on an open client/server platform to facilitate sharing metadata with other enterprise tools. Each repository must be stored on an existing RDBMS and registered in the Central Management Console (CMC).
Local
A local repository is used by an application designer to store definitions of objects (like projects, jobs, work flows, and data flows) and source/target metadata.
Central
A central repository is an optional component that can be used to support multi-user development. The central repository provides a shared object library allowing developers to check objects in and out of their local repositories.
While each user works on applications in a unique local repository, the team uses a central repository to store the master copy of the entire project. The central repository preserves all versions of an application's objects, so you can revert to a previous version if needed.
Multi-user development includes other advanced features such as labeling and filtering to provide you with more flexibility and control in managing application objects.
Job Server
The SAP Data Services Job Server starts the data movement engine that integrates data from multiple heterogeneous sources, performs complex data transformations, and manages extractions and transactions from ERP systems and other sources. The Job Server can move data in either batch or real-time mode and uses distributed query optimization, multi- threading, in-memory caching, in-memory data transformations, and parallel processing to deliver high data throughput and scalability.
Access Server
The SAP Data Services Access Server is a real-time, request-reply message broker that collects message requests, routes them to a real-time service, and delivers a message reply within a user-specified time frame. The Access Server queues messages and sends them to the next available real-time service across any number of computing resources. This approach provides automatic scalability because the Access Server can initiate additional real-time services on additional computing resources if traffic for a given real-time service is high. You can configure multiple Access Servers.
Engine / Service
When Data Services jobs are executed, the Job Server starts engine processes to perform data extraction, transformation, and movement. The engine processes use parallel processing and in-memory data transformations to deliver high data throughput and scalability
The Data Services service is installed when Job and Access Servers are installed. The service starts Job Servers and Access Servers when you restart your system. The Windows service name is SAP Data Services. The UNIX equivalent is a daemon named AL_JobService.
Data Services Management Console (DMC)
A Data Services Management console is a browser based module that includes various utilities like Administrator, Auto Documentation, Impact and Lineage analysis, Data Validation, Data Quality Reports, and Operational Dashboards.
Administrator
This is a valuable utility for production support team, various tasks like Scheduling, monitoring, and executing batch jobs, configuring, starting and stopping real-time services, configuring and managing adapters, publishing batch jobs and real-time services via Web services.
Auto Documentation
Data Services Auto documentation reports capture critical information for understanding your jobs so you can see at a glance the entire ETL process.
After creating a project, you can use Auto documentation reports to quickly create a PDF or Microsoft Word file that captures a selection of job, workflow, and/or data flow information including graphical representations and key mapping details.
Data Validation Reports
In the Data Validation application, create functional areas.
Verify that your source data meets your business rules.
Take the appropriate actions when the data does not meet your business rules Data Quality Reports
Many data quality transforms generate information about the data being processed. Data quality reports provide access to that data processing information.
The statistics-generating transforms include Match, USA Regulatory Address Cleanse, and Global Address Cleanse transforms.
Report types include job summaries, transform-specific reports, and transform group reports Impact and Lineage Analysis
Manage a single metadata repository to capture the relationships between different extraction and access methods and provide integrated lineage and impact analysis.
Analyze the impact of changing a source table, column, element, or field on existing documents and reports.
Analyze lineage to determine data sources of documents and reports.
Track different versions (changes) to each object over time.
Central Management Console (CMC)
SAP Data Services uses the Central Management Console for user accounts and rights management.
Before you can access Data Services repositories in other components such as the Designer and Management Console, you must configure them appropriately.
Create the repository in a database using the Repository Manager.
Register the repository in the Central Management Console (CMC).
Manage security settings for the repository in the CMC.
Data Services Designer
Fundamentals
The fundamental capabilities of Data Services is extracting, transforming, and loading (ETL) data from heterogeneous sources into a target database or data warehouse.
Broad source and target support Including SAP application and databases with bulk loading and CDC changes data capture including file types comma delimited, fixed width, COBOL, XML, Excel etc...
Instant traceability with impact analysis and data lineage capabilities that include the data quality process Multi-user support and versioning via a central repository
Administration tool with scheduling capabilities and monitoring/dashboards Scalable scripting language with a rich set of built-in functions
Interoperability and flexibility with Web services-based applications Debugging and built-in profiling and viewing data
Designer
Key Areas
Project area - 1
Contains the current project (and the job(s) and other objects within it) available to you at a given time.
In Data Services, all entities you create, modify, or work with are objects.
Workspace - 2
The area of the application window in which you define, display, and modify objects.
Local object library - 3
Provides access to local repository objects including built-in system objects, such as transforms and transform configurations, and the objects you build and save, such as jobs and data flows.
Tool palette - 4
Buttons on the tool palette enable you to add new objects to the workspace.
Login
TDB
Object Hierarchy
Projects
A data integration application typically contains one or more Data Services projects. Each project contains one or more jobs, which can load data in batch mode or respond to real-time requests. Projects are a convenient way to group related real-time jobs.
Jobs
There are two types of jobs: batch and real-time. A batch job extracts, transforms, and loads data. A real-time job receives a request, extracts, transforms and loads data to support that request, and then responds to the requestor in real-time.
Work flows
Work flows contain steps that define the order of steps within a job for execution. Work flows call data flows, but cannot manipulate data themselves. Work flows are called from inside other work flows or jobs.
Scripts
A script is a collection of Data Services scripting language statements. Batch jobs start and end with scripts (an initialization and finalization script). Scripts can also be used inside components to execute SQL statements, set variable values, or execute Data Services functions. All variables used in a data flow must be initialized in a script.
Data flows
A data flow contains defines steps to extract data from one or more sources, transform the data, and load the data into one or more target tables.
ABAP Dataflow
An ABAP Dataflow is specific to SAP Application, and has to be defined to extract data from SAP Application.
An ABAP data flow is mandated to use a transport object File Formats
File formats describe the metadata structure.
A file format is a set of properties describing the structure of a flat file (ASCII).
A file format describes a specific file.
A file format template is a generic description that can be used for many data files
The object library stores file format templates that you use to define specific file formats as sources and targets in data flows
Object Hierarchy
Define Metadata
Datastore definitions need to be consistent and secure.
Datastores will be defined once and referenced, used by all developers and jobs.
Developers have READ only access to the datastore definitions, but may modify their local definitions, including the creation of System Configuration for unit testing purposes.
Developers will “re-get” the datastore definitions when changes to structure or passwords are made
Pre Requisites
Define Data Base Schemas as follows ODS
DS_STAGE SALESMART
Note: Please refer to data services tutorial folder to create test data
Use Case
The objective is to load a Datamart (Salesmart) to meet various reporting
The Source of data is an Operational Data Store, and the following tables are used in order to populate Salesmart Star Schema
Source Schema
ODSStage Schema
DS_STAGETarget Schema
SalesmartODS_Customer STG_ODS_Customer CUST_DIM
ODS_Employee STG_ODS_Employee EMPLOYEE_DIM
ODS_Material STG_ODS_Material MRTL_DIM
ODS_Region STG_ODS_Region SALESORG_DIM
ODS_SalesOrder STG_ODS_SalesOrder
SALES_FACT ODS_SalesItem STG_ODS_SalesItem
ODS_Delivery STG_ODS_Delivery
Naming Standards
The Best practice to build a multi user ETL environment is by following proper Naming standards
Prefix Object Example
Job Job Job_ODS_Salesmart_Extract
WF Workflow WF_ODS_Dimension_Extract
WF_ODS_Fact_Extract
DF Dataflow DF_ODS_Customer_Extract
DS Datastore DS_ODS
DS_STAGE DS_SALESMART
Working with Flat Files
Define a Flat File
Create a file format template that defines the structure for a file.
Create a specific source or target file format in a data flow.
The source or target file format is based on a template and specifies connection information such as the file path / file name.
File format objects can describe files in:
Delimited format — Characters such as commas or tabs separate each field Fixed width format — The column width is specified by the user.
All files within the ETL area must be placed on the Job Server.
If the File format is used in multiple places, it should be considered a reusable object and managed like a Datastore.
Query
Maps column from input Schema to output Schema.
Perform transformations and functions on the source data.
Assign Primary Keys to output Schema columns.
Add New Output columns, Nested Schemas, and Function Calls to the output Schema.
Perform Data Nesting and Unnesting with Sub Schemas of the Output Schema. Also assign Make Current Schema.
Generate distinct result set output for the input Schema.
Join data from Multiple Input Source Schemas. Inner Join as well as Outer Join is supported.
Filter input Source Data.
Performs Aggregation based on input column groups.
Generate sorted dataset based on source input column order.
Also we can generate DTD, XML Schema or File Format based on the Input or Output Schema.
Query Editor:
Use the Query editor to specify the Schema In, Schema Out, and Options for the Query transform.
The areas can be resized in order to expand the area in which you are working. You can also expand and contract the columns to change the width of properties displayed in the input and output schema areas.
Important Tabs in Query Editor:
Mapping Tab:
Mapping tab provides complex column mappings and drag drop input schemas and column into the output schema to enable the editor. Use the function wizard and expression editor to build expression.
Select Tab:
This tab specifies whether to output only distinct rows (Discarding any identical duplicate rows).
From Tab:
This tab specifies all input schemas that are used in the current schema.
Where tab:
This tab specifies conditions that determine which rows is output.
Group by tab:
This tab specifies how the output rows are grouped. (If Required)
Order by Tab:
This tab specifies the output rows are sequence (ascending or descending If Required).
Advanced Tab:
Use the Advanced tab to set up Data Services so that it creates separate sub data flows.
Sub data flows process any of the following resource-intensive query clauses:
DISTINCT | GROUP BY | JOIN | ORDER BY
Find Tab:
This tab specifies locate input or output elements containing a specific word or term.
JOINS IN QUERY TRANSFORM Source Tables:
ODS_CUSTOMER | ODS_SALESODER Output Columns:
Cust_id | Cust_name | Order_Date | Order_Status | Net_Order_Date
Join Condition:
ODS_CUSTOMER.CUST_ID = ODS_SALESORDER.CUST_ID