Chapter 4—Data Ingestion
Process Flow
The Data Ingestion subsystem components receive and process data in a series of workflow steps that include extract or preprocess, transform, load, and post-load transformations. Figure 11 shows how the Data Ingestion subsystem components participate in each workflow step to extract, transform, and load data. The workflow processes include the following:
In Extract or Preprocess Workflow, preprocessor components receive the raw market data, business (firm) data, and reference data from their external interfaces. The components then perform data validation and prepare the data for further processing.
In Transform Workflow, transformer components receive the preprocessed market and business data. The components then create derived attributes that support the downstream alert processing.
In Load and Transform Workflow, loader components receive preprocessed Reference data and transformed market and business data. The components then load this data into the database. In Post Data Load, data transformations occur through Informatica: derivations and aggregations, risk assignment, and watch list processing (refer to Chapter 5, Informatica Workflows on page 133 for more information).
Oracle Financial Services Behavior Detection Platform 6.1.3 Administration Guide PrChapter 4—Da
The following sections describe this process flow in more detail.
Figure 11. Data Ingestion Subsystem
Process Flow
Chapter 4—Data Ingestion
Data Ingestion Process Summary
Figure 12 provides a high-level view of the Data Ingestion process for Oracle Financial Services Trading Compliance Solution (TC), Anti-Money Laundering (AML), Broker Compliance Solution (BC), Fraud (FR) and Insurance.
Process Flow Chapter 4—Data Ingestion
Process Flow
Chapter 4—Data Ingestion
Alternate Process Flow for MiFID Clients
Derivations done by the FDT process for the MiFID scenarios, which use the Order Size Category, require the use of the Four-week Average Daily Share Quantity (4-wk ADTV) to define an order as small, medium, or large based on how it compares to a percentage of the 4-wk ADTV. The 4-wk ADTV is derived on a daily basis by the
process_market_summary.sh script in the end-of day batch once the Daily Market Profile is collected for each security from the relevant market data source.
For firms using the MiFID scenarios and running a single end-of-day batch, the
process_market_summary.sh script must be executed prior to running the
runFDT.sh script such that the 4-wk ADTV for the Current Business Day incorporates the published Current Day Traded Volume.
Figure 13 depicts dependency between the process_market_summary.sh script and the runFDT.sh script.
Figure 13. Dependency between process_market_summary.sh and runFDT.sh
For intra-day batch ingestion or intra-day execution of the MiFID scenarios, the process flow does not change from Figure 12. Since the current day’s 4-wk ADTV is not available until the end of the day, the previous day’s 4-wk ADTV is used to determine order size.
For additional information on configuring the percentage values used to define a MiFID-eligible order as Small, Medium, or Large, refer to the Market Supplemental Guidance section in the Data Interface Specification, Release 6.1.3.
Process Flow Chapter 4—Data Ingestion
Data Ingestion Flow Processes
The following sections take the high-level view of Figure 12 and divide the Data Ingestion flow into distinct processes:
Beginning Preprocessing and Loading
Preprocessing Trading Compliance Solution Data
Processing Data through FDT and MDT
Running Trading Compliance Solution Data Loaders
Rebuilding and Analyzing Statistics
Populating Market and Business Data Tables
Processing Informatica Workflows and other Utilities Data Ingestion
Directory Structure The processes within each of the procedures refer to input and output directories within the Data Ingestion directory structure. Where not called out in this chapter, all Data Ingestion directories (for example, /inbox or /config) reside in
<INSTALL_DIR>/ingestion_manager.
Also, processing datestamps many Data Ingestion directories and subdirectories so that they appear with a YYYYMMDD notation. The system provides this processing date to the set_mantas_date.sh shell script when starting the first batch for the day.
For detailed information about the Data Ingestion directory structure, refer to section Data Ingestion Directory Structure, on page 80, for more information.
Beginning
Preprocessing and Loading
In Figure 12, section A, preprocessing begins. The system executes preprocessors using the runDP.sh script. The following sample command shows invoking of a preprocessor:
<INSTALL_DIR>/ingestion_manager/scripts/runDP.sh Account
Ingestion Manager processes data files in groups (in a specified order) from Oracle client data in the /inbox directory.
Process Flow
Chapter 4—Data Ingestion
Table 19 lists the data files by group.
Table 19. Data Files by Group Group Data Files
Process Flow Chapter 4—Data Ingestion
* BackOfficeTransaction must be loaded after the AccountManagementStage utility has been executed (see Miscellaneous Utilities).
Processing of data in Group1 requires no prerequisite information (dependencies) for preprocessing. Groups that follow, however, rely on successful preprocessing of the previous group to satisfy any dependencies. For example, Ingestion Manager does not run Group 4 until processing of data in Group 3 completes successfully.
4 AccountAddress
Table 19. Data Files by Group (Continued) Group Data Files
Process Flow
Chapter 4—Data Ingestion
Processing bases the dependencies that determine grouping on the referential relationships within the data. If an Oracle client chooses not to perform referential integrity checking, grouping is not required (except in some instances). In this case, a need still exists to process some reference data files prior to processing trading data.
These dependencies are as follows:
Prior to executing the runMDS.sh script, you should ingest the following reference data files:
Security
MarketCenter
Prior to executing the runDP.sh, TradeExecution, and runDL.sh scripts, you should ingest the following reference data files:
Security
MarketCenter
CorporateAction
StructuredDeal
SettlementInstruction
Process Flow The ingestion process flow is as follows:
1. Behavior Detection receives firm data in ASCII flat .dat files, which an Oracle client’s data extraction process places in the /inbox directory. This data can be:
Reference (for example, point-in-time customer and account data)
Transactional (for example, market and trading data)
The preprocessor addresses only those files that match naming conventions that the DIS describes, and which have the date and batch name portions of the file names that match the current data processing date and batch.
The Oracle client need only supply those file types that the solution sets require.
2. Ingestion Manager executes preprocessors simultaneously (within hardware capacities). The preprocessors use XML configuration files in the
/config/datamaps directory to verify that the format of the incoming Oracle client data is correct and validate its content; specifically:
Error-checking of input data
Assigning sequence IDs to records
Resolving cross-references to reference data
Checking for missing records
Flagging data for insertion or update
Process Flow Chapter 4—Data Ingestion
Preprocessors place output files in the directories that Table 20 lists.
Figure 14 summarizes preprocessing input and output directories.
Figure 14. Preprocessing Input and Output Directories
3. Simultaneous execution of runDL.sh scripts (within hardware capacities) loads each type of data into the FSDM. This script invokes a data loader to load a specified preprocessed data file into the database.
For reference data (any file that has a load operation of Overwrite, which the DIS
Table 20. Preprocessing Output Directories Directory Name Description
/inbox/<yyyymmdd> Backup of input files (for restart purposes, if necessary).
/data/<business or market>/load
Data files for loading into the database as
<data type>_<yyyymmdd>_<batch name>_<N>.XDP.
Load control files.
/logs/<yyyymmdd> Preprocessing and load status, and error messages.
/data/errors/<yyyymmdd> Records that failed validation. The file names are the same as those of the input files.
/data/firm/transform TC trading data files that the FDT processes.
Process Flow
Chapter 4—Data Ingestion
Full Refresh: Truncating of the entire table occurs before loading of data.
This mode is intended for use when a client provides a complete set of records daily.
Delta Mode: Updating of existing data and insertion of new data occur. This mode is intended for use when a client provides only new or changed records daily.
The FullRefresh parameter in DataIngest.xml controls the use of full refresh or delta mode. When this parameter is true, the system uses full refresh mode; when it is false, the system uses delta mode. Setting the default can be for either mode; overriding the default for individual file types is also possible, when needed.
The following sample command illustrates execution of data loaders:
<INSTALL_DIR>/ingestion_manager/scripts/runDL.sh Account
Figure 15 illustrates the Trading Compliance Solution data loading process.
Figure 15. TC Data Loading Process Guidelines for Duplicate
Record Handling The Ingestion Manager considers records as duplicates if the primary business key for multiple records are the same. The Ingestion Manager manages these records by performing either an insert or update of the database with the contents of the first duplicate record. The system inserts the record if a record is not currently in the
Process Flow Chapter 4—Data Ingestion
Preprocessing Trading Compliance Solution Data
The Ingestion Manager preprocesses market and trading data as procedures in the following sections provide.
1. When Ingestion Manager satisfies dependencies from Group2 and preprocesses or loads the data in Group3, it executes the runMDS.sh script to process market data. This script invokes the Market Data server, which does the following:
Supports preprocessing of market data through the following mechanisms:
Preprocessing of queue-based equity market data from a market data I/O stream (for example, Reuters) through TIBCO.
Support of input of market data in flat files.
Assigns sequence numbers to market data records.
Stores market data so that Firm Data Transformer (FDT) and Market Data Transformer (MDT) can retrieve it efficiently.
The Market Data server preprocesses market data files. The following provides a sample command:
<INSTALL_DIR>/ingestion_manager/scripts/runMDS.sh
This command initiates the Market Data server to process the ReportedMarket-Sale, InsideQuote, and MarketCenterQuote files, which the Oracle Financial Ser-vices client previously placed in the /inbox directory.
2. After Ingestion Manager preprocesses and loads the data in Group 2, it executes the runDP.sh script to process TC trading data. This script invokes:
Checking input trading data for errors
Assigning sequence IDs to records
Resolving cross-references to market reference data
Checking for missing fields or attributes
When Ingestion Manager executes runMDS.sh, it places output files in the directories in Table 21.
Table 21. runMDS.sh and runDP.sh Output Directories
Directory Description
/data/market/extract/
<yyyymmdd>
Market data intermediate files.
/logs/<yyyymmdd> Preprocessing transformation and load status (in individual, date-stamped log files).
/data/errors/<yyyymmdd> Records that failed validation.
/inbox/<yyyymmdd> Backup of input files (for restart purposes, if necessary).
Process Flow
Chapter 4—Data Ingestion
Figure 16 illustrates input and output directories for preprocessing market and trading data.
Figure 16. runMDS.sh Input and Output Directories
Preprocessing
Alternative to the MDS When ingesting market data in flat files, the Preprocessor can be used as an alternative to the MDS. The following commands can be run in parallel:
<INSTALL_DIR>/ingestion_manager/scripts/runDP.sh InsideQuote
<INSTALL_DIR>/ingestion_manager/scripts/runDP.sh MarketCenterQuote
<INSTALL_DIR>/ingestion_manager/scripts/runDP.sh ReportedMarketSale
The output of these commands will be the same as documented for the MDS. The benefit of this approach is that the Preprocessor has some performance enhancements which the MDS does not. If this alternative is used, everywhere that this document refers to the MDS, these three Preprocessors can be substituted.
Processing Data
through FDT and MDT When the Ingestion Manager completes preprocessing of TC trading data and market data, and prepared data output files, the Firm Data Transformer (FDT) and Market Data Transformer (MDT) can retrieve them.
Upon completion of data preprocessing through scripts runDP.sh and runMDS.sh, Ingestion Manager executes the runFDT.sh and runMDT.sh scripts. The runMDT.sh script can run as soon as runMDS.sh processing completes. However, runMDS.sh and
Process Flow Chapter 4—Data Ingestion
FDT Processing During execution of the runFDT.sh script, Ingestion Manager processes trade-related data, orders and executions, and trades through the Firm Data Transformer, or FDT (Figure 17). The FDT does the following:
Enriches data.
Produces summary records for orders and trades.
Calculates derived values to support detection needs.
Derives state chains (that is, order life cycle states, marketability states, and displayability states).
Provides data for loading into FSDM schema.
The system executes the FDT with the runFDT.sh script; the following provides a sample command:
<INSTALL_DIR>/ingestion_manager/scripts/runFDT.sh
Figure 17. Firm Data Transformer (FDT) Processing
Process Flow
Chapter 4—Data Ingestion
The FDT:
Processes all files that reside in the /data/firm/transform directory for the current date and batch.
Terminates automatically after processing files that it found at startup.
Ignores files that the system adds after processing begins; the system may process these files by starting FDT again, after exiting from the previous invocation.
When Ingestion Manager executes runFDT.sh, it places output files in the directories in Table 22.
MDT Processing During execution of the runMDT.sh script, Ingestion Manager processes market data (InsideQuote, MarketCenterQuote, and ReportedMarketSale) through the MDT. The Ingestion Manager also:
Enriches data.
Provides data for loading into FSDM schema.
The system executes the MDT with the runMDT.sh script; the following provides a sample command:
<INSTALL_DIR>/ingestion_manager/scripts/runMDT.sh Table 22. runFDT.sh Output Directories
Directory Description
/data/firm/transform Rollover data that processing saves for the next run of the FDT. Includes open and closed orders, old executions, old trades, old derived trades, lost order events, and lost trade execution events.
/logs/<yyyymmdd> Status and error messages.
/data/errors/
<yyyymmdd>
Records that the system was unable to transform.
/data/backup/
<yyyymmdd>
Backup of preprocessed input files.
/data/firm/load Transformed output files for loading into the database.
Process Flow Chapter 4—Data Ingestion
Figure 18 illustrates MDT data processing through runMDT.sh.
Figure 18. Market Data Transformer (MDT) Processing
When Ingestion Manager executes runMDT.sh, it places output files in the directories in Table 23.
Running Trading Compliance Solution Data Loaders
When FDT and MDT processing complete, the system executes the runDL.sh script for TCS trading and market data load files. This activity loads data from the
preprocessors and transformers into the FSDM schema.
The FullRefresh parameter in DataIngest.xml controls use of full refresh or delta mode. A value of <true> implies use of Full Refresh mode; a value of <false>
implies use of Delta mode. Setting of the default can be to one or the other; overriding the default is possible for individual file types.
For reference data (that is, any file that has a load operation of Overwrite, which the DIS specifies), two options are available for loading data:
Full Refresh: Truncates the entire table before loading the file. Use this mode when you plan to provide a complete set of records daily. You must set the
FullRefresh parameter in DataIngest.xml to <true> to use the Full Refresh mode.
Delta Mode: Updates existing data and inserts new data. Use this mode only when you plan to provide new or changed records daily. You must set the
Table 23. runMDT.sh Output Directories
Directory Description
/data/market/transform Checkpoint data kept from one run to the next.
/logs/<yyyymmdd> Status and error messages.
/data/market/load Transformed output files to be loaded into the database.
Process Flow
Chapter 4—Data Ingestion
The system executes data loaders using the runDL.sh script; the following provides a sample command:
<INSTALL_DIR>/ingestion_manager/scripts/runDL.sh Order
This command runs the data loaders for the order file that the FDT created previously.
The Ingestion Manager can execute the runDL.sh scripts for trading and market data simultaneously. For example, Ingestion Manager can load ReportedMarketSale, InsideQuote, MarketCenterQuote, and MarketState for market data simultaneously.
Figure 19 illustrates the Trading Compliance Solution data loading process.
Figure 19. TCS Data Loading Process
Rebuilding and
Analyzing Statistics When TCS market data loading is complete, Ingestion Manager does the following (Figure 12 on page 55):
1. Rebuilds database indexes by executing the runRebuildIndexes.sh script for the loaded market data files (refer to section Rebuilding Indexes, on page 69, for more information).
2. Analyzes data table characteristics (refer to section Analyzing Statistics, on page 69, for more information).
Process Flow Chapter 4—Data Ingestion
Rebuilding Indexes During the data load process, Ingestion Manager drops database indexes on some tables so that use of Oracle direct-path loading can improve load performance for high-volume data. After loading is complete, Ingestion Manager rebuilds indexes using the runRebuildIndexes.sh script, which makes the table usable.
For example, the system executes the runRebuildIndexes.sh script on completion of the InsideQuote data loader:
<INSTALL_DIR>/ingestion_manager/scripts/runRebuildIndexes.sh
InsideQuote
The system then executes the firm_analyze.sh and market_analyze.sh scripts after rebuilding the indexes.
For example:
<INSTALL_DIR>/ingestion_manager/scripts/firm_analyze.sh
Analyzing Statistics After rebuilding the indexes, Ingestion Manager uses either the firm_analyze.sh
script (for trading data) or the market_analyze.sh script (for market data) to analyze data table characteristics. This activity improves index performance and creates internal database statistics about the loaded data.
Populating Market and
Business Data Tables To build and update trade and market summary data in the database, Ingestion Manager runs the process_firm_summary.sh and process_market_summary.sh
scripts, as in Figure 12 on page 55.
The following examples illustrate execution of the scripts:
<INSTALL_DIR>/ingestion_manager/scripts/process_firm_summary.sh
<INSTALL_DIR>/ingestion_manager/scripts/process_market_summary.sh
After these two scripts complete processing, Data Ingestion for Trading Compliance Solution is finished.
Processing
Informatica Workflows and other Utilities
When the Data Ingestion processes finish loading data into the FSDM, the Ingestion Manager performs the following tasks:
Updates summaries of trading, transaction, and instruction activity.
Assigns transaction and entity risk through watch list processing.
Updates various Balances and Positions derived attributes.
Note: To successfully run Informatica workflows you must have installed
Informatica. Refer to the Oracle Financial Services Behavior Detection Platform Installation Guide, for more information.
The system uses Informatica to perform these tasks. Figure 20 illustrates Informatica processing.
Process Flow
Chapter 4—Data Ingestion
Figure 20. Informatica Workflow Processing
Informatica does the following:
Reads mappings or workflows in its repository.
Applies relevant workflows to FSDM: Reference, Transaction, and Derived data.
Updates summary tables (Figure 21), Watch List content, and various Balance and Positions derived attributes (Figure 22).
Process Flow Chapter 4—Data Ingestion
.
Figure 21. Informatica Summary Generation
Process Flow
Chapter 4—Data Ingestion
Figure 22 illustrates Informatica Watch List processing and risk assignment.
Figure 22. Informatica Watch List Processing
Refer to section Alternatives to Standard Data Ingestion Practices, on page 77, for more information about Watch List processing.