Youusedataprofilingandanalysistounderstandyourdataand ensurethatit suitstheintegrationtask.WebSphereInformationAnalyzer isa criticalcomponent of IBMInformationServerthatprofilesandanalyzes datasothatyoucandeliver trusted informationtoyourusers.
WebSphereInformationAnalyzercanautomaticallyscansamplesofyourdatato determinetheirqualityandstructure.Thisanalysisaids youinunderstandingthe inputs toyourintegrationprocess,rangingfromindividualfields tohigh-leveldata entities.Informationanalysisalsoenablesyoutocorrectproblems withstructureor validity beforetheyaffectyour project.
Inmanysituations,analysis mustaddressdata,values,andrulesthatare best understoodbybusinessusers.Particularlyforcomprehensiveenterpriseresource planning, customerrelationshipmanagement,orsupplychainmanagement packages,validatingdataagainstthisbusinessknowledgeisa criticalstep. The businessknowledge,inturn,formsthebasisforongoingmonitoringandauditing of datatoensurevalidity,accuracy,and compliancewith internalstandardsand industryregulations.
Whileanalysisofsourcedataisa criticalfirst stepinanyintegrationproject,you must continuallymonitorthequalityofthedata.WebSphereInformationAnalyzer enablesyoutotreatprofilingand analysisasanongoingprocessandcreate businessmetricsthatyoucanrunand trackovertime.
WebSphere Information Analyzer capabilities
IBM WebSphereInformationAnalyzerautomatesthetaskofsourcedataanalysis byexpeditingcomprehensivedataprofilingand minimizingoverallcostsand resources forcriticaldataintegrationprojects.
WebSphereInformationAnalyzerrepresentsthenextgenerationindataanalysis tools,whicharecharacterizedbytheseattributes:
End-to-end dataprofilingandcontentanalysis
Provides standarddataprofilingfeaturesandqualitycontrols.The repositoryholdsthedataanalysisresultsandprojectmetadatasuchas project-level androle-levelsecurityandfunctionadministration.
Business-orientedapproach
Withitstask-baseduser interface,aidsbusinessusersin easilyreviewing dataforanomalies andchangesovertime,and provideskeyfunctional and designinformationtodevelopers.
Adaptable, flexible,andscalablearchitecture
Handleshighdatavolumesthroughcommonparallelprocessing
technology,andleverages commonservicessuchasconnectivitytoaccessa widerangeofdatasourcesand targets.
Scenarios for information analysis
Thefollowingscenariosshowhow WebSphereInformationAnalyzerhelps organizations understandtheirdatatofacilitateintegrationprojects.
Food distribution:Infrastructurerationalization
AleadingU.S.fooddistributorhad morethan80separatemainframe,SAP, and JDEdwards applicationssupportingglobalproduction,distribution, and CRMoperations.Thisinfrastructurerationalizationprojectincluded customer relationshipmanagement,order-to-cash,purchase-to-pay,human resources,finance,manufacturing, andsupplychainplanning.The
companyneededtomovedatafromthese sourcesystemstoa singletarget system.
ThecompanyusesWebSphereInformationAnalyzerto profileitssource systemsandcreatemasterdataaroundkeybusinessdimensions,including customer,vendor, item(finishedgoods),andmaterial(rawmaterials).They plantomigratedataintoa singlemasterSAPenvironmentand a
companion SAPBWreportingplatform.
Financial services:Data qualityassessment
Amajorbrokerage firmhad becomeinefficientbysupportingdozensof businessgroupswith theirownapplicationsandITgroups. Costswere excessive,regulatory compliancedifficult, anditwasimpractical totarget low-margin, middle-incomeinvestors.Whenthefederalgovernment mandated T+1,aregulation thatchangedindustrystandardpractices,the firm hadtofinda waytoreducethetimetoprocess atradefrom3.5days to1 day,areductionof71.4percent.
Tomeet thefederal mandate,thebrokerage houseusesWebSphere InformationAnalyzerto inventorytheirdata,identifyintegrationpoints, removedataredundancies,and documentdisparitiesbetweenapplications.
Thefirm nowhasarepeatableandauditablemethodologythatleverages automated dataanalysis.Byensuringthatalltransactionsareprocessed quickly anduniformly,thecompanyisbetterable totrackandrespondto riskresultingfromitsclients’and itsowninvestments.
Transportationservices:Dataqualitymonitoring
Atransportationserviceproviderdevelopssystemsthatenableits extensivenetworkofindependentowner-operatorstocompete intoday’s toughmarket.Theowner-operatorswere exposedtocompetitionbecause theycould notreceivedataquickly. Executiveshad littleconfidenceinthe datathattheyreceived. Productivitywasslowedbyexcessive time
reviewingmanualintervention andreconcilingdatafrommultiplesources.
WebSphereInformationAnalyzerallowstheowner-operatorstobetter understandandanalyzetheirlegacydata.Itallowsthemtoquickly increasetheaccuracyoftheirbusinessintelligence reportsandrestore executiveconfidenceintheircompanydata.Movingforward,they implemented adataqualitysolutiontocleansetheircustomerdataand spot trendsovertime, furtherincreasingtheirconfidenceinthedata.
WebSphere Information Analyzer in a business context
Afterobtainingprojectrequirements, aprojectmanagerinitiatestheanalysisphase of dataintegrationtounderstandsourcesystemsanddesigntargetsystems. Too often,analysiscanbe alaborious,manualprocess thatreliesonout-of-date(or nonexistent)sourcedocumentationortheknowledgeof thepeoplewho maintain thesourcesystems. Butsourcesystemanalysisiscrucialtounderstandingwhat dataisavailable anditscurrentstate.
Figure26showstheroleofanalysis inIBMInformationServer.WebSphere InformationAnalyzer playsa keyrole inpreparingdataforintegrationby
analyzing businessinformationtoassurethatitisaccurate,consistent,timely,and coherent.
Profilingandanalysis
Examinesdatatounderstanditsfrequency,dependency,andredundancy and validatedefinedschemaanddefinitions.
Data monitoringandtrending
Uncoversdataqualityissuesinthesourcesystem asdataisextractedand loaded intotargetsystems. Validationruleshelpyoucreatebusiness metricsthatyoucanrunandtrackovertime.
Facilitating integration
Usestables,columns,probable keys,and interrelationshipstohelp with integrationdesigndecisions.
Data analysishelpsyouseethecontentandstructure ofdatabeforeyoustarta projectand continuestoprovideusefulinsightaspartoftheintegrationprocess.
Thefollowingdatamanagement tasksusedataanalysis:
Data integrationor migration
Data integrationormigrationprojects(includingdatacleansingand matching) movedatafromoneormoresourcesystemstoone ormore target systems.Dataprofilingsupportstheseprojects inthreecritical stages:
1. Assessingsourcestosupportordefinebusinessrequirements
2. Designingreferencetablesandmappingsfromsourcetotarget systems 3. Developingand runningteststovalidatesuccessfulintegrationor
migrationof dataintotargetsystems Data qualityassessmentandmonitoring
Evaluates qualityintargetedstaticdatasourcesalongmultipledimensions includingcompleteness,validity(ofvalues),accuracy,consistency,
Figure26.WebSphereInformationAnalyzerhelpsusersunderstandtheirdata.
timeliness,andrelevance.Dataqualitymonitoringrequiresongoing assessmentofdatasources. InformationAnalyzersupportsthese projects byautomatingmanyofthese dimensionsforin-depthsnapshotsovertime.
Asset rationalization
Looksfor waystocutcoststhatareassociatedwith existingdata transformation processes(forexample,processor cycles)ordatastorage.
Asset rationalizationdoesnotinvolvemovingdata,butreviewschanges in dataovertime.WebSphereInformationAnalyzersupportsasset
rationalizationduring theinitial assessmentofsourcecontentandstructure and duringdevelopment andexecutionofdatamonitorstounderstand trendsand utilizationovertime.
Verifyingexternalsourcesforintegration
Validatesthearrivalofneworperiodicexternalsources toensurethat those sourcesstillsupport thedataintegrationprocessesthatusethem.
Thisprocesslooks atstaticdatasources alongmultipledimensions
includingstructuralconformity topriorinstances,completeness,validity of values,validityofformats,andlevelofduplication.WebSphere
InformationAnalyzerautomatesmanyofthesedimensionsovertime.
A closer look at WebSphere Information Analyzer
WebSphereInformationAnalyzerisanintegratedtoolforproviding
comprehensiveenterprise-leveldataanalysis.Itfeaturesdataprofiling,analysis, and designandsupportsongoingdataqualitymonitoring.
The WebSphereInformationAnalyzeruser interfaceperformsavarietyofdata analysis tasks,asFigure27shows.
Figure27.Dashboardviewofaprojectprovideshigh-leveltrendsandmetrics
WebSphereInformationAnalyzercanbeusedbydataanalysts,subjectmatter experts,businessanalysts,integrationanalysts,andbusinessendusers.Ithasthe followingcharacteristics:
Business-driven
Provides end-to-enddatalifecyclemanagement(fromdataaccessand analysis throughdatamonitoring)toreducethetimeandcosttodiscover, evaluate,correct,andvalidatedataacrosstheenterprise.
Dynamic
Drawsonasingle activerepositoryformetadatatogive youa common platformview.
Scalable
Leveragesa high-volume,scalable,parallelprocessingdesigntoprovide highperformance analysisoflargedatasources.
Extensible
Enables youtoreview andacceptdataformatsanddatavaluesasbusiness needschange.
Serviceoriented
LeveragesIBM InformationServer’sservice-orientedarchitecturetoaccess connectivity, logging,andsecurityservices,allowingaccesstoawide range ofdatasources (relational,mainframe,and sequentialfiles)andthesharing ofanalytical resultswith otherIBM InformationServer components.
Robustanalytics
Helps youunderstandembeddedorhiddeninformationaboutcontent, quality,andstructure.
Designintegration
Improvestheexchangeofinformationfrombusinessanddataanalyststo developersbygeneratingvalidationreferencedataand mappingdata, whichreduceserrors.
Robustreporting
Provides acustomizableinterface forcommonreportingservices,which enablesbetterdecisionmakingthrough visualrepresentationofanalysis, trends,andmetrics.
IBM WebSphereAuditStageisa suitecomponentthataugmentsWebSphere InformationAnalyzer byhelpingyoumanage thedefinitionand analysisof businessrules.WebSphereAuditStageexaminessourceand targetdata,analyzing acrosscolumns forvalidvalue combinations,appropriatedataranges,accurate computations,andcorrectif-then-elseevaluations.WebSphereAuditStage establishes metricstoweightthese businessrulesandstoresahistoryofthese analyses andmetricsthatshowtrendsindataquality.
Where WebSphere Information Analyzer fits in the IBM Information Server architecture
WebSphereInformationAnalyzerusesaservice-orientedarchitecturetostructure dataanalysis tasksthatareusedbymanynew enterprisesystem architectures.
WebSphereInformationAnalyzerissupportedbyarangeofshared servicesand reusesseveralIBMInformationServercomponents.
BecauseWebSphereInformationAnalyzerhasmultiplediscreteservices,it hasthe flexibility toconfiguresystemstomatchvariedcustomerenvironmentsandtiered architectures. Figure28showshow WebSphereInformationAnalyzerinteracts with thefollowingelementsofIBM InformationServer:
IBMInformationServerconsole
Provides agraphicaluser interfacetoaccessWebSphereInformation Analyzerfunctionsand organizedataanalysisresults.
Common services
Providegeneralservices thatWebSphereInformationAnalyzerusessuch asloggingandsecurity.Metadataservices provideaccess,query,and analysis functionsforusers.ManyservicesthatareofferedbyWebSphere InformationAnalyzerare specifictoitsdomainofenterprisedataanalysis suchascolumnanalysis,primarykeyanalysisand review,and cross-table analysis.
Common repository
Holds metadatathatisshared bymultiple projects.WebSphereInformation
Figure28.IBMInformationServerarchitecture
Analyzerorganizesdatafromdatabases,files,and othersources intoa hierarchyofobjects.ResultsthataregeneratedbyWebSphereInformation Analyzercanbe sharedwith otherclientprograms suchastheWebSphere DataStageand WebSphereQualityStageDesignerthroughtheirrespective servicelayers.
Common parallelprocessingengine
Addresseshighthroughput requirementsthatare inherentinanalyzing largequantitiesofsourcedatabytakingadvantageofparallelismand pipelining.
Common connectors
Provideconnectivity toalltheimportantexternalresources andaccessto thecommonrepositoryfromtheprocessingengine.WebSphere
InformationAnalyzerusesthese connectionservicesinthreefundamental ways:
v Importingmetadata
v Performingbase analysisonsourcedata v Providing drill-downand querycapabilities
WebSphere Information Analyzer tasks
TheWebSphereInformationAnalyzeruser interfacepresentsanintuitiveset of controlsthatare designedforintegrationdevelopmentworkflow.
TheWebSphereInformationAnalyzeruser interfaceaidsyouinorganizingdata analysis workintoprojects.Thetop-levelview iscalledaDashboardbecauseit reports asummaryof yourkeyprojectanddatametrics,bothina graphicalformat and inastatusgridformat.
Thehigh-levelstatusviewinFigure29onpage50summarizesthedatasources, includingtheirtablesandcolumns,thatwereanalyzed andreviewedsothat managersandanalystscanquickly determinethestatusofwork.Theprojectview of theGlobalCoprojectshowsa high-levelsummaryofcolumnanalysisand an aggregatedsummaryofanomaliesfound,alongwiththeGettingStartedpane.
Whilemanydataanalysistoolsaredesigned torunina strictsequenceand generateone-timestaticviewsofthedata,WebSphereInformationAnalyzer enablesyoutoperform selectintegrationtasks asrequiredorcombinethemintoa largerintegrationflow. Thesetasks fallintothreecategories:
Profilingandanalysis
Provides completeanalysis ofsourcesystemsand targetsystems,and assessesthestructure,content,andqualityofdata,whetheratthecolumn level, thecross-column level,thetableorfilelevel,thecross-tablelevel,or thecross-sourcelevel.Thistaskreportsonvariousaspectsofdata
includingclassification,attributes,formatting,frequencyvalues, distributions, completeness,andvalidity.
Data monitoringandtrending
Helps youassessdatacompletenessand validity,dataformats,and valid-value combinations.Thistaskalso evaluatesnewresultsagainst established benchmarks.By usingtheWebSphereAuditStagecomponent, businessusersdevelopadditionaldatarulestoassessandmeasurecontent and qualityovertime.Rulescanbe simplecolumnmeasures that
incorporate knowledgefromdataprofilingorcomplex conditionsthattest multiple fields.Validationrulesassistincreatingbusinessmetricsthatyou canrunandtrackovertime.
Facilitating integration
Provides sharedanalyticalinformation,validationand mappingtable generation,andtestingofdatatransformationsthroughcross-comparison ofdomains beforeand afterprocessing.
Data profiling and analysis
WebSphereInformationAnalyzerprovidesextensivecapabilitiesforprofiling sourcedata.Thefourmaindataprofilingfunctionsarecolumnanalysis,primary keyanalysis,foreign keyanalysis,andcross-domainanalysis.
Figure29.InformationAnalyzerprojectview
Column analysis
Column analysisgeneratesa fullfrequencydistribution andexaminesallvaluesfor a columntoinferitsdefinitionandpropertiessuchasdomainvalues,statistical measures, andminimumandmaximumvalues.Eachcolumnofeverysourcetable isexaminedindetail.Thefollowingpropertiesareobservedandrecorded:
v Countofdistinctvaluesorcardinality v
Countofemptyvalues,null values,and non-nulloremptyvalues v Minimum,maximum,and averagenumericvalues
v Basicdatatypes,includingdifferentdate-time formats v Minimum,maximum,and averagelength
v Precisionandscalefornumericvalues
WebSphereInformationAnalyzeralsoenablesyouto drilldownonspecific columns todefineuniquequalitycontrolmeasures foreachcolumn.Figure30 showsa closerlookatresultsforatablenamedGlobalCo_Ord_Dtl.Atthetopisa summaryanalysisoftheentiretable.Beneaththesummaryisdetail foreach column thatshows standarddataprofilingresults,includingdataclassification, cardinality, andproperties. Whenyouselecta column,additional tasksthatare relevanttothatlevelofanalysisbecomeavailable.
Another functionofcolumnanalysis isdomainanalysis.Adomainisavalidsetof valuesforanattribute.Domainanalysisdeterminesthedatadomainvaluesfor anydataelement.By usinga frequencydistribution,youcanfacilitatetestingby providing alistofallthevaluesinacolumnand thenumberofoccurrencesof each.Domainanalysischeckswhetheradataelementcorrespondstoa valueina
Figure30.Columnanalysisexampledataview
database tableorfile.Figure31showsa frequencydistributionchartthathelps find anomaliesintheQtyordcolumn.
The barchartshowsdatavaluesonthey-axisand thefrequency ofthosevalues onthex-axis.Thisdetailpoints outdefaultandinvalidvaluesbasedonspecific selection,ranges,orreferencesources,andaids youiniteratively buildingquality metrics.
Whenyouarevalidatingfree-form text,analyzingandunderstandingtheextentof thequalityissuesisoftenverydifficult.WebSphereInformationAnalyzercan showeachdatapatternofthetextforamuchmoredetailedqualityinvestigation.
Ithelpswiththefollowingtasks:
v Uncoveringtrends,potentialanomalies,metadatadiscrepancies,and undocumentedbusinesspractices
v Identifyinginvalidor defaultformatsandtheirunderlyingvalues
v Verifyingthereliabilityoffields thatareproposedasmatching criteriaforinput toWebSphereQualityStageandWebSphereDataStage
Primary key analysis
The primarykeyof arelationaltableisauniqueidentifierthatadatabaseusesto accessa specificrow.Primarykeyanalysisidentifies allcandidatekeysforoneor more tablesandhelpsyoutestacolumnorcombinationof columnstodetermine if itisacandidateforbecomingtheprimarykey.Figure32onpage53showsa single-column analysis.
Figure31.Columnanalysisexamplegraphicalview
Theanalysis presentsallofthecolumns andthepotentialprimarykeycandidates.
Aduplicatecheck validatestheuseofsuchkeys.Youselecttheprimarykey candidatebased onitsprobabilityforuniqueness andyourbusinessknowledgeof thedatainvolved.Ifyouselectamulti-datacolumnastheprimarykey,thesystem willdevelopa frequencydistributionfortheconcatenatedvalues.
Foreign key analysis
Foreignkeyanalysisexaminescontent andrelationshipsacrosstables.Thisanalysis helpsidentifyforeignkeys,check theirintegrity,andcheckthereferentialintegrity betweentheprimarykeyandforeign keys.Forexample,inaBillof Materials structure, theparent-childrelationshipsamongassembliesandsubassemblies would requireyouto identifyrelationshipsbetweenforeignkeysand primarykeys and validatetheirreferentialintegrity.
Acolumnqualifies tobea foreignkeycandidateifthemajority(forexample,98 percent orhigher)ofitsfrequencydistributionvaluesmatchthefrequency
distribution valuesofaprimarykeycolumn.AsFigure33onpage54shows,after youselecta foreignkey,thesystemperformsabidirectionaltest(foreignkeyto primary key,primary keytoforeign key)ofeachforeignkey’sreferentialintegrity and identifiesthenumberofreferentialintegrityviolationsand″orphan″ values (keys thatdo notmatch).
Figure32.Primarykeyanalysis
Cross-domain analysis
Cross-domain analysisexaminescontentand relationshipsacrosstables.This analysis identifiesoverlapsinvaluesbetweencolumns,andanyredundancyof datawithin orbetweentables.Forexample,countrycodesmight existintwo differentcustomer tablesandyouwanttomaintain aconsistentstandardforthese codes. Cross-domainanalysisenablesyoutodirectlycomparethesecodevalues.
WebSphereInformationAnalyzerusestheresultsofcolumnanalysisforeachset of columns thatyouwanttocompare.Theexistenceofa commondomainmight indicatea relationshipbetweentablesorthepresenceofredundantfields.
Cross-domain analysiscancompareanynumberofdomains withinoracross sources.
Data monitoring and trending
With baselineanalysis, WebSphereInformationAnalyzercompareschangestodata fromone previouscolumnanalysis(a baseline)toa new,currentcolumnanalysis.
Figure33.Foreignkeyanalysis
Figure34showstheresultsof comparingtwodistinctanalysesonthe
Figure34showstheresultsof comparingtwodistinctanalysesonthe