Data Quality Report
Prepared for: Demo (with real data)
Date: 1/11/11
Contents
1 Introduction ... 3
1.1 Company Background ... 3
1.2 Reference Data ... 3
1.3 Address Verification Process... 4
2 Detailed Results ... 5
2.1 Input Data Analysis ... 5
2.1.1 Data Supplied ... 5
2.1.2 Top 10 Field Contents ... 5
2.2 Parsing Reports ... 8
2.2.1 Parsing Status Chart ... 8
2.2.2 Data Completeness Graph ... 9
2.2.3 Data Completeness per Top 10 Countries Graph ... 10
2.3 Verification Reports ... 11
2.3.1 Verification Level Improvement Graph ... 11
2.3.2 Cumulative Uplift Graph ... 12
2.3.3 Textual Changes Graph ... 13
2.3.4 Verification Level per Top 10 Countries Graph ... 14
2.3.5 Verification Status Chart... 15
2.3.6 Postcode Status Chart ... 16
2.4 Geocoding Reports ... 17
2.4.1 Geocoding Level Chart ... 17
2.4.2 Geocoding Level per Top 10 Countries Graph ... 18
2.5 Conclusions ... 18
1
Introduction
1.1
Company Background
Location is a key data element for most business applications. Consumers need to know where to look for products and services, and businesses need to be able to accurately identify the location of their customer to deliver these products and services efficiently. Whether for data quality, fraud prevention or delivering location based services, identifying an accurate customer location is key to successful business strategies.
Loqate is a leader in geographic data quality solutions, ensuring that the data you collect is accurate and complete - first time, every time. Whether postal address, point of
interest or latitude longitude coordinate, using a combination of advanced algorithmic analysis and comprehensive reference sources, we can identify, parse, standardize, verify and enrich your data, adding valuable information to increase it’s value.
Our solutions are global, covering over 240 countries around the world - that’s every populated world territory. So whether you’re trading in one country or globally, Loqate can meet your requirements - any language, any character set.
1.2
Reference Data
Our Global Knowledge Repository contains reference data relating to postal addresses and other geographic points of interest for all 240+ populated countries and territories of the world. Loqate is the only solution of it’s type with complete global coverage.
Our reference data comes directly from data owners around the world and is licensed for use within the Loqate solutions.
1.3
Address Verification Process
The Loqate Geo-data Quality Engine identifies, verifies and enriches geographic and location based data such as postal address, point of interest, and latitude/longitude coordinate.
Loqate can run as an installed Library/API or through our Software as a Service platform.
Having identified address components, the Verify module will parse unstructured text into labeled address components, transliterate from one character set to another,
validate against our Global Knowledge Repository of worldwide reference data to correct misspellings, add missing components and postcodes, and standardize to the correct local format and content.
Our Verify and Geocode modules work for all 240+ populated countries and territories of the world.
2
Detailed Results
2.1
Input Data Analysis
2.1.1
Data Supplied
File name: test_data.txtInput file type: Tab delimited UTF8 Text # of records: 342986
Field list from input file:
ADDRESS1 ADDRESS2 ADDRESS3 CITY STATE ZIP COUNTRY
2.1.2
Top 10 Field Contents
Top 10 field contents for each provided field:ADDRESS1 field Count
proprietary 1893 proprietary 1028 proprietary 426 proprietary 366 proprietary 366 proprietary 356 proprietary 345 proprietary 305 proprietary 293 proprietary 266 Blank Records 2121
ADDRESS2 field Count proprietary 389 proprietary 189 proprietary 176 proprietary 140 proprietary 116 proprietary 99 proprietary 97 proprietary 96 proprietary 82 proprietary 75 Blank Records 289972
ADDRESS3 field Count
proprietary 79 proprietary 66 proprietary 50 proprietary 49 proprietary 39 proprietary 36 proprietary 32 proprietary 31 proprietary 31 proprietary 29 Blank Records 338453
CITY field Count
proprietary 19782 proprietary 13935 proprietary 7632 proprietary 6446 proprietary 6073 proprietary 5332 proprietary 5040 proprietary 4288 proprietary 4105 proprietary 3500 Blank Records 2533
STATE field Count proprietary 100992 proprietary 39244 proprietary 38485 proprietary 16323 proprietary 7934 proprietary 7628 proprietary 7047 proprietary 4797 proprietary 3982 proprietary 2844 Blank Records 83149
ZIP field Count
proprietary 546 proprietary 397 proprietary 318 proprietary 236 proprietary 213 proprietary 205 proprietary 191 proprietary 186 proprietary 182 proprietary 182 Blank Records 18679
COUNTRY field Count
proprietary 234849 proprietary 11210 proprietary 9849 proprietary 5925 proprietary 5262 proprietary 5041 proprietary 4210 proprietary 3761 proprietary 3299 proprietary 3125 Blank Records 41
2.2
Parsing Reports
The parsing reports collectively provide a quantitative picture of the input data quality. More details are provided with each report, explaining specifically what it shows.
2.2.1
Parsing Status Chart
The parsing status chart indicates the percentage of the input data that contained complete, identifiable components. A failure to parse normally occurs because of input records that are either blank or only contain punctuation, or because of data within the input records that could not be identified as a valid component based on the structure or order of the input.
Examples of “unable to parse” data:
Example 1: , UKRAINA MBT-082 KPP LUDZANKA
Example 2: Redwood City, Suite 23, Veterans Blvd. 805, CA Example 3:,4 SN-802
2.2.2
Data Completeness Graph
For those records that could be parsed, the data completeness graph indicates how complete the input data is. The higher the percentage that is complete to either Premise or Delivery Point level, the more complete the input data is. A high proportion of data that is either empty or complete to Administrative Area or Locality suggests that the input data is either lacking full addresses, or is poor quality.
More confidence can be assigned to data that is marked as complete according to the lexicon, since this means that the data was recognizable as such. Less confidence is associated with data that is marked as complete according to context since this implies that the data could not be recognized directly, it was only able to be recognized by its position relative to other components.
2.2.3
Data Completeness per Top 10 Countries Graph
The Data Completeness per Top 10 Countries graph breaks the completenessinformation down per country, allowing a better understanding of where data quality is better or worse and might provide insight on where improvements might bring the most benefit in upstream (data input) process improvement and cleansing efforts.
2.3
Verification Reports
The verification reports collectively provide a picture of how much improvement to your input data was achieved by running it through the Loqate Global Geo-data Quality Engine in a variety of areas.
2.3.1
Verification Level Improvement Graph
The Verification Level Improvement graph shows the qualitative improvement obtained during the cleansing process along with an indication of how closely the input data matched the available reference data. A higher percentage of records verified to Locality or Thoroughfare level is not necessarily an indicator of a lack of quality, since in a significant proportion of the world no reliable reference data is available to verify Thoroughfare, Premise, or Delivery Point. A higher proportion of records with a verification level of None does, however, indicate poorer quality data.
Improvement is indicated when the red bars exceed the blue bars moving to the right (towards Premise and Delivery Point). This shows that data that was verified to a less specific level is being corrected to be able to be verified to a higher level.
2.3.2
Cumulative Uplift Graph
The cumulative uplift graph builds on the Verification Level graph, showing how much of a difference the Loqate data cleansing process was able to make to those records which were improved. An uplift level of 1 implies that a record that previously was only verified to Locality level could be verified to Thoroughfare level after processing. An uplift level of 2 implies that a record that was previously only verified to Locality level could be verified to Premise level after processing, etc.
Uplift
Level Count Percent
At Least 1 44468 13.0% At Least 2 35197 10.3% At Least 3 10506 3.1% At Least 4 5972 1.7% At Least 5 2896 0.8%
2.3.3
Textual Changes Graph
The textual change graph shows what level of change was applied during the data cleansing process for those records that were able to be verified. Minor change means that small spelling mistakes needed to be fixed in order to verify an address, generally resulting in some uplift in the verification level.
2.3.4
Verification Level per Top 10 Countries Graph
The verification level per country graph shows how the verification level splits on individual countries. To reiterate, a higher percentage of records verified to Locality or Thoroughfare level is not necessarily an indicator of a lack of quality, since in a significant proportion of the world no reliable reference data is available to verify Thoroughfare, Premise, or Delivery Point. A higher proportion of records with averification level of None does, however, indicate poorer quality data. This graph should be compared to the Completeness Level per Country graph – if records are deemed to be complete but are unable to verified that could indicate the presence of junk or fraudulent data.
2.3.5
Verification Status Chart
The verification status shows, for those records that could be verified, whether a
complete match could be made against the available reference data (Verified), whether there were multiple possible matches (Ambiguous) or whether only a partial match was possible (Partially Verified).
2.3.6
Postcode Status Chart
The Postcode Status chart gives the breakdown of what the Loqate data cleansing process was able to do the postcode field for those records that were able to be verified. Please note that a significant proportion of countries worldwide do not have postcode systems. This graph covers those records that had data within the postcode field on output.
2.4
Geocoding Reports
The geocoding reports provide information about the number of records that were geocoded and to what level of accuracy.
2.4.1
Geocoding Level Chart
The geocoding level chart shows what proportion of the input data was able to be geocoded and to what level.
2.4.2
Geocoding Level per Top 10 Countries Graph
The Geocoding per country graph breaks that geocoding information down by individual countries.
2.5
Conclusions
Generally very good quality data
Able to geocode a significant proportion of the file - over 95% of South American records
13% of the data was uplifted by at least 1 verification level