• No results found

Data Quality Report. Date: 1/11/11

N/A
N/A
Protected

Academic year: 2021

Share "Data Quality Report. Date: 1/11/11"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Quality Report

Prepared for: Demo (with real data)

Date: 1/11/11

(2)

Contents

1 Introduction ... 3

1.1 Company Background ... 3

1.2 Reference Data ... 3

1.3 Address Verification Process... 4

2 Detailed Results ... 5

2.1 Input Data Analysis ... 5

2.1.1 Data Supplied ... 5

2.1.2 Top 10 Field Contents ... 5

2.2 Parsing Reports ... 8

2.2.1 Parsing Status Chart ... 8

2.2.2 Data Completeness Graph ... 9

2.2.3 Data Completeness per Top 10 Countries Graph ... 10

2.3 Verification Reports ... 11

2.3.1 Verification Level Improvement Graph ... 11

2.3.2 Cumulative Uplift Graph ... 12

2.3.3 Textual Changes Graph ... 13

2.3.4 Verification Level per Top 10 Countries Graph ... 14

2.3.5 Verification Status Chart... 15

2.3.6 Postcode Status Chart ... 16

2.4 Geocoding Reports ... 17

2.4.1 Geocoding Level Chart ... 17

2.4.2 Geocoding Level per Top 10 Countries Graph ... 18

2.5 Conclusions ... 18

(3)

1

Introduction

1.1

Company Background

Location is a key data element for most business applications. Consumers need to know where to look for products and services, and businesses need to be able to accurately identify the location of their customer to deliver these products and services efficiently. Whether for data quality, fraud prevention or delivering location based services, identifying an accurate customer location is key to successful business strategies.

Loqate is a leader in geographic data quality solutions, ensuring that the data you collect is accurate and complete - first time, every time. Whether postal address, point of

interest or latitude longitude coordinate, using a combination of advanced algorithmic analysis and comprehensive reference sources, we can identify, parse, standardize, verify and enrich your data, adding valuable information to increase it’s value.

Our solutions are global, covering over 240 countries around the world - that’s every populated world territory. So whether you’re trading in one country or globally, Loqate can meet your requirements - any language, any character set.

1.2

Reference Data

Our Global Knowledge Repository contains reference data relating to postal addresses and other geographic points of interest for all 240+ populated countries and territories of the world. Loqate is the only solution of it’s type with complete global coverage.

Our reference data comes directly from data owners around the world and is licensed for use within the Loqate solutions.

(4)

1.3

Address Verification Process

The Loqate Geo-data Quality Engine identifies, verifies and enriches geographic and location based data such as postal address, point of interest, and latitude/longitude coordinate.

Loqate can run as an installed Library/API or through our Software as a Service platform.

Having identified address components, the Verify module will parse unstructured text into labeled address components, transliterate from one character set to another,

validate against our Global Knowledge Repository of worldwide reference data to correct misspellings, add missing components and postcodes, and standardize to the correct local format and content.

Our Verify and Geocode modules work for all 240+ populated countries and territories of the world.

(5)

2

Detailed Results

2.1

Input Data Analysis

2.1.1

Data Supplied

File name: test_data.txt

Input file type: Tab delimited UTF8 Text # of records: 342986

Field list from input file:

 ADDRESS1  ADDRESS2  ADDRESS3  CITY  STATE  ZIP  COUNTRY

2.1.2

Top 10 Field Contents

Top 10 field contents for each provided field:

ADDRESS1 field Count

proprietary 1893 proprietary 1028 proprietary 426 proprietary 366 proprietary 366 proprietary 356 proprietary 345 proprietary 305 proprietary 293 proprietary 266 Blank Records 2121

(6)

ADDRESS2 field Count proprietary 389 proprietary 189 proprietary 176 proprietary 140 proprietary 116 proprietary 99 proprietary 97 proprietary 96 proprietary 82 proprietary 75 Blank Records 289972

ADDRESS3 field Count

proprietary 79 proprietary 66 proprietary 50 proprietary 49 proprietary 39 proprietary 36 proprietary 32 proprietary 31 proprietary 31 proprietary 29 Blank Records 338453

CITY field Count

proprietary 19782 proprietary 13935 proprietary 7632 proprietary 6446 proprietary 6073 proprietary 5332 proprietary 5040 proprietary 4288 proprietary 4105 proprietary 3500 Blank Records 2533

(7)

STATE field Count proprietary 100992 proprietary 39244 proprietary 38485 proprietary 16323 proprietary 7934 proprietary 7628 proprietary 7047 proprietary 4797 proprietary 3982 proprietary 2844 Blank Records 83149

ZIP field Count

proprietary 546 proprietary 397 proprietary 318 proprietary 236 proprietary 213 proprietary 205 proprietary 191 proprietary 186 proprietary 182 proprietary 182 Blank Records 18679

COUNTRY field Count

proprietary 234849 proprietary 11210 proprietary 9849 proprietary 5925 proprietary 5262 proprietary 5041 proprietary 4210 proprietary 3761 proprietary 3299 proprietary 3125 Blank Records 41

(8)

2.2

Parsing Reports

The parsing reports collectively provide a quantitative picture of the input data quality. More details are provided with each report, explaining specifically what it shows.

2.2.1

Parsing Status Chart

The parsing status chart indicates the percentage of the input data that contained complete, identifiable components. A failure to parse normally occurs because of input records that are either blank or only contain punctuation, or because of data within the input records that could not be identified as a valid component based on the structure or order of the input.

Examples of “unable to parse” data:

Example 1: , UKRAINA MBT-082 KPP LUDZANKA

Example 2: Redwood City, Suite 23, Veterans Blvd. 805, CA Example 3:,4 SN-802

(9)

2.2.2

Data Completeness Graph

For those records that could be parsed, the data completeness graph indicates how complete the input data is. The higher the percentage that is complete to either Premise or Delivery Point level, the more complete the input data is. A high proportion of data that is either empty or complete to Administrative Area or Locality suggests that the input data is either lacking full addresses, or is poor quality.

More confidence can be assigned to data that is marked as complete according to the lexicon, since this means that the data was recognizable as such. Less confidence is associated with data that is marked as complete according to context since this implies that the data could not be recognized directly, it was only able to be recognized by its position relative to other components.

(10)

2.2.3

Data Completeness per Top 10 Countries Graph

The Data Completeness per Top 10 Countries graph breaks the completeness

information down per country, allowing a better understanding of where data quality is better or worse and might provide insight on where improvements might bring the most benefit in upstream (data input) process improvement and cleansing efforts.

(11)

2.3

Verification Reports

The verification reports collectively provide a picture of how much improvement to your input data was achieved by running it through the Loqate Global Geo-data Quality Engine in a variety of areas.

2.3.1

Verification Level Improvement Graph

The Verification Level Improvement graph shows the qualitative improvement obtained during the cleansing process along with an indication of how closely the input data matched the available reference data. A higher percentage of records verified to Locality or Thoroughfare level is not necessarily an indicator of a lack of quality, since in a significant proportion of the world no reliable reference data is available to verify Thoroughfare, Premise, or Delivery Point. A higher proportion of records with a verification level of None does, however, indicate poorer quality data.

Improvement is indicated when the red bars exceed the blue bars moving to the right (towards Premise and Delivery Point). This shows that data that was verified to a less specific level is being corrected to be able to be verified to a higher level.

(12)

2.3.2

Cumulative Uplift Graph

The cumulative uplift graph builds on the Verification Level graph, showing how much of a difference the Loqate data cleansing process was able to make to those records which were improved. An uplift level of 1 implies that a record that previously was only verified to Locality level could be verified to Thoroughfare level after processing. An uplift level of 2 implies that a record that was previously only verified to Locality level could be verified to Premise level after processing, etc.

Uplift

Level Count Percent

At Least 1 44468 13.0% At Least 2 35197 10.3% At Least 3 10506 3.1% At Least 4 5972 1.7% At Least 5 2896 0.8%

(13)

2.3.3

Textual Changes Graph

The textual change graph shows what level of change was applied during the data cleansing process for those records that were able to be verified. Minor change means that small spelling mistakes needed to be fixed in order to verify an address, generally resulting in some uplift in the verification level.

(14)

2.3.4

Verification Level per Top 10 Countries Graph

The verification level per country graph shows how the verification level splits on individual countries. To reiterate, a higher percentage of records verified to Locality or Thoroughfare level is not necessarily an indicator of a lack of quality, since in a significant proportion of the world no reliable reference data is available to verify Thoroughfare, Premise, or Delivery Point. A higher proportion of records with a

verification level of None does, however, indicate poorer quality data. This graph should be compared to the Completeness Level per Country graph – if records are deemed to be complete but are unable to verified that could indicate the presence of junk or fraudulent data.

(15)

2.3.5

Verification Status Chart

The verification status shows, for those records that could be verified, whether a

complete match could be made against the available reference data (Verified), whether there were multiple possible matches (Ambiguous) or whether only a partial match was possible (Partially Verified).

(16)

2.3.6

Postcode Status Chart

The Postcode Status chart gives the breakdown of what the Loqate data cleansing process was able to do the postcode field for those records that were able to be verified. Please note that a significant proportion of countries worldwide do not have postcode systems. This graph covers those records that had data within the postcode field on output.

(17)

2.4

Geocoding Reports

The geocoding reports provide information about the number of records that were geocoded and to what level of accuracy.

2.4.1

Geocoding Level Chart

The geocoding level chart shows what proportion of the input data was able to be geocoded and to what level.

(18)

2.4.2

Geocoding Level per Top 10 Countries Graph

The Geocoding per country graph breaks that geocoding information down by individual countries.

2.5

Conclusions

 Generally very good quality data

 Able to geocode a significant proportion of the file - over 95% of South American records

 13% of the data was uplifted by at least 1 verification level

2.6

Recommendations

References

Related documents

HEFCE’s employer engagement strategy has fi ve strands: (1) developing responsive provision to meet employer and employee needs; (2) engaging employers in the HE curriculum; (3)

Using the Least squares dummy variable model (LSDV) model to analyses if the impact of financialisation as estimated by dividend payouts and rentier shares on real

Taken together, our data suggest that MoGrp1 functions as a novel splicing factor with poly(U) binding activity to regulate fungal virulence, development, and stress responses in

Fmc. Anterior view of heart and lung. Note the dilatation of the pulmonary artery and bulging right ventricle. The lungs are voluminous and retain their shape.. Hc. Posterior view

Experiments were designed with different ecological conditions like prey density, volume of water, container shape, presence of vegetation, predator density and time of

CC Video Workflow Adobe Media Encoder Output video for any screen Premier Pro Video production and editing After Effects Cinematic visual effects and motion graphics

Our analysis suggests that whilst new smart toys do share some affordances with traditional toys – for example, in the way they invite imaginative play, practices of care,

To our knowledge, this is the first study to investigate the signals of developing acute pancreatitis during treat- ment with commonly used glucocorticoids, using a large