How to Keep OCR Errors from Spoiling Your ediscovery Party

(1)

How to Keep OCR

Errors from

Spoiling Your

eDiscovery Party

(2)

ACEDS Membership Benefits

Training, Resources and Networking for the

E-Discovery Community

Join Today!

aceds.org/join

or Call ACEDS Member Services 786-‐517-‐2701

!

Exclusive News and Analysis

!

Weekly Web Seminars

!

Podcasts

!

On-‐Demand Training

!

Networking

!

Resources

!

Jobs Board & Career Center

!

bits + bytes NewsleRer

!

CEDS CerSﬁcaSon

!

And Much More!

“ACEDS provides an excellent, much needed forum… to train, network and stay

current on critical information.”

(3)

Speaker Introduction

Greg Gies

Director, Product Marketing, Imaging

Nuance Communications

Leads go-to-market planning for

Nuance’s print, capture & PDF

solutions.

(4)

Agenda

–

 

What OCR is.

–

 

What are OCR errors and what causes them.

–

 

Why eDiscovery professionals should care.

–

 

How common these problems are.

–

 

What can be done to prevent OCR errors.

(5)

What is OCR?

Wikipedia: “Optical character recognition, usually

abbreviated to OCR, is the mechanical or electronic

conversion of scanned or photographed images of

typewritten or printed text into machine encoded /

computer-readable text.”

2009 Computerworld Article: “Optical character

recognition (OCR) is the translation of optically scanned

bitmaps of printed or written text characters into character

(6)

I say, “OCR is the digital transcription of

bitmaps containing machine-printed text

into encoded text characters, using a

coding scheme such as ASCII, which

among other capabilities enables indexing

software to decipher textual elements

contained within bitmaps.

(7)

What are OCR errors and what causes

them.

(8)

Types of OCR errors

Transcription errors: Result: misspelled words.

Impact: Unsearchable. Proportional fonts are

especially problematic.

Example, “

learning

” becomes “

leaming

”

Formatting errors.

Result: poor legibility.

Impact: Unsearchable.

Example, “

learning

” becomes “l e a r n i n g”

Deleted metadata:

Result: data loss.

(9)

Some image defects and causes

DEFECT

CAUSE

Faulty

printing

equipment

Toner specks

Vertical lines

Toner smear

Gray background

Page skew

Light/dark print

Defective toner cartridge

Worn rollers

Wrong settings

Worn pickup roller

Low toner or ink cartridge

Clogged nozzles

Paper or

form

elements

_{Vertical/horizontal lines}

Halftones

Noise

Colored paper

Carbon copies

Shaded and lined forms

Low/high contrast background

Faulty

(10)

Examples of image defects

Halftone

Specks

Color Background

Fuzzy edges

Gridlines

(11)

Cause & effect

SKEWED IMAGE

TONER SPECKS

DARK TEXT

(12)

OCR results

•... ; .,

1HIt#~!':tl'Ol\<fLo\i;l;i\~: •••• 'do ••••• ~""r.bf .. . ,200., by- ••• ~ •• ODNEY

D~"'!\Ob$Y~s.IN(; ..• ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) •••

JOE STANDUi' ~Irnown •• """,,"). _ and SclIa _<01"'"""y

'l>eknoWhllereinasutlJ.e paiti~S",' . . . ... ..

. "

(13)

Formatting errors

(14)

Why should eDiscovery professionals

care about OCR errors?

(15)

PDF Files Created by Scanners Aren’t

Necessarily Searchable

1. Unlike PDF Files that were “born digital”, a PDF file from

a scanner is an image of a paper document.

2. While the text in an image may appear similar or the

same as text in a “born digital” PDF, it’s invisible as far

as search algorithms are concerned.

3. Images aren’t searchable until processed with OCR.

(16)

Digitized Versus Native Documents

El

ect

ro

ni

c

•  Originated online

•  Many native file formats

•  Encoded text –

“machine-readable”

•  eDiscovery software designed

for these documents

D

ig

iti

ze

d

•  Scans of paper originals

•  TIFF & PDF common formats

•  Text is a bitmap; not encoded

•  OCR makes scans

“machine-readable”

(17)

All Searchable PDFs Aren’t Equally

Findable

(18)

All Searchable PDFs Aren’t Equally

Findable

–

 

OCR Word Accuracy Matters

–

 

Character vs. word accuracy

–

 

1character error = 1 word error

–

 

1 char / 10,000 = .0001

–

 

1 word / 1,000 = .001

–

 

Seemingly small differences in

OCR error rate lead to very large

differences in errors

–

 

98.25% vs. 96.05%

–

 

948 pages

–

 

~ 2% delta

(19)

PDF Isn’t Just PDF Anymore

Improper handling may lead to data loss

Electronic “sticky

notes” & stamps

obscure text

behind

PDF contains

digitized pages

or elements

OCR software may

flatten the image,

i.e., convert PDF to

TIF, then OCR

PDF contains

native pages or

(20)

Four Important Ways PDF is Different

(21)

Four Important Ways PDF is Different

2. PDF elements can be rearranged.

User copied snip of

scanned document.

(22)

Four Important Ways PDF is Different

(23)

Four Important Ways PDF is Different

(24)

Why These Differences Matter

When these properties converge with scanned

text eDiscovery pitfalls arise that can lead to

inadvertent data loss due to processing errors.

(25)

Single PDF Page Can Contain Both

“Born Digital” and Scanned Content

This PDF

contains 2

scanned pages.

(26)

Notes, Text Boxes, Callouts, Stamps,

Etc…Can Be Overlaid On Images

(27)

How Data Gets Inadvertently Destroyed

OCR may ‘flatten’ objects overlaid on text, making the text

underneath unreadable and metadata unsearchable

(28)

Why Is Data Destruction A Problem?

“Spoliation is the destruction or significant alteration of

evidence, or the failure to preserve property for another's

use as evidence in pending or reasonably foreseeable

litigation.”

(29)

Penalties for Spoliation

$2,750,000

United States v. Philip Morris USA, Inc.

(30)

What If The Data Was Destroyed

Unintentionally? I’m Okay Right?

“The intent to alter or destroy electronic data is

not required for spoliation to occur.”

(31)

Other Reasons Data Loss is Bad

–

 

Destruction of evidence – affect case outcome.

–

 

Time wasted recreating lost data – reduced profitability.

–

 

Client relations – loss of credibility and future business.

(32)

(33)

(34)

Firms With Initiatives To Convert Paper

Documents And Process To Digital

0%

20%

40%

60%

80%

100%

Total

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

Don’t know

No

Yes

(35)

What Percentage Of The Knowledge

Workers Have Access To Scanners?

72.8%

66.7%

75.2%

64.5%

78.6%

80.9%

20%

40%

60%

80%

100%

Wo

rk

er

s

(36)

The Use Of Scanning Within

Organizations

0%

20%

40%

60%

80%

100%

Total

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

Decreasing

Stay the same

Increasing

(37)

Why scanning is increasing

20%

40%

60%

80%

100%

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

More people have

been given access

to scanning

Mix of more

documents being

scanned and more

people gaining

access to scanning

Individuals are

scanning more

documents

(38)

What can be done to prevent OCR

errors from occurring

(39)

Image enhancement filters out defects

Halftone

Removal

Despeckle

Color

Smooth Characters

Remove Gridlines

(40)

Selectively Processing PDF Files Isn’t a

Viable Strategy

–

 

Client delivers a disk with millions of

files, how do you know which PDF

are “compound” that require special

handling?

–

 

It’s kind of like looking for needles in

(41)

Segregate All PDF Files Before

Conversion

–

 

After collecting documents,…

–

 

…but before processing, review and analysis

–

 

Identify all PDF documents

(42)

Pre-process all PDF files with OCR

–

 

Text images are potentially hidden within files

–

 

Newer OCR tools will make image text searchable,

without disturbing the rest of the data within the document

–

 

Ensures all text within every PDF document can be

searched by eDiscovery system

–

 

Should also convert to PDF/A at the same time so files

(43)

How to correct OCR errors when they

happen

(44)

Post processing error-correction

options

1. Proofreading – too time-intensive

2. Run an automated spell check

–

 

Effective at identifying and correcting spelling errors

–

 

Doesn’t solve contextual errors

§

 

Example, “How is you day?”

§

 

Spelled correctly but clearly is wrong

§

 

If searching for “your” this instance won’t be found

§

 

No commercial solutions today solve this problem

§

 

Helps to understand this potential problem

§

 

Can either manually review and correct or adjust search strategy to

(45)

Final Thought – Why You Should Care

General business /

transaction documents

Purchase orders, invoices, contracts, employee identification documents…

Healthcare organizations /

patient medical records

Physician’s notes, discharge summaries, test results, post operative reports, etc…

Personal identification

documents

Driver’s licenses, social security cards, professional certificates

Public institutions / Police

records

Incident / accident reports, police logs, court records…

Insurance & banking

_{Claims documents, medical records, financial records…}

(46)

(47)

Next Steps

• A recording of today’s Webinar will be available shortly for you to review

at your leisure

• Contact Nuance with any questions you may have:

781-565-5000 or

[email protected]

• For more information visit:

http://www.nuance.com/for-business/by-industry/legal/legal-solution

• Try our 30-Day free trial of Power PDF at

http://www.powerpdf.com

(48)

How to Keep OCR Errors from Spoiling Your ediscovery Party

How to Keep OCR

Errors from

Spoiling Your

eDiscovery Party

ACEDS Membership Benefits

Training, Resources and Networking for the

E-Discovery Community

Join Today!

aceds.org/join

or Call ACEDS Member Services 786-­‐517-­‐2701

!

Exclusive News and Analysis

!

Weekly Web Seminars

!

Podcasts

!

On-­‐Demand Training

!

Networking

!

Resources

!

Jobs Board & Career Center

!

bits + bytes NewsleRer

!

CEDS CerSﬁcaSon

!

And Much More!

“ACEDS provides an excellent, much needed forum… to train, network and stay

current on critical information.”

Speaker Introduction

Greg Gies

Director, Product Marketing, Imaging

Nuance Communications

Leads go-to-market planning for

Nuance’s print, capture & PDF

solutions.

Agenda

–

What OCR is.

–

What are OCR errors and what causes them.

–

Why eDiscovery professionals should care.

–

How common these problems are.

–

What can be done to prevent OCR errors.

What is OCR?

Wikipedia: “Optical character recognition, usually

abbreviated to OCR, is the mechanical or electronic

conversion of scanned or photographed images of

typewritten or printed text into machine encoded /

computer-readable text.”

2009 Computerworld Article: “Optical character

recognition (OCR) is the translation of optically scanned

bitmaps of printed or written text characters into character

I say, “OCR is the digital transcription of

bitmaps containing machine-printed text

into encoded text characters, using a

coding scheme such as ASCII, which

among other capabilities enables indexing

software to decipher textual elements

contained within bitmaps.

What are OCR errors and what causes

them.

Types of OCR errors

Transcription errors: Result: misspelled words.

Impact: Unsearchable. Proportional fonts are

especially problematic.

Example, “

learning

” becomes “

leaming

”

Formatting errors.

Result: poor legibility.

or Call ACEDS Member Services 786-‐517-‐2701

On-‐Demand Training

_{Vertical/horizontal lines}