• No results found

How to Keep OCR Errors from Spoiling Your ediscovery Party

N/A
N/A
Protected

Academic year: 2021

Share "How to Keep OCR Errors from Spoiling Your ediscovery Party"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

How to Keep OCR

Errors from

Spoiling Your

eDiscovery Party

(2)

ACEDS Membership Benefits

Training, Resources and Networking for the

E-Discovery Community

 

 

Join  Today!  

 

aceds.org/join

   

or  Call  ACEDS  Member  Services  786-­‐517-­‐2701  

 

 

 

!

 

Exclusive  News  and  Analysis  

!

 Weekly  Web  Seminars  

!

 Podcasts  

!

 On-­‐Demand  Training  

!

 Networking  

 

 

!

 Resources  

!

 Jobs  Board  &  Career  Center  

!

 bits  +  bytes  NewsleRer  

!

 CEDS  CerSficaSon  

!

 And  Much  More!  

 

 

“ACEDS provides an excellent, much needed forum… to train, network and stay

current on critical information.”

(3)

Speaker Introduction

Greg Gies

Director, Product Marketing, Imaging

Nuance Communications

Leads go-to-market planning for

Nuance’s print, capture & PDF

solutions.

(4)

Agenda

What OCR is.

What are OCR errors and what causes them.

Why eDiscovery professionals should care.

How common these problems are.

What can be done to prevent OCR errors.

(5)

What is OCR?

Wikipedia: “Optical character recognition, usually

abbreviated to OCR, is the mechanical or electronic

conversion of scanned or photographed images of

typewritten or printed text into machine encoded /

computer-readable text.”

2009 Computerworld Article: “Optical character

recognition (OCR) is the translation of optically scanned

bitmaps of printed or written text characters into character

(6)

I say, “OCR is the digital transcription of

bitmaps containing machine-printed text

into encoded text characters, using a

coding scheme such as ASCII, which

among other capabilities enables indexing

software to decipher textual elements

contained within bitmaps.

(7)

What are OCR errors and what causes

them.

(8)

Types of OCR errors

Transcription errors: Result: misspelled words.

Impact: Unsearchable. Proportional fonts are

especially problematic.

Example, “

learning

” becomes “

leaming

Formatting errors.

Result: poor legibility.

Impact: Unsearchable.

Example, “

learning

” becomes “l e a r n i n g”

Deleted metadata:

Result: data loss.

(9)

Some image defects and causes

DEFECT

CAUSE

Faulty

printing

equipment

Toner specks

Vertical lines

Toner smear

Gray background

Page skew

Light/dark print

Defective toner cartridge

Worn rollers

Wrong settings

Worn pickup roller

Low toner or ink cartridge

Clogged nozzles

Paper or

form

elements

Vertical/horizontal lines

Halftones

Noise

Colored paper

Carbon copies

Shaded and lined forms

Low/high contrast background

Faulty

(10)

Examples of image defects

Halftone

Specks

Color Background

Fuzzy edges

Gridlines

(11)

Cause & effect

SKEWED IMAGE

TONER SPECKS

DARK TEXT

(12)

OCR results

•... ; .,

1HIt#~!':tl'Ol\<fLo\i;l;i\~: •••• 'do ••••• ~""r.bf .. . ,200., by- ••• ~ •• ODNEY

D~"'!\Ob$Y~s.IN(; ..• ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) •••

JOE STANDUi' ~Irnown •• """,,"). _ and SclIa _<01"'"""y

'l>eknoWhllereinasutlJ.e paiti~S",' . . . ... ..

. "

(13)

Formatting errors

(14)

Why should eDiscovery professionals

care about OCR errors?

(15)

PDF Files Created by Scanners Aren’t

Necessarily Searchable

1.

Unlike PDF Files that were “born digital”, a PDF file from

a scanner is an image of a paper document.

2.

While the text in an image may appear similar or the

same as text in a “born digital” PDF, it’s invisible as far

as search algorithms are concerned.

3.

Images aren’t searchable until processed with OCR.

(16)

Digitized Versus Native Documents

El

ect

ro

ni

c

•  Originated online

•  Many native file formats

•  Encoded text –

“machine-readable”

•  eDiscovery software designed

for these documents

D

ig

iti

ze

d

•  Scans of paper originals

•  TIFF & PDF common formats

•  Text is a bitmap; not encoded

•  OCR makes scans

“machine-readable”

(17)

All Searchable PDFs Aren’t Equally

Findable

(18)

All Searchable PDFs Aren’t Equally

Findable

OCR Word Accuracy Matters

Character vs. word accuracy

1character error = 1 word error

1 char / 10,000 = .0001

1 word / 1,000 = .001

Seemingly small differences in

OCR error rate lead to very large

differences in errors

98.25% vs. 96.05%

948 pages

~ 2% delta

(19)

PDF Isn’t Just PDF Anymore

Improper handling may lead to data loss

Electronic “sticky

notes” & stamps

obscure text

behind

PDF contains

digitized pages

or elements

OCR software may

flatten the image,

i.e., convert PDF to

TIF, then OCR

PDF contains

native pages or

(20)

Four Important Ways PDF is Different

(21)

Four Important Ways PDF is Different

2. PDF elements can be rearranged.

User copied snip of

scanned document.

(22)

Four Important Ways PDF is Different

(23)

Four Important Ways PDF is Different

(24)

Why These Differences Matter

When these properties converge with scanned

text eDiscovery pitfalls arise that can lead to

inadvertent data loss due to processing errors.

(25)

Single PDF Page Can Contain Both

“Born Digital” and Scanned Content

This PDF

contains 2

scanned pages.

(26)

Notes, Text Boxes, Callouts, Stamps,

Etc…Can Be Overlaid On Images

(27)

How Data Gets Inadvertently Destroyed

OCR may ‘flatten’ objects overlaid on text, making the text

underneath unreadable and metadata unsearchable

(28)

Why Is Data Destruction A Problem?

“Spoliation is the destruction or significant alteration of

evidence, or the failure to preserve property for another's

use as evidence in pending or reasonably foreseeable

litigation.”

(29)

Penalties for Spoliation

$2,750,000

United States v. Philip Morris USA, Inc.

(30)

What If The Data Was Destroyed

Unintentionally? I’m Okay Right?

“The intent to alter or destroy electronic data is

not required for spoliation to occur.”

(31)

Other Reasons Data Loss is Bad

Destruction of evidence – affect case outcome.

Time wasted recreating lost data – reduced profitability.

Client relations – loss of credibility and future business.

(32)
(33)
(34)

Firms With Initiatives To Convert Paper

Documents And Process To Digital

0%

20%

40%

60%

80%

100%

Total

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

Don’t know

No

Yes

(35)

What Percentage Of The Knowledge

Workers Have Access To Scanners?

72.8%

66.7%

75.2%

64.5%

78.6%

80.9%

20%

40%

60%

80%

100%

Wo

rk

er

s

(36)

The Use Of Scanning Within

Organizations

0%

20%

40%

60%

80%

100%

Total

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

Decreasing

Stay the same

Increasing

(37)

Why scanning is increasing

20%

40%

60%

80%

100%

Pe

rc

e

n

ta

g

e

o

f

R

e

s

p

o

n

d

e

n

ts

More people have

been given access

to scanning

Mix of more

documents being

scanned and more

people gaining

access to scanning

Individuals are

scanning more

documents

(38)

What can be done to prevent OCR

errors from occurring

(39)

Image enhancement filters out defects

Halftone

Removal

Despeckle

Color

Smooth Characters

Remove Gridlines

(40)

Selectively Processing PDF Files Isn’t a

Viable Strategy

Client delivers a disk with millions of

files, how do you know which PDF

are “compound” that require special

handling?

It’s kind of like looking for needles in

(41)

Segregate All PDF Files Before

Conversion

After collecting documents,…

…but before processing, review and analysis

Identify all PDF documents

(42)

Pre-process all PDF files with OCR

Text images are potentially hidden within files

Newer OCR tools will make image text searchable,

without disturbing the rest of the data within the document

Ensures all text within every PDF document can be

searched by eDiscovery system

Should also convert to PDF/A at the same time so files

(43)

How to correct OCR errors when they

happen

(44)

Post processing error-correction

options

1. 

Proofreading – too time-intensive

2.

Run an automated spell check

Effective at identifying and correcting spelling errors

Doesn’t solve contextual errors

§

Example, “How is you day?”

§

Spelled correctly but clearly is wrong

§

If searching for “your” this instance won’t be found

§

No commercial solutions today solve this problem

§

Helps to understand this potential problem

§

Can either manually review and correct or adjust search strategy to

(45)

Final Thought – Why You Should Care

General business /

transaction documents

Purchase orders, invoices, contracts, employee identification documents…

Healthcare organizations /

patient medical records

Physician’s notes, discharge summaries, test results, post operative reports, etc…

Personal identification

documents

Driver’s licenses, social security cards, professional certificates

Public institutions / Police

records

Incident / accident reports, police logs, court records…

Insurance & banking

Claims documents, medical records, financial records…

(46)
(47)

Next Steps

A recording of today’s Webinar will be available shortly for you to review

at your leisure

Contact Nuance with any questions you may have:

781-565-5000 or

[email protected]

For more information visit:

http://www.nuance.com/for-business/by-industry/legal/legal-solution

Try our 30-Day free trial of Power PDF at

http://www.powerpdf.com

(48)

References

Related documents

2010-2012 Adjunct professor, Department of Supply Chain and Operations, Carlson School of Management, responsible for teaching course in Transportation and Logistics

Oth social assistance srvcs Hospitals Tertiary education Residential care srvcs Public order/safety srvcs Allied health srvcs State govmt administration Legal/accounting

The p- value and r 2 suggest that critical pore diameter, macroporosity and connected macroporosity are the best parameters to predict the log of saturated hydraulic

Free Online OCR service allows you could convert PDF document to MS Word file scanned images to editable text.. formats and balloon text from

All scanned paper, email and native file collections should be converted or processed to TIFF files, Bates numbered, and include fully searchable text (OCR).. Most document

Free Online OCR service allows you to convert PDF document to MS Word file scanned images to editable text formats and extract text from PDF files.. With optical character

Free Online OCR service allows you arrive convert PDF document to MS Word file scanned images to editable text formats and extract current from PDF files.. Regular pdf to

Free Online OCR service allows you better convert PDF document to MS Word file scanned images to editable text formats and extract of from PDF files.. For word online