• No results found

A Case Study of ebay UTF-8 Database Migration

N/A
N/A
Protected

Academic year: 2021

Share "A Case Study of ebay UTF-8 Database Migration"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

TM

A Case Study

of

eBay UTF-8 Database Migration

Nelson Ng

(2)

2

TM

What is the difficulty in migrating

from ISO Latin 1 to UTF-8?

(3)

TM

eBay: The World’s Online Marketplace

An online trading platform where everyone

can trade almost anything from anywhere

• A global presence in 33 international markets • 203 million registered users

• $44.3 billion in annualized gross merchandise volume (value of goods sold)

• 724,000 small business and individuals in U.S. use eBay as a primary or secondary sales channel

• 36,000 Developers via eBay API

(4)

4

TM

Agenda

Reasons for UTF-8 Migration

Challenges of UTF-8 Migration

Approaches to UTF-8 Migration

(5)

TM

Reasons

for

(6)

6

TM

Ambiguity in Textual Data Representation

UTF-8

Ñ

a

c

Á

d

Ã

a

c

ä

¸

œ

Ã

â

d

,

CP1252

à ?

a

c

ä

¸

Ã

d

ISO 8859-1

¬

Raw Byte Sequence in

Hex representation

â

¬

OR

OR

� �

C3 91

61

63

E4 B8 AD

E5 9C 8B

C3

E2

64

82 AC

81

(7)

TM

eBay Marketplace Textual Data Encoding Model

eBay Marketplace supports both Latin 1 and UTF-8 sites on

a single platform:

¾ Latin 1 Sites: US, CA, UK, AU, FR, DE, IT, BE (nl), BE (fr), NL, ES, CH, AT, IE, NZ

¾ UTF-8 Sites: TW, IN, CN, HK, MY, PH, PL, SG, SE

Textual data for all sites are stored in the same set of

tables and DB hosts

A combination of ISO Latin 1 and UTF-8 DB hosts

¾ ISO Latin 1 DB host:

™ Latin 1 Site records: Encoded in Cp1252

™ UTF-8 Site records: Encoded in UTF-8

¾ UTF-8 DB host:

(8)

8

TM

(9)

TM

Issues with Textual Data Encoding Model

Cross-Border Trade Impediments

Increased code complexity and developmental

costs to support mixed textual data encoding model

Increased 3

rd

Party integration costs to implement

eBay textual data encoding model

Textual data encoding model not supported by

(10)

10

TM

Challenges

of

(11)

TM

Summary of Challenges

S

tringent

Availability

¾Site Uptime Requirement: 99.94%

Magnitude of Data

¾Total Site DB Hosts: ~160

¾Total Active DB Tables: ~2000

¾Total DB Storage Usage: ~200TB

¾Total Redo Log Generation: ~2TB/day

Extreme Usage

¾Peak time SQL Execution: ~462K/sec

¾Avg. time for SELECT: ~7ms

¾Avg. time for INSERT/UPDATE: ~15ms

¾Total Application Servers: ~5K

Frequency of Change

(12)

12

TM

Approach

to

(13)

TM

How Big was the Problem?

Scanning DB hosts with Oracle CSSCAN

Identify different types of textual data encoding

¾ 7 bit ASCII (0x0 to 0x7F): No conversion. Same value as-is in UTF-8

¾ 8 bit Cp1252 (0x80 to 0xFF): Converted into 2 bytes or 3 bytes for UTF-8.

(14)

14

TM

Approaches

1.

Conversion with External File

2.

Conversion with Secondary DB

3.

In-line Conversion WITHOUT Status flag

(15)

TM

(16)

16

TM

Pre Conversion and Data Conversion Stages

1. Create a UTF-8 Status column with default value NULL in each table. 2. Modify code base to support UTF-8 Migration to check DB Charset

a. If DB is UTF-8, then use UTF-8 rules to process textual data for all sites b. If DB is ISO Latin 1, then process textual data according to the UTF-8

Status column

3. Activate Implicit Conversion

(17)

TM

DB Migration Stage

1. Stop Incoming Traffic to DB Host

2. ALTER DATABASE CHARSET from

WE8ISO8859P1

to

UTF8

(18)

18

TM

(19)

TM

Reasons for this Approach

Pros

No DB downtime during data conversion. Only minimal DB outage during ALTER DB

Implicit Conversion to reduce Explicit Conversion effort

Ability to pause Data Conversion if necessary

Ability to rollback ALTER DB if necessary

Does not require additional hardware for secondary DB

Cons

A new UTF-8 Status column has to be created for each table

All DB access code has to be modified to handle the new UTF-8

Status column during transition period of data conversion

(20)

20

TM

Lessons Learned

(21)

TM

Lessons Learned

Planning! Planning! Planning!

¾

Upfront Data Analysis

¾

DB Environment Preparation

¾

Leverage Site Maintenance Windows

Gradual Activation of Data Conversion

Data Monitoring

¾

Data Environment Changes

¾

Integrity of Data Conversion

¾

Problematic SQL in DB Log

Throttle Data Conversion Speed

Dedicated support resources from Development and

Operations teams

(22)

22

TM

Metrics of Migration

Duration of Explicit Conversion:

~7 months

¾

Number of Records:

~18 billion

¾

Number of Columns:

~600

¾

Number of Tables:

~200

Duration of DB Migration:

~7 months

¾

Number of DB Host:

~100

(23)

TM

Q & A

(24)

24

TM

(25)

TM

Alternative 1: Conversion with External File

1. Identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data 2. Extract identified records into a file

3. Use a script to convert the textual data in the file to UTF-8 4. ALTER DATABASE CHARSET from WE8ISO8859P1 to UTF8

(26)

26

TM

Alternative 1: Conversion with External File

Pros

Avoids creating UTF-8 Status column in each table

Code change to handle transition period during data migration is not required.

Avoids requiring additional hardware for secondary DB

Cons

Data conversion must be completed in a single outage window

Prolong DB downtime due to data conversion takes place during outage window

(27)

TM

Alternative 2: Conversion with Secondary DB

1. Create Last Modified Date(LMD) column in each table

2. Replicate Primary DB to Secondary DB and save the timestamp

3. In Secondary DB, identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data 4. In Secondary DB, use a script to convert the identified records to UTF-8

5. In Primary DB, run post conversion script to sync up modified records since DB replication 6. In Secondary DB, ALTER DB CHARSET from WE8ISO8859P1to UTF8

(28)

28

TM

Alternative 2: Conversion with Secondary DB

Pros

Avoids creating UTF-8 Status column in each table

Code change to handle transition period during data migration is not required.

Data conversion can be restarted if necessary

No data quality impact to source during data conversion

No DB outage for ALTER DB.

Minimal time to redirect application servers traffic

Cons

High hardware setup and

administration cost for secondary DB

Overhead to create and maintain Last Modified Date column

High overhead to sync up all

modified records since secondary DB replication

(29)

TM

Alternative 3: In-line Conversion WITHOUT Status column

1. Scan to identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data to be converted

2. Use a script to convert all identified records to UTF-8

3. Repeat Step 1 & 2 to convert newly generated records with Cp1252 data until there is only a small number of Cp1252 records

4. During DB outage window, scan for remaining records with 8 bit Cp1252 textual data and convert them to UTF-8

(30)

30

TM

Alternative 3: In-line Conversion WITHOUT Status flag

Pros

Creation of UTF-8 Status column in each table is not required

Ability to pause Data Conversion if necessary

Ability to rollback ALTER DB if necessary

Does not require additional hardware for secondary DB

Cons

Implement charset detection logic into site code base could increase code

complexity and performance overhead

Charset detection to process each record is less reliable than explicit UTF-8 Status flag

Multiple iteration of data conversion process

Potential longer DB outage to convert remaining Cp1252 records during the final step of data conversion

References

Related documents

study was designed to document the commonly isolated gram-negative bacilli from different clinical specimens and their susceptibility patterns in tertiary health care hospital

This makes sense in light of MOE's focus on using adaptation finance for delivery and not just TA (to generate visibility of climate change investments at the national and

If your database server or application server use a different character encoding form, you may have to configure them to support UTF-8.. This section provides information

 Good file format choices for data archiving include: Plain text (Ascii/UTF-8) or Open Document Format (not Word), CSV (not Excel), JPEG2000 (not JPG), TIFF or PNG (not GIF),

It is very vivid that in Table 3 nearly all responses are on the agree side by indicating the belief on positive impact of educational supervisor in terms of the curriculum

The four common peer review activities identified from Mozilla and Python — submission, identification, resolution and evaluation — share similarities with four steps

From board game-based leadership simulations on topics like change leadership, team performance, stakeholder engagement and strategy execution to bespoke market and

Michael Goodwin, Director-Principal – Durant Campus Debbie Golden, Director-Principal – Hugo/Antlers Campuses Johnnie Meredith, Director-Principal – Idabel Campus. April