TM
A Case Study
of
eBay UTF-8 Database Migration
Nelson Ng
2
TM
What is the difficulty in migrating
from ISO Latin 1 to UTF-8?
TM
eBay: The World’s Online Marketplace
An online trading platform where everyone
can trade almost anything from anywhere
• A global presence in 33 international markets • 203 million registered users
• $44.3 billion in annualized gross merchandise volume (value of goods sold)
• 724,000 small business and individuals in U.S. use eBay as a primary or secondary sales channel
• 36,000 Developers via eBay API
4
TM
Agenda
•
Reasons for UTF-8 Migration
•
Challenges of UTF-8 Migration
•
Approaches to UTF-8 Migration
TM
Reasons
for
6
TM
Ambiguity in Textual Data Representation
UTF-8
Ñ
a
c
Á
d
Ã
‘
a
c
ä
¸
-å
œ
‹
Ã
â
d
,
CP1252
€
à ?
a
c
ä
¸
-å
Ã
d
�
ISO 8859-1
中
¬
Raw Byte Sequence in
Hex representation
國
â
¬
OR
OR
� �
�
�
�
C3 91
61
63
E4 B8 AD
E5 9C 8B
C3
E2
64
82 AC
81
TM
eBay Marketplace Textual Data Encoding Model
•
eBay Marketplace supports both Latin 1 and UTF-8 sites on
a single platform:
¾ Latin 1 Sites: US, CA, UK, AU, FR, DE, IT, BE (nl), BE (fr), NL, ES, CH, AT, IE, NZ
¾ UTF-8 Sites: TW, IN, CN, HK, MY, PH, PL, SG, SE
•
Textual data for all sites are stored in the same set of
tables and DB hosts
•
A combination of ISO Latin 1 and UTF-8 DB hosts
¾ ISO Latin 1 DB host:
Latin 1 Site records: Encoded in Cp1252
UTF-8 Site records: Encoded in UTF-8
¾ UTF-8 DB host:
8
TM
TM
Issues with Textual Data Encoding Model
•
Cross-Border Trade Impediments
•
Increased code complexity and developmental
costs to support mixed textual data encoding model
•
Increased 3
rdParty integration costs to implement
eBay textual data encoding model
•
Textual data encoding model not supported by
10
TM
Challenges
of
TM
Summary of Challenges
•
S
tringentAvailability
¾Site Uptime Requirement: 99.94%
•
Magnitude of Data
¾Total Site DB Hosts: ~160
¾Total Active DB Tables: ~2000
¾Total DB Storage Usage: ~200TB
¾Total Redo Log Generation: ~2TB/day
•
Extreme Usage
¾Peak time SQL Execution: ~462K/sec
¾Avg. time for SELECT: ~7ms
¾Avg. time for INSERT/UPDATE: ~15ms
¾Total Application Servers: ~5K
•
Frequency of Change
12
TM
Approach
to
TM
How Big was the Problem?
• Scanning DB hosts with Oracle CSSCAN
• Identify different types of textual data encoding
¾ 7 bit ASCII (0x0 to 0x7F): No conversion. Same value as-is in UTF-8
¾ 8 bit Cp1252 (0x80 to 0xFF): Converted into 2 bytes or 3 bytes for UTF-8.
14
TM
Approaches
1.
Conversion with External File
2.
Conversion with Secondary DB
3.
In-line Conversion WITHOUT Status flag
TM
16
TM
Pre Conversion and Data Conversion Stages
1. Create a UTF-8 Status column with default value NULL in each table. 2. Modify code base to support UTF-8 Migration to check DB Charset
a. If DB is UTF-8, then use UTF-8 rules to process textual data for all sites b. If DB is ISO Latin 1, then process textual data according to the UTF-8
Status column
3. Activate Implicit Conversion
TM
DB Migration Stage
1. Stop Incoming Traffic to DB Host
2. ALTER DATABASE CHARSET from
WE8ISO8859P1
to
UTF8
18
TM
TM
Reasons for this Approach
Pros
• No DB downtime during data conversion. Only minimal DB outage during ALTER DB
• Implicit Conversion to reduce Explicit Conversion effort
• Ability to pause Data Conversion if necessary
• Ability to rollback ALTER DB if necessary
• Does not require additional hardware for secondary DB
Cons
• A new UTF-8 Status column has to be created for each table
• All DB access code has to be modified to handle the new UTF-8
Status column during transition period of data conversion
20
TM
Lessons Learned
TM
Lessons Learned
•
Planning! Planning! Planning!
¾
Upfront Data Analysis
¾
DB Environment Preparation
¾
Leverage Site Maintenance Windows
•
Gradual Activation of Data Conversion
•
Data Monitoring
¾
Data Environment Changes
¾
Integrity of Data Conversion
¾
Problematic SQL in DB Log
•
Throttle Data Conversion Speed
•
Dedicated support resources from Development and
Operations teams
22
TM
Metrics of Migration
Duration of Explicit Conversion:
~7 months
¾
Number of Records:
~18 billion
¾
Number of Columns:
~600
¾
Number of Tables:
~200
Duration of DB Migration:
~7 months
¾
Number of DB Host:
~100
TM
Q & A
24
TM
TM
Alternative 1: Conversion with External File
1. Identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data 2. Extract identified records into a file
3. Use a script to convert the textual data in the file to UTF-8 4. ALTER DATABASE CHARSET from WE8ISO8859P1 to UTF8
26
TM
Alternative 1: Conversion with External File
Pros
• Avoids creating UTF-8 Status column in each table
• Code change to handle transition period during data migration is not required.
• Avoids requiring additional hardware for secondary DB
Cons
• Data conversion must be completed in a single outage window
• Prolong DB downtime due to data conversion takes place during outage window
TM
Alternative 2: Conversion with Secondary DB
1. Create Last Modified Date(LMD) column in each table
2. Replicate Primary DB to Secondary DB and save the timestamp
3. In Secondary DB, identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data 4. In Secondary DB, use a script to convert the identified records to UTF-8
5. In Primary DB, run post conversion script to sync up modified records since DB replication 6. In Secondary DB, ALTER DB CHARSET from WE8ISO8859P1to UTF8
28
TM
Alternative 2: Conversion with Secondary DB
Pros
• Avoids creating UTF-8 Status column in each table
• Code change to handle transition period during data migration is not required.
• Data conversion can be restarted if necessary
• No data quality impact to source during data conversion
• No DB outage for ALTER DB.
Minimal time to redirect application servers traffic
Cons
• High hardware setup and
administration cost for secondary DB
• Overhead to create and maintain Last Modified Date column
• High overhead to sync up all
modified records since secondary DB replication
TM
Alternative 3: In-line Conversion WITHOUT Status column
1. Scan to identify all records with 8 bit Cp1252(0x80 to 0xFF) textual data to be converted
2. Use a script to convert all identified records to UTF-8
3. Repeat Step 1 & 2 to convert newly generated records with Cp1252 data until there is only a small number of Cp1252 records
4. During DB outage window, scan for remaining records with 8 bit Cp1252 textual data and convert them to UTF-8
30
TM
Alternative 3: In-line Conversion WITHOUT Status flag
Pros
• Creation of UTF-8 Status column in each table is not required
• Ability to pause Data Conversion if necessary
• Ability to rollback ALTER DB if necessary
• Does not require additional hardware for secondary DB
Cons
• Implement charset detection logic into site code base could increase code
complexity and performance overhead
• Charset detection to process each record is less reliable than explicit UTF-8 Status flag
• Multiple iteration of data conversion process
• Potential longer DB outage to convert remaining Cp1252 records during the final step of data conversion