Nucleotide sequence of a cloned woodchuck hepatitis virus genome: comparison with the hepatitis B virus sequence.

(1)

JOURNALOFVIROLOGY, Jan. 1982,p. 51-65

0022-538X/82/010051-15$02.0O/0 Vol.41,No. 1

Nucleotide

Sequence of a Cloned Woodchuck

Hepatitis Virus

Genome:

Comparison with the Hepatitis B Virus Sequence

FRANCISGALIBERT,* TSIEN NANCHEN,tANDELISABETH MANDART

Laboratoired'HMmatologieExptrimentale, Centre Hayem, Hopital Saint-Louis, 7S475ParisC&dex10, France

Received 1 June 1981/Accepted 14 July 1981

The complete nucleotide sequence of a woodchuck hepatitis virus genome clonedin Escherichia coli was determined by the method of Maxam and Gilbert. This sequence was found to be 3,308 nucleotides long. Potential ATG initiator triplets and nonsense codons were identified and used to locate regions with a substantial coding capacity. A striking similarity was observed between the organizationof human hepatitis B virus and woodchuck hepatitis virus. Nucleo-tidesequences of these open regions in the woodchuck virus were compared with correspondingregions present in hepatitis B virus. This allowed the location of fourviral genes on the L strand and indicated the absence of protein coded by the Sstrand. Evolution rates of the various parts of the genome aswell as of the four differentproteins coded by hepatitis B virus and woodchuck hepatitis virus were compared. These results indicated that: (i) the core protein has evolved slightly lessrapidly than the other proteins; and (ii) when a region of DNA codes for two differentproteins, there is less freedom for the DNA to evolve and, moreover, one ofthe proteinscan evolve morerapidly than the other. A hairpin structure, very well conserved in the two genomes, was located in the only region devoid of coding function, suggesting the location of the origin of replication of the viral DNA.

Biologicalstudyofhepatitis B virus (HBV) is drasticallylimited due to its restricted host range and failure to infect tissue culture. However, with development of recombinant DNA technol-ogy,itwaspossibletoclone the HBV genome in Escherichiacoli (2, 7, 16) and produce substan-tial amounts ofDNA; this, inturn, allowed its complete primary structure to be analyzed (8, 14, 20). Although genes cannot be identified from a DNA sequence alone because of the possibility of splicingevents abletoeliminatea

closed reading frame from a putative mRNA transcript, the location ofopen reading frames could be a useful step toward gene identifica-tion. On the HBV genome (subtype ayw), we

wereable tolocateseveral openreadingframes with substantial coding capacity (8). Among

otherconsiderations,twoofthem werefoundto

correspond to two viral

proteins:

polypeptide I

and its glycosylated derivative,

polypeptide

II, the main components of the viral

envelope,

calledthehepatitis B surface

antigen

(HBsAg),

and thehepatitisBcoreantigen

(HBcAg) (4,

8,

19).

Recently three

hepatitis

viruses

infecting,

re-tPresent address: Genetic Institute of the Academy of SciencesofChina,Peking,China.

spectively, woodchucks, squirrels, and Peking ducks, were isolated (11, 18; J. Summers, W. T. London, T.T.Sun, and B. S. Blumberg, manu-scriptinpreparation).These viruses have many characteristics in common with human HBV, and some areunique to this novel class. Of these characteristics, ultrastructure, antigenic make-up, DNA size andstructure (the smallestknown DNAgenome, made of apartially single-strand-ed andnoncovalently closedcircularDNA),and features ofpersistent infection shouldbe noted (6, 12, 21).

Because these HBV-like viruses share all of these properties, we felt that a

comparison

of their genomes at the nucleotide level might provideananswertoseveralquestions concern-ingthe number and locationof the viral genes, theserologicalrelationshipbetween thevarious viralproteins, andtheevolution ofthegenomes of theseviruses, whicharelocalized in different ecological niches.

Recently we were able to locate within the whole cloned DNAgenome ofwoodchuck

hepa-titis virus (WHV) the gene

coding

for the WHsAg proteinandtocompare theamino acid sequences of the surface

antigens

from human

andwoodchuck

hepatitis

viruses

(F.

Galibert,

E. Mandart,and S.N.Chen,Proc.Natl. Acad. Sci. 51

on November 10, 2019 by guest

http://jvi.asm.org/

(2)

U.S.A.,in press). We report here the complete nucleotide sequence of the genome of WHV and compare its primary structure with that of the

human HBV genome(8).

MATERIALS ANDMETHODS

Enzymes and chemicals. Restriction endonucleases came from New England Biolabs andwere used as recommendedby the manufacturer. DNApolymerase I came from Boehringer Mannheim, and bacterial alkalinephosphatase andpolynucleotide kinasewere from P. L. Biochemicals.

A

T

C T C T

G3

A

c T G

T

T (3

T

T G3

A T C*

_~~~~_

- (

_~~~( A

..eW~~~~~~~~~~~~(

T

G A

--_~~~~~

_ _.

- C

A..e

::...A

wl0ps.~~ ~ ~

: 3

A ....

I

a

Chemicals used for the nucleotide sequenceanalysis were as described (10). [y-32P]ATP(specific activity, >2,500 Ci/mmol) and a-32P-labeled nucleotide triphos-phate (specific activity, >3,000 Ci/mmol) were from NewEngland Biolabs.

PreparationofEcoVHV DNA. A X-WHV recombi-nant wasconstructedby Cumming et al., using Xgt XB WES bacteriophage as the cloning vector (6). The cloned DNA was referred to as Eco WHV DNA. Propagation and purification of the recombinant as well as preparation of the Eco WHV DNA were performed as previously described (4, 8,10).

A_'A ....

, T. f

"A3 4

: .

PI A-

i-4

1k

C

_~~~~

-WF! A

1A

A

C

FIG. 1. Autoradiogramof 16% acrylamide sequencinggelshowingthesequenceofchainL fromnucleotide 208 (a), 714(b), and889 (c). Five chemical reactions specific forG-AG-CT-C and AC were performed on all

sequencedfragments.

http://jvi.asm.org/

[image:2.504.110.415.204.645.2]

(3)

WHV GENOME NUCLEOTIDE SEQUENCE 53

Containnt. Containment conditionswere as rec-ommendedby the French National Control Commit-tee. The culture of recombinant bacteriophage was doneunderL3B1 conditions.

DNA nudeotide sequence. Sequence analyseswere determined by the Maxam and Gilbert method (13). Usually, about 10pmol of Eco WHV DNA(20,ug)was fully digested each time with a givenrestriction en-zyme. Fragments weredephosphorylated and labeled with [-y-32P]ATP and polynucleotide kinase as de-scribed(10). Tosegregatethe two labeled ends, frag-ments were denatured by heating to 92°C in the presence of30%o dimethyl sulfoxideandfractionated byelectrophoresis inacrylamidegel (13). Fragments largerthan 600 base pairs were hydrolyzedwith anoth-er restriction enzyme. Under some circumstances, fragmentswith arecessed3'endwere labeled with a a-32P-labeled nucleotide triphosphate of choice and DNApolymeraseIasdescribedby Hartley and Donel-son (9).

RESULTS ANDDISCUSSION

Wedetermined by the methodofMaxamand Gilbert (Fig. 1) the complete nucleotide

se-quenceof the WHV DNA genome cloned in E. coli.Thesequence wasdetermined fromalarge number of fragments obtained with various re-striction enzymes. Both DNA strands were in-dependently analyzed except between nucleo-tides 1,110 and 1,174, which wereanalyzed on one strand only, but several times and from differentrestrictionsites(Fig. 2). All restriction sites used asstarting points were also analyzed

as internal points of overlapping fragments in order to detectpotentially verysmallfragments produced by closed identical restriction sites. From the relative positions ofthe EcoRl and SacI enzymes on the viral DNA (6) and the position ofthe Sacl restriction site on the

se-quencedDNA, itwaspossibletoorientthe Eco WHV DNA. As for HBV, we

designate

the DNAstrandofthe Eco WHV DNAhomologous

1 c:=j - I

tothe longest viral DNA strand as the L strand and the other strand as the S strand. The L strandstarts with the sequence CCAGG, and the S strand starts with thesequenceAATTCGGG. Thesequence shown in Fig. 3 is complementary to the L strand. It is 3,308 nucleotides long, compared with the 3,182 nucleotides of the HBV Lstrand.

Location of the open reading frame. A comput-er program (17) was used to search for stop codons and ATG triplets on both DNA strands (Fig. 4), In chain L, which had a sequence identical to thepotential transcripts of chain S, there was a large number of stop codons (44, 64 and 60, respectively) located in the three phases. Moreover, the distribution of these stop codons all along the molecule left only one region in phase 1 betweennucleotides 1,719 and 1,179 with a coding capacity arbitrarily chosen above 100 amino acids.

On the other strand, the number of stop codons in the various phases was smaller (33, 50 and 30,respectively), and their uneven distri-bution created several large open reading frames. The locations of these open reading frames on the WHV circular genome are shown and comparedwith those of the HBV genome in Fig.5. Fromthiscomparison, astriking similar-ityin the number, size, and location of the open readingframes, relative to each other and to the nickofthe Lstrand, emerged between the two genomes. The fact that there was one open reading frame in chain L of the WHV genome butfour in chain L of the HBV genome was the onlydifference. Wepreviously noted (8) that, in the open reading frames 1 to 4 ofthe HBV L chain, the first ATG encountered and able to play an initiator role was located well inside these open readingframes,largelyreducingtheir coding capacity. The lack of open reading frames homologousto

regions

2, 3,and4in the

_ * ZJZ

2 t

.r]F

Z'

_-J

4 1

5 6 7 8

-J .

[image:3.504.58.455.512.646.2]

I_..-I-J FIG. 2. DiagramofanalyzedDNAfragmnents. Vertical barscorrespondtothepositionof the labeled ends ofrestrictionfragmentsused.Lengthof thearrowsis relativetothe number ofanalyzednucleotides. Most of the fragmentswere5'labeled.Hinflfragmentswerelabeledatthe 5' and 3' ends.(1)BstEII;(2)Hinfl;(3)HaeIII;(4) BamHI, BglII,andHindIll;(5)AluI; (6) Sau3a; (7) RsaI; (8)DdeI.

VOL.41, 1982

a

I 0

http://jvi.asm.org/

(4)

IleArgAspIleProArgGlyLeuValProPrcGlnThrProThrAsnArgAspGlnGlyArgLysProThrPro

AsnSerGlyHisThrThrTrpPheSerSerAlaSerAsnSerAsnLysSerArgSerArgGluLysAlaTyrSer

1

AATTCGGGACATACCACGTGGTTTAGTTCCGCCTCAAACTCCAACAAATCGAGATCAAGGGAGAAAGCCTACTCC

_.0._.*..*

. . .. *. ..** . .

3081 CCAGCAAAT CCGCCTCCTGCCTCCACCAATCGCCAGTCAGGAAGGCAGCCTACCCC

AlaSerLys o o o CysLeuHisGln o ProVal o LysAla o o Pro

o AlaAsn o o ProAlaSer o o o GlnSer o o Gln o o o

ProThrProProLeuArgAspThrHisProHisLeuThrMetLysAsnGlnThrPheHisLeuGInGlyPheVal

SerAsnSerThrSerLysArgTyrSerProProLeuAsnTyrGluLysSerAspPheSerSerPraGlyValArg

76 TCCAACTCCACCTCTAAGAGATACTCACCCCCACTTAACTATG AAAAATCAGA CT TTCGT

... 0.0 ... ..

3137 GCTGTCTCCACCTTTGAGAAACACTCATCCTCAGGCC ATGCAGTGGAATTC CACAACCTTCCA AlaVal o o PheGluLysHis o SerSerGly HisAlaValGluPhe HisAsnLeuPro

LeuSer o o o o Asn o o o GlnAla o GlnThrAsnSer ThrThr o His

AspGlyLeuArgAspLeuThrThrThrGluArgGlnHisAsnAlaTyrGlyAspProPheThrThrLeuSerPro

GlyArgI1eLysArgLeuAspAsnAsnGlyThrProThrGlnCysLeuTrpArgSerPheTyrAspThrLysPro

151 GGACGGA'ITAAGAGACTTGACAACAACGGAACGCCAACACAATGCCTATGGAGATCCTTTTACGACACTAAGCCC 18

CCAAACTCTGCAAGATCCCAGAGTGAGAGGCCTGTATTTCCCTGCTGGTGGCTCCAGTTCAGGAACAGTAAACCC

ProAsnSerAla o SerGlnSerGluArgProValPhePro o Trp o LeuGIn o ArgAsnSer o o

GlnThrLeuGlnAspProArgValArgGlyLeuTyrPhePro o Gly o SerSerSerGly o ValAsn o WHs

ValValProThrValSerThrI1eLeuSerProProSerThrThrGlyAspProAlaGlnSerPrcGluMet

CysGlySerTyrCysIleHisHisileValSerSerl1eAspAspTrpGlyPrcCysThrValThrGlyAsp

226

TGTGGTTCCTACTGTATCCACCATATTGTCTCCTCCATCGACGACTGGGGACCCTGCACAGTCACCGGAGA'G

93

TGTTCTGACTACTGCCTCTCCCTTATCGTCAATCTTCTCGAGGATTGGGGACCCTGCGCTGAAC

ATGGAG

o SerAsp o o LeuSerLeu o o AsnLeuLeuGlu o o o o o AlaGlu HisGly

o LeuThr o Ala o ProLeuSer o IlePhe o Arglle o o o o LeuAsn o Glu

SerProSerSerLeuLeuGlyLeuLeuAlaGlyLeuGlnValValTyrPheLeuTrpThrLysIeLeuThr

ValThrlIeLysSerProArgThrProArgArgIleThrGlyGlyValPheLeuValAspLysAsnProAsn

299

TCACCATCAAGTCTCCTAGGACTCCTCGCAGGATTACAGGTGGTGTATTTCTTGTGGACAAAAATCCTAACA

163

AACATCACATCAGGATTCCTAGGACCCCTTCTCGTGTTACAGGCGGGGTTTTTCTTGTTGACAAGAATCCTCACA

GluHisHis o Arglle o o o o Ser o Val o o o o o o o o o o o His

AsnlleThr o GlyPhe o o Pro o LeuVal o o

AlaGlyPhe

o o Leu o

Arg

o o o

IleAlaGlnAsnLeuAspTrpTrpTrpThrSerLeuSerPhePrcGlyGlyllePrdGluCysThrGlyGlnAsn

AsnSerSerGluSerArgLeuValValAspPheSerGlnPheSerArgGlyHisThrArgValHisTrpProLys

371

ATAGCTCAGAATCTAGATTGGTGGTGGACTTCTCTCAGTTTTCCAGGGGGCATACCAGAGTGCACTGGCCAAAAT

*C * *v.o -*-* * ***.-. ....@-*@**- *o.*@@@@@*.o. * . ...*@-..

--238

ATACCGCAGAGTCTAGACTCGTGGTGGACTTCTCTCAATTTTCTAGGGGGAACTACCG,I.irI

ii.,uOCAAAAT

o Thr o o o o o o o o o o o o o o o AsnTyr o o Ser o o o

o Pro o Ser o o Ser o o o o o Asn o Leu o o ThrThrVal o Leu o o o

SerGlnPheGlnThrCysLysHisLeuProThrSerCysProProThrCysAsnGlyPheArgTrpMetTyrLeu

PheAlaValProAsnLeuGlnThrLeuAlaAsnLeuLeuSerThrAspLeuGlnTrpLeuSerLeuAspValSer

446 TCGCAGTTCCAAACTTGCAAACACTTGCCAACCTCCTGTCCACCGACTTGCAATGGCTTTCGTTGGATGTATCTG

313 TCGATCAjCTCTATACCTTGCTAATGCTGATGCTGGATGTGTCTG

o o o o o o o Ser o Thr o o o o SerAsn o Ser o o o o o o o o o SerPro o SerAsn o Ser o o o o o o o o Pro o Tyr o o o Cys o FIG. 3. Comparison of the WHV and HBV DNA sequences. The top line corresponds to the WHV sequence (Sstrand); thesecond linecorresponds to the HBV sequence (8). Identical nucleotides are marked with a dot. Identical amino acids are indicated by an open circle. In both cases,sequences are numbered from theirEcoRI site.

http://jvi.asm.org/

(5)

ArgArgPheIlelieTyrLeuLeuValLeuLeuLeuCysLeuIlePheLeuLeuValLeuLeuAspTrpLysGly

5lAlaAlaPheTyrHisIleProIleSerProAlaAlaValProHiisLeuLeuValGlySerProGlyLeuGluArg

521 CTACTTGTGTCCTGCG^AG

388

CGciGT'rrT

TTTCCTCTfCA.CDCTGO

l

GCT.CATGCCTCATCTTCTTGT.TGGTTCTTCTGGA.CTATCAAGGT

o o o o o Leu o LeuHis o o o Met o o o o o o o Ser o o Ser o

o o o o o Phe o Phe o o o o o o o o o o o o o o

TyrGIn

o

LeulieProValCysProlleGInProThrThrGluThrThrValAsn

CysArgGInCysThrlIeSer

PheAsnThrCysLeuSerTyrSerThrHisAsnArgAsnAspSerGln

LeuGlnThrMetHisAsnLeu 596

TTAATAI~CcLXCTGTCTEIATCAACCCACAACAGAXACGACAGTCAA

TTGCAGACAATGCACAATCTCT

00000.0.0o..0 0. .0. 00

000~~~~~ ~

. .

463 ATGT

TyrValAlaArg o o SerAsnSerArglIeLeu o AsnGInHisGlyThrMetProAspLeu o AspTyr

MetLeu o o o o Leulle o GlySerSer o o SerThrGlyPro o o Thr o o ThrThr

VaIGInAspMetTyrThrProProTyrCysCysCysLeuLysProThrAlaGlyAsnCysThrCysTrpProIle

CysThrArgHisValTyrSerSerLeuLeuLeuLeuPheLysThrTyrGlyArgLysLeuHisLeuLeuAlaHis

'665

TAAACCTACGGCAGGCAATTGCACTTGCTGGCCCATC

.

..0.0.0 0 . 0 O. 0 . . . .0 0

538 CTCAAGGAA ATTGCTGTATTCCCATC

CysSer o AsnLeu o Val o o o o o TyrGIn o Phe o o o o o o TyrSer o Ala o GlyThrSerMetTyr o Ser o o o Thr o o SerAsp o o o o o Ile o o

ProSerSerTrpAlaLeuGlyAsnTyrLeuTrpGluTrpAlaLeuAlaArgPheSerTrpLeuAsnLeuLeuVal

ProPhelleMetGlyPheArgLysLeuPrdMetGlyValGlyLeuSerProPheLeuLeuAlaGlnPheThrSer

740 CGGAGTGGGCC

613

CCATCATCTmMAArrCCTATGGGA

CAGTTTACTAGTG

o Ile o Leu o o a o Ile o o o o o o o o o o o o o o o o

o o o o Phe o LysPhe o o o o Ser o a o o ooSer o o o

ProLeuLeuGlnTrpLeuGlyGlylieSerLeulleAlaTrpPheLeuLeuleTrpMetIleTrpPheTrpGly

AlaLeuAlaSerMetValArgArgAsnPheProHisCysValValPheAlaTyrMetAspAspLeuValLeuGly

815 CCCTTGCTTCAATGGTTAGGTGGAATCTCA

1ox.111R,CTTATATGGATGATTTGGTTTTGGGGG

688

CCATTTGTTCA.GTGGTTCGTAGGGCTTTCCCCCACTGTTTGGCTTTCAGTTATATGGATGATGTGGTATTGGGGG

o IleCys o Val o o o Ala o o o o LeuAla o Ser o o o o Val o o o o PheVal o o o Val o Leu o ProThrVal o LeuSerVal o o o Met Tyr o o

End of region 7

ProAlaLeuLeuSerIleLeuProProPhelleProliePheValLeuPhePheLeulieTrpValTyrlle_

AlaArgThrSerGluHisLeuThrAlaIleTyrThrHisIleCysSerValPheLeuAspLeuGlyIleHiisLeu

890

CCCGCACTTlCTGACAITCTTACCGCATTTATACCCATATTTGTTCTGTTTTTCTTGATTTGGGTATAATTTRAA

763

CTTIACCGCTGTTACCAATTTTCITGTCTTTGGGTATACAT1-AA

o LysSerValGln o o GluSerLeuPhe o AlaValThrAsnPheLeu o Ser o o o o o o Ser o Tyr o o o Ser o o Leu LeuLeuProlle o o CysLeu o o o ao

AsnValAsnLysThrLysTrpTrpGlyAsnHisLeuHisPheMetGlyTyrVailIeThrSerSerGlyValLeu

965

ATGTTAATAAAACAAAATGGTGGGXCAATCATTTACATTrTIATGc^ASTATAATTACTAGTTCAGGTGTATTG

838

ACCCTAACAAAACAAAGAGATGGGGTT.ACTCTCTAAATTTTATGG3TTATGTCATTGGATGTTATGGGTCCTTG

o Pro o o a o Arg o o TyrSer o Asn o o o o o GlyCysTyr o Ser o

FIG. 3-continued

VOL.41,1982

http://jvi.asm.org/

(6)

ProGInAspLysHisValLysLysLeuSerArgTyrLeuArgSerValProValAsnGInProLeuAspTyrLys

1039

CCACAAGACAAACATGTTAAGAAACTTTCCCGTTATTTACGCTCTGTCCTTATAACCTCTGGATTACAA

o o GluHis o IleGln o lIeLysGluCysPhe o LysLeu o Ile o Arg o Ile o Trp o IleCysGIuArgLeuThrAspIleLeuAsnTyrValAlaProPheThrLeuCysGlyTyrAlaAlaLeuMetPro

**T**-** * * * * * * *- * .- ... * ...*. ... O*- M *@ ...@

987 GTATGTCAACGAATTGTGGGCITGCCCTTTTACACAATGITIGTATCCTGCGTTGATGCCT

Val o Gln o IleValGlyLeu o GlyPheAla o o o o Gln o o o Pro o o o o

LeuTyrHisAlaIleAlaSerArgThrAlaPheValPheSerSerLeuTyrLysSerTrpLeuLeuSerLeuTyr 1189

CTGTATCATGCTATTGCTTCCCGTACGGCTTTCGTTITCTCCTCCTFGTATAAATCCGGTTGCTGTCTCTTTAT

1062

TG.CAATAC

o o AlaCys o Gln o LysGIn o o Thr o o ProThr o o AlaPhe o CysLysGln o

GluGluLeuTrpProValValArgGlnArgGlyValValCysSerValPheAlaAspAlaThrProThrGlyTrp

1137

CTGAACCTTTACCCCGTTGCCCGGCAACGGCCAGGTCTGTGCCAAGTGTTGCTGACGCAACCCCCACTGGCTG

LeuAsn o Tyr o o Ala o o o ProGlyLeu o Gln o o o o o o o o o o

GlylleAlaThrThrCysGInLeuLeuSerGlyThrPheAlaPheProLeuProlleAlaThrAlaGluLeulle

1339 GAACTCATC

*G-~~~~~~*G * *@ .0 @.0 *- .* D.*- ... *- *-. *---..

1212 GGCG GGG

GGS

VTCTGCCGATCCATACTGCGGAACTCCTA

o LeuValMetGlyHis o ArgMetArg o o o SerAla o o o o His o o o o Leu

AlaAlaCysLeuAlaArgCysTrpThrGlyAlaArgLeuLeuGlyThrAspAsnSerValValLeuSerGlyLys

1414

GGGGCTC

GTCGGGGAAG

1287

GCCGGTTGCT.CGCA)CAGGTCTGGAGCAAACATTATCGGGACTGATAACTCTGTTGTCCTATCCCGCAAA

o o o Phe o o SerArgSer o o Asnllelle o o o o o o o o o Arg o

Gene 5

MetAlaAlaArgLeuCysCysGlnLeuAspProAlaArgAspValLeuLeuLeuArgPro

LeuThrSerPheProTrpLeuLeuAlaCysValAlaAsnTrplIeLeuArgGlyThrSerPhecysTyrValPro

1489 TCCGCGGQGCGTCTTTGCACGCCC

1362 T AIACATGTTTCCATGCTGCTGGCLGTGCGCCLACTGGATCCTGCGCGGGACGTCCTTGTflACGTCCCG

Tyr o o o o o o oGly o Ala o o o o o o o o o o Val o o o o o o o o o o o o o o o o o o o0Cyso o o

PheGlySerGInSerSerGlyProProPheProArgProSerAlaGlySerAlaAlaSerProAlaSerSerLeu

SerALaLeuAsnProAlaAspLeuProSerArgGlyLeuLeuProValLeuArgProLeuProArgLeuArgPhe

15646G4CCACAGGACCCTCGAGCTTC

~J

14371437TCGGCGCGAA.TCcCCTTC

CG~GGGTCrCC

. C

GTCIrrCGT

0.. 0

o o o o o o o Asp o o o o Arg o GlyLeuSer o o o Leu o o Pro o Val o AlaGlu o Cys o Arg o o SerGlySerLeuGlyThrLeuSerSerProSerPro o AlaVal

FIG. 3-continued

http://jvi.asm.org/

(7)

SerAlaSerAspGluSerAspLeuProLeuGlyArgLeuProAlaCysPheAlaSerAlaSerGlyProCysCys

ArgProProThrSerArgI1eSerLeuTrpAlaAlaSerProProValSerProArgArgProValArgValAla

1639 CGTCCTCCGACGAGTCGGATCTCCCTTrGGGCCGCCTCCCCGCCTTTCGCCTCGGCGCCGTCCGTGTTGC

Ow* * .O-* . O. ..- O... O* O.@.... - *---.. *...@-

*--1512 CGACCGACCATTCTGCCGGACCGTGTGCAC

o o Thr o Gly o Thr o o Tyr o Asp o o Ser o ProSerHisLeu o Asp o o His

ProThrAspHisGlyAlaHis o Ser o ArgGly o o Val o AlaPhe o SerAla o o o Ala

LeuValValThrCysAlaGluLeuArgThrMetAspSerThrValAsnPheValSerTrpHisAla TrpSerSerProValGlnAsnCysGluProTrplIeProProi End of region 6 1714 TGGTCGTCACCTGTGCAGAATTGCCATGGATTCACCGTGAACTTTGTCTCCTGGCATGCA

O * ** .**pOO*-OO ** **@ @ @ @

1587 TTCGCTTCACCTCTGCAC GTCGCATGGAGACCACCGTGAACGCCCACCAAATATTGCCCAAGGTCTTA

PheAla o o LeuHis ValALa o Arg o o

-o ArgPhe o Ser o o Arg o GluThr o o o AlaHisGlnlleLeuProLysValLeu

AsnArgGlnLeuGly MetProSerLysAspLeuTrpTrpProTyrIleArgAspGlnLeuLeu

1779 AATCGTCAACTTGGC ATGCCAAGCAAG.GACCTTTGGACTCCTTATATAAGAGATCAATTATTA

*. *O @@s *0 O *O. -.- O O .0 ** * O..**

1655 CATAAGAGGACTCTTGGACTCTCAGCAATGTCAACGACCGACCTTGAGGCA TACTTCAAAGACTGTTTGTTT HisLys o Thr o o LeuSerAla o SerThrThr o o GluAla o PheLys o Cys o Phe

ThrLys TrpGluGlu GlySer IleAspProArgLeuSerlIePheValLeuGlyGlyCysArgHis 1842 ACTAAA TGGGAGGAG GGCAGC ATTGATCCTAGATTATCAATATTTGTATTAGGAGGCTGTAGGCAT

1727 AAAGACTGGGAGGAGTTGGGGGAGGAGATT AGGTTAAAGGTCTTTGTACTAGGAGGCTGTAGGCAT

o Asp o o o Leu o GluGlu o o o LysValo o o o o o o o

LysCysMetArgLeuProm End of region 5

1908 AAATGCATGCGACTTCCGTAACCATGTATCTTTTTCACCTGTGCCTTGTTITTGCCTGTGTTCCATGTCCTACTGT 1793 AAATTGGTCTGCGCACCAGCACCATGCAACTTTTTCACCTCTGCCTAATCATCTCTrGT TCATGTCCTACTGT

o LeuValCysAla o AlaPrcCysAsnPhePheThrSerAla _ Region 8 - WHc

MetAsp11eAspProTyrLysGluPheGlySerSerTyr 1984 TCAAGCCTCCAAGCTGTGCCTTGGATIATGGACATAGATcCTTATAAAGAATTTGGTTCATCTTAT

1866 TCAAGCCTCCAAGCTG TGT TTGGACATCGACCCflATAAAGAATAGCTACTGTG

o o o o o o o o o o AlaThrVal

GlnLeuLeuAsnPheLeuProLeuAspPhePheProAspLeuAsnAlaLeuValAspThrAlaThrAlaLeuTyr

2060 CAGTTGTTGAATTTTCTTCCTTGGACTTCTTTCCTGATCTAATGC XGACACTGCTACTGCCTlAT

1942 GAGTTACTCTCGTTCl CTTCTGACTTC

AGTCAGCTCTGTAT

Glu o o Ser o o o Ser o o o o SerValArgAsp o Leu o o o Ser o o o

GluGluGluLeuThrGlyArgGluHi sCysSerProHisHi sThrAla 11eArgGlnAlaLeuValCysTrpAsp 2135 GAAGAAGAACTACAGGTAGGGACATGCTCCGCACCATACAGCTATTAGACAACTTAGTATGCTAT_,.00

* .0 *.@@. .- . O .. * * 00 *-OO * O *-.0-..

2017 CGGGMGCCTTAGAGTCTCCTGAGCATTGTTCACCTCACCATACTGCACTCAGGCAGCTGG

Arg o Ala o GluSerPro o o o o o o o o o Leu o o o IleLeuo o Gly

FIG. 3-continued VOL.-41, 1982

http://jvi.asm.org/

(8)

GluLeu ThrLysLeulleAla TrpMetSerSerAsnleThrSerGluGlnValArgThrllelleVal

2210 GAATTA ACTAAATTGATAGCT TGGATGAGCTCTAACATAACTTCTGAACAAGTAAGAACAATCATTGTA

2092 GAACTAATGACT CTA GCTACCTGGGTGGGTGTTAATTTGGAAGATCCAGCGTCTAGAGACCTAGTAGTC

o o Met o 0 o Thr o ValGlyVal o LeuGluAspProAlaSer o AspLeuVal o AsnHisValAsnAspThr TrpGlyLeuLysValArgGlnSerLeuTrpPheHisLeuSerCysLeuThrPhe

2279 AATCATGTCAATGATACC TGGGGACTTAAGGTGAGACAAAGTTTATGGTTTCATTTGTCATGTCTCACTTTC 2160 AGTTATGTCAAC ACTAATATGGGCCTAAAGTTCAGGCAACTCTTGTGGTTTCACATTTCTTGTCTCACTTTT

SerTyr o o 0 Met o o o Phe o o Leu o o 0 o Ile o o o o o

GlyGlnHisThrValGlnGluPheLeuValSerPheGlyValTrpIleArgThrProAlaPro TyrArgPro

2351 GGACAACATACAGTTCAAGAATTTTTAGTAAGTTTTGGAGTATGGATCAGGACTCCAGCTCCA TATAGACCT

*. **. ...** @ ** *

2233 GGAAGAGAAACAGTTATAGAGTATTTGGTGTCTTTCGGAGTGTGGATTCGCACTCCT CCAGCTTATAGACCA

o ArgGlu o o Ile o Tyro 0 0 0 o 0 0 0 o o o o Ala o o o

Gene 6

MetHisProPheSerArgLeuPheArgAsnIleGlnSerLeuGlyGluGluGluValGlnGluLeuLeuGly

ProAsnAlaProl1eLeuSerThrLeuPrcGluHisThrVaIlleArgArgArgGlyGlyAlaArgAlaSerArg

2423 CCTAATGCACCCATTCTCTCGACTCTAGG

2305 CCAAATGCCCCTATCCTATCAACACTTCCGGAGACTACTGTTGTTAGACGACGAGGC AGG

o o o o o o o o o o o Thr o o Val o o o o 0

o ProLeuSerTyrGlnHis o o ArgLeuLeuLeuLeuAspAsp o Ala o

ProPrcGluAspAlaLeuProLeuLeuAlaGlyGluAspLeuAsnHisArgValAlaAspAlawLeuAsnLeuHis

SerProArgArgArgThrProSerProArgArgArgArgSerGlnSerProArgArgArgArgSerGlnSerPro

2498

TCCCCCAGAAGACGCACTCCCTCTCCTCGCAGGAGAAGATCTCAATCACCGCGTCGCAGACGCTCTCAATCTCCA

. . * .**O.X * X X**... **** ... ...- *@@@@@

2365 TCCCCTAGAAGAAGAACTCCCTCGCCTCGCAGACGAAGGTCTCAATCGCCGCGTCGCAGAAGATCTCAATCTCGG

° ° o o o oo o o o o o o o o o o o o o o o o o Arg

o Leu o GluGlu o o Arg o o Asp o Gly o o Arg o o o GluAsp o o o Gly

LeuProThrAlaAspLeuGlnTrpValHisLysThrAsnAlaIleThrGlyLeuTyrSerAsnGlnAlaAlaGln

SerAlaAsnCys_ End region 8

2573

TCTGCCAACTIGCTGTTAAGACATAAAACTAATGCTATTACAGGTCTTTACTCTAACCAAGCTGCTCAG

2440 GAATCTCAATGTTAGTATTCCTTGGACT AGGTC AACTTTACGCTTATTCTACTGTACCTGTC

Asn o Gln o0

Asn o AsnValSerlIePro o Thr o o ValGlyAsnPhe o o o o o SerThrValProVal

PheAsnProHisTrpIeGlnPrcGluPhePrcGluLeuHisLeuHisAsnGluLeuIleLysLysLeuGInGIn

2649 TTTAATCCGCATTGGATTCAACCTGAGTTTCCTGAGC ACACAATG

MTTTTTCTAMAG

0...0~ ~~~~..

2516 TUTCCCTGAAAACACCATCT

TTTCCTAATACATTACAMATATG'EAACAG

0 o o o o LysThr o Ser o o Asnlle o o o GlnAsplle o o o CysGlu o

TyrPheGlyProLeuThrIleAsnGluLysArgLysLeuGlnLeuAsnPheProAlaArgPhePheProLysAla

2724TAT

IAGAAAATTGCAATTAAATTTTCCTGCAAGATTTTTCAAAGCC

2591

AGATGAGGAA

AGAGATTGAAT

AGATTGATATTGATTATTGCCAGG

TTTTATCCAAAGTT

PheVal o o 0 o Val o o o o Arg o o 0 IleMet o o o o Tyr o o Val

FIG. 3-continued

http://jvi.asm.org/

(9)

WHVGENOME NUCLEOTIDE SEQUENCE 59

ThrLysTyrPheProLeulleLysGlylleLysAsnAsnTyrProAsnPheAlaLeuGluHisPhePheAlaThr 2799

AC.AAA

ATTTCTTAAMAGCATAA

M

AACAATTATCCTMTGCTrAGAA C GACC

2666 ACCAAATATATTA CTTCCAAACT

o o o Leu o o Asp o o o o ProTyr o o GluHisLeuValAsn o Tyr o Gln o AlaAsnTyrLeuTrpThrLeuTrpGluAlaGlylleLeuTyrLeuArgLysAsnGInThrThrLeuThrPheLys

2874 GG _GMC TGA AA

2741 ACATACGGGTCCTCATTTG

ArgHis o o His o o o Lys o o o o o Lys o GluThrThrHisSerAlaSer o Cys Gene 7 - pre S region

MetGlyAsnAsnlleLysValThrPheAsn

GlyLysProTyrSerTrpGluHisArg

GInLeuValGlnHisAsnGlyGlnGInHisLysSerHisLeuGln

2949 GGTAAACCATATTC A AAACATA GA

2741

AGGTCACCATATC

MCAAGATCTACAGGGCAGAATCTTTCC ACCAGCAA

o Ser o o o o o GInAspLeu o His o AlaGluSerPhe o Gln o

o o Gln o LeuSer o Ser o

ProAspLyslleAlaAlaTrpTrpProAlaValGlyThrTyrTyrThrThrThrTyrProGlnAsnGlnSerVaI SerArgGInAsnSerSerValValAlaCysSerGlyHisLeuLeuHisAsnHisLeuProSerGluProValSer

3021 A AG CG

AAGCCAGTCAGT

2876 TCC T C TGGG ATT CTTTCCC GACCAC CAG

o Ser o lie LeuPro Arg o Pro

o Leu o Phe PhePro AspHis Gln

PheGInPrcGlylleTyrGlnThrThrSerLeuValAsnProLysThrGInGInGluLeuAspSerValLeulle ValSerThrArgAsnLeuSerAsnAsnllePheGlyLysSerGInAsnSerThrArgThrGlyLeuCysSerHis

3096 GTTTCAACCAGGAATTTATA AA ACAAGAACTGGClCG TCAT

2904 TT GG ATCCAG CCTTCAGAGC AAA CAC C G C

o Gly SerSer LeuGInSer Lys His Arg

Leu AspPro AlaPheArgAla Asn Thr Ala

AsnArgTyrLysGinIleAspTrpAsnThrTrpGlnGlyPheProValAspGlnLysLeuProLeuValAsnArg

LysGlnIeGInThrAspArgLeuGluHisLeuAlaArglieSerCysArgSerLysThrThrIleGlyGInGln

3171 CAAAAACTACCATTGCAACAG

2933 AAA T C C AGATTGGGAC TT CAA

TCCC

nucieotide

'GGC CAG

O Ser o o GLy Ser Gln o

2957

and

3b55

has o Ser

0Pro 0 0 0

L~~~~~~no

homology

at

all

AlaGl

AspProProProLysSerAla GlnThrPheGlulleLysPrdilyProllelleValPrcGly GlySerSerProLyslIeSer SerAsnPheArgAsnGlnThrTrpAlaTyrAsnSerSerTrp 3246 GGATCCTCCCCCAAAATCAGCT CAMCTTTCGAAATCMACCTGGGCCTATMTAGTTCCTGG 3063 GG C AT A CTACAACTTT G

o His ThrThr o o

Gly lie Leu o o Leu

FIG. 3-continued VOL.41,1982

http://jvi.asm.org/

(10)

Chain S

7 8

I. I 1llIll I IIIla IIII II

IIIIIIhiloIIVI I I 1 1 I

5

I Ia IailailI I

I IlIlI IalS II II IIu.111, , l_. _

I I I r Ir Il

6

0

Chain L

100

1 _L I I 1111u 11 lo 11 I Al1a11 I

a II II n r

( lulI I 111111 R I --I I I I I I II I

.. I I. I...a ... .1I I L I

I- Iaa

II----1 1 1I II 1I I I

I I I I11 I I I I I

I I II 11111 I 1 11 1aI

II I I

a II I11I1IaEl I1 a im Ia a

FIG. 4. Diagramshowingthe localizationofthenonsensecodonsonchains Land Sof the EcoWHV DNA. Threereading framesweredefined from the 5' end of each DNAstrand. OnchainS,frame 1 isdefinedby its first triplet AAT, frame 2is identifiedby ATT,and frame 3isidentifiedbyTTC. On chainL,frame1isdefinedbyits firsttriplet CCA,frame 2 isidentifiedbyCAG,andframe3isidentifiedbyAGG. Theviral DNAiscircular,and itslength in nucleotides(3,308)isnotamultipleof 3.Therefore,going throughtheEcoRIsite,reading in frame1 iscontinued in frame 2, readingin frame 2 is continuedinframe3,andreading in frame 3 is continued in frame 1. Upper vertical bars indicatestopcodons;lower vertical barsrepresentATGtriplets.AsfortheHBVgenome, numbers1, 5, 6, 7, and8defineareaswithanopenreadingframeofatleast 100triplets. Region1 goesfrom 1,719to1,179; region 5goesfrom1,467to1,926; region6goesfrom2,382to3,308and from1to1,756; region7

goesfrom2,983to3,308 and from 1to962;andregion8goesfrom1,910to2,585. WHVLchainmight therefore indicate that they

existby chance in the HBVgenomeand do not

correspondtoaparticular viralgene. Itis harder to estimate the significance of region 1. In the WHVgenome,the firstin-phase ATGwas

locat-edonlyafewresiduesahead ofstopcodon TAA 1,182 closing the open region 1, and no GUG

(which could beanalternateinitiator codon)was

present upstream. The absence of a potential

initiator codon indicates that region 1 cannot codeby itself foraviral protein but would have

tobe splicedtoa5'leader segment.

Nucleotide and amino acid sequence

compari-son. Usingthe computerprogram NSEQFITN,

developed by Staden (17), the nucleotide

se-quences of the HBV and WHV genomes were

compared. The twosequences were largely

ho-mologous, in therangeof 62to70%allalong the

genome, except in two regions: one centered

aroundthe EcoRI site, wheremostof the

differ-ences inlength were encountered, and one

be-tweenregions 6 and 8,correspondingtoapartof region5(Fig. 3). Although therewerenoamino

acid dataonthe WHVproteins (which could be

used) to locate the genes as has been done for

HBsAg, the very similar open reading frames

and the large degree of homology sequence

sharedby thetwovirusesaresuitablearguments

forlocalizing thegeneforhomologous protein in ahomologousopenregion andcomparingthem

atthe nucleotide and amino acid levels. Region 7. The open reading frame of region 7

went from nucleotide 2,983 through the EcoRI siteuptoTAA 962 andmoreprobablygaverise

2

3

6

IaIfIIII III I I 1111 IIII alaII

100

2 11 AM I

I II

-3

0

a I Ill II

II-T

I I

I N I a am a

.9'

I I I I II I1I I 1

a I I.. . . _ . ... I .... . . . II oIIa.... II a II

I I

I 11I I Ian 1S1111 1I11

I 11 I I I I

. . .. Aaaa I . I 1A 11 a I

-- ' I 11 11 I I

http://jvi.asm.org/

[image:10.504.54.456.64.494.2]

(11)

by translation from ATG 2% to the viral enve-lope protein (Galibert et al., in press) called the WHsAg protein. As in the case of the HBV genome, the ATGcorresponding to the N-termi-nal methionine was preceded by a large open readingframe, calledthe pre-Sregion, in which there were two ATG. Region 7 of WHV could then codefor proteins of 426 and/or 282 and 222 amino acids, and region 7 of HBV could code for proteins of 389 and/or 281 and 226 amino acids. This is more than a coincidence and suggests that the pre-S region codes for some proteins. Although a high degree of homology was found in nearly all parts of the two genomes, it is remarkable that one of the tworegions of large divergence started around nucleotide 2,980 (WHV) and went up to nucleotide 215, almost exactly corresponding to the pre-S region.

As a resultof this large difference at the DNA level, littlesequencehomology at the amino acid level was observed. However, both putative amino acid sequences revealed the presenceof the sequence Asn-X-Thr, known as the glycosy-lation site. What is moreinteresting, these se-quences occupiedasimilarposition.They were encounteredat-58fromthe WHsAg N methio-nineand at -52fromthe HBsAg Nmethionine. This putative glycosylation site was in both cases located very close to the second methio-nine of the pre-S region. Some other common featuresof thispre-Sregionaredepicted in Fig. 6, and more can be seen inFig.3.Translation of thepre-Sregionof the twovirusescouldgivea

proteinwith alargeamountofproline residues (26 for the WHV protein and 20 forthe HBV

protein) and no cysteine residue, which might indicate that this part of the protein could be more like a randomcoil with a low amount of secondary structure. The significance of the amino acid sequence coded by the pre-S region is not clear. Does it represent the extra sequence ofanenvelope protein precursor, or is it part of aminor component of the viral coat? Its poorly conserved status during evolution, whereas in the meantime the S region is well conserved (Galibertetal., in press), suggests that the pre-S region codes for the extra sequence of a precur-sormolecule (15).

Region

6. In both viruses, region 6 covered nearly 80% of the genome. From the first ATG (2,427) up to the stop codon TGA 1,756, a protein of 879 and 838 amino acids could be predicted forWHV and HBV. The use ofthe secondATG as an initiatortriplet is less proba-ble, since it would reduce the sizeoftheprotein to 371 and 400amino acids, respectively.

Region6,like region 7, overlapped the EcoRI site. Therefore, the percentage of homology varied greatlyall along the DNA and amino acid sequences. From the first methionine up to

amino acid 23 of the WHV protein, the se-quencesof the twoproteinswerequite different. However, it is noticeable that the DNA

se-quences were better conserved than the amino acid sequences. Startingwith ATG 2,427, 37 of thefirst 54nucleotides were identical, whereas this was truefor only4of 18 amino acids. This divergence in the percentage of homology

be-tweenthe amino acid and nucleotide sequences probably indicatesthat theproteins codedbythe

6

5~~'

FIG. 5. Localization of the openreadingframesonthe viralgenome and comparison with the HBV genome (9).Thestrippedareainregion7correspondstothepre-Ssequence. Arrows indicated the position of the first ATG found withinanopenreadingframe.

VOL.41, 1982

http://jvi.asm.org/

[image:11.504.65.459.448.647.2]

(12)

1 87

W-V Met Gly Asn Asn Ile. . Thr . Asn Pro Pro Pro Pro-HBV Met Gly Gln Asn Leu . Thr . Asn Pro Pro Pro Pro

-1 81

99

Pro . Pro ... Pro Pro . Pro .. Pro Pro 94

145 164 205

Met . . Asn Gln Thr PheHis Thr Thr Thr et Hs Met . . Asn Ser Thr Phe His Ser Ser Ser Met HBs

[image:12.504.87.409.58.241.2]

109 135 TU

FIG. 6. Some common characteristic features of theputativeproteins madefrom thepre-Sregionofboth

viruses.

overlappinggene 8 couldnotevolveasmuch as the protein coded by gene 6. It also indicates that, by virtue of the code degeneracy, a

se-quenceofnucleotideswhich codes intwo

differ-entframes fortwodifferentproteinscanevolve insuch a way that oneproteinis well conserved whereas the other is less so.

The two sequences then became muchmore

similar, and 90 amino acids of 162were identi-cal. After this, the sequences again diverged completelyat the DNA and amino acid levels. As previously noted, the difference in size be-tweenthe twovirus genomeswasmainlydueto

differences located in this region which could code for 183 and 140amino acids. Finally, the sequences again became similar atthe nucleo-tide and aminoacidlevels. Ofthe 515remaining aminoacids, 309 were identical.

Althoughthere is no direct argument indicat-ing that region 6 codes for a viral protein, the sizeof such a nucleotide sequence devoid in one readingframe of a stop codon leaves little doubt about the coding function of this region. In this respect, it has been suggested that the DNA polymerase found within the virion could be codedby gene 6 (8). If this istrue, the existence of a large piece of DNA centered around the EcoRI site and devoid of homology between the twoviruses ispuzzling. How is it that this piece of DNAhasevolved so rapidly in sequence and size withoutaffecting the biological property of thecorresponding protein? In the case of gene 7, because of its location, we have suggested that this piece of DNA could code for the extra sequence of a precursormolecule. Another ex-planation could be suggested for gene 6, namely, the existenceof an intron sequence eliminating a region which therefore can evolve more rapidly. However, no sequence resembling the consen-sussequence(often found at the border of many

intronsequences) can be found in the HBV and WHV sequences in support of this hypothesis (1, 3, 10).

Region 8. Region 8 was open from nucleotides 1,910 to2,585, and it corresponded in position, length, and sequence to region 8 of the HBV genome which codes for the HBc protein (8, 14). Several ATG were observed atthebeginning of this open reading frame, and thus the initiator triplet washardto locate. From the firstATG,

common to the two sequences and located at

positions1,931 and1,816,respectively,aprotein of 225 and 219 amino acids could be made.This isslightly above the value which can be calculat-edfrom the molecularweight of the HBc protein

asmeasuredby gelelectrophoresis. Translation from the second common ATG located at posi-tions 2,021 and 1,903 therefore seems more probable and would also agree with the results of Valenzuela et al. (20), who found a TGA stop codon instead ofaTGG twotripletsupstream of this secondATG.

Upon comparison, the nucleotide sequences downstream of ATG 2,021 and 1,903 and the predictable amino acid sequences were well preserved in the two viruses. This finding is in agreement with the results ofWerneret al. (21), who havedemonstrated by an immunodiffusion test thatHBc and WHc proteins have common antigenicdeterminants.

Table 1 shows characteristic features of the genes and their products. Due to the large amount ofsilent mutations, the homology ob-served between the two amino acid sequences washigher than that observed between the nu-cleotide sequences. An apposite result was ob-tained with geneS coding for the surface antigen (Galibertetal., in press). This wasprobably due tothefact that the sequence coding for the core protein does not code for another protein,

http://jvi.asm.org/

(13)

WHV GENOME NUCLEOTIDE SEQUENCE 63 TABLE 1. Comparison of HBc and WHc genes and products

Residues Silent Length Amino acids

Gene Length No. %

Triplets

mutated: Deleted Inserted

muta-

of No. %

triplets triplets tos poly- ie-Hml

iden- Homol- Once Twice Thrice peptide ticai omy

HBc 549 363 66 69 32 14 10 5 74 183 131 73

WHc 564 122 64 188

transversion 122 transition 64

Amino acids change within

Polar* NP Uncharged +Charged

9 15

whereas the sequence coding for the surface

proteincodesforanother protein, allowing less

freedom in the evolution of the latter DNA

sequence.

Aspreviously noted,thecoreprotein

exhibit-ed a particular structure at its carboxy end,

which is veryrich inproline and arginine

resi-dues (8). Eleven proline residues outof the last

60 residues and 16 arginine out of the last 34 residuescould be observed in theHBc and WHc

proteins. The repeat Ser-Pro-Arg-Arg-Arg-Arg-Ser-Gln previously noted at the carboxy endof the HBc proteinwasalsoobservedatthe

endof the WHcprotein.

Region5. Region5was openfrom nucleotide

1,467 up to stop codon TAA 1,926. The first

ATG was found at position 1,503, giving a

codingcapacity foraprotein of141amino acids.

Similarly, in the HBV genome, there was an

openreading frame fromnucleotide 1,340upto

TAA1,838 which could be translated from ATG 1,376 in aprotein of 154 amino acids.

Thereisat presentnodirectargument indicat-ing that region 5correspondstoaparticular viral

gene. However, several indirectarguments

sug-gest this quite strongly. (i) These two open

Basic Acid

2

reading frames occupied identical positions

withinthetwogenomeswithrespect tothe other

open regions and the nick of the L strand. (ii)

They could give rise totwo proteins of similar size. (iii) A third, more convincingargument is based on comparison of the amino acid

se-quenceswhich could be deduced from theDNA

sequences. Region 5 was read in frame 2 and was overlapped halfway down by region 6,

whichwasreadin frame3.Translationin frame

3 fromnucleotide 1,464tonucleotide 1,551gave anamino acidsequencesharingalarge degree of

homology with the correspondingHBVgenome:

11of13amino acidswereidenticalupstreamof

ATG 1,503, and 14 of16 were identical

down-streamof ATG. On thecontrary, translation in frame 2gave adifferent result:upstreamof ATG 1,503, only 1 of 12 amino acids was identical,

and16 of16wereidentical downstreamof ATG

1,503(Fig. 7). Thisdrastic change in the

conser-vation of the amino acidsequenceindicates that

another protein is made starting from ATG 1,503,in frame2. Italsoindicates,aspreviously

noted, that selective pressure on a particular gene does not greatly influence the coding

ca-pacity ofthevery samesequenceread inanother

Gene 5

Frame 2 Phe Arg Gly Val Val Gly Glu Ala Asp Val Leu Ser Met Ala Frame 3 Asn Ser Val Val Leu Ser Gly Lys Leu Thr Ser Phe Pro Trp Leu

TAATTCCGTGGTGTTGTCGGGGAAGCTGACGTCCTTTCCATGGCTG

~~~~~~~~~~~~~.

. . ... . ... . .*. ...*.*...*. TAAC TC TGTTGTCC TATC CCGC AAATATACATCGT T TCCATGGC TG

o o o o o o Arg o Tyr o o o o o o

Leu Cys Cys Pro Ile Pro Gln Ile Tyr Ile Val o o o

Ala Arg Leu Cys Cys Gln Leu Asp Pro Ala Arg Asp Val Leu Leu Ala Cys Val Ala Asn Trp Ile Leu Arg Gly Thr Ser Phe

CTGGCCTGTGTTGCCAACTGGATCCTGCGCGGGACGTCCTTC

. . . .

CTAGGCTGTGCTGCC AACTGGATCCTGCGCGGGACGTCCTTT

o Gly o Ala o o o o o o o o o. o

0 0 0 0 0 0 0 0 0 0 0 0 0 0

FIG. 7. Nucleotide sequence ofthe two virusesaround thefirst ATGofregion5. Translation inframes2 (region 5)and3(region 6)from the WHVsequence iscomparedwith thecorrespondingsequencesfound inthe HBV genome. Translation inframe 2 downstream of ATGgivesa proteinwhich resembles very much the homologous proteinmade from the HBV genome.

Identical group 21 VOL.41,1982

http://jvi.asm.org/

[image:13.504.54.448.82.192.2] [image:13.504.53.451.497.634.2]

(14)

WHV

2992

1756 2427

1503 1926

5

1376 1838

I

1623 a

I I -4

'2021 2585 8

1903 2452 2309

HBV

FIG. 8. Comparison of the organizationof thetwo genomesshowingthatonlyoneregionlocated between

regions5and 8 is devoid of coding function.

frame. Otherwise, the selective pressure made

on gene 6 upstream and downstream of ATG

1,503 could likewiseinfluence thetranslation in

frame 3upstreamand downstream ofthis ATG.

Comparison of the nucleotide sequence of

gene5and of itscorresponding protein showed

that they were less conserved than the se-quencesof othergenes.After the first stretch of

amino acids, which was identical in both pro-teins and was used as an argument to demon-strate theexistence ofgene5, the twoproteins became verydifferent, and several deletionsor

insertionsmustbepostulatedin the centralpart ofthe molecule to align43 amino acids outof

115. However, apotential identical secondary

structure could be predicted from the identical

HBV

positionsof several aminoacids suchasPro-Cys and Gly. Finally, near by the carboxylic end, which seemed to accumulate acid and basic

aminoacids,anidenticalnonapeptide

(Phe-Val-Leu-Gly-Gly-Cys-Arg-His-Lys) was

encoun-tered.

Region 1. Region 1 lay on the other DNA strand and, as we observed previously in the

HBV sequence (8), was devoid of a potential

initiator codon. Comparison ofthe nucleotide

sequence ofregion1 of thetwoviruses showed

thattheywerecompletelydifferent. Asamatter offact,theywerenotlocatedexactlyatthesame

position within thetwo genomes butabout 200

nucleotides away.Therefore,itmaybe

conclud-ed thatthey exist by chance and do not

corre-WHV

G T G T

T G T G

c c c c

G-C G=C

A =T A=T

A=T A=T

C G C-G

C--G CmG

T-G T=AT

C-SG C §G

C-=G c-G

G-C G-C

A =T A=T

,G T T C A T T G G G, G T T C A T T G G G

FIG. 9. Twopossiblehairpin structureslocated in theonly region devoid of coding function.

7 962

6

2850 7 835

r---i

A I

http://jvi.asm.org/

[image:14.504.81.450.112.273.2] [image:14.504.129.384.497.665.2]

(15)

spond to aparticular viral gene. It can also be inferred that there are only four viral genes, all located onthe L strand.

Location of the nick and the origin of replica-tion. Figure 8displays thevarious open reading frames from their putative initiator ATG triplets to their stop codons. As already noted, they overlapped each other without leaving any gap except between regions 5 and 8, where there was asmall sequence, devoid of coding function, in both viruses. Interestingly, comparison of these two nucleotide sequences indicated they were very well conserved even though they have no coding function. These nucleotide sequences also exhibited a palindromic structure able to form a hairpin structure with quite stable energy

(Fig.

9).

According to the location of the L-strand nick, as determined by gel electrophoresis (5, 6) for bothviruses, andthe length of the genome, one may calculate that the nick could be located in the HBV sequence around nucleotide 1,846 and in the WHV genome, around nucleotide 2,011, which means exactly within the only region devoid of coding function. These four results, i.e., absence of coding function, highly pre-served DNA sequence, hairpin structure, and position of the nick, are all consistent with the location of the origin ofreplication of the viral DNA inthisparticular sequence.

ACKNOWLEDGMENTS

Weare very muchgrateful to J. Summers, who gave us the EcoWHV cloned DNA.

Thiswork was supported by a grant from the Institut de la Santeetde la Recherche Medicale.

LITERATURECITED

1. Breathnack,R., C.Benoist, K.O'Hare,F.Gannon,andP. Chambon. 1978. Ovalbumingene: evidence foraleader sequence in mRNA and DNA sequences at the

exon-intronboundaries. Proc. Natl. Acad.Sci. U.S.A. 75:4853-4857.

2. Burrel, C. J., P. Mackay, P. J.Greenaway,P. H. Hof-schneider, and K.Murray. 1979.Expressionin Escheri-chia coli ofhepatitisB virusDNA cloned inplasmidpBR 322. Nature(London)279:43-47.

3. Catterall, J. F., W. 0. O'Malley,M. A. Robertson, R. Staden,R.Tanaka,andG. G.Brownlee. 1978. Nucleotide sequence homology at 12intron exonjunctions in the chick ovalbumingene. Nature(London)275:510-513. 4. Charnay, P., E. Mandart, A. Hampe, F. Fitoussi, P.

Tiollais, andF.Galibert. 1979. Localizationonthe viral genomeandnucleotidesequence ofthegenecodingfor

thetwo major polypeptides of the hepatitis B surface antigen (HBsAg). Nucleic Acids Res.7:335-346. 5. Charnay,P., C. Pourcel, A. Louise, A. Fritsch, and P.

Tiollais. 1979. Cloning in Escherichia coliandphysical structure of hepatitis B virion DNA. Proc. Natl. Acad. Sci. U.S.A. 76:2222-2226.

6. Cumming,I. W.,J.K.Browne, W. A. Salser, G. V.Tyler, R. L. Snyder, J. M. Smolec, and J. Summers. 1980. Isolation, characterization and comparison of recombi-nant DNAs derived from the human hepatitis B and woodchuck hepatitis virus genomes. Proc. Natl. Acad. Sci. U.S.A. 77:1842-1846.

7. Fritsch, A., C. Pourcel, P. Charnay, and P. Tioliais. 1978. Clonagedu virus del'hepatiteBdans Escherichia coli. C.R. Acad. Sci.287:1453-1546.

8. Galibert, F., E. Mandart, F. Fitoussi, P. Tiollais, and

P.Charnay.1979.Nucleotide sequence of thehepatitisB virus genome (subtype ayw)cloned in E. coli. Nature (London)281:646-650.

9. Hartley, J. L., and J. E. Donelson. 1980. Nucleotide sequence of the yeastplasmid. Nature (London) 286:860-864.

10. Herisse, J.,G.Courtois,andF.Galibert. 1980.Nucleotide sequence of the EcoRI D fragment of Adenovirus 2 genome. Nucleic Acids Res. 8:2173-2191.

11. Marion, P. L., L. S. Oshiro, D. C. Regnery, G. H. Scullard, and W. S. Robinson. 1980. A virus inBeechey ground squirrels that is related to hepatitis B virus of humans. Proc. Natl. Acad. Sci.U.S.A. 77:2941-2945. 12. Mason, S. W., G. Seal, and J. Summers. 1980. Virus of

Pekin ducks with structural andbiological relatednessto

human hepatitis B virus. J. Virol. 36:829-836.

13. Maxam, A., and W. Gilbert. 1980.Sequencing end labeled DNA with base specific chemical cleavage. Methods Enzymol.65:499-560.

14. Pasek, M., T. Golo, W. Gilbert, B. Zink, H. Schaller, P. McKay, G. Leadbetter, and K. Murray. 1979. Hepatitis B virus gene and theirexpression in E. coli. Nature 282:575-579.

15. Perler, F., A. Efstratiadis, P. Lomedico, W.Gilbert, R. Kolodner, and J. Dodgson. 1980.The evolution of genes: thechickenpreproinsulingene.Cell 20:555-566. 16. Sninski, J. J., A. Siddiqui, W. S. Robinson, and S. N.

Cohen. 1979. Cloning and endonucleasemapping of the hepatitis B viral genome. Nature (London) 279:346-348. 17. Staden, R. 1977. Sequence datahandling by computer.

Nucleic Acids Res.4:4037-4051.

18. Summers, J., J. M. Smolec, and R.Snyder.1978. Avirus similar to humanhepatitisB virus associated with hepati-tis andhepatoma in woodchucks. Proc. Natl. Acad. Sci. U.S.A.75:4533-4537.

19. Valenzuela, P., P. Gray, M.Quiroga, J.Zaldivar,H.M. Goodman,and W.J.Rutter. 1979.Ntucleotide sequence of the genecoding for the majorproteinofhepatitisBvirus surfaceantigen.Nature(London)280:815-819. 20. Valenzuela, P.,M.Quiroga,J.Zaldivar,P.Gray,and W.

J. Rutter. 1981. The nucleotide sequence of thehepatitisB viral genome and the identification of the majorviral genes. In B. Fields, R. Jalnisch, and C. F. Fox(ed.), Animal virusgenetics.AcademicPress, Inc.,New York. 21. Werner, B. G.,J. M.Smolec,R.Snyder,and J. Summers. 1979. Serological relationship of woodchuck hepatitis virusto humanhepatitis B virus. J. Virol. 32:314-322. VOL.41,1982

http://jvi.asm.org/