1.2 Scale-up and large scale processing
1.3.2 Proteins
In the semi-permeable catalytic bag cell model, the most important aspect is clearly the catalysts, since they drive all the reactions taking place within the bag. Within a real cell system, these catalysts are called proteins. Proteins play a wide role of functions within a cell, ranging from structural roles, to catalytic functions, to storage and controlling the flow of materials into, around and out of the cell.
At its most basic level, a protein is a polymer of individual building blocks called amino acids. Amino acids are a class of chemical molecules, which vary greatly in their indi-
vidual chemical properties, but are all composed of an amino (−NH2) and carboxylic
acid (−COOH) functional group – hence the name. The general chemical structure of
an amino acid is H2N−CHR’−COOH, where the R0 group can be one of 21 commonly
occurring side chains. The carbon that the R0 group falls on is referred to as the α
carbon, and this is also a chiral centre. In living organisms, all amino acids are L-form enantiomers; this is an important feature for the structural properties of proteins. There are other, rarer side chains, however these will not be discussed here. A complete table of the most commonly used amino acids is given in figure 1.4 (p. 33). These amino acids all have a single letter code, as shown in figure 1.4, which is used as short-hand when describing the sequence of amino acids in a protein.
1.3. KEY BIOLOGICAL PRINCIPLES 33
Figure 1.4: A graphical representation of the 21 most frequently occurring amino acids, grouped by their general features. All amino acids presented here are shown with charge
states based on pKa values at physiological pH (7.4). This figure was produced by
(Cojocari, 2016), and is freely available for reuse under the creative commons licence, via Wikimedia Commons
(polymer) sequentially through a series of amide (or peptide) bonds; this is where the carboxylic acid group reacts with the amino group in a condensation reaction, to form a covalent linkage between the two monomers. As the polymer grows, the protein has a regular repeating chain referred to as the protein ‘backbone’, with the R groups spanning out from it. Due to physical space requirements (also referred to as steric hindrance), these groups usually arrange themselves on alternating sides of the backbone in the absence of external forces.
Chirality is a description of the ordering of the atoms or groups within the molecule around a central point in 3 dimensions, which is typically a carbon atom as it readily forms 4 bonds with tetrahedral conformation. What this means is that around each of these ‘chiral carbon’ centres, there are 2 distinct conformations the surrounding atoms can take – and in one arrangement there is no amount of rotation that can be performed to reach the other arrangement. In order for a carbon to be chiral, it must be surrounded by 4 different atoms or groups of atoms. These two forms are called enantiometers, and are referred to as l and d – signifying the left- or right- handedness of the direction rotational of atomic ordering ordering.
All amino acids have uniform l-chirality, with the exception of glycine, which lacks a chiral carbon as the R0 group is H. As a result, the order in which the subunits are joined has an effect on the structural properties of the protein. When joining 2 amino acids,
A + H 6= H + A; as a result, a protein sequence is always read beginning from the amino
side and ending at the acid side by convention. It should be noted that it is possible to form palendromic sequences that ignore this directionality (the peptide ‘RACECAR’, for example), and whilst these are generally rare, they can cause problems with proteomic data processing in some cases (discussed later). Polymers of amino acid residues are referred to as peptides when they are ‘short’ and proteins when they are longer, although the cut-off boundary for this is fairly arbitrary. In this thesis, a peptide will refer to either any polypeptide molecule that is less than 4000 da (generally <36 amino acids long on average), or any protein that has had its backbone severed.
Shapes and structures are vitally important in biology, and are the driving force for specificity of reactions and the catalytic activity of the proteins. Protein structure is controlled at 4 levels to produce the final functional structural conformations. These are referred to primary, secondary, tertiary and quaternary structure. The precise ordering and type of amino acid residues that makes up the peptide backbone is referred to as the ‘primary structure’. This primary structure is fundamentally important for structural properties of the protein, as it controls the order of the different R0 groups and is generally fixed at the point when the protein is created. A major exception to this is the cysteine. When 2 of these sulphur-containing amino acid residues are spatially close together, a spontaneous oxidation reaction occurs, resulting in a covalent bond forming between the
1.3. KEY BIOLOGICAL PRINCIPLES 35 two. This reaction can either happen between 2 cysteine residues in the same protein, forming a loop in the primary protein structure, or between 2 separate protein molecules, forming a quaternary covalent bridge between different proteins. It is also worth noting that the amino acid proline has a significant effect on the structure of proteins and peptides. Proline is the only one of the 21 common amino acids that contains a secondary
amine instead of a primary amine – this occurs due to the R0 group reacting with the
amine, and so the regular repeating structure of the peptide backbone is kinked at an unusual angle due to steric hindrance. This kink is also pronounced in short peptides that would normally otherwise be linear.
‘Secondary’ protein structure consists of local structures called α-helices and β-sheets. These regular repeating structures are stabilised by hydrogen bonding from the peptide bonds in the backbone, and are observed repeatedly across all proteins. Glycine and proline residues are referred to as ‘helix breaking’, and can disrupt these structures; as proline has a secondary amine it is uncharged and therefore doesn’t contribute to the hydrogen bonding motif, whilst in the case of glycine as the R0 group is simply a hydrogen atom, the residue is non-chiral and therefore too unconstrained to contribute to a regular repeating pattern. ‘Tertiary’ protein structure refers to the global packing of the whole protein. This structure is stabilised by a range of different forces:
• Hydrophobic interactions – a form of packing that takes place where water is present in the environment. This is partially secured by van der waals forces, although is largely an expression of hydrogen bonding effects of water excluding hydrophobic residues. These forces are supported by hydrophobic amino acids such as trypto- phan.
• Hydrogen bonding – an induced dipole effect that is responsible for the interactions
between water and other charged species. Hydrogen bonds are responsible for
stabilising the majority of biological interactions and can form on any amino acid residue that is not hydrophobic.
• Salt bridges – a permanent dipole effect. A salt bridge forms between 2 residues with opposite charge signs, such as lysine and aspartic acid. They are affected by the surrounding pH, as raising or lowering the pH beyond a pKa value can determine if an R0 group is charged or not.
• Disulphide bonds – as described above, a covalent bond that forms spontaneously in oxidising conditions. These bonds can be reversed through exposure to a redu- cing environment, however unless the cysteine residues responsible are chemically blocked – typically through alkalation – then the bonds will spontaneously reform when exposed to an oxidising environment again.
• Steric hindrance – The forces associated with stopping two atoms occupying the same physical space. Also referred to as Pauli exclusion, this occurs when the neg- atively charged electron clouds of two atoms attempt to occupy the same physical space.
These features are a direct result of the different amino acid residues that make up the protein and how they pack together. Whilst this process is currently too complex to accurately predict a protein structure from its primary sequence alone, two proteins with the same primary sequence will fold to produce exactly the same terminal structure, and so controlling the primary sequence of the protein gives control over the final 3 dimensional structure. Finally, ‘quaternary’ structure is stabilised by all the same features as tertiary structure, however it refers to interactions between 2 or more distinct proteins. These proteins can be identical subunits, or two distinct proteins – the important part of this interaction is that two separate primary protein chains are interacting together to produce a functional final product.
Proteins can also be modified after their creation, in a process called post translational
modification (PTM). The most common PTM is the addition of a phosphate (Pi, or a
mix of HPO42–+ H2PO41–) onto an alcohol containing residue, such as serine, threonine or tyrosine. This addition is performed by proteins known as kinases, and is reversed by proteins known as phosphorylases. This can have a number of effects on a protein, firstly by adding a large charged molecule to the residue it becomes charged, enabling it to participate in hydrogen bonding and salt bridge formation. Adding phosphate can disrupt entire hydrophobic sections of a protein and result in dramatic re-shuffling of the protein structure. As a result, phosphorylation is the most frequently method for feedback and cascade control in living organisms, and so phosphorus levels are typically a limiting factor for biological growth. Beyond the charge aspect, the modification is relatively large, particularly for a residue like serine, and so addition of phosphate can physically change the shape of a protein by generating a steric hindrance effect. Both of these features can occur in differing amounts, but can enabling participation, or exclusion, of the modified protein in quaternary interactions by attaining a different shape. Dozens of other known PTMs exist, include glycosylation, namely the addition of sugars onto proteins; ubiquitination, which can attach smaller proteins to a protein to signal for its destruction; and alkylation, where hydrophobic hydrocarbons are added to charged residues, blocking their activity (such as cysteine forming disulphide bridges) or burying them within the hydrophobic core of the protein.
The same way that the structures of proteins are stabilised by these different forces, the arrangement of charges and flexibility of protein molecules are also what provide their catalytic properties. For example, some proteins have ‘pockets’ that certain shaped molecules fit into but others are excluded from, and buried within these protein pockets
1.3. KEY BIOLOGICAL PRINCIPLES 37 are arrangements of charges that, due to close proximity to the molecules held within the pockets, can have strong catalytic activity. A good example of a protein like this is the protease trypsin. Trypsin has pockets that only long-chain, positively charged residues can fit into; but when one binds to the pocket it triggers a catalytic reaction that results in cleavage of a peptide bond. The resulting effect is that trypsin will selectively cleave proteins at regions where arginine and lysine are present, but not elsewhere. This is referred to in proteomics as site-specific cleavage. Cleavage is a very important topic in proteomics, and will be discussed at length in a later section in this chapter.
Whilst there is a fairly good understanding of how different properties contribute to the structure; even if the structures of all proteins were known it is still impossible to know exactly what function a protein will have from either its sequence or structure alone. The most effective method for measuring what a protein does involves measuring that protein directly – referred to here as a de novo investigation. This can be done through purification of the protein and direct analysis; although while some proteins can be removed from the cellular environment and still retain their function (in vitro analysis - in glass), in other cases proteins produce different functions when removed from a living environment (in vivo analysis - in life).
In practice, beyond de novo investigation into the function of the protein, the accumulated knowledge of pre-existing proteins of a similar sequence is used to infer what newly dis- covered proteins do. Whilst this practice is not perfect, as similar protein sequences does not necessarily mean similar protein function, it provides a lot more utility to proteomic data generated in organisms that haven’t been studied as extensively.