Data analysis. - DNA − protein interactions

DNA − protein interactions

10. Data analysis.

10.1. Analysis of hydroxyl radical footprinting data.

10.1.1. Quantification and normalization of time−resolved footprints.

The output file of an electrophoresis run on the ALF Express DNA analyzer II (fluorescence intensity versus time for each lane on the gel, see Figure 11 in “Results”) was converted into ASCII format. By using the software OriginPro (OriginLab, Northampton, MA) a linear baseline was subtracted from each dataset and a smaller version of the file was created containing the time region of the run to be analyzed. The Peak Fitting Module of the software OriginPro was used to fit the peaks in each lane to a Lorenzian curve. Approximately 150 peaks were fit simultaneously, covering the whole region interacting with the polymerase. The values for peak areas within each lane were divided by the average value of the area of several peaks within the same lane that were not protected by the polymerase in the course of the binding reaction. Normalized area values were then divided by the values for the corresponding peaks in the lane of DNA cleaved in the absence of protein, the resulting values were then relative change compared with naked DNA, Ф (Figure 12 in “Results”).

10.1.2. Fit of the kinetic data to single and double exponential equations.

The progression curves (Ф versus time) of appearance of protection from hydroxyl radical cleavage at each nucleotide position were fit individually to single or double exponential expressions as follows:

y = L + (U-L) e – kt,

and

[2]

y = L + A e – kAt + B e – kBt

where A and B are the signal amplitudes and k, k

U = L + A + B, [3]

A and kB are the observed, apparent rate constants. The upper (U) and lower (L) limits from these fits were used to normalize the data from 0 to 1, where 0 corresponds to the value in the absence of polymerase and 1 corresponds to the value for the area of the peak at its minimum, resulting in the value of the fractional saturation at that site. We observed that often the difference in the value of the rate and amplitude measured for different nucleotides within a certain region was smaller than the error derived from the fit. The data points for these nucleotides, taken from at least two independent experiments, were combined and then fit again to reduce the error. In order to

determine whether a single or a double exponential expression better describes the results in each dataset, a visual analysis of the residuals was carried out (for example, see Figure 1S in “Supporting materials”) as well as an F test (for example, see Table 1S in “Supporting materials”).

10.1.3. Residuals from nonlinear regression.

Nonlinear regression finds parameters that make a model fit the data as closely as possible (given some assumptions). It does not automatically ask whether another model might work better. The graphical residuals analysis is one of the key tools for model validation.

Residuals are the differences between the observed and predicted responses. In other words, residuals are the vertical distances of each corresponding data point from the fit curve. Residuals are positive when the points are above the curve, and are negative when the points are below the curve. Carefully looking at residuals can tell whether the made assumptions are reasonable and the choice of model is appropriate. The general assumption applied to the group of residuals is that one expects them to be roughly normal and (approximately) independently distributed around the fit curve [Motulsky et al., 2003].

10.1.4. Extra sum−of−squares F test.

As outlined above, the goal of nonlinear regression is to find parameter values that make the curve come near the data points or, in other words, to minimize the sum of squares of the vertical distances of the data points from the curve. So it seems that the model which fit data with the smallest sum−of−squares is the best. In fact, that approach is too simple. The problem is that a more complicated model (more parameters) can fit the data better (can come close to the points) just because it can have more inflection points. For instance, two−phase model almost always fits data better than one−phase model. So any method to compare a simple model with more complicated model has to balance the decrease in sum−of−squares with the increase in the number of parameters.

The F test (extra sum−of−squares) is one of the key statistical approaches used to compare related models (two models are related when one is a simpler case of the other)

models and also takes into account the number of data points and the number of parameters of each model.

If the simpler model (fewer parameters) is correct, then the relative increase in the sum−of−squares (going from complicated to simple model) would be approximately equal the relative increase in degrees of freedom (DF), which equal the number of data points minus the number of parameters. If the more complicated model is correct, then relative increase in the sum−of−squares would be greater than the relative increase in degrees of freedom.

The F ratio equals the relative difference in the sum−of−squares divided by the relative difference in decrees of freedom:

That equation is more commonly shown in a following equivalent form:

where the number 1 and 2 refer to the simple (for example, single exponential) and more complex (for example, double exponential) models, respectively. (DF1-DF2) is the degrees of freedom for the numerator (DFn), and DF2 is the degrees of freedom for the denominator (DFd).

If the simpler model is correct, then an F ratio is expected to be near 1.0. If the F ratio is much greater than 1.0, then there are two possibilities:

• The more complicated model is correct.

• The simpler model is correct, but random scatter in the data led the more complicated model to fit better.

The P value, which can be calculated from the F ratio and the two DF values, tells how frequently the second possibility would happen. P is the probability (ranging from 0 to 1) that the results observed in study could have occurred by chance. If the P value is low (0.05 or below), then one can conclude that more complicated model is significantly better than the simple model. F = (SS1-SS2)/SS2 (DF1-DF2)/DF2 F = (SS1-SS2)/(DF1-DF2) SS2/DF2 [4] [5]

10.2. Analysis of potassium permanganate footprinting data.

In order to obtain the image of potassium permanganate footprints of DNA bound by RNAP, the radioactive (32P) gel was exposed to Fuji Imager plate BAS IIIS and scanned using Bas−1000 PhosphorImager instrument. Band intensity profiles along each gel lane were determined using the MacBas software. For integration of the area of the peaks in each lane, the lowest intensity points in the lane were used as the horizontal baseline and peaks were fit to a Lorenzian curve. The values for the peaks’ area were then normalized on the total amount of DNA material applied on corresponding gel lane. Normalized area values of the peaks in the lane of DNA treated with KMnO4 in the absence RNAP (local background intensity) were

then subtracted from the normalized area values of corresponding peaks within each lane. The resulting values were then change compared with naked DNA, Ф.

The plots of the increase of KMnO4−accessibility at each thymine position (Ф versus

time) were fit to single ([2]) or double ([3]) exponential equations and processed in the same way as in case of hydroxyl radicals footprints (see section 10.1.2.). Analysis of residuals as well as statistical F−test have shown that in all cases the plots fit better to double than to single exponential equation (see Figure 2S and Table 2S in “Supporting materials”, respectively).

In document Rogozina, Anastasia (2009): The pathway to transcriptionally active Escherichia coli RNAP-T7A1 promoter complex formation: Positioning of RNAP at the promoter using X-ray hydroxyl radical footprinting. Dissertation, LMU München: Fakultät für Chem (Page 104-108)