CHAPTER 3. IMAGE METRICS, A NEXT-GEN AFM IMAGE
3.5. Image Analysis
3.5.2. Shape Analysis
Shape analysis is an important application in computer vision that often involves image registration, recognition, and classification [122-125]. To describe and distinguish the shape of an object, shape descriptors (heights, contours, critical points, etc.) and shape contexts [126-129] are often used, and the shape has to be transformed to adjust for its scaling, three-dimensional orientations, and deformations [130-133]. In AFM, since particles such as proteins and nucleic acids are often imaged on a flat, blank surface, shape descriptors are routinely used to describe particle shapes. In particular, shape description via particle metrics (Table 3.3) is a staple feature for many AFM programs (e.g. Particle and Pore Analysis in SPIP [96], or Grain Analysis in Gwyddion [92]). Image Metrics provides a similar particle analysis module called Particle Analyzer that can quantitatively measure particle metrics in batch (Section 3.5.2A). Image Metrics can also match and classify particles using a correlation-based particle alignment and classification method (Section 3.5.2B, C) that is unique in AFM. Like particle alignment and classification technique used in Cryo-EM [134], Image Metrics can adjust for orientation
differences and classify molecules based on their shape and take class averages to obtain a more refined molecular conformation [115, 135, 136].
A. Particle Metrics
Particle metrics are geometric descriptions of particles, such as height and area (Table 3.3). The analysis of particle metrics is known as particle analysis in AFM. In Image Metrics, they can be measured in batch via the Particle Analyzer module. First, the particles of interest need to be masked and detected (Section 3.5.1). After that, batch analysis for most particle metrics can be performed by Image Processing Toolbox in MATLAB. Unlike most AFM
image, Image Metrics allows batch processing of particles in a collection of images. Image Metrics reports the measurement results and statistics to the end users in a table (Figure 3.13D, E). From there, users can then sort, filter, categorize, and locate particles by their metrics, and graph the statistical results (Figure 3.13F).
Particle Metrics
Area/Convex Area/Filled Area Intensity (max, mean, min)
Eccentricity Major/Minor Axis Length
Equivalent Diameter Orientation
Euler Number Perimeter
Extent Solidity
Fiber length Volume
Table 3.3 Selected List of Particle Metrics
The description of the majority of metrics can be seen in MATLAB’s regionprops function29. Volume and fiber
length are measured as described in Section 3.5.3B, C. Measurements on selected particle metrics are compared across multiple AFM programs in Section 3.5.4.
To improve performance, users can select only the metrics they want to calculate. Pre- filtering unnecessary particles while making the mask also helps (Section 3.5.1). After the measurements are made, users can filter particles by filtering the numeric values of their
respective metrics and, if the particles are categorized (Section 3.5.2B), using set operations on their respective categories. The filters can be applied to all the particles, or to particles that are currently selected. Users can apply multiple filters at the same time. Users can also make a subset using the current particle selections and perform all filtering operations on the particles within the subset only. Users can manually remove particles from or add particles into filtered
particles. Similar to SPIP, filtered particles can be categorized into sets using color and name labels, and users can perform set operations (intersection and union) to further dissect or group the categorized particles. In addition, particles can be displayed in a grid for better visibility (Figure 3.16), a feature not found in other programs. Collectively, these operations give users maximum flexibility in filtering, categorizing, and ultimately picking out the particles that satisfy given descriptions and criteria.
Figure 3.13. Particle Analysis
The main interface of the Particle Analyzer module is shown with different panels cropped in different color boxes.
A. Processed image. Particles are number and color labeled - different color represents the different categories they
are assigned. B. Regional inspection (zoomed in view of blue box in A). C. Particle inspection of particle #22. (labeled in A) D. Table listing metrics of all selected particles. E. Statistics of a selected metric. F. Distribution
(histogram plot) of the selected metric. B. Particle Matching
It is often difficult to distinguish fully the shape of a particle by using simple geometric metrics. For example, in Figure 3.14, particles of very different shapes may have the same size (area), and particles of the same conformation may alter in slight ways that results in notably
different measurement such as fiber length. Using a combination of mathematical descriptions listed in Table 3.3 can help recognize, but not completely, distinguish different shapes.
Figure 3.14 The Insufficiencies of Particle Metrics in Describing Shapes
Three hMutL𝛼 proteins (CHAPTER 4) are shown in A, B, C. Under the same height threshold that outlined them in red contours (D, E, F), their masks all have similar sizes (area), but particle in B has a globular shape, which is
notably different from the other two particles (A, C) that have more open conformations. On the other hand, although particles A, C both have extended conformations and may be categorized into the same shape group, they
have varying fiber lengths (black lines, D, F).
Correlation methods have traditionally been used in EM to identify and match particles precisely with a reference shape [115, 137, 138], which could be a custom shape or the shape of a template particle. To calculate the correlation between two particles, they have to be aligned both translationally and rotationally. In EM, auto-correlation and cross-correlation are used to solve for both the translational and rotational shifts between a given particle pair [115]. After particles are aligned, their correlation can be used for particle classification, and class averages through correlation averaging can be used to obtain higher resolution of a particle image [136, 139].
Compared to EM, Image Metrics has also implemented a similar rotary correlation matching algorithm – which has not seen implementation in mainstream AFM software. In this method, a particle is chosen as the template, and another particle is chosen as the target. The two particles form a particle pair and the program will try to align the particles by rotating the target particle at designated intervals, and calculating its correlation with the template particle. When the target particle aligns with template particle, a maximum correlation is reached and the rotation angle is recorded. This procedure is repeated for all possible particle pairs. Eventually a correlation and a rotation map are generated that allow the program to identify the correlation and alignment angle of any particle pair. A threshold correlation (called quality score in Image Metrics) is designated by the user to determine if a particle pair is a match. Therefore, by cycling through all possible template particles, users will be able to find all matching particles and put them into categories according to the shape of the templates.
In Image Metrics, users can choose one of two different correlation metrics for the aforementioned operation – normalized cross correlation (NCORR [140]) and sum of square difference (SSD[141]). Although they are both targeted at matching features, the emphases are different. NCORR obtains normalized correlation by using normalized image data for correlation calculation, whereas SSD obtains normalized correlation by using raw image data for correlation calculation before normalizing it. Therefore, NCORR tends to match features with the same textures regardless of their intensities; while SSD tends to match features with the same
intensities [142]. In general, NCORR is better suited at finding matches if the surface is not flat (or cannot be flattened), while SSD is better suited at finding matches if the surface is flat and every particle is on even “ground”. The results from NCORR may not be desired if a tall particle matches a short particle in texture, whereas results from SSD may suffer from uneven surface
heights such that a short particle on high background matches a tall particle on low background in absolute heights.
An example calculation of 70 particles with 360-degree rotations involves the calculation of the correlation of particle pairs for nearly 2 million times (~70*70*360). To speed the
calculation in Image Metrics, symmetry30 can be applied, and parallel computing31 on multiple- cores computer system or computer clusters at MDCS32 can be used. The speed can be further improved if some of the EM alignment methods mentioned earlier [115] are used33.
Once the correlation and rotation map are obtained, finding matching particles is an iterative process. Figure 3.15 shows an example of this process to characterize the shape of protein UHRF1, which is studied in a paper that I co-authored [143]34. Another example characterizing MutSα-DNA conformations is shown in Figure D.10. Users start by choosing a template particle (e.g. bottom left particle, Figure 3.15A) and defining a correlation score (quality score). Image Metrics finds all particles above the designated score, aligns them to the template, and displays their correlations (Figure 3.15B). Users can further filter them by hand, or
30 Symmetry includes rotation symmetry and particle pairing symmetry. For example, if particle 1 is aligned with particle 2 by rotating particle 1 clockwise by 60 degree, then particle 2 should be aligned with particle 1 by rotating particle 2 counterclockwise by 60 degree (in reality it may be slightly different due to interpretations in image rotation operations). To save speed, the two rotations can be obtained by rotating only one of the particles. Similarly, the particle pairing is also symmetric. Correlation of particle 1 to particle 2 should in principle be the same as correlation of particle 2 to particle 1. However, minor difference might occur depending on which particle is used as template for cross-correlation. To save speed, only one sequence of pairing needs be calculated instead of
calculating both.
31https://www.mathworks.com/help/distcomp/index.html
32 MDCS - MATLAB Distributed Computer Server
33 The current implementation in Image Metrics performs one cross-correlation for every rotation. Some of the EM methods use auto-correlation function (ACF) of the particle image, which is translationally invariant. Therefore, the rotations can be performed on the ACF without needing translational alignment, and only one cross-correlation calculation is required to align the image translationally after the optimal rotation angle is found.
34 UHREF1 is short for Ubiquitin-like, containing PHD and RING finger domains, 1. It is important in the epigenetic inheritance of the DNA methylation process.
by using a new quality score (Figure 3.15B). After the particles are filtered to the user’s satisfaction, they are assigned a category (e.g. a category of a specific conformation). The aligned particles can also be averaged to obtain a refined image of the particle category (Figure 3.15C, also see Section 3.4.2). This process is repeated by using found particles as new
templates, or by using a new particle as template, till users have cycled through and assigned categories to all the particles. After the initial rough assignment, users can review the assignment by category and make changes. This reviewing process is repeated until users have fine-tuned the assignment for every particle. Image Metrics can plot the categories into a pie chart so that the populations of each category can be compared (Figure 3.15D).
Figure 3.15. Shape Matching Analysis of Proteins UHRF1
A. Overview of UHRF1 proteins. B. Shape analysis – UHRF1 are aligned to a template, correlation matching
scores are calculated and labeled, proteins are removed (hollow circles) automatically by their low scores (purple) or manually. The shape conformation is named (‘double’). C. Selected proteins are correlation averaged, and the averaged protein image is compared to the template. D. Categorization of conformations of proteins after inspecting
C.Particle Classification
Putting particles into categories is a process of classification, one of the most common tasks in computer vision. In the previous sub-section, I described a manual classification process through particle correlation and alignment. Computer-assisted automatic classification methods have also been developed, of which clustering analysis (CA) is routinely used ([124] Ref. Section 8.3). In a typical clustering analysis, objects are clustered based on their similarity, where distance of object attributes is calculated as their similarity measure35. In shape analysis, the attributes could be one or more of their shape descriptors (metrics such as height, area, contour, etc.) ([129], [124] Ref. Chapter 6) or their individual pixels36. A variety of distance metrics exist, such as Jaccard distance and Euclidean distance [144, 145]. After the distances between objects are obtained, the objects can be clustered by using one of many clustering methods available, such as hierarchical ascendant clustering (HAC) and K-means clustering ([145], [124] Ref. Section 8.3).
Particle classification by shape is widely used in electron microscopy (EM). In EM, particles are usually classified through multivariate statistical analysis37 (MSA) followed by clustering analysis [135, 139, 146]. Particle classes can then be used to categorize conformations and reconstruct their 3D models [138, 147]. The advancement in algorithms, computing power, and cryogenic technology has enabled scientists to resolve molecular structures at near-atomic resolution in their native state [148], and it may be one of the most important scientific
35 The shorter the distance between objects’ attributes is, the higher the similarity between objects is. 36 Similarity between individual pixels of two objects are also called correlation [115]. For example, cross- correlation is one of the correlation metrics.
breakthroughs in recent history [149]. However, particle classification in AFM is still, to my knowledge, a novelty.
Image Metrics aims to bridge that gap. Similar to EM, I have implemented automatic classification methods (APPENDIX F). Initially, I developed a custom clustering scheme that involves a variant of multivariate statistical analysis by using Jaccard distance (the author named it eigenanalysis, see Appendix G.I). I have since implemented standard hierarchical clustering and K-means clustering based on distance between particle correlations using MATLAB’s Statistical Toolbox (Appendix G.II). In Appendix G.III, the concept and usage of different clustering schemes in Image Metrics is explained through using simulated data set, and the accuracy of clustering is compared and verified. The section also explains how to optimize classification by screening major parameters and/or by hybridizing different clustering methods.
Compared to EM, where the goal of clustering is to generate distinct classes with the ultimate purpose of minimizing the intra-class distances, Image Metrics focuses on maximizing inter-class distances so that distinct conformations can be separated. The results may be similar, but the emphases are different. In EM, data that contain minor conformational species or subtle conformational changes are often discarded in favor of obtaining higher resolution class averages for 3D reconstruction [139], whereas in Image Metrics data often encompass some variations of the same conformation inside the same class. In other words, EM clustering aims to generate more, but refined classes so that classes are distinct from each other even for minor
conformational changes; Image Metrics clustering aims to generate fewer, but “messier” classes that distinguish particles by major instead of minor conformational changes. As discussed in Section 3.7.3, the methods implemented in Image Metrics is well suited to the type of data
(AFM) it processes, and they can also be useful for image classification of a variety of image types in other fields.
In Image Metrics, a module called Particle Categorization is provided to process particle classification. Particles can be classified into groups, and groups can be merged into categories, either manually or using one of the aforementioned methods. In a typical workflow, users group particles first by choosing a number of parameters that gauge the quality of the match and the size of the groups. Once initial grouping is performed by the computer, users can inspect the groups, throw away outlier particles, regroup the particles, and merge groups into categories.
The module can classify particles with high-throughput. For a sample size of a thousand particles, initial grouping on an average computer only takes seconds to finish, and adding user inspections to complete the whole classification process (grouping and categorization) usually takes about half an hour. For example, in Figure 3.16, more than a thousand Saccharomyces cerevisiae MutLα proteins38 are being categorized using the clustering schemes aforementioned. After automatic grouping, the proteins of a group or a category are displayed in a gallery (Figure 3.16 red box). Users can choose a template protein (Figure 3.16 orange box) within the group or category (or let the computer choose the one that has the best matches) and align all other proteins to that template protein (Figure 3.16 red box). The matching scores can be displayed for each protein and sorted in order (Figure 3.16 blue numbers in red box). From the gallery, users can discard outliers into trash, or put them in another group (Figure 3.16 blue box). After confirming each group to the user’s satisfaction, the groups can be automatically or manually merged into categories (Figure 3.16 dark red box). After confirming each category to the user’s satisfaction, the classification process is complete.
Figure 3.16 Particle Classification Module
Shown in the figure is the classification process of 1112 Saccharomyces cerevisiae MutLα proteins. Red box – Gallery that displays particles from a selected group or category. The particles are centered in the images and masked with assorted color overlays (to separate nearby particles). Orange box – Panel that displays the template
particle that used for alignment of all other particles in its group or category. Blue box – Groups that are created manually or automatically. Numbers indicate correlation quality to the template. Dark red box – Categories that