SumThreshold Method - Technical development and scientific preparation for the e-MERLIN Cygnus

The most effective thresholding method currently available is demonstrated by Offringa et al. (2010a) to be the SumThreshold and this is the adopted RFI detection algorithm for SERPent. An overview of the method is given here, for a more in depth analysis of the process see Offringa et al. (2010a).

Threshold methods work on the basis that RFI increases visibility amplitudes for the times and frequencies in which they are present. Therefore there is a considerable difference compared to other RFI-free visibility amplitudes, making these RFI contaminated visibilities statistical outliers. If the RFI amplitudes fulfil a certain threshold condition,

Figure 2.1: Time-frequency plot of the visibilities of the source 0555+398 from dataset number 3 (see Table 2.1). A single IF and RR polarisation is shown with a frequency range from 4.54 to 4.66 GHz from the baseline Knockin-Pickmere (5 − 7). RFI is seen to vary both in time (vertical axes) and frequency (horizontal axes) at around 4.64 GHz. they are detected and flagged. The threshold level is dictated by the statistics of the relevant visibility subset, which can be the entire observation (all time scans, frequency channels, baselines etc.) or a smaller portion, for example: separate baselines, IFs and polarisations. This has the advantage of increasing the reliability of the statistics, because RFI may be independent of baseline and the distribution between IFs may differ. This is particularly relevant for L-band (1.3 - 1.8 GHz) observations where the RFI is more problematic.

The visibility data within AIPS is sorted by time and then baseline (TB format). Within each time-baseline data sample the data is further divided by IF, channels, stoke parameters and the real, imaginary and weight of the visibility (re: Section 1.1.1). The SumThreshold method applied in SERPent works on visibility data which is separated by baselines and polarisations and arranged in a 2D array, with the individual time scans and frequency channels comprising the array axes i.e. time-frequency space. The frequency axis is further split by IFs. The idea is that peak RFI and broadband RFI will be easily detectable when the visibility amplitudes are arranged in time-frequency space.

greater than 0.0 i.e. if data exists for that time and frequency, then the magnitude of the real and complex part of the visibility is taken to constitute the amplitude. The visibility weight is present from cross correlation, where it operates as a noise scaling factor during correlation to account for the different antenna sensitivities in the e-MERLIN array. If the weight is 0.0 or less, i.e. no data exists for this time-frequency position on this baseline, then the amplitude is set to ‘NaN’. This datum has no effect on the sample statistics or threshold value, but acts as a structural substitute for that elemental position within the array, which both AIPS and SERPent require to retain the correct time-frequency information. The Python module NumPy is employed to create and manipulate the 2D arrays, as the module is implemented with performance-optimised Fortran code1.

There are two concepts associated with the SumThreshold method: the threshold (χ) and the subset size (N ). The threshold levels are discussed below in Section 2.2.1. A subset is defined as N number of elements in one direction (i.e. the window is one dimensional) of the 2D array (time or frequency) which is to be tested. Within this subset window, the amplitudes are averaged and if the average amplitude exceeds the threshold level χ(N ), the elements within the subset window are flagged in a separate flag array (which is a float array in NumPy with the same size and structure as the data array). A 0.0, in this flag array denotes a normal visibility, 1.0 signifies RFI in the time direction, 2.0 for the frequency direction and higher values for any subsquent runs of the flagger function.

Once all permutations of the subset size N in the time direction has finished, all flagged elements are then set to the next threshold level. This is a unique feature of the SumThreshold method which differs from normal thresholding methods. The algorithm then proceeds to the next subset size in a specified series and repeats. This subset series increases as SN = [1, 2, 4, 8, 16, 32, 64], which provides a good balance between flagging

performance and computational performance (Offringa et al. 2010a).

Once all the subset sizes in the series have checked for RFI in the time direction, the process is repeated in the frequency direction in exactly the same manner. Running the SumThreshold method in both time and frequency direction constitutes one full run of the algorithm. Subsequent full runs of the algorithm check the flag array for flagged amplitudes which are then set to the next threshold level before the algorithm commences. This removes previously found RFI and helps the algorithm to search for any remaining weaker RFI.

In addition to the SumThreshold methodology, certain clauses have been added to prevent the algorithm from over-flagging the dataset. If any threshold level reaches the mean + α variance estimate, where 0.0 < α < 5.0, the flagging run for that direction (time or frequency) stops. The default kickout clause added in SERPent is set at α = 3.0, with smaller values allowing more flagging and higher values restricting the algorithm.

The flagging process can run multiple times at the cost of computational time, and by default an initial run of subset N = 1 only, is included to remove extremely high amplitude RFI. This is followed by two full runs of the algorithm (as described above); the first with a subset size N up to 32 and the second size N up to 256. The subset size N , can be manually selected by the user for optimisation purposes.

The execution of these full two runs is conditional on two factors: that the maximum value within the array after each run is a certain factor of the median, and flags exist from the previous run. On each subsequent cycle, all flagged visibilities from the previous run are set to the next threshold in the visibility array so they don’t skew the subsequent statistics and any weaker RFI which may remain can be found. This is necessary because some RFI in the e-MERLIN commissioning data are found to exist over a range of amplitude levels, even as high as 10,000 times the astronomical signal.

2.2.1 Statistical Variance Estimators

The variance of a sample is an important estimator of statistical outliers which represent RFI. Some statistical methods are sensitive to extreme values whereas others are robust against them. A study into a range of methods and various estimators is described and tested by Fridman (2008). The median absolute deviation (MAD) and median of pairwise averaged squares are the most effective estimators that remove outliers, although Fridman (2008) comments that both are not as efficient, (i.e. needs a larger sample population) as other methods. Since the sample size in any given observation from e-MERLIN will be sufficiently large, with over 500 channels per IF and over 100 time intervals for one scan giving > 50,000 values, this is not an issue. The breakdown point for MAD is also very high (0.5), i.e. almost half the data may be contaminated by outliers (Fridman 2008). MAD is adopted for this algorithm as an initial statistical estimator of the visibility population because of these robust properties, reducing the bias of RFI on the sample. Again, Fridman (2008) stresses that the type and intensity of RFI, type of observation and the method of implementation are important factors when deciding what estimate to

use for any given interferometer.

The MAD is the variance estimator employed in the SERPent algorithm and is defined by Equation 2.1, where mediani(xi) is the median of the original population. This median

is then subtracted from every element in the population, creating a new modified sample of the same size as the original. The median of this new population is then calculated and multipled by a constant scale factor 1.4286 to make this estimation consistent with that of an expected Gaussian distribution (Rousseeuw and Croux 1993; Fridman 2008).

M AD = 1.4826 medianj{|xj− mediani(xi) |} (2.1)

The first threshold level χ(1) (i.e. when the size of the scanning window N = 1) is determined by the median of the sample (median(xi)), the variance estimator (MAD) and

an aggressiveness parameter β as shown in Equation 2.2 (Niamsuwan et al. 2005). Since the median is less sensitive to outliers, it is preferred to the traditional mean in this equation and the MAD to the traditional standard deviation for similar reasons. If the data is Gaussian in nature then the MAD value will be similar to the standard deviation (and the median to the mean). A range of values for β has been tested for multiple observations and frequencies and a stable value of around β = 25 was empirically found to be used as the default in the algorithm. Increasing the value of β reduces the aggressiveness of the threshold and decreasing the value increases the aggressiveness.

χ (1) = mediani(xi) + βM AD (2.2)

The subsequent threshold levels (i.e. window sizes N > 1) are determined by Equation 2.3 where N is the subset value, and ρ = 1.5, empirically works well for the SumThreshold method (Offringa et al. (2010a)) and defines how coarse the difference in threshold levels is.

χ(N ) = χ(N )

ρlog2 N (2.3)

In document Technical development and scientific preparation for the e-MERLIN Cygnus OB2 radio survey (Page 61-65)