PV MODULE PERFORMANCE UNDER REAL-WORLD TEST
CONDITIONS–A DATA ANALYTICS APPROACH
by
YANG HU
Submitted in partial fulfillment of the requirements
For the degree of Master of Science
Thesis Adviser: Prof. Roger H. French
Department of Materials Science and Engineering CASE WESTERN RESERVE UNIVERSITY
Case Western Reserve University
Case School of Graduate Studies
We hereby approve the thesis1of
YANG HU
for the degree of
Master of Science
Prof. Roger French
Committee Chair, Adviser 11/21/2013
Prof. Roger French
Prof. David Matthiesen
Committee Member 11/21/2013
Prof. David Matthiesen
Prof. Jennifer Carter
Committee Member 11/21/2013
Prof. Jennifer Carter
Prof. Jiayang Sun
Committee Member 11/21/2013
Prof. Jiayang Sun
Dr. Timothy Peshek Committee Member 11/21/2013 Dr. Timothy Peshek Dr. Yifan Xu Committee Member 11/21/2013 Dr. Yifan Xu
Dedicated to science
and the pursuit of progress.
List of Figures vi
Acknowledgements viii
Abstract ix
Chapter 1. Introduction 1
Lifetime and degradation science approach 2
Thesis overview 3
Chapter 2. Background and literature review 5
Previous research on real world PV modules’ performance 5
Standards 10
Data science 13
Clustering analysis 14
Chapter 3. Real-world Data Acquisition 16
SDLE SunFarm design 16
Global SunFarm network and Energy CRADLE 17
Chapter 4. Results: Real-world data analytics 19
Overview 19
Raw data validation 21
Exploratory Data Analysis (EDA) on Integrated Data 29
Clustering of AC Power Data 31
Data Assembly 36
Sub-sampling 38
Clustering of Solar Noon Time Performance Ratio Data 42
Chapter 5. Discussion 49
Data analytics 49
Performance at different relative positions 51
Performance of different brands 55
Power time series data clustering 56
Solar noon time performance ratio clustering 58
Chapter 6. Conclusions 60
Chapter 7. Future research 62
Improved SunFarm data quality and redundancy 62
Predictive model 62
Appendix A. List of 24 manufacturers and nameplate power 64
Appendix B. SunFarm network 65
SDLE SunFarm design & characteristics 65
Energy CRADLE SunFarm informatics 73
Appendix. Complete References 76
2.1 Pie chart of method used to determine Rd 6
2.2 PR subsetting 8
4.1 60 PV modules distribution 20
4.2 Baseline result of 20 brands 22
4.3 Power time series plot 23
4.4 Microinverter’s efficiency 24
4.5 Power curve comparison 29
4.6 Total power production of 20 bands 30
4.7 Normalized power production of 20 brands 32
4.8 Hierarchical cluster 1 33
4.9 Total within cluster sum of square 1 34
4.10 Power time series plot with clustering result 35
4.11 Normalized performance metrics 39
4.12 Noontime PR versus yI 40
4.13 PR in different climate condition 42
4.14 Pairs plot of PR 1 43
4.15 Pairs plot of PR 2 45
4.16 Total within cluster sum of square 2 46
4.17 Hierarchical cluster 2 47
4.18 PR time series plot with clustering result 48
5.1 Sensor cross check 50 5.2 Averaged performance ratio of 15 min around solar noon time 52
5.3 Comparison of normalized AC power in winter 53
5.4 Comparison of normalized AC power in summer 54
B.1 An overview of SDLE Sunfarm 66
B.2 Sample tray and concentrator 67
B.3 Dual axis tracker 69
B.4 Tracker frame 69
B.5 SunFarms within Ohio 72
B.6 Architecture of NO-SQL Hadoop system 73
B.7 Architecture of Energy CRADLE’s user front end 75
I would like to express my deepest gratitude to the patience, diligence, and resource-fulness of the entire team of researchers in the Solar Durability and Lifetime Extension (SDLE) center at Case Western Reserve University, Department of Material Science and Engineering, headed by Prof. Roger H. French. Explicit thanks to Dr. Timothy Peshek and Mohammad A. Hossain who helped build and maintain the SDLE SunFarm data acquisition system.
Thanks for the coordinated efforts by researchers at the Center for Statistical Re-search, Computing and Collaboration (SR2C), Department of Epidemiology & Biostatis-tics. Prof. Jiayang Sun and Dr. Yifan Xu’s guidance in statistics and data science was instrumental in this work.
Assistance and technical support from researchers in the Medical Informatics Divi-sion of EECS, especially Prof. G.Q. Zhang and his group members, Yashwanth Reddy Gunapati, and Tarun Jian, who were extremely valuable in completing the data collec-tion and Energy CRADLE part of this work.
I would also like to acknowledge the funding for this work. The SDLE center was established through funding through the Ohio Third Frontier, Wright Project Pro-gram Award Tech 12-004. The PV module case study was supported by the Bay Area Phocovoltaic Consortium Prime Award No. DE-EE0004946, Subaward Agreement No. 60220829-51077-T.
At last I would like to certify that there is no proprietary material in this thesis.
PV Module Performance Under Real-world Test Conditions–A Data
Analytics Approach
Abstract by YANG HU0.1 Abstract
In pursuit of a higher fidelity understanding of the long-term degradation of long-lived technologies, such as photovoltaic (PV) systems, the framework of Lifetime and Degra-dation Science (L&DS) goes beyond initial qualification tests and investigates the under-pinning mechanisms of degradation. L&DS concerns itself with the complex and mul-tivariate signatures of the degradation process and uncovering the fundamental phys-ical mechanisms contributing to that degradation. In the case of PV modules, this ef-fort requires extensive continuous monitoring of PV modules’ power production and climatic conditions. The responses of PV module to the stressors of the real world is cross-correlated to the simulated and accelerated stressors placed on devices in a labo-ratory setting.
A unique, highly instrumented, outdoor test facility for PV materials, components, and systems, the Solar Durability and Lifetime Extension (SDLE) center’s SunFarm, was built for the purpose of better understanding the power degradation mechanisms of PV modules and materials. The SDLE SunFarm provides an apparatus for the collec-tion of real-world time series data consisting of output power, weather and insolacollec-tion
through appropriate grid-tied inverters.
The metrology package developed at CWRU for the collection of time series data pro-vides a model to be implemented at external sites around the globe. In order to expand the ability of monitoring PV systems’ performance under different climatic conditions, a global SunFarm Network was implemented among nine outdoor test facilities around the world in collaboration with academic institutions and industrial partners including commercial power plants.
This thesis provides the initial data analytics on the first six months of data from 60 PV modules on the SDLE SunFarm, and serves as a model for the analytics of full dataset from the global SunFarm Network. The data was first validated by characterization of the measurement apparatus, redundancy of measurement, and time-slewing accord-ing to minimization of the time cross-correlation function usaccord-ing a free and open-source statistical software language and packages known as “R”. Using R (v3.0.1)1for clustering data analysis base upon unfiltered AC power time series showed that the data fell into six clusters, which represented the six different electrical sites of SDLE SunFarm.
The data were intelligently assembled and subsampled around solar noon time. PV performance ratio (PR), which is a measure of PV modules’ output at given incident power from sunlight, was used as a indicator of modules’ working effectiveness. Cor-relations among the filtered sub-set of solar noon time PR data were discerned with hi-erarchical clustering analysis. K-means clustering was used to confirm the optimum
number of clusters for the analytics. The clustering results differentiate modules on dif-ferent physical sites, pointed out malfunctions of the PV mounting system, and incapac-ity of certain module brands. These results are useful for correlating different modules’ response to stressors and those stressors’ effects on overall performance.
1
Introduction
Solar energy is becoming a more mature and mainstream source of electricity; the pho-tovoltaic (PV) industry has experienced remarkable growth over the past decade. World-wide, PV has already exceeded the 100 GW installed capacity mark in 20122. Germany lead the installation in 2012 with 7.6 GW, followed by China with between 3.5 and 4.5 GW3,4. In the US during 2012, there were 3.2 GW installed, fourth in the world2. A solar project will be installed, on average, every four minutes in the US5. By the end of 2013, over 100,000 individual solar systems will be installed, exceeding 4.4 GW in capacity. In the academic world, although much PV research still focuses on gaining higher efficien-cies and inserting new technologies, interest in lifetime and degradation has risen. At the 2010 Department of Energy Science for Energy Technology workshop6, the topic of PV lifetime and degradation science (L&DS) was made a research priority and its im-portance was reconfirmed in the Mesoscale Science Report7. A quantification of power decline over time, also known as degradation rate (Rd), is equally important as initial
performance. Especially for investigators and PV power plant owners, degradation rates essentially determine the lifetime of a PV system. A well-known disaster in the PV in-dustry was Carrizo Plains, which was once the largest PV power plant in the world8. The
Introduction 2
installation failed after four years of operation because it exhibited a power degrada-tion rate of 10% per year. Commercial PV panels claim a degradadegrada-tion rate lower than 1% per year, and usually come with a 25 year manufacturer warranty9. However, recent research, sampling from on over 2000 degradation rates reported around the world, sug-gest that some PV systems exhibit a power degradation rate (Rd) higher than 1%.
Addi-tionally, the study observes that Rd is highly dependent on the operating environment10
.
1.1 Lifetime and degradation science approach
In order to predict the performance and lifetime of PV modules, a better understand-ing of degradation mechanisms and the influence of climate condition is necessary. A performance and lifetime prediction tool (PLP) based on a reliability physics and prog-nostics approach was proposed, which requires indoor accelerated studies of PV mate-rials, components and system and a real-world degradation and time series analysis of PV modules11,12.
Real-world testing plays a critical role in researching degradation mechanisms, firstly because it is the typical operating environment for PV systems13. A real-world environ-ment is a unique combination of different stressors that no indoor testing chamber is able to duplicate. Stressors in the real-world include, but are not limited to solar irradi-ance, rain, snow, salt fog, and soiling. In order to isolate the influence of a single stres-sor or several stresstres-sors requires precision and redundant climate condition monitoring. Secondly, outdoor testing is the only way to correlate indoor accelerated testing to real-world performance. By developing metrics, metrology, and tools to quantify, compare, and cross-correlate the response of PV modules and components to a variety of stressors
for both accelerated and outdoor testing, it is possible able to link observed responses to particular stressors and determine quantitative rates of degradation.
1.2 Thesis overview
1.2.1 Background and literature review
A literature review of previous research on PV modules, PV power plant performance under real-world operation conditions and different data filtering methods applied is provided in this thesis. Two IEC standards which were used for data monitoring and data cleaning in this study are also reviewed. Finally some background information on data science is provided.
1.2.2 Real World Data Acquisition
SDLE SunFarm’s design and the data acquisition methods applied to the case study of 60 PV modules on SDLE SunFarm are explained.
1.2.3 Results:Real-world data analytics
Descriptive data analysis and data clustering results are presented in this section.
1.2.4 Discussion
Discussion of the data analytic procedures for outdoor test data, comparison of mod-ules’ performance under different climate conditions, at different relative position to sunlight, comparison of initial indoor performance and outdoor performance of 20 brands will be presented in this section.
1.2.5 Conclusions
Introduction 4
1.2.6 Future research
An improved study protocol and a predictive model are planed for future research.
1.2.7 Appendix
A list of 24 PV models being studied in this thesis, SunFarm design and characteristics, and full references are presented in the appendixes.
2
Background and literature review
2.1 Previous research on real world PV modules’ performance
2.1.1 PV module degradationPV modules’ power output is known to decline over time, and a quantification of this phenomenon is measuring the degradation rate (Rd) of a PV system. It is equally
impor-tant for investigators and power plant owners to know the initial efficiency of PV mod-ules as well as their degradation rates. Jordan and Kurtz reviewed over 2000 degradation rate reports in 201110. All the degradation rates that had been reported were determined using one of the four methods introduce below.
Current-voltage (I-V) curves, which are typically taken at discrete time intervals in-door with a solar simulator or outin-doors with a portable I-V curve tracer, are used for determining Rd14. In order to take an indoor I-V curve, the PV module needs to be taken
off the array, which is not convenient for PV system owners. Outdoor I-V curve tracing requires a very clear sky. Fig.2.1shows the the methodologies used to determine Rd. The
use of indoor I-V curve tracing increased after the year 2000 due to the widespread use of flash indoor solar simulators. Neither of these methods provide continuous measure-ments, in fact it would take a large effort to acquire I-V curve measurements on every PV module on a real PV power plant15. As a result, a large portion of the Rd measurements
Background and literature review 6
(40 out of 58, around 70%) were determined using only two, or even one, data point, which leads to low accuracy and high uncertainty16. Using continuous power data for
Figure 2.1. Pie chart of the number of references deploying the indicated methods to determine degradation rates prior to and following the year 200010.
Rddetermination can improve the accuracy10. Photovoltaic for Utility Scale Application
(PVUSA)17and performance ratio (PR)18Rd measurement methods are in the
the PVUSA project. The PVUSA method provides an empirical relationship of the mod-ule’s AC output as a function of solar irradiance, ambient temperature and wind speed. PR gives a measure of the ratio of modules convert efficiency in the field to a manu-facturer provided qualification test efficiency under standard test conditions (STC), of 25◦C, 1kW /m2, and AM 1.5 irradiance.
The degradation rate was determined by taking the trend of continuous data using time series analysis19,20. Both methods display strong seasonality that can affect re-ported rates and increase uncertainties. In practice, the process of preferentially choos-ing data subsets, referred to as data filterchoos-ing, such as data for sunny-only days, can re-duce the noisiness of data21. However, data filtering usually eliminates or disregards the impact of different climate conditions on modules’ Rd.
2.1.2 Performance ratio filtering
Performance ratio (PR) reflects the PV system conversion efficiency in the field com-pared to that under qualification STC. Previous research reported that typical PR of PV systems is about 70%-80%. A survey conducted by Nils Reich from Frauhofer Institute for Solar Energy Systems suggests that the PR for newly built PV systems in Germany in-creased to 90%22. However these reported PR are all filtered and averaged with a certain methodology. Reich’s study only considered POA irradiance between 800-1000 W /m2 and temperatures of either the 35-40◦C or the 40-45◦C temperature bin. Following the first round of filtering there is still remaining ”outliers”, which they discarded all the data points with a deviation of more than ±5% from the median of the annual PR. Fig2.2
shows how annual PR was determined from already filtered data set. There are obvious outliers exceeded 110% at the beginning of the study, and additional outliers lower than 40% during the study. This range was selected because ”there is no physical reason apart
Background and literature review 8
from malfunctions or measurement uncertainty, why PR at selected irradiance and tem-perature conditions should differ that much”23.
Figure 2.2. PR subsetting of an entire plant. Keeping ±5% data from the median of annual PR22.
Another study conducted by Jordan et al . from NREL used three steps filtering13. POA irradiance is fixed between 800 W /m2 to 1200W /m2. Another two filters were applied, denoted as stability and outliers. The stability filter ”eliminates data points when POA changes more than 20 W /m2/mi n and the module temperature more than 1 ◦C/min”. Outlier filter ”uses DC/POA to eliminate snow days, partial shading conditions. Furthermore, the data for sunny days were selected by filtering for clearness index >0.5”. Clearness index of the sky is the ratio of measured global irradiance over the extrater-restrial beam irradiance on a similarly tilted surface24. After filtering, PR shows good precision, which is good for degradation determination. However by applying a filter it
only keeps data from constant bright sunny days and eliminated the other weather con-ditions. Filtered data was also averaged to eliminate seasonality, yet weather and season have important effects.
Recently, Hasselbrink et al . from SunPower Corporation developed a unique ap-proach of using “3 million module-years of live site data”25. Instead of determine yearly degradation data with monthly averaged PR and moving average method, which ignore seasonality by smoothing out the variation, performance index of the same day of the year was used to determine the degradation rate at each day of the year. And yearly degradation rate was determined from the distribution of the 365 Rd. This method
in-cluded all climate conditions; however, isolating the influence of each climate stressors is not the focus of their study.
2.1.3 Influence of weather stressors
A PV systems’ operating environment is a combination of multiple weather stressors including temperature, humidity, radiation, soiling, etc. Interest has risen for the in-vestigating of the influence of one or multiple stresses. Faiman, Ye et .al conducted an experiment on three different types of modules: Mono c-Si, micromorph Si and a-Si with single junction. Their performance under two distinct monsoon seasons through-out the year was modeled26. The results show module efficiency is highly correlated to temperature. However, as a result of Singapore’s low altitude, module’s efficiency at noon time is not strongly correlated to spectral effects, which arises from changes in air mass. Another study focused on the soiling losses of solar systems, was conducted by a group of researchers at the University of California San Diego. They qualitatively mod-eled the losses caused by dust accumulating on module surfaces between two days of rain27. The research explicitly compared average soiling losses of modules mounted at
Background and literature review 10
tilt angles from 0-5, 6-19, and greater than 20 degrees. Soiling loss of sites have tilt angle shallower than 5◦showed losses five times that of the rest of the sites.
Seasonal variation, which has usually been neglected in the process of determin-ing Rd, contains information about influence of climate stresses on PV modules
per-formance and reliability. The research reported here aims to extract more information by doing exploratory data analysis and clustering analysis on the entire AC power time series data before sub-sampling or “filtering”.
2.2 Standards
2.2.1 Photovoltaic system performance monitoring
IEC 61724 describes general guidelines for the monitoring and analysis of the electrical performance of photovoltaic systems28.
Meteorology. For climate conditions monitoring, total irradiance in the plane of array
(GI) shall be measured in the same plane as the PV array by calibrated reference devices
or pyranometers. Ambient air temperature (Tam) shall be measured at a location that
can represent array conditions using temperature sensors that are shielded from direct solar radiation. Wind speed (SW) shall be measured at a height that can represent array
conditions.
Electrical parameters. PV system electrical parameters including output voltage (VA),
output current (IA), and output power (PA) represent the DC electrical characteristics.
Utility grid electrical parameters including utility voltage (VU), current to utility grid
(IT U), current from the grid (IFU), and power to the utility grid (PT U). The standard also
point out that “AC voltage and current may not need to be monitored in every situation. DC power can either be calculated in real time as the product of sampled voltage and
current quantities or measured directly using a power sensor. If DC power is calculated, the voltage and current quantities shall be sampled not averaged.” This explains why the microinverter used in this study provides instantaneous DC voltage and DC current and averaged AC power.
System performance indices. System performance indices are part of derived
parame-ters that relate to system energy balance and performance calculated from the recorded monitoring data. Performance indices normalize system performance, which makes PV systems of different configurations and at different locations comparable. These indices include yield, losses and efficiencies. Yields are energy quantities normalized to rated array power. System efficiencies are normalized to array area. Losses are the differences between yields.
Daily mean yields. a) The array yield YA is the daily array energy output per kW of
in-stalled PV array:
YA= EA,d/P0= τr× (Σd a yPA)/P0 (2.1)
This yield represents “the number of hours per day that the array would need to operate at its rated output power, P0, to contribute the same daily array energy to the system as was monitored”.
b) The final PV system yield Yf is the portion of daily net energy output of the entire
PV plant which was supplied by the array per kW of installed PV array:
Yf = YA× ηLO AD (2.2)
This yield represents the number of hours per day that the array would need to operate at its rated power output to equal monitored net daily yield.ηLO AD is load efficiency.
Background and literature review 12
c) The reference yield Yr can be calculated by dividing the total daily in-plane
irradi-ation by the module’s reference in-plane irradiance GI ,r e f.
Yr = τr× (Σd a yGI)/GI ,r e f (2.3)
This yield represents the number of hours in a day the sun needs to be at reference irra-diance levels in order to contribute the same incident energy as measured on the field.
Normalized losses. By subtracting yields, normalized losses are calculated.
a) The "array capture" losses Lc represent the losses due to array operation:
Lc= Yr− YA (2.4)
b) The balance of system (BOS) losses LBOS represent the losses in the BOS
compo-nents:
LBOS= YA× (1 − ηBOS) (2.5)
c) The PR indicates the overall effect of losses on the array’s rated output due to ar-ray temperature, incomplete utilization of the irradiation, and system component inef-ficiencies or failure:
P R = Yf/Yr (2.6)
2.2.2 Procedures for temperature and irradiance corrections to measure current-voltage characteristics
In IEC. 6089118, three correction procedures have been introduced. For time’s sake, only the first procedure will be introduced, which was used for the baseline data correction in this work. The second procedure is especially good for large irradiance corrections (>20%). The third procedure needs to be utilized when the temperature coefficient of PV devices is unknown.
Correction procedure 1. The measured current-voltage characteristic shall be corrected
to standard test conditions ,which is given at 25 and 1000W /m2, by applying the follow-ing equations:
I2= I1+ ISC· (G2/G1− 1) + α · (T2− T1) (2.7)
V2= V1− RS· (I2− I1) − κ · I2· (T2− T1) + β · (T2− T1) (2.8)
where I1,V1are coordinates of points on the measured characteristics; I2,V2are coordi-nates of the corresponding points on the corrected characteristics; G1is the irradiance measured with the reference device; G2 is the irradiance at the standard or other de-sired irradiance; T1is the measured temperature of the test specimen; T2is the standard or other desired temperature; ISC is the measured short-circuit current of the test
speci-men at G1and T1;αandβ are the current and voltage temperature coefficients of the test specimen in the standard or target irradiance for correction and within the temperature range of interest; Rs is the internal series resistance of the test specimen; κ is a curve
correction factor.
2.3 Data science
2.3.1 Data validationData validation is the process of ensuring that data analysis is based on a clean, correct and useful data set29. Data validation including data type checks, for example, whether the data is power production of PV module or irradiance intensity on the PV module’s plane; file existence check, check for which days data files are available for analysis; cross-system consistency check, which compare data point to the same variable col-lected in different systems to ensure it is consistent. In practice data validation rules
Background and literature review 14
can be implemented through the automated facilities of a data dictionary30, or by the inclusion of explicit application program validation logic31.
2.3.2 Exploratory data analysis
In outdoor testing of PV systems, test conditions are not controllable, the best we can do is to collect as much data as possible so as quantitatively evaluate climate stressors and the PV systems’ response. Exploratory data analysis (EDA)32encompasses and surpasses initial data analysis (IDA)33while IDA narrowly focus on hypothesis testing and check-ing assumptions, EDA encourages statisticians to explore the data, possibly formulatcheck-ing hypothesis that can guide further experiments and data collection. EDA usually sum-marizes main characteristics of data by visual methods, including box plots, histograms, multi-vari charts which graphically displays patterns of variation.
2.4 Clustering analysis
Data describe the characteristics of different PV systems. In order to understand all kinds of response and phenomena, one of the most important steps of data analysis ac-tivities is to classify or group data into a set of categories or clusters. Data objects that are classified in the same group or cluster should reflect similar properties based on some criteria. Classification processes can be supervised or unsupervised. Supervised classi-fication is mapping data objects into predefined classes. Unsupervised classiclassi-fication is know as cluster data analysis34. As described in literature, “A direct reason for unsuper-vised clustering comes from the requirement of exploring the unknown natures of the data that are integrated with little or no prior information”35. Clustering algorithms will be discussed in this paper including hierarchical clustering and k-means clustering.
Hierarchical clustering is a connectivity based clustering algorithm. It is based on the core idea of “objects being more related to nearby objects than to objects farther away”36. In order to determine the similarity of two objects, the distance of two objects need to be defined. Distance metrics including Euclidean distance, squared Euclidean distance, Manhattan distance, Dynamic time warping etc. Euclidean distance computes the root of square differences coordinates of a pair of objects:
DX Y =
rX
k
(xi k− xj k)2 (2.9)
The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are farther apart:
D2X Y =X
k
(xi k− xj k)2 (2.10)
Manhattan distance or city block distance represents distance between points in a city road grid. It computes the absolute differences between coordinates of a pair of objects:
DX Y =
X
k
| xi k− xj k| (2.11)
Linkage criterion specifies if two set of objects can joined into one by measure dif-ferent objects pairs in two sets.
K-means clustering is also known as centroid-based clustering37, which partitions objects in a way that objects assigned to the same cluster are nearest to each other. K-means clustering uses Euclidean distance metrics. The quantity that can evaluate the quality of k-means clustering result is within-cluster sum of squares (WCSS), which is a sum of the distance among the objects in the same cluster. The goal is to assign each objects to a cluster such that the total WCSS is minimized.
16
3
Real-world Data Acquisition
3.1 SDLE SunFarm design
The SDLE Sunfarm located on the west campus of CWRU is about one acre in size. 14 high precision, Feina SF20 dual-axis tracker and 2 sites of adjustable tilted racking com-prise the 16 electrical sites of SDLE SunFarm. 122 individual PV power plants include 120 PV modules with microinverters and two sets of 8 PV modules connected in series with string inverters. Output power is monitored through inverters and fed back to the grid through a reversing relay. 120 modules work with microinverters were evenly sepa-rated into two groups, each group has 60 samples (3 modules samples from 20 brands). Two groups of modules use two different microinverter models for comparison. The first 60 microinverters installed were Enphase model M215. Electrical data was reported by Enphase’s embedded Enlighten data acquisition system. The metrology platform (shown in Table3.1) includes insolation, and weather monitoring. Minute-by-minute global horizontal irradiance (GHI) data was monitored by a Kipp & Zonen CMP6 pyra-nometer, positioned near the fixed racking. Another Kipp & Zonen CMP11 pyranome-ter was also set on the horizontal plane and connected to a Daystar multi-tracer. Two Vaisala WXD520 weather stations were placed on the SunFarm to record wind speed, wind direction, rainfall, rain intensity, rain duration, and humidity. An anemometer was
connected to the Master Control Unit of the trackers to monitor the wind load on the trackers. T-type thermocouples were used for backsheet temperature monitoring.
Instrument Attributes
Enphase micro-inverter
AC power DC current DC voltage Kipp& Zonen pyranometer Solar irradiance
Vaisala WXD 520 weather station
Temperature Wind speed Wind direction Rainfall Hail Relative humidity T-type thermocouple Backsheet temperature
Table 3.1. Parameters monitored using SDLE SunFarm metrology platform
The data acquisition system consists of 17 networked Campbell Scientific CR-1000 dataloggers, with each datalogger connected to an AM 16-32 multiplexer, extending the capacity of datalogger to 32 differential measurement channels. The Campbell data-loggers monitor thermocouple and sensor outputs. Enphase micro-inverters use the Enphase Envoy Communications Gateway to connect each individual micro-inverter to Enlighten monitoring and management software. Similarly, Solectria string inverters use the Solrenview system to collect data. Minute by minute data can be downloaded from Solrenview web servers.
3.2 Global SunFarm network and Energy CRADLE
Cleveland’s climate, a humid continental, is not typical for PV degradation research. In order to study PV modules’ performance under different climatic conditions, a global SunFarm network was established among nine PV outdoor test beds across the world.
Real-world Data Acquisition 18
The purpose of the Energy Common Research Analytics and Data Lifecycle Environ-ment (Energy CRADLE) is to create, for engineering, and in particular lifetime science, the tools and protocols necessary to transform Big Data to information, which informs scientific knowledge to guide further analysis30,38–40. Energy CRADLE is tightly focused on serving the needs of handling and sharing data across the SunFarm network. Appen-dix B provides further details.
4
Results: Real-world data
analyt-ics
4.1 Overview
In this section, the results will focus on real-world performance of 60 crystalline silicon PV modules from 20 different manufacturers exposed from November 25, 2012 to May 31, 2013 on the SDLE SunFarm. The purpose of this case study is interpreting the infor-mation in the data that has been collected during the first 6 months of SDLE SunFarm’s operation, developing a data cleaning, and data munging procedure. This analytic cedure will be integrated within Energy CRADLE and will guide the way for data pro-cessing on the cloud. This case study can also inform experimental design and evoke further research interests.
Fig.4.1 is a blueprint of SDLE SunFarm’s 16 electrical sites. The 60 modules stud-ied are distributed on the sites marked with red boxes, specifically fixed rack Site 1 and tracker Sites 4, 6, 8, 12, and 14. All three modules from the same manufacturer are placed on the same site. On fixed rack Site 1, 18 modules from 6 brands are aligned horizontally, and modules of same brand placed adjacent to each other. On trackers, which carry ei-ther 6, 9 or 12 modules each, modules of the same brand are evenly distributed on the same tracker frame.
Results: Real-world data analytics 20
In this study, manufacturer information of the modules are withheld, each brand will be referred to as capital letter A through T. Modules’ location is represented using lower case f or t, which are short for “fixed rack” or “tracker”, respectively, followed by site number. Each module has a sample number start with “sa”. For example, in Fig.4.3
power data was record from “A.f1.sa18259.00” which is a module of brand A mounted on fixed rack site 1 and its sample number is “sa18259.00”.
Figure 4.1. 60 PV modules studied in this section are distributed on 6 dif-ferent electrical sites shown in red boxes in the plot. Site 1, which is the long site along the bottom, is a fixed tilt rack site. 18 modules are exposed on Site 1. The rest of the modules are exposed on even number of tracker sites. There are 12 modules on Site 4, 6 modules on Site 6 and 8, 9 modules on Site 12 and 14.
4.1.1 Analytical methods
R1, which is a free and open source programming language and software environment for statistical computing and graphics was chosen as the data analysis tool. The data
analytical methods applied to this case study consist of raw data validation, exploratory data analysis, data assembly, data subsampling, and clustering data analysis.
4.2 Raw data validation
4.2.1 Module baselineAll the modules studied on SDLE SunFarm were brand new modules purchased on the open market. Before being exposed to the sunlight, I-V characterization of each mod-ule were recorded using a SPIRE SPI- 4800 solar simulator and I-V curve tracer, located at the Wright Center for Photovoltaic Innovation and Commercialization (PVIC), at the University of Toledo. In order to reduce the impact of instrument uncertainty, sixteen I-V curve measurements were acquired for each module. Additionally, the backsheet temperature and the irradiance intensities were recorded. Each measurement was cor-rected to standard test condition (STC), specified at 25◦C and 1 kW /m2 according to IEC 6089118. Maximum power output (Pmax) was taken from 16 corrected I-V curve
measurements to represent the initial performance of each module under STC. For 60 modules, the standard deviations of 16 Pmax measurements fall between 0.04%-0.9%,
which supports the reliability of baseline results.
In order to evaluate the initial performance of each brand, the mean of Pmax were
taken for each brand from three modules and normalized by dividing nominal power output of the module. Fig.4.2shows the normalized performance of each brand, and the deviation among three module samples is shown as error bars. Most brands’ (except H and Q) initial performance fall in the gap between 0.95-1.05, which means their initial performance reached the common market expectation of ± 5% of their nominal power.
Results: Real-world data analytics 22
Figure 4.2. Cross-sectional comparison of crystalline silicon PV modules from 20 different manufacturers. Y axis is normalized power. X axis shows the brands and location. Brand names were replaced with letters A through T. Letter f and t represent fixed tilt rack and tracker. The max-imum power output (Pmax) of three modules of each brand were
mea-sured. The bars show the averaged normalized power of each brand. The standard deviation was plotted as error bars.
4.2.2 Power data
As introduced in previous chapter, electricity generated from all 60 modules are reported by the microinverters data acquisition system, Enlighten. Enlighten data reports DC
current, DC voltage, a microinverter’s internal temperature, and AC power. Data collec-tion interval is 5 minutes. Prior to Energy CRADLE, power data was collected from En-lighten manually. Fig.4.3shows an example module’s AC power over six months. From this figure, we can clearly see daily variation of power data. During the 180 days, there are several gaps in data partly due to three trips of the interconnection relay. Over the 180 days observation time, 99 days have power data reports.
Figure 4.3. Power production versus time of one PV module.
4.2.3 Microinverter’s efficiency
A microinverter’s efficiency is calculated from AC and DC power data. Microinverter’s conversion efficiency is given by the ratio of AC power to DC power. DC power is cal-culated using the product of DC current, DC voltage. AC power is provided in the data. Both DC current and DC voltage have two significant digits, while AC power is given as an integer. Fig.4.4shows the efficiency of 60 microinverters. The majority of the effi-ciencies are between 95% and 99%, which is consistent with the efficiency provided by the manufacturer. However, there are 12 points that exceed 1.0, which is contrary to the laws of thermodynamics. By looking at the raw data, it appears that when the PV mod-ule’s DC output is low (around 1 W), the module tends to "round up" the product of DC
Results: Real-world data analytics 24
current and voltage to an integer. This rounding behavior explains why abnormal effi-ciency appears mostly on trackers 12 and 14; these two trackers did not track properly during the majority of the 180 days data was collected. The modules on these trackers were exposed to low irradiance level longer than the other modules.
Figure 4.4. The efficiency of 60 microinvertors from 99 days of data collection after exposure on the SDLE SunFarm on fixed rack 1 (blue), tracker4 (red), tracker6 (yellow), tracker8 (green), tracker12 (purple), and tracker14 (light blue).
4.2.4 Microinverter burst-mode
Further investigation of the“round up” effect shows that it is due to the fact that Enphase microinverter works at “burst-mode” at low DC input. When a PV module is working under low irradiance, DC output of the module is low, and therefore the DC to AC con-version efficiency will drop41. Microinverters can scan the DC voltage at each AC cycle (1/60 second). When a microinverter detects the DC input is lower than 30%, it will charge a capacitor instead of converting DC to AC power. At the next cycle, microin-verter scans the PV module for its output again and adds that to the amount of charge already stored in that capacitor bank from the previous cycle. If the combined power is high enough for a DC-to-AC conversion, the capacitor will release the charge. As a result, when the microinverter is “bursting” the stored-up charge, the AC output of the microinverter will be higher than what the DC input would dictate. This explains why when a microinverter always rounds up its AC power which, as a result, make its effi-ciency higher when it’s working at low irradiance level. However, it is also known that AC power reported by Enlighten is an averaged value instead of an instantaneous mea-surement. The affect of “Burst-Mode” will be shown in the data subsampling part of this chapter.
4.2.5 Weather data
Insolation data. This study uses global horizontal irradiance (GHI) monitored by a Kipp
& Zonen CMP6 pyranometer placed at a horizontal plane as reference. The sampling rate for irradiance data was determined by the datalogger’s scan period, which is 1 minute for all the data loggers on SDLE SunFarm. Incident irradiance on a PV module’s plane
Results: Real-world data analytics 26
is different from horizontal plane, so in order to convert GHI to plane of array (POA) irradiance, the assumption was made that all incident sunlight is direct sunlight.
Global horizontal irradiance (GHI) to plane of array (POA) irradiance conversion. In
reality, global horizontal irradiance (GHI) consists of direct irradiance and diffuse irra-diance. Direct irradiance is proportional to direct normal irradiance (DNI) with a sine function, while diffuse irradiance varies on different planes.
G H I = DN I × si nα + Id i f
PO At r acker= DN I + Id i f
PO Af i xed= DN I × si n(α + β) + Id i f
(4.1)
where Id i f is the defused irradiance,α is the elevation angle of the sun, and β is the tilt
angle of the fixed rack, in this caseβ equals 22.3◦. The elevation angle is given as :
α = 90◦− θ + δ (4.2)
whereθ is the latitude; and δ is the declination angle given as:
δ = 23.45◦si n[360◦× (284 + d)/365] (4.3)
where d is the day of the year. Since the DNI or Id i f data was not available for the first 6
months, here the assumption is that all the incident light is direct sunlight, which sim-plifies the formulas as:
G H I = DN I × si nα PO At r acker= DN I
PO Af i xed= DN I × si n(α + β)
The estimated POA is supposed to be higher than the actual incident irradiance on a modules’ surface as a result of treating diffused light as direct light, amplified by con-verting GHI to POA. In the future, this systematic error can be removed by having an irradiance sensor set on the plane of array and a direct irradiance sensor for DNI. In this case study, as modules performances will be cross compared, the systematic error can be ignored.
Additional climate data. Additional climate data including special climate events and
the cloudiness levels were collected from online open source historical data, such as Weather Underground (http://www.wunderground.com).
4.2.6 Data alignment
Data alignment is another important validation process for time series data. There are multiple data sources on the SunFarm, and different devices synchronize time from dif-ferent time sources. For example, the weather data used in this study was collected by dataloggers on SDLE SunFarm. Time on these dataloggers was synchronized through a controller software on a desktop computer in the SDLE lab. The power data was re-ported by Enphase user interface, which synchronizes time with their server. PV mod-ules can generate power almost instantaneously when sunlight hit on the front surface; therefore, time series data of power and irradiance should be highly correlated.
Weather and power data were aligned using the sample cross correlation function (ccf ) in R. Ccf in R is defined as the set of sample correlations between time series X at time t + h (h = 0, ±1, ±2) and time series Y at time t, where X is potentially a predictor of Y. If two time series were perfectly aligned, then correlation is the highest when h = 0 and the correlation value drops as the absolute value of h increases. However, if the maximum correlation value appears when h is positive, then X lags Y. If correlation is
Results: Real-world data analytics 28
maximum when h is negative, then X leads Y. It was determined using CCF that, weather data was leading power data by 3 minutes before March 10t h, 2013. After daylight saving time began on March 10t h, 2013, the time shift became 63 minutes. The time for the power data was found to be more trust worthy in comparison to the standard Greenwich Mean Time (GMT). The weather data were separated into two parts, before and after March 10, then the time of two parts were slewed accordingly.
4.2.7 Malfunction of trackers
According to the maintenance record, tracker 4 did not experience any mechanical prlems. All the other trackers experienced some amount of malfunctions during the ob-servation time. In order to determine the days when trackers malfunctioned, the power data from an example module on each tracker were plotted versus time and compared to the example power data from the functioning tracker 4. An example of the power data curve from a stopped tracker (tracker 8) and power data curve from a normal operating tracker (tracker 4) is shown in Fig.4.5.
The curve of the power on tracker 8 is not symmetric with the majority of the power generated in the afternoons. Thus, it was stopped facing west. By comparing the curves in this manner, the malfunctioning dates of each tracker were determined. Tracker 6 was stopped for 5 days in May. Tracker 8 stopped 10 days in April and May. Tracker 12 was off tracking until its gear counter got replaced. Tracker 14 was not tracking most of the time because of a gear stopper issue.
Figure 4.5. A comparison of AC power generated by module on tracker 8 (top) and tracker 4 (bottom) between May 12th to May 18th 2013, when tracker 8 stopped functioning and facing west.
4.3 Exploratory Data Analysis (EDA) on Integrated Data
4.3.1 Total power productionA normal way of evaluating a PV module’s performance is by comparing the total power production. A module’s power production, in this case, is not only affected by it’s nom-inal power rating, but also affected by modules mounting system. The averaged total power production of each brand is shown in Fig.4.6. The highest power production in 99 days is 60.36 kWh from brand G on tracker 4 and the lowest power production, 28.79 kWh, is brand T on tracker 14. The four brands on tracker 4 (red), on average, produced about 40% more power than the six brands on fixed rack (blue). Modules on the other trackers produced less power than tracker 4 by varying degrees. Generally, modules on an operational tracker should produce more power than those on fixed rack when the
Results: Real-world data analytics 30
tracker is operating correctly. However, except the tracker on sites 4 and 6, the modules on trackers produced less power on average than the modules on the fixed rack site 1. In order to compare the performance of different brands on the same site, total power needs to be normalized.
Figure 4.6. The bar graph shows the averaged total power production of each brand from fixed rack 1 (blue), tracker 4 (red), tracker 6 (yellow), tracker 8 (green), tracker 12 (purple), and tracker 16 (light blue). The stan-dard deviation are plotted as error bars.
4.3.2 Normalized power yield
Normalized power yield is defined as the ratio of the total power production to the prod-uct of nominal power and exposure days (i.e, 99 days)28. Normalized power yield is
equal to the time that the PV plant is operating at nominal power output in a day. Nor-malized power is an important factor in choosing PV modules. In the PV industry, a module’s price is presented in the unit of dollars per watt, so modules have higher nor-malized power yield are more cost effective. Fig.4.7shows the normalized power pro-duction of each brand. As the trackers experienced different problems, it is not valid to compare the performance between two brands on different sites. However, the rank-ing of brands within one site demonstrates the relative performance of different brands. For example, on fixed rack which is shown in blue, brand B’s total power production is the lowest but it’s normalized power production is the highest among 6 brands. This indicates that under the same environment, including temperature and irradiance con-ditions, brand B performs better than the other brands on site 1.
4.4 Clustering of AC Power Data
Section4.3showed that integrated performance, total power production and normal-ized power production vary among the 20 different brands; however, brands on the same site perform similarly. In order to determine if the modules of the same brand always perform similarly, it is necessary to check the similarity of 60 modules’ AC power time series data. As mentioned in Chapter2, a statistical way of checking the similarity of multiple observations is clustering analysis. A hierarchical clustering analysis (HCA) was conducted on all of the time series AC power data from the 99 days of observed data. There are 9698 observations for each module. A dendrogram that uses Euclidean distance metric and average linkage criterion is shown in Fig.4.8. The distance metric and linkage criteria will be discussed in5. Red boxes in the plot show the result of di-viding modules into six groups. The grouping result reflected exactly 6 physical sites on
Results: Real-world data analytics 32
Figure 4.7. The bar graph shows the averaged normalized power produc-tion of each brand from fixed rack 1 (blue), tracker 4 (red), tracker 6 (yel-low), tracker 8 (green), tracker 12 (purple), and tracker 16 (light blue). The standard deviation are plotted as error bars.
SDLE SunFarm. Although there are some exceptions, most of the modules of the same brand are close to each other in distance. In Fig4.8from left to right, the 6 groups con-sist of modules from tracker 14, tracker 12, tracker 8, fixed rack 1, tracker 6, and tracker 4 respectively.
However, six is an arbitrary number chosen from experience with the data. In order to confirm the result of HCA is valid, the k-means algorithm was used. K-means cluster-ing partition observations into k clusters which minimize the “total within-cluster sum of square" (WCSS). In this case, each sample (PV module) has a set of 9698 observations,
Figure 4.8. Hierarchical cluster analysis of 60 modules based on all AC power time series data. The clusters were generated using “hclust" in “stats" package in R(v3.0.1). Distance matrix is computed using a Eu-clidean method. Distance between sets of observations is defined with the average linkage method. When the dendrogram tree is divided into 6 groups, each group includes exactly the modules physically located on the same electrical site.
where each sample is treated as a 9698-dimensional vector. In order to determine the k value that gives the most reasonable result, a commonly used method is the elbow method42. The elbow method is applied to a plot of WCSS as a function of the cluster numbers, k. The best cluster result occurs when adding an additional cluster does not statistically improve the model of the data. This point should be chosen as the cluster number, hence the "elbow criterion". A survey of the WCSS as a function of k is plotted
Results: Real-world data analytics 34
in Fig.4.9. The elbow point is equal to 6, which is marked in a red circle. The k-means clustering result a k equals to 6 is consistent with the result of HCA. In order to visually
Figure 4.9. Total within cluster sum of square (WCSS). Elbow points oc-curs when k is equal to 6.
conform that AC power time series fall into each group similar to each other, Fig.4.10
plotted AC power output of 60 PV modules over 99 days according to both k-means and hierarchical clustering results. 60 AC power time series were separated into 6 groups,
modules from the same brands are shown in the same color. Group 1 through 6 corre-spond to tracker 14, tracker 12, tracker 8, fixed rack 1, tracker 6, and tracker 4, respec-tively. Fig.4.10confirms that the shape and magnitude of the AC power time series in each cluster are similar.
Figure 4.10. AC power of 60 modules grouped by hierarchical result. Color of the curve differentiate module brands.
Results: Real-world data analytics 36
4.5 Data Assembly
4.5.1 Performance metricsUp to this point analysis was based on 60 modules’ AC power data. However, in order to correlate modules power output to climate conditions, climate data and power data were assembled in the following way. To compare the performance of 60 modules of different nominal power and with different mounting system, a normalized analysis and presentation was introduced based on IEC 6172428and H. Haeberlin et. al. work43.
Normalized energy yields and losses. Definition of six performance indices introduced
in IEC 61724 were discussed in Chapter2. Since the 60 modules being studied are all working with individual microinverters instead of a PV array and AC power generated was directly fed back to the grid. Each module is one PV plant. Data was collected on a minute basis instead of on daily basis. Specifically power data was collected ev-ery 5 minutes and weather data was collected evev-ery minute; therefore, it is necessary to modify the performance metrics. These new performance indices are normalized in-stantaneous quantities. Irradiance yield, YI, is POA irradiance normalized to reference
irradiation 1 kW/m2(Equation4.5).
YI = PO A/G0,G0= 1kW /m2 (4.5)
DC yield, YDC, is the DC power normalized to a module’s nominal power (Equation4.6).
DC power was calculated by multiplying DC current to DC voltage.
YDC = PDC/P0 (4.6)
AC yield, YAC, is the AC power normalized to module’s nominal power (Equation4.7).
Capture losses, Lc, is the part of incident sun power not captured by the solar cell
(Equa-tion4.8).
Lc= YI− YDC (4.8)
System losses, Ls, is the DC-AC inverter conversion losses (Equation4.9).
Ls= YDC− YAC (4.9)
Performance ratio (PR) is the ratio of the useful energy fed back into the grid to the en-ergy which would be generated an ideal PV module with cell temperature of 25◦C and the same irradiance.
P R = YAC/YI (4.10)
4.5.2 Solar time
Local noon time is usually not when the sun is the highest in the sky due to the Earth’s orbit and human adjustments such as time zones and daylight saving time. Noon local solar time (LST) is defined as the time when the sun is highest in the sky for a particular location and not necessarily at the local noon time44. In order to better understand the modules’ performance corresponding to solar motion, timestamps of the data need to be converted from local time (LT) to LST. The local standard time meridian (LSTM) is a reference meridian used for a particular time zone and is similar to the Prime Merid-ian (longitude = 0◦), which is used for greenwich mean time (GMT)45. The formula for calculating LSTM is given by Equation4.11:
LST M = 15◦× ∆TG M T (4.11)
where∆TG M T is the difference of the local time from GMT in hours. ∆TG M T equals −4
Results: Real-world data analytics 38
time (EoT) corrects the eccentricity of the Earth’s orbit and Earth’s axial tilt (Equation
4.12).
EoT = 9.87si n(2B) − 7.53cos(B) − 1.5si n(B) (4.12) where
B = 360◦(d − 81)/365
in degree and d is the number of days in the year. The net time correction factor (TC) accounts for the variation of LST in a given time zone (Equation4.13,4.14).
T C = 4(Long i tude − LST M) + EoT (4.13)
LST = LT + T C /60 (4.14)
Six performance metric variables of one single module over one day in LST are shown in Fig.4.11. YI, YDC, YAC, Lc, Ls, and PR curves are plotted in black, green, blue, yellow,
brown, and red, respectively. On a clear sunny day both irradiance and PR show a dome shaped curve. The PR curve has a comparably flat top, which suggest PR and POA are highly correlated and PR is less sensitive to POA irradiance at high level (over 750 W/m2).
4.6 Sub-sampling
4.6.1 Solar noon time performance ratio
From the EDA plot of performance metrics (Fig.4.11), it is clear that PR is correlated to POA irradiance. PR can reach up to 0.85 on the 22.3◦fixed rack and 0.90 on the tracker at solar noon time when the POA irradiance is high. In order to reduce the volume of data and reduce temporal fluctuations, the PR is subset into ±15 mins around solar noon time. The sampling rate of the PR is 5 mins, so there are about 7 data points within this
Figure 4.11. Normalized performance of one single module (D.fi.sa18286.00) on fixed rack on December 12, 2012. These vari-ables are PR (red), YI (black), YDC (green), YAC (blue), Lc (yellow), Ls
(brown).
30 min window. During the 99 days, there are roughly 700 observations for each module, which is still statistically sufficient for further analysis.
4.6.2 Snowy days
EDA on the solar noon time PR subset was performed by plotting PR versus YI for each
module. An example of PR vs YI plot is shown in Fig.4.12. Three abnormal data points
Results: Real-world data analytics 40
Figure 4.12. Solar noon time PR of a module (C.f1.sa18328.00) on the fixed rack versus YI. The vertical blue line marks POA irradiance at 1200
kW /m2, the red horizontal line marks the PR at 1.0. Group 1 are points have a PR greater than one. Group 2 are the points at irradiance higher than 1200 kW /m2. Group 3 are the points showed zero PR.
Group 1. In theory, the PR can never exceed one. From literature and standards22,28,43, PR is normally reported to be 0.8-0.85 on average. The abnormal points in group 1 have a PR calculated greater than one. By looking at the raw data including AC power, DC power, and POA irradiance at each of data points, two potential causes were found. First, these data points appeared when the irradiance changed quickly. As discussed previously, power data was reported by Enphase Envoy system and the method of their data acquisition is unknown. It could be that Enphase does not report instantaneous power but an averaged value. By using averaged power data and instantaneous POA measurement for PR calculation, a systematic error is introduced. Another cause of high performance could be the microinverter working in burst-mode as discussed earlier.
Group 2. Solar radiation outside the Earth’s atmosphere is 1.36 kW /m2and the global irradiance on a tracker plane at noon time in Denver, Colorado is less than 1.2 kW /m2. The abnormal data points in Group 2 showed a POA irradiance on the fixed rack, 22.3◦ tilt plane is higher than 1.2 kW /m2. This is a systematic error introduced by converting GHI to POA. Without direct global POA irradiance monitoring or direct sunlight moni-toring, it is not possible to correct the error. However, since the same irrradiance con-version method is used for all modules, it will not affect the cross-sectional comparison of the modules performance.
Group 3. The PR of the modules was small or equal to zero even when irradiance was
not very low, which indicates that the module may be covered. Moreover, it only ap-peared in certain days in December and January, and only apap-peared on some modules mainly on the fixed rack. Given the climate, this suggests that it was caused by snow coverage.
Snowy days. In order to document the relationship between low performance of
mod-ules and snowy weather, historical climate condition data from a third party web site was collected. PR time series were plotted for each module, data points on snowy days were highlighted with red and blue colors46. Fig.4.13shows the PR of six modules of two different brands (three modules from each brand). Low PR appears only during or after snow or fog-snow days, proving that the abnormal points group 3 points were most likely snow coverage. All snow-covered date was determined by plotting out all 60 mod-ules PR versus time, and the snowy days data were assembled as a subgroup.
Results: Real-world data analytics 42
Figure 4.13. Solar noon time PR of six modules, data points when there was snow or fog-snow were highlighted in red and blue. Three modules on the top row are from brand A placed on fixed rack. Three modules on the bottom row are from brand G placed on tracker 4. All three A brand modules showed low/zero performance during or after some snowy days, while the other three modules on tracker do not.
4.7 Clustering of Solar Noon Time Performance Ratio Data
As discussed in Section 4.6, PV modules performance ratio data was subsetted to 15 minutes around solar noon time. In order to reduce data volume, the average of each days PR data was taken. After subtracting the snowy days, there are 75 days of PR data
left, thus 75 data points represent the solar noon time performance of each module. Since the relationships among the 60 modules noon time performance is not intuitive, an EDA can lead to a better understanding of the data. A pairs plot is commonly the first step of EDA.
Figure 4.14. A pairs plot of solar noon time PR of three modules of the same brand. For each row, all the Y axis are the PR of a module. For each column, all the X axis of the plot are PR of a module. Module’s sample number is shown in the diagonal boxes. The correlation coefficient of each X, Y axis is calculated, and represented by varying shades of green. The darker the green relate to the higher correlation coefficient.
Results: Real-world data analytics 44
The pairs plot takes the value of the PR of one module as the X coordinate and the PR of another module as the Y coordinate. If the two modules under comparison per-formed the same at each observation time, then we expect to see all data points in a di-agonal line. Fig4.14is a plot of solar noon time PR of three modules of the same brand. In order to better visualize the correlation of the X and Y coordinates, the correlation coefficient of the two modules are represented by varying shades of green related to the strength of the correlation coefficient. The darker the green color relates to a higher cor-relation coefficient between the two PR series. In Fig.4.14, performance of the module, G.t4.sa18211.00, is over 99% correlated to G.t4.sa18210.00. Only the first ten modules pairs plot is shown in Fig.4.15, as space is limited; however a pairs plot of all 60 mod-ules was studied. From the pairs plot of all 60 modmod-ules, a green and gray pattern helps visualize that modules in different groups. Qualitatively grouping the modules requires a Pearson distance matrix which use correlation coefficient to define the distance be-tween different observations47. Also, since there is no domain knowledge suggesting the number of clusters, a k-means clustering analysis was used to determine the num-ber of clusters. Fig4.16shows the WCSS as a function of clusters number, k, and there is a clear “elbow point” when k equals 5. An HCA dendrogram of solar noon time perfor-mance ratio using the Pearson distance matrix and average linkage criteria is shown in Fig.4.17. Modules are divided into 5 groups using the cut r ee function. The first group on the left consists of all the modules on the fixed rack. The second group are all mod-ules on tracker 4, 6, and 8 except for three modmod-ules of brand M. The third group are all modules on tracker 14 and the forth group are all modules on tracker 12. The last group on the right contains only three modules of brand M. Time series of each modules are plotted out according to the HCA result (Fig.4.18). There are several gaps since data was
Figure 4.15. Pairs plot of solar noon time PR of ten modules. For each row, all the Y axis are the PR of one module. For each column, all the X axis of the plot are PR of one module. Module’s sample number is shown in the diagonal boxes. Correlation coefficient of each X, Y axis is calculated, and represented by color. The darker the green background represent the higher correlation coefficient. First three modules showed strong cor-relations (over 99%) among themselves. They also showed fairly strong correlations to next three modules on a different location (about 90%). However first three module showed low correlation to last four modules, correlation coefficient is lower than 30%.
not continuous due to snow and noncontiguous AC power data. The variability of the curve in the same group are mainly caused by the noncontiguous nature of the data. The largest dispersion of data curves appears in the last group. Before mid February, data curves of three modules are highly varied.
Results: Real-world data analytics 46
Figure 4.16. Total within-cluster sum of square (WCSS). The elbow point occurs when k equals 5.
Figure 4.17. The hierarchical clustering of 60 modules based on solar noon time PR time series data. The distance matrix is computed using the Pearson method. Distance between the sets of observations is de-fined with the average linkage method. From left to right, the first group includes all modules on fixed rack; the second includes all modules from tracker 4, 6, and brand N on tracker 8; the third included all modules on tracker 14; the fourth included all modules on tracker 12; the last group are three modules from brand M (on tracker 8).
Results: Real-world data analytics 48
Figure 4.18. Solar noon time PR of 60 modules grouped by hierarchical clustering result. Color of the curve differentiates samples.
5
Discussion
5.1 Data analytics
This section will focus on the problems found in the process of data cleaning, munging, and exploratory data analysis (EDA).
5.1.1 Irradiance data crosscheck
In the data subsampling part (Section 4.6), due to snow coverage, some modules, es-pecially those on the fixed rack showed low performance during and after snowy days. Snow has the potential to cover irradiance sensors on SunFarm. Since all irradiance data used in this study were measured by a pyranometer mounted on top of a electrical cab-inet near the fixed rack, it is necessary to evaluate the irradiance data quality.
There were two pyranometers working on SDLE SunFarm during the observation time, the GHI data used in this work is collected by a CMP11 pyranometer mounted on an electrical cabinet. The other one was mounted horizontally and connected with a Daystar multi-tracer. The Daystar can trace real-time I-V curves of up to 32 modules. Unlike dataloggers, the Daystar doesn’t collect irradiance measurements every minute, it recordes irradiance data only when an I-V curve was being taken. In the first several months, the Daystar took an I-V curve in 30 minute time intervals. After proper data