Operational Analysis Documentation

(1)

Release 2.2

NREL PRUF OA Team

Aug 20, 2021

(2)

(3)

1 Install 3

1.1 Requirements. . . 3

1.2 Installation . . . 4

1.3 Development . . . 4

1.4 Contributors . . . 5

2 Examples 7 2.1 Use ENGIE’s open data set . . . 7

2.2 Quality Check Diagnostic Work . . . 11

2.3 First step in gap analysis is to determine the AEP based on operational data.. . . 21

2.4 Example operational analysis using the augmented capabilities of the AEP class . . . 34

2.5 The next step in the gap analysis is to calculate the Turbine Ideal Energy (TIE) for the wind farm based on SCADA data . . . 44

2.6 The next step in the gap analysis is to estimate electrical losses from the wind farm. . . 53

2.7 Perform energy yield assessment (EYA)-operational assessment (OA) gap analysis . . . 57

3 Toolkits 61 3.1 Filters . . . 61

3.2 Power Curve . . . 63

3.3 Imputing . . . 65

3.4 Timeseries . . . 66

3.5 Met Data Processing . . . 68

3.6 Metadata Fetch . . . 70

3.7 Unit Conversion . . . 70

3.8 Plotting . . . 71

4 Analysis Methods 79 4.1 Plant Level Analysis . . . 79

4.2 Turbine Level Analysis. . . 83

4.3 Electrical Losses Analysis . . . 86

5 Project Data 89 5.1 Schemas . . . 89

5.2 PlantData . . . 91

5.3 AssetData. . . 93

5.4 ReanalysisData . . . 94

6 Contributing 95 6.1 Issue Tracking . . . 95

6.2 Repository . . . 95

6.3 Pull Request . . . 95

(4)

6.6 Testing . . . 96 6.7 Deploying a Package to PyPi. . . 96

7 Credit 97

Python Module Index 99

Index 101

(5)

This library provides a generic framework for working with large timeseries data from wind plants. Its development has been motivated by the WP3 Benchmarking (PRUF) project, which aims to provide a reference implementaiton for plant-level performance assessment.

The implementation makes use of a flexible backend, so that data loading, processing, and analysis can be performed locally (e.g., with Pandas dataframes), in a semi-distributed manner (e.g., with Dask dataframes), or in a fully distributed matter (e.g., with Spark dataframes).

Data processing and ETL is handled by the PlantData class and by project-specific modules which implement sub- classes. These modules can be used to import, inspect, pre-process, and save the raw data from wind turbines, meters, met towers, and reanalysis products such as Merra2.

Analysis routines are grouped by purpose into toolkits - which provide an abstract low level API for common compu- tations, and methods - which provide higher level wind industry specific API. In addition to these provided modules, anyone can write their own, which is intended to provide natural growth of tools within this framework.

To interact with how each of these components of OpenOA are used, please visit our examples notebooks onBinder, or view them statically on theexamples page.

(6)

(7)

INSTALL

This library provides a framework for working with large timeseries data from wind plants, such as SCADA. Its development has been motivated by the WP3 Benchmarking (PRUF) project, which aims to provide a reference implementation for plant-level performance assessment.

Analysis routines are grouped by purpose into methods, and these methods in turn rely on more abstract toolkits. In addition to the provided analysis methods, anyone can write their own, which is intended to provide natural growth of tools within this framework.

The library is written around Pandas Data Frames, utilizing a flexible backend so that data loading, processing, and analysis could be performed using other libraries, such as Dask and Spark, in the future.

If you would like to try out the code before installation or simply explore the possibilities, please see our examples on Binder.

1.1 Requirements

• Python 3.6-3.8 with pip.

OpenOA should be compatible with newer versions of Python, but one of its dependencies, Shapely, does not yet have binary wheels in pip for Python 3.9 on Mac.

We strongly recommend using the Anaconda Python distribution and creating a new conda environment for OpenOA.

You can download Anaconda throughtheir website.

After installing Anaconda, create and activate a new conda environment with the name “openoa-env”:

conda create --name openoa-env python=3.8 conda activate openoa-env

(8)

1.2 Installation

Clone the repository and install the library and its dependencies using pip:

git clone https://github.com/NREL/OpenOA.git pip install ./OpenOA

You should now be able to import operational_analysis from the Python interpreter:

python

>>> import operational_analysis

1.2.1 Common Installation Issues:

• In Windows you may get an error regarding geos_c.dll. To fix this install Shapely using:

conda install Shapely

• In Windows, an ImportError regarding win32api can also occur. This can be resolved by fixing the version of pywin32 as follows:

pip install --upgrade pywin32==255

1.3 Development

Development dependencies are provided through the develop extra flag in setup.py. Here, we install OpenOA, with development dependencies, in editable mode, and activate the pre-commit workflow (note: this second step must be done before committing any changes):

pip install -e "./OpenOA[develop]"

pre-commit install

1.3.1 Example Notebooks and Data

The example data will be automaticaly extracted as needed by the tests. To manually extract the example data for use with the example notebooks, use the following command:

unzip examples/data/la_haute_borne.zip -d examples/data/la_haute_borne/

In addition, you will need to install the packages required for running the examples with the following command:

pip install -r ./OpenOA/examples/requirements.txt

The example notebooks are located in the examples directory. We suggest installing the Jupyter notebook server to run the notebooks interactively. The notebooks can also be viewed statically onRead The Docs.

jupyter notebook

(9)

1.3.2 Testing

Tests are written in the Python unittest framework and are runnable using pytest. There are two types of tests, unit tests (located in test/unit) run quickly and are automatically for every pull request to the OpenOA repository.

Regression tests (located at test/regression) provide a comprehensive suite of scientific tests that may take a long time to run (up to 20 minutes on our machines). These tests should be run locally before submitting a pull request, and are run weekly on the develop and main branches.

To run all unit and regresison tests:

pytest

To run unit tests only:

pytest test/unit

To run all tests and generate a code coverage report pytest --cov=operational_analysis

1.3.3 Documentation

Documentation is automatically built by, and visible through,Read The Docs.

You can build the documentation withsphinx, but will need to ensurePandoc is installedon your computer first:

cd sphinx

pip install -r requirements.txt make html

1.4 Contributors

Alphabetically: Nathan Agarwal, Nicola Bodini, Anna Craig, Jason Fields, Rob Hammond, Travis Kemper, Joseph Lee, Monte Lunacek, John Meissner, Mike Optis, Jordan Perr-Sauer, Sebastian Pfaffel, Caleb Phillips, Charlie Plum- ley, Eliot Quon, Sheungwen Sheng, Eric Simley, and Lindy Williams.

(10)

(11)

EXAMPLES

All notebooks are located at /examples in the OpenOA repository, and can be modified and run onBinder.

2.1 Use ENGIE’s open data set

ENGIE provides access to the data of its ‘La Haute Borne’ wind farm throughhttps://opendata-renewables.engie.com and through an API. The data can be used to create additional turbine objects and gives users the opportunity to work with further real-world data.

The series of notebooks in the ‘examples’ folder uses SCADA data downloaded fromhttps://opendata-renewables.

engie.com, saved in the ‘examples/data’ folder. Additional plant level meter, availability, and curtailment data were synthesized based on the SCADA data.

In the following example, data is loaded into a turbine object and plotted as a power curve. The selected turbine can be changed if desired.

[1]: import matplotlib.pyplot as plt import numpy as np

import pandas as pd

from bokeh.plotting import show from bokeh.io import output_notebook output_notebook()

from project_ENGIE import Project_Engie

from operational_analysis.toolkits import filters from operational_analysis.toolkits import power_curve from operational_analysis.toolkits import pandas_plotting

Data type cannot be displayed: application/javascript, application/vnd.bokehjs_load.v0+json

2.1.1 Import the data

[2]: project = Project_Engie('./data/la_haute_borne') project.prepare()

INFO:project_ENGIE:Loading SCADA data

INFO:operational_analysis.types.timeseries_table:Loading name:la-haute-borne-data-

˓→2014-2015

(continues on next page)

(12)

(continued from previous page) INFO:project_ENGIE:SCADA data loaded

INFO:project_ENGIE:Timestamp QC and conversion to UTC

INFO:project_ENGIE:Correcting for out of range of temperature variables INFO:project_ENGIE:Flagging unresponsive sensors

INFO:project_ENGIE:Converting field names to IEC 61400-25 standard INFO:operational_analysis.types.timeseries_table:Loading name:plant_data INFO:operational_analysis.types.timeseries_table:Loading name:plant_data

INFO:operational_analysis.types.timeseries_table:Loading name:merra2_la_haute_borne INFO:operational_analysis.types.timeseries_table:Loading name:era5_wind_la_haute_borne Now the data is imported we can take a look at the wind farm. There are 4 turbines, nearby foresty, a small town and neighbouring wind farms, which could impact on performance. Now lets have a look at the turbines.

[3]: show(pandas_plotting.plot_windfarm(project,tile_name="OpenMap",plot_width=600,plot_

˓→height=600))

Data type cannot be displayed: application/javascript, application/vnd.bokehjs_exec.v0+json

[4]: # List of turbines

turb_list = project.scada.df.id.unique() turb_list

[4]: array(['R80736', 'R80721', 'R80790', 'R80711'], dtype=object) Let’s examine the first turbine from the list above.

[5]: df = project.scada.df.loc[project.scada.df['id'] == turb_list[0]]

windspeed = df["wmet_wdspd_avg"]

power_kw = df["wtur_W_avg"]/1000 # Put into kW

[6]: def plot_flagged_pc(ws, p, flag_bool, alpha):

plt.scatter(ws, p, s = 1, alpha = alpha)

plt.scatter(ws[flag_bool], p[flag_bool], s = 1, c = 'red') plt.xlabel('Wind speed (m/s)')

plt.ylabel('Power (W)') plt.show()

First, we’ll make a scatter plot the raw power curve data.

[7]: plot_flagged_pc(windspeed, power_kw, np.repeat(True, df.shape[0]), 1)

(13)

2.1.2 Range filter

[8]: out_of_range = filters.range_flag(windspeed, below=0, above=70) windspeed[out_of_range].head()

[8]: Series([], Name: wmet_wdspd_avg, dtype: float64) No wind speeds out of range

2.1.3 Window range filter

Now, we’ll apply a window range filter to remove data with power values outside of the window from 20 to 2100 kW for wind speeds between 5 and 40 m/s.

[9]: out_of_window = filters.window_range_flag(windspeed, 5., 40, power_kw, 20., 2100.) plot_flagged_pc(windspeed, power_kw, out_of_window, 0.2)

Let’s remove these flagged data from consideration

(14)

[10]: windspeed_filt1 = windspeed[~out_of_window]

power_kw_filt1 = power_kw[~out_of_window]

2.1.4 Bin filter

We may be interested in fitting a power curve to data representing ‘normal’ turbine operation. In other words, we want to flag all anomalous data or data represenatative of underperformance. To do this, the ‘bin_filter’ function is useful.

It works by binning the data by a specified variable, bin width, and start and end points. The criteria for flagging is based on some measure (scalar or standard deviation) from the mean or median of the bin center.

As an example, let’s bin on power in 100 kW increments, starting from 25.0 kW but stopping at 90% of peak power (i.e. we don’t want to flag all the data at peak power and high wind speed. Let’s use a scalar threshold of 1.5 m/s from the median for each bin. Let’s also consider data on both sides of the curve by setting the ‘direction’ parameter to ‘all’

[11]: max_bin = 0.90*power_kw_filt1.max()

bin_outliers = filters.bin_filter(power_kw_filt1, windspeed_filt1, 100, 1.5, 'median',

˓→ 20., max_bin, 'scalar', 'all')

plot_flagged_pc(windspeed_filt1, power_kw_filt1, bin_outliers, 0.5)

As seen above, one call for the bin filter has done a decent job of cleaning up the power curve to represent ‘normal’

operation, without excessive removal of data points. There are a few points at peak power but low wind speed that weren’t flagged, however. Let catch those, and then remove those as well as the flagged data above, and plot our

‘clean’ power curve

[12]: windspeed_filt2 = windspeed_filt1[~bin_outliers]

power_kw_filt2 = power_kw_filt1[~bin_outliers]

Unresponsive Filter

As a final filtering demonstration, we can look for an unrespsonsive sensor (i.e. repeating measurements). In this case, let’s look for 3 or more repeating wind speed measurements:

[13]: frozen = filters.unresponsive_flag(windspeed_filt2, 3) windspeed_filt2[frozen]

(15)

[13]: time

2014-01-10 14:40:00 0.0 2014-01-10 14:50:00 0.0 2014-01-10 15:00:00 0.0 2014-01-11 22:30:00 0.0 2014-01-11 22:40:00 0.0 ...

2015-12-09 22:50:00 0.0 2015-12-09 23:00:00 0.0 2015-12-15 02:20:00 5.5 2015-12-15 02:30:00 5.5 2015-12-15 02:40:00 5.5

Name: wmet_wdspd_avg, Length: 1926, dtype: float64

We actually found a lot, so let’s remove these data as well before moving on to power curve fitting.

Note that many of the unresponsive sensor values identified above are likely caused by the discretization of the data to only two decimal places. However, the goal is to illustrate the filtering process.

[14]: windspeed_final = windspeed_filt2[~frozen]

power_kw_final = power_kw_filt2[~frozen]

Power curve fitting

We will now consider three different models for fitting a power curve to the SCADA data.

[ ]: # Fit the power curves

iec_curve = power_curve.IEC(windspeed_final, power_kw_final)

l5p_curve = power_curve.logistic_5_parametric(windspeed_final, power_kw_final) spline_curve = power_curve.gam(windspeed_final, power_kw_final, n_splines = 20)

[ ]: # Plot the results

x = np.linspace(0,20,100) plt.figure(figsize = (10,6))

plt.scatter(windspeed_final, power_kw_final, alpha=0.5, s = 1, c = 'gray') plt.plot(x, iec_curve(x), color="red", label = 'IEC', linewidth = 3) plt.plot(x, spline_curve(x), color="C1", label = 'Spline', linewidth = 3) plt.plot(x, l5p_curve(x), color="C2", label = 'L5P', linewidth = 3) plt.xlabel('Wind speed (m/s)')

plt.ylabel('Power (kW)') plt.legend()

plt.show()

The above plot shows that the IEC method accurately captures the power curve, although it results in a ‘choppy’ fit, while the L5P model (constrained by its parametric form) deviates from the knee of the power curve through peak production. The spline fit tends to fit the best.

2.2 Quality Check Diagnostic Work

This notebook illustrates some quality control steps that should be considered when analyzing a new dataset. In this example we’ll use the ‘WindToolKitQualityControlDiagnosticSuite’ class to automate some of the QC analysis for SCADA data.

(16)

2.2.1 Step 1: Load in Data

[1]: %load_ext autoreload

%autoreload 2

[2]: from operational_analysis.methods.quality_check_automation import

˓→WindToolKitQualityControlDiagnosticSuite as QC import pandas as pd

import numpy as np

[3]: scada_df = pd.read_csv('./data/la_haute_borne/la-haute-borne-data-2014-2015.csv')

[4]: scada_df.head()

[4]: Wind_turbine_name Date_time Ba_avg P_avg Ws_avg \ 0 R80736 2014-01-01T01:00:00+01:00 -1.00 642.78003 7.12 1 R80721 2014-01-01T01:00:00+01:00 -1.01 441.06000 6.39 2 R80790 2014-01-01T01:00:00+01:00 -0.96 658.53003 7.11 3 R80711 2014-01-01T01:00:00+01:00 -0.93 514.23999 6.87 4 R80790 2014-01-01T01:10:00+01:00 -0.96 640.23999 7.01

Va_avg Ot_avg Ya_avg Wa_avg 0 0.66 4.69 181.34000 182.00999 1 -2.48 4.94 179.82001 177.36000 2 1.07 4.55 172.39000 173.50999 3 6.95 4.30 172.77000 179.72000 4 -1.90 4.68 172.39000 170.46001

Convert Date_time to a datetime object

[5]: # To illustrate timezone QC functions, we'll remove the timezone information date = [s[0:10] for s in scada_df['Date_time']]

time = [s[11:19] for s in scada_df['Date_time']]

datetime = [date[s] + ' ' + time[s] for s in np.arange(len(date))]

scada_df['datetime'] = pd.to_datetime(datetime, format = "%Y-%m-%d %H:%M:%S") scada_df.set_index('datetime', inplace = True, drop = False)

[6]: scada_df.dtypes

[6]: Wind_turbine_name object

Date_time object

Ba_avg float64

P_avg float64

Ws_avg float64

Va_avg float64

Ot_avg float64

Ya_avg float64

Wa_avg float64

datetime datetime64[ns]

dtype: object

(17)

2.2.2 Step 2: Initializing QC and Performing the Run Method

Now that we have our dataset with the necessary columns and datatypes, we are ready to perform our quality check diagnostic. This analysis will not make the adjustments for us, but it will allow us to quickly flag some key irregularities that we need to manage before going on.

To start, let’s initialize a QC object, qc, and call its run method.

[7]: qc = QC(df = scada_df, ws_field = 'Ws_avg', power_field= 'P_avg', time_field = 'datetime', id_field= 'Wind_turbine_name', freq = '10T',

lat_lon = (48.45, 5.586), dst_subset = 'France', check_tz = False)

INFO:operational_analysis.methods.quality_check_automation:Initializing QC_Automation

˓→Object [8]: qc.run()

INFO:operational_analysis.methods.quality_check_automation:Identifying Time

˓→Duplications

INFO:operational_analysis.methods.quality_check_automation:Identifying Time Gaps INFO:operational_analysis.methods.quality_check_automation:Grabbing DST Transition

˓→Times

INFO:operational_analysis.methods.quality_check_automation:Isolating Extrema Values INFO:operational_analysis.methods.quality_check_automation:QC Diagnostic Complete

2.2.3 Step 3: Deep Dive with QC Diagnostic Results

Let’s take a deeper look at the results of our QC diagnostic.

Perform a general scan of the distributions for each numeric variable [9]: qc.column_histograms()

(18)

(19)

(20)

(21)

Check ranges of each variable

[10]: qc._max_min

[10]: min max

Wind_turbine_name R80711 R80790

Date_time 2014-01-01T01:00:00+01:00 2016-01-01T00:50:00+01:00

Ba_avg -121.26 262.61

P_avg -17.92 2051.87

Ws_avg 0 19.31

Va_avg -179.95 179.99

Ot_avg -273.2 39.89

Ya_avg 0 360

Wa_avg 0 360

datetime 2014-01-01 01:00:00 2016-01-01 00:50:00 These values look fairly reasonable and consistent.

Identify any timestamp duplications and timestamp gaps.

Duplications in October and gaps in March would suggest DST.

[11]: qc._time_duplications [11]: datetime

2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00 2014-03-30 03:00:00

(22)

(continued from previous page) 2014-03-30 03:10:00 2014-03-30 03:10:00

2014-03-30 03:10:00 2014-03-30 03:10:00 2014-03-30 03:10:00 2014-03-30 03:10:00 2014-03-30 03:10:00 2014-03-30 03:10:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:20:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:30:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:40:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2014-03-30 03:50:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:00:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:10:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:20:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:30:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:40:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 2015-03-29 03:50:00 Name: datetime, dtype: datetime64[ns]

[12]: qc._time_gaps

[12]: 12678 2014-03-30 02:00:00 12679 2014-03-30 02:10:00 12680 2014-03-30 02:20:00 12681 2014-03-30 02:30:00 12682 2014-03-30 02:40:00 12683 2014-03-30 02:50:00 65094 2015-03-29 02:00:00 65095 2015-03-29 02:10:00 65096 2015-03-29 02:20:00

(23)

(continued from previous page) 65097 2015-03-29 02:30:00

65098 2015-03-29 02:40:00 65099 2015-03-29 02:50:00 dtype: datetime64[ns]

Based on the duplicated timestamps, it does seem like there is a DST correction in spring but no time gap in the fall

Check the DST plot to look in more detail

[13]: qc.daylight_savings_plot()

/Users/esimley/opt/anaconda3/lib/python3.7/site-packages/pandas/plotting/_converter.

˓→py:129: FutureWarning: Using an implicitly registered datetime converter for a

˓→matplotlib plotting method. The converter was registered by pandas on import.

˓→Future versions of pandas will require you to explicitly register matplotlib

˓→converters.

To register the converters:

>>> from pandas.plotting import register_matplotlib_converters

>>> register_matplotlib_converters() warnings.warn(msg, FutureWarning)

(24)

(25)

So we do in fact have a gap in the spring data when DST kicks in (as well as duplicated data for some reason) but not duplicated data in the fall.

The final question regarding datetime is whether we’re in UTC or local. Given the daylights savings gap, it’s likely we’re in local. This is further confirmed by the raw datetime info provided in the SCADA file, which shows either a +1h or +2h timezone from UTC. So we are operating in local time. Therefore, the project import script for La Haute Borne should shift the timestep back to put it into UTC.

2.2.4 Inspect the turbine power curves

Now that we have gathered some useful information about our timeseries, the one last check we may want to make is to inspect each turbine profile. We can look at each turbine’s power curve and perform an initial scan for irregularities.

[14]: qc.plot_by_id('Ws_avg', 'P_avg')

Overall, these power curves look pretty common with some downtime, derating, and what look like a few erroneous data points.

2.2.5 Step 4: Performing adjustments on our data

Recall that this notebook is only for diagnostic QC of plant data and does not actually change the data in the project import script. Any issues identifed here should be incorporated into the project import script.

Note that the necessary corrections have alreayd been applied to the project import script for this data.

[ ]:

2.3 First step in gap analysis is to determine the AEP based on oper- ational data.

%autoreload 2

This notebook provides an overview and walk-through of the steps taken to produce a plant-level operational energy asssessment (OA) of a wind plant in the PRUF project. The La Haute-Borne wind farm is used here and throughout the example notebooks.

(26)

Uncertainty in the annual energy production (AEP) estimate is calculated through a Monte Carlo approach. Specifi- cally, inputs into the OA code as well as intermediate calculations are randomly sampled based on their specified or calculated uncertainties. By performing the OA assessment thousands of times under different combinations of the random sampling, a distribution of AEP values results from which uncertainty can be deduced. Details on the Monte Carlo approach will be provided throughout this notebook.

2.3.1 Step 1: Import plant data into notebook

A zip file included in the OpenOA ‘examples/data’ folder needs to be unzipped to run this step. Note that this zip file should be unzipped automatically as part of the project.prepare() function call below. Once unzipped, 4 CSV files will appear in the ‘examples/data/la_haute_borne’ folder.

[2]: # Import required packages import os

import matplotlib.pyplot as plt import numpy as np

import statsmodels.api as sm import pandas as pd

import copy

from operational_analysis.methods import plant_analysis

/Users/esimley/opt/anaconda3/lib/python3.7/site-packages/statsmodels/tools/_testing.

˓→py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the

˓→public API at pandas.testing instead.

import pandas.util.testing as tm

In the call below, make sure the appropriate path to the CSV input files is specfied. In this example, the CSV files are located directly in the ‘examples/data/la_haute_borne’ folder

[3]: # Load plant object

project = Project_Engie('./data/la_haute_borne')

[4]: # Prepare data project.prepare()

˓→2014-2015

INFO:project_ENGIE:SCADA data loaded

INFO:numexpr.utils:NumExpr defaulting to 8 threads.

INFO:operational_analysis.types.timeseries_table:Loading name:merra2_la_haute_borne INFO:operational_analysis.types.timeseries_table:Loading name:era5_wind_la_haute_borne

2.3.2 Step 2: Review the data

Several Pandas data frames have now been loaded. Histograms showing the distribution of the plant-level metered energy, availability, and curtailment are shown below:

(27)

[5]: # Review plant data

fig, (ax1, ax2, ax3) = plt.subplots(ncols = 3, figsize = (15,5)) ax1.hist(project._meter.df['energy_kwh'], 40) # Metered energy data

ax2.hist(project._curtail.df['availability_kwh'], 40) # Curtailment and availability

˓→loss data

ax3.hist(project._curtail.df['curtailment_kwh'], 40) # Curtailment and availability

˓→loss data

plt.tight_layout() plt.show()

2.3.3 Step 3: Process the data into monthly averages and sums

The raw plant data can be in different time resolutions (in this case 10-minute periods). The following steps process the data into monthly averages and combine them into a single ‘monthly’ data frame to be used in the OA assessment.

[6]: project._meter.df.head()

[6]: energy_kwh time

time

2014-01-01 00:00:00 369.726 2014-01-01 00:00:00 2014-01-01 00:10:00 376.409 2014-01-01 00:10:00 2014-01-01 00:20:00 309.199 2014-01-01 00:20:00 2014-01-01 00:30:00 350.176 2014-01-01 00:30:00 2014-01-01 00:40:00 286.333 2014-01-01 00:40:00

First, we’ll create a MonteCarloAEP object which is used to calculate long-term AEP. Two renalaysis products are specified as arguments.

[7]: pa = plant_analysis.MonteCarloAEP(project, reanal_products = ['era5', 'merra2']) INFO:operational_analysis.methods.plant_analysis:Initializing MonteCarloAEP Analysis

˓→Object

Let’s view the result. Note the extra fields we’ve calculated that we’ll use later for filtering: - energy_nan_perc : the percentage of NaN values in the raw revenue meter data used in calculating the monthly sum. If this value is too large, we shouldn’t include this month - nan_flag : if too much energy, availability, or curtailment data was missing for a given month, flag the result - num_days_expected : number of days in the month (useful for normalizing monthly gross energy later) - num_days_actual : actual number of days per month as found in the data (used when trimming monthly data frame)

[8]: # View the monthly data frame pa._aggregate.df.head()

(28)

[8]: energy_gwh energy_nan_perc num_days_expected num_days_actual \ time

2014-01-01 1.279667 0.0 31 31

2014-02-01 1.793873 0.0 28 28

2014-03-01 0.805549 0.0 31 31

2014-04-01 0.636472 0.0 30 30

2014-05-01 1.154255 0.0 31 31

availability_gwh curtailment_gwh gross_energy_gwh \ time

2014-01-01 0.008721 0.000000 1.288387

2014-02-01 0.005280 0.000000 1.799153

2014-03-01 0.000151 0.000000 0.805700

2014-04-01 0.002773 0.000000 0.639245

2014-05-01 0.015176 0.000225 1.169656

availability_pct curtailment_pct avail_nan_perc curt_nan_perc \ time

2014-01-01 0.006769 0.000000 0.0 0.0

2014-02-01 0.002934 0.000000 0.0 0.0

2014-03-01 0.000188 0.000000 0.0 0.0

2014-04-01 0.004338 0.000000 0.0 0.0

2014-05-01 0.012974 0.000192 0.0 0.0

nan_flag availability_typical curtailment_typical \ time

2014-01-01 False True True

combined_loss_valid era5 merra2 time

2014-01-01 True 7.314878 7.227947

2014-02-01 True 8.347006 8.598686

2014-03-01 True 5.169673 5.207071

2014-04-01 True 4.756275 4.872304

2014-05-01 True 6.162751 6.351635

2.3.4 Step 4: Review reanalysis data

Reanalysis data will be used to long-term correct the operational energy over the plant period of operation to the long-term. It is important that we only use reanalysis data that show reasonable trends over time with no noticeable discontinuities. A plot like below, in which normalized annual wind speeds are shown from 1997 to present, provides a good first look at data quality.

The plot shows that both of the reanalysis products track each other reasonably well and seem well-suited for the analysis.

[9]: pa.plot_reanalysis_normalized_rolling_monthly_windspeed().show()

(29)

2.3.5 Step 5: Review energy and loss data

It is useful to take a look at the energy data and make sure the values make sense. We begin with scatter plots of gross energy and wind speed for each reanalysis product. We also show a time series of gross energy, as well as availability and curtailment loss.

Let’s start with the scatter plots of gross energy vs wind speed for each reanalysis product. Here we use the ‘Robust Linear Model’ (RLM) module of the Statsmodels package with the default Huber algorithm to produce a regression fit that excludes outliers. Data points in red show the outliers, and were excluded based on a Huber sensitivity factor of 3.0 (the factor is varied between 2.0 and 3.0 in the Monte Carlo simulation).

The plots below reveal that: - there are some outliers - Both renalysis products are strongly correlated with plant energy

[10]: pa.plot_reanalysis_gross_energy_data(outlier_thres=3).show()

(30)

Next we show time series plots of the monthly gross energy, availabilty, and curtialment. Note that the availability and curtailment data were estimated based on SCADA data from the plant.

Long-term availability and curtailment losses for the plant are calculated based on average percentage losses for each calendar month. Summing those average values weighted by the fraction of long-term gross energy generated in each month yields the long-term annual estimates. Weighting by monthly long-term gross energy helps account for potential correlation between losses and energy production (e.g., high availability losses in summer months with lower energy production). The long-term losses are calculated in Step 9.

[11]: pa.plot_aggregate_plant_data_timeseries().show()

(31)

2.3.6 Step 6: Specify availabilty and curtailment data not represenative of actual plant performance

There may be anomalies in the reported availabilty that shouldn’t be considered representative of actual plant performance. Force majeure events (e.g. lightning) are a good example. Such losses aren’t typically considered in pre-construction AEP estimates; therefore, plant availablity loss reported in an operational AEP analysis should also not include such losses.

The ‘availability_typical’ and ‘curtailment_typical’ fields in the monthly data frame are initially set to True. Below, individual months can be set to ‘False’ if it is deemed those months are unrepresentative of long-term plant losses. By flagging these months as false, they will be omitted when assessing average availabilty and curtailment loss for the plant.

Justification for removing months from assessing average availabilty or curtailment should come from conversations with the owner/operator. For example, if a high-loss month is found, reasons for the high loss should be discussed with the owner/operator to determine if those losses can be considered representative of average plant operation.

[12]: # For illustrative purposes, let's suppose a few months aren't representative of long-

˓→term losses

pa._aggregate.df.loc['2014-11-01',['availability_typical','curtailment_typical']] =

˓→False

pa._aggregate.df.loc['2015-07-01',['availability_typical','curtailment_typical']] =

˓→False

(32)

2.3.7 Step 7: Select reanalysis products to use

Based on the assessment of reanalysis products above (both long-tern trend and relationship with plant energy), we now set which reanalysis products we will include in the OA. For this particular case study, we use both products given the high regression relationships.

2.3.8 Step 8: Set up Monte Carlo inputs

The next step is to set up the Monte Carlo framework for the analysis. Specifically, we identify each source of uncertainty in the OA estimate and use that uncertainty to create distributions of the input and intermediate variables from which we can sample for each iteration of the OA code. For input variables, we can create such distributions beforehand. For intermediate variables, we must sample separately for each iteration.

Detailed descriptions of the sampled Monte Carlo inputs, which can be specified when initializing the MonteCarloAEP object if values other than the defaults are desired, are provided below:

• slope, intercept, and num_outliers : These are intermediate variables that are calculated for each iteration of the code

• outlier_threshold : Sample values between 2 and 3 which set the Huber algorithm outlier detection parameter.

Varying this threshold accounts for analyst subjectivity on what data points constitute outliers and which do not.

• metered_energy_fraction : Revenue meter energy measurements are associated with a measurement uncertainty of around 0.5%. This uncertainty is used to create a distribution centered at 1 (and with standard deviation therefore of 0.005). This column represents random samples from that distribution. For each iteration of the OA code, a value from this column is multiplied by the monthly revenue meter energy data before the data enter the OA code, thereby capturing the 0.5% uncertainty.

• loss_fraction : Reported availability and curtailment losses are estimates and are associated with uncertainty.

For now, we assume the reported values are associated with an uncertainty of 5%. Similar to above, we therefore create a distribution centered at 1 (with std of 0.05) from which we sample for each iteration of the OA code.

These sampled values are then multiplied by the availability and curtaiment data independently before entering the OA code to capture the 5% uncertainty in the reported values.

• num_years_windiness : This intends to capture the uncertainty associated with the number of historical years an analyst chooses to use in the windiness correction. The industry standard is typically 20 years and is based on the assumption that year-to-year wind speeds are uncorrelated. However, a growing body of research suggests that there is some correlation in year-to-year wind speeds and that there are trends in the resource on the decadal timescale. To capture this uncertainty both in the long-term trend of the resource and the analyst choice, we randomly sample integer values betweeen 10 and 20 as the number of years to use in the windiness correction.

• loss_threshold : Due to uncertainty in reported availability and curtailment estimates, months with high combined losses are associated with high uncertainty in the calculated gross energy. It is common to remove such data from analysis. For this analysis, we randomly sample float values between 0.1 and 0.2 (i.e. 10% and 20%) to serve as criteria for the combined availability and curtailment losses. Specifically, months are excluded from analysis if their combined losses exceeds that criteria for the given OA iteration.

• reanalyis_product : This captures the uncertainty of using different reanalysis products and, lacking a better method, is a proxy way of capturing uncertainty in the modelled monthly wind speeds. For each iteration of the OA code, one of the reanalysis products that we’ve already determined as valid (see the cells above) is selected.

2.3.9 Step 9: Run the OA code

We’re now ready to run the Monte-Carlo based OA code. We repeat the OA process “num_sim” times using different sampling combinations of the input and intermediate variables to produce a distribution of AEP values.

A single line of code here in the notebook performs this step, but below is more detail on what is being done.

(33)

Steps in OA process:

• Set the wind speed and gross energy data to be used in the regression based on i) the reanalysis product to be used (Monte-Carlo sampled); ii) the NaN energy data criteria (1%); iii) Combined availability and curtailment loss criteria (Monte-Carlo sampled); and iv) the outlier criteria (Monte-Carlo sampled)

• Normalize gross energy to 30-day months

• Perform linear regression and determine slope and intercept values, their standard errors, and the covariance between the two

• Use the information above to create distributions of possible slope and intercept values (e.g. mean equal to slope, std equal to the standard error) from which we randomly sample a slope and intercept value (note that slope and intercept values are highly negatively-correlated so the sampling from both distributions are constrained accordingly)

• to perform the long term correction, first determine the long-term monthly average wind speeds (i.e. average January wind speed, average Februrary wind speed, etc.) based on a 10-20 year historical period as determined by the Monte Carlo process.

• Apply the Monte-Carlo sampled slope and intercept values to the long-term monthly average wind speeds to calculate long-term monthly gross energy

• ‘Denormalize’ monthly long-term gross energy back to the normal number of days

• Calculate AEP by subtracting out the long-term avaiability loss (curtailment loss is left in as part of AEP) [13]: # Run Monte-Carlo based OA

pa.run(num_sim=2000, reanal_subset=['era5', 'merra2'])

INFO:operational_analysis.methods.plant_analysis:Running with parameters: {

˓→'uncertainty_meter': 0.005, 'uncertainty_losses': 0.05, 'uncertainty_loss_max':

˓→array([10., 20.]), 'uncertainty_windiness': array([10., 20.]), 'uncertainty_nan_

˓→energy': 0.01, 'num_sim': 2000, 'reanal_subset': ['era5', 'merra2']}

100%|| 2000/2000 [00:22<00:00, 89.99it/s]

INFO:operational_analysis.methods.plant_analysis:Run completed

The key result is shown below: a distribution of AEP values from which uncertainty can be deduced. In this case, uncertainty is around 9%.

[14]: # Plot a distribution of AEP values from the Monte-Carlo OA method

pa.plot_result_aep_distributions().show()

(34)

2.3.10 Step 10: Post-analysis visualization

Here we show some supplementary results of the Monte Carlo OA approach to help illustrate how it works.

First, it’s worth looking at the Monte-Carlo tracker data frame again, now that the slope, intercept, and number of outlier fields have been completed. Note that for transparency, debugging, and analysis purposes, we’ve also included in the tracker data frame the number of data points used in the regression.

[15]: # Produce histograms of the various MC-parameters

mc_reg = pd.DataFrame(data = {'slope': pa._mc_slope.ravel(), 'intercept': pa._mc_intercept,

'num_points': pa._mc_num_points,

'metered_energy_fraction': pa._inputs.metered_energy_

˓→fraction,

'loss_fraction': pa._inputs.loss_fraction,

'num_years_windiness': pa._inputs.num_years_windiness, 'loss_threshold': pa._inputs.loss_threshold,

'reanalysis_product': pa._inputs.reanalysis_product})

(35)

It’s useful to plot distributions of each variable to show what is happening in the Monte Carlo OA method. Based on the plot below, we observe the following:

• metered_energy_fraction, and loss_fraction sampling follow a normal distribution as expected

• The slope and intercept distributions appear normally distributed, even though different reanalysis products are considered, resulting in different regression relationships. This is likely because the reanalysis products agree with each other closely.

• 24 data points were used for all iterations, indicating that there was no variation in the number of outlier months removed

• We see approximately equal sampling of the num_years_windiness, loss_threshold, and reanalysis_product, as expected

[16]: plt.figure(figsize=(15,15))

for s in np.arange(mc_reg.shape[1]):

plt.subplot(4,3,s+1)

plt.hist(mc_reg.iloc[:,s],40) plt.title(mc_reg.columns[s]) plt.show()

It’s worth highlighting the inverse relationship between slope and intercept values under the Monte Carlo approach.

As stated earlier, slope and intercept values are strongly negatively correlated (e.g. slope goes up, intercept goes down) which is captured by the covariance result when performing linear regression. By constrained random sampling of slope and intercept values based on this covariance, we assure we aren’t sampling unrealisic combinations.

The plot below shows that the values are being sampled appropriately

(36)

[17]: # Produce scatter plots of slope and intercept values, and overlay the resulting line

˓→of best fits over the actual wind speed

# and gross energy data points. Here we focus on the ERA-5 data

plt.figure(figsize=(8,6))

plt.plot(mc_reg.intercept[mc_reg.reanalysis_product =='era5'],mc_reg.slope[mc_reg.

˓→reanalysis_product =='era5'],'.') plt.xlabel('Intercept (GWh)') plt.ylabel('Slope (GWh / (m/s))') plt.show()

We can look further at the influence of certain Monte Carlo parameters on the AEP result. For example, let’s see what effect the choice of reanalysis product has on the result:

[18]: # Boxplot of AEP based on choice of reanalysis product

tmp_df=pd.DataFrame(data={'aep':pa.results.aep_GWh,'reanalysis_product':mc_reg[

˓→'reanalysis_product']})

tmp_df.boxplot(column='aep',by='reanalysis_product',figsize=(8,6)) plt.ylabel('AEP (GWh/yr)')

plt.xlabel('Reanalysis product')

plt.title('AEP estimates by reanalysis product') plt.suptitle("")

plt.show()

/Users/esimley/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83:

˓→VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which

˓→is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)

˓→is deprecated. If you meant to do this, you must specify 'dtype=object' when

˓→creating the ndarray

return array(a, dtype, copy=False, order=order)

(37)

In this case, the two reanalysis products lead to similar AEP estimates, although MERRA2 yields slightly higher uncertainty.

We can also look at the effect on the number of years used in the windiness correction:

[19]: # Boxplot of AEP based on number of years in windiness correction

tmp_df=pd.DataFrame(data={'aep':pa.results.aep_GWh,'num_years_windiness':mc_reg['num_

˓→years_windiness']})

tmp_df.boxplot(column='aep',by='num_years_windiness',figsize=(8,6)) plt.ylabel('AEP (GWh/yr)')

plt.xlabel('Number of years in windiness correction') plt.title('AEP estimates by windiness years')

plt.suptitle("") plt.show()

/Users/esimley/opt/anaconda3/lib/python3.7/site-packages/numpy/core/_asarray.py:83:

˓→VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which

˓→is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes)

˓→is deprecated. If you meant to do this, you must specify 'dtype=object' when

˓→creating the ndarray

return array(a, dtype, copy=False, order=order)

(38)

As seen above, the number of years used in the windiness correction does not significantly impact the AEP estimate.

[ ]:

2.4 Example operational analysis using the augmented capabilities of the AEP class

%autoreload 2

This notebook provides an overview and walk-through of the augmented capabilities which have been added to the plant-level operational energy asssessment (OA) of a wind plant in the PRUF project. The La Haute-Borne wind farm is used here and throughout the example notebooks.

The overall structure of the notebook follows the walk-through in the standard AEP example notebook

‘02_plant_aep_analysis,’ to which we refer the reader for a detailed description of the steps needed to prepare the analysis. Here, we focus on the application of various approaches in the AEP calculation, with different time resolutions, regression inputs and regression models used.

[14]: # Import required packages import os

import matplotlib.pyplot as plt import numpy as np

import statsmodels.api as sm import pandas as pd

import copy

(39)

(continued from previous page)

from operational_analysis.methods import plant_analysis

In the call below, make sure the appropriate path to the CSV input files is specfied. In this example, the CSV files are located directly in the ‘examples/operational_AEP_analysis/data’ folder.

[3]: # Prepare data project.prepare()

˓→2014-2015

INFO:project_ENGIE:Correcting for out of range of temperature variables

INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set,

˓→so enforcing safe limit of 8.

INFO:project_ENGIE:Flagging unresponsive sensors

2.4.1 Comparison 1: AEP calculation using different regression models and differ- ent time resolution

The updated AEP class includes the choice of four different regression algorithms to calculate the long-term OA.

The choice is based on what is specified by the reg_model parameter: - linear regression (reg_model = ‘lin’, default) - generalized additive regression model (reg_model = ‘gam’) - gradient boosting regressor (reg_model = ‘gbm’) - extremely randomized trees model (reg_model = ‘etr’)

Linear regression can be selected without restrictions, but should only be used at monthly resolution, since wind plant power curves are not linear at fine time resolution. On the other hand, as machine learning models are more suited for problems with a large number of data points, we have restricted the use of gam, gbm and etr regressors to OA performed at daily and hourly resolution only.

Here, we’ll calculate AEP using all four regression models, using only wind speed as input (Comparison 2 will show an example of a multivariate regression). The linear regression model is run at monthly resolution; the GBM and ETR models at daily resolution; the GAM model at hourly resolution.

[4]: pa_lin = plant_analysis.MonteCarloAEP(project, reanal_products = ['merra2','era5'],

˓→time_resolution = 'M',

reg_temperature = False, reg_winddirection =

˓→False, reg_model = 'lin')

pa_gam = plant_analysis.MonteCarloAEP(project, reanal_products = ['merra2','era5'],

˓→time_resolution = 'H',

˓→False, reg_model = 'gam')

(40)

pa_gbm = plant_analysis.MonteCarloAEP(project, reanal_products = ['merra2','era5'],

˓→time_resolution = 'D',

˓→False, reg_model = 'gbm')

pa_etr = plant_analysis.MonteCarloAEP(project, reanal_products = ['merra2','era5'],

˓→time_resolution = 'D',

reg_temperature = False, reg_

˓→winddirection = False, reg_model = 'etr')

INFO:operational_analysis.methods.plant_analysis:Initializing MonteCarloAEP Analysis

˓→Object

As an example, the monthly data frame below includes wind speed averages for both the reanalysis products selected for the analysis.

[5]: # View the monthly data frame pa_lin._aggregate.df.head()

[5]: energy_gwh energy_nan_perc num_days_expected num_days_actual \ time

2014-01-01 1.279667 0.0 31 31

2014-02-01 1.793873 0.0 28 28

2014-03-01 0.805549 0.0 31 31

2014-04-01 0.636472 0.0 30 30

2014-05-01 1.154255 0.0 31 31

availability_gwh curtailment_gwh gross_energy_gwh \ time

2014-01-01 0.008721 0.000000 1.288387

2014-02-01 0.005280 0.000000 1.799153

2014-03-01 0.000151 0.000000 0.805700

2014-04-01 0.002773 0.000000 0.639245

2014-05-01 0.015176 0.000225 1.169656

availability_pct curtailment_pct avail_nan_perc curt_nan_perc \ time

2014-01-01 0.006769 0.000000 0.0 0.0

2014-02-01 0.002934 0.000000 0.0 0.0

2014-03-01 0.000188 0.000000 0.0 0.0

2014-04-01 0.004338 0.000000 0.0 0.0

2014-05-01 0.012974 0.000192 0.0 0.0

nan_flag availability_typical curtailment_typical \ time

(41)

combined_loss_valid merra2 era5 time

2014-01-01 True 7.227947 7.314878

2014-02-01 True 8.598686 8.347006

2014-03-01 True 5.207071 5.169673

2014-04-01 True 4.872304 4.756275

2014-05-01 True 6.351635 6.162751

We now run the Monte-Carlo based OA for the four setups specified above. The following lines of code launch the Monte Carlo-based OA for AEP. We identify each source of uncertainty in the OA estimate and use that uncertainty to create distributions of the input and intermediate variables from which we can sample for each iteration of the OA code.

We repeat the OA process “num_sim” times using different sampling combinations of the input and intermediate variables to produce a distribution of AEP values. Running the OA with the machine learning models at daily resolution is significantly slower than the case of a simple linear regression. Therefore, we have reduced the num_sim parameter to speed up the computation here. Once again, for a detailed description of the steps in the OA process, please refer to the standard AEP example notebook.

[6]: # Run Monte-Carlo based OA - linear monthly pa_lin.run(num_sim=1000)

# Run Monte-Carlo based OA - gam model, hourly resolution pa_gam.run(num_sim=500)

# Run Monte-Carlo based OA - gradient boosting model, daily resolution pa_gbm.run(num_sim=500)

# Run Monte-Carlo based OA - extra randomized tree model, daily resolution pa_etr.run(num_sim=500)

˓→energy': 0.01, 'num_sim': 1000, 'reanal_subset': ['merra2', 'era5']}

100%|| 1000/1000 [00:14<00:00, 68.65it/s]

0%| | 0/500 [00:00<?, ?it/s]

Fitting 5 folds for each of 20 candidates, totalling 100 fits 0%| | 1/500 [00:07<1:03:44, 7.66s/it]

Fitting 5 folds for each of 20 candidates, totalling 100 fits 100%|| 500/500 [05:11<00:00, 1.61it/s]

0%| | 0/500 [00:00<?, ?it/s]

(42)

Fitting 5 folds for each of 20 candidates, totalling 100 fits 100%|| 500/500 [02:33<00:00, 3.25it/s]

0%| | 0/500 [00:00<?, ?it/s]

Fitting 5 folds for each of 20 candidates, totalling 100 fits 1%| | 5/500 [00:44<38:33, 4.67s/it]

Fitting 5 folds for each of 20 candidates, totalling 100 fits 100%|| 500/500 [08:35<00:00, 1.03s/it]

The key results for the AEP analysis are shown below: distributions of AEP values from which uncertainty can be deduced. We can now compare the AEP distributions obtained for the four configurations of the OA.

[7]: # Plot a distribution of AEP values from the Monte-Carlo OA method - wind speed only pa_lin.plot_result_aep_distributions().show()

(43)

[8]: # Plot a distribution of AEP values from the Monte-Carlo OA method - gam model pa_gam.plot_result_aep_distributions().show()

(44)

[9]: # Plot a distribution of AEP values from the Monte-Carlo OA method - gradient

˓→boosting model

pa_gbm.plot_result_aep_distributions().show()

(45)

[10]: # Plot a distribution of AEP values from the Monte-Carlo OA method - extra randomized

˓→tree model

pa_etr.plot_result_aep_distributions().show()

(46)

For this specific case, we see a decrease in AEP uncertainty when the calculation is performed with a machine learning regression model at daily resolution, which becomes even more significant when performing the calculation at hourly resolution.

2.4.2 Comparison 2: AEP calculation using various input variables

The augmented capabilities of the AEP class now allow the user to include temperature and/or wind direction as additional inputs to the long-term OA. This choice is controlled by the booleans “reg_temperature” and

“reg_winddirection”. In this example, we will compute AEP using a multivariate hourly GAM regression, includ- ing wind speed and temperature as inputs, and compare the results with the univariate GAM applied in the previous comparison.

[11]: pa_gam_T = plant_analysis.MonteCarloAEP(project, reanal_products = ['merra2','era5'],

˓→time_resolution = 'H',

reg_temperature = True, reg_winddirection =

˓→False, reg_model = 'gam')

˓→Object

(47)

We now run the Monte-Carlo based OA for this new setup:

[12]: # Run Monte-Carlo based OA - gam model pa_gam_T.run(num_sim=500)

0%| | 0/500 [00:00<?, ?it/s]

Fitting 5 folds for each of 20 candidates, totalling 100 fits 100%|| 500/500 [09:23<00:00, 1.13s/it]

INFO:operational_analysis.methods.plant_analysis:Run completed And we can now take a look at the AEP distribution:

[13]: # Plot a distribution of AEP values from the Monte-Carlo OA method - wind speed +

˓→temperature + wind direction

pa_gam_T.plot_result_aep_distributions().show()

(48)

In this case, only a slight reduction in AEP uncertainty is achieved when temperature is added as additional input to the hourly GAM regression. Our analysis (Bodini et al. 2021, Wind Energy) showed how adding temperature as additional input has the largest benefits for those wind plants that experience a strong seasonal cycle, which might not be the case for the specific wind plant considered in this example.

[ ]:

2.5 The next step in the gap analysis is to calculate the Turbine Ideal Energy (TIE) for the wind farm based on SCADA data

%autoreload 2

This notebook provides an overview and walk-through of the turbine ideal energy (TIE) method in OpenOA. The TIE metric is defined as the amount of electricity generated by all turbines at a wind farm operating under normal conditions (i.e., not subject to downtime or significant underperformance, but subject to wake losses and moderate turbine performance losses). The approach to calculate TIE is to:

(49)

1. Filter out underperforming data from the power curve for each turbine,

2. Develop a statistical relationship between the remaining power data and key atmospheric variables from a long- term reanalysis product

3. Long-term correct the period of record power data using the above statistical relationship 4. Sum up the long-term corrected power data across all turbines to get TIE for the wind farm

Here we use different reanalysis products to capture the uncertainty around the modeled wind resource. We also consider uncertainty due to power data accuracy and the power curve filtering choices for identifying normal turbine performance made by the analyst.

In this example, the process for estimating TIE is illustrated both with and without uncertainty quantification.

[2]: # Import required packages import matplotlib.pyplot as plt import numpy as np

import pandas as pd

from operational_analysis.methods import turbine_long_term_gross_energy

In the call below, make sure the appropriate path to the CSV input files is specfied. In this example, the CSV files are located directly in the ‘examples/data/la_haute_borne’ folder

[4]: # Load and prepare the wind farm data project.prepare()

˓→2014-2015

[5]: # Let's take a look at the columns in the SCADA data frame project._scada.df.columns

[5]: Index(['id', 'wrot_BlPthAngVal1_avg', 'wmet_wdspd_avg', 'wmet_VaneDir_avg', 'wmet_EnvTmp_avg', 'wyaw_YwAng_avg', 'wmet_HorWdDir_avg', 'wtur_W_avg', 'energy_kwh'],

dtype='object')

2.5.1 TIE calculation without uncertainty quantification

Next we create a TIE object which will contain the analysis to be performed. The method has the ability to calculate uncertainty in the TIE metric through a Monte Carlo sampling of filtering thresholds, power data, and reanalysis product choices. For now, we turn this option off and run the method a single time.

(50)

[6]: ta = turbine_long_term_gross_energy.TurbineLongTermGrossEnergy(project) INFO:operational_analysis.methods.turbine_long_term_gross_energy:Initializing

˓→TurbineLongTermGrossEnergy Object

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Note: uncertainty

˓→quantification will NOT be performed in the calculation

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Processing SCADA

˓→data into dictionaries by turbine (this can take a while)

All of the steps in the TI calculation process are pulled under a single run() function. These steps include:

1. Processing reanalysis data to daily averages.

2. Filtering the SCADA data

3. Fitting the daily reanalysis data to daily SCADA data using a Generalized Additive Model (GAM) 4. Apply GAM results to calculate long-term TIE for the wind farm

By setting UQ = False (the default argument value), we must manually specify key filtering thresholds that would otherwise be sampled from a range of values through Monte Carlo. Specifically, we must set thresholds applied to the bin_filter() function in the toolkits.filtering class of OpenOA.

[7]: # Specify filter threshold values to be used

wind_bin_thresh = 2.0 # Exclude data outside 2 m/s of the median for each power bin max_power_filter = 0.90 # Don't apply bin filter above 0.9 of turbine capacity

We also must decide how to deal with missing data when computing daily sums of energy production from each turbine. Here we set the threshold at 0.9 (i.e., if greater than 90% of SCADA data are available for a given day, scale up the daily energy by the fraction of data missing. If less than 90% data recovery, exclude that day from analysis.

[8]: # Set the correction threshold to 90%

correction_threshold = 0.90

Now we’ll call the run() method to calculate TIE, choosing two reanalysis products to be used in the TIE calculation process.

[9]: # We can choose to save key plots to a file by setting enable_plotting = True and

# specifying a directory to save the images. For now we turn off this feature.

ta.run(reanal_subset = ['era5', 'merra2'], enable_plotting = False, plot_dir = None, wind_bin_thresh = wind_bin_thresh, max_power_filter = max_power_filter, correction_threshold = correction_threshold)

0%| | 0/2 [00:00<?, ?it/s]INFO:operational_analysis.methods.turbine_long_

˓→term_gross_energy:Filtering turbine data

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Processing

˓→reanalysis data to daily averages

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Processing scada

˓→data to daily sums 0it [00:00, ?it/s]

4it [00:00, 27.11it/s]

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Setting up daily

˓→data for model fitting

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Fitting model data /Users/esimley/opt/anaconda3/lib/python3.7/site-packages/scipy/linalg/basic.py:1321:

˓→RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension

˓→not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2

˓→(released July 21, 2010). Falling back to 'gelss' driver.

x, resids, rank, s = lstsq(a, b, cond=cond, check_finite=False)

(51)

(continued from previous page) INFO:operational_analysis.methods.turbine_long_term_gross_energy:Applying fitting

˓→results to calculate long-term gross energy

50%| | 1/2 [00:02<00:02, 2.02s/it]INFO:operational_analysis.methods.turbine_

˓→long_term_gross_energy:Filtering turbine data

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Processing

˓→reanalysis data to daily averages

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Processing scada

˓→data to daily sums 0it [00:00, ?it/s]

4it [00:00, 25.93it/s]

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Setting up daily

˓→data for model fitting

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Fitting model data INFO:operational_analysis.methods.turbine_long_term_gross_energy:Applying fitting

˓→results to calculate long-term gross energy 100%|| 2/2 [00:03<00:00, 1.93s/it]

INFO:operational_analysis.methods.turbine_long_term_gross_energy:Run completed Now that we’ve finished the TIE calculation, let’s examine results

[10]: ta._plant_gross

[10]: array([[13587744.52216543], [13716630.11389883]])

[11]: # What is the long-term annual TIE for whole plant

print('Long-term turbine ideal energy is %s GWh/year' %np.round(np.mean(ta._plant_

˓→gross/1e6),1))

Long-term turbine ideal energy is 13.7 GWh/year

The long-term TIE value of 13.7 GWh/year is based on the mean TIE resulting from the two reanalysis products considered.

Next, we can examine how well the filtering worked by examining the power curves for each turbine using the plot_filtered_power_curves() function.

[12]: # Currently saving figures in examples folder. The folder where figures are saved can

˓→be changed if desired.

ta.plot_filtered_power_curves(save_folder = "./", output_to_terminal = True)

(52)