PEXTA: A PGAS I/O for Hyperspectral Data Storage

(1)

G. Nimako, E.J Otoo and Michael Sears

School of Computer Science The University of the Witwatersrand

Johannesburg, South Africa

(2)

Introduction

The conceptual model of a dataset as amultidimensional array

has been around since the early days ofrelational database model.

Records of multi-attribute files, perceived as points in

multi-dimensional space, form the bases of many of the efficient data structures and algorithms.

Examples: KD-Tree, Quad-Tree, Grid-File, multi-dimensional OLAP and data warehousing.

Scientific data-sets are maintained predominantly as multi-dimensional arrays - HDF5, NetCDF, New SciDB, etc.

(3)

The conceptual model of a dataset as amultidimensional array

has been around since the early days ofrelational database model.

Records of multi-attribute files, perceived as points in

multi-dimensional space, form the bases of many of the efficient data structures and algorithms.

Examples: KD-Tree, Quad-Tree, Grid-File, multi-dimensional OLAP and data warehousing.

Scientific data-sets are maintained predominantly as multi-dimensional arrays - HDF5, NetCDF, New SciDB, etc.

(4)

Motivation

Processing massively large scientific data(Big Data), in the realm of Peta-scale and Exa-scale computing usingPartitioned Global Address Space (PGAS)computational model. This also

subsumes thehadoop-likecomputing model. The PGAS computational model are exemplified by:

Global Array Toolkit, Unified Parallel C (UPC), OpenShmem Application to hyper-spectral image data storage and analysis The database consistent with this model of computation isSciDB

It is anArray Data Modelon shared parallel architectures with unbounded non-uniform dimensions with holes, arbitrary nesting, sophisticated storage representations, including overlapping chunks, etc.

(5)

Distributed dense arrays that can be accessed through a shared memory-like style

(6)

Motivation: The Global Array Toolkit

Distributed dense arrays that can be accessed through a shared memory-like style

(7)

Multidimensional Array Files

Mapping ofconventional array elements onto linear storage, using row-major orcolumn-major order allows for extendability of the array in only one dimension.

The problem has been addressed in some earlier papers. Notably by :

G. Nimako, E. Otoo and D. Kwofie, PARCO Journal [2013] G. Nimako, E. Otoo and D. Kwofie, IASDS of IEEE Cluster [2013] G. Nimako, E. Otoo and D. Kwofie, P2S2 [2012]

Tsuji, M. Kuroda and K. Higuchi [2008] A. Rosenberg and L. Stockmeyer [1977]

(8)

Multidimensional Array Files

The question then arises: ”How can we store arrays elements so that the index range of any dimension can be extended?”

(9)

The question then arises: ”How can we store arrays elements so that the index range of any dimension can be extended?”

(10)

Main Contributions

We present a computable access functionF∗₍_h_i

0, i1, . . . , ik−1i)

that maps the k-dimensional indexhi0, i1, . . . , ik−1ito a value that

gives the linear address location`j in timeO(k+ logN)using

O(k2N1/k)addition space. This allows us to store elements of extendible array directly as array-files.

We introduce Parallel Chunked Extendible Dense Array I/O (PEXTA) for the GA

PEXTA can be used as a drop-in replacement for the Disk Resident Array of the GA-Toolkit

PEXTA provides an efficient I/O for Parallel Implementation of Endmember Extraction Algorithms From Hyperspectral Data.

(11)

Main Contributions

0, i1, . . . , ik−1i)

(12)

Main Contributions

0, i1, . . . , ik−1i)

(13)

0, i1, . . . , ik−1i)

(14)

Rest of the Talk

Extendible Arrays of Data Chunked Extendible Arrays PEXTA I/O Model

PEXTA Implementation for GA

Using PEXTA for Parallel Implementation of Endmember Extraction Algorithms From Hyperspectral Data

(15)

Consider an 3-D arrayA[U0][U1][U2]whose elements are

allocated in row-major order.

The location of an elementAhi0, i1, i2i, is given by

loc(Ah0,0,0i+qwhereq is computed as: q =i0∗C0+i1∗C1+i2∗C2

and

C0=U1∗U2

C1=U2

C2= 1

The valuesC0, C1, C2 are held in thedope-vector.

(16)

Conventional Array Using Row-Major Ordering

Consider an 3-D arrayA[3][3][2]whose elements are allocated in row-major order as below

(17)

(18)

Conventional Array Using Row-Major Ordering

(19)

(20)

Linear Mapping for a Dense Extendible Array

The mapping function for extendible array uses axial-vectors to store information that includes thedope-vector at the time of extending the dimension

A list of axial-vectors is maintained for each dimension.

LetA[U∗0][U∗1][U∗2]be an arbitrary3-dimensional array, whereU∗j implies that the bound can be extended.

Similarly we employ the notation:

F()when referring to conventional array mapping function.

(21)

LetA[U∗0][U∗1][U∗2]be an arbitrary3-dimensional array, whereU∗j implies that the bound can be extended.

Similarly we employ the notation:

(22)

Linear Mapping for a Dense Extendible Array

LetA[U∗0][U∗1][U∗2]be an arbitrary3-dimensional array, whereU∗j

implies that the bound can be extended. Similarly we employ the notation:

(23)

(24)

Illustration of the Linear Mapping for a Dense Extendible

Array

(25)

(26)

Illustration of the Linear Mapping for a Dense Extendible

Array

(27)

(28)

Illustration of the Linear Mapping for a Dense Extendible

Array

(29)

(30)

Illustration of the Linear Mapping for a Dense Extendible

Array

(31)

(32)

Illustration of the Linear Mapping for a Dense Extendible

Array

(33)

(34)

Illustration of the Linear Mapping for a Dense Extendible

Array

(35)

(36)

Linear Mapping for a Dense Extendible Array

Suppose that in ak-dimensional extendible arrayA[U∗0][U∗1][U∗2]...[U∗k−1],

dimensionlis extended byλl, then the index range increases fromU∗l to U∗l +λl.

Let the locationAh0,0, ...,U∗l, ...,0i(i.e. the starting location of an

allocatedhyperslab) be denoted as`_Z0

U∗_l whereZ0_U∗ l = Qk−1 r=0U ∗ r.

The Mapping Function

q∗=F∗(hi0, i1, i2, ..., ik−1i)) =Z0_U∗ l + (il−U ∗ l)C ∗ l + k−1 X j=0 j6=l ijCj∗ Cl∗= k−1 Y j=0 j6=l U∗j C_j∗= k−1 Y r=j+1 r6=l U∗r

The limitation imposed byF()is that extensions of the array can only be done on one dimension (i.e. that is dimensionU0since it was not used in the evaluation ofF().

(37)

Suppose that in ak-dimensional extendible arrayA[U∗0][U∗1][U∗2]...[U∗k−1],

dimensionlis extended byλl, then the index range increases fromU∗l to U∗l +λl.

Let the locationAh0,0, ...,U∗l, ...,0i(i.e. the starting location of an

allocatedhyperslab) be denoted as`_Z0

U∗_l whereZ0_U∗ l = Qk−1 r=0U ∗ r.

The Mapping Function

q∗=F∗(hi0, i1, i2, ..., ik−1i)) =Z0_U∗ l + (il−U ∗ l)C ∗ l + k−1 X j=0 j6=l ijCj∗ Cl∗= k−1 Y j=0 j6=l U∗j C_j∗= k−1 Y r=j+1 r6=l U∗r

The limitation imposed byF()is that extensions of the array can only be done on one dimension (i.e. that is dimensionU0since it was not used

in the evaluation ofF().

(38)

Chunking Extendible Dense Arrays

The use of the list for axial-vectors can be expensive and depends in particular on the interruptible expansions. (worst case is when we have cubical extensions).

Such interruptible expansions cause the addition of a new entry in the list of axial vectors.

Chunking the array gives some additional advantages:

It gives contiguous storage allocations for the elements of the chunks.

When arrays are allocated onto secondary storage, I/O can be made in multiples of the chunk size.

The allocation is done inchunksas opposed to the single

(39)

The use of the list for axial-vectors can be expensive and depends in particular on the interruptible expansions. (worst case is when we have cubical extensions).

Such interruptible expansions cause the addition of a new entry in the list of axial vectors.

Chunking the array gives some additional advantages: It gives contiguous storage allocations for the elements of the chunks.

When arrays are allocated onto secondary storage, I/O can be made in multiples of the chunk size.

The allocation is done inchunksas opposed to the single elements.

(40)

(41)

We model the PEXTA after MPI-I/O Definitions

Thefiletypedescribes the data layout in file

Thebuftypedescribes how data is laid out in a process’s buffer The elementary datatype,etype, ensures consistency between the type signature of both filetype and buftype.

The principal-array data and the axial-vectors, that correspond to

the leaf-nodes of theO2-Tree, are stored on disk as the

principal-array file,Fp and metadata file,Fm, respectively.

AlthoughFpwill be stored on disk, it is accessed through the

(42)

PEXTA Overview and Implementation

We model the PEXTA after MPI-I/O Definitions

Thefiletypedescribes the data layout in file

Thebuftypedescribes how data is laid out in a process’s buffer The elementary datatype,etype, ensures consistency between the type signature of both filetype and buftype.

The principal-array data and the axial-vectors, that correspond to the leaf-nodes of theO2-Tree, are stored on disk as the

principal-array file,Fp and metadata file,Fm, respectively.

AlthoughFpwill be stored on disk, it is accessed through the

(43)

PEXTA Model supports both regular and irregular distributions of global array of the GA-Toolkit. We have completed implementation for regular distribution.

Processes can perform independent or collective reads and writes of their respective sub-arrays.

An external/independent process can request for an exclusive

write lock token on the metadata file,Fm and append chunks of

(44)

PEXTA Overview and Implementation

PEXTA Model supports both regular and irregular distributions of global array of the GA-Toolkit. We have completed implementation for regular distribution.

Processes can perform independent or collective reads and writes of their respective sub-arrays.

An external/independent process can request for an exclusive write lock token on the metadata file,Fm and append chunks of

(45)

When a new session starts, the processes within a communicator group reads the newly updated metadata file into their individual

O2-Trees.

(46)

PEXTA Overview and Implementation

When a new session starts, the processes within a communicator group reads the newly updated metadata file into their individual O2-Trees.

(47)

The PEXTA API provides routines quite similar to DRA for creating disk resident array

We however provide additional functions for extendibility. Functions

int PEXTA Init(int maxArrays, double maxArraySize, double totalDiskSpace, double maxMemory);

int PEXTA Create(int type, int ndim, size t

dims[],char *name, char* filename, int mode, int mode, size t reqdims[], int g a, int *p a);

int PEXTA IRead_{(int g a, int p a, int *request);} int PEXTA Extend(int p a, size t edims[],size t odims[], int *request);

(48)

(49)

Recently, high-performance computing systems have become more widespread in remote sensing applications.

Parallel techniques for endmember extraction may allow rapid inspection of massive data archives

It will also enable generation of comprehensive spectral libraries and other useful information

An opportunity to explore methodologies in areas such as

multisource data registration and mining that previously looked too computationally intensive for practical applications due to the immense files involved

(50)

Hyperspectral Data Analysis

Mixed pixels (due to insufficient spatial resolution and mixing effects in surfaces)

Sub-pixel targets (very important and crucial in many hyperspectral applications)

(51)

The goal is to find extreme pixel vectors (endmembers) that can be used to unmix other mixed pixels in the data using a linear mixture model.

Each mixed pixel can be obtained as a combination of

endmember fractional abundances. A crucial issue is how to find spectral endmembers.

Endmember Extraction Algorithms:

Boardman’s Pixel Purity Index (PPI) Algorithm

Harsanyi and Chang’s Orthogonal Subspace Projection (OSP) Winter’s N-FINDR Algorithm

(52)

Parallel Endmember Extraction Using PEXTA and GA

The master processor scatters the data among the distributed workers

We provide collective operations in PEXTA for data partitioning. Alternatively, the GA collective operation for data partitioning can be used.

(53)

Spectral-Domain Partitioning

A single pixel vector (spectral signature) may be stored in different processing units

It requires communications for individual pixel-based calculations such as spectral angle computations.

(54)

Parallel Endmember Extraction Using PEXTA and GA

Spatial-Domain Partitioning

Every pixel vector (spectral signature) is stored in the same processing unit.

This is beneficial for a proposed sliding-window algorithm in terms of low-level image processing computations.

(55)

Poor parallel I/O performance has been a bottleneck in many data-intensive applications

The organisation of extendible arrays using a mapping function is highly appropriate for most scientific datasets where the model of the data is perceived as a large multi-dimensional array.

Main application areas is for hyper-spectral image processing and a host other scientific applications that are based on multidimensional data models.

Remote sensing missions require efficient I/O able to cope with the extremely high dimensionality of the collected data without

compromising mission payload.

Future work on the use of in memory resident caches of both the chunks and the array are anticipated.

(56)

PEXTA: A PGAS I/O for Hyperspectral Data Storage

Introduction

Motivation

Motivation: The Global Array Toolkit

Multidimensional Array Files

Multidimensional Array Files

Main Contributions

Main Contributions

Main Contributions

Rest of the Talk

Conventional Array Using Row-Major Ordering

Conventional Array Using Row-Major Ordering

Linear Mapping for a Dense Extendible Array

Linear Mapping for a Dense Extendible Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Illustration of the Linear Mapping for a Dense Extendible

Array

Linear Mapping for a Dense Extendible Array

Chunking Extendible Dense Arrays

PEXTA Overview and Implementation

PEXTA Overview and Implementation

PEXTA Overview and Implementation

Hyperspectral Data Analysis

Parallel Endmember Extraction Using PEXTA and GA

Parallel Endmember Extraction Using PEXTA and GA

Thanks!

Thank You

Questions?