The Piranha computer algebra system. introduction and implementation details

(1)

The Piranha computer algebra system:

introduction and implementation details

Francesco Biscani

Advanced Concepts Team

European Space Agency (ESTEC)

Course on Differential Equations and Computer Algebra

(2)

Outline

1

A Brief Overview

Algebraic Structures

Software Framework

Pyranha: the Python Bindings for Piranha

2

Fast multiplication algorithms

Asymptotically fast?

Practically fast

3

Benchmarks

4

The Future

(3)

Outline

1

A Brief Overview

Software Framework

2

Fast multiplication algorithms

Practically fast

3

Benchmarks

4

The Future

(4)

Piranha in a Nutshell

It is an algebraic

manipulation framework

Around 12000

SLOC

(

S

ingle

L

ines

O

f

C

ode)

Written in

C++

and

object-oriented

It uses extensively existing

Free-Software

tools and libraries

(Boost, GMP, Python, . . . )

Multiplatform (

GNU/Linux

,

Windows

,

BSD

)

Free-Software

itself

(5)

Algebraic Structures for Celestial Mechanics

Polynomials

:

X

i

C

i

p

i

Fourier series

:

X

i

C

i

cos

sin

(

i

·

t

)

Poisson series

:

X

i

X

j

C

i

,

j

p

j

cos

sin

(

i

·

t

)

Echeloned Poisson series

:

X

i

X

j

X

k

C

i

,

j

,

k

p

k

Y

l

(

l

·

d

)

δ

j,l

cos

sin

(

i

·

t

)

(6)

The Framework 1/2

Q

: Can we manipulate these algebraic structures in a

general

and

unified

way?

The Basic Ideas

1

Series are

collections

of terms

2

Terms are

coefficient-key

pairs

3

Terms are uniquely identified by their

keys

:

t

1

≡

t

2

⇐⇒

t

1

.

key

≡

t

2

.

key

(7)

The Framework 1/2

Q

: Can we manipulate these algebraic structures in a

general

and

unified

way?

The Basic Ideas

1

Series are

collections

of terms

2

Terms are

coefficient-key

pairs

3

Terms are uniquely identified by their

keys

:

t

1

≡

t

2

⇐⇒

t

1

.

key

≡

t

2

.

key

(8)

(9)

Object-Oriented and Generic Programming

The C++ Language

High performance

and

high-level design are not mutually exclusive

OO

: inheritance, polymorphism, encapsulation, modularity

Generic programming

: type-agnostic classes

Template meta-programming

(aka

modern

C++, see Alexandrescu

[2001]): OO features with

zero

overhead, efficient

compile-time

optimizations and checks

The Bottom Line

It is possible to

share

a consistent portion of the implementation

among the supported algebraic structures and reduce

code

(10)

Object-Oriented and Generic Programming

The C++ Language

High performance

and

high-level design are not mutually exclusive

OO

: inheritance, polymorphism, encapsulation, modularity

Generic programming

: type-agnostic classes

Template meta-programming

(aka

modern

C++, see Alexandrescu

[2001]): OO features with

zero

overhead, efficient

compile-time

optimizations and checks

The Bottom Line

It is possible to

share

a consistent portion of the implementation

among the supported algebraic structures and reduce

code

(11)

A Quick SLOC Analysis 1/2

Gregoire & Colbert

Chapront [2003]:

Fourier

and

Poisson

series manipulators

Written in

Fortran 90

Feature

set comparable with Piranha’s

∼

4000

SLOC each (Piranha is

∼

12000

SLOC)

Piranha supports additionally:

polynomials

as top-level series

multiple

representations for keys and numerical coefficients

(complex, reals, integers, rationals, arbitrary-size, etc.)

12

different manipulators are currently implemented within the

framework (other combinations can be trivially added)

(12)

A Quick SLOC Analysis 2/2

Piranha’s

SLOC

count divided by directory:

(13)

Pyranha

Python

bindings for Piranha

Uses the

Boost.Python

library

Compiled

-code performance with the flexibility of an

interpreted

language

Python is a

real

computer language (not some obscure ad-hoc

language)

Many possibilities for extensions

Interactive

graphical environment with

IPython

,

matplotlib

and

PyQt4

(14)

Outline

1

A Brief Overview

Software Framework

2

Practically fast

3

Benchmarks

4

The Future

(15)

Schoolbook multiplication

Given:

a

(

x

) =

a

1

x

+

a

0

,

b

(

x

) =

b

1

x

+

b

0

,

compute

a

(

x

)

·

b

(

x

) as

a

0

b

0

+

a

0

b

1

x

+

a

1

b

0

x

+

a

1

b

1

x

2

.

Complexity:

O

n

2

.

(16)

Asymptotically fast multiplication: Karatsuba

Karatsuba

’s algorithm: given

a

(

x

) =

a

1

x

+

a

0

,

b

(

x

) =

b

1

x

+

b

0

,

express

a

(

x

)

·

b

(

x

) as

a

0

b

0

+ [(

a

0

+

a

1

) (

b

0

+

b

1

)

−

a

0

b

0

−

a

1

b

1

]

x

+

a

1

b

1

x

2

,

with

3

multiplications vs

4

of the classical method.

Complexity:

O

n

log

2

3

.

(17)

Asymptotically fast multiplication: FFT

Convert polynomials to

vector of coefficients

Compute the

FFT

of both vectors

Pointwise

multiplication of the FFTed vectors

Inverse

FFT to recover the result of the multiplication

Complexity:

O (

n

log

n

)

.

(18)

Alas. . .

Issues

Both Karatsuba and FFT:

have a high

constant factor

in complexity which make them

unsuitable for typical problems in Celestial Mechanics

rely on the assumption that the polynomials being multiplied

are

dense

perform poorly on

real-world multivariate

polynomials

(19)

Kronecker’s trick

z

y

x

Code

0

1

0

2

0

3

0

1

0

4

0

1

5

0

1

2

6

0

1

3

7

0

2

0

8

0

2

1

9

0

2

10

0

2

3

11

. . .

3

63

Idea

: code the sets of exponents into

integer values

Maintain

lexicographic

order

Homomorphism

between the vector space

of integers and

Z

which preserves addition

and subtraction

Operations on integer vectors are reduced

to

O

(1)

complexity

Codes can be used as perfect

hash values

or

indices

in an array

Series are encoded

on-the-fly

during

multiplication

(20)

Exploiting modern computer designs

memory hierarchies

−→

(to the whiteboard)

spatial

locality of reference

temporal

locality of reference

prefetcher

multi-core

CPUs

(21)

(22)

“Dense” multiplication

use Kronecker exponents directly as

indices

in an array

use

cache-blocking

to promote temporal

locality of reference

monomial ordering is

prefetch

-friendly

when applicable, top performance is

achieved

(23)

Memory access patterns

(24)

Sparse multiplication

use Kronecker exponents directly as

hash values

optimized hash table: items stored in

sequential and contiguous

buckets

order input polynomials according to exponent

modulo

table size

cache-blocking

(25)

Parallelization

P1

P2

1

→

1

2

3

· · ·

n

−

3

n

−

2

n

−

1

n

–

2

→

2

3

4

· · ·

n

−

2

n

−

1

n

–

1

3

→

3

4

5

· · ·

n

−

1

n

–

1

2

4

→

4

5

6

· · ·

n

–

1

2

3

cache-blocking

provides a natural way to avoid contention

interval

arithmetics on the exponents used to guarantee write-in

memory areas are disjoint

(26)

Outline

1

A Brief Overview

Software Framework

2

Practically fast

3

Benchmarks

4

The Future

(27)

Benchmarks

Fateman’s

dense

benchmark. Compute:

s

(

s

+ 1)

,

s

= (1 +

x

+

y

+

z

+

t

)

30

.

46376 x 46376 =

2 150 733 376

term-by-term multiplications

Final polynomial length =

635 376

Monagan-Pearce’s

sparse

benchmark. Compute:

f

·

g

,

f

= (1 +

x

+

y

+ 2

z

2

+ 3

t

3

+ 5

u

5

)

12

,

g

= (1 +

u

+

t

+ 2

z

2

+ 3

y

3

+ 5

x

5

)

12

.

6188 x 6188 =

38 291 344

term-by-term multiplications

Final polynomial length =

5 821 335

(28)

Benchmark results

Test

Coefficient

System

Time

ccpm

Fateman

double

Core2Quad

4.29s

4.8

Fateman

double

Core2Duo

5.62s

4.6

Fateman

double

PPC64

4.96s

4.6

Fateman

double

Xeon

3.73s

4.6

Fateman

double

Atom

20.15s

15.0

Fateman

GMP mpz

Core2Quad

67.90s

75.8

Fateman

61-bit integer

SDMP-Core2

60.25s

67.2

Fateman

61-bit integer

SDMP-Corei7

70.59s

85.3

ELP

double

Core2Quad

15.62s

10.3

MP-sparse

double

Core2Quad

1.71s

107.2

MP-sparse

double

Xeon

1.59s

110.5

MP-sparse

double

Corei7

1.15s

88.0

MP-sparse

37-bit integer

SDMP-Core2

1.86s

116.6

MP-sparse

37-bit integer

SDMP-Corei7

1.56s

108.4

(29)

Benchmark results: parallelization

dense

case:

∼

90% of maximum theoretical speedup

sparse

case:

∼

70% of maximum theoretical speedup

SDMP gets (super)linear speedup in the dense case, but does

not scale up in the sparse case

possible improvements:

reduce synchronization barriers

make algorithm

non-deterministic

. . .

(30)

Outline

1

A Brief Overview

Software Framework

2

Practically fast

3

Benchmarks

4

The Future

(31)

Future Steps

Code refactor and cruft-elimination

Extension of the

Python

bindings, GUI improvements, etc.

Documentation