The Piranha computer algebra system:
introduction and implementation details
Francesco Biscani
Advanced Concepts Team
European Space Agency (ESTEC)
Course on Differential Equations and Computer Algebra
Outline
1
A Brief Overview
Algebraic Structures
Software Framework
Pyranha: the Python Bindings for Piranha
2
Fast multiplication algorithms
Asymptotically fast?
Practically fast
3
Benchmarks
4
The Future
Outline
1
A Brief Overview
Algebraic Structures
Software Framework
Pyranha: the Python Bindings for Piranha
2
Fast multiplication algorithms
Asymptotically fast?
Practically fast
3
Benchmarks
4
The Future
Piranha in a Nutshell
It is an algebraic
manipulation framework
Around 12000
SLOC
(
S
ingle
L
ines
O
f
C
ode)
Written in
C++
and
object-oriented
It uses extensively existing
Free-Software
tools and libraries
(Boost, GMP, Python, . . . )
Multiplatform (
GNU/Linux
,
Windows
,
BSD
)
Free-Software
itself
Algebraic Structures for Celestial Mechanics
Polynomials
:
X
i
C
i
p
i
Fourier series
:
X
i
C
i
cos
sin
(
i
·
t
)
Poisson series
:
X
i
X
j
C
i
,
j
p
j
cos
sin
(
i
·
t
)
Echeloned Poisson series
:
X
i
X
j
X
k
C
i
,
j
,
k
p
k
Y
l
(
l
·
d
)
δ
j,lcos
sin
(
i
·
t
)
The Framework 1/2
Q
: Can we manipulate these algebraic structures in a
general
and
unified
way?
The Basic Ideas
1
Series are
collections
of terms
2Terms are
coefficient-key
pairs
3
Terms are uniquely identified by their
keys
:
t
1
≡
t
2
⇐⇒
t
1
.
key
≡
t
2
.
key
The Framework 1/2
Q
: Can we manipulate these algebraic structures in a
general
and
unified
way?
The Basic Ideas
1
Series are
collections
of terms
2Terms are
coefficient-key
pairs
3
Terms are uniquely identified by their
keys
:
t
1
≡
t
2
⇐⇒
t
1
.
key
≡
t
2
.
key
Object-Oriented and Generic Programming
The C++ Language
High performance
and
high-level design are not mutually exclusive
OO
: inheritance, polymorphism, encapsulation, modularity
Generic programming
: type-agnostic classes
Template meta-programming
(aka
modern
C++, see Alexandrescu
[2001]): OO features with
zero
overhead, efficient
compile-time
optimizations and checks
The Bottom Line
It is possible to
share
a consistent portion of the implementation
among the supported algebraic structures and reduce
code
Object-Oriented and Generic Programming
The C++ Language
High performance
and
high-level design are not mutually exclusive
OO
: inheritance, polymorphism, encapsulation, modularity
Generic programming
: type-agnostic classes
Template meta-programming
(aka
modern
C++, see Alexandrescu
[2001]): OO features with
zero
overhead, efficient
compile-time
optimizations and checks
The Bottom Line
It is possible to
share
a consistent portion of the implementation
among the supported algebraic structures and reduce
code
A Quick SLOC Analysis 1/2
Gregoire & Colbert
Chapront [2003]:
Fourier
and
Poisson
series manipulators
Written in
Fortran 90
Feature
set comparable with Piranha’s
∼
4000
SLOC each (Piranha is
∼
12000
SLOC)
Piranha supports additionally:
polynomials
as top-level series
multiple
representations for keys and numerical coefficients
(complex, reals, integers, rationals, arbitrary-size, etc.)
12
different manipulators are currently implemented within the
framework (other combinations can be trivially added)
A Quick SLOC Analysis 2/2
Piranha’s
SLOC
count divided by directory:
Pyranha
Python
bindings for Piranha
Uses the
Boost.Python
library
Compiled
-code performance with the flexibility of an
interpreted
language
Python is a
real
computer language (not some obscure ad-hoc
language)
Many possibilities for extensions
Interactive
graphical environment with
IPython
,
matplotlib
and
PyQt4
Outline
1
A Brief Overview
Algebraic Structures
Software Framework
Pyranha: the Python Bindings for Piranha
2
Fast multiplication algorithms
Asymptotically fast?
Practically fast
3
Benchmarks
4
The Future
Schoolbook multiplication
Given:
a
(
x
) =
a
1
x
+
a
0
,
b
(
x
) =
b
1
x
+
b
0
,
compute
a
(
x
)
·
b
(
x
) as
a
0
b
0
+
a
0
b
1
x
+
a
1
b
0
x
+
a
1
b
1
x
2
.
Complexity:
O
n
2
.
Asymptotically fast multiplication: Karatsuba
Karatsuba
’s algorithm: given
a
(
x
) =
a
1
x
+
a
0
,
b
(
x
) =
b
1
x
+
b
0
,
express
a
(
x
)
·
b
(
x
) as
a
0
b
0
+ [(
a
0
+
a
1
) (
b
0
+
b
1
)
−
a
0
b
0
−
a
1
b
1
]
x
+
a
1
b
1
x
2
,
with
3
multiplications vs
4
of the classical method.
Complexity:
O
n
log
23
.
Asymptotically fast multiplication: FFT
Convert polynomials to
vector of coefficients
Compute the
FFT
of both vectors
Pointwise
multiplication of the FFTed vectors
Inverse
FFT to recover the result of the multiplication
Complexity:
O (
n
log
n
)
.
Alas. . .
Issues
Both Karatsuba and FFT:
have a high
constant factor
in complexity which make them
unsuitable for typical problems in Celestial Mechanics
rely on the assumption that the polynomials being multiplied
are
dense
perform poorly on
real-world multivariate
polynomials
Kronecker’s trick
z
y
x
Code
0
0
0
0
0
0
1
1
0
0
2
2
0
0
3
3
0
1
0
4
0
1
1
5
0
1
2
6
0
1
3
7
0
2
0
8
0
2
1
9
0
2
2
10
0
2
3
11
. . .
. . .
. . .
. . .
3
3
3
63
Idea
: code the sets of exponents into
integer values
Maintain
lexicographic
order
Homomorphism
between the vector space
of integers and
Z
which preserves addition
and subtraction
Operations on integer vectors are reduced
to
O
(1)
complexity
Codes can be used as perfect
hash values
or
indices
in an array
Series are encoded
on-the-fly
during
multiplication
Exploiting modern computer designs
memory hierarchies
−→
(to the whiteboard)
spatial
locality of reference
temporal
locality of reference
prefetcher
multi-core
CPUs
“Dense” multiplication
use Kronecker exponents directly as
indices
in an array
use
cache-blocking
to promote temporal
locality of reference
monomial ordering is
prefetch
-friendly
when applicable, top performance is
achieved
Memory access patterns
Sparse multiplication
use Kronecker exponents directly as
hash values
optimized hash table: items stored in
sequential and contiguous
buckets
order input polynomials according to exponent
modulo
table size
cache-blocking
Parallelization
P1
P2
1
→
1
2
3
· · ·
n
−
3
n
−
2
n
−
1
n
–
–
–
2
→
2
3
4
· · ·
n
−
2
n
−
1
n
–
–
–
1
3
→
3
4
5
· · ·
n
−
1
n
–
–
–
1
2
4
→
4
5
6
· · ·
n
–
–
–
1
2
3
cache-blocking
provides a natural way to avoid contention
interval
arithmetics on the exponents used to guarantee write-in
memory areas are disjoint
Outline
1
A Brief Overview
Algebraic Structures
Software Framework
Pyranha: the Python Bindings for Piranha
2
Fast multiplication algorithms
Asymptotically fast?
Practically fast
3
Benchmarks
4
The Future
Benchmarks
Fateman’s
dense
benchmark. Compute:
s
(
s
+ 1)
,
s
= (1 +
x
+
y
+
z
+
t
)
30
.
46376 x 46376 =
2 150 733 376
term-by-term multiplications
Final polynomial length =
635 376
Monagan-Pearce’s
sparse
benchmark. Compute:
f
·
g
,
f
= (1 +
x
+
y
+ 2
z
2
+ 3
t
3
+ 5
u
5
)
12
,
g
= (1 +
u
+
t
+ 2
z
2
+ 3
y
3
+ 5
x
5
)
12
.
6188 x 6188 =
38 291 344
term-by-term multiplications
Final polynomial length =
5 821 335
Benchmark results
Test
Coefficient
System
Time
ccpm
Fateman
double
Core2Quad
4.29s
4.8
Fateman
double
Core2Duo
5.62s
4.6
Fateman
double
PPC64
4.96s
4.6
Fateman
double
Xeon
3.73s
4.6
Fateman
double
Atom
20.15s
15.0
Fateman
GMP mpz
Core2Quad
67.90s
75.8
Fateman
61-bit integer
SDMP-Core2
60.25s
67.2
Fateman
61-bit integer
SDMP-Corei7
70.59s
85.3
ELP
double
Core2Quad
15.62s
10.3
MP-sparse
double
Core2Quad
1.71s
107.2
MP-sparse
double
Xeon
1.59s
110.5
MP-sparse
double
Corei7
1.15s
88.0
MP-sparse
37-bit integer
SDMP-Core2
1.86s
116.6
MP-sparse
37-bit integer
SDMP-Corei7
1.56s
108.4
Benchmark results: parallelization
dense
case:
∼
90% of maximum theoretical speedup
sparse
case:
∼
70% of maximum theoretical speedup
SDMP gets (super)linear speedup in the dense case, but does
not scale up in the sparse case
possible improvements:
reduce synchronization barriers
make algorithm
non-deterministic
. . .
Outline
1
A Brief Overview
Algebraic Structures
Software Framework
Pyranha: the Python Bindings for Piranha
2
Fast multiplication algorithms
Asymptotically fast?
Practically fast
3
Benchmarks
4
The Future
Future Steps
Code refactor and cruft-elimination
Extension of the
Python
bindings, GUI improvements, etc.
Documentation