15-418 Final Project Report Trading Platform Server
Yinghao Wang
[email protected] May 8, 2014
Executive Summary
The final project will implement a trading platform server that provides back-end support for trading algorithms and user interface. Specifically, the server will support order execution, indicators computation, Monte Carlo simulation, market scanning and client requests handling. The trading platform was tested on GHC 3000 machines with 6-core Xeon CPU and GTX 480 graphic card.
This report provides an overview of the background, implementation details, challenges and performance analysis of a trading platform server that could be launched on a workstation.
Background
In the dynamic and ever-changing financial markets, traders and portfolio managers increasingly rely on advanced trading technology and infrastructure to achieve returns that outperform the broad market. As trading becomes automated and algorithms replace conventional traders in scanning market signals, execution speed and API support are important to outstanding performance.
The implementation of the project will follow the structure as shown in the figure below. The trading platform provides key back-end functionalities, such as order routing, computing and storing technical indicators, portfolio scenario analysis and market signal scanning.
Functions & Performance Objective Order Execution
The trading platform exposes its order execution functions through an API to the client side. The order execution function is a wrapper that bridges trading algorithms and brokerage servers. Since the trading platform is a wrapper for brokerage trading functions, it should not significantly increase trade execution time. The objective is that all orders should be handled within 100 milliseconds. The trading speed of major brokerage accounts is shown in the figure below. Adding 100 milliseconds to their trade execution time will not decrease overall trading speed by a significant amount.
Scheduler
Several key factors are considered in measuring a scheduler’s performance.
Priority. Tasks with higher priority should have a higher chance to be executed than a task with lower priority. Order execution should be executed after requests arrive at the master process immediately because execution speed is key to electronic trading.
Deadline. Certain operations are required to be finished within a certain period of time. For example, every second a new database update is generated and such requests should be executed before the next request arrives.
Starvation. Although higher priority tasks should be executed earlier than lower priority tasks, the scheduler should not create starvation for lower priority tasks.
Waiting time relative to task size. Reducing 1 second execution time is more significant for a task that usually takes 10 seconds than a task that usually takes 1.
The following performance measure is defined for the scheduler.
where is the priority of client request, is the maximum priority level, T is the execution time, is the number of occurrences that orders are not handled within 100ms and
is the number of occurrences that database update is not handled within 1 second.
Since the scheduler aims to executes client requests as fast as possible, the less the score, the better the performance.
Approach Scheduler
Algorithm 1. This algorithm is based on the idea that in the trailing 10 seconds, the weight execution time of tasks of each category should be roughly balanced. Specifically,
where represents priority i and represents time spent executing tasks of priority i. The scheduler maintains separate work queues for different tasks and select tasks based on the idea of balanced execution time presented above.
Algorithm 2. The skeleton of the algorithm is similar to algorithm 1. However, it partitioned large tasks into small parts. For example, if a simulation has 100,000 paths, the scheduler may partition the tasks into 10 parts. After executing 10,000 paths, the scheduler put the task together with partial results back to the work queue.
Indicators Computation
The task is to compute technical indicators for Russell 3000 stocks. The following technical indicators are supported: simple moving average, exponential moving average, Bollinger bands and parabolic SAR. Three approaches were tested.
Single CPU core
Multiple CPU cores
CUDA
The rationale behind using a CUDA implementation is that the bottleneck in indicators computation is memory access. The algorithm for generating those technical indicators have O(n) memory access.
Since the computation itself has O(n) complexity, increasing memory access speed is key to improving performance. CUDA implementation is able to hide latency and boost bandwidth.
Monte Carlo Simulation
The trading platform is able to handle scenario analysis tasks. In scenario analysis tasks, Monte Carlo simulation is widely used. Three implementations were tested:
Sequential implementation
Single-thread implementation
Multi-thread implementation
Result Scheduler
For comparison purposes, a number of algorithms were tests.
First in first out.
Orders are executed first and other tasks are executed disregarding priority.
Algorithm 1
Algorithm 2
Algorithm 2 has the best performance because it is able to achieve workload balance between tasks of different priority and is designed to capture diminishing return of performance (reducing 1 second for a program that usually takes 100 seconds is more significant than reducing 1 second for a program that usually takes 500 seconds).
Monte Carlo Simulation
The speedup diagram is shown below. The trading platform is able to achieve significant speedup doing CPU-intensive tasks. The peak performance occurs when using 4 threads on a single core.
6330
1813
652 505
0 1000 2000 3000 4000 5000 6000 7000
FIFO No priority Algorithm 1 Algorithm 2
Performance Score
Indicators Computation
CUDA implementation is able to achieve 10.7 times speedup over a sequential implementation.
CPU-based implementations were not able to achieve significant speedup because the bottleneck is memory access. However, GPU could help hide latency and boost memory bandwidth. Thus, CUDA implementation could outperformance CPU implementations.
0 1 2 3 4 5 6
Seq 1 2 4 8 16
Speedup
0 2 4 6 8 10 12
Seq 1 thread 4 threads GPU
1 1.4 1.5
10.7
Speedup