Co-Design Issues in HPC and Big Data Convergence

Full text

(1)Co-Design Issues in HPC and Big Data Convergence Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) Co-Design Meeting, HPC China Wuxi, China 2015/11/10.

(2) Two Big Data CREST Programs (2013-2020) ~$60 mil. Advanced Core Technologies for Big Data Integration. Research Supervisor: Masaru Kitsuregawa Director General, National Institute of Informatics. Advanced Application Technologies to Boost Big Data Utilization for Multiple-Field Scientific Discovery and Social Problem Solving Research Supervisor: Yuzuru Tanaka Professor, Graduate School of Information Science and Technology, Hokkaido University.

(3) CREST Big Data Projects circa 2014 (blue = big data application area). Advanced Core Technologies for Big Data Integration • •. Establishment of Knowledge-Intensive Structural Natural Language Processing and Construction of Knowledge Infrastructure Privacy-preserving data collection and analytics with guarantee of information control and its application to personalized medicine and genetic epidemiology. •. EBD: Extreme Big Data – Convergence of Big Data and HPC for Yottabyte Processing. • • • • •. Discovering Deep Knowledge from Complex Data and Its Value Creation Data Particlization for Next Generation Data Mining Foundations of Innovative Algorithms for Big Data Recognition, Summarization and Retrieval of Large-Scale Multimedia Data The Security Infrastructure Technology for Integrated Utilization of Big Data. Advanced Application Technologies to Boost Big Data Utilization for Multiple-Field Scientific Discovery and Social Problem Solving • • • • • •. Development of a knowledge-generating platform driven by big data in drug discovery through production processes. Innovating "Big Data Assimilation" technology for revolutionizing very-short-range severe weather prediction Establishing the most advanced disaster reduction management system by fusion of real-time disaster simulation and big data assimilation Exploring etiologies, sub-classification, and risk prediction of diseases based on big-data analysis of clinical and whole omics data in medicine Detecting premonitory signs and real-time forecasting of pandemic using big biological data Statistical Computational Cosmology with Big Astronomical Imaging Data.

(4) JST-CREST “Extreme Big Data” Project (2013-2018) Future Non-Silo Extreme Big Data Scientific Apps. Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?. Ultra Large Scale Graphs and Social Infrastructures. Large Scale Metagenomics. Co-Design EBD Bag. Graph Store. Cloud IDC Very low BW & Efficiency Highly available, resilient. Massive Sensors and Data Assimilation in Weather Prediction. Co-Design Co-Design. EBD System Software incl. EBD Object System. Cartesian Plane KV S. KV S NVM/Fla NVM/Flas 2Tbps HBM NVM/Fla sh h 4~6HBM Channels NVM/Flas NVM/Fla NVM/Flas sh h 1.5TB/s DRAM & h sh DRAM DRAM NVM BW DRAM DRAM DRAM DRAM Low Low High Powered 30PB/s I/O BW Possible Main CPU Power Power 1 Yottabyte / Year CPU CPU TSV Interposer. KV S. EBD KVS. Exascale Big Data HPC PCB. Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW. Issues regading Architectural, algorithmic, and system software evolution? Use of GPUs?. Supercomputers Compute&Batch-Oriented More fragile.

(5) Extreme Big Data (EBD) Team Co-Design EHPC and EDB Apps • Satoshi Matsuoka (PI), Toshio • Yutaka Akiyama, Ken Kurokawa (Tokyo Endo, Hitoshi Sato (Tokyo Tech.) Tech) (EBD App1 Genome) (EBD Software System) •. • Osamu Tatebe (Univ. Tsukuba) • Takemasa Miyoshi (Riken AICS) (EBD App2 Weather, data assim.) (EBD-I/O). • Michihiro Koibuchi (NII) (EBD Network). • Toyotaro Suzumura (IBM Watson / Columbia U)(EBD App3 Social Simulation).

(6) 100,000 Times Fold EBD “Convergent” System Architecture Akiyama Group. Miyoshi Group. Suzumura Group. Large Scale Genomic Correlation. SQL for EBD Message Passing (MPI, X10) for EBD. Data Assimilation in Large Scale Sensors and Exascale Atmospherics. Large Scale Graphs and Social Infrastructure Apps. Graph Framework. PGAS/Global Array for EBD. Workflow/Scripting Languages for EBD. EBD Abstract Data Models. MapReduce for EBD. Matsuoka Group EBD Algorithm Kernels. (Distributed Array, Key Value, Sparse Data Model, Tree, etc.) EBD Bag. Cartesian Plane (Search/ Sort, Matching, Graph Traversals, , etc.) KVS. EBD File System. EBD Data Object. EBD Burst I/O Buffer. Tatebe Group Graph Store NVM. (FLASH, PCM, STT-MRAM, ReRAM, HMC, etc.). TSUBAME 3.0. EBD Network Topology and Routing. Koibuchi Group HPC Storage. Web Object Storage. TSUBAME-GoldenBox. Interconnect. (InfiniBand 100GbE). Cloud Datacenter. KVS. KVS. EBD KVS. Network (SINET5). Intercloud / Grid (HPCI) 6. Programming Layer Basic Algorithms Layer System SW Layer Big Data & SC HW Layer Big Data & SC HW Layer.

(7) Acceleration of EBD Processing (1). • Large Capacity – Multi-Terabytes, Petabytes, Exabytes • Kernel algorithms for discrete data – graph, sort, etc. • EBD Characteristics. • Sparse and random data structure • Involve frequent and abundant data transfer. • EBD Solutions (research). Implies low latency and high bandwidth access. • High capacity at low power: non-volatile memory, deep memory hierarchy • High bandwidth: fast on-package memory + memory hierarchy+ Supercomputer Network (>100Gbps injection, Petabits bisection) + bandwidth reducing algorithms for EBD • Low Latency Our research: define & invent • latency reduction => memory 3-D stacking, EBD architecture + algorithm fast on-package memory + low latency network + system SW • Latency hiding => many core + many threading + latency reducing algorithms for EBD.

(8) Acceleration of EBD Processing (2) • Classification algorithms – statistical modeling/optimization, Machine Learning • EBD Characteristics: iterative numerical optimization. • Kernel may be sparse (e.g., SVM) or dense (e.g., Deep Learning) • Parallelism difficult due to massive sample size (10~100 billion images). • EBD Solutions (our research). • Approach: Employ traditional and new HPC/supercomputer parallelization and acceleration strategies • Sparse algorithms – high bandwidth processors (e.g., GPU) w/stacked memory and on-package memory + memory hierarchy + supercomputing network + bandwidth reducing algorithms (sparse linear algebra) Limited • Dense algorithms – many-core high FLOPS processor (e.g., GPU) + showing algorithmic advances for strong scaling today • High volume data – utilize “burst buffer” technology (incl. Clouds).

(9)

(10) Elapsed Time (ms). The Graph500 – June 2014 and June 2015 K Computer #1 Tokyo Tech[EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu 88,000 nodes, 1500. Communi… Computati…. 1000. 700,000 CPU Cores 1.6 Petabyte mem 20GB/s Tofu NW. 500 0. 65536 nodes. (Scale 30). (Scale 40). Rank. GTEPS. Implementation. November 2013. 4. 5524.12. Top-down only. June 2014. 1. 17977.05. Efficient hybrid. November 2014. 2. June 2015. 1. Efficient hybrid Hybrid + Node Compression. *Problem size is weak scaling “Brain-class” graph. ≫. 64 nodes. List. 38621.4. 73% total exec time wait in communication. LLNL-IBM Sequoia 1.6 million CPUs 1.6 Petabyte mem.

(11) Optimized Graph500 program (1) – Bandwidth Reducing Algorithm Sparse Matrix Representation with Bitmap Problem . . Since the partitioned graph is a hyper sparse matrix, we need efficient hyper sparse matrix representation for large scale distributed graph processing.. Our proposal: Sparse Matrix Representation with Bitmap . Enables compression of row indexes and fast access to each row.. Comparison with other methods Performance. Data size of row index (MB/node). (8064 nodes, Scale 36). (8064 partition, Scale 36). CSR (Compressed Sparse Row) DCSC Coarse Index + Skip List Bitmap (Proposal). 1806 861 309 337. 3,328. 3500 Performance (GTEPS). . 3000 2500. 2,294. 2,653. 2000 1500 1000 500 0. DCSC. Coarse Index + Skip List. Bitmap.

(12) Optimized Graph500 program (2) – Bandwidth Reducing Algorithm Vertex Reordering for Bitmap Optimization . Our idea  . . Creates reordered vertex number by sorting vertices by degree. Use reordered number for bitmap access and original number for other processing.. Result . 16% speedup by reduction of bitmap data, 28% speedup by localized memory access, and 49% speedup in total. (8064 nodes) Performance Bitmap Access. (8064 nodes, Scale 36). 3500. 3,328. Reorder. Unnecessary part. (ii) Reduce the size of Bitmap (i) Localize memory access. Performance (GTEPS). 3000. 2,596. 2500. 2,235. 28% 16%. 49%. 1,891. 2000 1500 1000 500 0. Proposal. Only remove No reorder unnecessary. Convert at last.

(13) EBD App1: Ultra-fast and High-sensitive Homology Search for Metagenomics [Akiyama Group] GOAL. 1. To develop ultra-fast and high-sensitive metagenomic analyzer. 2. To generalize the invented scheme, as a common tool for "EBD vs. EBD".. DNA sequencing Environmental samples Hospital Microbiome. Home Microbiome. Reference Databases Earth Microbiome. Phylogenetic composition and Metabolic pathway analysis.

(14) Extreme performance of GHOST-MP GHOST-MP shows extreme scalability.. GHOST-MP is x96 - x138 faster than mpi-BLAST. Weak scaling: 0.85 on 24576 nodes. [Suzuki et. al., PLOS ONE, 2014]. (=196,608 cores), 85% efficiency. Speedup. faster. on TSUBAME 2.5 10. 100. 1,000. # CPU cores. 10,000. 25000. 1. 20000. 0.8. 15000. 0.6. 10000. 0.4. 5000. 0.2. 0. 0. 0. 10000. 20000. 30000. Number of Nodes. [1] Suzuki S, Kakuta M, Ishida T, Akiyama Y, PLOS ONE, 9(8): e103833, 2014. [2] Suzuki S, Kakuta M, Ishida T, Akiyama Y, Bioinformatics, 31(8):1183-1190, 2015.. Speedup / Number of Nodes. compared to 1 node, on K-computer.. [Suzuki et. al., Bioinformatics, 2015].

(15) Large-scale application in human body metagenome We have performed Homology Search for whole human oral microbiome data (18 billion DNA reads) in HMP database.. Discovery of functional clusters. L.M. Proctor, Cell Host & Microbe, 2011. [4] Kakuta M, Suzuki S, Ishida T, Akiyama Y , (submitted).

(16) High-sensitive Homology Search required in Metagenomics:. A typical situation where EBD vs. EBD calculation required. O(m) Reference. Databases. O(n) Meas. data. O(m n) calculation Similarity search. In case of a human oral metagenome analysis: Input:. Output:. Measured Data: 1,284 GB (18 billion DNA reads) A Reference DB: 15 GB (KEGG GENES DB only) Search Result: 1,223 GB (Similarity, Ontology). in 2009 on a Xeon cluster w/ 144 cores 0.18 M reads/hour (100,000 hours / 18 billion reads). x3178 faster. in 2013 on K-computer 572M reads/hour (32 hours / 18 billion reads). and beyond....

(17) Toward Co-design: Generalization of scheme Current GHOST-MP structure main algorithm (GHOSTX/ GHOSTZ). data partition policy. original task dispatcher. MapReduce implementations of GHOST-MP using Hadoop, Spark, and Hamar [Zhang et al., HPCS2015]. Future GHOST-MP structure main algorithm (GHOSTX/ GHOSTZ). new API. common EBD tools. [3] Zhang C, Shirahata K, Suzuki S, Akiyama Y, Matsuoka S, HPCS2015 (2015)..

(18) Collecting Atmospheric Data Satellite Radar. Aircraft. Weather balloon Ship. Surface station. Global continuous collection Variety of sensors, stationary and mobile. Buoy.

(19) EBD App2: Miyoshi Group (Weather Forecast Application) Big Data Assimilation for severe weather forecast Only in 10 minutes! Goal ： Pinpoint (100-m resol.) forecast of severe local weather by updating 30-min forecast every 30 sec!. Revolutionary super-rapid 30-sec. cycle. 120 times more rapid than hourly update cycles.

(20) Future work (with Tatebe Group) • Geographical Search for LETKF – Searching local O(103) from global O(106) at every grid point – The number of grid points = O(108) – This part takes ~50 % of the total computer time.. • R-tree or kd-tree will be applied to accelerate the search. – In collaboration with Tatebe Group.

(21) EBD App3: Large-Scale Traffic Simulation [Suzumura Group] Agent-Based Model Optimizations. Exact-Differential Simulation. • Background. • Background. • Objective. • Objective. • Parallel and distributed computation systems are indispensable for large-scale simulations. • But there are many performance overheads especially load imbalance and node synchronizations. • Reduce such overheads to gain parallelism by nodes and threads.. • It needs to run simulation many times with various parameters or scenarios (what-if analysis). • There are many redundant events between initial simulation and later repeating simulation. • Propose a redundant reduction technique keeping exactly same results..

(22) Performance Optimization for Agent-Based Traffic Simulation [Kanezashi et. al., WSC 2015]. • Background. • Parallel and distributed computation systems are indispensable for largescale simulations. • But there are many performance overheads especially load imbalance.. • Optimization Methods. 1. Inner-node agent assignments 2. Across-node agent assignments 3. Synchronization reduction.

(23) Estimated Compute Resource Requirements for Deep Learning [Source: Preferred Network Japan Inc.] To complete the learning phase in one day. Image/Video Recognition 10P（Image) 〜 10E（Video）. 学習データ：1億枚の画像 10000クラス分類数千ノードで6ヶ月 [Google 2015]. Flops. Image Recognition. 10P〜 Flops. 1万人の5000時間分の音声データ人工的に生成された10万時間の音声データを基に学習 [Baidu 2015]. P:Peta E:Exa F:Flops. Bio / Healthcare. 100P 〜 1E Flops. 一人あたりゲノム解析で約10M個のSNPs 100万人で100PFlops、1億人で1EFlops. Auto Driving. 1E〜100E. Flops 自動運転車１台あたり1日 1TB 10台〜1000台, 100日分の走行データの学習. Robots / Drones. 1E〜100E Flops. 1台あたり年間1TB 100万台〜1億台から得られたデータで学習する場合. 機械学習、深層学習は学習データが大きいほど高精度になる現在は人が生み出したデータが対象だが、今後は機械が生み出すデータが対象となる各種推定値は1GBの学習データに対して1日で学習するためには 1TFlops必要だとして計算. 10PF 2015. 100PF 2020. 1EF. 10EF 2025. 100EF 2030.

(24) Status Quo of Deep Learning 2012. ILSVRC(ImageNet Large Scale Visual Recognition Competition)DL wins!(top-5 pred. error: 16.42%)[1]. 2013. Algorithm Improvement（Network-in-Network [2], drop-XXX [3, 4], etc.）. 2014. ITLAB uses Tegra to recognize pedestrians[5] Human-level facial recognition by Facebook [6] ILSVRC new record by Google（6.66%）[7] but thousands of CPUs, does not scale well. 2015. “DL supercomputer development” by Baidu [8] ILSVRC new record by Baidu（5.98%）[9]. LeCun. Hinton Bengio. ILSVRC new record by MSR（4.94%）[10] Slide Courtesy：Copyright (C) 2015 DENSO IT LABORATORY, INC. All Rights Reserved.. 24. Ng.

(25) Realistic DL requires immense compute power Voice Recognition 1GPU: 55h ・420 Million parameters ・1.1 Billion Samples. 3.4x. 10,000 cores: 25h. 1800 cores: 16h Simple scale-out in Clouds cannot handle real-life DL applications such a auto-driving 提供： Copyright (C) 2015 DENSO IT LABORATORY, INC. All Rights Reserved.. 25.

(26) NEW App 2015: EBD Co-Design for Deep Learning Applications • Deep Learning IS HPC!. • Training models – mostly dense MatVec • Data Access for training target data sets • Sharing updated training parameters in neural networks. • Goals. • Accelerate DL applications in EBD architectures ?. TSUBAME-KFC TSUBAME3.0 Prototype. Many companies (ex. Baidu, etc.) employ GPU-based Cluster Architectures, similar to TSUBAME2 & KFC. • Extreme-scale Parallelization, Fast Interconnects, Storage I/O, etc.. • Performance bottlenecks of multi-node parallel DL algorithms on current HPC systems ?. • Current Status. Real DL Applications. I/O • Official Collaboration w/DENSO IT Lab to be signed October • Profiling based bottleneck identification and performance Performance Model optimization of a real DL application on TSUBAME • > 100 million images, 1500 GPUs (6 Pflops) 1 week grand challenge run • Compete w/Google, MS, Baidu etc. in ILSVRC in ImageNet. Comm. Calc Feed Back.

(27) EBD Algorithm Kernels, Programming, and System Software (Matsuoka-G) • EBD Algorithm Kernels. • Graph Processing - Graph 500 (collaboration w/ RIKEN, Univ. Kyushu – Katsuki Fujisawa “Graph CREST”, Fujitsu) => #1 June 2014 and June 2015 • Sorting - Extremely Parallel Implementation & NVM-based Implementation. • EBD Programming Framework. • Graph-based Programming Framework (Python interface for high-performance ScaleGraph) • NVM-based Graph Storage (Collaboration w/ LLNL). • EBD System Software • • • •. Inter-cloud Federation using Cloud Burst Buffers (collaboration w/ Amazon AWS) Network Visualization Tool (collaboration w/ BSC, LLNL) Resource Optimization on a GPU-based System using Process Migration Tool, mrCUDA Performance Evaluation for Next-gen Storage System (collaboration w/ Data Direct Networks). • EBD CoDesign Apps • • • •. Meta-genome Apps using EBD Programming Framework (w/ Akiyama G) Traffic Simulation (w/ Suzumura G) Deep Learning (collaboration w/ Denso Lab) Others (Text Processing, Agent Simulation, etc.).

(28) Hamar (Highly Accelerated Map Reduce) [Shirahata, Sato, … IEEE CCGrid 2013]. • A software framework for large-scale supercomputers w/ many-core accelerators and local NVM devices – Abstraction for deepening memory hierarchy. • Device memory on GPUs, DRAM, Flash devices, etc.. • Features. – Object-oriented. • C++-based implementation • Easy adaptation to modern commodity many-core accelerator/Flash devices w/ SDKs – CUDA, OpenNVM, etc.. – Weak-scaling over 1000 GPUs • TSUBAME2. – Out-of-core GPU data management • Optimized data streaming between device/host memory • GPU-based external sorting. – Optimized data formats for many-core accelerators • Similar to JDS format. EBD Programming Framework.

(29) Out-of-core GPU-MapReduce for Large-scale Graph Processing [IEEE Cluster 2014]. EBD Programming Framework. Emergence of large-scale graphs -. GPU. SNS, road network, smart grid, etc. Millions to trillions of vertices/edges. Map. Map. Shuf fle. -. Map: 1.41x, Reduce: 1.49x, Sort: 4.95x speedup Overlapping communication effectively. Map. Map. Initialization. Scan. Red uce. Red uce. Sort. Shuf fle. Operation on GPU. Performance [MEdges/sec]. Experimental Results: performance improvement over CPUs. Processing for each chunk. Sort. Problem: GPU memory capacity limits scalable large-scale graph processing. - Stream-based GPU MapReduce - Out-of-core GPU sorting. Memcpy (H2D, D2H). Operation on GPU. → Need for fast graph processing on supercomputers. Proposal: Out-of-core GPU memory management on MapReduce. CPU. Red uce. Red uce. Weak scaling on TSUBAME2.5. 3000. 1CPU (S23 per node). 2500. 1GPU (S23 per node). 2000. 2GPUs (S24 per node). 2CPUs (S24 per node) 3GPUs (S24 per node). 1500. 2.10x. (3 GPU vs 2CPU). 1000 500 0. 0. 500 1000 Number of Compute Nodes. 1500.

(30) GPU-based Distributed Sorting. EBD Algorithm Kernels. [Shamoto, IEEE BigData 2014, IEEE Trans. Big Data 2015] • Sorting: Kernel algorithm for various EBD processing • Fast sorting methods – Distributed Sorting: Sorting for distributed system • Splitter-based parallel sort • Radix sort • Merge sort. – Sorting on heterogeneous architectures • Many sorting algorithms are accelerated by many cores and high memory bandwidth.. • Sorting for large-scale heterogeneous systems remains unclear • We develop and evaluate bandwidth and latency reducing GPU-based HykSort on TSUBAME2.5 via latency hiding – Now preparing to release the sorting library.

(31) GPU implementation of splitterbased sorting (HykSort) •. Weak scaling performance (Grand Challenge on TSUBAME2.5) – 1 ~ 1024 nodes (2 ~ 2048 GPUs) – 2 processes per node – Each node has 2GB 64bit integer. •. x1.4. 0.25 [TB/s]. x3.61. Yahoo/Hadoop Terasort: 0.02[TB/s] – Including I/O. x389. Performance prediction •. x2.2 speedup compared to CPU-based implementation when the # of PCI bandwidth increase to 50GB/s. PCIe_#: #GB/s bandwidth of interconnect between CPU and GPU 8.8% reduction of overall runtime when the accelerators work 4 times faster than K20x.

(32) EBD Algorithm Kernels. Efficient Parallel Sorting Algorithm for Variable-Length Keys. •. •. •. Comparison-based sorts inefficient for long/variable-length keys (like strings) Better way: examining individual characters (based on MSD Radix sort algorithm) Hybrid parallelization scheme: combining data-parallel and taskparallel stages. apple apricot banana kiwi. 70 M keys/second sorting throughput on 100bytes strings. Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. Efficient String Sorting on Multi- and ManyCore Architectures. in Proceedings of IEEE 3rd International Congress on Big Data. Anchorage, USA, August 2014 Aleksandr Drozd, Miquel Pericàs, Satoshi Matsuoka. MSD Radix String Sort on GPU: Longer Keys, Shorter Alphabets in proceedings of 第142回ハイパフォーマンスコンピューティング合同研究発表会 (HOKKE-21).

(33) GPU + NVM + PCIe SSD Sorting our new Xtr2sort library [H.Sato et.al. SC15 Poster] in-core GPU. Xtr2sort GPU+CPU+NVM. CPU+NVM. Single Node Xeon - 2 socket 36 cores - 128GB DDR4 - K40 GPU (12GB) - SSD PCIe card (2.4TB).

(34) Large Scale Graph Processing Using NVM [Iwabuchi, IEEE BigData2014] 1. Hybrid-BFS ( Beamer’11 ) 2. Proposal EBD Algorithm Kernels. Switching two approaches. DRAM. NVM. Holds highly accessed data Top-down. Holds full size of Graph. Bottom-up. CPU DRAM. Intel Xeon E5-2690 × 2. NVM. EBD-I/O 2TB × 2. 256 GB. www.crucial.com/. mSATA ・・・ mSATA SSD SSD. ×8. RAID Card (RAID 0) www.adaptec.com. mSATA-SSD. RAID Card. Median GigaTEPS. 3. Experiment. (Giga Traversed Edges Per Seconds). Load highly accessed graph data before BFS. # of frontiers:nfrontier，# of all vertices:nall, Parameter : α, β. Limit of DRAM Only. 6.0. 4.1. 5.0 4.0. DRAM + EBD-I/O DRAM Only. 3.0 2.0. 3.8. 4 times larger graph with. 1.0. 6.9 % of degradation. 0.0. 23 24 25 26 27 28 29 30 31. Ranked 3rd. SCALE(# vertices = 2SCALE). in Green Graph500 (June 2014).

(35) Dynamic Graph Data Structure Using Local-NVRAM [Iwabuchi et al. SC15 Poster]. Dynamic edges and vertices insert and delete. The dynamic graph data structure (This work). Comp. Node Controller / Partitioner. Node-local NVRAM. Comp. Node. Increase page-level locality of data stored in NVRAM by carefully choosing a hash function. Cumulated Exec. TIme (sec.). Comp. Node. Edge Insert Performance (sorted edges) 1000 800 600. Exceed DRAM cache size (4GB). 400 200 0 1 201 401 601 801 1001 The Number of Inserted Edges (million).

(36) ScaleGraph Large-scale Graph Processing Framework enhanced w/ User-Friendly Python / Spark Interface • ScaleGraph [Suzumura]. • X10-based open source Highly Scalable Large Scale Graph Analytics Library beyond the scale of billions of vertices and edges on Distributed Systems • XPregel: Pregel-based bulk synchronous parallel graph processing framework • Built-in graph algorithms (Centrality, Connected Component, Clustering, etc.). • NEW Development: Python Interface. • Allow users to use ScaleGraph with Spark* by easy python interface Software stack. Cluster. User Program Graph Algorithm. X10. Sparse Matrix (Graph Processing System) BLAS XPregel. Third Party Library X10 & C++ (ARPACK, METIS). File IO. ScaleGraph X10 Standard Lib Base Team Library MPI. User Python Script. ScaleGraph Spark (RDD). HDFS. *Apache Spark: http://spark.apache.org/.

(37) Cloud-based I/O Burst Buffer Architecture [Xu et al. SC15 Poster]. In collaboration talks w/ AWS. • Cloud-based I/O Burst Buffers. • Using several compute nodes in a public cloud as I/O buffer nodes • Taking advantage of high throughput of LAN inside a public cloud. System Architecture Compute node X. IOnode. Hash of file path IOnode. Master Y IOnode. IOnode. IOnode. IOnode. SCBB 0. IOnode. Master M-1 IOnode. IOnode. IOnode. IOnode. SCBB Y. IOnode IOnode IOnode. SCBB M-1. Burst buffers. Shared cloud storage. • The whole system consists of several SCBB (Sub CloudBB) • Each SCBB consists of a Master and several IOnodes. • •. Master=1 Master=1 Master=1. Smaller is better. Master 0 IOnode. Application Case Study. Masters: controlling all IOnodes in the same SCBBs and handling all I/O requests from Compute Nodes. IOnodes: storing actual data and transferring data with Compute Nodes.. 4.58x. Montage: Improve Performance.

(38) EBD System Software. Insightful Analysis of Performance Metrics on Fat-tree Networks [IEEE ICPADS 2015] 1. app Open MPI library network hardware process 1. 2. app. Profiler source port a b. dest. port b a. traffic (kb) 5 15. network communication profile. Hardware-centric traffic visualization switches compute nodes. Tree-topology viz. design. Open MPI library network hardware process 2. Non-intrusive collection of performance metrics using our ibprof profiler • •. Low overhead Captures links traffic.

(39) EDB Distributed Object Store (co-PI: Osamu Tatebe, U-Tsukuba) • Background – Demand for high IOPS and high BW of object store by EDB applications – Innovation of storage device and storage interface. • Objective - design and implementation for EDB Distributed Object Store – – – –. Design of object store in proposed standard storage interface [GPC 2015] Design of concurrent B+tree index for NVM Key Value Ongoing: Indexed cyclo-join for nearest neighbor search Ongoing: Co-design of execution model with meta-genomics application.

(40) Object Storage Design in OpenNVM [Takatsu et al GPC 2015]. • New interface - Sparse address space, atomic batch operations and persistent trim • Simple design by fixed-size Region enabled by sparse address space and persistent trim. Super Region. K OPS. 15.6 Kops/s. DirectFS. 61.3 Kops/s. Proposal. 746 Kops/s. …. Region N. …. 746Kops/sec. Object Creation Performance with Optimizations. 128 reservations + 32 initializations. 800 600. – Bulk reservation and bulk initialization. XFS. Region 2. Next Region ID[]. – Free’ed by persistent trim and no reuse – Enough region size to store one object. • Optimization techniques for object creation. Region 1. 2.8x. 400. 128 reservations. 200 0. Baseline 1. 2. 4 8 # thread. 16. 32. Fuyumasa Takatsu, Kohei Hiraga, and Osamu Tatebe, “Design of object storage using OpenNVM for high-performance distributed file system”, the 10th International Conference on Green, Pervasive and Cloud Computing (GPC 2015), May 4, 2015. 1.5x.

(41) Optimizations for Object Creation • Bulk Reservation. • Bulk Initialization. – Super region keeps the position of the next region – Bulk reservation commits the position every N times – When failure happens, only logical sparse space is lost. – Super block in each region will be initialized when it is used – OpenNVM atomic batch write performs better when the number of vectors increases – Bulk Initialization initializes N regions with a single batch write operation – When failure happens, lose non-used initialized super blocks. • Physical space is not affected. Super Region. Super Region 1 Region. Super Region 2 Region. …. Super Region N Region. …. Update every N times. 2015/11/10. Initialize with one atomic operation Super … Region. Super Block. Super Block. …. Super Block. … 41.

(42) Concurrent B+Tree Index for Native NVM-KVS [Jabri] •. •. Enable range-queries support for KVS running natively on NVM like fusionio ioDrive Design of Lock-free concurrent B+Tree. Lock-free operations – search, insert and delete • Dynamic rebalancing of the Tree • Nodes to be split or merged are frozen until replaced by new nodes. •. • Asynchronous interface using future/promise in C++11/14. NVM-KVS supporting range-queries. In-memory B+ Tree. OpenNVM like KVS Interface NVM (Fusion-io flash device).

(43) Proposal: Asynchronous equivalent call for NVMKV • •. NVMKV API calls are synchronous blocking calls Turn them asynchronous with respect to calling thread at the cost of occupying some background threads: – Leverage C++11/14 concurrency and asynchrony tools: future/promise, std::async – Lock-free queue of closures (lambda) with a set of background threads consuming those lambdas • Each lambda could be seen as a transaction (atomic) with respect to others. – Calls are wrapped into lambdas returning a future that encapsulates the result and exceptions – Calling thread could retrieve the result when needed by querying the future. •. •. Calling thread could query the future to check if the result is ready (do something else if not or just block until it is) Could be more relevant with task (as future) continuation and composition like Microsoft PPL’s or HPX: .then(), .when_any(), when_all(). Active Object, handling asynchronous nvmkv calls Background Threads: execute & set promises value to execution results dequeue. Lock-free message queue enqueue. Future/Promise channel setup for the incoming request. Async. Put/Get invocation. Return future to query and retrieve the result of Get/Put.

(44) Review and future work H25. H26. H27. H28. H29. H30. Design of distributed object store Prototype Implementation and evaluation Evaluation by applications and enhancement. • Design of Distributed Object Store. – Design of Local Object Store in OpenNVM – Design of concurrent B+Tree index for native NVM-KVS. • Ongoing work. – Spatial index and cyclo-join for nearest neighbor search with Miyoshi Group – Co-design of execution model with meta-genomics application with Akiyama Group. 44.

(45) EBD Interconnects [Koibuchi G] Background: DC and supercomputer NW are not optimized to EBD traffic Goal: low-latency topology/routing and direct remote access to storage EBD non-uniform access Typical Data Centers -Poor scalability - 1GbE + 10GbE - TCP/IP basis. Low latency write/read ～10μs for 4KB. Extreme Big Data Flow. K computer. Supercomputers - Dedicated to neighboring and uniform access. Low-jitter topology w/ random shortcuts. TCP/IP bypassing direct comm. to flash.

(46) EBD Traffic Patterns Better. 2D-Torus Random. 1.5. Performance. • All-to-all access – Random is better – Low latency requirement! • Pure stencil access – Topology does not impact. 2.0 1.0 0.5 0.0. CG. FT. Graph500.

(47) Our EBD Random-Based Topology Design. [Ikki Fujiwara et al, Skywalk: a Topology for HPC Networks with Low-delay Switches, IPDPS2014]. Cable delay will affects topology design → Make cable-geometric random topology + minimal-latency path routing. . Cisco SFS7000D. . QLogic 12300. . A future product. Switch latency 200 ns 140 ns. ?. ~ 50 ns ?. 21. 17. 13. 31 30 29. 227. 27 24. 20. 16. 12. 9 8. z. 28. 23 22. 18. 14. 7 10 6. 26. 4. 25. 5. Switch 0. 19. 33. Switch 1. 15. 34. Switch 2. 11. 35. Cabinet Switch 3. 32. 67. 99 98. 66 65. 97 96. 64. 131. 163 162. 130 129. 161 160. 128. 194. 226 225. 193 192. 224. 195. = 10m fiber. Max. Latency [ns]. (when compared to Dragonfly, pure random). 2600. 2600. 2400. 2400. 2200. 2200. Our goal. 2000 1800. 3-D Torus degree=6 diameter=20. 1600 1400 1200. Random degree=6 diameter=7. 1000 800. Max. Latency [ns]. Better tradeoff between latency and NW radix. 0. 50. 100. 150. Switch delay [ns]. 200. A new topology for high-radix NW • The new topology achieves +19% maximum latency at −90% cable length than Dragonfly! • achieves low latency at short cable length 47. 2000 1800. Hy d d. 1600 1400 1200. Ra d d. 1000 800. 0. 50. 100. Switch delay [n.

(48) Is Random-basis best ?. Random is theoretically best in some cases. http://research.nii.ac.jp/graphgolf/.

(49) Graph Golf: Order/degree Open Problem • We organize a graph competition now – The winner will be awarded in CANDAR’15. • Find a graph that has smallest diameter & average shortest path length for given an order and a degree ⇔ Famous graph diameter/degree problem is to maximize the network size for a given degree (Toward Moore bound). http://research.nii.ac.jp/graphgolf/.

(50) Review of the results so far & Future work • Capturing MPI EBD app enables new efficient EBD Topology and Routing – by successfully extending SimGrid event-discrete network simulator – Towards Ideal Network Topology IPDPS2014, ICPADS2014,IEEE Trans. on Parallel and Distributed Systems2014,HPCA2015, IEEE Spectrum (June,2015). ・ Future Work. (1) The use of Free-Space Optics (new device) for EBD Apps. (2) Approximate Networking for EBD Apps FSO Terminal Switch. Laser Beam Cabinet Cable. 50.

(51) EBD Open Source Software Development • Early research deliverable outcome • Distributed as OSS & facilitated on TSUBAME2/3 for CREST-BD • GPU-based Large-scale Distributed Sorting • Scales over 1000 nodes • STL-like Interfaces. • ScaleGraph: Large-scale Graph Processing Framework enhanced w/ User-Friendly Python / Spark Interface • Calling Fast Graph Algorithm Implementations via Python • Easy Pregel-style Graph Programing using Python. • Cloud-based I/O Burst Buffers. • Fuse-based Filesystem for general purpose I/O operations.

(52) 2016 Q4 TSUBAME3.0 Leading Machine Towards Exa & Big Data TSUBAME3.0. 1. “Everybody’s Supercomputer” - High Performance (~20 Petaflops, 4~5PB/s Mem, ~Pbit/s NW), innovative high cost/performance packaging & design, in mere 180m2… 2. “Extreme Green” – ~10GFlops/W power-efficient architecture, system-wide power control, advanced cooling, future energy reservoir load leveling & energy recovery 3. “Big Data Convergence” – Extreme high BW &capacity, deep memory 2013 hierarchy, extreme I/O acceleration, Big Data SW Stack TSUBAME2.5 for machine learning, graph processing, … upgrade 5.7PF DFP 4. “Cloud SC” – dynamic deployment, container-based 2016 TSUBAME3.0 /17.1PF SFP ~20PF(DFP) 4~5PB/s Mem BW node co-location & dynamic configuration, resource 20% power 10GFlops/W power efficiency elasticity, assimilation of public clouds… reduction Big Data & Cloud Convergence 5. “Transparency” - full monitoring & user visibility of machine & job state, 2010 TSUBAME2.0 accountability 2.4 Petaflops #4 World “Greenest Production SC” via reproducibility 2006 TSUBAME1.0 80 Teraflops, #1 Asia #7 World “Everybody’s Supercomputer”. Large Scale Simulation 2013 TSUBAME-KFC Big Data Analytics 5 #1 Green 52 500 Industrial Apps 2011 ACM Gordon Bell Prize.

(53) Phase 2 TSUBAME3.0 2016 EBD Highlights (Co-designed exclusively for TSUBAME3 w/vendors) • ~80 Petaflops FP16 for Deep Learning etc. • 4~5 Petabyte/s total memory BW (~x1.5 K-Computer) • 800 Gbps NW throughput / node, 1 Petabit/s total • = Entire global intra-IDC network capacity circa 2016 according to CISCO. • 1~2 microsecond NW end-to-end latency, ~10 microsecond entire machine reduction op • 5 Petabyte, 5~6 Terabyte/s NVM storage Also many advanced • Could extend to max 14 Petabyte, ~50 Terabyte/s NVMe for Skylake 2017 Cloud and Big Data • = ~1.6 Zetabyte/year “storage” throughput software elements on • Immense IOPs (not calculated yet ;-) top of the HW • Approximately 30 racks (1/30 K-computer) • 2.4MW max, 1MW for bandwidth dominant EBD applications including cooling (1/30 Post-K) using state-of-the-art supercomputing cooling and power control • (We need a little more money to realize the above full spec…).

(54) Phase 2 TSUBAME3 (2016) Architecture for EBD Ultra High BW, Deep Mem Hierarchy, Low Latency NW NV-Link 80GB/s. DDR4 x 4 64^128GB 100GB/s. 16GB/s PCI-E 3.0 PLX. PCI-e x 16. Broadwell Xeon-EP 14~ cores. 16GB/s 16GB/s Mellanox EDR HCA No Or OmniPath 100Gbps ~30 racks. existing product. 16. Pascal GPU 32GB 1TB/s. PCI-E 3.0 PLX. HBM HBM. QPI 2.0. Pascal GPU 32GB 1TB/s. x10. PCI-E 3.0 PLX. HBM HBM. PCI-e x 16. DDR4 x 4 64~128GB 100GB/s Broadwell 16 Xeon-EP 14~ cores. D. Pascal GPU 32GB 1TB/s. NV-Link 80GB/s. 16GB/s. 4 Mellanox Mellanox x30~ EDR HCA EDR HCA 100 On-board Or OmniPath Or OmniPath Flash Terabytes 100Gbps 100Gbps Gigabytes/s. Pascal GPU 32GB 1TB/s. PCI-E 3.0 PLX Mellanox EDR HCA Or OmniPath 100Gbps. 400+400Gbps/node ~1Petabit/s total 2 microsec end-to-end.

(55) SINET5: Nationwide Academic Network  2016 SINET5 connects all the SINET nodes in a fully-meshed topology and minimizes the latency between every pair of the nodes using nationwide dark fiber  MPLS-TP devices connect a pair of the nodes by primary and secondary MPLS-TP paths.. SINET5 2016. SINET4 present. • Connects all the nodes in a fully-meshed topology with redundant paths • Secondary paths do not consume resources. • Connects nodes in a star-like topology • Secondary circuits of leased lines need dedicated resources : Leased Line (Primary Circuit). : MPLS-TP Path (Primary). Data center. : Leased Line (Secondary Circuit). : MPLS-TP Path (Secondary). SINET Node. SINET Node. SINET Node. ROADM Node +MPLS-TP. ROADM +MPLS-TP. ROADM +MPLS-TP. Edge Edge. GW Node. Edge Node Core. Edge. GW Node. Edge Node. Core. : Dark Fiber. Core. Edge. : Wavelength Path. : MPLS-TP Path. Edge Node Edge. Node. Edge Edge. Node. Node Node. ：10 Gbps ：100 Gbps ：> 200 Gbps.

(56) Proposed Big Data and HPC Convergent Infrastructure => “Nationoal Big Data Science Virtual Institute” （Tokyo Tech GSIC）（Objective） • • •. Convergence of High Bandwidth and Large Capacity HPC facilities with “Big Data” currently processed managed by domain laboratories HPCI HPC Center => HPC and Big Data Science Center People convergence: domain scientists + data scientists + CS/Infrastructure => Big data virtual institute. Present. Domain labs segregated data facilities No mutual collaborations Inefficient, not scalable with Not enough data scientists. 2013 TSUBAME2.5 Upgrade 5.7Petaflops. 2016 TSUBAME3.0 Green&Big Data HPCI Leading Machine Ultra-fast memory network, I/O Data Management Big Data Storage Deep Learning SW Infrastructure. Convergence of top-tier HPC and Big Data Infrastructure. National Labs With Data Big Data Science Applications. Mid-tier “Cloud” Storage. Archival Long-Term Storage Goal 100 Petabytes. Virtual Multi-Institutional Data Science => People Convergence.

(57) Application-Centric Overlay Cloud Utilizing Inter-Cloud (PI: Kento Aida, NII Japan)infrastructure • data store/access method データベース. 計算機・ストレージ. selecting the optimal set of resources (computers, network, storages) that fulfill application requirements. compt. 計算機・ストレージ. data. compt. Jitsumoto Gr. (w/Matsuoka) Tokyo Inst. Tech.. データベース. data. platform customized for app A. 計算機・ストレージ. compt. Ogasawara Gr. (National Genetics Inst.). Internet. platform customized for app B. data. automatically and quickly building data analysis platform optimized for application. Aida Gr. National Inst. Informatics. application • Genome sequencing • Fluid Acoustic Analysis. データベース. middleware. portal. user. app A. Amano Gr. (Univ. Kyushu). on. inter-cloud • testbed operation. optimal resource selection. Munetomo Gr. (Hokkaido U). 57. user. app B. Approved as a JSTCREST “Big Data” Project, 2015 Kento Aida, National Institute of Informatics.

(58) HPCI Data Publication Prototype GSIC and DDBJ @ National Institute of Genetics & Amazon Storage Service (Cloud Burst Buffer) DDBJ Center, National Institute of Genetics Data Generation and Provanance NEW: HPCI Storage Server: 300TB. GSIC Center, Tokyo Institute of Technology HPCI Storage Cache for Data Publication. Test HPCI Global File System (Gfarm: 1.2 PB). NIG User. Gfarm Meta Data Server. NEW: HPCI Storage Server: 0.9PB for Cache. SINET4 (10Gbps) ＬｏｃａｌＳｔｏｒａｇｅ > 3PB Genetic Data. TSUBAME User. ＬｏｇｉｎＮｏｄｅ. ＬｏｇｉｎＮｏｄｅ Direct distance: 86km. NIG Supercomputer (Mishima, Shizuoka, JAPAN). Cloud Burst Buffer. ＬｏｃａｌＳｔｏｒａｇｅ ~11PB. TSUBAME 2.5. (Meguro, Tokyo, JAPAN).

(59) 2015 EBD Co-Design Status Quo Akiyama Group. Large Scale Graphs and Social Infrastructure Apps. Large Scale Genomic Correlation. Generalized Programing Model for Large-scale Genomic Correlation HAMAR: MapReduce for EBD. Miyoshi Group. Suzumura Group. Programing Model for Social Simulation. Data Assimilation in Large Scale Sensors and Exascale Atmospherics. Fail-safe I/O for Severe Weather Forecast. I/O Acceleration for Genome DB Matching. Nearest Neighbor Search. Concurrent B+Tree Index for NVM. ScaleGraph: Graph Computation Framework. GPU-based Large-scale Graph Computation i.e. Grpah500. Burst Buffer Evaluation on TSUBAME systems. NVM-based Graph Computation/Store. Network Visualization GPU-based Largescale Sorting. Interaction w/ Simulation Log Results on DB. NVM-based Sorting. NVM-based Object Store w/ EBD Standard Interfaces. Next-gen Network Design using EBD Co-Design Apps. Indexed Cyclo-Join. Cloud-based Burst Buffers.

(60) mrCUDA. EBD System Software. • Background:. • rCUDA [1]: Remote CUDA execution middleware • Our previous work [2]: We showed that rCUDA can be used for solving the idle-GPU scattering problem in multi-GPU batch-queue systems. • Performance Problem with rCUDA: Some HPC applications (e.g. LAMMPS) experience huge slow down when using rCUDA (an example on next slide) • Motivation: Local GPUs may become available while rCUDA jobs are running, and local is always better than remote in terms of performance • Proposed solution: mrCUDA – low-overhead middleware for transparently live migrating CUDA execution from remote to local GPUs [1] J. Duato et al. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. (HPCS2010) [2] P. Markthub et al. Using rCUDA to reduce GPU resource-assignment fragmentation caused by job scheduler. (PDCAT2014).

(61) Linearly increase due to the rCUDA’s overhead before migration. mrCUDA’s overhead. Case Study (LAMMPS) negligible. visible but small. After the migration, mrCUDA completely cuts off the rCUDA’s overhead. *2 nodes, Tesla K20c, InfiniBand 4xFDR. *N: used native CUDA, *R: used rCUDA *x%: migrated after finish x% of total iterations.

(62) EBD Co-Design Apps.

(63) EBD Co-Design Apps.

(64) Japan’s High Performance Computing Infrastructure (HPCI) HPCI: a nation-wide HPC infrastructure Hokkaido U. - Supercomputers ~40 PFlops (2015 April) - National Storage 22.5 PB HDDs + Tape (~14PB Used) - Research Network (SINET4), 40+10GBps->SINET5 (2016) 200~400Gbps - SSO (HPCI-ID), Distributed FS (Gfarm FS) - National HPCI Allocation Process Osaka U.. Kyoto U.. JAMSTEC. Tohoku U.. U. Tsukuba 1PF. Tokyo Tech.. Kyushu U 1.7PF. Nagoya U.. Riken AICS • “K” computer 11Petaflops • HPCI Storage (10PB). Tokyo Tech TSUBAME2.5 5.7 Petaflops HPCI Storage (0.5->1.5PB). U. Tokyo • Supercomputers 1.1PF • HPCI Storage (12PB). NII • Management of SINET &Single sign-on. 3.

(65) Towards the Next Flagship Machine & Beyond PF. 1. • The Flagship 2020 project: the next national “flagship” system (NOT Exascale) for 2020 • Alternative “Leading Machines” inbetween. 2008. T2K 2010. RIKEN. PostT2K TSUBAME3.0. 9 Universities and National Laboratories. • Co-design primary key • Some academic-led R&D (esp. system SW and overall architecture). • International collaboration • New targets e.g. power, big data, etc.. Tokyo Tech. TSUBAME2.0. U. Tsukuba U. Tokyo Kyoto U.. Future Exascale. Flagship 2020 “Post K”. 2012. 2014. 2016. 2018. 2020. 65. 2014/09/11.

(66) Japanese “Leading Machine” Candidates Roadmap of the 9 HPCI University Centers + Post K -> Total ExaFlop? Fiscal Year (start March). Hokkaido Tohoku Tsukuba Tokyo. 2012. 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023. Hitachi SR16000/M1 (172 TF, 22TB) Cloud System Hitachi BS2000 (44TF, 14TB). NEC SX-9 + Exp5800 (31TF). NEC SX-ACE 800TFlop/s. HA-PACS (800 TF, GPU) COMA (1PF Xeon Phi) T2K Todai (140 TF). Tsubame 2.0 (2.4PF, Tsubame 2.5 (5.7 PF, 110+ 97TB, 744 TB/s)1.8MW TB, 1160 TB/s), 1.4MW. Nagoya. Fujitsu M9000(3.8TF, 1TB/s) HX600(25.6TF, 6.6TB/s) FX1(30.7TF, 30 TB/s). Osaka Kyushu. 30+Pflop/s. Fujitsu FX10 (1PFlops, 150TiB, 408 TB/s), Hitachi SR16000/M1 (54.9 TF, 10.9 TiB, 5.376 TB/s). Cray XE6 (300TF, 92.6TB/s), GreenBlade 8000 (243TF, 61.5 TB/s). Hitachi SR1600(25TF). 100+ PF 50+ PF. Tsubame 3.0 (20~25 PF, 5~6PB/s) 1.8MW (Max 2.3MW). Fujitsu FX10 (90.8TF, 31.8 TB/s), CX400(470.6TF, 55 TB/s) Post FX10 Upgrade (3PF) Cray XC30 (429TF) Cray XC30 Phi(584TF). SX-8 + SX-9 (21.7 TF, 3.3 TB, 50.4 TB/s). 50+ PF. PostT2K JHPCA (~30 PF, (100+ TiB, 600TiB, 4.0+ PB/s, 0.68+ PB/s)). Tokyo Tech.. Kyoto. 10+ PF. 50+ PFlops. 10+ PF. 50+ PF. NEC SC-ACE 400TFlops. Hitachi HA8000tc/HT210(500TF, 215 TiB, 98.82TB/s), Xeon Phi (212TF, 26.25 TiB, 67.2 TB/s), SR16000(8.2TF, 6 TiB, 4.4 TB/s). Fujitsu FX10（270TF, 65.28 TB/s), CX400(510TF, 152.5 TiB, 151.14 TB/s), GPGPU(256TF, 30 TiB, 53.92 TB/s). ~17PF April 2015, Japan-wide ~40PF(incl. K),. Tsubame 4.0 (100~200 PF, 20~40PB/s), 2.3~1.8MW. (5+ PiB/s) 10+ PF. 50+ PF.

(67) Flagship 2020 “Post K” Supercomputer  CPU • Many-core with Interconnect interface integrated on chip • Power Knob feature for saving power  Interconnect • TOFU (mesh/torus network). 2014/09/11. … … … … … … … … … … … …. Co-design may include: • Compute Node Features • FP performance • Memory hierarchy, control, capacity, and bandwidth • Network Performance • I/O Performance. I/O Network Maitenance Servers. Portal Servers Login Servers Hierarchical Storage System. :Interconnect : Compute Node. 67.

(68) Current status of the Flagship 2020 project . . The project currently procured development of the basic design of the Flagship 2020 supercomputer In the specification RFP: . Constraints are:    . . The system should be designed to maximize the performance of applications in each computational science field. .  . Power capacity (about 30MW) Space for system installation (in Kobe AICS building) Budget (money) for development (NRE) and production. ... some degree of compatibility to the current K computer.. "Co-design" is a keyword!. First Prototype 2018, Deployment start 2019, full system completion 2020 Fujitsu announced as the winner Oct 1, 2014 68.

(69) 9 Priority Research Areas Category Healthcare and longevity. Priority Research Area ① Infrastructure for advanced drug discovery controlling function of biomolecule ② Integrated computational life science for individual preventive medical treatment. Disaster mitigation and Environment al Issues. ③ Total prediction system for disasters caused by complex combination of earthquake and tsunami. Energy. ⑤ New social infrastructure for efficient energy generation, conversion, storage, and use. ④ Advanced prediction of weather and global environment with observed BIG DATA. ⑥ Advanced clean energy system technology. Manufacturin g and Engineering. ⑦ New functional devices and high-performance materials. Basic Science. ⑨ Investigation of the fundamental laws of the universe. 5 Frontier Research Areas Frontier Research Area Acceptan ce is decided based on future possibiliti es. ⑩ Frontier of Basic Science – Challenging Exploratory ⑪ Modelling interaction among various social economical phenomena and its applications ⑫ Investigating birth of Extrasolar planet (another earth）and environmental change in planets of the solar system ⑬ Investigating neural network mechanism realizing think and applying the mechanism to artificial intelligence. ⑧ Advanced CAD/CAM leading the near-future manufacturing. 69.

(70) HPCI Nationwide HPC Storage Cloud.  21.8 PB (separate from local). ~70% full. HPCI East HUB Univ. Tokyo • 11.5PB + 20PB Tape. NII SINET4->SINET5.  High resiliency and availability Tokyo Tech  Redundant Servers・RAID6 • 0.3PB -> 1.2PB  Active Repair • (TSUBAME 11PB Local)  Multi-Tier Distributed Storage HPCI West HUB  Multi-vendor utility Riken AICS  ZABBIX, Ganglia • 10PB + 60PB Tape  Fault Detection & Information Sharing. HPCI. High Performance Computing Infrastructure. copy file to “Gfarm” via login node. HPCI Single Sign-on AAA. Tohoku-U Tsukuba U JAMSTEC. K Computer (30PB Local) => PostK (2020) U-Kyushu. replication to (neighbor) host - access efficiency, dependability. WHub. Tokyo Tech. Hokkaido-U. EHub. HPCI Storage Cloud Gfarm over SINET. Inst. Stat. Math. Nagoya-U Kyoto-U Osaka-U. copy file to Site-local FS via login node Site-Local FS.

(71) Extreme Big Data Examples Rates and Volumes are extremely immense Social NW – large graph processing •. Facebook – – –. •. •. •. Genomics advanced sequence matching. 7 billion people. Peta~Zetabytes Data. Input Data –. 500 million active users 340 million tweets per day 300 million new websites per year 48 hours of video to YouTube per minute 30,000 YouTube videos played per second. Target Area: Planet (Open Street Map). –. –. Internet – – –. NOT simply mining Tbytes Silo Data. Applications –. Twitter – –. •. 〜1 billion users Average 130 friends 30 billion pieces of content shared per month. Social Simulation. –. Road Network for Planet: 300GB (XML) Trip data for 7 billion people 10KB (1trip) x 7 billion = 70TB Real-Time Streaming Data. Ultra High-BW Data Stream. (e.g., Social sensor, physical data). •. Highly Unstructured, Irregular. Simulated Output for 1 Iteration –. 700TB. Weather – real time large data assimilation Phased Array Radar 1GB/30sec/2 radars. Himawari 500MB/2.5min. A-1. Quality Control A-2. Data Processing. B-1. Quality Control B-2. Data Processing. ①30-sec Ensemble Forecast Simulations 2 PFLOP. ②Ensemble Data Assimilation 2 PFLOP. Analysis Data 2GB. Ensemble Forecasts シミュレーションシミュレーション 200GB データデータ. Ensemble Analyses シミュレーションシミュレーション 200GB データデータ. ③30-min Forecast Simulation 1.2 PFLOP. Repeat every 30 sec.. 30-min Forecast 2GB. Complex correlations between data from multiple sources Extreme Capacity, Bandwidth, Compute All Required.

(72) Further real-world applications in medical science Oral Microbiome. Gut Microbiome. Clinical Study on Periodontotitis （歯周病） using GHOST-MP. Fundamental Study on Gut immunity （腸管免疫） using GHOST-MP. Prof. Kazuyuki Ishihara (Tokyo Dental College). Prof. Satoshi Uematsu (IMS, Univ. of Tokyo). Collaboration research (involving human subjects) formally started from 2015.. Collaboration research started from 2015.. www.niddk.nih.gov.

(73) Review of the current results and future work Current. 1. GHOST-MP is about x100 faster than mpi-BLAST, and has extreme scaling with its original efficient load-balancing mechanism (mpidp). 2. Matsuoka group has transported it to MapReduce framework (Hamar). 3. We have analyzed whole human oral microbiome data on HMP database which has 18 billion DNA reads (1.3TB) with high-sensitivity search. Future. 1. We will renovate the engine to employ further faster GHOSTZ algorithm. Furthermore, we have implemented GHOSTZ-GPU utilizing multi-GPUs. 2. Architecture co-design, especially on combination of fast local storage (SSD, etc.) with burst-buffer-like hierarchical technique will be studied. 3. Real-world applications in medical science is ongoing (perio- and gut)..

(74) Failsafe methods to recover the system downtime. LETKF: Local Ensemble Transform Kalman Filter (Hunt et al. 2007).

(75) Forecast accuracy  compute speed.

(76) 100 members. 10240 members. Sampling noise reduced. [Miyoshi, T., Kondo, K. and Imamura, T. 2014: 10240-member ensemble Kalman filtering with an intermediate AGCM. Geophys. Res. Lett., 41, 5264–5271.]. High-precision probabilistic representation.

(77) EBD App 2 Conclusion • 4D-LETKF can recover the system downtime effectively. • We found tradeoffs between the computer time and forecast accuracy.. Future perspective Potential collaboration with Matsuoka Group. Phased array weather radar 4MB/30sec (Thinned). Himawari-8 500MB/2.5min. • Enhanced data transfer and I/O in BDA 7.5GB/member. A-1. Quality controll A-2. Data processing. B. Data processing. ①Ensemble 30 seconds forecast simulation (Model：JMA-NHM). 1200 sec.（K 1440 nodes）. ②LETKF Data assimilation （100 members） 1400 sec.（K 1440 nodes）. Analysis data 4.0GB. 45.5GB/member Ensemble forecast シミュレーションシミュレーション 2500GB データデータ. Repeat. Ensemble analysis シミュレーションシミュレーション 374GB データデータ. ③30 minutes forecast simulation 1160 sec.(K 900 nodes) Forecast data 142GB.

(78) • Evaluation Results • Conditions Area. Dublin. Cross points. 19,062. Roads. 40,965. Vehicles. 234,378. Simulation steps. 14,400 (4 hours). Machine. TSUBAME 2.5 Thin-node. Nodes. Up to 16. Threads/node. 12. Memory/node. 48GB. Java Version. HotSpot 1.7.0. X10 Version. 2.5.1. • Dynamic agent assignments. • Synchronization Reduction. • Achieved 37% speed-up with across-node assignments and 35% with reduction of synchronization frequency. • The number of nodes at the best performance is always 6 (total 72 threads) regardless of any optimization method..

(79) Exact-Differential Traffic Simulation [Hanai et. al., SIGSIM PADS 2015]. • It needs to run simulation many times with various parameters or scenarios (what-if analysis). • There are many redundant events between initial simulation and later repeating simulation.. • Naive partial simulation never brings about exact results (exactly same as full simulation).. • We proposed a redundant reduction technique keeping exactly same results..

(80) • Main ideas. • Store all intermediate logs in initial simulation • Process only affected events in repeating simulation. • Implementation. • Use optimistic simulation (Time Warp) to detect affected events and reprocess them.

(81) • Results. • Emergent traffic control scenarios in Tokyo Bay Area (161K junctions, 203K roads, 5000 vehicles in 3 hours) • Reduce 92.4% events in average, 62.8% off in the worst case • 2x - 7x speed-up from original simulation.

(82) EBD App3 Conclusion and Future Works • Consideration from Results • Across-node agent assignment is the most effective method • The parallelism did not yet improve by our methods. • Future Work. • Further performance evaluation with larger transportation systems • More sophisticated (monitoring and automatic) agent-assignment method. • Towards Big Data Problems • Initial simulation is a highly data intensive workload. • over 10 GB logs even in small scenario (Tokyo Bay Area 141K junctions, 3 hours) • In full Tokyo simulation (770K junctions, 24 hours), it generates over TB scale logs. • Co-design Solutions. • Algorithm optimization (lazy storing, reverse computing) • Out-of-core process with database (in Tatebe G.) • Hardware supports (nonvolatile memory, SSD, burst buffering in Matsuoka G., etc.).

(83) EBD Co-Design and Meetings • Mutual co-design feedbacks between apps and system software teams • 8 face-to-face meetings in total • Mini-apps as software instances for developing EBD system software stacks • Meta-genome Application (GhostX, Ghost-MP etc.) from Akiyama Group • EBD-based MapReduce-style Software Framework • I/O Acceleration using NVM technologies. • Weather Assimilation Application (LETKF) from Miyoshi Group • Workflow Acceleration • Nearest Neighbor Search Acceleration • I/O Acceleration w/ FT. • Social Simulation from Suzumura Group. • Co-design meeting with other CREST-BD groups desirable • Use TSUBAME as a common platform for true interaction. • Co-design meetings with DENSO Lab..

(84)

No results found