• No results found

Development of a Computational and Data-Enabled Science and Engineering Ph.D. Program

N/A
N/A
Protected

Academic year: 2021

Share "Development of a Computational and Data-Enabled Science and Engineering Ph.D. Program"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Development of a Computational and

Data-Enabled Science and Engineering Ph.D.

Program

Paul T. Bauman, Varun Chandola, Abani Patra Matthew Jones

University at Buffalo, State University of New York

EduHPC 2014 Workshop November 16, 2014

(2)

What is CDSE?

(3)

Need for Data Science Training

• 140,000–190,000 additional trained personnel for “deep

analytical talent positions”, and 1.5 million more “data-savvy managers” needed to take full advantage of big data in the United States.

• National shortage for such talent is at least 60% – McKinsey

Global Institute

• “Universities can hardly turn out data scientists fast enough.” New York Times

(4)

Need for Data Science Training

• Many new discoveries in science, new constructs in

engineering and efficiencies/opportunities in business are made possible by CDSE methodologies e.g.

• LHC generates 2-6 GB data/second — mining of which led to

the Higgs discovery!

• LSST is hailed as a breakthrough in data enabled science.

• Demand – students in multiple departments (MAE, CBE,

CSEE, CSE, Math, Physics, Chemistry, ...) are in ad hoc programs of study aligned with this theme with gaps, poor sequencing and preparation

(5)

A Little History!

• Analog with Computational

Science program development

• Many scientists doing

computing, but little training in numerical mathematics, algorithms, software engineering

• Major advancements came

not just from hardware, but from better algorithms

Taken from David Keyes, “Computational and applied mathematics in scientific discovery”

(6)

Interdisciplinary Approach

• We expect similar trends in

the emergence of “Big-Data”

• Currently, trends essentially

follow Moore’s Law

• Opportunities for major

advancements emerging from interdisciplinary efforts

• We want history to repeat

itself here!

• Gains will come from

scientists who understand the problems on both sides

Transaction Performance vs. Moore’s Law: A Trend Analysis 117

Fig. 3. TPC-C primary performance metric trend vs. Moore’s Law

During the years 1996 to 2000, Moore’s Law slightly underestimates NtpmC, while starting with 2008 it slightly overestimates NtpmC. In the period between 1996 and 2000 the publications used a very diverse combination of database and hardware plat-forms (about 12 hardware and 8 software platplat-forms). Between 2001 and 2007 Moore’s Law estimates NtpmC pretty accurately. In this period most publications were on x86 platforms running Microsoft SQL Server. Starting in 2008 Moore’s Law over esti-mated NtpmC. At the same time the number of Microsoft SQL Server publications started decreasing. 2008 saw 21 benchmark publications, 2009 saw 6 benchmark

pub-lications and, so far, 20103 has seen 5 benchmark publications. This is mostly because

of Microsoft’s increased interest in the new TPC-E benchmark, introduced in 2007, while Oracle’s DBMS took the lead in a number of TPC-C publications. Also, during this period the number of cores per processor for x86 processors increased to six. Another reason for the decline in the number of TPC-C publication is the increased benchmark implementation cost, mainly due to an increased number of spindles re-quired to cope with increase in processor speed.

Ignoring the two time periods during which NtpmC differed slightly from Moore’s Law, TPC-C performance increases are remarkably similar to the processor perfor-mance improvements. This suggests that TPC-C perforperfor-mance is attributed solely to 3 As of June, 2010. 100 1000 10000 100000 1000000 199 3 199 4 199 5 199 6 199 7 199 8 199 9 200 0 200 1 200 2 200 3 200 4 200 5 200 6 200 7 200 8 200 9 201 0 Av er ag e N tp m C p er Ye ar Publication Year

Average tpmC per processor Moore's Law

TPC-C Revision 2

TPC-C Revision 3, First Windows result, First x86 result First clustered result

First result using 7.2K RPM SCSI disk drives First result using storage area network (SAN)

Intel introduces multithreading TPC-C Revision 5, First x86-64 bit result First multi core result

First Linux result First result using 15K RPM SCSI disk drives

First result using 15K RPM SAS disk drives First result using solid state disks (SDD)

Taken from Nambiar and Poess, “Transaction Performance vs. Moore’s Law: A Trend Analysis”, TPCTC 2010

(7)

UB CDSE Ph.D. Program

• Students will get heavy doses of graduate applied

mathematics/statistics, computer science, and HPC

• Build on a strong foundation in their core science/engineering

• Will have “end-to-end” expertise in innovating, developing, and

(8)

Other Data Science Programs

• NYU: Many MS programs, Ph.D. CS specializations

• Columbia: MS Data Science

• Stanford: MS CME, Data Science track

Data

Science ComputationalScience

CDSE

(9)

UB CDSE Ph.D. Program

(10)
(11)

Core Courses

Data Science

• CSE 574: Introduction to Machine Learning

• EE 634: Principles of Information Theory and Coding

• IE 512: Decision Analysis

• MAE 674: Optimal Estimation Methods

• STA 502: Statistical Inference

• STA 503: Regression Analysis

• STA 521: Introduction to Theoretical Statistics I

• STA 536: Statistical Design and Analysis of Experiments

• STA 567: Bayesian Statistics

Applied and Numerical Mathematics

• IE 572: Linear Programming

• IE 575: Stochastic Methods

• MAE 702: Applied Functional Analysis • MTH 539: Methods of Applied Mathematics I • MTH 540: Methods of Applied Mathematics II • MTH 543: Fundamentals of Applied Mathematics • MTH 631: Analysis I • MTH 649: Partial Differential Equations

(12)

Core Courses

High Performance and Data Intensive Computing • CSE 562: Database Systems

• CSE 587: Data Intensive Computing

• CSE 603: Parallel and Distributed Processing

• CSE 999: Large-Scale Distributed Data Systems

• IE 551: Simulation and Stochastic Models

• MAE 609: High Performance Computing I

• MAE 610: High Performance Computing II

• STA 545: Data Mining I

• STA 546: Data Mining II

(13)

Core Faculty

Co-Directors

Venu Govindaraju (CSE), Abani Patra (MAE)

Lead Faculty

Gino Biondini (Math), Vipin Chaudhary (CSE), Marianthi Markatou (Biostats), Murali Ramanathan (Pharm), Christian Tiu (Mgt Sci)

New CDSE Faculty

Paul Bauman (MAE), Varun Chandola (CSE), Michele Dupuis (CBE)

(14)

Affiliated Faculty

School of Engineering and Applied Sciences

A. Aref (CSEE), C.W. Chen (CSE), G. Dargush (MAE), S. Das(MAE), P. Desjardin (MAE), M. Demirbas (CSE), J. Errington (CBE), E. Furlani (CBE), T. Furlani (CCR), K. Hoffmann (Phys & BME), D. Kofke (CBE), T. Kosar (CSE), K. Lewis (CSE), D. Pados (EE), C. Qiao (CSE), D. Salac (MAE), G. Scutari(EE), ...

College of Arts and Sciences

L. Biang (GEO), M. Bursik (GLY), J.-H. Jung (Math), O. Khan (SAP), J. Ringland(Math),S. Metcalf (GEO), C. Renschler (GEO), B. Spencer (Math), G. Valentine (GLY), S. Rapoccio(Phy)

School of Management

S. Das Smith (MSS), R. Ramesh (MSS), H. R. Rao (MSS), A. Zhang (CSE)

(15)

Center for Computational Research

• Leading Academic Supercomputing Center

• More than 14 years experience delivering HPC to campus

• 100 Tflops Total Aggregate Compute Capacity

• 8000 cores; 600 TB of high performance storage

• Major Upgrades Planned for 2014

• $1.2M ESD CFA Award

• $100M Genome Initiative with NY Genome Center in NYC

• Utilization in 2013

• 454 users (147 faculty), 2.1M jobs

• NYS HPC2 Consortium

• RPI,UB, Stony Brook, Brookhaven, NYSERNet

• Bringing HPC to NYS Industry and Academia

• NSF TAS Award

• 5-year $7.5M XSEDE-based Award

(16)

http://cdse.buffalo.edu

Figure

Fig. 3. TPC-C primary performance metric trend vs. Moore’s Law  During the years 1996 to 2000, Moore’s Law slightly underestimates NtpmC, while  starting  with  2008  it  slightly  overestimates  NtpmC

References

Related documents

– Students in this program are exposed to a wide range of computa- tional and data science applications, and will learn computational science tools, high-performance computing,

Data Engineers build the tools and platforms to answer questions with data , using software engineering best practices, computer science fundamentals, core database principles,

High Performance Computing: LLGrid + Hadoop Distributed Database/ Distributed File System Distributed Database: Accumulo (triple store). A C E B Array

4 Education Initiatives: Data Science, Informatics, Computational Science and Intelligent Systems?. Engineering;

several aspects of data intensive computing from storage to algorithms, to processing and education.. • Progress in Data Intensive Programming Models • Progress in Academic

acquisition, data visualization, image analysis, alysis, high performance computing, and knowledge- high performance computing, and knowledge- based systems are used in MMM..

The “Computational Analytics” specialization will give students a strong background in mathematics, algorithmic and computational thinking, computer systems, and data analysis,

This paper, presented at the Alliance Chautauqua held at the Albuquerque High Performance Computing Center in Albuquerque, NM, August 1999, summarizes the status of