• No results found

Big Data Tutorial on Mapping Big Data Applications to Clouds and HPC: Sports Analytics

N/A
N/A
Protected

Academic year: 2020

Share "Big Data Tutorial on Mapping Big Data Applications to Clouds and HPC: Sports Analytics"

Copied!
46
0
0

Loading.... (view fulltext now)

Full text

(1)

BIG DATA APPLICATIONS & ANALYTICS

SPORTS ANALYTICS

1/26/2015

Sports Analytics 1

Geoffrey Fox

January 26 2014

BigDat 2015: International Winter School on Big Data

Tarragona, Spain, January 26-30, 2015

[email protected]

http://www.infomall.org

School of Informatics and Computing Digital Science Center

(2)

Sports Informatics

Summary

Sports sees significant growth in analytics with pervasive statistics shifting to more sophisticated measures. We start with baseball as game is built around segments dominated by individuals where detailed (video/image) achievement measures including PITCHf/x and FIELDf/x are moving field into big data

arena. There are interesting relationships between the economics of sports and big data analytics. We look at Wearables and consumer sports/recreation. The importance of spatial visualization is discussed. We look at other Sports:

Soccer, Olympics, NFL Football, Basketball, Tennis and Horse Racing.

(3)

Sports Analytics

Lesson 1: Introduction

Introduction to all Sports Informatics

Moneyball The 2002-2003 Oakland Athletics

Diamond Dollars economic model of baseball

Performance – Dollar relationship

Value of a Win

SPORTS INFORMATICS I: SABERMETRICS

(BASEBALL)

1/26/2015

(4)

• Understood to be very important

• In ~1990 Quants went to wall street; now they are becoming part of sports management

• Applicable to all fields but more prominent in some such as baseball

• Several data sources

• Statistics on actions within game

• Analysis of real-time video

• Signals from special position sensitive tags

• Custom instruments such as those in fitness wearables

• MIT Sloan Sports Analytics conference probably best source of data

http://www.sloansportsconference.com/

• Baseball’s http://sabr.org/ SABR (Society for American Baseball research) has an interesting conference and rich history

SPORTS ANALYTICS AND INFORMATICS

(5)

• We will see three source types for data

• Very precise numerical data on game results, athlete performances etc.

• Use of sensors that are tracked (position, acceleration, physical condition etc.)

Video where image processing used to extract information

• Some results are used to predict results and make decisions on

players or game play (e.g. which relief pitcher to use in baseball or which offensive play to use in NFL)

• These analyses use probabilities of course; chose the play that is most likely to succeed

• Other results are used for “spatial visualization” e.g. to map where field goal throws are successful in NBA or where passing plays are successful in NFL.

SOME BROAD TRENDS

Sports Analytics

(6)

BASEBALL EXAMPLES

(7)

• http://en.wikipedia.org/wiki/Sabermetrics

Sabermetrics is the empirical analysis of baseball, especially

baseball statistics that measure in-game activity. The term is derived from the acronym SABR, which stands for the Society for American Baseball Research http://sabr.org/.

• It was coined by Bill James, who is one of its pioneers and is often considered its most prominent advocate and public face.

• Sabermetricians frequently question traditional measures of baseball skill. For instance, they doubt that batting average is as useful as

conventional wisdom says it is because team batting average provides a relatively poor predictor for team runs scored.

• Sabermetric reasoning would say that runs win ballgames, and that a good measure of a player's worth is his ability to help his team score more runs than the opposing team.

• Use VORP or Value over replacement player to measure value of

individual players

SABERMETRICS

Sports Analytics

(8)

• Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis, published in 2003, and 2011 Film starring Brad Pitt

• 2002 Oakland Athletics finished first in the American League West with a record of 103-59

• Billy Beane general manager used sabermetrics and farm system

MONEYBALL

(9)

• This is a book by Vince Gennaro that does not incorporate

sophisticated statistics but does define and describe baseball as a business with a nontrivial economic model that has tio be considered in analytics as a players value depends on this model.

• Note in e-commerce and web search analysis, the fact that real criterion for success was economic strategies balancing user happiness versus

amount of dollars from sales led to non trivial problems.

• Different teams have different economic models due to different fan characteristics: size, expectations (of winning).

• Fans include those that go to ballpark plus those that watch on various media outlets that include TV and Internet

• Yankees has highest team value and highest fan interest summing all revenue sources

• There is a relationship between winning and attendance and hence winning and revenues

• Appearing in post season makes an important revenue difference and that requires winning!

DIAMOND DOLLARS I

Sports Analytics

(10)

• This illustrates that revenue per year varies according to team with a factor of 2.8 difference in 2013 between top and bottom

DIAMOND DOLLARS II

(11)

• Interestingly attendance at ball games is increasing

each year even with

increased media outlets.

• Note YES network (a regional sports network RSN)has huge value to Yankees

• YES is Baseball, Basketball and Football

• The 90th win is worth more

than 80th as it’s directly

relevant to playoffs

• Wins on the margin are worth $5M each today

• New stadiums add value

DIAMOND DOLLARS III

Sports Analytics

(12)

• A player’s value (measured in dollars to team) depends on

• Performance on the field quantified by WAR

• Teams win-loss record

• Situation of other teams that could effect value of given player to other teams

• Depth of club at player’s position

• Contract status (free-agency?)

• Marquee value which depends on and builds team brand value

• Team Brands

• New York Yankees (26 world championships in 85 years)

• Chicago Cubs

• Boston Red Sox

• A win is roughly equivalent to 10 runs

DIAMOND DOLLARS IV

(13)

Sports Analytics

Lesson 2: Basic Sabermetrics

Different Types of Baseball Data

Sabermetrics

Overview of all data

Details of some statistics based on basic data

OPS, wOBA, ERA, ERC, FIP, UZR

SPORTS INFORMATICS I: SABERMETRICS

(BASEBALL)

1/26/2015

(14)

• There are classic statistics that tells you important information; one example is Batting Average or Earned Run Average for pitchers

• http://en.wikipedia.org/wiki/Baseball_statistics includes

• 41 Batting (including OPS, wOBA) statistics definitions

• 7 Base running

• 50 Pitching (including ERA, FIP, ERC)

• 12 Fielding

• 3 Overall Value (including WAR)

• 4 General

• As Baseball is a complicated game, these basic statistics may not be best correlated with success and there choices of/combinations of basic statistics that (are claimed to) correlate better with success

• OPS and WAR are two examples of such combinations

• However these are still calculated from “little data”; basically a measure (few numbers at most) of each pitch and at bat

BIG DATA AND LITTLE DATA

(15)

Little data: basic statistics. Used in most popular discussions of sport

• One can add into basic averages, selections such as size of ball park and performance against say left and right handed pitchers to make specific predictions more precisely. This leads

Little data: Sabermetric statistics like OPS and WAR. This has been

used for some time in professional analyses and origin of Oakland Athletics success chronicled in Moneyball.

• Finally one can use the detailed video record of each action in

baseball coming from products like PITCHf/x that replace a pitch or hit

by a video that can be analyzed with a more sophisticated model and this is Big Data

• The discussion (given earlier) “Diamond Dollars” (by Vince Gennaro) also uses such statistics but includes a sophisticated fiscal model of baseball

• A win is worth more for some teams than others

INCREASINGLY SOPHISTICATED ANALYSES

Sports Analytics

(16)

Big data: PITCHf/X, HITf/X, FIELDf/X and Commandf/X (catching) etc. is perhaps future of Sabermetrics and described on the web by Vince Gennaro in talks and blogs.

• It makes a more sophisticated model of a player

• It uses video – much larger amounts of data although this is summarized in terms of numbers measuring speed, curve, location etc.

• Also can measure physical status of players and so help fitness and health

• The analysis of this data uses more sophisticated analytics such as recommender engines

• http://www.sportvision.com/baseball

• http://m.mlb.com/news/article/68514514/mlbam-introduces-new-way-to-analyze-every-play is a rival to Sportvision FIELDf/X

LEADING TO BIG DATA

(17)

• Very accurate clean data over a long time interval – over 140 years with clear metadata

• Actions clearly associated with Pitcher, Batter, Fielder although two-way interaction present

• E.g. A given batter will do differently for different styles of pitch

• This contrasts with soccer or basketball where team features much more important although some actions like shooting free throws or 3-pointers are individually focused.

• Enough data that can in detail train models and then test on a different sample of data

FEATURES OF BASEBALL

Sports Analytics

(18)

• This is a “sabermetric baseball statistic“

OPS = OBP + SLG

OBP = (H + BB + HBP)/(AB + BB + SF + HBP)

SLG = TB/AB

• OBP On-base percentage

• SLG Slugging Average

• H = Hits

• BB = Base on balls

• HBP = Times hit by pitch

• AB = At bats (Plate appearances, not including bases on balls, being hit by pitch, sacrifices, interference, or obstruction)

• SF = Sacrifice flies (Fly balls hit to the outfield which although caught

OPS: ON-BASE PLUS SLUGGING

(19)

• This is a “sabermetric baseball statistic“

• AB (At Bats): Number of trips to the plate in which the batter does not walk, get hit by a pitch, sacrifice (fly or bunt), or reach on interference

• HBP (Hit By Pitches), SF (Sacrifice Flies),

• BB (Walks), IBB (Intentional Walks)

• 1B = Single, 2B = Double, 3B = Triple

• HR = Home run

• wOBA = (0.690×(BB-IBB) + 0.722×HBP + 0.888×1B + 1.271×2B + 1.616×3B + 2.101×HR) / (AB + BB – IBB + SF + HBP) in 2013

• Empirically better than other statistics in measuring contribution to run scoring

• Weights calculated separately for each year

• http://www.fangraphs.com/library/offense/offensive-statistics-list/

wOBA (WEIGHTED ON-BASE AVERAGE)

Sports Analytics

(20)

ERA is mean of earned runs given up by a pitcher per nine innings pitched (i.e. the traditional length of a game).

• It is determined by dividing the number of earned runs allowed by the number of innings pitched and multiplying by nine.

• Runs resulting from defensive errors (including pitchers' defensive errors) are recorded as unearned runs and are not used to determine ERA

• ERA misleading for relief pitchers, because they are charged only for runs scored by batters who reached base while batting against them. They can “blow the save” by letting batters on base when they start score but have zero ERA

• Pitchers for the Colorado Rockies have historically faced many problems, all damaging to their ERAs. The combination of high altitude (5,280 ft or 1,610 m) and a semi-arid climate in Denver causes fly balls to travel up to 10% farther than at sea level.

• Denver's altitude and low humidity also reduce the ability of pitchers to

ERA EARNED RUN AVERAGE

(21)

Sports Analytics

Lesson 3: Wins Above Replacement

Wins above Replacement WAR

Discussion of Calculation

Examples

Comparisons of different methods

Coefficient of Determination

Another Sabermetrics Example

Summary of Sabermetrics

SPORTS INFORMATICS I: SABERMETRICS

(BASEBALL)

1/26/2015

(22)

• http://en.wikipedia.org/wiki/Wins_Above_Replacement

• http://www.fangraphs.com/library/misc/war/

• http://www.baseball-reference.com/about/war_explained.shtml

Wins Above Replacement Player is a sophisticated sabermetric baseball statistic developed to sum up the extent of "a player’s total contributions to their team“

• It has an agreed goal but many implementations compared at http://www.baseball-reference.com/about/war_explained_comparison.shtml

WARP from Baseball Prospectus

rWAR or updated bWAR from Baseball Reference

fWAR from Fangraphs

• 10 runs are roughly equal to a win

• A replacement level player is defined as contributing 20.5 runs fewer than a player of league-average performance, over 600 plate appearances i.e.

below average!

• WAR = 2.05 is an average player over 600 plate appearances

WAR(P): WINS ABOVE REPLACEMENT I

(23)

• A team of replacement-level players is expected to have a .294 (originally 0.32)winning percentage, or 47.6 wins in a 162 game season.

• Definition change made in March 2013

• This is 1000 wins extra for average (0.5 win %) teams summed over 2 leagues

• i.e. total WAR for all teams in 2013 is 1000.

• Wikipedia suggests formula Wins = 52.7 + 0.97 fWAR (old definition)

• Cameron found that a team's projected record based on fWAR and that team's actual record has a strong correlation of 0.83 to WAR prediction

• Don’t refer to average players who are relatively rare, difficult to obtain and highly paid whereas replacement level players, by their very definition, are players easy to obtain when a starter goes down.

• These are the players who receive non-roster invites at the start of the year or the players who are 6-year minor league free agents.

WAR(P): WINS ABOVE REPLACEMENT II

Sports Analytics

23 1/26/2015

Baseball talent among the population is

(24)

• http://www.baseball-reference.com/about/war_explained_position.shtml

WAR for position players

has six components:

• Batting Runs e.g. use wOBA (weighted on-base average)

• Base running Runs (Stolen Bases and Caught Stealing runs, 1st to 3rd on singles, outs on the bases, tagging up on fly balls, scoring from third on a ground ball, etc.)

• Runs added or lost due to Grounding into Double Plays in DP situations

• Fielding Runs (“Defensive Runs Saved”)

• Positional Adjustment Runs e.g. a catcher gets added runs and

designated hitters runs removed. Pitchers (when they bat) need special treatment

• Replacement level Runs (based on playing time) as 5 components compared to League average

These are complicated formulae but all involve league averages

and complicated but little data statistics

WAR(P): WINS ABOVE REPLACEMENT III

(25)

• http://www.baseball-reference.com/about/war_explained_pitch.shtml

WAR for pitchers based on Runs Allowed (both earned and unearned)

and Innings Pitched compared to average pitcher

• This is then adjusted by difference between average and replacement level pitcher

• Fangraphs uses FIP (Fielding Independent Pitching)

• Use an average pitcher corrected for situation current pitcher placed in

• Level of Opposition

• Handling Interleague (Currently AL performs better than NL) and designated hitter difference

• Team defense ability as seen in FIP or Defense-Independent Pitching Stats (DIPS)

• Ball park effects. These are accounted for statistically and with a physics model as in Big data approach

• Relievers versus starters. Relievers have better ERA but only pitch a few innings. Relieving in close game worth more than relieving in a

noncompetitive game

WAR(P): WINS ABOVE REPLACEMENT IV

Sports Analytics

(26)

• In 2014 Mike Trout (Angels) had a fWAR of 7.8 (10.5 in 2013)

• In 2014, Corey Kluber (Indians) had 7.3 fWAR while Clayton Kershaw (Dodgers) had one of 7.2 even though

• Kluber was 18 wins, 9 losses and 2.44 ERA

• Kershaw was 21 wins, 3 losses and 1.77 ERA

• In 2014 Dodgers had total fWAR 41.8: divided 27.1 batting and 15.4 Pitching

• Red Sox had highest fWAR of 43.3 in either League: divided 36.5 batting and 13.6 pitching

• Over all time (sum over appearances) Babe Ruth (10616 PA) had best fWAR at 168.4 followed by Barry Bonds (12606 PA) at 164.0

• PA = Plate Appearances

• http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual =y&type=8&season=2014&month=0&season1=1871&ind=0

• Note definition valid and statistics available over complete recorded

fWAR EXAMPLES

(27)

• Note that there are modest volumes of data in sabermetrics analyses like WAR but they are determined by the data itself

• All the magic coefficients in FIP and wOBA and other WAR

components do not come from “theory” – they come from fitting data.

• In this sense classic sabermetrics illustrates key features of “big data” – the data not a model determines the answer

• Of course one needs a lot of baseball savvy to know what variables to include in formulae – albeitg with unknown coefficients

• Little data sabermetrics discussed in EdX course

https://courses.edx.org/courses/BUx/SABR101x/2T2014/courseware /10e616fc7649469ab4457ae18df92b20/ with modules on using SQL and R to calculate

LITTLE DATA SABERMETRICS

Sports Analytics

(28)

Lesson: Pitching Clustering and Video in Baseball

A Big Data Pitcher Clustering method introduced by Vince

Gennaro

Data from Blog and video at 2013 SABR conference

(29)

• Vince Gennaro’s Blog http://vincegennaro.mlblogs.com/

• First decide on key properties of pitchers that are a) Important to batter’s performance

b) Available from PITCHf/X

• These 12 properties are in right hand column

CLUSTERING PITCHERS I

Sports Analytics

29 1/26/2015

(30)

• Below is a visual mapping of pitcher clusters.

• Each node represents a pitcher and each line between pitchers represents a

“connection” or a

similarity, based on a defined minimum

threshold level.

• This graph includes only LHPs and it

clusters them against only right-handed

hitters.

CLUSTERING PITCHERS II

(31)

• This summarizes the traditional one on one approach (given batter record versus given pitcher) compared to clustering method

CLUSTERING PITCHERS III

Sports Analytics

31 1/26/2015

Generalize to

hitter clusters

(32)

• Return on Investment

• Replace 30th percentile guy by 70th

percentile for 81 days – 19 runs

• Optimize pinch hitter 100 times – 9 runs

• Chose optimal relief pitcher 50 times – 5 runs

• 33 runs is 3 wins

• 1 win is worth $5M for a competitive team

ROI OF OPTIMIZING MATCH UPS

(33)

• http://baseball.physics.illinois.edu/FieldFX-TDR-GregR.pdf

• http://www.sportvision.com/baseball/fieldfx

• The FIELDf/x® service uses Sportvision's baseball technology to

digitally record the position of all players and hit balls in real time.

• Left illustrates type of material available

• Right is result of 95 foot run to catch ball

FIELDf/X FOR FIELDERS I

Sports Analytics

(34)

• http://grantland.com/the-triangle/mlb-advanced-media-play-tracking-bob-bowman-interview/

START OF CATCH

(35)

• https://www.youtube.com/watch?v=YkjtnuNmK74

• Talk from Sabermetrics expert at Tufts

HITf/X AND RESULT OF HIT

Sports Analytics

35 1/26/2015

(36)

Lesson: Mainly Pretty Pictures …….

Spatial Visualization

(37)

• http://www.slideshare.net/Tricon_Infotech/big-data-for-big-sports

• Players analyzed in real time

• Speed

• Heart rate

• Hydration

• Breathing

• Fatigue

• Pain

• Coaches relate to technique

• Enhance fantasy play

• Fans engage through social media, real-time enhanced data

• Teams get locations contacts for fans

• Implies Injury reduction, and benefits for marketing and betting

GENERAL COMMENTS II

Sports Analytics

(38)

• http://www.slideshare.net/BrandEmotivity/sports-analytics-innovation-summit-data-powered-storytelling

CONSUMER DEVICES

Sports Analytics

Accelerometer

(39)

• http://www.slideshare.net/Tricon_Infotech/big-data-for-big-sports

SOCCER VISUALIZATION

Sports Analytics

(40)

• http://www.sloansportsconference.com/wp-content/uploads/2014/06/Automated_Playbook_Generation.pdf

• http://autoscout.adsc.illinois.edu/publications/football-trajectory-dataset/

• Computer vision and Machine learning to classify plays and then predict next one; players recognized from video

AMERICAN FOOTBALL

(41)

NFL AMERICAN FOOTBALL

Sports Analytics

41

1/26/2015 http://www.slideshare.net/elew/sport-analytics-innovation

(42)

• http://www.slideshare.net/elew/sport-analytics-innovation

NBA SHOOTING LOCATION/SUCCESS I

Sports Analytics

Jose Calderon 2012-2013

(43)

• http://www.slideshare.net/elew/sport-analytics-innovation Sports Analytics

43 1/26/2015

(44)

http://www.sloansportsconference.com/wp-content/uploads/2012/02/Goldsberry_Sloan_Submission.pdf

COMPARING 4 NBA PLAYERS

(45)

TENNIS

Sports Analytics

45 1/26/2015

(46)

• http://www.trakus.com/technology.asp#tNetText

• More accurate and immediate than GPS or other positioning techniques, the Trakus system uses proprietary wireless

communications to track tags fitted into each horse’s saddlecloth during live racing.

HORSE RACING

Sports Analytics

The durable, lightweight tag weighs 2.8 ounces (86 g) and it has the profile and size of a credit card or PCMCIA

References

Related documents

‘Zefyr’ caused by Gnomonia fragariae in the greenhouse 11 weeks after inoculation: (A) Severe stunt of plants inoculated by root dipping in ascospore

online student persistence, such as perceived sense of community , social presence , learners' satisfaction , and learner participation and interaction, are integral aspects

We can see clearly that ontology modeling can be connecting to object model by this example: the ontology representation language used in this paper a UML class diagram (contain

I We also consider a noisy variant with results concerning the asymptotic behaviour of the MLE. Ajay Jasra Estimation of

Initial combination therapy versus step-up therapy in treatment to the target of remission in daily clinical practice in early rheumatoid arthritis patients: results from the

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

We interacted this variable with different measures of financial literacy to test whether the financially literate are more likely to withdraw housing equity when

During the three phases of the model a new nurse who starts to work in critical care moves from a latent ability to develop an inherent affective and mental resourcefulness