• No results found

2010 Data Miner Survey Highlights

N/A
N/A
Protected

Academic year: 2021

Share "2010 Data Miner Survey Highlights"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Karl Rexer, PhD

President

Rexer Analytics

www.RexerAnalytics.com

2010 Data Miner Survey Highlights

… The Views of 735 Data Miners

Predictive Analytics World

Washington, DC

(2)

2010 Data Miner Survey: Overview

• 

Fourth annual survey

• 

47 questions

• 

10,000+ invitations emailed

plus newsgroups, vendors,

and snowball referrals

• 

Respondents:

735 data miners

from 60 countries

33%

31%

12%

5%

19%

Corporate

Consultants

Note: Data from tool vendors was

excluded from many analyses

Academics

NGO / Gov’t

Vendors

45%

36%

12%

North America

•  USA 40%

•  Canada 4%

Europe

•  Germany 7%

•  UK 5%

•  France 4%

•  Poland 4%

Asia Pacific

•  India 4%

•  Australia 3%

•  China 2%

Central & South

America (4%)

•  Columbia 2%

•  Brazil 1%

Middle East & Africa (3%)

•  Israel 1%

•  Turkey 1%

(3)

10% 10% 10% 11% 13% 13% 14% 15% 15% 25% 29% 41%

0%

10%

20%

30%

40%

50%

Government

Internet-based

Manufacturing

Medical

Technology

Pharmaceutical

Retail

Telecommunications

Insurance

Academic

Financial

CRM / Marketing

Fields Applying Data Mining

Question: In what fields do you TYPICALLY apply data mining? (Select all that apply)

• 

CRM / Marketing, Financial and Academic are the most commonly

reported fields. This has been consistent since the 2007 survey.

(4)

8% 9% 9% 11% 12% 13% 14% 16% 21% 21% 22% 25% 26% 27% 31% 32% 60% 68% 69% 0% 10% 20% 30% 40% 50% 60% 70% MARS Uplift Modeling Link Analysis Genetic Algorithms Social Network Analysis Rule Induction Survival Analysis Anomoly Detection Bayesian Support Vector Machines Ensemble Models Association Rules Text Mining Factor Analysis Neural Nets Time Series Cluster Analysis Regression Decision Trees

Data Mining Algorithms

• 

Decision trees, regression, and cluster analysis continue to form a triad of core

algorithms for most data miners. This is very consistent, year to year.

• 

However, a wide variety of algorithms are being used.

Question: What algorithms/analytic methods do you TYPICALLY use? (Select all that apply)

Corporate Consultants Academic NGO / Gov’t

10% 12% 4% 5%

Ensemble Models

Uplift Modeling

Corporate Consultants Academic NGO / Gov’t

(5)

Text Mining

STATISTICA Text Miner 19%

IBM SPSS Modeler 17%

SAS Text Miner 9%

IBM SPSS Text Analytics 7%

Rapid Miner 6%

Provalis Wordstat 2%

GATE 2%

KXEN 2%

Oracle Text or ODM 1%

Megaputer Text Analyst 1%

Autonomy 1%

Other 35%

Text Miners

• 

About a third of data miners

currently incorporate text

mining into their analyses,

and another third plan to.

Software Used

Plan to Start

Text Mining

No Plans to

Conduct Text

Mining

0% 20% 40% 60%

The focus of our text mining is to extract key themes (sentiment analysis) We use text fields as inputs / predictors in a larger model We use text mining as part of social network analyses

30%

34%

36%

55%

59%

21%

(6)

35%

24%

49%

39%

26%

18%

7%

0% 60%

Computing Environments

• 

A lot of data mining happens on desktop and laptop computers.

• 

Frequently the data and processing is local

(not on servers, mainframe or cloud).

• 

Only a small minority of data mining is on the cloud.

Question: What are the computing environments/platforms on which data mining/analytics occurs at your company/organization? (Check all that apply)

C

orp

ora

te

C

on

su

lta

nt

Aca

de

mi

c

N

G

O

/

G

ov’

t

V

en

do

r

5% 10% 7% 3% 14% 20% 16% 14% 32% 26% 28% 30% 19% 29% 45% 48% 36% 25% 47% 39% 43% 49% 58% 58% 35% 29% 24% 15% 32% 37% 28% 36% 46% 42% 44%

Cloud Computing

Centralized Mainframe/Server

Local Server

Desktop PC/Workstation (with data &

processing on server, mainframe or cloud)

Desktop PC/Workstation (with

data & processing locally)

Laptop PC (with data & processing

on server, mainframe or cloud)

Laptop PC (with data &

processing locally)

(7)

Analytic Capability & Data Quality

• 

Analytic capability:

– 

There’s room to improve if we’re going to “Compete on Analytics”.

Data Quality Question: How do you rate the quality of data available for analysis at your company/organization?

• 

Data quality:

– 

48% rate it “strong” or “very strong” (same as last year)

– 

16% rate it “poor” or “very poor” (13% last year)

Analytic Capability Question: How do you rate the analytic capabilities of your company/organization?

13%

35%

30%

20%

8%

40%

35%

13%

(8)

Overcoming Challenges: Best Practices

• 

Top challenges facing data miners:

– 

Dirty data

: #1 challenge every year, 2007-2010

– 

Explaining data mining to others

: always in the top 4 challenges,

2007-2010

– 

Difficult access to data

: always in the top 3 challenges, 2007-2010

• 

This year survey respondents provided “Best

Practices” for overcoming these challenges.

– 

E.g., Dirty Data: Use anomaly detection to flag records to put before

subject matter experts.

– 

E.g., Dirty Data: All projects begin with low-level data reports showing

counts of records, verification of keys (uniqueness, widows/orphans), and

distributions of field contents. These reports are echoed back to the data

content experts.

– 

See the list of Best Practices at www.RexerAnalytics.com in early

November.

(9)

Data Mining Software

Survey Questions: • What Data mining/analytic tools did you use in 2009? (rate each as “never”, “occasionally”, or “frequently”)

• What one Data Mining software package do you use most frequently?

Overall

Corporate

Consultants

Academics

NGO / Gov’t

• 

The average data miner reports using 4.6 software tools.

• 

R is used by the most data miners (43%).

(10)

Satisfaction with Data Mining Tools

Question: Please rate your overall satisfaction with your primary Data Mining software package.

2010

2009

Sample size < 20

• 

STATISTICA received the highest satisfaction ratings. Consistent with

the 2009 findings, R and SPSS Modeler users are also quite satisfied.

– 

About 80% of STATISTICA and R users also report that they are extremely likely to

stay with these primary tools over the next 3 years. This is reported by only 42-45%

of SAS, SPSS Statistics, and SAS-EM users; and only 18% of Weka users.

Continued Use question (not graphed): What is the likelihood that you will continue to use this tool as your primary Data Mining software package over the next 3 years?

(11)

Data Mining and the Economy

Question: How will the number of data mining projects your organization conducts in 2010 compare to what has been typical in the past few years?

There is a strong market for data mining:

• 

73% of data miners foresee increases in the number of data mining projects.

• 

Offshoring of data mining is also increasing: It is reported by 14% of data

miners this year (8% last year).

Offshoring Question (not graphed): Has your company moved any data mining or other analytics to another country to take advantage of lower wages in the destination country?

(12)

Number of respondents

“What do you envision as the primary future trends in data

mining?” (open-ended survey question)

Future Trends in Data Mining

50

32

32

26

15

15

12

11

0

10

20

30

40

50

60

Growth in Data Mining Adoption

Text Mining

Social Network Analysis

Automation

Cloud Computing

Data Visualization

Tools Get Easier to Use

Scaling to Bigger Data

(13)

How to Get More Information

• 

Questions? – Talk with me at PAW

– 

Call or email me if you don’t see me in the hallways

• 

Copy of these slides – Available now

• 

2010 Data Miner Survey Summary Report (Free)

– 

Available in early November

– 

Available at PAW website or email me

• 

Best Practices for overcoming data mining

challenges

– 

Available in early November at

www.RexerAnalytics.com

Karl Rexer, PhD

[email protected]

www.RexerAnalytics.com

617-233-8185

References

Related documents

Elevated CO 2 decreased litter mass loss (-10 %) during the first year of incubation, but increased it (+46 %) during the second year, resulted in a similar litter mass

determine if ldcA is able to impact the profile of released PG fragments, we employed metabolic pulse-chase labeling of GC peptidoglycan followed by size exclusion chro- matography

Immunoprecipi- tation and Western blot for FGFR3 proteins confirmed the presence of both FGFR3 proteins in the cell lysate, suggesting that this decrease in phosphorylation did

In examining the ways in which nurses access information as a response to these uncertainties (Thompson et al. 2001a) and their perceptions of the information’s usefulness in

As a formal method it allows the user to test their applications reliably based on the SXM method of testing, whilst using a notation which is closer to a programming language.

For the cells sharing a given channel, the antenna pointing angles are first calculated and the azimuth and elevation angles subtended by each cell may be used to derive

The ratio of amylose/amylopec- tin may influence the functional properties, which is related to the degree of intermolecular association, shape, composition and distribution

Deshmukh &amp; Jha Shankar Madanmohan [1], “Design Evaluation and Material Optimization of a Train Brake “ stated that, A moving train contains energy, known as