Bayesian Count Data Modeling for Finding Technological Sustainability

(1)

Article

1

Bayesian Count Data Modeling for Finding

2

Technological Sustainability

3

Sunghae Jun 1,_*

4

1 _{Department of Big Data and Statistics, Cheongju University, Chungbuk, 28503, Korea}

5

* Correspondence: [email protected]; Tel.: +82-10-7745-5677

6

7

8

Abstract: Technology development changes society and society demands new and innovative

9

technology development. We analyze technology to understand society and technology itself. Many

10

researches have been introduced in various fields. Most of them were about patent analysis. This is

11

because detailed and accurate results of research and development are patented. In this paper, we

12

study on new patent analysis method based on count data model and Bayesian regression analysis.

13

Using count data model, we analyze the technological keywords extracted from the collected patent

14

documents. We use the posterior distribution of Bayesian statistics to reflect the experience and

15

knowledge of the relevant technological experts in the analysis model. Moreover, we apply the

16

proposed model to finding sustainable technologies. Finding and developing sustainable

17

technologies is an important activity for companies and research institutes to maintain their

18

technological competitiveness. To illustrate how our modeling could be applied to real domain, we

19

carry out a case study using the patent documents related to artificial intelligence.

20

Keywords: count data; Bayesian regression; technological sustainability; Poisson probability

21

distribution; patent analysis

22

23

1. Introduction

24

Technology with sustainability is to keep technological Competitiveness of company [1-2]. Most

25

companies have tried to find their sustainable areas for technological innovation and new product

26

development. So, sustainable technology is important issue in management of technology (MOT) [3].

27

Many academics, research institutes, and companies have studied on sustainable technologies.

28

Recently, Kim et al. (2018) published a statistical method for sustainable technology analysis [4]. They

29

considered Bayesian inference and social network analysis for the proposed method, applied their

30

research to the technology domain related to artificial intelligence (AI). Also, they used the IPC

31

(international patent classification) codes extracted from patent documents as input data for

32

sustainable technology analysis. The IPC is a hierarchical system of technologies for the classification

33

of patents [5]. For example, the IPC code G06F represents the electric digital data processing

34

technology [6]. In general, IPC codes cover a wide range of technologies. So, it is difficult for us to

35

grasp the detailed technological structure of a specific technology field. In order to overcome this

36

problem, we propose a technology analysis method using patent keywords. The keywords are

37

extracted from patent documents related to specific technology by text mining techniques [7].

38

Therefore, the technology keyword can represent more detailed description of a specific technology

39

field than the IPC code for sustainable technology analysis. In addition, we propose a statistical

40

modeling using Bayesian count data analysis for understanding sustainability of given technology

41

domain. The count of event is the number of times an event occurs [8]. In this paper, each patent

42

keyword is an event, and we analyze the count data of patent keywords. We consider the Poisson

43

probability distribution for the proposed statistical patent analysis model, because the count data of

44

patent keywords are nonnegative integer values [9]. We also combine the Poisson count model with

45

(2)

Bayesian regression analysis to build Bayesian count data modeling for finding technological

46

sustainability. We use the prior probability distribution to reflect the experience and knowledge of

47

the relevant technology experts in the model. The collected patent data is represented by the

48

likelihood function. We get the posterior distribution by multiplying prior distribution and likelihood

49

function. Finally, these Bayesian probability distributions are applied to the count data regression for

50

Bayesian count data model. Therefore, we carry out the Bayesian count data modeling to find

51

technological sustainability. To show the validity of our modeling, we perform a case study using the

52

patent documents related to AI. The remainder of this paper is organized as follows. In section 2, we

53

show the research backgrounds related to our study. We explain the proposed modeling for finding

54

technological sustainability in section 3. Next section illustrates the result of our case study. In the

55

conclusions section, we conclude our research and describe our future works related to this paper.

56

2. Sustainable Technology and Patent Analysis

57

In this paper, sustainable technology means a technology that can sustain a company's

58

technological competitiveness. So, it is important for a company to know what its sustainable

59

technologies are. In the MOT field, many companies and institutes have tried to find their sustainable

60

technologies. Many research results on sustainable technology analysis have been published in

61

academia [1-2,4,10-11]. The sustainable technology analysis has been made using diverse analytical

62

methods in various technological fields. However, research on this field is still lacking. Companies

63

and research institutions are demanding a more sophisticated and feasible methodologies for

64

sustainable technology analysis.

65

Most methods for sustainable technology analysis rely on patent analysis. This is because patents

66

contain accurate and vast results on the research and development of technology. This is due to the

67

exclusive right to use the technology granted to the inventor. Thus, a great deal of research has been

68

done on patent analysis [12-16]. In patent analysis, we should transform the collected patent

69

documents into structured data consisting of keywords, IPC codes, citations, etc. for statistical

70

analysis. In the preprocessing process of patent data, we use R data language and its ‘tm’ package

71

[7,17]. Using the structured patent data, we perform the proposed modeling for finding technological

72

sustainability.

73

3. Finding Technological Sustainability using Bayesian Count Data Modeling

74

Bayesian modeling has been used in diverse data analysis areas such as regression and

75

classification increasingly. This is one of two approaches to statistics. We start the Bayesian count

76

data modeling from the following expression [18].

77

78

P(θ|y) = y θ ( )

( ) (1)

79

80

Where θ is model parameter, and y is response variable to be predicted. P(θ) and P(θ|y) are

81

the prior and posterior probabilities of parameter θ respectively. P(y|θ) represents the likelihood

82

function of y given θ respectively. Also, P(y) is calculated by the following integration [19].

83

84

P(y) = ∫ P(y|θ)P(θ)dθ (2)

85

86

Using Bayesian modeling, we determine the model parameter of posterior distribution and

87

compute the credible interval of true parameter. We are interested in the mean of the parameters in

88

this credible interval. The 100(1-)% credible interval of θ is defined as follow [19].

89

90

P(θ ∈ C) = 1 − α (3)

91

92

Where C is an interval depended on y. In addition, the credible interval of Bayesian modeling is

93

(3)

modeling. Non-informative or informative priors can be used for the prior distribution in Bayesian

95

modeling. In general, the result of non-informative prior is close to the maximum likelihood estimate

96

(MLE) in frequentist statistics, because the popular non-informative prior is uniform distribution

97

[9,19]. On the other hand, we should use the informative prior to get amend result for the parameter

98

estimation. But, we should carry out Bayesian computing such as Markov Chain Monte Carlo

99

(MCMC) for using the informative prior [19]. To alleviate the computational burden, we can use

100

conjugate prior. In Bayesian count data modeling, the gamma and beta distributions are conjugate to

101

the Poisson and binomial distributions respectively. In our research, we consider a Bayesian count

102

data modeling with Poisson distribution for finding technological sustainability. The Poisson

103

probability distribution is the most popular model for count data. If the random variable Y is

104

distributed to Poisson with parameter , its distribution is defined as follow [9]; Y~Poisson(λ).

105

106

P(Y = y|λ) =

! , y = 0,1,2, … (4)

107

108

Where the expectation E(Y) and variance of Y are equal to parameter λ. In this paper, the

109

frequency of each patent keyword extracted from patent document data is a Poisson random variable

110

with parameter as follow.

111

112

~ ( ), i = 1,2, … , m (5)

113

114

Where m is the number of all keywords. We extract the patent keywords from the collected

115

patent documents using text mining techniques as follow.

116

117

118

Figure 1. Text mining process for creating structure data for Bayesian count data modeling.

119

120

In our text mining process, first of all we should determine the target technology for statistical

121

analysis. Next, we use a retrieved equation to collect the patent documents related to target

122

technology from the patent databases in the world. It is impossible to analyze the searched patent

123

document data directly, because the data is not suitable to input data for statistical analysis including

124

Bayesian count data modeling. So, we try to make a structured data for statistical analysis. The first

125

step in creating structured data is to create corpus collecting and interpreting text documents. Based

126

on the created corpus, we transform the patent documents into plain texts, and clean the text data by

127

eliminating whitespace, removing step-words (“and”, “for”, ”in”, “is”, etc. ), stemming, and filtering.

128

(4)

modeling. The matrix consists of patent (row) and keyword (column), and its elements are the

130

frequency (count) values of occurred keywords in each patent document.

131

In our modeling, we define the frequency (count) of ith occurred keyword as , and represent

132

the data set as follow [9].

133

134

~ ( ), E( ) = Var( ) = , i = 1,2, … , m (6)

135

136

Using this data set, we perform the generalized linear model (GLM) with Poisson probability

137

distribution, no predictors, and log link function as follow.

138

139

log( ) = or = exp ( ), i = 1,2, … , m (7)

140

141

We determine the response and predictor variables (keywords) by the results of GLM. In our

142

study, we use variables with large coefficient values as response variables and those with small

143

coefficient values as explanatory variables.

144

145

146

Figure 2. Determination of response variable by GLM results.

147

148

In this paper, we denote the response variable as Y, and the explanatory variables as

149

, , … , . Where p is the number of explanatory variables. For Bayesian count data modeling,

150

we make the Poisson regression model with gamma distribution as prior. The Poisson regression

151

model with λ is defined as follow [20].

152

153

f(Y|λ) = exp(− + log( ) − ∑ log ( !)) (8)

154

155

Where n is the number of collected patent documents. Also, an informative gamma prior for λ

156

as follow.

157

158

P(λ) =

( ) (9)

159

160

Where Γ(∙) is gamma function, and E(λ) and Var(λ) are and respectively. This

161

expression is used for the likelihood in the Bayesian count data modeling. So, using the likelihood

162

and prior distributions, we show the posterior distribution as follow.

163

(5)

P(λ|Y) = exp(− + log − ∑ log !) ×

( ) (10)

165

166

We can ignore the terms not involving , so we yield the proportional result of posterior

167

distribution as follow.

168

169

P(λ|Y) ∝ (−( + ) + ( + − 1)log ) (11)

170

171

This expression represents the kernel of the gamma distribution with parameters (n + b) and

172

(n + ). In addition, by the characteristic of gamma distribution, the posterior mean and variance

173

of are and ₍ ₎ respectively. In Bayesian Poisson regression case, | is distributed

174

Poisson with mean = exp( ′ ), where is the parameter vector of Poisson regression. In our

175

research, (Y|x) is shown as (response keyword | explanatory keywords). Therefore, we get the

176

Bayesian count data modeling as follow.

177

178

| ~ ( ) (12)

179

180

= ( ′ ) + , ~ (0, ) (13)

181

182

P(β): Gamma prior density of β (14)

183

184

Using this modeling, we make a technology structure to understand the target technology from

185

the viewpoint of sustainability in Figure 3.

186

187

188

Figure 3. Technology structure of Bayesian count data modeling.

189

190

In our modeling, all keywords except response are used as explanatory keywords. In Figure 3,

191

the keyword at the beginning of the arrow indicates the explanatory variable, and the keyword at the

192

end of the arrow indicates the response variable. Also, each keyword is distributed to Poisson with

193

parameter . From the final result of Bayesian count data modeling, we get the regression

194

coefficients  between response and explanatory variables (keywords). Using the , we build a

195

technological structure of target domain for sustainable technology management. Therefore, the

196

Bayesian count data modeling is based on the following concept.

197

198

(6)

Figure 4. Concept of Bayesian count data modeling.

200

201

In this paper, we try to combine the expert’s subjective knowledge and objective result from

202

patent data analysis. That is, the prior represents the domain knowledge of experts, and the likelihood

203

denotes the objective data based on patent documents. The result of multiplying prior and likelihood

204

is posterior, we use this as predictive model for finding technological sustainability.

205

Using this approach, we expect the improved performance of patent technology analysis for

206

sustainable technology. We illustrate how this research could be applied to practical problem by a

207

case study in next section.

208

4. Case Study

209

To show how this research could be applied to practical problem, we performed a case study

210

using the patent documents related to artificial intelligence (AI) technology. We collected the patents

211

applied and registered by 2016 from the WIPSON [21]. The total number of collected and valid

212

patents was 11,973 cases. First of all, we consulted the experts on AI and extracted the keywords

213

related to AI from the collected patent document data [22]. Next, using the text mining techniques,

214

we built structured patent data for performing our case study [7,17]. Our structured patent data is

215

shown in Figure 5.

216

217

218

Figure 5. Structured patent data.

219

The row and column of this data matrix are patents and keywords related to AI, and each cell of

220

this matrix represents occurrence frequency of each keyword on a patent. Based on the structured

221

data, we classified the AI technology as follow.

222

223

Table 1. Hierarchical structure of AI technology

224

Sub-technology Patent keywords

Learning Learning, inference, ontology, representation, analysis, data Behavior Behavior, awareness, situation, sentiment, mind, spatial, collaborative

Language Language, natural, understanding, morphological, dialogue, sentence, corpus, voice, speech, conversation, interface

Vision Vision, figure, object, video, image

Neuro Neuro, network, computing, feedback, pattern, recognition, cognitive

225

In Table 1, we denoted the AI technology to five sub-technologies as follow; learning, behavior,

226

language, vision, and neuro. In addition, we have shown the patent keywords belonging to each

sub-227

technology. We used this technology tree to retrieve the AI patents and analyze them. First, we

228

estimated the Poisson parameters for the patent keywords using maximum likelihood estimator

229

(MLE) by the frequency values of the keywords. Table 2 shows the estimates of Poisson parameters

230

for all patent keywords.

231

(7)

Table 2. Estimates of Poisson parameters for all keywords

233

Keyword  Keyword  Keyword 

analysis awareness

behavior cognitive collaborative

computing conversation

corpus data dialogue feedback figure

0.0287 0.0008 0.0258 0.0001 0.0001 0.0010 0.0035 0.0123 0.8316 0.0009 0.0287 0.0007

image inference

interface language learning

mind morphological

natural network

neuro object ontology

0.2745 0.0019 0.0170 0.0426 0.0058 0.0004 0.0004 0.0011 0.2668 0.0014 1.3510 0.0044

pattern recognition representation

sentence sentiment

situation spatial speech understanding

video vision voice

0.1506 0.0211 0.0062 0.0103 0.0016 0.0060 0.0966 0.5527 0.0005 0.5114 0.0123 0.0153

234

We can compare the relative frequency between the patent keywords in Table 2. These estimates

235

are the MLE (maximum likelihood estimate) for Poisson parameters of AI keywords [18]. Figure 6

236

illustrates the MLEs for the Poisson parameters of all patent keywords.

237

238

239

Figure 6. MLEs of all patent keywords.

240

241

In this figure, we found the MLEs of ‘object’, ‘data’, ‘speech’, ‘video’, ‘image’, ‘network’, ‘pattern’,

242

‘spatial’, ‘language’, ‘analysis’, feedback’, ‘behavior’, ‘recognition’, ‘interface’, ‘voice’, ‘corpus’,

243

‘vision’, and ‘learning’ are relatively larger than other keywords. Using the result of Figure 6, we

244

determine the patent keywords that affect AI technology. A keyword with a larger MLE value will

245

have more impact on AI technology. We also carried out the Bayesian count data modeling on the

246

(8)

248

Table 3. Model parameters and weight

249

Keyword Poisson Gaussian Weight

analysis awareness behavior cognitive collaborative computing conversation corpus data dialogue feedback figure image inference interface language learning mind morphological natural network neuro object ontology pattern recognition representation sentence sentiment situation spatial speech understanding video vision voice 0.1023 -0.6630 0.1062 -0.2173 -0.1971 0.0867 -1.7300 -4.5122 -0.2869 -0.8820 -5.6046 -0.6320 -16.8687 -1.1521 0.1670 -0.2675 0.1336 -0.5577 -0.5222 -0.8831 -0.3976 -1.5059 -0.3076 -2.0103 0.1729 -0.0030 -1.2470 -3.2219 -1.6950 -2.1258 0.1356 -1.2228 -0.5460 -0.5061 -3.2688 -3.6855 0.0176 -0.0013 0.0103 -0.0016 0.0000 0.2221 -0.0009 -0.0009 -0.0065 -0.0007 -0.0046 -0.0002 -0.0159 -0.0013 0.0284 0.0020 0.0500 -0.0004 -0.0012 -0.0016 -0.0099 -0.0007 -0.0085 -0.0028 0.0155 0.0214 -0.0080 -0.0033 -0.0006 -0.0019 0.0144 -0.0131 -0.0037 -0.0072 -0.0032 -0.0027 0.0599 -0.3321 0.0582 -0.1094 -0.0985 0.1544 -0.8654 -2.2566 -0.1467 -0.4414 -2.8046 -0.3161 -8.4423 -0.5767 0.0977 -0.1327 0.0918 -0.2791 -0.2617 -0.4424 -0.2038 -0.7533 -0.1580 -1.0065 0.0942 0.0092 -0.6275 -1.6126 -0.8478 -1.0638 0.0750 -0.6180 -0.2749 -0.2567 -1.6360 -1.8441

250

We performed the Bayesian regression models by Gaussian as well as Poisson distributions. In

251

addition, the weight in Table 3 is the average value of Poisson and Gaussian parameters. In this paper,

252

we selected the keywords with larger weight values for finding technological sustainability in AI

253

technology. Using the result of Table 3, we show the patent keyword ranking that influences AI

254

technology in Figure 7.

255

(9)

257

Figure 7. Patent keyword ranking that influences artificial intelligence.

258

259

The keywords in 1st_{Group have a greater impact on AI technology than 2}nd_{Group or 3}rd_Group.

260

Using the experimental results of this paper, we made the following technological structure for

261

sustainable AI technology.

262

263

264

Figure 8. Technological structure for sustainable AI.

265

266

In this paper, we divided the AI technology into five sub-technologies (learning, behavior,

267

language, vision, neuro) in Table 1. In Figure 8, each sub-technology contains the keywords that can

268

represent its technology. For example, the keywords of learning, inference, ontology, representation,

269

analysis, and data describe the learning technology for AI. Each keyword is represented by bold or

270

underlined types depending on its importance from the results of Tables 2 and 3. The keywords in

271

bold type are those that have an impact to AI technology from the results of Poisson MLEs. Also, the

272

keywords with underlined lettering affect the AI technology by the Bayesian regression model. So,

273

we knew that the sub-technologies related to learning, behavior, language, and neuro influence to

274

the sustainability of AI technology. But we found that the sub-technology of vision has a relatively

275

small effect on the technological sustainability of AI compared to other sub-technologies. In this

276

(10)

interface’, and ‘recognition pattern’ are important things to continue the sustainability for AI

278

technology.

279

5. Conclusions

280

We proposed Bayesian count data modeling to find the technological sustainability. To know

281

the sustainable technology in given technological field is very important to improve the technological

282

competition of company and nation. Various researches have been conducted to find sustainable

283

technologies. Most of them carried out the statistical models not consider the characteristic of count

284

data from patent documents. But, most structured patent data have a count data structure. In order

285

to solve this discrepancy problem, Bayesian count data modeling is proposed in this study. In

286

addition, we applied the domain knowledge of experts to prior distribution of model parameter in

287

Bayesian regression model. To show the validity of proposed modeling and illustrate how our

288

approach could be applied to practical problem, we carried out a case study using the patent

289

documents related to AI technology. In the case study, we found the sub-technologies for the

290

sustainable technologies of AI. They were learning from data, spatial behavior, interface of language,

291

and pattern recognition technologies. Therefore, we should concentrate our research and

292

development on these sub-technologies to keep the sustainability of AI technology.

293

Our research can be applied to the research and development-related planning of companies

294

and research institutes. Also, this research will contribute to diverse technological fields as well as

295

AL technology. In this research, we considered only the patent keywords extracted from patent

296

documents for patent technology analysis using statistical modeling. Our future work will use more

297

diverse elements, as well as keywords, such as citations and claims, to find sustainable technologies

298

for specific technological domain.

299

300

Author Contributions: Sunghae Jun designed this study and collected the data for the experiment. He also

301

preprocessed the data and selected valid patents and analyzed the data to show the validity of the study and

302

wrote the paper and performed all the research steps.

303

Conflicts of Interest: The authors declare no conflict of interest.

304

305

References

306

1. Park, S.; Jun, S. Statistical Technology Analysis for Competitive Sustainability of Three Dimensional Printing.

307

Sustain. 2017, 9, 1142.

308

2. Choi, J.; Jun, S.; Park, S. A patent analysis for sustainable technology management. Sustain. 2016, 8, 1–13.

309

3. Roper, A.T.; Cunningham, S.W.; Porter, A.L.; Mason, T.W.; Rossini, F.A; Banks, Forecasting and Management

310

of Technology; John Wiley & Sons: Hoboken, NJ, USA, 2011.

311

4. Kim, J; Jun, S,; Jang, D.; Park, S. Sustainable Technology Analysis of Artificial Intelligence Using Bayesian

312

and Social Network Models. Sustain. 2018, 10, 115.

313

5. WIPO. World Intellectual Property Organization. Available online: www.wipo.org (accessed 2018).

314

6. WIPO IPC. International Patent Classification (IPC), World Intellectual Property Organization. Available

315

online: http://www.wipo.int/classifications/ipc/en (accessed 2018).

316

7. Feinerer, I.; Hornik, K. Package ‘tm’ Ver. 0.7-5, Text Mining Package, CRAN of R Project. 2018. Available

317

online: https://cran.r-project.org/web/packages/tm/tm.pdf (accessed on 1 August 2018).

318

8. Hilbe, J. M.; Modeling Count Data, Cambridge University Press: Cambridge, UK, 2014.

319

9. Cameron, A. C.; Trivedi, P. K. Regression Analysis of Count Data; Cambridge: New York, NY, USA, 2013.

320

10. Kim, S.; Jang, D.; Jun, S.; Park, S. A novel forecasting methodology for sustainable management of defense

321

technology. Sustain. 2015, 7(12), 16720–16736.

322

11. Park, S.; Lee, S.; Jun, S. A network analysis model for selecting sustainable technology. Sustain. 2015, 7(10),

323

13126–13141.

324

12. Jun, S.; Park, S. Examining technological innovation of Apple using patent analysis. Ind. Manage. Data Syst.

325

(11)

13. Kim, J.; Jun, S. Graphical causal inference and copula regression model for Apple keywords by text mining.

327

Adv. Eng. Inform. 2015, 29(4), 918–929.

328

14. Jun, S.; Park, S. Examining technological competition between BMW and Hyundai in the Korean car

329

market. Technol. Anal. Strateg. Manage. 2016 28(2), 156–175.

330

15. Grimaldi, M.; Cricelli, L.; Rogo, F. Valuating and analyzing the patent portfolio: the patent portfolio value

331

index. European J. of Innovation Manage. 2018 21(2), 174-205.

332

16. Kim, J.; Jun, S.; Jang, D.; Park, S. An Integrated Social Network Mining for Product-based Technology

333

Analysis of Apple. Ind. Manage. Data Syst. 2017 117(10), 2417-2430.

334

17. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for

335

Statistical Computing: Vienna, Austria. Available online: http://www.R-project.org, 2018.

336

18. Ross, S. M. Introduction to Probability and Statistics for Engineers and Scientists, 4th Edition; Elsevier: Seoul,

337

Korea, 2012.

338

19. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd

339

Edition; Chapman & Hall/CRC Press: Boca Raton, FL, 2013.

340

20. Hogg, R. V.; McKean, J. M.; Craig, A. T. Introduction to Mathematical Statistics, 8th edition, Pearson: Upper

341

Saddle River, NJ, 2018.

342

21. WIPSON. WIPS Corporation. Available online: http://www.wipson.com, http://global.wipscorp.com (accessed

343

2018).

344

22. KISTA, Korea Intellectual Property Strategy Agency, Available online: http://www.kista.or.kr (accessed