• No results found

An Efficient Procedure for Characteristic mining of Mathematical Formulas from Document

N/A
N/A
Protected

Academic year: 2020

Share "An Efficient Procedure for Characteristic mining of Mathematical Formulas from Document"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

An Efficient Procedure for Characteristic

mining of Mathematical Formulas from

Document

Appa Rao G

Department of CSE, GIT,GITAM,INDIA [email protected]

Venkata Rao K

Department of CSSE, Andhra University, INDIA

Prasad Reddy PVGD

Department of CSSE, Andhra University, INDIA

Lava Kumar T

Department of IT, ANITS, INDIA [email protected]

Abstract : In the academic area such as physics, mathematics and statistics mathematical formulas or terms are indispensable for signifying technical knowledge. As the formulas include both constitutional and interpretation information searching for related mathematical formulas is a significant but exigent problem. In order to search the formulas effectively, it is essential to haul out the structural and interpretation features from the mathematical appearance of the formulas authentically. In this paper we propose an efficient approach for formula characteristic mining. The proposed approach evaluated on the text contains individual formulas and repeated formulas. The performance of the reclamation is deliberated based on dataset of repeated formulas and formulas appeared for once and the time taken for reclamation is also measured.

1. Introduction

The interest in mining of time series has been increased in the recent years as it is very complex. Time series is very complex as there is high dimensionality, high relationship between data. Most of the technical papers are in print with mathematical formulas with time series. The search for associated mathematical expression by the researchers operational in technological discipline cannot be done effect with text based search engine except appropriate text.

Keywords are known. The competence of accessible text search engine to search mathematical expression conquered with a math-aware search engine.

The search for mathematical formulas is significant and difficult as they contain both constitutional and interpretation information. In general, the hypothetical foundation of the knowledge in numerous technical documents is commonly represented with mathematical formulas. The accepted search engines Google and Yahoo works efficiently for text based information. But they are struggling in searching data with mathematical formulas.

Let us consider an example Sinx+eCosx, this formula contains three symbols e, sin and cos represents exponential and trigonometric functions. In this equation the sin and cos terms contains some semantic meaning and sin and cos terms are structurally related to exponential.

As already several methods were proposed for the extraction of mathematical formulas, in this paper we planned a new method for retrieving repeated and unrepeated formulas appeared in the given text[1 2 3].

The later part of this paper is planned as follows. Section 2 reviews the associated work on mathematical retrieval systems. Section 3 deals with the anticipated approach. Section 4 comprises the experimental results. As a final point, the conclusion is presented in Section 5.

1.1. Extracting Math formula

(2)

When parsing docx , we use the .xsl file for generating the math formulae. The entire set of math formulae is stored in array list. If a particular formula is to be found, Map data type is used. Map also gives the frequency of the formula. In order to display the total set of math formulae in the browser, a script called Mathjax.js is used.

4

2

There are several elements clustered in the design of MathML. The cluster anxious with msub, munder, mmultiscripts, mrow, mstyle and mfrac. The maction component allows different kinds of actions on notation. Table 1 show an example formula √ encoded in MathML. In the figure mo symbolize the operators

– and +, mi used to signify the alphabets b, a and c, mn is used to represent the numbers 2 and 4, msup used to stand for the superscript and msqrt is used to represent the rooting function. In the second table mi represents e, I, w, t and mo represents the symbol.

2. Related literature

Mathematical formula retrieval for problem solving by Sidath Harshanath Samarasinghe and Siu Cheung Hui Proposed a method for solving mathematical problems which are tricky for students. In this approach they proposed document retrieval to help explain mathematical problems. In this approach they have used Kohonen’s Self Organizing Maps for data clustering. They have presented the efficiency of the proposed approach with other clustering techniques.

Feature Extraction and Clustering-based Retrieval for Mathematical Formulas by Kai Ma, Siu Cheung Hui and Kuiyu Chang proposed that in order to present the scientific knowledge mathematical formulas or expressions are indispensable. As mathematical formulas contain both semantic and structural information it is very difficult to retrieve related mathematical formulas. They proposed an efficient e approach with help of approach for the retrieval of mathematical formulas. They proposed a new approach with the help of K-means, Self-Organizing Map (SOM), and Agglomerative Hierarchical Clustering (AHC), for formula retrieval.

MathML generation for formula and the syntax meaning

MathML Meaning <math> root (all starts with <math>)

<mi>x</mi> identifier x (mi)

<mo>=</mo> operator = (mo)

<mfrac> fraction tag

<mrow> format tag

<mo>-</mo> operator - (mo)

<mi>b</mi> identifier b (mi)

<mo>±</mo> operator ± (mo)

<msqrt> Root function tag

<msup> power (superscript)

<mrow> format tag

<mi>b</mi> identifier b (mi) </mrow>

<mrow> format tag

<mn>2</mn> number constant 2 (mn)

</mrow> </msup>

<mo>-</mo> operator - (mo)

<mn>4</mn> number constant 4 (mn) <mi>a</mi> identifier a (mi) <mi>c</mi> identifier c (mi)

</msqrt> Root function tag

</mrow>

<mrow> format tag

(3)

</mrow> </mfrac> </math>

MathML Meaning <math> root (all starts with <math>)

<msup> power (superscript)

<mrow> format tag

<mi>e</mi> identifier e (mi) </mrow>

<mrow> format tag

<mo>-</mo> operator - (mo)

<mi>i</mi> identifier i (mi) <mi>ω</mi> identifier ω(mi) <mi>t</mi> identifier t (mi) </mrow> </msup> </math>

In the above mentioned formulas square root and exponential is having its own meaning and the sub formulas iωt is related to exponential and b2-4ac, 2a are the formulas related with square root and division symbol. So there are two types of meanings in formulas one correlated to the meaning and the other one is associated to structure.

As shown in the math formula table all the components in mathematical formula have some meaning, √ means square root of a given function and the variable a, b, c are structurally related with square root and division.

3. Proposed Approach

There are so many retrieval systems developed to extract text from documents as well as formula extraction which includes semantic and structural meaning. The proposed approach is mainly concentrating on the repeated and non-repeated mathematical formulas. The proposed approach explained in the fig.

For extracting Math formulae from doc and docx, we need to follow some steps.

 Generally, Math formulae are written using MathML and Latex.

 The developer of Microsoft fortunately developed a file named OMML2MML.xsl. This file is used to generate MathML parser.

 We use that .xsl file inorder to extract MathML formulae from doc.

 Fortunately, Java developer developed one library called POI-OOXML-3.9.jar file. This contains all the classes and functions which are used to extract Math formulae from docx file.

 When we load our docx file in to our program, it parses our file to CTOMath(predefined class in the specified .jar file).

 When parsing docx , we use the .xsl file for generating the math formulae.

 The entire set of math formulae is stored in ArrayList .

 If a particular formula is to be found, Map datatype is used. Map also gives the frequency of the formula.

 In order to display the total set of math formulae in the browser, a script called Mathjax.js is used.

Table 1: Semantic and structural information of mathematical formula[1 2].

Mathematical Formula Semantic Information Structural Information

  Integration, tangent function, variable x tangent function in integration, x in tangent function

Square function,variable x x in square function Exponential function,

(4)

Flow Chart for extraction of Formula[1]

(Math formulae processing)

Generate all the mathML for all math formulae

Dom tree

After creation display load into browser Script

Fig1: DOM tree generation for a formula e-iωt [1]

Load this OMML2MML.xsl to generate MathML

Generate MathML for every formulae

Load the different set of Formulae (docx) along with text

Generate Dom Tree for MathML

(5)

MathML

<msup> <mrow> <mi>e</mi> </mrow> <mrow> <mo>-</mo> <mi>i</mi> <mi>w</mi> <mi>t</mi> </mrow> </msup> </math>

DOMTREE 1.root&CP

msup

mi[e] mrow

mo[-]

mi[i] mi[t]

mi[w]

4. Time taken for retrieval

In this section we present the retrieval time for repeated and non-repeated formulas from the document. We have conducted the experimentation on the documents contain the exponential, Sigma, function, cosine and sine formulas. Some of the formulas in the document are repeated for 2, 4, 6times and some formulas appear only once. The retrieval time depends on the RAM and processor speed. In this work the retrieval time calculated with a system with 4 GB RAM and I3 Processor the retrieval time may be reduced if the retrieval done with system with higher configuration.

 For loading .xsl file and our input file , for one formula it take around 306 milliseconds , for more no.of formulae it take 850 milliseconds on average.

 For a normal equation like 2 1 0, it takes around 400 milliseconds.

 For complex equations like sin sin 2 sin cos ∓ , it takes 900-1000

milliseconds.

Performing around 40 various different formulae from basic to too complex for this the time takes around 18591 Milliseconds.

Table 2: Time taken for the extraction of mathematical formulas repeated for even number of times.

Formula

Time Taken Non repeated formula in Millisecond

Time taken for two repeated formula Milliseconds

Time taken for four repeated formula

Milliseconds

Time taken for eight repeated formula

Milliseconds

√ 4

2 3407 4105 4516 5969

3531 3969 4510 5650

3532 3938 4469 5407

1

3563 4048 4620 5516

cos sin 3719 3922 4719 5766

<math

(6)

Table 3: Time taken for the extraction of mathematical formulas repeated for odd number of times

Formula

Time Taken Three times repeated formula in Millisecond

Time taken for Five times repeated formula Milliseconds

Time taken for Seven times repeated formula Milliseconds

Time taken for Nine times repeated formula Milliseconds

√ 4

2 4250 4656 5235 5969

4735 4703 5235 5725

4603 4860 5188 5766

1

4438 5149 5431 5663

cos sin 4578 5174 5481 6292

5. Conclusion

In this paper we proposed a model which retrieves repeated and non-repeated mathematical formulas. The recital of the planned method measured in terms of time taken for generation of parser for the repeated and non-repeated formulas. The intended technique produces appraised with good results when the formula non-repeated for more number of times. Presently experiments are being demeanor on much huge set of training documents.

References

[1] Kai Ma, Siu Cheung Hui and Kuiyu Chang Feature Extraction and Clustering-based Retrieval for Mathematical Formulas, pp. 372-377,

[2] Sidath Harshanath Samarasinghe and Siu Cheung Hui “Mathematical Document Retrieval for Problem Solving” Pp.583-587, International Conference on Computer Engineering and Technology, 2009.

[3] J. Misutka and L. Galambos, “Mathematical Extension of Full Text Search Engine Indexer,” Proc. 3rd International Conference on Information and Communication Technologies: From Theory to Applications (ICTTA 08), April 2008, pp. 1-6.

[4] B.R. Miller and A. Youssef, “Technical Aspects of the Digital Library of Mathematical Functions,” in Annals of Mathematics and Artificial Intelligence, Springer Netherlands, pp. 121-136, 2003.

[5] H. Zhang, T.B. and M.S. Lin, “An Evolutionary Kmeans Algorithm for Clustering Time Series Data,” Proc. International Conference on Machine Learning and Cybernetics, pp. 1282-1287, 2004.

[6] R. Munavalli and M.R. MathFind, “A Math-aware Search Engine,” Proc. Annual International ACM SIGIR Conference on Research and development in information retrieval, pp.735-735, 2006.

Figure

Table 1 show an example formula � �
Table 1: Semantic and structural information of mathematical formula[1 2].
Table 2: Time taken for the extraction of mathematical formulas repeated for even number of times
Table 3: Time taken for the extraction of mathematical formulas repeated for odd number of times

References

Related documents

Per the objectives, this research focused on what is lacking in current education through (1) an analysis conducted of institutions with accredited undergraduate degree curricula

Activities: First nation-wide census on PGRFA conserved ex situ 1999-2001: Assessment of PGR conserved in the 15 IRSA 21.843 accessions belonging to 366 species within 70 genera

• For conventional transactions, commit control is either centralized (e.g., a TP monitor coordinates two-phase commit among the sub-transactions comprising a distributed

Methods: KIDSCREEN-27 consists of five dimensions measuring health-related quality of life (HRQoL) in children and adolescents; 63 survivors, (4 – 6 years post- diagnosis) aged 12 –

Comparing the effectiveness of using generic and specific search terms in electronic databases to identify health outcomes for a systematic review: a prospective comparative study

Here the Biosurfactant that is being extracted from Cyanobacteria Is being tested on various oils namely Coconut oil, Brake oil, Petrol, Diesel oil, Gingelly oil,

Informal money management devices have much to teach us about the real financial service needs of poor people, and they leave the door open for a more formal approach to offering