Developed non-text contents automatic extractor for effective utilization and sharing of national R&D reports

(1)

Developed non-text contents automatic extractor for effective utilization and sharing of national R&D reports

Il-Kwon Lim¹, Kwang-Nam Choi², Ji-Seung Son³, ¹YongJu Shin^*4 1-4NTIS Center, KISTI, 245 Daehak-ro, Yuseong-gu, Daejeon, 34141, Korea

Abstract

In KISTI, R&D report, which is the result of national R&D project, is collected from task management (professional) agency under the clearing of the ministry, and it is constructed as high-quality DB and utilized for researchers. The collected R&D reports are converted to xml format by standardized pdf for DB construction. In addition, it is necessary to collect non-text contents for R&D report supplementary service of table / picture. Accordingly, when converting to XML format, non-text titles and contents such as tables and pictures are automatically extracted and developed.In the process of collecting non-text contents, non-text is extracted mainly about captions and objects corresponding to tables and pictures. Therefore, in actual process from the cover to the end, a large amount of unwanted tables and pictures are extracted, it is under inspection and supplementation as work. In order to improve the system, we improved the extraction of non - text contents by extracting the non - text contents extraction range based on the table of contents for table and figure contents and captions of R&D report. We also developed API so that we can improve the function of checking and supplementing extracted contents more conveniently and sharing and utilizing contents in other systems.Except for reports with poor technical quality of the tables and figures in the R&D report, reports with many attachments at the end of the appendix have significantly reduced the extraction of non-text content, thus reducing DB construction efforts. First, we will apply API to the search result of the homepage of NTIS (National Science &

Technology Information Service), and it will be used in NDSL(National Digital Science Library)and external organizations in the future.

Key Word: National R&D report, convert xml to pdf document, automatic extractor for non-text content, table / picture extraction, National R&D Reports Management System.

1. Introduction

Government-led R&D has been undergoing government-led research and development projects due to the difficulties in securing profitability and the risk of failure if it is committed entirely to private research and development (Kyung-Jae Lee et al., 2016), (Seung-kooAhn and Joo-il Kim, 2016). To promote new knowledge accumulation and technological innovation, these national R&D projects are increasing the ratio of R&D budget to each country as shown in Table 1. The R&D budget, which was 2.3 trillion won in 1996, has continuously increased to 19.1 trillion won in 2016. The proportion of total R&D investment to GDP is also 4.29% ('14)(Kyung-jae Lee, 2016).

Table 1. Government R&D Budget Trends in Major Countries

Countries 2016 2017 Unit Remarks

USA 1,430 1,512 Billion dollars Federal R&D budget EU 7,063 10,300 Million euros Horizon 2020

1 Corresponding Author: YongJu Shin, NTIS Center, KISTI, 245 Daehak-ro, Yuseong-gu, Daejeon, 34141, Korea

(2)

9 Japan

39,503 39,746 Billion yen Science and Technology Original Budget

China

2,728.50 2,899.20 Billion yuan National financial science and technology expenditure

Accordingly, the Ministry of Science, ICT and Future Planning designates and manages "Management and distribution specialized agencies of research achievements"

for the management and utilization of research results of national R&D projects, among which Korea Institute of Science and Technology Information (KISTI) has been designated as a management and distribution agency of the R&D report and is operating systematically. The results of the national research and development are divided into nine achievements such as papers, patents, and R&D reports, and they are designated and managed by each responsible organization. The Korea Institute of Science and Technology Information (hereinafter referred to as KISTI) is designated as R&D reports management and distribution organization and is operating systematically(Research Performance Management Distribution System User Manual, 2015).

KISTI works on bibliographic, table of contents, references, non-text content information processing, detection and deletion of personal information for efficient management and circulation of R&D reports. In addition, the National R&DReports Management System (Hereinafter referred to as NRMS) has been developed and operated as shown in Figure 1 for systematic management and distribution of these R&D reports.

In NRMS, we are operating a system of web hard and information link agent, collecting client program, etc. for collecting R&D report. The collected R&D report texts are used as a basis for permanent storage and common utilization through processes such as processing, refining, and XML conversion (Hong-Ro Lee et al., 2010),(Huh Tae-Sang et al., 2009), (Jeong-Kyeom Kim et al., 2011), (National R&D Reports Management System, 2017).

Fig. 1. National R&D Reports Management System

In order to maximize the utilization of such collected R&D reports, research is being conducted on the development of automatic extractor for non-text content along with research on XML conversion of PDF original text. In this paper, non-text content refers to content composed of tables and pictures, excluding the text of the R&D reports.

Therefore, this paper describes the performance enhancement for the development and operation of non-text content automatic extractor for extracting non-text contents in R&D reports.

2. Related research

A. XML conversion system of the R&D report

The R&D report collected in PDF format is converted into digital contents after DB establishment process (registration, acquisition, ordering, processing, delivery, loading, provision). At this time, the original document of the PDF format is converted into the W3C standard XML format and stored. For XML format conversion, metadata extraction

(3)

and XML conversion are performed in two stages.

In the first step, the R&D report is filtered to construct the necessary information and the metadata according to the result is extracted. The automatic metadata extraction tool consists of a filtering step and an extraction step as shown in Figure 2. The filtering step constitutes the information necessary for extracting the metadata, and the extracting step extracts the Meta page from the PDF submission statement, table of contents, and table page based on the temporary file constructed in the filtering step. In the second step, the analysis of the document structure of the whole R&D report and the caption of the non- text content are analyzed through the table of contents information extracted from the previous step as shown in Figure 3. Based on the analyzed information, non-text content extraction and text storage are performed. After this, we construct the Back using XML front and reference documents using metadata using XML generator, and the text is structured by using extracted sentence structure and caption information in structure analyzer to complete transformation into XML (Gyu-Jin Choi et al., 2014), (Hong-Woo CHUN, 2010), (Kwang-Nam Cho, 2016).

Fig.2.Metadata auto-extraction process

Fig. 3. XML transformation process

(4)

11 B. Non-text content automatic extractor study

The existing non-text content automatic extractor extracts the picture / table in the R&D report using the extracted caption information by using the structure analyzer during XML conversion and stores it as an image. In the case of an image, a composite image composed of lines, texts, and pictures is extracted as a single image based on the caption information. In the case of the table, only the start and end coordinates of the table are determined, and the corresponding region is captured as an image and stored (Gyu-Jin Choi et al., 2014). The image file of the extracted tables / figures is stored in the specified folder, and the metadata of the R&D report, the position of the caption, and the file are stored together in the database so that it can be used later. Also, the content is composed together with the caption structuring part when generating the XML. Figure 4 below shows the process of building a non-text content DB.

Fig. 4. A non-text content DB construction process

3. Developed automatic extractor for non-text content

The developed non-text content automatic extractor was included in the national R&D report construction management system (Hereinafter referred to as RRMS) of the NRMS system and developed the system development including the following functional enhancement to extract the non-text contents of the R&D reports.

A. Function to extract non-text content based on the list of table/picture

The extracted non-text content automatic extractor extracts the caption information according to the document structure through the text analysis of the entire document of the R&D report and extracts the non-text content according to the caption position information. However, in the national R&D report, in addition to the basic table of contents, there is a table of contents of the figure and table. In this case, extracting the non-text contents according to the position of the non-text contents of the table of contents information can extract the non- . Therefore, in this paper, we developed a non- text content extractor to improve the retrieval ability of the contents according to the table of contents.

DB construction of non-text contents is done in three steps of filtering, conversion, and generation as shown in Figure 4. In the first step, text and images are separated. In the second step, non-text content is extracted. In the last generation step, XML information is generated and stored. In this paper, we improve it and check the table of contents and table of contents in the conversion step as shown in Figure 5 and store related information

(5)

accordingly. We have improved the extraction of non-text contents based on table of contents and table of contents using stored table contents, file contents of figure contents, index / tagging information. If there is no the list of tables or figures, extract the non-text content in the same way as before.

Fig. 5. Improved non-text content DB building process based on the list of table/figure

B. Developed automatic extracting and complementing tool for non-text contents The extracted non-text content will be delivered to the actual service after the inspection of the non-text content by the professional staff of the R&D report DB. The errors that occurred during the automatic detection will be corrected during inspection work and the screen is as shown in Figure 6 ~ 8. Table 2 shows the errors and error codes that occur during the automatic detection.

Table 2.The type of error that occurs and the code number accordingly code division explanation

01 Extraction area position

When non-text content extraction is out of range - Non-text content truncated

- Extraction of area outside the range of n minutes

02 Caption title

Missing non-text content title - Non-text content title truncated - Extract different range headings 05 Extraction area

position+ caption title If both the extraction area and the caption are errors

04 not extracted

If the non-text content is not yet extracted

- Target: The original text of the R&D report & the non- text content in the text exists but not yet extracted

03 Other errors (except build)

Other errors (except build)

- if the pages are organized in succession - When there is no title or unclear

(6)

13 - Boxed tables or formulas

- If personal information is detected and marked - A table with budget and enforcement amounts

- If it is composed of only portraits(Group photographs after review)

99 Extraction succeeded If extracted normally

Figure 6 shows the detail data retrieval screen for the R&D report that extracted the non-text. The image correction function according to the extraction error of the extracted non-text contents is as shown in Figure 7. Figure 8 shows the new registration function for missing non-text content that is not extracted during automatic detection. We developed and implemented a non-text automatic extraction complementer for the functions required for such work.

Fig. 6. Detailed data retrieval screen for non-text extracted R&D report

Fig. 7. Non-text content editing function screen

(7)

Fig. 8. New registration screen for non-text content C. Develop non-text content search API function

Non-text content data constructed through automatic extraction and verification of non- text contents is served to the user by the search function of NRMS. You can search for non-text content by following the title or caption of the R&D report, as shown in Figure 9.

The developed non-text content search function was developed to be distributed as API (Application Programming Interface) in order to maximize sharing and utilization.

Accordingly, it is possible to provide an API to an institutional system that desires to utilize non-text content search. Table 3 shows input parameters and output parameters of the developed non-text content search API. The input parameters consist of seven values such as institution ID value, search classification, and query value for API authentication.

The output parameters consist of eleven values including search word, search number, and cam information. The search screen according to the provided API is the same as Figure9.

Fig.9. Non-text content search screen Table 3. Parameters of the non-text content API

Input parameter value turn Type Required Description

1 String O Search Category

2 String O Agency ID

3 String O search word

4 String Report number

(8)

15

5 String Report name

6 String Start search value 7 String Search end value

Output parameter value turn Type Description

1 String search word

2 String Number of searches 3 String Report number 4 String Report name 5 String Caption information 6 String Order by content 7 String File Directory Path 7 String Original size File name 7 String Thumbnail file name

7 String File Download Intermediate URL 7 String NRMS Linkage Image Popup URL

4. Conclusion

KISTI is researching and developing R&D report collection function and NRMS system for effective management and utilization of national R&D report. In this paper, we have developed automatic extractor for non-text contents. The extracted non-text contents automatically extracts non-text contents according to the caption position according to the analysis of the entire R&D report, but improves to extract non-text contents according to the table contents and the table contents in this paper. In addition, the R&D report has an error in extraction of non-text contents depending on the type of R&D report according to the characteristics of the task and the error made by the researcher. We developed a complementary function for non-text contents. We developed a non-text content search API for sharing and utilizing non-text contents. An image analysis function for maximizing utilization of non-text contents will be developed through image analysis of non-text contents collected in the future.

5. Acknowledgment

This research was supported by Maximize the Value of National Science and Technology by Strengthen Sharing/Collaboration of National R&D Information funded by the Korea Institute of Science and Technology Information (KISTI).

References

[1] Gyu-Jin, Choi, Seung-Jun Cha, Kyu-Chul Lee, “Development of the XML-based Contents Extraction and Conversion Systems for Enhancing Preservation and Service of National R&D Reports”, Journal of DATABASE RESEARCH, 30(1), web: https://goo.gl/D3iG1t(2014).

[2] Hong-Ro Lee, Jeong-Kyeom Kim, Kwang-Nam Cho, Ki-Seok Choi, Jae-Soo Kim, „A Study on the Improvement of Framework for Collection and Management of National R&D Reports‟, Proceedings of Korean Society for Internet Information Conference, PP. 297-302,(2010.6), web: https://goo.gl/WJpJvb.

[3] Hong-Woo CHUN, „Conversion of PDF to XML-based Structural Information‟, Research Report of KISTI, web: https://goo.gl/R1jK8n(2010).

[4] Huh Tae-Sang, Choi Ki-Seok, Kim Jae-Soo, Park Min-Woo, Shin Young-ho, „Design and Implementation of Registration System for Management National R&D Reports‟, Proceedings of

(9)

Korean Institute of information Scientist and Engineers, (2009.6), 36(1): PP. 230-235, web:

https://goo.gl/QuPEy6.

[5] Jeong-Kyeom Kim, Hong-Ro Lee, Kwang-Nam Cho, Ki-Seok Choi, „A study of efficient national R&D Reports improved registration system for collection and registration‟, Proceedings of Korea Internet Information Society, (2011), 12(1): PP.153-154, web: https://goo.gl/zDaobp.

[6] Kwang-Nam Cho, „XML Element Weight-based Similarity Analysis System of R&D Report‟, Ph. D.

Thesis, Pai Jae University, Daejeon, Rep. of Korea, (2016), P.107, web:https://goo.gl/sy8JNF

[7] Kyung-Jae Lee and others, A Research on Global R&D Investment Trends in the year 2016, Research Report Of Korea Institute of S&T Evaluation and Planning, Web: https://goo.gl/Ts1KvN(2016).

[8] Kyung-jae Lee, „2017 Government R&D Investment Direction and Key Features', KISTEP InI, No. 13, web : https://goo.gl/UcdkVJ(2016).

[9] „National R&D Reports Management System‟, web : nrms.kisti.re.kr, Search date: Sep. 3. (2017).

[10] Research Performance Management Distribution System User Manual, Korea Institute of S&T Evalutation and Planning, web: https://goo.gl/cGN5rD(2015).

[11] Seung-kooAhn, Joo-il Kim, Government Research and Development Budget Analysis in the FY 2016, Research Report Of Korea Institute of S&T Evaluation and Planning, web : https://goo.gl/VF6h6Z(2016).

[12] Reddy, A.V.N., Krishna, C.P. & Mallick, P.K. An image classification framework exploring the capabilities of extreme learning machines and artificial bee colony. Neural Comput & Applic (2019).

https://doi.org/10.1007/s00521-019-04385-5