• No results found

Automatic Text Summarization Using Sentence Scoring and Random Forest Algorithm

N/A
N/A
Protected

Academic year: 2021

Share "Automatic Text Summarization Using Sentence Scoring and Random Forest Algorithm"

Copied!
504
0
0

Loading.... (view fulltext now)

Full text

(1)AUTOMATIC TEXT SUMMARIZATION USING SENTENCE SCORING AND RANDOM FOREST ALGORITHM. SKRIPSI. Oleh : LA ODE ABD. EL HAFIZH HIDAYAT NIM. 15650059. JURUSAN TEKNIK INFORMATIKA FAKULTAS SAINS DAN TEKNOLOGI UNIVERSITAS ISLAM NEGERI MAULANA MALIK IBRAHIM MALANG 2020.

(2) AUTOMATIC TEXT SUMMARIZATION USING SENTENCE SCORING AND RANDOM FOREST ALGORITHM. SKRIPSI. Diajukan kepada: Universitas Islam Negeri (UIN) Maulana Malik Ibrahim Malang Untuk Memenuhi Salah Satu Persyaratan Dalam Memperoleh Gelar Sarjana Komputer (S.Kom). Oleh: LA ODE ABD. EL HAFIZH HIDAYAT NIM. 15650059. JURUSAN TEKNIK INFORMATIKA FAKULTAS SAINS DAN TEKNOLOGI UNIVERSITAS ISLAM NEGERI MAULANA MALIK IBRAHIM MALANG 2020. i.

(3) LEMBAR PERSETUJUAN AUTOMATIC TEXT SUMMARIZATION USING SENTENCE SCORING AND RANDOM FOREST ALGORITHM. SKRIPSI. Oleh: LA ODE ABD. EL HAFIZH HIDAYAT NIM. 15650059 Telah Diperiksa dan Disetujui untuk Diuji Tanggal : 18 Mei 2020. Dosen Pembimbing I. Dosen Pembimbing II. Dr. Cahyo Crysdian NIP. 197404242009011008. Ainatul Mardhiyah, M.CS NIP. 19860330201608012075. Mengetahui, Ketua Jurusan Teknik Informatika Fakultas Sains dan Teknologi Universitas Islam Negeri Maulana Malik Ibrahim Malang. Dr. Cahyo Crysdian NIP. 197404242009011008. ii.

(4) LEMBAR PENGESAHAN AUTOMATIC TEXT SUMMARIZATION USING SENTENCE SCORING AND RANDOM FOREST ALGORITHM SKRIPSI Oleh: LA ODE ABD. EL HAFIZH HIDAYAT NIM. 15650059 Telah Dipertahankan di Depan Dewan Penguji dan Dinyatakan Diterima Sebagai Salah Satu Persyaratan untuk Memperoleh Gelar Sarjana Komputer (S.Kom) Pada Tanggal : 18 Mei 2020 Susunan Dewan Penguji 1. Penguji Utama 2. Ketua Penguji 3. Sekretaris Penguji 4. Anggota Penguji. Tanda tangan A'la Syauqi, M.Kom NIP. 197712012008011007 Irwan Budi Santoso, M.Kom NIP. 197701032011011004 Dr. Cahyo Crysdian NIP. 197404242009011008 Ainatul Mardhiyah, M.CS NIP. 19860330201608012075. (. ). (. ). (. ). (. ). Mengetahui, Ketua Jurusan Teknik Informatika Fakultas Sains dan Teknologi Universitas Islam Negeri Maulana Malik Ibrahim Malang. Dr. Cahyo Crysdian NIP. 197404242009011008. iii.

(5) iv.

(6) MOTTO. َ‫ب ْال ُمت َ َو ِ ّك ِليْن‬ ُّ ‫علَى هّٰللا ِ ۗ اِ َّن هّٰللا َ يُ ِح‬ َ ‫عزَ ْم‬ َ ‫ت فَت َ َو َّك ْل‬ َ ‫فَ ِاذَا‬ Apabila engkau telah membulatkan tekad, maka bertawakallah kepada Allah. Sungguh, Allah mencintai orang yang bertawakal. (Penggalan QS. Ali Imran : 159). v.

(7) HALAMAN PERSEMBAHAN. ‫ب ْالعَالَ ِميْن‬ َّ ‫من‬ َّ ِ ‫ِب ْس ِم هّللا‬ ِّ ‫الر ِحيْم ْال َح ْمدُ هَّلل ِ َر‬ ِ ‫الر ْح‬ Karya tulis ini masih jauh dari kata layak untuk dipersembahkan kepada orang-orang tersayang. Tidaklah karya ini dapat disandingkan dengan peluh keringat orangtua dalam mendidik kami. Hanyalah Allah Subhaanahu wata'ala sebaik-baik pembalas kebaikan. Insya Allah karya tulis ini adalah lembaran awal dari sebuah karya-karya besar di masa yang akan datang.. vi.

(8) KATA PENGANTAR. Assalamualaikum Warahmatullahi Wabarakatuh. Puji syukur kehadirat Allah Subhaanahu wata'ala atas limpahan rahmat dan karunia-Nya sehingga penulis dapat menyelesaikan studi serta skripsi. Tujuan dari penyusunan skripsi ini guna memenuhi salah satu syarat untuk bisa menempuh ujian sarjana komputer pada Fakultas Sains dan Teknologi (FSAINTEK) Program Studi Teknik Informatika di Universitas Islam Negeri (UIN) Maulana Malik Ibrahim Malang. Didalam pengerjaan skripsi ini telah melibatkan banyak pihak yang sangat membantu dalam banyak hal. Oleh sebab itu, disini penulis sampaikan rasa terima kasih sedalam-dalamnya kepada: 1. Prof. Dr. Abdul Haris, M.Ag selaku Rektor Universitas Islam Negeri (UIN) Maulana Malik Ibrahim Malang 2. Dr. Sri Harini,M.Si, selaku Dekan Fakultas Sains dan Teknologi Universitas Islam Negeri (UIN) Maulana Malik Ibrahim Malang. 3. Dr. Cahyo Crysdian, selaku Ketua Jurusan Teknik Informatika sekaligus Dosen Pembimbing I yang telah membimbing dalam penyusunan skripsi ini hingga selesai. 4. Ainatul Mardhiyah, M.CS, selaku Dosen Pembimbing II yang telah membimbing dalam penyusunan skripsi ini hingga selesai. 5. Roro Inda Melani, M.Kom, Selaku Dosen Wali yang senantiasa memberikan banyak nasihat selama menjalani masa-masa perkuliahan. vii.

(9) 6. A'la Syauqi, M.Kom dan Irwan Budi Santoso, M.Kom, selaku Dosen penguji yang telah memberikan banyak saran untuk kebaikan penulis. 7. Ayah, Ibu, Kakak-kakak serta Adik-adik tercinta yang telah banyak memberikan doa dan dukungan kepada penulis hingga skripsi ini dapat terselesaikan. 8. Teman-teman seperjuangan Teknik Informatika 2015 yang senantiasa membersamai dalam perjalanan menimba ilmu. 9. Para developer open source yang telah memudahkan penulis dalam mengembangkan aplikasi penelitian. 10. Tim peneliti IndoSum yang telah menyediakan dataset untuk menunjang penelitian penulis. 11. Semua pihak yang telah banyak membantu dalam penyusunan skripsi ini yang tidak bisa penulis sebutkan semuanya. Penulis menyadari bahwa dalam penyusunan skripsi ini masih terdapat kekurangan dan penulis berharap semoga skripsi ini bisa memberikan manfaat kepada para pembaca khususnya bagi penulis secara pribadi.. Malang, 18 Mei 2020. Penulis. viii.

(10) Table of Contents TITLE PAGE............................................................................................................i LETTER OF APPROVAL.......................................................................................ii LETTER OF VALIDITY........................................................................................iii LETTER OF AUTHENTICITY.............................................................................iv MOTTO....................................................................................................................v LETTER OF GRATITUDE....................................................................................vi PREFACE..............................................................................................................vii LIST OF FIGURES...............................................................................................xii LIST OF TABLES..................................................................................................xv ABSTRACT.........................................................................................................xvii ABSTRAK..........................................................................................................xviii ‫الملخص‬.................................................................................................................xix CHAPTER I INTRODUCTION..............................................................................1 1.1 Research Background....................................................................................1 1.2 Research Question.........................................................................................3 1.3 Research Objectives......................................................................................3 1.4 Research Scope..............................................................................................4 CHAPTER II LITERATURE REVIEW..................................................................5 2.1 Automatic Text Summarization.....................................................................5 2.1.1 Sentence Scoring...................................................................................7. ix.

(11) 2.1.2 Evaluation Metrics.................................................................................9 2.2 Ensemble Learning........................................................................................9 2.2.1 Bootstrap Aggregating (Bagging)........................................................12 2.2.2 Random Forest.....................................................................................15 CHAPTER III SYSTEM DESIGN AND IMPLEMENTATION...........................18 3.1 System Design.............................................................................................18 3.1.1 Dataset Customization.........................................................................19 3.1.2 System Development...........................................................................21 3.1.2.1 Preprocessing...............................................................................21 3.1.2.2 Sentence Scoring..........................................................................23 3.1.2.3 Random Forest.............................................................................27 3.2 System Implementation...............................................................................29 3.2.1 Sentence preprocessing........................................................................30 3.2.2 Sentence Weighting.............................................................................36 3.2.3 Ensemble Learning..............................................................................45 3.2.3.1 Bootstrap Aggregating.................................................................46 3.2.3.2 Random Forest.............................................................................50 3.2.4 User Interface & Database...................................................................55 CHAPTER IV RESULTS AND DISCUSSION....................................................64 4.1 Evaluation Metrics.......................................................................................64 4.2 Test Result...................................................................................................66 4.2.1 The Best Split Attribute.......................................................................66. x.

(12) 4.2.2 OOB Evaluation...................................................................................68 4.2.3 System Summary Evaluation...............................................................70 4.3 Discussion....................................................................................................76 CHAPTER V CONCLUSION AND SUGGESTION...........................................79 5.1 Conclusion...................................................................................................79 5.2 Suggestion...................................................................................................79 REFERENCES.......................................................................................................81 ATTACHMENTS...................................................................................................83 Attachment 1.....................................................................................................83 Attachment 2.....................................................................................................86 Attachment 3.....................................................................................................89 Attachment 4....................................................................................................112 Attachment 5....................................................................................................117. xi.

(13) LIST OF FIGURES. Figure 2.1 Ensemble decision making...................................................................16 Figure 3.1 System Design......................................................................................18 Figure 3.2 Statistical input dataset.........................................................................19 Figure 3.3 The structure of the article in the dataset..............................................20 Figure 3.4 Preprocessing........................................................................................21 Figure 3.5 Sentence token......................................................................................22 Figure 3.6 Sentence Scoring Process.....................................................................23 Figure 3.7 Random Forest Process.........................................................................27 Figure 3.8 SentenceMunging Class constructor.....................................................30 Figure 3.9 Example of input data on SentenceMunging Class...............................30 Figure 3.10 remove_punctuation() method............................................................31 Figure 3.11 Example input of remove_punctuation() function.............................31 Figure 3.12 case_folding() method........................................................................31 Figure 3.13 Example output of case_folding() function.......................................32 Figure 3.14 remove_stopword() function...............................................................32 Figure 3.15 Example output of remove_stopword() function...............................32 Figure 3.16 lemmatization() function.....................................................................33 Figure 3.17 tokens transformation.........................................................................33 Figure 3.18 SentenceScoring Class constructor.....................................................36 Figure 3.19 Example input data on SentenceScoring Class...................................37. xii.

(14) Figure 3.20 word_frequency() method...................................................................38 Figure 3.21 title_similarity() method.....................................................................39 Figure 3.22 sentence_position() method............................................................... 39 Figure 3.23 sentence_length() method.................................................................. 40 Figure 3.24 centrality() method............................................................................ 41 Figure 3.25 tf_isf(), calculate_cosim(), bi_gram() and tri_gram() methods......... 43 Figure 3.26 feature_scaling() method................................................................... 44 Figure 3.27 Bagging Class constructor................................................................. 46 Figure 3.28 sample_size() method........................................................................ 47 Figure 3.29 sample_attributes() method............................................................... 47 Figure 3.30 bootstrap_aggregating() method........................................................49 Figure 3.31 RandomForest Class constructor....................................................... 50 Figure 3.32 random_forest_exec() method........................................................... 52 Figure 3.33 docstring of gain_ratio() method........................................................52 Figure 3.34 docstring of get_max_gain() method................................................. 53 Figure 3.35 Home Page......................................................................................... 55 Figure 3.36 Training Page..................................................................................... 55 Figure 3.37 Step By Step Training........................................................................ 56 Figure 3.38 After Training..................................................................................... 57 Figure 3.39 The display before summarizing the article....................................... 58 Figure 3.40 The display after summarizing the article.......................................... 58 Figure3.41 Statistic Page ...................................................................................... 60. xiii.

(15) Figure 3.43 datasetsources table and relation........................................................61 Figure 3.44 originaldataset table and schema...................................................... 62 Figure 3.45 Testing database schema.................................................................... 63 Figure 4.1 Bar Graph of The best split attribute....................................................66 Figure 4.2 Interval Histogram................................................................................67 Figure 4.3 Summary Application Interface............................................................70 Figure 4.4 Recall Histogram..................................................................................74 Figure 4.5 Precision Histogram..............................................................................74 Figure 4.6 F-Measure Histogram...........................................................................75 Figure 4.7 Execution Time (millisecond) Histogram.............................................75. xiv.

(16) LIST OF TABLES. Table 2.1 Ensemble decision making.....................................................................12 Table 2.2 Example dataset with one feature...........................................................14 Table 2.3 Bootstrap Sample 1................................................................................14 Table 2.4 Bootstrap Sample 2................................................................................14 Table 2.5 Bootstrap Sample 3................................................................................14 Table 2.6 Bootstrap Sample 4................................................................................15 Table 2.7 Bootstrap Sample 5................................................................................15 Table 3.1 The structure of the converted dataset....................................................20 Table 3.2 Dataset with sentence scoring................................................................26 Table 3.3 Sentence and Label pairs .......................................................................34 Table 3.4 Sentence Munging..................................................................................35 Table 3.5 The Sentence Scoring Results................................................................43 Table 3.6 The Feature Scaling Result.....................................................................45 Table 3.7 Descriptives statistic of dataset to be processed in training...................45 Table 3.8 Rules Structure of decisions.txt..............................................................54 Table 4.1 Distributive Frequency of The best split attribute..................................67 Table 4.2 Structure rules extracted from Random Forest......................................68 Table 4.3 Overall test results for the OOB dataset.................................................69 Table 4.4 Summary Application Interface..............................................................70 Table 4.5 Article Example 1...................................................................................71 xv.

(17) Table 4.6 Article Example 2...................................................................................72 Table 4.7 Statistic of Example Article....................................................................73. xvi.

(18) ABSTRACT. Hafizh Hidayat, La Ode Abd. El. 2020. Automatic Text Summarization Using Sentence Scoring and Random Forest Algorithm. Undergraduate Thesis. Department of Informatics Engineering, Faculty of Science and Technology, State Islamic University of Maulana Malik Ibrahim Malang. Supervisor : (I) Dr. Cahyo Crysdian (II) Ainatul Mardhiyah, M.CS,.. Keywords : Automatic Text Summarization, Sentence Scoring, Random Forest, IndoSum. The transformation of digital-based media causes an abundant amount of information. Hence the purpose of this study is to build a summarization system that can automatically select the most important information from a text. The sentence scoring method aims to make an initial representation so that the system can identify the uniqueness of each sentence. This approach to extractive text summarization can be referred to as statistically based approaches. This approach does not depend on a particular language (language independent) so that the process does not require additional specific linguistic knowledge. Then, each feature in sentence scoring will become an initial dataset for random forest algorithms which in the process will find patterns or knowledge from the data. By using dataset testing from the Indosum, the results of system evaluation using ROUGE get an average precision and recall of 0.4.. xvii.

(19) ABSTRAK. Hafizh Hidayat, La Ode Abd. El. 2020. Peringkasan Teks Secara Otomatis Menggunakan Sentence Scoring and Algoritma Random Forest. Skripsi. Jurusan Teknik Informatika Fakultas Sains Dan Teknologi Universitas Islam Negeri Maulana Malik Ibrahim Malang. Pembimbing : (I) Dr. Cahyo Crysdian (II) Ainatul Mardhiyah, M.CS,.. Kata Kunci : Peringkasan Teks secara Otomatis, Sentence Scoring, Random Forest, Indosum. Transformasi media informasi berbasis digital menyebabkan melimpahnya jumlah informasi. Oleh karena itu tujuan dari penelitian ini adalah untuk membangun sistem peringkasan yang secara otomatis dapat memilah informasi terpenting dari sebuah teks. Sentence scoring berusaha memberikan bobot pada tiap kalimatnya agar sistem dapat mengenali keunikan tiap kalimat-kalimat yang terdapat pada teks. Pendekatan ini pada extractive text summarization dapat dirujuk sebagai statistical based approaches. Pendekatan ini tidak bergantung pada bahasa tertentu (language independent), sehingga dalam prosesnya tidak memerlukan tambahan pengetahuan lingustik secara spesifik. Kemudian, tiap fitur-fitur pada Sentence Scoring akan menjadi data latih bagi algoritma Random Forest yang dalam proses pembelajarannya akan menemukan pola atau pengetahuan dari data tersebut. Dengan menggunakan dataset testing dari dataset Indosum, hasil evaluasi sistem menggunakan ROUGE-2 mendapatkan hasil rata-rata precision dan recall sebesar 0.4.. xviii.

(20) ‫الملخص‬ ‫آلحآفظ هد‪١‬يأت‪ ,‬ال‪١‬ؤدئ عبد‪ . ٢٠٢٠ .‬تلخيص النص التلقائي باستخدام جمل~~ة الجم~~ل‬ ‫(‪ )Sentence Scoring‬وخوارزمية الغابة العش~~وائية (‪ .)Random Forest‬أطروح~~ة‬ ‫جامعية‪ .‬المعلوماتية لكلية العل~~وم والتكنولوجي~~ا في جامع~~ة موالن~~ا مال~~ك إب~~راهيم‬ ‫اإلسالمية الحكومية بماالنق‪.‬‬ ‫المشرفين ‪ )١( :‬الدكتور جهيو كريسديان‬ ‫(‪ )٢‬عيناتول مرديياه‪ ،‬الماجستير‬ ‫الكلمات الرئيسية‪ :‬تلخيص النص التلقائي ‪ ،‬تسجيل الجمل (‪ ، )Sentence Scoring‬الغابة‬ ‫العشوائية(‪ ، )Random Forest‬إندوسوم (‪.)Indosum‬‬ ‫يتسبب تحول الوس~~ائط الرقمي~~ة في كمي~~ة وف~~يرة من المعلوم~~ات‪ .‬وبالت~~الي ‪ ،‬ف~~إن‬ ‫الغرض من هذه الدراسة هو بناء نظام تلخيص يمكنه تحدي~د أهم المعلوم~~ات تلقائيً~~ا من‬ ‫النص‪ .‬تهدف طريقة تسجيل الجمل (‪ )Sentence Scoring‬إلى إجراء تمثي~~ل أولي ح~~تى‬ ‫يتمكن النظام من تحديد تفرد كل جملة واردة فيه‪ .‬يمكن اإلشارة إلى هذا النهج لتلخيص‬ ‫النص االستخراجي كمقاربات قائمة إحصائيًا‪ .‬ال يعتمد هذا النهج على لغة معين~~ة (لغ~~ة‬ ‫مستقلة) بحيث ال تتطلب العملية معرفة لغوية إضافية إضافية‪ .‬بعد ذل~~ك ‪ ،‬ستص~~بح ك~~ل‬ ‫ميزة في تسجيل الجمل مجموعة بيانات أولية لخوارزميات الغابة العش~~وائية (‪Random‬‬ ‫‪ )Forest‬التي ستجد في العملية أنما ً‬ ‫طا أو معرفة من البيانات‪ .‬باستخدام اختبار مجموعة‬ ‫البيانات من (‪ ، )Indosum‬تحصل نت~~ائج تق~~ييم النظ~~ام باس~~تخدام ‪ ROUGE‬على متوس~~ط‬ ‫الدقة واستدعاء نقطة الصفر األربعة‪.‬‬. ‫‪xix‬‬.

(21) CHAPTER I INTRODUCTION. 1.1 Research Background Information media increase and transform rapidly. The increase grows directly proportional to the spreads of a huge amount of information. This can be seen in the amount of data produced every day in the world as much as 2.5 trillion bytes of data (Marr, 2018). As the country with the fourth-largest population in the world, the involvement of Bahasa Indonesia contributes a vastly large portion of the data. A large amount of textual data result in information overload. Information Overload is a condition where the effectiveness of a person in obtaining relevant and useful information in his work is hampered by a large number of information sources (David and Lyn, 2008). The news article is one of the information sources. If someone wants to get important and relevant information by reading all the available news, it will take a long time and is very tiring. The need to automatically process all the information into a summary will be very helpful in extracting information precisely from the whole article. If someone can understand the summary of an article then it can also indirectly understand the whole article, because the summary is a representation of important things. Even if the requirement is only to reduce reading time, making text selection easier, make indexing more effective, then the automatic summary system 1.

(22) 2 does not have to be as perfect as human ability when summarizing (Suyanto, 2018). Furthermore, in the context of Islam, Al-Qur’an becomes the main source of information or reference. However, to understand the content of the Qur'an as a whole, despite the availability of the Qur'an the translation is not easy. Therefore Allah Subhaanahu wata'ala revealed the opening chapter in the Qur'an, the AlFatihah as a summary of important lessons from the whole chapter in the Qur'an. As the statement of one of the scholars from the tabi'in, Hasan Al-Basri that Allah Subhaanahu wata'ala has summarized all the knowledge of the previous books in the Qur'an. Then, He summarized all the knowledge of Al-Qur’an in the Fatihah. Whoever comprehends the interpretation of the Fatihah, means that he seems to have mastered the interpretation of the entire revealed books (Arkoun, 1998). Naturally, for humans when reading an article then drawing conclusions or retelling of what has been read is an easy matter. But not so with machines. Related to the article, the sentence scoring method aims to make an initial representation so that the machine can identify the uniqueness of each sentence contained in it. Then, each feature in sentence scoring will become an initial dataset for random forest algorithms which in the process will find patterns or knowledge from the data. Random forest algorithm applies a combination of learning technique (ensemble learning) precisely one of bagging variant (bootstrap aggregating) which can produce relatively low error rates with the main strength is random feature se-.

(23) 3 lection (subspace sampling). The learning method that is combined is the collection of decision tree classifiers using the C4.5 algorithm. By using the Gain Ratio measure when handling the output of sentence scoring which is a numeric value, so it is not biased in determining the best split attributes (Suyanto, 2018). From the performance of C4.5, it is also considered to have better accuracy and faster in the computational cost (Sabuna & Setyohadi, 2017). The output from the random forest is then expected to be the basis of the prediction model in extracting the automatic summaries from a document.. 1.2 Research Question a) How accurate is the prediction model of Random Forest if measured using Out-Of-Bag (OOB) dataset ? b) What is. score of system summary if measured using ROUGE-2 ?. 1.3 Research Objectives a) To measure the accuracy of prediction models of Random Forest when measured using the OUT-OF-BAG (OOB) dataset. b) To measure. score of system summary if measured ROUGE-2..

(24) 4 1.4 Research Scope In our study we limit the scope as follows : a) Dataset Collection of Indonesian articles is taken entirely from INDOSUM (New Benchmark Dataset For Indonesian Text Summarization) (Kurniawan & Louvan 2018). b) This study focuses on Single Document Summarization and Extractive Text Summarization..

(25) CHAPTER II LITERATURE REVIEW. 2.1 Automatic Text Summarization Daily, humans have become accustomed to abstracting important points from one or more information on the process of decision making. For example when spending weekends watching movies. With so many alternatives, consumers select the films to be watched by looking at the box office rating, official trailers with short duration, and even reviews from an independent film critic. This behavior is what we try to be brought into the realm of machine learning, especially the sub-field of natural language processing (NLP), how machines are intuitively able to mimic the human mindset. In this case how the machine can summarize the text of an article. The realm of automatic text summarization research is an NLP topic with a fairly long research journey. It has been noted that since the early 1950s, the research on text summarization has attracted the interest of researchers, precisely by H.P. Luhn is researching the topic of text summarizing for the first time. In 1958 he emphasized the importance of the significance of a word based on its frequency. In the same year, Baxendale also conducted a study in the same field with the results of research in the form of sentence-based features where 7% of the sentence at the beginning and end of a document contained important information (Shirwandkar, 2018).. 5.

(26) 6 About half a century later, even though the age of the research was relatively long, in the early 2000s some researchers assumed that the scope of their research was still quite young, among others due to the small number of the corpus (Hahn & Mani, 2000). In the range of that year, certainly, the tools or computing power to process large amounts of data (big data) are not as big as at present. In the context of the number of the corpus, in one recent study, the corpus used to create system summary even reaches 200 thousand English news articles (Cheng & Lapata, 2016). Building a text summarizing system is basically to solve two fundamental questions namely; how to select important content in the document, and then how to arrange the selected content in summary appropriately (Bhatia & Jaiswal, 2016). If viewed in terms of the number of documents, there are two important categories, namely single document summarization, and multi-document summarizations. Where in a single document summarization, the system output in the form of a summary is generated from one document while the multi-document summary comes from more than one document. There are two broad approaches to building a text summarizing system, namely extractive and abstractive. The extractive approach tries to identify the most important sentences in the text, then the collection of the most important sentences is compiled into a summary. In contrast to extractives, abstractive try to identify important points in the text then reinterpret the points (paraphrasing) into a summary (Gambhir & Gupta, 2017)..

(27) 7 In this study, we focus on single-document summarization and extractive text summarization. Unlike the study of summarizing English texts, abstractive text summarization in Indonesian has not been as mature as extractive text summarization. This is due to the lack of benchmark datasets, and the different evaluation metrics used. So to compare one study with another research will be very difficult (Kurniawan & Louvan, 2018). 2.1.1 Sentence Scoring Sentence scoring tries to give weight to each sentence so that the machine can uniquely recognize each sentence contained in the text. This approach to extractive text summarization can be referred to as statistically based approaches. This approach does not depend on a particular language (language independent) so that the process does not require additional specific linguistic knowledge. Some of the statistical features used include sentence position, positive keywords and negative keywords (based on frequency calculation), centrality of sentences (similarity with other sentences), resonance of sentence to the title, relative length of the sentence, presence of numerical data in the sentence, presence of proper noun (Gambhir & Gupta, 2017). The use of the sentence scoring method besides being able to stand alone in the building process of system summary, also some researchers combine it with other methods. The difference is that if you only use the sentence scoring method, after each sentence is weighted, the summary results are determined based on the compression rate, the ranking of the sentence or the golden summaries threshold..

(28) 8 Meanwhile, if combined with other methods, the weight information of each sentence obtained from sentence scoring will become the initial dataset for the next method. Some researchers combined the scoring method using the classification method. Where the specified classifier determines the inappropriate sentences of a particular entry as a priority. Among them are a combination of score and decision tree assessment, while the statistical features used include tf / idf, uppercase letters, nouns, sign phrases, numerical data, sentence length, sentence position, similarity to the title. Features that will inform the decision tree in producing an outcome model (Sabuna & Setyohadi, 2017). Broadly speaking, the features of sentence scoring are used as the training dataset, therefore it is wide open in various machine learning approaches whether supervised, unsupervised, or semi-supervised. Furthermore, the implementation using Neural Network, including extracting the summary results with Deep Learning. The statistical features used include sentence position, sentence length, numerical token, TF-ISF, cosine similarity, bi-gram, tri-gram, proper noun, thematic word. Deep learning architecture used is the Restricted Boltzmann Machine (RBM) which has two layers namely the visible layer and the hidden layer. Each node in the visible layer is interconnected with all the nodes in the hidden layer (Shirwandkar & Kulkarni, 2018). The quality of the neural network training process (well-trained net) is said to be good if it can do backward translation with high accuracy..

(29) 9 2.1.2 Evaluation Metrics Just as the mindset needs to automate text into a summary, then assessing how well the results of the summary also need to be automated. Since 2004, the Document Understanding Reference determines that the ROUGE method was chosen as the de facto standard in evaluating text summaries (Microsoft, 2019). Until now ROUGE is still a standard metric in text summarization research (Kurniawan & Louvan, 2018). Among those that make ROUGE superior is the existence of proof of correlation summary assessment results performed by ROUGE with assessments made by humans (Lin, 2004). Rouge then becomes a benchmark to compare the benchmark comparison of the significance of one study with other studies. However, ROUGE is not the only choice specifically for the abstractive text summarization approach. This is because ROUGE looks for match sentences based on n-grams in each sentence, which results in low judgment on the abstractive text summarization approach. The alternative is to use Semantic Similarity Checking (SSC) which measures the compatibility between two documents that have different structures (Koto, 2016). 2.2 Ensemble Learning Most machine learning algorithms focus on how to build a model that can predict a particular case accurately. Combined learning techniques (ensemble learning) come with a different approach, not only by relying on one particular prediction model but by relying on a collection of models and then producing pre-.

(30) 10 dictions from the aggregates in each model output. The collection of models is called the ensemble model. There are two characteristics of the ensemble model: 1. Build several different models of the same dataset using an original dataset that has been modified into several pseudo-dataset versions. 2. Make predictions by combining predictions from a collection of different models. For categorical target data, one can use different voting mechanisms in each model, and for continuous target data, it can use a central measure of tendency in each of the different model predictions, such as the mean or median. Ensemble learning techniques can be analogous to a collection of experts working together to solve a problem. With this combination, it is expected that the success rate will be higher compared with an expert working alone. Besides cooperating, it is also important to avoid compartmentalization of the collection of models. The classification here means that each model must be independent or not interdependent or that one model is not built on top of another model. In largescale independent population models, ensemble models can be very accurate even if there are one or two models in them whose performance is worse than random prediction (Kelleher et al., 2015). On the other hand, ensemble learning techniques by some researchers are considered to be able to cover the lack of machine learning methods when used in a particular case that is not in accordance with the character of the model. Among them are considered less stable and easily trapped in overfit conditions. As a case.

(31) 11 study of gold price predictions with two labels namely Up and Down. If there are three institutions (A, B, C) consecutively release prediction results of gold prices the next day (Up, Up, Down). It is noteworthy that the three institutions developed prediction models using independent data and techniques, with the prediction accuracy rate over the past year being 80%, 82%, and 85%, respectively. If we use ensemble learning techniques, especially the characteristics of the second ensemble point model, we will not necessarily choose the institution with the highest accuracy rate of 85% with the prediction results of gold prices the next day will be Declining. But with the ensemble model approach that is majority voting, the price of gold the next day will rise. Here is the justification, by calculating the accuracy of the aggregation prediction model using a simple probability calculation, which is the probability of truth (accuracy) of the combined three institutions.. The calculation above is obtained from three events, where two votes are right and one vote is wrong added by one event where all three sounds are all right (consensus) as illustrated in Table 2.1. The opportunity for truth is far greater than the accuracy of each stand-alone model (Suyanto, 2018)..

(32) 12 Table 2.1 Ensemble decision making. Occurance. A. B. C Voting. 1. 0. 0. 0. 0. 2. 0. 0. 1. 0. 3. 0. 1. 0. 0. 4. 0. 1. 1. 1. 5. 1. 0. 0. 0. 6. 1. 0. 1. 1. 7. 1. 1. 0. 1. 8. 1. 1. 1. 1. 2.2.1 Bootstrap Aggregating (Bagging) The bagging technique works by taking random subsets of datasets which in bagging are referred to as bootstrap samples. Bootstrap samples have the same size as the original dataset. In the process, bootstrap samples are resampling dataset with replacement. It is intended that each subset of the dataset will have duplicate data objects. At the same time, some data objects are not included in one of the dataset subsets but exist in another dataset subset. Resulting in different bootstrap samples, and the resulting prediction models will be different (Kelleher et al., 2015)..

(33) 13 As explained in Chapter 2.2, ensemble learning techniques can overcome overfit conditions. Bagging techniques besides able to overcome overfit conditions is also general-purpose (their use is not limited to certain methods) procedures that can reduce variance. Mathematically, if observations of. is the number of independent. , then the variance of the average of. observations is. . In other words, the average value of a set of observations can reduce the value of variance. In the process, reducing the value of variance and increasing prediction accuracy requires many training sets from the population, building different prediction models on each training set, and looking for the average of each prediction result. In other words, we can calculate. using a separate. training set, and find the average in order to obtain a single low-variance value, with the notation as in Equation 2-1. (2-1) But in general, this is not practical because often we don't have many training sets. Therefore we use bootstrap, by taking samples repeatedly from a trainig dataset. In this approach, we take. training datasets that are different from the. same dataset. We use certain methods to train each th bootstrap sample to get , and find the average of all predictions, as in Equation 2-2. This is what is referred to as bagging (James et al., 2013)..

(34) 14 (2-2). As an illustration of bootstrap samples if we have a dataset with one feature x, and target y as shown in Table 2.2. Table 2.2 Example dataset with one feature. x. 2,4. 3,1. 3,7. 4,4. 5,2. 5,8. 6,4. 7,5. 8,6. 9,4. y. 1. 1. 1. -1. -1. -1. -1. -1. 1. 1. Then a subset of new data that can be generated from five bootstrap processes with 10 data objects is as follows (Suyanto, 2018): Table 2.3 Bootstrap Sample 1. x. 2,4. 3,1. 3,7. 5,2. 5,2. 6,4. 6,4. 7,5. 8,6. 9,4. y. 1. 1. 1. -1. -1. -1. -1. -1. 1. 1. Table 2.4 Bootstrap Sample 2. x. 2,4. 2,4. 2,4. 3,1. 3,7. 4,4. 4,4. 5,2. 5,8. 8,6. y. 1. 1. 1. 1. 1. -1. -1. -1. -1. 1. Table 2.5 Bootstrap Sample 3. x. 2,4. 4,4. 4,4. 5,2. 5,2. 5,8. 6,4. 7,5. 8,6. 9,4. y. 1. -1. -1. -1. -1. -1. -1. -1. 1. 1.

(35) 15. Table 2.6 Bootstrap Sample 4. x. 2,4. 3,1. 3,7. 3,7. 4,4. 4,4. 4,4. 5,2. 7,5. 8,6. y. 1. 1. 1. 1. -1. -1. -1. -1. -1. 1. Table 2.7 Bootstrap Sample 5. x. 2,4. 3,1. 3,7. 5,8. 7,5. 8,6. 8,6. 8,6. 9,4. 9,4. y. 1. 1. 1. -1. -1. 1. 1. 1. 1. 1. 2.2.2 Random Forest In applying bagging, one of the most suitable algorithms is the decision tree. This is because the decision tree is very sensitive if there is a change in the dataset (overfit). If there is a slight change, it will result in differences in the selection of features that will split the dataset (feature split) in the root, so it is very decisive for the continuation of each subtree (James et al, 2013)..

(36) 16. Figure 2.1 Ensemble decision making. Source: (Kelleher et al., 2015). When bagging is applied in a decision tree, the sampling process that is often done is to choose a subset of features randomly from the available features. The results of randomly sampled features are referred to as subspace sampling. Subspace sampling plays a major role in creating a diversity of ensemble models, and the benefits of it can reduce the time complexity in the training process. The combination of bagging, subspace sampling, and decision tree is called a random forest (Kelleher et al., 2015). In other words in bagging, in each random bootstrap sample in selecting the predictor (feature). using the whole predictor. or. , then the random for-. est only considers the use of a portion of the predictor. Usually estimated around.

(37) 17 cube roots of the total predictor. (James et al, 2013) or 20% of the num-. ber of predictors (Suyanto, 2018). So in each decision tree model that is built will be mutually independent. Especially because the predictor that is considered dominant in splitting data in one bootstrap sample, by random forest in another bootstrap sample is not likely to be used. Therefore, on average there will be a. number of bootstrap. sample data that does not consider the dominant predictor at all. So other predictors have more opportunities (James et al, 2013)..

(38) CHAPTER III SYSTEM DESIGN AND IMPLEMENTATION. 3.1 System Design The system design in Figure 3.1 consists of several components: Indosum dataset customization, preprocessing, sentence scoring, random forest, and applying rules to generate system summary.. Figure 3.1 System Design. Each component above is elaborated in the following process: 1) Indosum customization, which was originally in the JSON line format is converted into dataframe format. 2) Each sentence in the dataframe will be preprocessed by. 18.

(39) 19 sorting irrelevant tokens and returning each word to its basic form. 3) Furthermore, each sentence will be given features value as a unique identity between sentences. 4) Random Forest will process each sentence along with its features to produce output in the form of rules. 5) These rules will be the basis of knowledge in determining which sentences will be candidates for system summary. 3.1.1 Dataset Customization Indosum is a dataset that contains news articles from various online media sources in Indonesia along with the summary results. The dataset itself can be accessed through the GitHub repository (https://github.com/kata-ai/indosum). Indosum contributed by providing a dataset of nearly 20 thousand articles. The number is almost 200 times larger than the dataset that has ever existed before (Kurniawan & Louvan, 2018). The indosum dataset is stored in the .jsonl (json line) file format divided into a 5-fold corpus, the dataset's statistics are in Figure 3.2 along with a description of the data structure of each article illustrated in Figure 3.3.. Figure 3.2 Statistical input dataset.

(40) 20. Figure 3.3 The structure of the article in the dataset. Not all columns in JSON are used in this study. Therefore we convert the dataset which was in the form of JSON line into a dataframe structure or commonly known as a matrix as in Table 3.1. Table 3.1 The structure of the converted dataset.

(41) 21 or document is an article found in each of its JSON lines. While sequence of sentences in each. article,. is sourced from. is the in the. JSON structure in Figure 3.3 which is the label for extractive text summarization. This dataframe will then be processed into the next phase.. 3.1.2 System Development This phase is the most important because it will produce a prediction model that will be used as a knowledge base in building a text summary system. 3.1.2.1 Preprocessing To support the performance of the main algorithm, preprocessing needs to be done. Preprocessing serves to present a consistent dataset. The process is as shown in Figure 3.4.. Figure 3.4 Preprocessing.

(42) 22 If in general preprocessing in NLP uses tokenizing or separating sentences into token, because we use the Indosum dataset, each sentence has passed the tokenization process. In Figure 3.5 there is an illustration of the basic structure of each sentence in the dataset that has undergone the process of tokenization.. Figure 3.5 Sentence token. Furthermore, the words in each sentence will undergo a case-folding process, precisely each letter or token containing capital letters in each sentence will be processed in the lower case. Then continue with each sentence, removing punctuation such as periods (.), exclamation points (!), question marks (?) and others. After making sure the case folding and removing the punctuation process is running well, then proceed with the remove stopword process. Where words that are included in this stopword are considered as irrelevant words or do not have certain meanings such as conjunctions. At this point, some of the words in the sentence have been reduced. Furthermore, each word in the sentence will undergo a lemmatization process. Where.

(43) 23 each word will be returned to its basic form. The corpus that we use as a reference to lemmatization is sourced from the library https://spacy.io. 3.1.2.2 Sentence Scoring Gradually each sentence will be given a score sequentially as shown in Figure 3.6.. Figure 3.6 Sentence Scoring Process. The following details the process of each feature : • Word Frequency (F1) In general, the word with the most frequency appearance is an indicator of information. The relative word frequency of a word can be defined as a comparison of the level of occurrence of each word in the whole document. The word frequency score of a sentence can be defined as the relative number of word frequencies per word in a sentence. (3-1).

(44) 24 • Title Similarity (F2) If the word in the title is in the. sentence, it can be calculated as in Equation. 3-2: (3-2) is the number of words in the title contained in the. sentence and. is the number of words in the title (Krishnaveni & Balasundaram, 2017). • Sentence Position (F3) The sentence at the beginning and end of the article are important and have maximum information value. Sentence position can be calculated as in Equation 3-3: (3-3) is the total sentence, and. is the location of the sentence (Shirwandkar &. Kulkarni, 2018). • Sentence Length (F4) Short sentences do not contain much meaningful information. To identify important sentences based on their length, the score can be calculated as in Equation 3-4 (Shirwandkar & Kulkarni, 2018) : (3-4).

(45) 25 • Sentence Centrality (F5) is the similarity score for each pair of sentences. (Zheng & Lapata, 2019). (3-5) • TF-ISF (Term frequency - Inverse Sent ence frequency) (F6) Term frequency - Inverse document frequency in the process of information retrieval is needed. In the context of single document summarization, sentences are treated as documents. TF-ISF can be calculated as in Equation 3-6. (3-6) Where,. is the total appearance of each term ith in the whole sentence,. the appearance of the term in the sentence i th, and. is. is the number of words in. the sentence (Shirwandkar & Kulkarni, 2018). • Cosine Similarity (F7) Calculates cosine similarity between sentences and centroids. Centroid itself is a sentence with the largest TS-ISF value. The calculation is as in equation 3-7 (Shirwandkar & Kulkarni, 2018):. (3-7) • Bi-Gram (F8) Bi-gram is a pair of two words that are close together in each sentence (Shirwandkar & Kulkarni, 2018). (3-8).

(46) 26 • Tri-Gram (F9) Trigram is a continuation of Bi-gram which is a pair of three words that are close together in each sentence (Shirwandkar & Kulkarni, 2018). (3-9) After extracting features in each sentence, the dataframe which originally consisted of only three columns has been paired with each feature. As shown in Table 3-2. Table 3.2 Dataset with sentence scoring. The process does not stop immediately when it has extracted its features. When the feature extraction process, the chances are big enough to result in differences in the distance of data between one feature and another feature. Therefore it is necessary to do a feature scaling where we normalize the range of data between 0 to 1 without affecting the distribution pattern. To do this process we use MinMaxScaler which has been provided by the ScikitLearn library (Scikit-Learn, 2019)..

(47) 27 3.1.2.3 Random Forest Random Forest algorithm relies on the Bagging technique and tree classifier as shown in Figure 3.7.. Figure 3.7 Random Forest Process.

(48) 28. The initial process starts by redrawn each ples so that the total. dataset, into bootstrap sam-. bootstrap samples are fifty. Also at the same time, there. is the process of sorting Out-of-bag Samples (. ) which is a collection of. data that is not selected into bootstrap samples.. becomes important later. in the system evaluation process. Furthermore, it is also crucial to limit the number of features used in each bootstrap sample or random subspace method ( ). In this case, from the nine sentence scoring features, there are three random predictors chosen for each bootstrap sample. Random forest is closely related to the decision tree classifier. We chose the C4.5 algorithm as the tree classifier for random forest. The C4.5 classifier is then fed into each bootstrap sample. Gain Ratio is an important indicator in determining the best split attributes which are then translated into each of its decision rules. To get the gain ratio value, several processes must be taken. (3-10) The first step is looking for Entropy value which is the value of the diversity of the label as a whole. Where is a class or label. If referring to the dataset structure in Table 3.2, the label in question is a. value. While. is the pro-. portion of the amount between one label with another. (3-11).

(49) 29. After getting the diversity values on the label, then it is necessary to know the effectiveness of each feature in the dataset by calculating the information gain value. Where. is a related feature, and. mined threshold. While. is the number of samples at a predeter-. is the value of entropy in each of its. val-. ues. (3-12) Then, before getting the gain ratio, we must first get the value in each of its attributes. Where. denotes a subset of data based on diversity. in each attribute. After getting the. , the Gain Ratio value can. be formulated as in Equation 3-13. (3-13). 3.2 System Implementation The application was built using the Python version 3.7 and built on top of web framework django 2.2 which acts as a backend. The database itself using PostgreSQL 10.12 to make the website more dynamic and the full application code. can. be. accessed. through. https://gitlab.com/ELHafizh/auto-summarizer. the. GitLab. repository. via.

(50) 30 3.2.1 Sentence preprocessing This stage is the stage of preparing input for the Sentence Scoring algorithm. The algorithm in the program code is called the SentenceMunging Class or it can also be called the preprocessing sentence. In the SentenceMunging program code as shown in Figure 3.8, it requires input in the form of a token variable that contains a collection of words from a sentence. Examples of input tokens are as shown in Figure 3.9. class SentenceMunging: def __init__(self, tokens): self._tokens = tokens self.remove_punctuations() self.case_folding() self.remove_stopword() self.lemmatization(). Figure 3.8 SentenceMunging Class constructor ['Jakarta', ',', 'CNN', 'Indonesia', '-', '-', 'Menteri', 'Riset', ',', 'Teknologi', ',', 'dan', 'Pendidikan', 'Tinggi', 'Muhammad', 'Nasir', 'akan', 'membentuk', 'tim', 'khusus', 'untuk', 'menyelidiki', 'dugaan', 'permainan', 'uang', 'pada', 'pemilihan', 'rektor', 'universitas', 'negeri', '.']. Figure 3.9 Example of input data on SentenceMunging Class.. In Figure 3.8 the SentenceMunging algorithm consists of four functions including the function remove_punctuation(), case_folding(), remove_stopword(), and lemmatization(). The four functions are executed one by one in sequence. def remove_punctuations(self): temp_tokens = [] regex_punct = re.compile(r'[:;\/\"\(\)\.\,!?\-\“\”]') regex_ws = re.compile(r'\s') regex_num = re.compile(r'\d') for index, token in enumerate(self._tokens): match_punct = regex_punct.search(token) match_ws = regex_ws.search(token) match_num = regex_num.search(token) if match_num:.

(51) 31 temp_tokens.append(self._tokens[index]) elif match_punct or match_ws: pass else: temp_tokens.append(self._tokens[index]) self._tokens = temp_tokens. Figure 3.10 remove_punctuation() method.. In figure 3.10, the remove_punctuation() function is responsible to eliminate tokens that are indicated as punctuation. To identify punctuation contained in tokens by using regular expressions. Then the token array in Figure 2.1 is processed into an array in Figure 3.11. ['Jakarta',. 'CNN',. 'Pendidikan', 'khusus',. 'Indonesia',. 'Tinggi',. 'untuk',. 'Menteri',. 'Muhammad',. 'menyelidiki',. 'Nasir', 'dugaan',. 'Riset',. 'Teknologi',. 'dan',. 'akan',. 'membentuk',. 'tim',. 'permainan',. 'uang',. 'pada',. 'pemilihan', 'rektor', 'universitas', 'negeri']. Figure 3.11 Example output of remove_punctuation() function. def case_folding(self): temp_tokens = [] for token in self._tokens: temp_tokens.append(token.lower()) self._tokens = temp_tokens. Figure 3.12 case_folding() method.. To ensure that each word token has a consistent structure, each capital letter is converted to lowercase with the case_folding() function. The previous token array has been cleared of punctuation in the Figure 3.11 then each capital letter is changed to lowercase as shown below..

(52) 32 ['jakarta',. 'cnn',. 'pendidikan', 'khusus',. 'indonesia',. 'tinggi',. 'untuk',. 'menteri',. 'muhammad',. 'menyelidiki',. 'nasir', 'dugaan',. 'riset',. 'teknologi',. 'dan',. 'akan',. 'membentuk',. 'tim',. 'permainan',. 'uang',. 'pada',. 'pemilihan', 'rektor', 'universitas', 'negeri']. Figure 3.13 Example output of case_folding() function. def remove_stopword(self): temp_tokens = [] for index, token in enumerate(self._tokens): doc = nlp(token) if not doc[0].is_stop: temp_tokens.append(self._tokens[index]) self._tokens = temp_tokens. Figure 3.14 remove_stopword() function.. The word token in Figure 3.13 is then processed in the third stage which is to drop the word token found as stopwords. Stopword is a word that is quite common or most often appears in sentences. The stopword collection references using the spacy library. The remove_stopword() algorithm in Figure 3.14 generates a new array with fewer tokens as shown in Figure 3.15. ['jakarta', 'cnn', 'indonesia', 'menteri', 'riset', 'teknologi', 'pendidikan', 'muhammad',. 'nasir',. 'membentuk',. 'tim',. 'khusus',. 'menyelidiki',. 'permainan', 'uang', 'pemilihan', 'rektor', 'universitas', 'negeri']. Figure 3.15 Example output of remove_stopword() function.. def lemmatization(self):. 'dugaan',.

(53) 33 temp_tokens = [] for index, token in enumerate(self._tokens): doc = nlp(token) temp_tokens.append(doc[0].lemma_) self._tokens = temp_tokens. Figure 3.16 lemmatization() function.. The results of the remove_stopword() function are tokens that are removed and indicated as stopwords are the words dan, tinggi, akan, untuk, and the word pada. Then the last function is lemmatization(). The program code lemmatization() in Figure 3.16 serves to return each token into its basic words based on morphological analysis of words by using the https://spacy.io/ library. The token array in Figure 3.15 is transformed as shown in Figure 3.17. Figure 3.17 tokens transformation.. After understanding the input and output processes in each function in the SentenceMunging Class then we describe the application of SentenceMunging in each sentence in an article. As can be seen in Figure 3.3, the Indosum dataset has.

(54) 34 many components including paired paragraphs and gold labels. Then the system customizes the dataset in both components so that it becomes the main input on the new dataset.. As illustrated in Table 3.3. Table 3.3 Sentence and Label pairs. No Sentences Sentence Token Label 1 Jakarta, CNN Indonesia- -Man-['Jakarta', ',', 'CNN', 'Indonesia', '-', '-', True ajer Manchester United, Jose'Manajer', 'Manchester', 'United', ',', Mourinho, tidak punya kritikan'Jose', 'Mourinho', ',', 'tidak', 'punya', untuk para pemainnya setelah'kritikan', 'untuk', 'para', 'pemainnya', The Red Devils hanya mampu'setelah', 'The', 'Red', 'Devils', 'hanya', bermain imbang 1- 1 saat men-'mampu', 'bermain', 'imbang', '1', '-', jamu Swansea City pada lanju-'1', 'saat', 'menjamu', 'Swansea', 'City', tan Liga Primer Inggris di Sta-'pada', 'lanjutan', 'Liga', 'Primer', 'Ingdion Old Trafford, Minggu (30 /gris', 'di', 'Stadion', 'Old', 'Trafford', 4) . ',', 'Minggu', '(', '30', '/', '4', ')', '.'] 2 MU sempat unggul lewat gol['MU', 'sempat', 'unggul', 'lewat', 'gol', True penalti kontroversial Wayne'penalti', 'kontroversial', 'Wayne', Rooney di pengujung babak'Rooney', 'di', 'pengujung', 'babak', pertama. 'pertama', '.'] 3 Usai pertandingan Mourinho['Usai', 'pertandingan', 'Mourinho', False mengaku puas dengan penampi-'mengaku', 'puas', 'dengan', 'penampilan MU. lan', 'MU', '.'] 4 "Kami kehilangan pemain dan['"', 'Kami', 'kehilangan', 'pemain', False kehilangan poin, jadi hari ini'dan', 'kehilangan', 'poin', ',', 'jadi', tidak terlalu buruk. 'hari', 'ini', 'tidak', 'terlalu', 'buruk', '.'] 5 "Kami punya skuat dari 22 pe-['"', 'Kami', 'punya', 'skuat', 'dari', '22', False main, dan berkurang menjadi'pemain', ',', 'dan', 'berkurang', 'men13 atau 14. jadi', '13', 'atau', '14', '.'] 6 Mourinho mengaku khawatir['Mourinho', 'mengaku', 'khawatir', False dengan cedera yang dialami'dengan', 'cedera', 'yang', 'dialami', Shaw. 'Shaw', '.'] 7 "Saya tidak tahu cedera apa. ['"', 'Saya', 'tidak', 'tahu', 'cedera', False 'apa', '.'] 8 Hasil imbang membuat MU ga-['Hasil', 'imbang', 'membuat', 'MU', False gal menggeser Manchester City'gagal', 'menggeser', 'Manchester', dari posisi empat besar klase-'City', 'dari', 'posisi', 'empat', 'besar', men Liga Primer. 'klasemen', 'Liga', 'Primer', '.'].

(55) 35 Datasets that have components as in Table 3.3 at the beginning of the conversion can be categorized as raw datasets. Raw dataset cannot be directly entered into the main algorithm because there is still noise. As can be seen in the sentence token column in Table 3.3 there are many tokens that are not words, such as punctuation. Therefore in the next stage of conversion is needed which applies the preprocessing stage as stated in Section 3.1.2.1 and then the SentenceMunging program code is implemented for each sentence token in Table 3.3. .So the results are as shown in table 3.4. Table 3.4 Sentence Munging. No Sentences Sentence Token Label 1 Jakarta, CNN Indonesia- -Manajer['jakarta', 'cnn', 'indonesia', 'man- True Manchester United, Jose Mourinho,ajer', 'manchester', 'united', 'jose', tidak punya kritikan untuk para pe-'mourinho', 'kritikan', 'main', 'the', mainnya setelah The Red Devils'red', 'devils', 'main', 'imbang', '1', hanya mampu bermain imbang 1- 1'1', 'jamu', 'swansea', 'city', 'lansaat menjamu Swansea City padajut', 'liga', 'primer', 'inggris', 'stalanjutan Liga Primer Inggris di Sta-dion', 'old', 'trafford', 'minggu', dion Old Trafford, Minggu (30 /4) . '30', '4'] 2 MU sempat unggul lewat gol['mu', 'unggul', 'gol', 'penalti', True penalti kontroversial Wayne'kontroversial', 'wayne', 'rooney', Rooney di pengujung babak per-'ujung', 'babak'] tama. 3 Usai pertandingan Mourinho men-['tanding', 'mourinho', 'aku', False gaku puas dengan penampilan MU. 'puas', 'tampil', 'mu'] 4 "Kami kehilangan pemain dan kehi-['hilang', 'main', 'hilang', 'poin', False langan poin, jadi hari ini tidak ter-'buruk'] lalu buruk. 5 "Kami punya skuat dari 22 pemain,['skuat', '22', 'main', 'kurang', '13', False dan berkurang menjadi 13 atau 14. '14'] 6 Mourinho mengaku khawatir den-['mourinho', 'aku', 'khawatir', False gan cedera yang dialami Shaw. 'cedera', 'dialami', 'shaw'] 7 "Saya tidak tahu cedera apa. ['cedera'] False.

(56) 36 8 Hasil imbang membuat MU gagal['hasil', 'imbang', 'mu', 'gagal', False menggeser Manchester City dari'geser', 'manchester', 'city', 'poposisi empat besar klasemen Ligasisi', 'klasemen', 'liga', 'primer'] Primer. When compared between Table 3.3 and 3.4 in the sentence token column there is a significant difference wherein Table 3.4 each token looks more consistent. For example, there are no more punctuation marks and only tokens of words and numbers.. 3.2.2 Sentence Weighting Each pair of sentences and labels have been paired with each other. Furthermore, the weighting of each sentence based on the Sentence Scoring algorithm in Section 3.1.2.2. class SentenceScoring: def __init__(self, title, articles): self._articles = articles self._title = title self._score = {} self.word_frequency() self.title_similarity() self.sentence_position() self.sentence_length() self.centrality() self.tf_isf() self.calculate_cosim() self.bi_gram() self.tri_gram().

(57) 37 Figure 3.18 SentenceScoring Class constructor.. The code program class SentenceScoring in Figure 3.18 is an implementation of the Sentence Scoring algorithm in Section 3.1.2.2. The parameters of the SentenceScoring program code in Figure 3.18 are article titles and two-dimensional arrays for example as shown in Figure 2.1. [['portal', 'berita', 'startup', 'inovasi', 'teknologi'], ['dapatkan', 'diskon', '20',. '%',. 'tiket',. 'gmic',. 'indonesia',. '2017',. 'laman',. 'deals'],. ['–',. 'dailysocial', 'media', 'partner', 'gmic', 'indonesia', '2017'], ['informasi', 'terkini', 'gmic', 'sila', 'kunjungi'], ['gmic', 'indonesia', '2017', 'harap', 'wadah', 'gerak',. 'giat',. 'teknologi',. 'bidang', 'mobile',. 'intelligence',. 'gmic',. 'paham',. 'dalam',. 'internet'], ['tema',. 'harap',. 'katalis',. 'industri',. 'teknologi',. 'new', 'frontiers',. 'bidang',. 'teknologi',. 'of',. 'temu',. 'tukar', 'pikir'], ['gmic', 'indonesia', '2017', 'diselenggarakan', 'tanggal', '26', 'september'], ['gwc', 'global', 'selenggara', 'gmic', 'indonesia', '2017', 'wadah',. 'giat',. 'teknologi',. 'paham',. 'dalam',. 'industri',. 'teknologi',. 'bidang', 'mobile', 'internet'], ['implikasinya', 'bisnis', 'adaptasi', 'cepat', 'adopsi', 'baru', 'sistem', 'mengakselerasi', 'bisnis'], ['revolusi', 'digital', 'masuk',. 'babak',. 'ditandai',. 'banyaknya',. 'ubah',. 'tatanan',. 'digerakkan', 'teknologi']]. Figure 3.19 Example input data on SentenceScoring Class. def word_frequency(self): list_of_word = [] for article in self._articles: for word in article: list_of_word.append(word) self._word_counter = dict(Counter(list_of_word)). 'industri',.

(58) 38 list_frequency_sentence = [] for article in self._articles: sentence_frequency_words = [] for word in article: sentence_frequency_words.append(self._word_counter[word]) result = 0 for sentence_frequency_word in sentence_frequency_words: result += sentence_frequency_word list_frequency_sentence.append(result) self._score['f1_wf'] = list_frequency_sentence. Figure 3.20 word_frequency() method.. The program code word_frequency() in Figure 3.20 calculates the frequency of occurrences of each word in the entire article. The number of word frequencies is then added to another word in the same sentence and then accumulated into a score. def title_similarity(self): title_tokens = [] for title_token in self._title.split(" "): if title_token: title_tokens.append(title_token) score_lists = [] title_len = len(title_tokens) for article in self._articles: score = 0 for word in article: for token in title_tokens: if word == token:.

(59) 39 score += 1 score_lists.append(score) for i, score_list in enumerate(score_lists): score_lists[i] = round((score_list / title_len), 2) self._score['f2_ts'] = score_lists. Figure 3.21 title_similarity() method.. The program code title_similarity() in Figure 3.21 handles finding similarities between sentences and article titles. Based on Equation 3-2 to get the title similarity score, the nominator is the number of slices between the title and the sentence while the denominator is the number of words in the title. def sentence_position(self): n = len(self._articles) score_lists = [] for i, article in enumerate(self._articles): if i == 0: score_lists.append(1) elif i == n - 1: score_lists.append(1) else: score_lists.append(round((((n - i) / n)), 2)) self._score['f3_sp'] = score_lists. Figure 3.22 sentence_position() method.. The program code sentence_position() in Figure 3.22 calculates a score based on the position of the sentence in the article. The maximum value on the.

(60) 40 sentence position score is one. If the sentence position is at the beginning and the end of the sentence, the score is one, other than that, the earlier the sentence position, the higher the score. def sentence_length(self): index_max = 0 for i, article in enumerate(self._articles): if len(article) > len(self._articles[index_max]): index_max = i score_lists = [] for article in self._articles: score_lists.append(round((len(article) / len(self._articles[index_max])), 2)) self._score['f4_sl'] = score_lists return score_lists. Figure 3.23 sentence_length() method.. Program code sentence length() in Figure 3.23 calculates a score based on the value of the comparison between the number of words in each sentence with the highest number of words in the entire article. def centrality(self): limit = len(self._articles) sentence_position = 0 score_centrality = [] for article in self._articles: pair_score = [] for i in range(limit):.

(61) 41 if i != sentence_position: vector1, vector2 = self.build_vector(article, self._articles[i]) score = round(self.cosine_similarity(vector1, vector2), 2) pair_score.append(score) result = 0 for score in pair_score: result += score score_centrality.append(round(result, 2)) sentence_position += 1 self._score['f5_centrality'] = score_centrality. Figure 3.24 centrality() method.. The program code centrality() in Figure 3.24 calculates sentence scores based on the number of similarities between one sentence and another sentence in a row. The number of similarities is then accumulated as the final score in the sentence. def tf_isf(self): list_isf, list_tf, list_len, score_tf_isf = [], [], [], [] for article in self._articles: score_isf = 0 score_tf = 0 article_counter = dict(Counter(article)) list_len.append(len(article)) for word in article: score_isf += self._word_counter[word] for k, v in article_counter.items(): score_tf += v.

(62) 42 list_isf.append(score_isf) list_tf.append(score_tf) for tf, isf, length in zip(list_tf, list_isf, list_len): try: score = (log(isf) * tf) / length score_tf_isf.append(round(score, 2)) except ValueError: score = 0 score_tf_isf.append(score) self._score['f6_tfisf'] = score_tf_isf return score_tf_isf. def calculate_cosim(self): index_max = 0 for i, value in enumerate(self._score['f6_tfisf']): if value > self._score['f6_tfisf'][index_max]: index_max = i score_cosim = [] for article in self._articles: vector1, vector2 = self.build_vector(article, self._articles[index_max]) score = round(self.cosine_similarity(vector1, vector2), 2) score_cosim.append(score) self._score['f7_cosim'] = score_cosim return score_cosim. def bi_gram(self): score_bigram = [] for article in self._articles: score = len(list(ngrams(article, 2))) score_bigram.append(score).

(63) 43 self._score['f8_bigram'] = score_bigram return score_bigram def tri_gram(self): score_trigram = [] for article in self._articles: score = len(list(ngrams(article, 3))) score_trigram.append(score) self._score['f9_trigram'] = score_trigram return score_trigram. Figure 3.25 tf_isf(), calculate_cosim(), bi_gram() and tri_gram() methods.. Based on Figure 3.25 the program code tf_isf() calculates its score based on the calculation of the term frequency against inverse sentence frequency. In the program code tf_isf() there is a process to get centroid. A centroid is a sentence that has the largest TF-ISF score. The program code calculate_cosim() is tasked with finding a resemblance between a sentence and centroid. The program codes bi_gram() and tri_gram() have something in common in terms of counting the number of n-gram pairs from sentence tokens. The output of the Sentence Scoring algorithm is a numerical representation of a collection of features. Table 3.5 The Sentence Scoring Results No. f1. f2. f3. f4. f5. f6. f7. f8. f9. label. 1. 45. 0.29. 1. 1. 0.67. 3.81. 1. 29. 28. TRUE. 2. 11. 0.14. 0.25. 0.3. 0.24. 2.4. 0. 8. 7. TRUE. 3. 11. 0.29. 0.38. 0.2. 0.66. 2.4. 0.07. 5. 4. FALSE.

(64) 44 4. 10. 0. 0.5. 0.17. 0.28. 2.3. 0.13. 4. 3. FALSE. 5. 9. 0. 0.62. 0.2. 0.29. 2.2. 0.14. 5. 4. FALSE. 6. 10. 0.14. 0.75. 0.2. 0.81. 2.3. 0.07. 5. 4. FALSE. 7. 2. 0. 0.88. 0.03. 0.41. 0.69. 0. 0. 0. FALSE. 8. 18. 0.29. 1. 0.37. 0.48. 2.89. 0.26. 10. 9. FALSE. The sentence scoring algorithm gives weight to each sentence. As can be seen in Table 3.5, numerically there are significant differences between one sentence and another. When looking at the values in each of the features, there is a considerable difference in distance. For example, on f1 the biggest value is 45. While on f6 the biggest value is 3.81. To make the distance between features not too large without having to change the pattern of data distribution, it is necessary to do feature scaling. Feature Scaling is used to uniform the distance in each feature. The Feature Scaling Program Code is as shown in Figure 3.26. def feature_scaling(self, targetclass): scaler = preprocessing.MinMaxScaler() scaled_df = scaler.fit_transform(self._dataset) features = self._features features.append(self._feature_target) scaled_df = pd.DataFrame(scaled_df, columns=features) self._dataset = scaled_df self._attributes = list(self._dataset) if targetclass == "end": self._features = self._attributes self._feature_target = self._features[len(self._attributes) - 1] del self._features[len(self._attributes) - 1] self.central_tendency(). Figure 3.26 feature_scaling() method.. The results of the Feature Scaling Algorithm can be seen in Table 3.6..

(65) 45 Table 3.6 The Feature Scaling Result. No. f1. f2. f3. f4. f5. f6. f7. f8. f9. label. 1. 0.657 0.007 1.000 0.751 0.172 0.683 0.424 1.000 1.000. 1. 2. 0.149 0.001 0.167 0.215 0.051 0.330 1.000 0.292 0.250. 1. 3. 0.149 0.007 0.311 0.138 0.169 0.330 0.017 0.167 0.125. 0. 4. 0.134 1.000 0.444 0.115 0.062 0.305 0.044 0.125 0.083. 0. 5. 0.119 1.000 0.578 0.138 0.065 0.280 0.048 0.167 0.125. 0. 6. 0.134 0.001 0.722 0.138 0.212 0.305 0.017 0.167 0.125. 0. 7. 0.015 1.000 0.867 0.008 0.099 0.380 1.000 1.000 1.000. 0. 8. 0.254 0.007 1.000 0.268 0.119 0.453 0.100 0.375 0.333. 0. Table 3.6 shows that there is a difference with Table 3.5 including the maximum value of all features is 1.0. And, the representation of the label which was originally a Boolean type is converted to binary. 3.2.3 Ensemble Learning The Indosum dataset that we use after going through the Sentence Scoring process will produce a new dataset with the format of .csv (Comma Separated Values) where the dataset structure can be seen in Table 3.2 and the statistical description can be seen in Table 3.7. Table 3.7 Descriptives statistic of dataset to be processed in training.. N Mean Median. f1 f2 f3 f4 f5 f6 f7 f8 f9 150340 150340 150340 150340 150340 150340 150340 150340 150340 0.380 0.397 0.596 0.385 0.376 0.501 0.340 0.387 0.363 0.326 0.0134 0.604 0.357 0.305 0.495 0.118 0.320 0.302.

(66) 46 Min Max. 0.0155 1.00. 3.83e-4 1.00. 0.0110 1.00. 0.0076 0.00274 0.0426 0.00439 1.00 1.00 1.00 1.00. 0.040 1.00. 0.0377 1.00. 3.2.3.1 Bootstrap Aggregating. Based on Table 3.7 the dataset which has 150340 rows, will then be distributed into one hundred bags following the bagging concept in Section 2.2.1. Each bag has two thirds of the source dataset.. class Bagging: def __init__(self, dataset, number_of_bootstrap=100): self._dataset = dataset self._attributes = list(self._dataset) self._features = self._attributes self._number_of_bootstrap = number_of_bootstrap self._feature_target = self._features[len(self._attributes) - 1] del self._features[len(self._attributes) - 1] self.sample_size() self.sample_attributes() self.bootstrap_aggregating(). Figure 3.27 Bagging Class constructor.. The parameters of the Bagging class are the dataset and number_of_bootstrap. The dataset parameter is a Dataframe (https://pandas.pydata.org/) type sourced from a CSV file containing a Sentence Scoring score. While number_of_bootstrap by default in our study were 100 bags. There are three functions in the Bagging class including:.

References

Related documents

pediatrics as a response to house staff interests and also in re- sponse to the needs of the surrounding community. This in- terest in primary care is perhaps even more evident

Aim: To compare the efficacy and safety between posterior sub-tenon injection of triamcinolone acetonide (PSTA) and intravitreal injection of bevacizumab (Avastin) (IVIA) in

When pozzolanic materials are incorporated to concrete, the silica present in these materials react with the calcium hydroxide released during the hydration of cement and

According to data derived from a recently published Swedish study, it was estimated that the number of patients with clinical vertebral fractures admitted to hospitals account

MLAA: Maximum likelihood estimation of activity and attenuation; MPI: Myocardial perfusion imaging; MRI: Magnetic resonance imaging; PET: Positron emission tomography; SiPM:

Background: The variable numbers of tandem-repeat (VNTR) alleles at the phenylalanine hydroxylase ( PAH ) gene have been used in carrier detection and prenatal diagnosis

In einer weiteren prospektiven Studie im gleichen Jahr wurden vier Patienten mit einem adrenergen Adenom und zwei Patienten mit hypophysärem HAC für drei bis elf Wochen

Since the inferior-septal region has very high R2* values at baseline (possibly from susceptibility related contributions to R2*), it was excluded from the analysis of