• No results found

Removing Duplicates Method of English Tests

N/A
N/A
Protected

Academic year: 2020

Share "Removing Duplicates Method of English Tests"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Removing Duplicates Method of English Tests

Shi-jiao ZHANG

1,2

, Yuan SUN

1,2,*

and Zhen ZHU

1,2

1School of Information Engineering, Minzu University of China

2Minority Languages Branch, National Language Resource and Monitoring

Research Center 100081, Beijing, China

18810372500@163.com,*tracy.yuan.sun@gmail.com, 18957736389@163.com

Keywords: Removing duplicates, Word2vec, Synonyms’ distance, Contact ratio.

Abstract. This paper is an exploration to find a way to remove duplicates of English tests.

Considering those duplicates of English tests, and it is very difficult to remove duplicates by manual work. So we use a method to remove duplicates of English tests. The English test was composed of three parts--question stem, options and answer analysis. Because different parts have different characters, we use different strategies in different parts. Firstly, we use word2vec method in answer analyses, synonyms’ distance calculation in options, and coincidences ratio calculation in question stem. Then we regard the answers of above as the variables in the multiple regression model. Finally, we achieve removing duplicates in English tests.

Introduction

There are a lot of duplicates of English tests in our schools, if the students do these tests, there are little effects on them. Removing duplicate method as a data processing technology has received much attention in academia and industry[1]. In 1996, JR Brown, PJ Rich, raised automatically removing duplicate and obsolete patterns[2]. In the few years, removing duplicate has been extensively studied[3-7]. RV Guha proposed a method and system for removing duplicate query results in a database system comprising a plurality of data sources. If the first data source contains the data, then the second query result is considered a duplicate and is discarded[8], In 2014, S. Chandiramanietc present a method that duplicate video search results are detected and removed [9]. P Fisher makes a workflow, which takes in two list of strings and then concatenates them together. Any duplicates that are present are then removed, and the resulting file is returned back to the user [10].

These methods discuss removing duplicates provides good reference for us. However, removing duplicates of English tests is seldom discussed.

In this paper, we clearly describe a removing duplicates of English tests method and show some experimental results, finally we summary our works.

Method of Removing Duplicates in English Tests

The English test was composed of three parts--question stem, options and answer analysis. Because different part have different character, we use different strategies in different parts, and each part has its own advantages for removing duplicates. In order to improve the accuracy of removing duplicates, this paper decides the results by multi-attribute collective decision.

(2)

The specific calculation strategies as follows.

Word2vec Method in Answer Analysis

Sub-headings should be typeset in boldface italic and capitalize the first letter of the first word only. Section number to be in boldface roman.word2vec method is that two words vector indicates the two answer analyses’ semantic similarity. Using 6.9G encyclopedic description texts as the training corpus, which uses Skip-gram model as the input layer and Negative Sampling model as output layer. obtaining the 400M word vector language models which are contain whole words and the cosine distance of the two words can be calculated by 128-dimensional word vectors. There are two ways to calculate the similarity based on word2vec.

The Average Highest Similarity Method

The algorithm is relatively simple. First of all, we get all segmentation results of two answer analyses, calculate two similarity, and then choose the max value to be the similarity value of the two words. Finally, calculate the average of the highest similarity for all words in answer analyses.

Keyword Similarity Method

We Extract the keywords from two answers analyses, calculate the similarity between them, and choose the max value to be similarity value. we can utilize Hanlp keywords extraction algorithm to get Keywords.

As the answer often uses different language to analyze the same content, it can‘t depend on the string matching method to solve, but this paper can resolve this question by using similarity calculation, When the threshold is reached, it is assumed that the subject is duplicate.

Synonyms’ Distance Calculation in Options

This method is calculating synonyms’ distance based on synonym dictionary to get the shortest words distance. In this paper, only the shortest words distance of all words is used as the similarity value of the calculation option.

Coincidence Ratio Calculation in Question Stem

Coincidence number divided by the number of all words is coincidence ratio.

The Calculation of All Coincidence Number

The overall coincidence ratio is based on two pre-judgment question stems all the words , and then comparing with each other, statistics coincidence ratio, the coincidence ratio of the similar question stems will be much larger than the different question stems. However, some words coincidence ratio aren’t indicate the same question stems, which is not useful for the result. Such as "is", "a", "the". You can filter out words by using a disabled words list.

The High-frequency Words Calculation of Coincidence Ratio Based on TF*IDF

TF * IDF is a very common method of text classification. it is applied to the topic fusion task, and it is assumed that high-frequency words are relatively representative of the sentence meaning. TF * IDF comprehensively reflects the importance of a word in the sentence. Calculations are using the following three equations 1-3.

, , *

i j i j i

VTF IDF (1)

, ,

i j i j

j

n TF

N

(2)

log( / 1) &

(3)

Firstly, because the length of each question stem is little effect on the feature extraction and the question stem is shorter, we are using disabled words list. And then we sort out 15 words of each question stem as coincidence vocabulary. Finally, we calculate the number of coincident words in the two categories.

[image:3.612.170.436.160.288.2]

Five kinds of features are utilized to present a English test as shown in Table 1.

Table 1. Feature selection list.

Feature source calculation num

Answer analysisAverage maximum similarity method 1

Answer analysis Keyword similarity method 2

Option Distance calculation based on synonyms3

title The calculation of all coincidences 4

title The high-frequency word calculation of coincidences ratio based on TF*IDF

5

Combination Strategy Based on Multiple Linear Regression

We consider that the above five kinds of features have different strengths in different environments, but we can’t determine how to distribute the weight of each feature to get the best final classification. Therefore, this paper utilize multiple regression method, which means training weights, then uses classification polynomial to predict whether the same English tests.

The result of five features is defined as variablesxi,i[1,5] , and multiply weightai, which

represents the importance of variables xi on the results, And then add up the results ,which plus the

offset valueb, by the results of the sign to determine whether they are the same English tests, as

shown in 4 and 5.

5

1

0 i i i

y b a x label

 

 

(4)

5

1

0 i i i

y b a x label

 

 

(5) The above are the calculation formulas, but aiis obtained by training corpus calculating. The

least square method is used in the calculation. The formula is shown in Equation 6.

5 5 5

2 2

1 1 1

( i i) [ i ( i i)]

i i i

RSS yy yb a x

  

(6) we are known xiand yi, in order to optimize ai and b ,we respectively calculate the partial

derivatives of parameters, if the partial derivatives contain the extreme value, we can move parameter in the direction of the derivative according to the learning ration' threshold, so as to optimize parameters.

Experimental Results Analysis

In order to use the supervised training method, this paper marked some training data.

(4)
[image:4.612.165.447.122.189.2]

marked the same tests, 1434 marked for different tests in this corpus. Two-thirds of the data was used as the training data and the other was the test data.

Table 2. The results of the two experiments. Num Total test Machine

identification

Correctly identity

Correct rate Recall rateF1

1 1000 990 750 75.82% 75.40% 75.61%

2 1000 985 865 87.61% 86.60% 87.10%

[image:4.612.190.428.247.366.2]

As we can see from Table 2 that in the first experiment, we use five variables, and then we use training corpus to train the weighting parameters and use the multiple regression model to predict the result. Finally, F-1 is 75.61%.

Figure 1. This is the results of the two experiments.

Fig.1 is clearly compare the first and second experiments, after analyze the results, we find that the first variable and the fifth variable have little effect on the results, so move out them. Finally, using the same data training and testing, the accuracy ratio reached 87.61%, F-1 is 87.10%, to a certain extent, achieving the effect of data fusion.

Conclusions

In this paper, removing duplicates of English tests method is proposed. We mainly discuss five important factors: (1) Average maximum similarity method. (2) Keyword similarity method. (3) Distance calculation based on synonyms. (4) The calculation of all coincidences.(5) The high-frequency word calculation of coincidences ratio based on TF*IDF. Finally, by analyzing the results of experiments, we prove removing duplicates of English tests method can accurately achieve good results.

Acknowledgments

This work is supported by National Nature Science Foundation (No. 61501529, No. 61331013), National Language Committee Project (No. YB125-139, ZDI125-36), and Minzu University of China Scientific Research Project (No. 2015MDQN11, No. 2015MDTD14C).

References

[1] Yang Hu De-duplication Technology Research and Implementation of Large-scale Short Texts Orient. (2007).

(5)

[3] C. Wu, Removing duplicate objects from an object store. (2005).

[4] N.M. Joy, S. Salas Systems and methods for removing duplicate search engine results. (2007). [5] J. Colgrove, J. Hayes, E. Miller, J.S. Hasbani, C. Sandvig, Method for Removing Duplicate Data From a Storage Array. (2013).

[6] X. Li, Q. Yang, LNZeng, Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents. (2010).

[7] S.P. Semprevivo, M.R. Wells, System and Method of Removing Duplicate Leads. (2010). [8] R.V. Guha, Pass-through Architecture Via Hash Techniques to Remove Duplicate Query Results. (2000).

[9] S. Chandiramani, P.U. Chandraghatgi, ISridharan, Interactive Deduplication Methods for Detecting and Removing Duplicates in Video Search Results. (2014).

Figure

Table 1. Feature selection list.
Figure 1. This is the results of the two experiments.

References

Related documents

These best features are used with a binary or multi-class LS-SVM with Gaussian and polynomial RBF kernels in order to recognize three different categories in- cluding

EFFECT OF BUSH BURNING ON HERBACEOUS PLANT DIVERSITY IN LAGOS STATE POLYTECHNIC, IKORODU CAMPUS, LAGOS - NIGERIA.. Environmental Biology Unit, Department of Science

Passports Labour legislation Unemployment benefits Child allowance Birth & marriage certificates Consumer Protection Regulation Environment-related permits Work permits

penicillin, skin testing for immediate hyper-.. sensitivity to the agent should

Fuchs, “Fault ride-through of a DFIG wind turbine using a dynamic voltage restorer during symmetrical and asymmetrical grid faults,” IEEE Trans. Power

Psychic Readings: From An Art To A Money-Mending Mill 10 Phone Reading: The Latest Specie Of Psychic Readings 11 Psychic Readings: Do We Really Require Them.. 12

In this paper we have used an algorithm called Gradient Boosting Classifier from ensemble machine learning approaches to classify and detect SQL Injection