Removing Duplicates Method of English Tests
Shi-jiao ZHANG
1,2, Yuan SUN
1,2,*and Zhen ZHU
1,21School of Information Engineering, Minzu University of China
2Minority Languages Branch, National Language Resource and Monitoring
Research Center 100081, Beijing, China
18810372500@163.com,*tracy.yuan.sun@gmail.com, 18957736389@163.com
Keywords: Removing duplicates, Word2vec, Synonyms’ distance, Contact ratio.
Abstract. This paper is an exploration to find a way to remove duplicates of English tests.
Considering those duplicates of English tests, and it is very difficult to remove duplicates by manual work. So we use a method to remove duplicates of English tests. The English test was composed of three parts--question stem, options and answer analysis. Because different parts have different characters, we use different strategies in different parts. Firstly, we use word2vec method in answer analyses, synonyms’ distance calculation in options, and coincidences ratio calculation in question stem. Then we regard the answers of above as the variables in the multiple regression model. Finally, we achieve removing duplicates in English tests.
Introduction
There are a lot of duplicates of English tests in our schools, if the students do these tests, there are little effects on them. Removing duplicate method as a data processing technology has received much attention in academia and industry[1]. In 1996, JR Brown, PJ Rich, raised automatically removing duplicate and obsolete patterns[2]. In the few years, removing duplicate has been extensively studied[3-7]. RV Guha proposed a method and system for removing duplicate query results in a database system comprising a plurality of data sources. If the first data source contains the data, then the second query result is considered a duplicate and is discarded[8], In 2014, S. Chandiramanietc present a method that duplicate video search results are detected and removed [9]. P Fisher makes a workflow, which takes in two list of strings and then concatenates them together. Any duplicates that are present are then removed, and the resulting file is returned back to the user [10].
These methods discuss removing duplicates provides good reference for us. However, removing duplicates of English tests is seldom discussed.
In this paper, we clearly describe a removing duplicates of English tests method and show some experimental results, finally we summary our works.
Method of Removing Duplicates in English Tests
The English test was composed of three parts--question stem, options and answer analysis. Because different part have different character, we use different strategies in different parts, and each part has its own advantages for removing duplicates. In order to improve the accuracy of removing duplicates, this paper decides the results by multi-attribute collective decision.
The specific calculation strategies as follows.
Word2vec Method in Answer Analysis
Sub-headings should be typeset in boldface italic and capitalize the first letter of the first word only. Section number to be in boldface roman.word2vec method is that two words vector indicates the two answer analyses’ semantic similarity. Using 6.9G encyclopedic description texts as the training corpus, which uses Skip-gram model as the input layer and Negative Sampling model as output layer. obtaining the 400M word vector language models which are contain whole words and the cosine distance of the two words can be calculated by 128-dimensional word vectors. There are two ways to calculate the similarity based on word2vec.
The Average Highest Similarity Method
The algorithm is relatively simple. First of all, we get all segmentation results of two answer analyses, calculate two similarity, and then choose the max value to be the similarity value of the two words. Finally, calculate the average of the highest similarity for all words in answer analyses.
Keyword Similarity Method
We Extract the keywords from two answers analyses, calculate the similarity between them, and choose the max value to be similarity value. we can utilize Hanlp keywords extraction algorithm to get Keywords.
As the answer often uses different language to analyze the same content, it can‘t depend on the string matching method to solve, but this paper can resolve this question by using similarity calculation, When the threshold is reached, it is assumed that the subject is duplicate.
Synonyms’ Distance Calculation in Options
This method is calculating synonyms’ distance based on synonym dictionary to get the shortest words distance. In this paper, only the shortest words distance of all words is used as the similarity value of the calculation option.
Coincidence Ratio Calculation in Question Stem
Coincidence number divided by the number of all words is coincidence ratio.
The Calculation of All Coincidence Number
The overall coincidence ratio is based on two pre-judgment question stems all the words , and then comparing with each other, statistics coincidence ratio, the coincidence ratio of the similar question stems will be much larger than the different question stems. However, some words coincidence ratio aren’t indicate the same question stems, which is not useful for the result. Such as "is", "a", "the". You can filter out words by using a disabled words list.
The High-frequency Words Calculation of Coincidence Ratio Based on TF*IDF
TF * IDF is a very common method of text classification. it is applied to the topic fusion task, and it is assumed that high-frequency words are relatively representative of the sentence meaning. TF * IDF comprehensively reflects the importance of a word in the sentence. Calculations are using the following three equations 1-3.
, , *
i j i j i
V TF IDF (1)
, ,
i j i j
j
n TF
N
(2)
log( / 1) &
Firstly, because the length of each question stem is little effect on the feature extraction and the question stem is shorter, we are using disabled words list. And then we sort out 15 words of each question stem as coincidence vocabulary. Finally, we calculate the number of coincident words in the two categories.
[image:3.612.170.436.160.288.2]Five kinds of features are utilized to present a English test as shown in Table 1.
Table 1. Feature selection list.
Feature source calculation num
Answer analysisAverage maximum similarity method 1
Answer analysis Keyword similarity method 2
Option Distance calculation based on synonyms3
title The calculation of all coincidences 4
title The high-frequency word calculation of coincidences ratio based on TF*IDF
5
Combination Strategy Based on Multiple Linear Regression
We consider that the above five kinds of features have different strengths in different environments, but we can’t determine how to distribute the weight of each feature to get the best final classification. Therefore, this paper utilize multiple regression method, which means training weights, then uses classification polynomial to predict whether the same English tests.
The result of five features is defined as variablesxi,i[1,5] , and multiply weightai, which
represents the importance of variables xi on the results, And then add up the results ,which plus the
offset valueb, by the results of the sign to determine whether they are the same English tests, as
shown in 4 and 5.
5
1
0 i i i
y b a x label
(4)
5
1
0 i i i
y b a x label
(5) The above are the calculation formulas, but aiis obtained by training corpus calculating. The
least square method is used in the calculation. The formula is shown in Equation 6.
5 5 5
2 2
1 1 1
( i i) [ i ( i i)]
i i i
RSS yy y b a x
(6) we are known xiand yi, in order to optimize ai and b ,we respectively calculate the partial
derivatives of parameters, if the partial derivatives contain the extreme value, we can move parameter in the direction of the derivative according to the learning ration' threshold, so as to optimize parameters.
Experimental Results Analysis
In order to use the supervised training method, this paper marked some training data.
marked the same tests, 1434 marked for different tests in this corpus. Two-thirds of the data was used as the training data and the other was the test data.
Table 2. The results of the two experiments. Num Total test Machine
identification
Correctly identity
Correct rate Recall rateF1
1 1000 990 750 75.82% 75.40% 75.61%
2 1000 985 865 87.61% 86.60% 87.10%
[image:4.612.190.428.247.366.2]As we can see from Table 2 that in the first experiment, we use five variables, and then we use training corpus to train the weighting parameters and use the multiple regression model to predict the result. Finally, F-1 is 75.61%.
Figure 1. This is the results of the two experiments.
Fig.1 is clearly compare the first and second experiments, after analyze the results, we find that the first variable and the fifth variable have little effect on the results, so move out them. Finally, using the same data training and testing, the accuracy ratio reached 87.61%, F-1 is 87.10%, to a certain extent, achieving the effect of data fusion.
Conclusions
In this paper, removing duplicates of English tests method is proposed. We mainly discuss five important factors: (1) Average maximum similarity method. (2) Keyword similarity method. (3) Distance calculation based on synonyms. (4) The calculation of all coincidences.(5) The high-frequency word calculation of coincidences ratio based on TF*IDF. Finally, by analyzing the results of experiments, we prove removing duplicates of English tests method can accurately achieve good results.
Acknowledgments
This work is supported by National Nature Science Foundation (No. 61501529, No. 61331013), National Language Committee Project (No. YB125-139, ZDI125-36), and Minzu University of China Scientific Research Project (No. 2015MDQN11, No. 2015MDTD14C).
References
[1] Yang Hu De-duplication Technology Research and Implementation of Large-scale Short Texts Orient. (2007).
[3] C. Wu, Removing duplicate objects from an object store. (2005).
[4] N.M. Joy, S. Salas Systems and methods for removing duplicate search engine results. (2007). [5] J. Colgrove, J. Hayes, E. Miller, J.S. Hasbani, C. Sandvig, Method for Removing Duplicate Data From a Storage Array. (2013).
[6] X. Li, Q. Yang, LNZeng, Clustering Web Retrieval Results Accompanied by Removing Duplicate Documents. (2010).
[7] S.P. Semprevivo, M.R. Wells, System and Method of Removing Duplicate Leads. (2010). [8] R.V. Guha, Pass-through Architecture Via Hash Techniques to Remove Duplicate Query Results. (2000).
[9] S. Chandiramani, P.U. Chandraghatgi, ISridharan, Interactive Deduplication Methods for Detecting and Removing Duplicates in Video Search Results. (2014).