Cross modal Retrieval of Chinese CQA Based on CCA Algorithm

(1)

2018 International Conference on Computational, Modeling, Simulation and Mathematical Statistics (CMSMS 2018) ISBN: 978-1-60595-562-9

Cross-modal Retrieval of Chinese-CQA Based on CCA Algorithm

Xi LIU, Lei SU

*

, Di JIANG and Zheng-yu FAN

Faculty of Information Engineering and Automation,

Kunming University of Science and Technology, Kunming 650051, China *

Corresponding author

Keywords: Chinese CQA, Cross-model retrieval, CCA, Semantic abstraction.

Abstract. With the development of Chinese Q&A community, there are a large number of question-answer pairs has being accumulated. For this question-answer pairs may contain text, pictures, audio, video and other multi-modal data. And the key question for the Chinese Q&A community platform becomes how to match the questions with the most appropriate answers by using cross-modal information such as text and images. In this paper, we propose a question and answer retrieval model based on CCA cross-modal retrieval algorithm. Firstly, the LDA is used to represent Chinese text features, and then the image features are extracted using a convolutional neural network and the K-means clustering method is used to obtain image features. Finally, the Canonical Correlation Analysis (CCA) method is used to retrieve between the image and text, CCA method crosses the heterogeneous problem of the underlying multimedia data, and retains the correlation of the variables, then get cross-model search results of questions and answers. After Clear the correlation between the two models, the image and text features are mapped to the same feature space, and the similarity of the feature vectors can be directly measured, multimodal retrieval with document retrieval map is implemented. The experimental results show that the cross-modal retrieval method based on CCA in Chinese community can improve the accuracy of answer retrieval.

Introduction

With the development of Chinese Q & A community, like ‘ZhiHu’, ‘Baidu’ and other sites, tens of thousands of questions are being posted every day. Q & A community search engines usually reply the users with a series of other questions in regards to the issue. The recommended questions are also posted by previous users and have been answered. However, the rapid increase of the number of questions and answers on the web sites, the answers may contain text, pictures, audio, video and other multi-modal data. The key question for the Chinese Q&A community platform becomes how to match the questions with the most appropriate answers [1].

(2)

Cross-modal information, People understand it as a collection of text, images, video and audio objects. The "cross-modal" retrieval is a new trend in the research of multimedia information retrieval. It achieves a flexible leap of semantic gap between objects in different modalities. At present, there are mainly three methods for handling cross-modal retrieval: Research on fusion analysis association mining and Correlation Analysis. Literature [10] still uses the CCA method to reduce the dimensionality of the data of the two modalities and eliminates the heterogeneity of the underlying features among the different modes.

The rest of this paper is organized as follows. Section 2 introduces the LDA method, the Convolutional Neural Network are introduced to express the characteristics of the picture, these two methods are used to represent the features of text and pictures. Section 3 uses CCA method to calculate the relevance of the question and answer of the Chinese question and picture, so as to match the user with the more relevant and reasonable answer. We introduce the datasets and experimental results in Section 4. Section 5 concludes the paper.

Cross- modal Retrieval Based on the CCA (Canonical correlation analysis)

Canonical correlation analysis is a multivariate statistical method used to study the interdependence between the two random variables. It plans the correlation between the two variables from whole and finds the two variables representing the X and Y in the two groups of variables. Using the correlation between X and Y to reflect the overall correlation between the two groups of variables [12].

Multi-modal Information Feature Representation

LDA (Latent Dirichlet Allocation) [13] The basic idea of the model is to assume that the document is a random mixture of potential topics, so that each topic can be expressed as a random mixing of feature items. The process of creating a document can be expressed as follows: In the whole collection of documents, the word item-document matrix is defined with a variable, first the topic distribution in the set is determined, and a subject is selected according to the probability value in the distribution. After choosing a good theme, select a characteristic word item according to certain probability, form Word item-document matrix. Therefore, each topic can be represented as a collection of Word items, each document represented as a collection of word items, and repeated selection of themes and signature items generates all documents [12].

Traditional machine learning K-means clustering method is used to extract image features. The feature extracted from the concatenated neural network full-connected layer is a comprehensive representation of the feature vector. In K-means, the parameter km_k represents the category of image clustering. The subset of feature representations clustered by the LDA model is the "word list" in the document. Then all features of a category clustered by K-means are equivalent to a word in the "vocabulary" of the LDA model. Then the "frequency" of the image is counted. In the K-means clustering, the feature set "word list" of the image is input, and then the "word" corresponding to each feature in the image feature set is calculated, and the image is counted in the image. The number of occurrences of each "word" in the feature set is also the word frequency. K-means clustering outputs statistical word frequency data, which is one of the feature vectors corresponding to the image. The last is the feature vector combination, which is to fuse all the image and statistical word frequency data together, which is the space of the feature vector of the image..

Feature Subspace Projection Based on Correlation

The correlation-based feature subspace projection is to dig a potential correlation between the underlying features of different modal information, learning the optimal subspace projection matrix to solve the problem of feature heterogeneity, and achieving the goal of directly measuring the relationship between them [15].

(3)

a 5-dimensional feature extracted from the 5 subject LDA model. However, there is no direct link between the feature space I1 and T1, the CCA algorithm can map I1 and T1 to I2 and T2 respectively through the training of many "image-sample" samples, in which the feature space I2 and T2 are linearly correlated, which can directly measure the similarity between eigenvectors in I2 and T2. The similarity of pictures in questions and answers.

Cross-mode Answer Retrieval Algorithm Based on CCA in Chinese Q&A Community

The image-text cross-search algorithm based on CCA (Typical Association Analysis). First to extract image features X ( X ∈ 𝑋𝑛×𝑝 )and question text features Y ( Y ∈ 𝑅𝑛×𝑞 )from dataset D={(𝑋1, 𝑌1),(𝑋2, 𝑌2),…,(𝑋𝑛, 𝑌𝑛)}. For each group(𝑋𝑘, 𝑌𝑙),among them 𝑋𝑘= (𝑥𝑘,1, 𝑥𝑘,2, … , 𝑥𝑘,𝑝), p ∈

{1,2, … , p}, 𝑌𝑙 = (𝑦𝑙,1, 𝑦𝑙,2, … 𝑦𝑙,𝑞), 𝑙 ∈ {1,2, … , 𝑞} .After that, we calculate the mean value of Xk, Yl

as follows:

𝜇_𝑘 = 𝑋̅̅̅ = ∑ 𝑥_𝑘 𝑁 _𝑘,𝑖⁄𝑁

𝑖=1 (1)

v𝑙 = Y̅ = ∑ y𝑙 Ni=1 𝑙,i⁄N (2)

Here, N is the size of data set in the above formula. Then, we calculate the sample covariance matrix:

𝐶_𝑤(𝑋𝑘, 𝑌𝑙) = ∑𝑁𝑖=1((𝑥𝑘,𝑖 − 𝜇𝑘)(𝑦𝑘,𝑖− v𝑙)) 𝑁⁄ (3)

The covariance matrix of the entire data set can be expressed as:

𝐶_𝑤(𝑋, 𝑌) = [𝐶_𝐶𝑤(𝑋, 𝑋) 𝐶𝑤(𝑋, 𝑌)

𝑤(𝑌, 𝑋) 𝐶𝑤(𝑌, 𝑌)] (4)

After obtaining the covariance matrix of the data set, we define μ = 𝑎𝑇𝑋, v = 𝑏𝑇𝑌according to the traditional canonical correlation analysis method, and then caculate the correlation between

μ and v , here μ and v correspond to the isomorphic subspace of the image and the text respectively.

Corr(μ, v) = 𝑎𝑇𝐶𝑤(𝑋,𝑌)𝑏

√𝑎𝑇_𝐶_𝑤_{(𝑋,𝑋)𝑎√𝑏}𝑇_𝐶_𝑤_{(𝑌,𝑌)𝑏} (5)

To solve this formula, we construct the following Lagrangian equation:

L = 𝑎𝑇_𝐶

𝑤(𝑋, 𝑌)𝑏 −λ₂(𝑎𝑇𝐶𝑤(𝑋, 𝑋)𝑎 − 1) −𝜃₂(𝑏𝑇𝐶𝑤(𝑌, 𝑌)𝑏 − 1) (6)

According to the known conditions 𝑎𝑇𝐶_𝑤(𝑋, 𝑋)𝑎 = 1，𝑏𝑇𝐶_𝑤(𝑌, 𝑌)𝑏 = 1, we can obtain the results:

λ=θ=𝑎𝑇𝐶_𝑤(𝑋, 𝑌)𝑏 (7) That means is Corr(μ, v), so we only need to find the largest . Its matrix is expressed as follows:

[𝐶𝑤(𝑋, 𝑋)−1 0

0 𝐶_𝑤(𝑌, 𝑌)−1] [

0 𝐶_𝑤(𝑋, 𝑌)

𝐶_𝑤(𝑌, 𝑋) 0 ] [

𝑎

𝑏] = λ [𝑎𝑏] (8)

Finally, the result is:

𝐶_𝑤(𝑋, 𝑋)−1_𝐶

𝑤(𝑋, 𝑌)𝐶𝑤(𝑌, 𝑌)−1𝐶𝑤(𝑌, 𝑋) = λ2𝑎 (9)

(4)

The entire calculation process is as follows: CCA-based Chinese Answers Community Answer Search Algorithm

Inputs:

D: Data Set {(X_1, Y_1, Z_1), (X_2, Y_2, Z_2),

…,(X_n，Y_n, Z_n)}

(X∈X^(n × p)) (Y∈R^(n × p))

among them X(X∈X^(n × p)): Image features

；

Y(Y∈R^(n × p)): Text features of the problem；

Z(Z∈R^(n × p)): Text characteristics of the answer

Outputs:

λ1: Correlation coefficient of μ and v;

λ2: Correlation coefficient of v and ¢ Process:

1. The input image data is extracted through the convolutional neural network and the feature vector space set is output. The extracted feature vector space structure is represented by a 128-point floating-point array.

2.K-means clustering is added to the

convolutional neural network to process the image feature matrix μ_tr(3000×128) 3.The input text data is processed by LDA to obtain a text feature matrix v_tr(1000×10)with a dimension of 10.

4. Calculate μ_k、V_l as a formula (3-1)(3-2) 5. Calculate Cw(X_k, Y₁),as a formula (3-4) 6. Calculate the covariance of μ and v, as a formula (3-5)

7. Solving covariance, Fixed molecule, Solving the denominator, Construct Lagrangian equation L(3-6)

8.Solutionλ1(3-7)(3-8)(3-9)

9. Solving λ2 Using Cosine Similarity 10.Returnλ,λ2

Experiment

Experiment Dataset

(5)

Table 1. Feature extraction and test dataset.

Corpus Chinese Wikipedia Data

Sogou Natural Language Data Internet 650 760 Finance 580 790 physical 630 830 medicine 590 820 history 550 800 Image text pair 3000 4000

Feature Extraction

The questions of the user are respectively related to the texts and pictures in the answer, the image data in the answer is mapped to the image feature space I1 and the relevance of the picture in the question and the answer is calculated and the text data of the question is mapped to the text Feature space T1. The features of the original image and text data are extracted by the methods mentioned in Chapter 2. Where the image is a 128-dimensional feature matrix and the text is a 10-dimensional feature extracted from a 5-topic LDA model. We use the CCA algorithm maps I1 and T1 to I2 and T2 respectively. The feature space I2 and T2 are linearly related, which can directly measure the similarity between I2 and T2 eigenvectors. That is to say the similarity between the question and the picture in the answer. Finally, we can combine the similarity to match users with more relevant and reasonable answers.

We use a processed dataset with a total of 4000 pairs of text and image samples. During the experiment 20% of the data were randomly selected as the testing set and the remaining data as the training set. We train the image features and text feature data of 3200 training samples and using the CCA algorithm to learn the combined weight coefficients of the image features and the text features makes the transformed image features and the text features have the maximum correlation.

Experimental Evaluation Index

We selected two corresponding evaluation indicators for performance evaluation. 1.Top-K accuracy used to determine the validity of a single object retrieval; 2. Based on the full text of the MAP value is used to measure the effectiveness of the entire method.

MAP value is mainly composed of three parts, which is calculated as follows:

MAP = 𝑝(𝑟) = ∑ 𝑝𝑖(𝑟)/𝑁

𝑁 𝑖=1

This formula is the average quasi-off-rate (MAP) formula, which is mainly used to evaluate the performance of retrieval tasks. Where Nq represents the number of queries, and P(r) refers to the average query accuracy rate for a full-text document with a probability of r. In the search results, the higher the ranking of the relevant documents, the higher the map value. If the map value is 0, it means that the related search document was not found.

Experiment Results

[image:5.595.171.427.688.748.2]

In the experiment we trained our model on the crawled the Chinese dataset and finally retrieved 10 results to test the accuracy of the retrieved results. The following table shows the correctness of the results when the number of returned results is fixed at 10.

Table 2. Accuracy of the retrieved results.

Times Accuracy

1 2 3 4 5

Signal model 0.16 0.22 0.30 0.38 0.43 CCA-Cross model 0.21 0.28 0.33 0.45 0.52

(6)

CCA-cross-modal retrieval has exceeded 50%. In generally, the accuracy of the CCA-cross-modal retrieval part is generally higher than that of the single modal text retrieval.

When retrieval of the similar objects, we used different distance function to verify the impact on the experimental results were selected L1, L2 and NC, the final MAP results as follows:

Table 3. Results of different distance functions.

L1 L2 NC Average Signal model 0.21 0.19 0.24 0.213 CCA-Cross model 0.28 0.24 0.28 0.266

From the above experiment result we can see that CCA-based Chinese community cross-modal retrieval can achieve the expected results, it is superior to single-modal text retrieval.

[image:6.595.172.425.373.483.2]

Since the relevant search algorithms and retrieval models in the field of cross-modal information retrieval are also in an endless stream, in order to verify the performance and validity of the CCA-cross-modal retrieval model proposed in this paper for cross-modal information retrieval tasks, The performance of the proposed model was verified using comparative experiments. The methods used for comparison are: a cross-modal retrieval algorithm based on a corresponding automatic encoding machine(Corr-AE), an automatic coding cross-modal retrieval algorithm corresponding to cross-modality correspondence (Corr-Cross-AE), and a cross-media semantic mapping method based on a regularized deep neural network(RE-DNN). The experimental comparison results are shown in the following table:

Table 4. Comparison experiment results of cross-modal retrieval.

Experimental method

The pair of image and text Image

retrieval text

Text retrieval

image

Average value

LDA-CNN-CC A

0.359 0.372 0.366

Corr-AE 0.326 0.361 0.344 Corr-Cross-AE 0.348 0.341 0.338 RE-DNN 0.341 0.353 0.347

From the table, it can be clearly seen that the CCA-cross-modal retrieval model proposed in this paper compares well with other cross-modal retrieval algorithm models during the comparison experiment and performs well in cross-modal retrieval tasks, whether in image retrieval or not. Both text and text retrieval tasks lead the results of other algorithms.

Conclusion

With the rapid development of Chinese Q&A communities, cross-modal information such as words and pictures can be used to match more reasonable answers to questioners' questions. In this paper, we use LDA method to extract text features, using convolutional neural network and K-means clustering method to extract features of the image, finally we use CCA method for correlation analysis, which improved the performance and accuracy of question answering system. However, the research on cross-modal retrieval is limited in this paper. The fusion of CCA and other methods can be considered to improve the retrieval performance.

Acknowledgments

(7)

References

[1]Liu Yu, Yuan Jian. Method for ranking candidate answers in Q & A community based on RTEM model J. Electronic Technology, 2016, 29(5): 130-134.

[2]Zhao Shanshan. The combination of deep learning and multivariate characteristics of the answer to choose the sort of research D. Harbin Institute of Technology, 2016.

[3]Wang M., Manning C.D. Probabilistic tree-edit models with structured latent variables for textual entailment and question answering [C] Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010: 1164-1172.

[4]Heilman M,, Smith N,A. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions [C]//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010: 1011-1019.

[5]Yao X., Van Durme B., Callison-Burch C., et al. Answer Extraction as Sequence Tagging with Tree Edit Distance [C]//HLT-NAACL. 2013: 858-867.

[6]Yih W., Chang M.W., Meek C., et al. Question answering using enhanced lexical semantic models [J]. 2013.

[7]Yih W., Zweig G., Platt J.C. Polarity inducing latent semantic analysis [C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012: 1212-1222.

[8]Severyn A., Moschitti A. Automatic Feature Engineering for Answer Selection and Extraction [C]//EMNLP. 2013: 458-467.

[9]Zhang H., Zhuang Y.T., Wu F. Cross-modal Correlation Learning for Cluster on Image-audio Data Set. ACM International Conference on Multimedia, Augsburg. 2007. 273 –276.

[10]Fei Wu, Hong Zhang, Yueting Zhuang. Learning Semantic Correlations for Cross-media Retrieval. In IEEE International Conference on Image Processing, Atlanta, GA. 2006. 1465-1468.

[11]Lu Bo, Wang Guoren. New Method of Cross-media Retrieval Based on Large-scale Data. Journal of Computer Science and Technology. 2012. 1140-1149.

[12]Liu Yao. Cross-modal multimedia information fusion based on CCA and Adaboost [D] Southwest University, 2016.

[13]D. M. Blei, A.Y. Ng, M.I. Jordan. Latent dirichlet allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.

[14]D. Lowe. Object Recognition from Local Scale-Invariant Features [C]. International Conference on Computer Vision. 2002.

[15]Ding Heng, Lu Wei. Research on cross-modal information retrieval based on correlation [J]. New Technology of Library and Information Service, 2016, 32(1): 17-23.

[16]Hotelling H. Relations between two sets of variates [J]. Biometrika, 1935, 28(28): 321-377.

[17]Li Yuxiang. Question-answer community-based problem correlation and answer ranking D. Shanxi University, 2011.

[18]Wang, Y., Liu, Z., Huang, J.C. Multimedia Content Analysis: Using Both Audio and Visual Clues. IEEE Signal Processing Magazine. 2000. 12-36.

(8)

[20]Blei D., Ng A., Jordan M. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003. 3: 993-1022 Multimedia, Apr. 2008. 10(3): 437-446.

[21]Zhang Ying, Zhao Yanjun. Architecture and Method of Multimedia Data Mining in Digital Library [J]. New Intelligence, 2008, 28(1): 92-94.

[22]Zhuang Y,T,, Yang Y,, Wu F. Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-media Retrieval. IEEE Transactions on Multimedia. 2008. 10(2): 221-229.