Task-Oriented Multi-User Semantic Communications for Multimodal Data

(1)

Task-Oriented Multi-User Semantic Communications for Multimodal Data

Huiqiang Xie^∗, Zhijin Qin^∗, Geoffrey Ye Li^†

∗Queen Mary University of London, UK

†Imperial College London, UK

Email: {h.xie, z.qin}@qmul.ac.uk, [email protected]

Abstract—Semantic communications focus on the successful transmission of information relevant to the transmission task. In this paper, we investigate multi-users transmission for multimodal data in a task semantic communication system. We take the vision-answering as the semantic transmission task, in which part of the users transmit images and the other users transmit text to inquiry the information about the images. The receiver will provide answers based on the image and text from multiple users in the considered system. To exploit the correlation between the multimodal data from multiple users, we proposed a deep neural network enabled multi-user semantic communication system, named MU-DeepSC, for the visual question answering (VQA) task, in which the answer is highly dependent on the related image and text from the multiple users. Particularly, based on the memory, attention, and composition (MAC) neural network, we jointly design the transceiver and merge the MAC network to capture the features from the correlated multimodal data for serving the transmission task. The MU-DeepSC extracts the semantic information of image and text from different users and then generates the corresponding answers. Simulation results validate the feasibility of the proposed MU-DeepSC, which is more robust to various channel conditions than the traditional communication systems, especially in the low signal-to-noise (SNR) regime.

I. INTRODUCTION

The continuously increasing connected-mobile devices and enriched various intelligent demands cause the explosion of wireless data traffic, and bring new challenges to communication systems, including dealing with the huge volumes of data, providing the cornerstone for various intelligent tasks, and exploiting the limited frequency resource. Semantic communications is a promising technology to address these challenges because of its great potential to significantly reduce required resources for transmission [1], [2]. Compared with the traditional systems, semantic communications focus on extracting essential features from the source at the transmitter to serve the transmission task directly at the receiver. Therefore, only relevant information will be transmitted to the receiver instead of the bit sequences mapped from the source. By doing so, for the same amount of source information, a lower volume of data will be generated and transmitted without degrading the transmission accuracy. Moreover, the semantic communication system has been approved to work well in harsh communication environments.

Although Weaver has proposed the concept of semantic communication more than seven decades ago [3], the develop- ments of semantic communications are still in its infancy due

to the lack of mathematical models for processing semantic information. Following Weaver’s idea, Carnap et al. [4] and Bao et al. [5] tried to establishing semantic communication information theory by using logical probability to measure semantic information initially. However, how to design effective semantic coding is still an open problem. Until very recently, we are witnessing great interest in semantic communications from both academia and the industry, powered by the boom of artificial intelligence (AI).

Inspired by the emerging deep learning (DL), there have been several works on the semantic coding using the neural networks, which outperform the traditional communication systems in terms of recovering data and performing intelligent tasks. For data recovery, Farsad et al. [6] proposed the joint source-channel coding (JSCC) based on long short-term memory (LSTM) for text transmission, which shows the robust performance over erasure channels. Xie et al. [7] developed a semantic communication system based on Transformer, named DeepSC, to recover the text under more complex and practical wireless environments, which demonstrates the potential of semantic communications over the air. Xie et al. [8] further extended the work in [7] and designed a lite DeepSC to decrease the model parameters for suiting narrow bandwidth and power-limited devices in distributed communication environments. Bourtsoulatze et al. [9] proposed a JSCC with convolutional neural networks (CNNs) to recover image and Weng et al. [10] developed a semantic communication system for speech signals, named DeepSC-S. Besides, there have been several works on performing intelligent tasks. Lee et al. [11] considered the image classification task and designed the transmission-recognition communication system jointly.

Jankowski et al. [12] merged the JSCC and image features networks, i.e., ResNet, for the image retrieval tasks. However, the aforementioned works focus on the point-to-point semantic communication scenarios and for a specific type of source, i.e., text only, image only, or speech only. However, how to design multi-user semantic communication system to deal with the multimodal data is still an open problem.

Compared with single modality data, multimodal data can provide more information about the final transmission task than single data, introduce new degrees of freedom, and increase the performance of transmission task [13]. To deal with multimodal data, the first challenge is to extract the suitable semantic information from each user, which requires

arXiv:2108.07357v1 [eess.SP] 16 Aug 2021

(2)

Semantic Encoder 1 ( ) User 1

User N

…

Task

… …

Channel Estimation

s1

sN x_N

x1

y1

yM

ˆx1

ˆ_N x

z1

zN

α1

βN

γ1

γN

φ y2

Semantic Encoder N ( )

Channel Encoder 1 ( )

Channel Encoder N ( )

Channel Decoder 1 ( )

Channel Decoder N ( )

Semantic Decoder ( )

Base Station N Users

β1

αN

Fig. 1. Multi-user semantic communication system model.

to transmit useful information with respect to the transmission task. The second challenge is to build a model for multimodal semantic information fusion at the receiver, which requires that the model can explore the complementarity underlying different semantic information and make the decision effectively. It can be summarized as follows

• Q1: How to extract the related semantic information from each user?

• Q2: How to deal with the correlated multimodal data at the receiver?

In this paper, we investigate multi-user semantic communications for multimodal data transmission, where the data from each user is correlated. Specifically, the visual question answering (VQA) task, one of the most challenging multi- user multimodal data tasks, is considered, which enables the receiver to answer the questions raised by one device about the visual information captured from another device. The main contributions are summarized as follows:

• A novel task-oriented framework is proposed for the multi-user semantic communication systems for correlated multimodal data transmission, named MU-DeepSC, where the transceivers are jointly designed by exploiting the neural networks.

• The MU-DeepSC transmitter adopts the memory, attention, and composition (MAC) neural network to process the correlated data. By extracting the semantic information of image and text from different transmitters, the MU-DeepSC will directly generate the answers based on the received semantic information at the receiver. These address the aforementioned Q1 and Q2.

The rest of this article is structured as follows. Section II in- troduces the model of the multi-user semantic communication system. In Section III, the details of the proposed MU-DeepSC are presented. Simulation results are discussed in Section IV and Section V concludes this work.

II. SYSTEMMODEL

As shown in Fig. 1, we consider the multi-user semantic communication system, which consists of one receiver with M antennas and N single-antenna transmitters. In the system, we focus on the correlated multimodal data transmission from multiple users, where each transmits semantic information

correlated with others. For example, one user transmits collect the vision information collected by a camera while the other user sends the text information collected by a sensor. The collected vision and text information are employed to perform certain tasks, i.e., the VQA tasks.

A. Semantic Transmitters

As shown in Fig. 1, we denote the source data of the i-th user as s_i, i.e., image or text, where each source contains the correlated semantic information. Then, the transmitted complex signal of the i-th user, xi, can be represented by

xi= C (S (si; αi) ; βi) , (1) where S (si; αi) and C (·; βi) are the semantic encoding and channel encoding, respectively, αi and βi are the corresponding encoding parameters of the i-th user, respectively.

The semantic information of source data is extracted first by each semantic encoder. Different from traditional channel coding to add parity bits, the channel encoder in semantic communication compresses semantic information in addition to dealing with the effects from channels.

B. Semantic Receiver

When the transmitted signal passes a physical channel, the M × 1 received signal at the receiver can be expressed as

y = Hx + n, (2)

where H = [h1, h2, ..., hN] ∈ C^{M ×N} is the channel between the BS and users, x = [x1, x2, · · · , xN] ∈ C^{N ×1} denotes transmit symbols from all N users in the considered system, and n ∈ C^{M ×1} indicates the circular symmetric Gaussian noise, items of n are of variance σ_n².

Compared with conventional end-to-end communication system based on auto-encoder, we employ the additional domain knowledge, i.e., channel estimation, to improve the training speed and enhance the final decision accuracy. With estimated channel, the transmitted signal can be estimated by

ˆ

x = H^HH⁻¹

H^Hy = x + ˆn, (3) where ˆx = [ˆx1, ˆx2, · · · , ˆxN] is the estimated information of all N users, ˆn = H^HH−1

H^Hn represents the impact of noise. With operation in (3), the channel effect is transferred from multiplicative noise to additive noise, which significantly

(3)

ResNet-101Bi-LSTM Image Channel Encoder

Is the number of matte blocks in front of the small yellow cylinder greater than the number of large

red shiny cylinders?

Text Channel Encoder Channel Estimation

Image Semantic yes Information

Text Semantic Information

Transmitted Image Symbols

Transmitted Text Symbols

Received Symbols

Received Image Symbols

Received Text Symbols

Recovered Text Semantic Information

Recovered Image Semantic Information

MAC Network

Fig. 2. The structure of proposed MU-DeepSC.

reduces the learning burden. Therefore, the semantic information from each user is recovered by the channel decoder and can be computed by

zi= C⁻¹(ˆxi; γi) , (4) where C⁻¹(ˆx_i; γ_i)¹ is channel decoder for the i-th user with parameters γi.

With the received semantic information, we can directly perform the final task, i.e., classification category or answers, with the semantic decoder by

Task= S⁻¹(z; ϕ) , (5)

where S⁻¹(z; ϕ) is the semantic decoder with parameters ϕ, and z = [z1, z₂, · · · , z_N] is the recovered semantic information.

It is difficult to build a network that can support various tasks, we therefore choose one of the most challenging tasks, VQA task, to perform multi-user semantic communications for multimodal data transmission.

C. Performance Metrics

Different from the traditional communication systems, the final decoded signal is Task defined in (5) instead of a sequence of bits, which is hard to measure the system performance with bit-error rate (BER) or symbol-error rate (SER).

Therefore, it is important to reasonable performance metrics to indicate the accuracy of Task. Here, we choose the accuracy of predicted answers as the performance metrics for the VQA task. But the metrics could vary for different transmission tasks.

III. PROPOSEDMU-DEEPSC TRANSCEIVER

In this section, we design a deep neural network (DNN) for the considered semantic communication system, named as MU-DeepSC in Fig. 2, of which the MAC network is adopted for answering questions.

1In order to reduce the number of representation symbols, we use ·⁻¹here to represent the decoder.

TABLE I

THE SETTING OF THE CHANNEL ENCODER AND DECODER.

Image Layer Name Units Activation

Channel Encoder

CNN 256, 3×3 Elu

CNN 128, 3×3 Elu

Channel Decoder

CNN 256, 3×3 Elu

CNN 512, 3×3 Elu

Text Layer Name Units Activation

Channel Encoder Dense 256 Relu

Channel Decoder Dense 512 Relu

A. The Proposed MU-DeepSC

As shown in Fig. 2, we use the image and text information as an example, where only two users are discussed for con- venience. It is easy to expand the network with the input of multi-image and multi-text.

The proposed MU-DeepSC consists of a image transmitter and a text transmitter. For the image transmitter, we extract the semantic information by image semantic encoder, which employs the first 30 blocks from ResNet-101 network [14] pre- trained on ImageNet [15]. This ensures similar results in different setups. Afterwards, the semantic information is mapped to the transmitted symbols directly by the image channel encoder shown in Table I, which consists of CNN layers due to its characteristics of learning different local features of image effectively. For the text transmitter, we employ one layer Bi- LSTM to extract the semantic representations of each word in the input sentence. Then, the text will be compressed and mapped to the transmitted text symbols via the text channel encoder shown in Table I, which includes several dense layers to preserve all text semantic information since the dense layer can transform the entire input information.

The MU-DeepSC receiver processes the signal by three components: channel estimation, image and text channel decoder, and semantic decoder. The channel state information (CSI) obtained from channel estimation is the domain knowledge to be used to speed up the training process. Without channel estimation component, the neural network has to learn to deal with the channel matrix, H, which requires

(4)

Classifier Read

Unit 1

Write Unit 1

Read Unit N

Write Unit N

...

Control Unit 1

Control Unit N

...

MAC cell 1

Read Unit 2

Write Unit 2 Control

Unit 2

MAC cell 2 MAC cell N

Received text semantic information

Received image semantic information

Fig. 3. The structure for the memory, attention, and composition network for text and image semantic information fusion and the VQA.

an additional learning burden. Then, similar to the channel encoders, the image and text channel decoder consists of CNN layers and dense layers to decompress and restore semantic information. Meanwhile, the MAC network [16] combines with the text and image semantic information as well as to answer the vision question.

The MAC network shown in Fig. 3 consists of multiple MAC cells, of which each includes control unit, read unit, and write unit. The control unit first generates a query based on the received text semantic information, i.e., the object of question and the type of question, by the attention mechanism, then the read unit gets the query and searches the corresponding key from image semantic information by another attention module.

Finally, the write unit integrates information and outputs the predicted answers to the questions.

B. Loss Function

As indicated before, the objective of the proposed MU- DeepSC is to answer the questions based on the images and texts. The proposed transceiver is task-oriented, where the answers will be predicted directly at the MU-DeepSC receiver.

As the image and text will not be recovered in the proposed transceiver like traditional communication systems, the loss function based on bit-error or symbol-error are not applicable any more. In order to improve the accuracy of answers, the cross-entropy (CE) is used as the loss function to measure the difference between the correct answer, a, and the predicted answer, ˆa, which can be formulated as

LCE(a, â; α, β, γ, ϕ) = −p (a) log (p (â)), (6) where p(a) is the real probability of the answer, and p(â) is the probability of the predicted answer. The CE can measure difference between the two probability distributions. In real life, when human beings try to answer the question about the image, they observe the image and question first, and then choose the answer with the highest probability by their inference. Through reducing the loss value of CE, the network learns the correct answer first and tries to predict the answer with the highest correct probability. Then the network can be optimized by gradient descent. We take the semantic encoder,

α, as an example, the updated trainable parameters at the i+1- th training epoch is given by

α (i + 1) = α (i) − η∇_α(i)LCE(a, ˆa; α(i), β(i), γ(i), ϕ(i)), (7) where η is the learning rate.

IV. SIMULATIONRESULTS ANDDISCUSSIONS

In this section, we compare the proposed MU-DeepSC with DL-enabled semantic communication algorithms and the traditional source coding and channel coding methods under the AWGN channels, Rayleigh fading channels, and Rician fading channels, where the perfect CSI is assumed for all methods. The transceiver is assumed with two single-antenna users and one receiver with two antennas.

A. Implementation Details

The adopted Dataset is CLEVR [17], which consists of a training set of 70,000 images and 699,989 questions and a validation set of 15,000 images and 149,991 questions.

In the experiments, images will be resized to a common 224

× 224 resolution with bicubic interpolation before computing semantic information. Text will be embedded by embedding layer, which is initialized from N (0, 1) with shape (vocab size, embedding-dim). The embedding dimension is set to be 300. We employ the network as shown in Section III.A, where the MAC network consists of 12 cells. We employ the ADAM optimizer with a learning rate of 0.0001. For the baselines, different from predicting with image and text, we mainly consider two cases, only using question or image to predict the answer, and the typical separate source and channel coding,

• Error-free transmission: The full, noiseless images and texts input to the MAC network directly.

• Traditional method: To perform the source and channel coding separately, we use the following technologies respectively:

– Joint photographic experts group (JEPG) [18] as the image source coding, a commonly used method of lossy compression with compression rate of 75 for digital images,

(5)

– Huffman coding for text source coding, lossless compression for text,

– Low-density parity-check codes (LDPC) with coding rate of 1/3 [19] as the channel coding, especially for a large data size.

• Text or image only based prediction: The transmitter is the same as the text user or image user in Fig. 2, but the receiver replaces the MAC network with a one-layer classifier.

The modulation method for the traditional method is 16 quadrature amplitude modulation (QAM). For the traditional method, the recovered image and text are inputted to the MAC network to get answers. The answer accuracy is adopted to measure the system performance.

B. Performance of MU-DeepSC

Fig. 4 shows the relationship between the answer accuracy and the SNR for the similar number of transmitted symbols over AWGN, Rayleigh fading channels, and Rician fading channels. Among the methods in Fig. 4, the proposed MU- DeepSC outperforms other baselines, especially in the low SNR regime, and is about to approach the upper bound at high SNR regime, which validates the effectiveness of the proposed MU-DeepSC. Besides, over all SNR regimes, transmitting single source, such as text or image only, has the similar answer accuracy over three channels. The reason is that the system only has the semantic information of the question or image and then need to guess the answers based on the text or image information, where the case transmitting text only achieves higher accuracy since the text information can help the system narrow the search range of answers. For example, it is possible to guess the answers with given questions, but, it is hard to predict answers with images only.

Moreover, in Fig. 4(a), the traditional method performs worse at the low SNR regime since the images are corrupted by error bits, but performing higher accuracy as the SNR increases. Similarly, for more complex channels, the answer accuracy of the traditional method can increase with the higher SNR in Fig. 4(b) and 4(c), which means that the traditional method can achieve similar accuracy as MU-DeepSC at higher SNR or with wider bandwidth to allow more parity bits to correct errors, which validates the effectiveness of MU- DeepSC at harsh communication environments.

V. CONCLUSION

In this paper, we have proposed multi-users semantic communication system, named MU-DeepSC, for dealing with the correlated image and text information, where the VQA task is considered. By jointly design the semantic encoder and the channel encoder by learning and extracting the essential semantic information, the proposed MU-DeepSC handles the image and text semantic information effectively and predicts the answers accurately by merging different semantic information. The simulation results have demonstrated that the MU- DeepSC outperforms various benchmarks, especially in the low SNR regime. Hence, we are highly confident that the

0 3 6 9 12 15 18

SNR (dB) 0

0.2 0.4 0.6 0.8 1

Answer Accuracy

Error-free transmission JPEG+Huffman+LDPC DeepSC: text only DeepSC: image only MU-DeepSC

(a) AWGN Channels.

0 3 6 9 12 15 18

SNR (dB) 0

0.2 0.4 0.6 0.8 1

Answer Accuracy

Error-free transmission JPEG+Huffman+LDPC DeepSC: Text only DeepSC: Image only MU-DeepSC

(b) Rayleigh Fading Channels.

0 3 6 9 12 15 18

SNR (dB) 0

0.2 0.4 0.6 0.8 1

Answer Accuracy

Error-free transmission JPEG+Huffman+LDPC DeepSC: Text only DeepSC: Image only MSDeepSC

(c) Rician Fading Channels.

Fig. 4. Answer accuracy for various testing channels based on different trained models; (a) testing under AWGN channels; (b) testing under Rayleigh channels; (c) testing under Rician channels.

proposed MU-DeepSC is a promising candidate for multi-user semantic communication systems to transmit multimodal data.

(6)

REFERENCES

[1] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physical layer communications,” IEEE Wire. Comm., vol. 26, no. 2, pp. 93–99, 2019.

[2] G. Shi, D. Gao, X. Song, J. Chai, M. Yang, X. Xie, L. Li, and X. Li, “A new communication paradigm: from bit accuracy to semantic fidelity,”

arXiv preprint arXiv:2101.12649, 2021.

[3] C. E. Shannon and W. Weaver, The Mathematical Theory of Communi- cation. The University of Illinois Press, 1949.

[4] R. Carnap, Y. Bar-Hillel et al., An Outline of A Theory of Semantic Information. RLE Technical Reports 247, Research Laboratory of Electronics, Massachusetts Institute of Technology., Cambridge MA, Oct. 1952.

[5] J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, W. Leland, and J. A. Hendler, “Towards a theory of semantic communication,” in IEEE Network Science Workshop, West Point, NY, USA, Jun. 2011, pp. 110–

117.

[6] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source- channel coding of text,” in Proc. IEEE Int’l. Conf. Acoustics Speech Signal Process. (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 2326–

2330.

[7] H. Xie, Z. Qin, G. Y. Li, and B. H. Juang, “Deep learning enabled semantic communication systems,” IEEE Trans. on Signal Process., Early Access, 2021.

[8] H. Xie and Z. Qin, “A lite distributed semantic communication system for internet of things,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp.

142–153, Jan. 2021.

[9] E. Bourtsoulatze, D. B. Kurka, and D. G¨und¨uz, “Deep joint source- channel coding for wireless image transmission,” IEEE Trans. Cogn.

Commun. Netw., vol. 5, no. 3, pp. 567–579, May 2019.

[10] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” arXiv preprint arXiv:2102.12605, 2021.

[11] C. Lee, J. Lin, P. Chen, and Y. Chang, “Deep learning-constructed joint transmission-recognition for internet of things,” IEEE Access, vol. 7, pp.

76 547–76 561, Jun. 2019.

[12] M. Jankowski, D. G¨und¨uz, and K. Mikolajczyk, “Wireless image retrieval at the edge,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp.

89–100, Jan. 2021.

[13] D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of methods, challenges, and prospects,” Proc. IEEE, vol. 103, no. 9, pp.

1449–1477, 2015.

[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of the IEEE conf. on computer vision and pattern recog., 2016, pp. 770–778.

[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” Int’l. J. Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[16] D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” in Int’l. Conf. on Learning Repre., ICLR, Vancouver, BC, Canada, Apr., 2018, 2018.

[17] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Proc. IEEE Int’l. Conf. on Computer Vision and Pattern Recog., 2017, pp. 2901–

2910.

[18] W. B. Pennebaker and J. L. Mitchell, JPEG: Still image data compression standard. Springer Science & Business Media, 1992.

[19] R. Gallager, “Low-density parity-check codes,” IRE Trans. Inf. Theory, vol. 8, no. 1, pp. 21–28, 1962.