## Wonjun Kim , Yongjun Ahn , Jinhong Kim and Byonghyo Shim## |

Learning technique | Applicable problem | Loss function | Application example |
---|---|---|---|

Supervised learning | Detection problem using the classification training | Cross entropy, Kullback-Leibler (KL) divergence, | LoS/NLoS detection, Time-domain channel tap detection |

Estimation problem using the regression training | Mean squared error (MSE), Mean absolute error (MAE) | AoA, AoD, DoA estimation, Path gain estimation | |

Unsupervised learning | Optimization problem | Objective function to be optimized (e.g., sum-rate, cell throughput) | Pilot signal allocation, Channel state information feedback |

Reinforcement learning | Sequential decision making problem | Cumulative reward (e.g., sum-rate, total power consumption) | Beam tracking and selection, CQI feedback |

3) Reinforcement learning (RL): RL is a goal-oriented learning technique where an agent learns how to solve a task by trials and errors. In the learning process, the agent observes the state of an environment, takes an action, and then receives a reward for the action. RL is suitable for the sequential decision-making problem whose purpose is to find out a series of actions maximizing the performance metric such as data rate, energy consumption, or latency. Recently, deep RL (DRL) has been widely used since it can effectively handle the large-scale state-action pair in dynamically varying wireless environments. While DRL is in general not directly used for the channel estimation, it can be used for the channel related functions such as the beam tracking and channel quality indication. For example, when one tries to track the angle in the V2X system using DRL, a base station (BS), playing the role of the agent, observes the state (e.g., distance, velocity, and angles (AoD/DoA)) and then determines the action (adjust the beam direction) maximizing the reward (e.g., minimizing the number of retransmission).

In this section, we delve into two main issues in the DL- based wireless channel estimation: sufficient and comprehen- sive training dataset and properly designed neural network.

When the number of training samples is not sufficient enough, designed DL-based channel estimation model would be closely fitted to the training dataset, making it difficult to make a reasonable inference for the unseen channel. This problem that the trained DL model lacks the generalizing capability is often called overfitting. In the angle estimation task, for example, if the received signals are generated from the angles in [TeX:] $$[0, \pi / 2$$), then the trained network cannot accurately identify the rest angles in [TeX:] $$[\pi / 2, \pi$$). To prevent the overfitting problem, a dataset should be large enough to cover all possible scenarios. This is not easy, in particular for wireless channel, since the number of real transmissions will be humongous. In acquiring the training dataset, we basically have three options:

• Collection from the actual received signals

• Synthetic data generation using the analytic system model or the ray-tracing simulator

• Real-like training set generation using generative adver-sarial network (GAN)

In the training set acquisition, a straightforward option is to collect the real transmit/receive signal pair. Doing so, however, will cause a significant overhead since it requires too many training data transmissions. For example, when collecting one million received signals in 5G NR systems, it will take more than 15 minutes ([TeX:] $$10_6$$ symbols × 0.1 subframe/symbol × 8 ms/subframe).

To reduce the overhead, one can consider synthetically generated dataset (see Fig. 4(a)). In fact, in the design, test, and performance evaluation phase of the most channel estimation algorithms, analytic models have been widely used. For example, propagation channels such as the extended pedestrian A (EPA) channel or extended vehicular A (EVA) channel have been popularly employed in the generation of training dataset [21]. In EVA channel, for example, channel is expressed by 9 tapped delays (i.e., [0, 30, 150, 310, 370, 710, 1090, 1730, 2510] ns), each of which has a different path gain in the delay-Doppler domain. Since the synthetic data can be generated with a simple programming, time and effort to collect huge training data can be saved. However, there might be some, arguably non-trivial, performance degradation caused by the model mismatch.

Yet another intriguing option is to use an artificial but realistic samples generated by the special DL technique. This approach is in particular useful when the analytic model is un- known or non-existent (e.g., deep underwater acoustic channel and non-terrestrial (NTN) channel) and real measured data is not large enough (e.g., THz-band channel). In such case, GAN technique, an approach to generate samples having the same distribution with the input dataset, can be employed [22]. Basically, GAN consists of two neural networks: generator G and discriminator D (see Fig. 4(b)). The generator G tries to produce the real-like data samples and the discriminator D tries to distinguish real (authentic) and fake data samples. To be specific, G is trained to generate real-like data G(z) from the random noise z and D is trained to distinguish whether the generator output G(z) is real or fake. In order to accomplish the mission, the min-max loss function, expressed as the cross- entropy (i.e., [TeX:] $$H(q)=-q \log (p(q))-(1-q) \log (1-p(q))$$) between the distribution of generator output G(z) and that of the measured channel data x, is used:

In the design of DNN-based channel estimator, one should consider the channel characteristics (e.g., tempo- ral/spatial/geometric correlation), wireless environments (e.g., mmWave/THz/V2X/ UAV link), and system configurations (e.g., bandwidth, power of transmit pilot signal, number of antennas).

1) Baseline network: A natural first step of the DNN design is to choose the baseline architecture. Based on the connection shape between neighboring layers, neural network can be roughly divided into three types: fully-connected network (FCN), convolutional neural network (CNN), and recurrent neural network (RNN).

FCN can be used universally since each hidden unit (neuron) is connected to all neurons in the next layer. When the input dataset has a spatial structure (e.g., doubly selective channel and the spatial-frequency domain channel in MIMO systems), CNN might be an appealing option. In CNN, each neuron is computed by the convolution of the 2D spatial filter and a part (e.g., rectangular shaped region) of neurons in the previous layer. Due to the local connectivity of the convolution filter, CNN facilitates the extraction of spatial correlated feature. For example, in the MIMO channel estimation, spatial correlation among the uniform rectangular array (URA) antennas can be extracted using CNN. When the input sequence is temporally correlated, which is true for most of wireless fading channels, RNN model (e.g., long-short term memory (LSTM)) might be a good choice. In these approaches, by employing current inputs together with outputs of the previous hidden layer, temporally correlated feature can be extracted. For instance, by applying RNN to the mmWave channel estimation, change of the Doppler frequency caused by the mobile’s movement can be extracted.

Another promising network architecture distinct from the representative structures we just mentioned is the attention module [23]. In essence, attention module is a network component that quantifies the correlation between every two input elements. By measuring the attention score (a.k.a. value) between two input elements called key and query, attention module allows to model the dependencies between input elements without regard to their distance in pixel or time. Recently, attention module is used instead of CNN and RNN since it offers the capability to extract long-distance and long- term dependencies between input elements [24], [25]. For example, attention module can be used to extract the spatial correlation between the distant MIMO antenna arrays (e.g., elements placed at both ends) or to extract the temporal correlation between the traffic pattern and the co-channel interference in the MIMO channel.

2) Activation layer: Activation layer is used to 1) embed the nonlinearity in the hidden layer and 2) generate the desired type of output in the final layer.

In each hidden layer, weighted sum of inputs passes through the activation layer to determine whether the information generated by the hidden unit is activated (delivered to the next layer) or not. To this end, rectified linear unit (ReLU) function [TeX:] $$f(x)$$ = max([TeX:] $$x$$, 0) or hyperbolic tangent function [TeX:] $$f(x)$$ = tanh([TeX:] $$x$$) can be used [1]. By imposing the activation to the linearly transformed input, one can better promote the nonlinear operation (e.g., spline interpolation between adjacent subcarrier channel) and systematic nonlinearity (e.g., relationship between spatial-domain channel and AoA/AoD).

In the final layer of DNN, the activation layer is used to make sure that the generated output is the desired type. In the classification problem, the ground-truth label for each class is the probability so that the final output should also be a form of the probability. When there are several time-domain channel taps or the angular-domain channel paths are non-unique in the channel estimation problem, it would be desirable to use the sigmoid function [TeX:] $$f(x)=1 /\left(1+e^{-x}\right)$$ returning the individual probability for each class [1], [26]. Whereas, when the problem is modeled as a multi-class classification problem such as the CQI detection problem that generates the proper value among quantized CQI levels (e.g., 0∼15 in LTE), a softmax function [TeX:] $$f_i(\mathbf{x})=e^{x_i} / \sum_j e^{x_j}$$ would be a good choice since it normalizes the output vector into the probability distribution over all classes [27].

3) Input normalization: In the training process, the neural network computes the gradient of a loss function with respect to any weight and then updates the weight in the negative direction of the gradient. When an input varies in a wide range (e.g., multi-user communication scenario), therefore, variation in the weight update process will also be large, degrading the training stability and convergence speed severely. To prevent such ill-behavior, one should perform the normalization of the outputs for each layer. Typically, there are two types of the normalization strategies: layer normalization and batch normalization.

First, when the input vector contains signals from multiple users with different wireless geometries, variation of the received signals would be quite large. In such case, the layer normalization, normalizing each individual input vector, is a good choice because the layer normalization scheme ensures that the normalized input distribution has the fixed mean and variance.

Whereas, when the input data consists of several different types of information, the batch normalization (BN) might be a useful option. In the mini-batch [TeX:] $$\mathbf{B}=\left[\mathbf{y}^{(1)} \cdots \mathbf{y}^{(N)}\right]^T$$ consisting of multiple input samples [TeX:] $$\mathbf{y}^{(1)}, \cdots, \mathbf{y}^{(N)}$$, elements in each row of B (i.e., elements with the same input type) are normalized. For example, in the DRL-based beam tracking problem, both the velocity of the moving target and the angles (AoD/DoA) are used as inputs to DRL. Since the scale of two components would be quite different, the layer normalization will simply mess up the input dataset. To avoid the hassle, the velocity and angles need to be normalized separately using BN.

4) Dropout layer: When we use DNN consisting of mul- tiple hidden layers, the final output is determined by the activated hidden units in each layer. So, for the highly cor- related inputs (e.g., samples generated from correlated (low- rank) MIMO channel), their activation patterns will also be similar so that the final inference can be easily corrupted in the presence of perturbations (e.g., noise, ADC quantization error, and imperfect RF filtering). In order to mitigate this problem, the dropout layer where the activated hidden units are dropped out randomly can be used in the training phase [28]. In this scheme, by temporarily removing part of incoming and outgoing connections randomly, ambiguity (similarity) of the activation patterns among correlated dataset can be resolved.

5) Ensemble learning: Ensemble learning, a method to average out multiple outputs (inferences) of independently- trained networks, is conceptually analogous to the receiver diversity technique since it can enhance the output qual- ity without requiring additional wireless resources (e.g., fre- quency, time, and transmitter power). In a multi-user com- munication scenario, for example, the trained network might be closely fitted to the certain wireless setup (i.e., specific co-scheduled user pair) so that the trained DNN might not generate a reliable channel prediction for the inputs obtained from unobserved wireless environment. In such case, ensemble learning becomes a useful option. Essence of the ensemble learning is to train the multiple DNNs with different initial parameters and training sets obtained from different wireless environments and then combine the outputs of multiple DNNs. Using the ensemble learning, one can mitigate the overfitting to the specific wireless environments and generate more robust channel estimation.

6) Loss function: Since DNN weights are updated in a direction to minimize the loss function, the loss function should well reflect the design goal. When the ground-truth label is available, one can use the cross entropy, MSE, or mean absolute error (MAE). If this is not the case, as we mentioned in the unsupervised learning, one might use the design goal (e.g., throughput or bit error rate) as a loss function.

If there exist multiple goals for the problem at hand, those can be combined together. For example, in the joint AUD and channel estimation in NOMA system, the DNN is trained to detect the active devices and at the same time estimate the channels associated with active users. To do so, we set the loss function [TeX:] $$J$$ as the weighted sum of cross-entropy loss [TeX:] $$J_{A U D}$$ for AUD and the MSE loss [TeX:] $$J_{C E}$$ for CE (i.e., [TeX:] $$\text { (i.e., } J=J_{A U D}+\lambda J_{C E} \text { ) }$$).

7) Weight update strategy: In order to update the network parameter set, the gradient of a loss function should be computed first. A naive way to update the parameters is the batch gradient descent (BGD) method where the gradient of the loss function is computed for the entire training dataset. Since the whole dataset is used in each and every training iteration, training cost is quite expensive and the training speed will be very slow. In the non-static scenario where the channel characteristics are varying, parameters corresponding to the dynamically changing wireless environments (e.g., Doppler spread, scatter location) would not be updated properly. A better option to deal with the issue is the stochastic gradient descent (SGD) method. In contrast to BGD, SGD uses a small number, say D, of samples in each training iteration (i.e., [TeX:] $$\Theta_t=\Theta_{t-1}-\frac{\eta}{D} \sum_{d=1}^D \nabla_{\Theta} J^{(d)}(\Theta)$$) so that it can update the network parameters as soon as a few samples are obtained.

8) Knowledge distillation: When one tries to use the DL-based channel estimation in the Internet of Things (IoT) device, on-device energy consumption is a big concern since most of the IoT devices are battery-powered. To reduce the training overhead, knowledge distillation (KD), an approach to generate a relatively small-sized DL model from a trained large model, can be employed [29]. Key idea of KD is to train a small network (a.k.a student network) using the output of a large network (a.k.a teacher network). In the generation of the loss function, output of the student network is compared against the output of the teacher network as well as the ground- truth label. In doing so, the student network implemented in IoT device can easily capture the underlying feature (e.g., similarity and difference among the classes) extracted by the teacher network with minimal training overhead.

In order to exemplify the detailed techniques we discussed, we present the DNN architecture for the AoA detection (see Fig. 5). Due to the sparse scattering in the mmWave band, a propagation path can be characterized by a few AoAs. By identifying these angles, the receiver can align the beam direc- tion to the transmitter, thereby maximizing the beamforming gain. In DNN, we use the received signal y and the steering matrix [TeX:] $$\mathbf{A}=\left[\begin{array}{lll} \mathbf{a}\left(\theta_1\right) & \cdots & \mathbf{a}\left(\theta_N\right) \end{array}\right]$$ as inputs and the set of the detected AoAs Ω as outputs^{4}. Since an input is a composite of the received signal and steering vectors, we use BN to normalize each component separately. Also, to generate the individual probability for each angle, we use the sigmoid activation function in the final layer.

^{4}[TeX:] $$a\left(\theta_i\right)=\left[\begin{array}{llll} 1 & e^{j \pi \sin \theta_i} & \cdots & e^{j \pi(m-1) \sin \theta_i} \end{array}\right]^T$$ is the steering vector corresponding to [TeX:] $$\theta_i$$

To judge the effectiveness of the DNN components con- sisting of the normalization, dropout, and ensemble learning, we evaluate the success probability of the AoA detection. In the numerical evaluations, we test 1) FCN, 2) FCN with BN, 3) FCN with the dropout layer, 4) FCN with the ensemble network, and 5) FCN with all components we discussed in this section. As shown in Fig. 6, the performance gain introduced by the aforementioned techniques is considerable. For example, FCN with the dropout layer achieves 4.8 dB gain over the conventional FCN since the highly correlated steering vectors can be better resolved using the dropout technique. The gain obtained from BN is also significant (4.7 dB gain) since it reduces the variation of the received signal caused by the device location change. Finally, when the gains induced by all techniques are combined together, we can achieve very reliable AoA detection performance, which cannot be achievable by the basic FCN even in high SNR regime.

In this section, we discuss the DL-based channel estimation techniques. We also investigate the DL-based CSI feedback in MIMO system and the meta learning-based GAN technique for real-time training.

It has been shown from recent measurement campaigns that the scattering effect of the environment is limited in many wireless channels. In other words, propagation paths tend to be clustered within a small spread so that there exists only a few dominant scattering clusters. In the wideband communication systems where the maximal delay and Doppler spreads are large and there are only few dominant paths, the channel can be readily modeled by a few tapped delay lines in the time- domain (i.e., delay-Doppler domain) [18]. Moreover, since the path delays vary much slower than the path gains due to the temporal channel correlation, such sparsity is almost unchanged during the coherence time. Therefore, one can greatly reduce the pilot overhead by estimating the channel in time-domain.

Let us consider the OFDM systems where we allocate the pilot symbols [TeX:] $$\mathbf{p}_f=\left[\begin{array}{lll} p_0 & p_1 & \cdots p_{m-1} \end{array}\right]$$ in the frequency domain. When we model the time-domain channel as a m-tapped channel impulse response [TeX:] $$\mathbf{h}=\left[h_0 h_1 \cdots h_{n-1}\right]^T \in \mathbb{C}^{n \times 1}$$, then the received signal y in the frequency domain is expressed as

where [TeX:] $$\mathbf{D} \in \mathbb{C}^{n \times n}$$ is the DFT matrix which plays a role to convert the time-domain channel response h to frequency domain response, [TeX:] $$\boldsymbol{\Phi} \in \mathbb{C}^{m \times n}$$ is the row selection matrix determining the location of pilots in frequency domain, and [TeX:] $$\mathbf{P}=\operatorname{diag}\left(\mathbf{p}_f\right) \mathbf{\Phi} \mathbf{D}$$. Due to the limited number of scattering clusters, the number of nonzero taps m in h is very small (i.e., [TeX:] $$m<n$$). When we introduce the nonzero tap indicator [TeX:] $$\boldsymbol{\delta}=\left[\begin{array}{ll} \delta_0 & \delta_1 \cdots \delta_{n-1} \end{array}\right]$$ ([TeX:] $$\delta_i$$ = 1 and 0 indicate that [TeX:] $$i$$-th element of h is nonzero and zero, respectively), y can be rewritten as

where [TeX:] $$h_i$$ is the [TeX:] $$i$$-th element of h and [TeX:] $$p_i$$ is the [TeX:] $$i$$-th column of P. Consequently, the time-domain channel estimation problem can be decomposed into two sub-problems: 1) support iden- tification to find out nonzero positions in h and 2) nonzero element estimation to find out [TeX:] $$h_i$$. Once the support of h is identified, estimate of the nonzero element can be easily obtained by conventional linear estimation techniques (see Section II.A).

As mentioned, the time-domain channel tap can be detected using the CS technique, but it might not work well when the columns of the system matrix P are highly correlated and the sparsity (number of nonzero taps) of h increases. When we use the DL technique, we need to find out the direct mapping [TeX:] $$\boldsymbol{\delta}=g(\mathbf{y} ; \boldsymbol{\theta})$$ between the received signal y and nonzero tap indicator [TeX:] $$\boldsymbol{\delta}$$, where [TeX:] $$\theta$$ is the network parameters (see Fig. 7(a)). In the DL-based channel tap detection, the final output is the n- dimensional vector [TeX:] $$\hat{\boldsymbol{\delta}}$$ whose element represents the probability of being the nonzero tap. Thus, [TeX:] $$\hat{\boldsymbol{\delta}}$$ needs to be compared against the true probability [TeX:] $$\boldsymbol{\delta}$$ in the calculation of the loss J. To do so, we employ the cross-entropy loss [TeX:] $$J=-\sum_{i=0}^{n-1}\left(\delta_i \log \hat{\delta}_i+\right.\left.\left(1-\delta_i\right) \log \left(1-\hat{\delta}_i\right)\right)$$ during the training.

In Fig. 7(b), we numerically evaluate the MSE performance of the DL-based scheme as a function of SNR. For compari- son, we also examine the performance of the conventional LS estimator, LMMSE estimator, and OMP algorithm. We observe that DL-based time-domain channel estimation outperforms the conventional schemes for all SNR regime^{5}. Since the DNN learns the correlation feature of the system matrix P, DL-based channel tap detection can better discriminate the correlated supports in the test phase. For example, we observe that the DL-based time-domain channel estimation achieves 2 dB gain over the OMP at MSE= [TeX:] $$10^{-5}$$.

^{5}Note that we used 100,000 samples for the training and 5,000 samples for the test.

As a core technology for 5G and 6G, mmWave and THz communication have received much attention recently [30]– [32]. By leveraging the abundant frequency spectrum re- source in mmWave/THz frequency band (30 GHz ∼ 10 THz), mmWave/THz-based systems can support a way higher data rate than the conventional sub-6 GHz systems. In the mmWave/THz systems, the beamforming technique realized by the MIMO antenna arrays is required to compensate for the severe path loss in the high frequency bands [15]. Since the beamforming gain is maximized only when the beams are properly aligned with the signal propagation paths, acquisition of accurate downlink channel is of importance for the success of the mmWave/THz MIMO systems.

We consider the downlink MIMO systems where the BS equipped with [TeX:] $$N_T$$ transmit antennas serves the user equipment (UE) equipped with [TeX:] $$N_R$$ receive antennas. Then, the goal of the channel estimation problem is to obtain the downlink channel matrix of the k-th pilot subcarrier [TeX:] $$\mathbf{H}[k] \in \mathbb{C}^{N_R \times N_T}$$ from the recevied signal [TeX:] $$\mathbf{Y}[k] \in \mathbb{R}^{M \times T}$$:

where [TeX:] $$\mathbf{F}=\left[\mathbf{f}_1, \cdots, \mathbf{f}_M\right] \in \mathbb{C}^{N_T \times M}$$ is the beamforming matrix, [TeX:] $$\mathbf{W}=\left[\mathbf{w}_1, \cdots, \mathbf{w}_T\right] \in \mathbb{C}^{N_R \times T}$$ is the combin- ing matrix, and [TeX:] $$\mathbf{S}[k]=\operatorname{diag}\left(s_1[k], \cdots, s_M[k]\right)$$ is the pilot symbol matrix. While LS and LMMSE channel estimators have been popularly used for the acquisition of mmWave MIMO channel [15], [33], these conventional approaches have drawback since the amount of pilot resources is proportional to the number of transmit/receive antennas. In fact, due to the massive numbers of transmit and receive antennas, this issue is serious for the mmWave/THz MIMO systems. For example, if the numbers of transmit and receive antennas are [TeX:] $$N_R$$ = 64 and [TeX:] $$N_T$$ = 4, respectively, then the BS has to use at least 3 resource blocks (RBs) (12 × 7 resources in each RB of 4G LTE) just for the pilot transmission, occupying almost 25% of a subframe in LTE.

In order to reduce the dimension of a channel matrix to be estimated, one can consider the estimation of the sparse channel parameters, not the full-dimensional MIMO channel matrix. Specifically, when we assume the geometric nar- rowband channel model, then the frequency domain channel matrix [TeX:] $$\mathbf{H}^l[k]$$ of [TeX:] $$k$$-th pilot subcarrier at [TeX:] $$l$$-th channel coherence interval is expressed as

where [TeX:] $$f_s$$ is subcarrier spacing, P is the number of propagation paths, [TeX:] $$\alpha_i^l$$ is the complex gain, [TeX:] $$\theta_i^l$$ and [TeX:] $$\phi_i^l$$ are AoA and AoD of the [TeX:] $$$$-th path, respectively, [TeX:] $$\tau_i^l$$ is the path delay of [TeX:] $$i$$-th path at [TeX:] $$l$$-th channel coherence interval.

One can see that [TeX:] $$\mathbf{H}^l[k]$$ is parameterized by a few chan- nel parameters: AoAs [TeX:] $$\left\{\theta_i^l\right\}_{i=1}^P$$, AoDs [TeX:] $$\left\{\phi_i^l\right\}_{i=1}^P$$, path delays [TeX:] $$\left\{\tau_i^l\right\}_{i=1}^P$$ , and path gains [TeX:] $$\left\{\alpha_i^l\right\}_{i=1}^P$$ since the number of effective paths is at most a few (LoS and 1∼2 NLoS paths)inthe mmWave/THz channel. Typically, the number of propagation paths P is much smaller than the total number of antennas [TeX:] $$N=N_T \times N_R$$ (e.g., P = 2 ∼ 8 while N =16 ∼ 256) due to the high path loss and directivity of mmWave/THz signal. Therefore, one can greatly reduce the pilot overhead by estimating the sparse channel parameters [TeX:] $$\left\{\theta_i^l, \phi_i^l, \tau_i^l, \alpha_i^l\right\}_{i=1}^P$$ instead of the full-dimensional MIMO channel matrix [TeX:] $$\mathbf{H}^l[k]$$. To this end, one can exploit the LSTM, a DL technique specialized for extracting temporally correlated features from the sequential data^{6}. In our context, LSTM can dynamically adjust the channel parameter estimation based on the extracted temporally correlated feature. For example, when the mobility of a mobile is low and thus the temporal correlation of channel parameters is high, past parameters highly affects the current parameters. LSTM can effectively deal with this type of temporal dependency and finally makes a fast yet accurate channel parameter estimation with relatively small amount of pilot resources. In a nutshell, LSTM-based channel estima- tor learns a complicated nonlinear mapping [TeX:] $$g$$ between the received downlink pilot signals ([TeX:] $$\mathbf{y}^1, \cdots, \mathbf{y}^l$$) and the channel parameters ([TeX:] $$\theta^l, \phi^l, \tau^l, \alpha^l$$):

^{6}Basically, the key ingredients of LSTM block are cell state and three gates, viz., input gate, forget gate, and output gate. The cell state, serving as a memory to store information extracted from the past inputs, sequentially passes through the forget, input, and output gates. Based on the input and the previous output, each gate determines how much information to be removed, written, and read in the cell state.

where [TeX:] $$\Theta$$ is the set of weights and biases (see Fig. 8(a)). To find out the parameters {[TeX:] $$\hat{\theta}^l, \hat{\phi}^l, \hat{\tau}^l, \hat{\alpha}^l$$}, the channel estimate [TeX:] $$\widehat{\mathbf{H}}^l[k]=\sum_{i=1}^P \hat{\alpha}_i^l e^{-j 2 \pi k f_s \hat{\tau}_i^l} \mathbf{a}_{\mathrm{R}}\left(\hat{\theta}_i^l\right) \mathbf{a}_{\mathrm{T}}\left(\hat{\phi}_i^l\right)$$ needs to be com- pared against the true channel matrix [TeX:] $$\mathbf{H}^l[k]$$ in the training. To do so, we define the loss function [TeX:] $$J(\Theta)$$ as

where L and K are the number of coherence intervals and the number of subcarriers, respectively.

In order to evaluate the efficacy of the LSTM-based para- metric channel estimation technique, we test the normalized MSE (NMSE) as the function of SNR^{7}. As shown in Fig. 8(b), LSTM-based channel estimation technique outperforms the conventional channel estimation algorithms by a large margin since the trained DNN exploits the sparsity of the angular domain channel, but no such mechanism is used for the conventional LS and MMSE techniques. For example, when SNR= 10dB, the NMSE of the DL-based scheme is less than [TeX:] $$10^{-2}$$ while those of the LMMSE estimator and the LS estimator are [TeX:] $$10^{-1}$$ and 0.5, respectively. We also observe that the performance of DL-based scheme outperforms the CS-based approach (i.e., 5dB gain at SNR= 15dB), since the channel parameters can be readily estimated by using the temporal correlation between the past and current channels.

^{7}n our experiments, we assume the scenario that the user is moving along with the line trajectory so that the concept of the temporal correlation between geometric parameters can be reflected in the dataset.

As mentioned, to fully enjoy the benefit of massive MIMO systems, an acquisition of an accurate downlink CSI at the BS is crucial. To do so, UE needs to estimate the downlink channels using the pilot signal (e.g., CSI reference signal (CSI- RS) in 5G) of each antenna and then feeds them back in a form of implicit indices (e.g., PMI, RI, and CQI in 4G LTE and 5G NR) to the BS [34]. Since the required number of bits to convey these indices increases in proportion to the number of antennas, the CSI feedback overhead is a big concern for the massive MIMO systems.

Over the years, there have been many works to reduce the overhead of the CSI feedback by exploiting the spatio- temporal correlation of CSI [35]–[37]. For example, in [35], the correlated channel vectors are transformed into an uncor- related sparse vector in some bases and then the CS technique is used to estimate the CSI from the small number of channel measurements. Also, in recent years, DL-based CSI feedback schemes have been proposed [38], which are stimulated by the outstanding performance of DL in the correlation extraction. DL-based feedback scheme can be derived from the autoen- coder (AE) architecture (see Fig. 9(a)). In the AE, encoder is used to transform the raw channel matrices into a compressed codeword vector and decoder reconstructs the original channel matrices from the codeword.

We consider the downlink MIMO-OFDM system where the BS is equipped with Nt transmit antennas and the UE is equipped with Nr receiver antennas. Since each antenna port at the BS sends a CSI-RS for the channel measurement, [TeX:] $$N_t \times N_r$$ elements in the MIMO channel matrix [TeX:] $$\mathbf{H} \in \mathbb{C}^{N_r \times N_t}$$ should be fed back to the BS. In the AE-based CSI feedback scheme, the encoder in the UE compresses the channel estimate [TeX:] $$\hat{\mathbf{H}}$$ into a low-dimensional feature vector and the decoder in the BS reconstructs the channel [TeX:] $$\hat{\mathbf{H}}$$ from the compressed vector. As an input to the encoder, vectorized version of [TeX:] $$\hat{\mathbf{H}}$$ ([TeX:] $$N_t N_r \times 1-$$ dimensional vector [TeX:] $$\hat{\mathbf{h}}$$) is used. Then, we use multiple hidden layers to obtain the low-dimensional vector p, which can be expressed as

where [TeX:] $$f_e(\cdot)$$ is the encoder operation and [TeX:] $$\Theta_e$$ is the training parameter set of the encoder. In the decoder, to reconstruct [TeX:] $$\hat{\mathbf{h}}$$ using p, we design multiple FCN layers with BN and ReLU

activation (see Fig. 9(a)). Finally, the reconstructed channel vector [TeX:] $$\hat{\mathbf{h}}_{r e c}$$ can be expressed as

where [TeX:] $$f_d(\cdot)$$ is the decoder operation and [TeX:] $$\Theta_d$$ is the training parameter set of the decoder. Considering that the training objective of the AE-based CSI feedback is to find out the channel closest to [TeX:] $$\hat{\mathbf{h}}$$, we use the MSE between the input channel and the reconstructed channel [TeX:] $$\hat{\mathbf{h}}_{r e c}$$ as a loss function:

In Fig. 9(b), we evaluate the MSE between the reconstructed channel [TeX:] $$\hat{\mathbf{H}}_{r e c}$$ and the true channel [TeX:] $$\mathbf{H} \text { (i.e., }\left\|\hat{\mathbf{H}}_{r e c}-\mathbf{H}\right\|_2^2 \text { ) }$$ of the AE-based CSI feedback scheme as a function of SNR. Note that the genie channel [TeX:] $$\mathbf{H}$$ is obtained by the geometric MIMO channel model in (17). As shown in Fig. 9(b), we observe that the reconstruction quality of AE-based scheme improves gradually with the dimension Nl of the latent vector p. This is mainly because the latent vector having a large dimension can capture the detailed channel features (e.g., geometric channel parameters of NLoS paths) as well as the core features (e.g., channel parameters of the dominating LoS path). For example, we observe that the AE-based CSI feedback technique achieves MSE= [TeX:] $$10^{-4}$$ at SNR = 15 dB when [TeX:] $$N_l$$ = 128, which is by no means possible in other cases even in the higher SNR regime.

n Section III.A, we discussed that the GAN technique can be used to collect realistic wireless channel samples. In reality, however, GAN is also a data-driven DL technique so that it requires considerable training samples and hence the practical benefit of this approach might be washed away when the training data is insufficient [39]. To overcome the shortcoming, one can consider the meta learning, a technique to train a model using a variety of tasks and then solve a new task using only a small number of training samples [40]. In short, the meta learning is a special training technique to obtain the initialization parameters of DNN using which one can easily and quickly learn the desired function with small training samples.

To be specific, an overall procedure of the meta learning- based GAN training is as follows. First, we perform the meta learning to obtain the initialization parameters. We then update the network parameters of GAN to perform the fine-tuning of DNN such that trained DNN generates channel samples for the desired wireless environments. In the meta learning phase, we extract the common features of multiple channel datasets, say M datasets {[TeX:] $$D_1, \cdots, D_M$$}, and then use them to obtain the network initialization parameters [TeX:] $$\theta$$ (see Fig. 10(a)):

where [TeX:] $$\theta_t$$ and [TeX:] $$\theta_{t-1}$$ are the parameters of GAN in [TeX:] $$t$$-th step and ([TeX:] $$t$$ − 1)-th step, respectively. Also, [TeX:] $$\psi_{D_i, t}$$ is the parameter associated with dataset [TeX:] $$D_i$$ in [TeX:] $$t$$-th step, [TeX:] $$\mathcal{L}_{D_i}$$ is the loss function of GAN for [TeX:] $$i$$-th dataset [TeX:] $$D_i$$, and α and β are the step sizes for the parameter update (see Algorithm 1). Next, in the parameter update phase, we use [TeX:] $$\theta$$ as the initialization parameters and then train DNN to generate the samples close to the desired channel dataset, say [TeX:] $$D_{M+1}$$ . In a nutshell, all that needed is to learn the distinct features (of [TeX:] $$D_{M+1}$$ ) unextracted from the meta learning.

In Fig. 10(b), we test the MSE performances of the DL- based channel estimator trained by the three different datasets: real-measured dataset [41], dataset generated from the vanilla GAN and GAN trained by meta learning^{8}. We observe that the channel estimation performance of the meta learning-based GAN is slightly worse than that using the real samples (e.g., 2 dB loss at MSE=[TeX:] $$10^{-2}$$). Whereas, the performance of the conventional GAN-based approach is much worse (i.e., more than 6 dB loss at MSE=[TeX:] $$10^{-2}$$) since the number of training samples is not large enough to ensure the convergence of GAN and there is no mechanism to exploit the common features of diverse channel conditions.

^{8}Specifically, we have used 10,000 samples for the benchmark training, 4,000 samples for meta learning, and 800 samples for fine-tuning and the training of the conventional CGAN.

In this article, we discussed the DL-based channel esti- mation with emphasis on the design issues related to DL model selection, training set acquisition, and DNN archi- tecture design. As the automated services and applications using machines, vehicles, and sensors proliferate, we expect that DL will be more popular channel estimation paradigm in 6G era. To deal with various frequency bands (sub- 6 GHz/mmWave/THz), wireless resources (massive MIMO antennas, intelligent reflecting surface, relays), and geograph- ical environment, we need to go beyond the state-of-the-art DL technique and consider more aggressive and advanced DL techniques. For example, when we try to train a DL model to estimate the desired wireless channel, transfer learning, an approach to use the pre-trained model for a similar task, can be employed. By recycling most of parameters in the pre-trained model and then training only a small part of parameters, new model can learn the distinct feature of the desired channel while reusing common features of the pre-trained channel environments. Another approach worth investigation is the federated/split/distributed learning, a technique to learn the desired task by the cooperation of multiple decentralized devices or servers. Our hope is that this article will serve as a useful reference for the communication researchers who want to apply the DL technique in their wireless channel estimation application.

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0- 00972, Development of Intelligent Wireless Access Technolo- gies) and the ITRC (Information Technology Research Center) support program (IITP-2022-2017-0-01637) supervised by the IITP.

Wonjun Kim (Member, IEEE) received the B.S. in the Department of Electrical and Electronic Engineering from Yonsei University in 2016, and the Ph.D. degree in electrical engineering from Seoul National University in 2021. Since September 2021, he has been with Samsung Electronics and has been working for 6G systems. His current research interests include compressed sensing and deep learning techniques for the 6G wireless communications.

Yongjun Ahn (Member, IEEE) received the B.S. degree from the Department of Electrical and Computer Engineering, Seoul National University, South Korea, in 2018. He is currently pursuing the Ph.D. degree in Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University. His research interests include 6G wireless communications and deep learning techniques.

Jinhong Kim (Member, IEEE) received the B.S. degree from the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, South Korea, in 2016. He is currently pursuing the Ph.D. degree in Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University. His research interests include sparse signal recovery and deep learning techniques for the 6G wireless communications.

Byonghyo Shim (Senior Member, IEEE) received the B.S. and M.S. degree in Control and Instru- mentation Engineering from Seoul National Uni- versity (SNU), Seoul, Korea, in 1995 and 1997, respectively, and the M.S. degree in Mathematics and the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois at Ur- banaChampaign (UIUC), Urbana, in 2004 and 2005, respectively. From 1997 and 2000, he was with the Department of Electronics Engineering at the Korean Air Force Academy as an Officer (First Lieutenant) and an Academic Full-time Instructor. He also had a short time research position in Texas Instruments and Samsung Electronics in 2004 and 2019, respectively. From 2005 to 2007, he was with the Qualcomm Inc., San Diego, CA as a Staff Engineer working on CDMA systems. From 2007 to 2014, he was with the School of Information and Communication, Korea University, Seoul, Korea, as an Associate Professor. Since September 2014, he has been with the Dept. of Electrical and Computer Engineering, Seoul National University, where he is currently a Professor. His research interests include signal processing for wireless communications, statistical signal processing, machine learning, and information theory. Dr. Shim was the recipient of the M. E. Van Valkenburg Research Award from the ECE Department of the University of Illinois (2005), the Hadong Young Engineer Award from IEIE (2010), the Irwin Jacobs Award from Qualcomm and KICS (2016), the Shinyang Research Award from the Engineering College of SNU (2017), the Okawa Foundation Research Award (2020), and the IEEE COMSOC Asia Pacific Outstanding Paper Award (2021). He was a technical committee member of Signal Processing for Communications and Networking (SPCOM), and currently serving as an associate editor of IEEE Transactions on Signal Processing (TSP), IEEE Transactions on Communications (TCOM), IEEE Transactions on Vehicular Technology (TVT), IEEE Wireless Communica- tions Letters (WCL), Journal of Communications and Networks (JCN), and a guest editor of IEEE Journal of Selected Areas in Communications (location awareness for radios and networks).

- 1 Y . LeCun, Y . Bengio, and G. Hinton, "
*Deep learning*," Nature, vol. 521, no. 7553, pp. 436-444, May 2015.doi:[[[10.1007/978-981-13-9113-2_16]]] - 2 K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition,"
*in Proc. IEEE CVPR*, pp. 770-778, 2016.doi:[[[10.1109/cvpr.2016.90]]] - 3 D. Silver et al., "Mastering the game of go with deep neural networks and tree search,"
*Nature*, vol. 529, no. 7587, pp. 484-489, Jan. 2016.doi:[[[10.1038/nature16961]]] - 4 I. Goodfellow, Y . Bengio, and A. Courville, "
*Deep learning*," MIT press, 2016.doi:[[[10.1002/9781119845041.ch7]]] - 5 H. Ju, S. Kim, Y . Kim and B. Shim, "Energy-efficient ultra-dense network with deep reinforcement learning,", IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6539-6552, Aug. 2022.doi:[[[10.1109/twc.2022.3150425]]]
- 6 K. Suh, S. Kim, Y . Ahn, S. Kim, H. Ju and B. Shim, "Deep reinforcement learning-based network slicing for beyond 5G,"
*IEEE Access*, vol. 10, pp. 7384-7395, Jan. 2022doi:[[[10.1109/access.2022.3141789]]] - 7 Y . Ahn and B. Shim, "Deep learning-based beamforming for intelligent reflecting surface-assisted mmWave systems,"
*in Proc. ICTC*, 2021.doi:[[[10.1109/ictc52510.2021.9621150]]] - 8 E. Vlachos, G. C. Alexandropoulos, and J. Thompson, "Massive MIMO channel estimation for millimeter wave systems via matrix completion,"
*IEEE Signal Process. Lett.*, vol. 25, no. 11, pp. 1675-1679, Nov. 2018.doi:[[[10.1109/lsp.2018.2870533]]] - 9 M. S. Oh, S. Hosseinalipour, T. Kim, C.G. Brinton, and D. J. Love, "Channel estimation via successive denoising in MIMO-OFDM systems: A reinforcement learning approach,"
*in Proc. IEEE ICC*, 2021, pp. 1-6.doi:[[[10.1109/ICC42927.2021.9500671]]] - 10 X. Ma, Z. Gao, F. Gao, M. D. Renzo, "Model-driven deep learning based channel estimation and feedback for millimeter-wave massive hybrid MIMO systems,"
*IEEE J. Sel. Areas Commun.*, vol. 39, no. 8, pp. 2388-2406, 2021.doi:[[[10.1109/jsac.2021.3087269]]] - 11 S. Park, J. Choi, J. Seol and B. Shim, "Expectation-maximization-based channel estimation for multiuser MIMO systems," IEEE Trans. Commun., vol. 65, no. 6, pp. 2397-2410, Jun. 2017.doi:[[[10.1109/tcomm.2017.2688447]]]
- 12 3GPP TR 37.817, "Study on enhancement for Data Collection for NR and EN-DC (Release 17),"
*v1.1.0*, 2022.custom:[[[-]]] - 13 3GPP TR 38.831, "User Equipment (UE) Radio Frequency (RF) requirements for Frequency Range 2 (FR2),"
*v16.1.0*, 2021.custom:[[[-]]] - 14 J. Kim, Y . Ahn, S. Kim, and B. Shim, "Parametric sparse channel estimation using long short-term memory for mmWave massive MIMO systems,"
*in Proc. IEEE ICC*, 2022.doi:[[[10.1109/icc45855.2022.9838434]]] - 15 H. Tang, J. Wang, and L. He, "Off-grid sparse Bayesian learning based channel estimation for mmWave massive MIMO uplink,"
*IEEE Wireless Commun. Lett.*, vol. 8, no. 1, pp. 45-48, Feb. 2019.doi:[[[10.1109/LWC.2018.2850900]]] - 16 Q. He, T. Q. Quek, Z. Chen, Q. Zhang, and S. Li, "Compressive channel estimation and multi-user detection in C-RAN with low-complexity methods,"
*IEEE Trans. Wireless Commun.*, vol. 17, no. 6, pp. 3931-3944, 2018.doi:[[[10.1109/twc.2018.2818125]]] - 17 A. Manoj and A. P. Kannu, "Channel estimation strategies for multiuser mmWave systems," IEEE Trans. Commun., vol. 66, no. 11, pp. 5678-5690, Jul. 2018.doi:[[[10.1109/TCOMM.2018.2854188]]]
- 18 J. W. Choi, B. Shim, Y . Ding, B. Rao, and D. I. Kim, "Compressed sensing for wireless communications: Useful tips and tricks,"
*IEEE Commun. Surveys Tuts.*, vol. 19, no. 3, pp. 1527-1550, 2017.doi:[[[10.1109/comst.2017.2664421]]] - 19 I. Saffar, M.L.A. Morel, K. D. Singh, and C. Viho, "Semi-supervised deep learning-based methods for indoor outdoor detection,"
*in Proc. ICC*, 2019, pp. 1-7.doi:[[[10.1109/icc.2019.8761297]]] - 20 M. Alenezi, K. K. Chai, A. S. Alam, Y . Chen, and S. Jimaa, "Unsupervised learning clustering and dynamic transmission scheduling for efficient dense LoRaWAN networks,"
*IEEE Access*, vol. 8, pp.191495-191509, Oct. 2020.doi:[[[10.1109/access.2020.3031974]]] - 21 3GPP TS 36.104. "Evolved Universal Terrestrial Radio Access (EUTRA); Base Station (BS) Radio Transmission and Reception," 3rd Generation Partnership Project; Technical Specification Group Radio Access Network.custom:[[[-]]]
- 22 I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben-gio, "Generative adversarial nets,"
*in Proc. NIPS*, 2014.doi:[[[10.1007/978-3-658-40442-0_9]]] - 23 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need,"
*in Proc. NIPS*, 2017, pp. 6000-6010.custom:[[[https://arxiv.org/abs/1706.03762]]] - 24 A. Rehman, S. U. Rehman, M. Khan, M. Alazab, and T. Reddy, "CANintelliIDS: detecting in-vehicle intrusion attacks on a controller area network using CNN and attention-based GRU," IEEE Trans. Netw. Sci. Eng., vol. 8, no. 2, pp. 1456-1466, Apr.-Jun. 2021.doi:[[[10.1109/tnse.2021.3059881]]]
- 25 M. Li, Y . Wang, Z. Wang, and H. Zheng, "A deep learning method based on an attention mechanism for wireless network traffic prediction,"
*in Ad Hoc Networks*, vol. 107, no. 1, Oct. 2020.doi:[[[10.1016/j.adhoc.2020.102258]]] - 26 Y . Ahn, W. Kim and B. Shim, "Active user detection and channel estimation for massive machine-type communication: Deep learning approach," IEEE Internet Things J., vol. 9, no. 14, pp. 11904-11917, Jul. 2021.doi:[[[10.1109/jiot.2021.3132329]]]
- 27 W. Kim, Y . Ahn, and B. Shim, "Deep neural network-based active user detection for grant-free NOMA systems,"
*IEEE Trans. Commun.*, vol. 68, no. 4, pp. 2143-2155, 2020.doi:[[[10.1109/tcomm.2020.2969184]]] - 28 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting,"
*J. Machine Learning Research*, vol. 15, no. 1, pp. 1929-1958, 2014.custom:[[[https://dl.acm.org/doi/10.5555/2627435.2670313]]] - 29 G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network,"
*arXiv preprint arXiv:1503.02531*, 2015.custom:[[[https://arxiv.org/abs/1503.02531]]] - 30 S. Rangan, T. S. Rappaport, and E. Erkip, "Millimeter-wave cellular wireless networks: Potentials and challenges,"
*Proc. IEEE*, vol. 102, no. 3, pp. 366-385, 2014.doi:[[[10.1109/jproc.2014.2299397]]] - 31 H. Ji, H. Yang, H. Noh, J. Yeo, Y . Kim, and J. Lee, "Compressed channel estimation for point-to-point millimeter-wave communications,"
*in Proc. IEEE Globecom Wkshps*, 2019, pp. 1-5.doi:[[[10.1109/gcwkshps45667.2019.9024523]]] - 32 J. Park, S. Kim, J. Moon and B. Shim, "Fast terahertz beam training via frequency-dependent precoding,"
*in Proc. IEEE ICC*, 2022.doi:[[[10.1109/iccworkshops53468.2022.9814478]]] - 33 K. Venugopal, A. Alkhateeb, N. G. Prelcic, and R. W. Heath, "Channel estimation for hybrid architecture based wideband millimeter wave systems,"
*IEEE J. Sel. Areas Commun.*, vol. 35, no. 9, pp. 1996-2009, 2017.doi:[[[10.1109/JSAC.2017.2720856]]] - 34 3GPP TS 38.214, "Physical layer procedures for data,"
*v17.0.0*, 2022.custom:[[[https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3216]]] - 35 S. Kim, J. W. Choi, and B. Shim, "Downlink pilot precoding and compressed channel feedback for FDD-based cell-free systems,"
*IEEE Trans. on Wireless Commun.*, vol. 19, no. 6, pp. 3658-3672, 2020.doi:[[[10.1109/twc.2020.2974838]]] - 36 B. Lee, J. Choi, J.-Y . Seol, D. J. Love, and B. Shim, "Antenna grouping based feedback compression for FDD-based massive MIMO systems,"
*IEEE Trans. on Commun.*, vol. 63, no. 9, pp. 3261-3274, 2015.doi:[[[10.1109/tcomm.2015.2460743]]] - 37 S. Kim and B. Shim, "Energy-efficient millimeter-wave cell-free systems under limited feedback," IEEE Trans. Commun., vol. 69, no. 6, pp. 4067-4082, Jun. 2021.doi:[[[10.1109/tcomm.2021.3059877]]]
- 38 C.-K. Wen, W.-T. Shih, and S. Jin, "Deep learning for massive MIMO CSI feedback,"
*IEEE Wireless Commun. Lett.*, vol. 7, no. 5, pp. 748-751, Oct. 2018.doi:[[[10.1109/lwc.2018.2818160]]] - 39 J. Kim, Y . Ahn and B. Shim, "Massive data generation for deep learningaided wireless systems using meta learning and generative adversarial network," to appear in IEEE Trans. Veh. Technol.custom:[[[https://arxiv.org/abs/2208.11910]]]
- 40 C. Finn, P. Abbeel, and S. Levine, "Model-agnostic meta-learning for fast adaptation of deep networks,"
*in Proc. Int. Conf. Mach. Learn. (ICML). PMLR*, Aug. 2017, pp. 1126-1135.custom:[[[https://arxiv.org/abs/1703.03400]]] - 41 E. Everett, C. Shepard, L. Zhong, and A. Sabharwal, "Softnull: Manyantenna full-duplex wireless via digital beamforming," in IEEE Trans. Wireless Commun., vol. 15, no. 12, p. 8077-8092, Dec. 2016.doi:[[[10.1109/TWC.2016.2612625]]]