PDF  PubReader

Kim , Ahn , Kim , and Shim: Towards Deep Learning-aided Wireless Channel Estimation and Channel State Information Feedback for 6G

Wonjun Kim , Yongjun Ahn , Jinhong Kim and Byonghyo Shim

Towards Deep Learning-aided Wireless Channel Estimation and Channel State Information Feedback for 6G

Abstract: Deep learning (DL), a branch of artificial intelligence (AI) techniques, has shown great promise in various disciplines such as image classification and segmentation, speech recognition, language translation, among others. This remarkable success of DL has stimulated increasing interest in applying this paradigm to wireless channel estimation in recent years. Since DL principles are inductive in nature and distinct from the conventional rule-based algorithms, when one tries to use DL technique to the channel estimation, one might easily get stuck and confused by so many knobs to control and small details to be aware of. The primary purpose of this paper is to discuss key issues and possible solutions in DL-based wireless channel estimation and channel state information (CSI) feedback including the DL model selection, training data acquisition, and neural network design for 6G. Specifically, we present several case studies together with the numerical experiments to demonstrate the effectiveness of the DL-based wireless channel estimation framework.

Keywords: Channel estimation , channel state information (CSI) feedback , deep learning (DL)


Artificial intelligence (AI) is a powerful tool to perform tasks that seem to be simple for human being but are extremely difficult for conventional (rule-based) computer program. Deep learning (DL), a branch of AI techniques popularized by Lecun, Bengio, and Hinton [1], has shown great promise in many practical applications. In the past few years, we have witnessed great success of DL in various disciplines such as traditional Go game, image classification, speech recog- nition, language translation, to name just a few [2]–[4]. DL techniques have also shown excellent prominence in various wireless communication tasks such as active user detection (AUD), spectrum sensing, and resource scheduling [5]–[7]. Also, efforts are underway to exploit DL as a means to over- come the various problems in the wireless channel estimation, particularly for the channel estimation in the high frequency bands (e.g., pilot overhead and channel estimation accuracy degradation in mmWave MIMO systems). Notable applica- tions include multiple-input-multiple-output (MIMO) channel estimation, angle-domain channel estimation for mmWave and terahertz (THz) band, and channel estimation in vehicular to everything (V2X) system [8]–[10]. In response to this trend, 3GPP recently decided to include artificial intelligence as one of main item in 5G advanced (Rel 17) [12].

When one tries to apply DL technique to the wireless chan- nel estimation, one can be easily overwhelmed by so many knobs to control and small details to be aware of. In contrast to the conventional linear minimum mean square error (LMMSE) or least square (LS) techniques where the estimation algorithm design is fairly straightforward, DL-based technique requires lots of hands-on experience and heuristic tips and tricks in the design of neural network, training dataset generation, and also choice of the training strategy. In fact, since the DL-aided system is data-driven and inductive in nature, one might easily get confused, stuck in the middle, or come up with suboptimal channel estimation scheme.

The primary goal of this paper is to show that DL is an effective means for the channel estimation and feedback, in particular for future wireless systems using high frequency band. Recently, there have been some studies on the AI-based channel estimation [8]–[10]. Since the main focus of these studies is to present the specific DL-based channel estimation technique for a specific wireless scenario (e.g., channel esti- mation in the MIMO system and optical wireless system), it might not be easy to grasp general idea and systematic view on the problem. In a nutshell, successful design of DL-based channel estimation and CSI feedback scheme comes down to 1) the choice of a proper DL model for the target wireless system, 2) detailed deep neural network architecture design, and 3) training data acquisition along with training strategy selection. We get to the heart of these and discuss detailed issues and provide several useful design tips learned from our past experiences in the time/frequency-domain channel estima- tion, geometric channel parameter estimation (e.g., angle-of- departure/arrival (AoD/AoA) estimation) in mmWave systems, and channel state information (CSI) feedback.

The rest of this paper is organized as follows. In Section II, we briefly review the design principles of conventional and discuss learning techniques used to estimate and exploit the wireless channel. In Section III, we explain the training dataset collection and neural network architecture design issues. In Section IV, we provide several case studies that use DL to estimate and exploit channels in the wireless system, together with plenty of experimental results. We discuss future issues and conclude the paper in Section V.

Fig. 1.

Design principles of traditional channel estimation algorithm and DL-based channel estimation technique.


In this section, we briefly compare two design principles: conventional channel estimation and the DL-based channel estimation techniques. We then discuss how DL technique can be mapped to the channel estimation in specific wireless environments.

A. Design Principle of Conventional Wireless Channel Esti- mation

When designing an algorithm to estimate the wireless channel, one should perform the system modeling, algorithm design, and performance analysis (see Fig. 1). System model, typically expressed as a clean-cut linear equation, defines the relationship between observation (e.g., received pilot signal) and variables to be recovered (e.g., time-domain channel tap, angle of arrival/departure). Using this, a proper algo- rithm achieving near optimal performance is developed (e.g., MMSE channel estimator, expectation-maximization (EM)- based channel estimator) and the theoretic analysis is con- ducted to obtain the performance bound such as the mean squared error (MSE) bound.

One well-known example is the LMMSE channel estimator derived from the linear system model given by [11]

[TeX:] $$\mathbf{y}=\mathbf{X h}+\mathbf{n}$$

where [TeX:] $$\mathbf{y} \in \mathbb{C}^{m \times 1}$$ is the received signal vector, [TeX:] $$\mathbf{X} \in \mathbb{C}^{m \times n}$$ is a matrix containing the pilots, [TeX:] $$\mathbf{h} \in \mathbb{C}^{n \times 1}$$ is the channel vector to be reconstructed, and [TeX:] $$\mathbf{n} \sim \mathcal{C N}\left(\mathbf{0}, \sigma_n^2 \mathbf{I}\right)$$is a complex Gaussian noise vector. The linear estimator [TeX:] $$\mathbf{W}^*$$ that minimizes the Bayesian MSE [TeX:] $$E\left[(\mathbf{h}-\hat{\mathbf{h}})^2\right]=E\left[\left(\mathbf{h}-\mathbf{W}^H \mathbf{y}\right)^2\right]$$ is

[TeX:] $$\mathbf{W}^*=\arg \min _{\mathbf{W}} \operatorname{MSE}(\mathbf{W})$$

After taking the gradient [TeX:] $$\frac{\partial \mathrm{MSE}}{\partial \mathbf{W}}=2 E\left[\mathbf{y} \mathbf{y}^H\right] \mathbf{W}-2 E\left[\mathbf{y h}^H\right]$$ and setting it to zero, we have

[TeX:] $$\mathbf{W}^*=E\left[\mathbf{y} \mathbf{y}^H\right]^{-1} E\left[\mathbf{y} \mathbf{h}^H\right]$$

[TeX:] $$=E\left[\mathbf{X h h}^H \mathbf{X}^H+\mathbf{n n}^H\right]^{-1} E\left[\mathbf{X h h ^ { H }}+\mathbf{n h}^H\right]$$

[TeX:] $$=\left(\mathbf{X} R_{h h} \mathbf{X}^H+\sigma_n^2 \mathbf{I}\right) \quad \mathbf{X} R_{h h}$$

where [TeX:] $$R_{h h}=E\left[\mathbf{h h}^H\right]$$ is the autocorrelation matrix of h. Let [TeX:] $$\hat{\mathbf{h}}=\left(\mathbf{W}^*\right)^H \mathbf{y}$$, then

[TeX:] $$\hat{\mathbf{h}}=R_{h h} \mathbf{X}^H\left(\mathbf{X} R_{h h} \mathbf{X}^H+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y}$$

Note that, when X is a diagonal matrix, we also obtain [TeX:] $$\hat{\mathbf{h}}=R_{h h}\left(R_{h h}+\sigma_n^2\left(\mathbf{X} \mathbf{X}^H\right)^{-1}\right)^{-1} \tilde{\mathbf{h}}$$ where [TeX:] $$\tilde{\mathbf{h}}=\mathbf{X}^{-1} \mathbf{y}$$ is the LS estimate of h. LMMSE estimate is a channel estimate minimizing the MSE when the system is expressed as an overdetermined linear where the number of measurements is larger than or equal to the size of unknown channel vector ([TeX:] $$m \geq n$$). However, the task to recover the channel vector in an underdetermined scenario where the measurement size is smaller than the unknown channel vector ([TeX:] $$m<n$$) is challenging and problematic, since one cannot find out the unique solution in general. If one wish to convert this ill-posed problem into a well-posed one, an additional side information is indispensable.

When the sparsity constraint is provided as a side in- formation, compressed sensing (CS)-based technique can be used to estimate the channel in the underdetermined system. For example, a mmWave propagation channel (e.g., FR2 in 5G [13]) is characterized by the geometric parameters such as AoD/AoA, path delay, and path gain. Due to the the severe attenuation of signal power caused by the high diffraction and penetration loss, atmospheric absorption, and rain attenuation in the mmWave band, the number of effective paths is at most a few (LoS and at most one or two NLoS paths), meaning that the channel can be represented by a small number of geometric parameters [14]. Using this property as a side information, one can find out the sparse parameters used to reconstruct the mmWave channel.

Fig. 2.

(a) Illustration of mmWave MIMO system with NT transmission antennas and NR receiver antennas. (b) Illustration of approximated channel using discretized angular bases [TeX:] $$\mathbf{A}_R$$ and [TeX:] $$\mathbf{A}_T$$ .

Consider the narrowband geometric mmWave MIMO chan- nel model given by

[TeX:] $$\mathbf{H}=\sum_{i=1}^P \alpha_i e^{-j 2 \pi \tau_i f_s} \mathbf{a}_R\left(\theta_i\right) \mathbf{a}_T^H\left(\phi_i\right)$$

where P is the number of propagation paths and [TeX:] $$\alpha_i, \theta_i, \phi_i$$ and [TeX:] $$\tau_i$$ are the path gain, AoA, AoD, and path delay of [TeX:] $$i$$-th path, respectively. Also, [TeX:] $$f_s$$ is subcarrier spacing and [TeX:] $$\mathbf{a}_R\left(\theta_i\right) \in\mathbb{C}^{N_R \times 1}$$ and [TeX:] $$\mathbf{a}_T\left(\phi_i\right) \in \mathbb{C}^{N_T \times 1}$$ are the array response vectors of BS and UE, respectively ([TeX:] $$N_T$$ and [TeX:] $$N_R$$ are the number of transmit and receive antennas)1 In the downlink transmission scenario where UE uses [TeX:] $$i$$-th combining vector [TeX:] $$\mathbf{w}_i$$ and BS uses [TeX:] $$j$$-th precoding vector [TeX:] $$\mathbf{f}_j$$, the received signal [TeX:] $$y_{i, j}$$ is

1 [TeX:] $$$$ = [TeX:] $$$$ and [TeX:] $$$$ = [TeX:] $$$$ , respectively, where λ is the wavelength and [TeX:] $$d$$ is the antenna spacing.

[TeX:] $$y_{i, j}=\mathbf{w}_i^H \mathbf{H} \mathbf{f}_j s_j+v_{i, j}$$

where [TeX:] $$\boldsymbol{s}_j$$ is transmitted pilot symbol, [TeX:] $$\mathbf{H} \in \mathbb{C}^{N_R \times N_T}$$ is the downlink channel matrix, and [TeX:] $$$$ is noise. Assuming that we use [TeX:] $$M$$ precoding vectors and [TeX:] $$N$$ combining vectors for the channel estimation, then the received signal matrix [TeX:] $$\mathbf{Y} \in \mathbb{C}^{N \times M}$$ can be expressed as

[TeX:] $$\mathbf{Y}=\mathbf{W}^H \mathbf{H F S}+\mathbf{V}$$

where [TeX:] $$\mathbf{F}=\left[\mathbf{f}_1, \cdots, \mathbf{f}_M\right] \in \mathbb{C}^{N_T \times M}$$ is the beamforming matrix, [TeX:] $$\mathbf{W}=\left[\mathbf{w}_1, \cdots, \mathbf{w}_N\right] \in \mathbb{C}^{N_R \times N}$$ is the combining matrix, [TeX:] $$\mathbf{S}=\operatorname{diag}\left(s_1, \cdots, s_M\right)$$ is the matrix containing the pilot symbols, and [TeX:] $$\mathbf{V} \in \mathbf{C} \mathbb{C}^{N \times M}$$is the matrix containing the noise elements (see Fig. 2(a)).

In order to estimate the channel, CS-based techniques ap- proximate H using discretized model [TeX:] $$\mathbf{H} \approx \mathbf{A}_R \boldsymbol{\Delta} \mathbf{A}_T^H$$ , where [TeX:] $$\mathbf{A}_R \in \mathbb{C}^{N_R \times G_r}$$ and [TeX:] $$\mathbf{A}_T \in \mathbb{C}^{N_T \times G_t}$$ are uniformly discretized angular basis for AoA and AoD, respectively2 (see Fig. 2(b)). To convert the channel estimation problem into the sparse signal recovery problem, we rearrange the system model in (9) using the mixed Kronecker matrix-vector property3 :

2[TeX:] $$\mathbf{A}_R=\frac{1}{\sqrt{N_R}}\left[\mathbf{a}_{\mathrm{R}}(0), \mathbf{a}_{\mathrm{R}}\left(\frac{2 \pi}{G_r}\right), \cdots, \mathbf{a}_{\mathrm{R}}\left(\frac{2 \pi}{G_r}\left(G_r-1\right)\right)\right]$$ and [TeX:] $$\mathbf{A}_T=\frac{1}{\sqrt{N_T}}\left[\mathbf{a}_{\mathrm{T}}(0), \mathbf{a}_{\mathrm{T}}\left(\frac{2 \pi}{G_t}\right), \cdots, \mathbf{a}_{\mathrm{T}}\left(\frac{2 \pi}{G_t}\left(G_t-1\right)\right)\right]$$, respectively, where [TeX:] $$G_r$$ and [TeX:] $$G_t$$ denote the number of angular grids in each basis [15]–[17]. [TeX:] $$\Delta$$ is a sparse path gain matrix containing P non-zero channel path gains. Specifically, the ([TeX:] $$i, j$$ )-th element of [TeX:] $$\Delta$$ is the path gain corresponding to the [TeX:] $$i$$ -th angular grid for the AoA and the [TeX:] $$j$$ -th angular grid for the AoD.

3[TeX:] $$\operatorname{vec}(\mathbf{A X} \mathbf{C})=\left(\mathbf{C}^T \otimes \mathbf{A}\right) \operatorname{vec}(\mathbf{X})$$

To identify nonzero elements in g, sparse recovery algo- rithms such as the orthogonal matching pursuit (OMP) can be used [18]. For instance, in each iteration, the OMP algorithm picks one column [TeX:] $$\boldsymbol{\Phi}_i$$ of [TeX:] $$\Phi$$ which is maximally correlated to the residual vector [TeX:] $$\mathbf{r}^{k-1}$$:

[TeX:] $$i=\arg \max _{i=1, \cdots, G_r G_t}\left\|\boldsymbol{\Phi}_i^H \mathbf{r}^{k-1}\right\|_2^2$$

where [TeX:] $$\hat{\Omega}^{k-1}, \mathbf{r}^{k-1}=\tilde{\mathbf{y}}-\boldsymbol{\Phi}_{\hat{\Omega}^{k-1}} \hat{\mathbf{g}}^{k-1}$$, and [TeX:] $$\hat{\mathbf{g}}^{k-1}=\left(\boldsymbol{\Phi}_{\Omega^{k-1}}^H \boldsymbol{\Phi}_{\Omega^{k-1}}\right)^{-1} \boldsymbol{\Phi}_{\Omega^{k-1}}^H \tilde{\mathbf{y}}$$̃ are the estimated positions of nonzero elements until (k − 1)-th iteration, (k − 1)-th residual vector, and (k − 1)-th sparse path gain vector estimate, respectively.

While the CS-based channel estimation is effective in dealing with the sparsity of the mmWave channel, it might not work well in practical scenarios where the mismatch between the true angles {[TeX:] $$\theta_i, \phi_i$$} and the quantized angles in the angular bases {[TeX:] $$\hat{\theta}_i \in\left[0,2 \pi / G_r, \cdots, 2 \pi / G_r\left(G_r-1\right)\right], \hat{\phi}_i \in\left[0,2 \pi / G_t, \cdots, 2 \pi / G_t\left(G_t-1\right)\right]$$} is considerable [14]. By applying high-resolution angle quantization, one can reduce the error caused by the mismatch. In this case, however, the column dimension of the sensing matrix [TeX:] $$\Phi$$ would be much larger than the size of measurement vector [TeX:] $$\tilde{\mathbf{y}}$$, increasing the underdetermined ratio of the system and degrading the channel estimation performance.

B. Design Principle of DL-based Wireless Channel Estimation

As discussed, conventional channel estimation techniques are effective in a system where input-output relationship is explicit and linearly expressed. However, efficacy of this approach can be degraded when the wireless environments and systems are becoming complicated and the input-output relationship is nonlinear.

As an entirely-new paradigm to deal with the channel estimation problem, DL has been received much attention in recent years. When a system is complex, meaning that it has complicated inputs/outputs relationship and the internal structure is highly nonlinear, it would be very difficult to come up with a closed-form analytic solution. In fact, when the system has a constraint (e.g., solution lies on a nonlinear manifold or output is nonlinear transformation of input) and thus analytic solution is far from being optimal, DL comes to the rescue. A holy grail of DL is to let machine learn the complicated, often highly nonlinear, relationship between the input dataset and the desired output without human interven- tion. In a nutshell, DL-based channel estimation is distinct from the conventional channel estimation in two main respects: data-driven training and end-to-end learning of the black box (see Fig. 1). Instead of following the analytical avenue, the DL model approximates the channel estimation function via the training process. In the training phase, DL parameters (weights and biases) are updated to identify the end-to-end mapping between the input dataset (typically received pilot signal) and the wireless channel. Once the training is finished, in the inference phase, DL returns the wireless channel for the input signal. One can infer that what we essentially need to do is to feed well-prepared training dataset into the properly designed neural network. It seems to be simple but requires lots of hands-on experience to achieve the maximum effect.

C. Learning Techniques for DL-based Wireless Channel Esti- mation and Exploitation

When one tries to use DL to estimate the wireless channel, perhaps the first thing to consider is to determine what learning technique to use. Depending on the design goal, training dataset, and learning mechanism, DL techniques can be roughly divided into three categories: supervised learning, unsupervised learning, and reinforcement learning.

1) Supervised learning: primary goal of the supervised learning is to learn a mapping function between the input dataset and the desired solution called label. To scrutinize the quality of a designed neural network and reflect it in the weight update process, we need a loss function that measures how far the predicted channel (or latent variable constructing channels) is from the label. The difference between two, in a form of MSE or cross entropy, is used as a loss function. Typically, there are two types of the supervised learning: classification to find out the categorical class of given input and regression to return the numerical value. The classification task is suitable for the detection problem such as the time-domain channel tap detection, LoS/NLoS detection, and indoor/outdoor detection and the regression task is a good fit for the estimation problem such as the angle (DoA/AoD) estimation, path gain estimation, and path delay estimation [ 19].

In the time-domain channel tap detection, for example, we identify a few nonzero (dominating) taps among all possible tapped delay-lines so that the problem can be well modeled as a multi-label classification problem identifying a few, say k, labels among N classes. By employing the set of received vectors {[TeX:] $$\mathbf{y}_1, \cdots, \mathbf{y}_n$$} as inputs and the nonzero tap index [TeX:] $$\Omega$$ as an output, deep neural network (DNN) is trained to find out indices of nonzero taps (see Fig. 3). In the DoA estimation problem, on the other hand, desired task is to produce the real-valued angle estimate [TeX:] $$\hat{\psi}$$ from the received pilot signals y (see Fig. 2(a)). Using the MSE between the real angle [TeX:] $$\psi$$ and estimate [TeX:] $$\hat{\psi}$$ obtained from DL as a loss function, DNN learns the regression mapping between y and [TeX:] $$\psi$$ (see more details in Section IV.B).

2) Unsupervised learning: unsupervised learning is used when the ground-truth label is unavailable for some reasons such as nonlinearity/nonconvexity of the problem. In this case, clearly, one cannot compute the difference between the generated output and the label and thus the design goal (i.e., objective function) is used as a loss function instead. In the pilot signal allocation problem, for example, it is very difficult to find out an optimal pilot pattern (e.g., pilot position in the resource block (RB) in 4G LTE and 5G NR) maximizing the optimizing performance metric since the problem is highly nonlinear mixed-integer programming [20]. In this case, by employing the quality of service (QoS) function as a loss function, the DL model can be trained. The loss function can be the target MSE (e.g., MSE between the original channel and reconstructed channel), throughput (e.g., throughput obtained by using the channel estimated from the learned pilot pattern). and so on.

Table 1.

Learning technique Applicable problem Loss function Application example
Supervised learning Detection problem using the classification training Cross entropy, Kullback-Leibler (KL) divergence, LoS/NLoS detection, Time-domain channel tap detection
Estimation problem using the regression training Mean squared error (MSE), Mean absolute error (MAE) AoA, AoD, DoA estimation, Path gain estimation
Unsupervised learning Optimization problem Objective function to be optimized (e.g., sum-rate, cell throughput) Pilot signal allocation, Channel state information feedback
Reinforcement learning Sequential decision making problem Cumulative reward (e.g., sum-rate, total power consumption) Beam tracking and selection, CQI feedback

Fig. 3.

Illustration of tapped delay lines of the time-domain channel. In this case, we set Ω = [0,0,1,0,1,1,0,··· ,1,··· ,0].

3) Reinforcement learning (RL): RL is a goal-oriented learning technique where an agent learns how to solve a task by trials and errors. In the learning process, the agent observes the state of an environment, takes an action, and then receives a reward for the action. RL is suitable for the sequential decision-making problem whose purpose is to find out a series of actions maximizing the performance metric such as data rate, energy consumption, or latency. Recently, deep RL (DRL) has been widely used since it can effectively handle the large-scale state-action pair in dynamically varying wireless environments. While DRL is in general not directly used for the channel estimation, it can be used for the channel related functions such as the beam tracking and channel quality indication. For example, when one tries to track the angle in the V2X system using DRL, a base station (BS), playing the role of the agent, observes the state (e.g., distance, velocity, and angles (AoD/DoA)) and then determines the action (adjust the beam direction) maximizing the reward (e.g., minimizing the number of retransmission).


In this section, we delve into two main issues in the DL- based wireless channel estimation: sufficient and comprehen- sive training dataset and properly designed neural network.

A. Training Dataset Acquisition

When the number of training samples is not sufficient enough, designed DL-based channel estimation model would be closely fitted to the training dataset, making it difficult to make a reasonable inference for the unseen channel. This problem that the trained DL model lacks the generalizing capability is often called overfitting. In the angle estimation task, for example, if the received signals are generated from the angles in [TeX:] $$[0, \pi / 2$$), then the trained network cannot accurately identify the rest angles in [TeX:] $$[\pi / 2, \pi$$). To prevent the overfitting problem, a dataset should be large enough to cover all possible scenarios. This is not easy, in particular for wireless channel, since the number of real transmissions will be humongous. In acquiring the training dataset, we basically have three options:

• Collection from the actual received signals

• Synthetic data generation using the analytic system model or the ray-tracing simulator

• Real-like training set generation using generative adver-sarial network (GAN)

In the training set acquisition, a straightforward option is to collect the real transmit/receive signal pair. Doing so, however, will cause a significant overhead since it requires too many training data transmissions. For example, when collecting one million received signals in 5G NR systems, it will take more than 15 minutes ([TeX:] $$10_6$$ symbols × 0.1 subframe/symbol × 8 ms/subframe).

To reduce the overhead, one can consider synthetically generated dataset (see Fig. 4(a)). In fact, in the design, test, and performance evaluation phase of the most channel estimation algorithms, analytic models have been widely used. For example, propagation channels such as the extended pedestrian A (EPA) channel or extended vehicular A (EVA) channel have been popularly employed in the generation of training dataset [21]. In EVA channel, for example, channel is expressed by 9 tapped delays (i.e., [0, 30, 150, 310, 370, 710, 1090, 1730, 2510] ns), each of which has a different path gain in the delay-Doppler domain. Since the synthetic data can be generated with a simple programming, time and effort to collect huge training data can be saved. However, there might be some, arguably non-trivial, performance degradation caused by the model mismatch.

Yet another intriguing option is to use an artificial but realistic samples generated by the special DL technique. This approach is in particular useful when the analytic model is un- known or non-existent (e.g., deep underwater acoustic channel and non-terrestrial (NTN) channel) and real measured data is not large enough (e.g., THz-band channel). In such case, GAN technique, an approach to generate samples having the same distribution with the input dataset, can be employed [22]. Basically, GAN consists of two neural networks: generator G and discriminator D (see Fig. 4(b)). The generator G tries to produce the real-like data samples and the discriminator D tries to distinguish real (authentic) and fake data samples. To be specific, G is trained to generate real-like data G(z) from the random noise z and D is trained to distinguish whether the generator output G(z) is real or fake. In order to accomplish the mission, the min-max loss function, expressed as the cross- entropy (i.e., [TeX:] $$H(q)=-q \log (p(q))-(1-q) \log (1-p(q))$$) between the distribution of generator output G(z) and that of the measured channel data x, is used:

Fig. 4.

Illustration of data acquisition strategies: (a) Synthetic data generation; (b) GAN-based data generation.

[TeX:] $$\min _G \max _D \mathbb{E}_{\mathbf{x}}[\log (D(\mathbf{x}))]+\mathbb{E}_{\mathbf{z}}[\log (1-D(G(\mathbf{z})))]$$

where the discriminator output D(x) is the probability of x being real (non-fake). In the training process, parameters of G are updated while those of D are fixed and vice versa. When the training is finished properly, the generator output G(z) is fairly reliable so that the discriminator cannot judge whether the generator output is real or fake (i.e., D(G(z)) ≈ 0.5). This means that we can safely use the generator output for the training purpose (we will say more on GAN-based data generation in Section IV.D).
B. DNN Architecture for Wireless Channel Estimation

In the design of DNN-based channel estimator, one should consider the channel characteristics (e.g., tempo- ral/spatial/geometric correlation), wireless environments (e.g., mmWave/THz/V2X/ UAV link), and system configurations (e.g., bandwidth, power of transmit pilot signal, number of antennas).

1) Baseline network: A natural first step of the DNN design is to choose the baseline architecture. Based on the connection shape between neighboring layers, neural network can be roughly divided into three types: fully-connected network (FCN), convolutional neural network (CNN), and recurrent neural network (RNN).

FCN can be used universally since each hidden unit (neuron) is connected to all neurons in the next layer. When the input dataset has a spatial structure (e.g., doubly selective channel and the spatial-frequency domain channel in MIMO systems), CNN might be an appealing option. In CNN, each neuron is computed by the convolution of the 2D spatial filter and a part (e.g., rectangular shaped region) of neurons in the previous layer. Due to the local connectivity of the convolution filter, CNN facilitates the extraction of spatial correlated feature. For example, in the MIMO channel estimation, spatial correlation among the uniform rectangular array (URA) antennas can be extracted using CNN. When the input sequence is temporally correlated, which is true for most of wireless fading channels, RNN model (e.g., long-short term memory (LSTM)) might be a good choice. In these approaches, by employing current inputs together with outputs of the previous hidden layer, temporally correlated feature can be extracted. For instance, by applying RNN to the mmWave channel estimation, change of the Doppler frequency caused by the mobile’s movement can be extracted.

Another promising network architecture distinct from the representative structures we just mentioned is the attention module [23]. In essence, attention module is a network component that quantifies the correlation between every two input elements. By measuring the attention score (a.k.a. value) between two input elements called key and query, attention module allows to model the dependencies between input elements without regard to their distance in pixel or time. Recently, attention module is used instead of CNN and RNN since it offers the capability to extract long-distance and long- term dependencies between input elements [24], [25]. For example, attention module can be used to extract the spatial correlation between the distant MIMO antenna arrays (e.g., elements placed at both ends) or to extract the temporal correlation between the traffic pattern and the co-channel interference in the MIMO channel.

2) Activation layer: Activation layer is used to 1) embed the nonlinearity in the hidden layer and 2) generate the desired type of output in the final layer.

In each hidden layer, weighted sum of inputs passes through the activation layer to determine whether the information generated by the hidden unit is activated (delivered to the next layer) or not. To this end, rectified linear unit (ReLU) function [TeX:] $$f(x)$$ = max([TeX:] $$x$$, 0) or hyperbolic tangent function [TeX:] $$f(x)$$ = tanh([TeX:] $$x$$) can be used [1]. By imposing the activation to the linearly transformed input, one can better promote the nonlinear operation (e.g., spline interpolation between adjacent subcarrier channel) and systematic nonlinearity (e.g., relationship between spatial-domain channel and AoA/AoD).

In the final layer of DNN, the activation layer is used to make sure that the generated output is the desired type. In the classification problem, the ground-truth label for each class is the probability so that the final output should also be a form of the probability. When there are several time-domain channel taps or the angular-domain channel paths are non-unique in the channel estimation problem, it would be desirable to use the sigmoid function [TeX:] $$f(x)=1 /\left(1+e^{-x}\right)$$ returning the individual probability for each class [1], [26]. Whereas, when the problem is modeled as a multi-class classification problem such as the CQI detection problem that generates the proper value among quantized CQI levels (e.g., 0∼15 in LTE), a softmax function [TeX:] $$f_i(\mathbf{x})=e^{x_i} / \sum_j e^{x_j}$$ would be a good choice since it normalizes the output vector into the probability distribution over all classes [27].

3) Input normalization: In the training process, the neural network computes the gradient of a loss function with respect to any weight and then updates the weight in the negative direction of the gradient. When an input varies in a wide range (e.g., multi-user communication scenario), therefore, variation in the weight update process will also be large, degrading the training stability and convergence speed severely. To prevent such ill-behavior, one should perform the normalization of the outputs for each layer. Typically, there are two types of the normalization strategies: layer normalization and batch normalization.

First, when the input vector contains signals from multiple users with different wireless geometries, variation of the received signals would be quite large. In such case, the layer normalization, normalizing each individual input vector, is a good choice because the layer normalization scheme ensures that the normalized input distribution has the fixed mean and variance.

Whereas, when the input data consists of several different types of information, the batch normalization (BN) might be a useful option. In the mini-batch [TeX:] $$\mathbf{B}=\left[\mathbf{y}^{(1)} \cdots \mathbf{y}^{(N)}\right]^T$$ consisting of multiple input samples [TeX:] $$\mathbf{y}^{(1)}, \cdots, \mathbf{y}^{(N)}$$, elements in each row of B (i.e., elements with the same input type) are normalized. For example, in the DRL-based beam tracking problem, both the velocity of the moving target and the angles (AoD/DoA) are used as inputs to DRL. Since the scale of two components would be quite different, the layer normalization will simply mess up the input dataset. To avoid the hassle, the velocity and angles need to be normalized separately using BN.

4) Dropout layer: When we use DNN consisting of mul- tiple hidden layers, the final output is determined by the activated hidden units in each layer. So, for the highly cor- related inputs (e.g., samples generated from correlated (low- rank) MIMO channel), their activation patterns will also be similar so that the final inference can be easily corrupted in the presence of perturbations (e.g., noise, ADC quantization error, and imperfect RF filtering). In order to mitigate this problem, the dropout layer where the activated hidden units are dropped out randomly can be used in the training phase [28]. In this scheme, by temporarily removing part of incoming and outgoing connections randomly, ambiguity (similarity) of the activation patterns among correlated dataset can be resolved.

5) Ensemble learning: Ensemble learning, a method to average out multiple outputs (inferences) of independently- trained networks, is conceptually analogous to the receiver diversity technique since it can enhance the output qual- ity without requiring additional wireless resources (e.g., fre- quency, time, and transmitter power). In a multi-user com- munication scenario, for example, the trained network might be closely fitted to the certain wireless setup (i.e., specific co-scheduled user pair) so that the trained DNN might not generate a reliable channel prediction for the inputs obtained from unobserved wireless environment. In such case, ensemble learning becomes a useful option. Essence of the ensemble learning is to train the multiple DNNs with different initial parameters and training sets obtained from different wireless environments and then combine the outputs of multiple DNNs. Using the ensemble learning, one can mitigate the overfitting to the specific wireless environments and generate more robust channel estimation.

6) Loss function: Since DNN weights are updated in a direction to minimize the loss function, the loss function should well reflect the design goal. When the ground-truth label is available, one can use the cross entropy, MSE, or mean absolute error (MAE). If this is not the case, as we mentioned in the unsupervised learning, one might use the design goal (e.g., throughput or bit error rate) as a loss function.

If there exist multiple goals for the problem at hand, those can be combined together. For example, in the joint AUD and channel estimation in NOMA system, the DNN is trained to detect the active devices and at the same time estimate the channels associated with active users. To do so, we set the loss function [TeX:] $$J$$ as the weighted sum of cross-entropy loss [TeX:] $$J_{A U D}$$ for AUD and the MSE loss [TeX:] $$J_{C E}$$ for CE (i.e., [TeX:] $$\text { (i.e., } J=J_{A U D}+\lambda J_{C E} \text { ) }$$).

7) Weight update strategy: In order to update the network parameter set, the gradient of a loss function should be computed first. A naive way to update the parameters is the batch gradient descent (BGD) method where the gradient of the loss function is computed for the entire training dataset. Since the whole dataset is used in each and every training iteration, training cost is quite expensive and the training speed will be very slow. In the non-static scenario where the channel characteristics are varying, parameters corresponding to the dynamically changing wireless environments (e.g., Doppler spread, scatter location) would not be updated properly. A better option to deal with the issue is the stochastic gradient descent (SGD) method. In contrast to BGD, SGD uses a small number, say D, of samples in each training iteration (i.e., [TeX:] $$\Theta_t=\Theta_{t-1}-\frac{\eta}{D} \sum_{d=1}^D \nabla_{\Theta} J^{(d)}(\Theta)$$) so that it can update the network parameters as soon as a few samples are obtained.

8) Knowledge distillation: When one tries to use the DL-based channel estimation in the Internet of Things (IoT) device, on-device energy consumption is a big concern since most of the IoT devices are battery-powered. To reduce the training overhead, knowledge distillation (KD), an approach to generate a relatively small-sized DL model from a trained large model, can be employed [29]. Key idea of KD is to train a small network (a.k.a student network) using the output of a large network (a.k.a teacher network). In the generation of the loss function, output of the student network is compared against the output of the teacher network as well as the ground- truth label. In doing so, the student network implemented in IoT device can easily capture the underlying feature (e.g., similarity and difference among the classes) extracted by the teacher network with minimal training overhead.

In order to exemplify the detailed techniques we discussed, we present the DNN architecture for the AoA detection (see Fig. 5). Due to the sparse scattering in the mmWave band, a propagation path can be characterized by a few AoAs. By identifying these angles, the receiver can align the beam direc- tion to the transmitter, thereby maximizing the beamforming gain. In DNN, we use the received signal y and the steering matrix [TeX:] $$\mathbf{A}=\left[\begin{array}{lll} \mathbf{a}\left(\theta_1\right) & \cdots & \mathbf{a}\left(\theta_N\right) \end{array}\right]$$ as inputs and the set of the detected AoAs Ω as outputs4. Since an input is a composite of the received signal and steering vectors, we use BN to normalize each component separately. Also, to generate the individual probability for each angle, we use the sigmoid activation function in the final layer.

4[TeX:] $$a\left(\theta_i\right)=\left[\begin{array}{llll} 1 & e^{j \pi \sin \theta_i} & \cdots & e^{j \pi(m-1) \sin \theta_i} \end{array}\right]^T$$ is the steering vector corresponding to [TeX:] $$\theta_i$$

To judge the effectiveness of the DNN components con- sisting of the normalization, dropout, and ensemble learning, we evaluate the success probability of the AoA detection. In the numerical evaluations, we test 1) FCN, 2) FCN with BN, 3) FCN with the dropout layer, 4) FCN with the ensemble network, and 5) FCN with all components we discussed in this section. As shown in Fig. 6, the performance gain introduced by the aforementioned techniques is considerable. For example, FCN with the dropout layer achieves 4.8 dB gain over the conventional FCN since the highly correlated steering vectors can be better resolved using the dropout technique. The gain obtained from BN is also significant (4.7 dB gain) since it reduces the variation of the received signal caused by the device location change. Finally, when the gains induced by all techniques are combined together, we can achieve very reliable AoA detection performance, which cannot be achievable by the basic FCN even in high SNR regime.


In this section, we discuss the DL-based channel estimation techniques. We also investigate the DL-based CSI feedback in MIMO system and the meta learning-based GAN technique for real-time training.

A. Time-domain Channel Estimation for Wideband Multipath Communications

It has been shown from recent measurement campaigns that the scattering effect of the environment is limited in many wireless channels. In other words, propagation paths tend to be clustered within a small spread so that there exists only a few dominant scattering clusters. In the wideband communication systems where the maximal delay and Doppler spreads are large and there are only few dominant paths, the channel can be readily modeled by a few tapped delay lines in the time- domain (i.e., delay-Doppler domain) [18]. Moreover, since the path delays vary much slower than the path gains due to the temporal channel correlation, such sparsity is almost unchanged during the coherence time. Therefore, one can greatly reduce the pilot overhead by estimating the channel in time-domain.

Let us consider the OFDM systems where we allocate the pilot symbols [TeX:] $$\mathbf{p}_f=\left[\begin{array}{lll} p_0 & p_1 & \cdots p_{m-1} \end{array}\right]$$ in the frequency domain. When we model the time-domain channel as a m-tapped channel impulse response [TeX:] $$\mathbf{h}=\left[h_0 h_1 \cdots h_{n-1}\right]^T \in \mathbb{C}^{n \times 1}$$, then the received signal y in the frequency domain is expressed as

[TeX:] $$\mathbf{y}=\operatorname{diag}\left(\mathbf{p}_f\right) \mathbf{\Phi D h}+\mathbf{v}$$

[TeX:] $$=\mathbf{P h}+\mathbf{v}$$

Fig. 5.

Exemplary DNN architecture designed for the AoA detection.

Fig. 6.

AoA detection performance of various DNNs as a function of SNR. The detection success probability corresponds to the percentage of detected AoAs among all angles.

where [TeX:] $$\mathbf{D} \in \mathbb{C}^{n \times n}$$ is the DFT matrix which plays a role to convert the time-domain channel response h to frequency domain response, [TeX:] $$\boldsymbol{\Phi} \in \mathbb{C}^{m \times n}$$ is the row selection matrix determining the location of pilots in frequency domain, and [TeX:] $$\mathbf{P}=\operatorname{diag}\left(\mathbf{p}_f\right) \mathbf{\Phi} \mathbf{D}$$. Due to the limited number of scattering clusters, the number of nonzero taps m in h is very small (i.e., [TeX:] $$m<n$$). When we introduce the nonzero tap indicator [TeX:] $$\boldsymbol{\delta}=\left[\begin{array}{ll} \delta_0 & \delta_1 \cdots \delta_{n-1} \end{array}\right]$$ ([TeX:] $$\delta_i$$ = 1 and 0 indicate that [TeX:] $$i$$-th element of h is nonzero and zero, respectively), y can be rewritten as

[TeX:] $$\mathbf{y}=\sum_{i=0}^{n-1} \delta_i \mathbf{p}_i h_i+\mathbf{v}$$

where [TeX:] $$h_i$$ is the [TeX:] $$i$$-th element of h and [TeX:] $$p_i$$ is the [TeX:] $$i$$-th column of P. Consequently, the time-domain channel estimation problem can be decomposed into two sub-problems: 1) support iden- tification to find out nonzero positions in h and 2) nonzero element estimation to find out [TeX:] $$h_i$$. Once the support of h is identified, estimate of the nonzero element can be easily obtained by conventional linear estimation techniques (see Section II.A).

As mentioned, the time-domain channel tap can be detected using the CS technique, but it might not work well when the columns of the system matrix P are highly correlated and the sparsity (number of nonzero taps) of h increases. When we use the DL technique, we need to find out the direct mapping [TeX:] $$\boldsymbol{\delta}=g(\mathbf{y} ; \boldsymbol{\theta})$$ between the received signal y and nonzero tap indicator [TeX:] $$\boldsymbol{\delta}$$, where [TeX:] $$\theta$$ is the network parameters (see Fig. 7(a)). In the DL-based channel tap detection, the final output is the n- dimensional vector [TeX:] $$\hat{\boldsymbol{\delta}}$$ whose element represents the probability of being the nonzero tap. Thus, [TeX:] $$\hat{\boldsymbol{\delta}}$$ needs to be compared against the true probability [TeX:] $$\boldsymbol{\delta}$$ in the calculation of the loss J. To do so, we employ the cross-entropy loss [TeX:] $$J=-\sum_{i=0}^{n-1}\left(\delta_i \log \hat{\delta}_i+\right.\left.\left(1-\delta_i\right) \log \left(1-\hat{\delta}_i\right)\right)$$ during the training.

In Fig. 7(b), we numerically evaluate the MSE performance of the DL-based scheme as a function of SNR. For compari- son, we also examine the performance of the conventional LS estimator, LMMSE estimator, and OMP algorithm. We observe that DL-based time-domain channel estimation outperforms the conventional schemes for all SNR regime5. Since the DNN learns the correlation feature of the system matrix P, DL-based channel tap detection can better discriminate the correlated supports in the test phase. For example, we observe that the DL-based time-domain channel estimation achieves 2 dB gain over the OMP at MSE= [TeX:] $$10^{-5}$$.

5Note that we used 100,000 samples for the training and 5,000 samples for the test.

Fig. 7.

(a) DNN architecture for time-domain channel tap detection. (b) MSE as a function of SNR (m = 64, n = 256).
B. Parametric Channel Estimation for mmWave/THz MIMO Communications

As a core technology for 5G and 6G, mmWave and THz communication have received much attention recently [30]– [32]. By leveraging the abundant frequency spectrum re- source in mmWave/THz frequency band (30 GHz ∼ 10 THz), mmWave/THz-based systems can support a way higher data rate than the conventional sub-6 GHz systems. In the mmWave/THz systems, the beamforming technique realized by the MIMO antenna arrays is required to compensate for the severe path loss in the high frequency bands [15]. Since the beamforming gain is maximized only when the beams are properly aligned with the signal propagation paths, acquisition of accurate downlink channel is of importance for the success of the mmWave/THz MIMO systems.

We consider the downlink MIMO systems where the BS equipped with [TeX:] $$N_T$$ transmit antennas serves the user equipment (UE) equipped with [TeX:] $$N_R$$ receive antennas. Then, the goal of the channel estimation problem is to obtain the downlink channel matrix of the k-th pilot subcarrier [TeX:] $$\mathbf{H}[k] \in \mathbb{C}^{N_R \times N_T}$$ from the recevied signal [TeX:] $$\mathbf{Y}[k] \in \mathbb{R}^{M \times T}$$:

[TeX:] $$\mathbf{Y}[k]=\mathbf{W}^H \mathbf{H}[k] \mathbf{F S}[k]+\mathbf{N}[k], \quad \forall k \in \mathcal{K}$$

where [TeX:] $$\mathbf{F}=\left[\mathbf{f}_1, \cdots, \mathbf{f}_M\right] \in \mathbb{C}^{N_T \times M}$$ is the beamforming matrix, [TeX:] $$\mathbf{W}=\left[\mathbf{w}_1, \cdots, \mathbf{w}_T\right] \in \mathbb{C}^{N_R \times T}$$ is the combin- ing matrix, and [TeX:] $$\mathbf{S}[k]=\operatorname{diag}\left(s_1[k], \cdots, s_M[k]\right)$$ is the pilot symbol matrix. While LS and LMMSE channel estimators have been popularly used for the acquisition of mmWave MIMO channel [15], [33], these conventional approaches have drawback since the amount of pilot resources is proportional to the number of transmit/receive antennas. In fact, due to the massive numbers of transmit and receive antennas, this issue is serious for the mmWave/THz MIMO systems. For example, if the numbers of transmit and receive antennas are [TeX:] $$N_R$$ = 64 and [TeX:] $$N_T$$ = 4, respectively, then the BS has to use at least 3 resource blocks (RBs) (12 × 7 resources in each RB of 4G LTE) just for the pilot transmission, occupying almost 25% of a subframe in LTE.

In order to reduce the dimension of a channel matrix to be estimated, one can consider the estimation of the sparse channel parameters, not the full-dimensional MIMO channel matrix. Specifically, when we assume the geometric nar- rowband channel model, then the frequency domain channel matrix [TeX:] $$\mathbf{H}^l[k]$$ of [TeX:] $$k$$-th pilot subcarrier at [TeX:] $$l$$-th channel coherence interval is expressed as

[TeX:] $$\mathbf{H}^l[k]=\sum_{i=1}^P \alpha_i^l e^{-j 2 \pi \tau_i^l k f_s} \mathbf{a}_R\left(\theta_i^l\right) \mathbf{a}_T^H\left(\phi_i^l\right)$$

where [TeX:] $$f_s$$ is subcarrier spacing, P is the number of propagation paths, [TeX:] $$\alpha_i^l$$ is the complex gain, [TeX:] $$\theta_i^l$$ and [TeX:] $$\phi_i^l$$ are AoA and AoD of the [TeX:] $$$$-th path, respectively, [TeX:] $$\tau_i^l$$ is the path delay of [TeX:] $$i$$-th path at [TeX:] $$l$$-th channel coherence interval.

One can see that [TeX:] $$\mathbf{H}^l[k]$$ is parameterized by a few chan- nel parameters: AoAs [TeX:] $$\left\{\theta_i^l\right\}_{i=1}^P$$, AoDs [TeX:] $$\left\{\phi_i^l\right\}_{i=1}^P$$, path delays [TeX:] $$\left\{\tau_i^l\right\}_{i=1}^P$$ , and path gains [TeX:] $$\left\{\alpha_i^l\right\}_{i=1}^P$$ since the number of effective paths is at most a few (LoS and 1∼2 NLoS paths)inthe mmWave/THz channel. Typically, the number of propagation paths P is much smaller than the total number of antennas [TeX:] $$N=N_T \times N_R$$ (e.g., P = 2 ∼ 8 while N =16 ∼ 256) due to the high path loss and directivity of mmWave/THz signal. Therefore, one can greatly reduce the pilot overhead by estimating the sparse channel parameters [TeX:] $$\left\{\theta_i^l, \phi_i^l, \tau_i^l, \alpha_i^l\right\}_{i=1}^P$$ instead of the full-dimensional MIMO channel matrix [TeX:] $$\mathbf{H}^l[k]$$. To this end, one can exploit the LSTM, a DL technique specialized for extracting temporally correlated features from the sequential data6. In our context, LSTM can dynamically adjust the channel parameter estimation based on the extracted temporally correlated feature. For example, when the mobility of a mobile is low and thus the temporal correlation of channel parameters is high, past parameters highly affects the current parameters. LSTM can effectively deal with this type of temporal dependency and finally makes a fast yet accurate channel parameter estimation with relatively small amount of pilot resources. In a nutshell, LSTM-based channel estima- tor learns a complicated nonlinear mapping [TeX:] $$g$$ between the received downlink pilot signals ([TeX:] $$\mathbf{y}^1, \cdots, \mathbf{y}^l$$) and the channel parameters ([TeX:] $$\theta^l, \phi^l, \tau^l, \alpha^l$$):

6Basically, the key ingredients of LSTM block are cell state and three gates, viz., input gate, forget gate, and output gate. The cell state, serving as a memory to store information extracted from the past inputs, sequentially passes through the forget, input, and output gates. Based on the input and the previous output, each gate determines how much information to be removed, written, and read in the cell state.

Fig. 8.

Fig. 8. (a) Overall receiver structure of the LSTM-based channel estimation scheme. (b) NMSE performance of channel estimation techniques as a function of SNR.

[TeX:] $$\left\{\hat{\theta}^l, \hat{\phi}^l, \hat{\tau}^l, \hat{\alpha}^l\right\}=g\left(\mathbf{y}^1, \cdots, \mathbf{y}^l ; \Theta\right)$$

where [TeX:] $$\Theta$$ is the set of weights and biases (see Fig. 8(a)). To find out the parameters {[TeX:] $$\hat{\theta}^l, \hat{\phi}^l, \hat{\tau}^l, \hat{\alpha}^l$$}, the channel estimate [TeX:] $$\widehat{\mathbf{H}}^l[k]=\sum_{i=1}^P \hat{\alpha}_i^l e^{-j 2 \pi k f_s \hat{\tau}_i^l} \mathbf{a}_{\mathrm{R}}\left(\hat{\theta}_i^l\right) \mathbf{a}_{\mathrm{T}}\left(\hat{\phi}_i^l\right)$$ needs to be com- pared against the true channel matrix [TeX:] $$\mathbf{H}^l[k]$$ in the training. To do so, we define the loss function [TeX:] $$J(\Theta)$$ as

[TeX:] $$J(\Theta)=\frac{1}{L} \sum_{l=1}^L \sum_{k=1}^K\left\|\mathbf{H}_k^l(\Theta)-\widehat{\mathbf{H}}_k^l(\Theta)\right\|_F^2$$

where L and K are the number of coherence intervals and the number of subcarriers, respectively.

In order to evaluate the efficacy of the LSTM-based para- metric channel estimation technique, we test the normalized MSE (NMSE) as the function of SNR7. As shown in Fig. 8(b), LSTM-based channel estimation technique outperforms the conventional channel estimation algorithms by a large margin since the trained DNN exploits the sparsity of the angular domain channel, but no such mechanism is used for the conventional LS and MMSE techniques. For example, when SNR= 10dB, the NMSE of the DL-based scheme is less than [TeX:] $$10^{-2}$$ while those of the LMMSE estimator and the LS estimator are [TeX:] $$10^{-1}$$ and 0.5, respectively. We also observe that the performance of DL-based scheme outperforms the CS-based approach (i.e., 5dB gain at SNR= 15dB), since the channel parameters can be readily estimated by using the temporal correlation between the past and current channels.

7n our experiments, we assume the scenario that the user is moving along with the line trajectory so that the concept of the temporal correlation between geometric parameters can be reflected in the dataset.

C. CSI Feedback in mmWave/THz Massive MIMO System

As mentioned, to fully enjoy the benefit of massive MIMO systems, an acquisition of an accurate downlink CSI at the BS is crucial. To do so, UE needs to estimate the downlink channels using the pilot signal (e.g., CSI reference signal (CSI- RS) in 5G) of each antenna and then feeds them back in a form of implicit indices (e.g., PMI, RI, and CQI in 4G LTE and 5G NR) to the BS [34]. Since the required number of bits to convey these indices increases in proportion to the number of antennas, the CSI feedback overhead is a big concern for the massive MIMO systems.

Over the years, there have been many works to reduce the overhead of the CSI feedback by exploiting the spatio- temporal correlation of CSI [35]–[37]. For example, in [35], the correlated channel vectors are transformed into an uncor- related sparse vector in some bases and then the CS technique is used to estimate the CSI from the small number of channel measurements. Also, in recent years, DL-based CSI feedback schemes have been proposed [38], which are stimulated by the outstanding performance of DL in the correlation extraction. DL-based feedback scheme can be derived from the autoen- coder (AE) architecture (see Fig. 9(a)). In the AE, encoder is used to transform the raw channel matrices into a compressed codeword vector and decoder reconstructs the original channel matrices from the codeword.

We consider the downlink MIMO-OFDM system where the BS is equipped with Nt transmit antennas and the UE is equipped with Nr receiver antennas. Since each antenna port at the BS sends a CSI-RS for the channel measurement, [TeX:] $$N_t \times N_r$$ elements in the MIMO channel matrix [TeX:] $$\mathbf{H} \in \mathbb{C}^{N_r \times N_t}$$ should be fed back to the BS. In the AE-based CSI feedback scheme, the encoder in the UE compresses the channel estimate [TeX:] $$\hat{\mathbf{H}}$$ into a low-dimensional feature vector and the decoder in the BS reconstructs the channel [TeX:] $$\hat{\mathbf{H}}$$ from the compressed vector. As an input to the encoder, vectorized version of [TeX:] $$\hat{\mathbf{H}}$$ ([TeX:] $$N_t N_r \times 1-$$ dimensional vector [TeX:] $$\hat{\mathbf{h}}$$) is used. Then, we use multiple hidden layers to obtain the low-dimensional vector p, which can be expressed as

[TeX:] $$\mathbf{p}=f_e\left(\hat{\mathbf{h}} ; \Theta_e\right)$$

where [TeX:] $$f_e(\cdot)$$ is the encoder operation and [TeX:] $$\Theta_e$$ is the training parameter set of the encoder. In the decoder, to reconstruct [TeX:] $$\hat{\mathbf{h}}$$ using p, we design multiple FCN layers with BN and ReLU

Fig. 9.

a) AE architecture for efficient CSI feedback. (b) MSE between the genie channel H and reconstructed channel estimate [TeX:] $$\hat{\mathbf{H}}_{r e c}$$ rec as a function SNR. [TeX:] $$N_1$$ is the dimension of the latent vector p.

activation (see Fig. 9(a)). Finally, the reconstructed channel vector [TeX:] $$\hat{\mathbf{h}}_{r e c}$$ can be expressed as

[TeX:] $$\hat{\mathbf{h}}_{r e c}=f_d\left(f_e\left(\hat{\mathbf{h}} ; \Theta_e\right) ; \Theta_d\right)$$

where [TeX:] $$f_d(\cdot)$$ is the decoder operation and [TeX:] $$\Theta_d$$ is the training parameter set of the decoder. Considering that the training objective of the AE-based CSI feedback is to find out the channel closest to [TeX:] $$\hat{\mathbf{h}}$$, we use the MSE between the input channel and the reconstructed channel [TeX:] $$\hat{\mathbf{h}}_{r e c}$$ as a loss function:

[TeX:] $$J\left(\Theta_e, \Theta_d\right)=\frac{1}{\sqrt{N_t N_r}}\left\|\hat{\mathbf{h}}-\hat{\mathbf{h}}_{r e c}\right\|_2^2$$

In Fig. 9(b), we evaluate the MSE between the reconstructed channel [TeX:] $$\hat{\mathbf{H}}_{r e c}$$ and the true channel [TeX:] $$\mathbf{H} \text { (i.e., }\left\|\hat{\mathbf{H}}_{r e c}-\mathbf{H}\right\|_2^2 \text { ) }$$ of the AE-based CSI feedback scheme as a function of SNR. Note that the genie channel [TeX:] $$\mathbf{H}$$ is obtained by the geometric MIMO channel model in (17). As shown in Fig. 9(b), we observe that the reconstruction quality of AE-based scheme improves gradually with the dimension Nl of the latent vector p. This is mainly because the latent vector having a large dimension can capture the detailed channel features (e.g., geometric channel parameters of NLoS paths) as well as the core features (e.g., channel parameters of the dominating LoS path). For example, we observe that the AE-based CSI feedback technique achieves MSE= [TeX:] $$10^{-4}$$ at SNR = 15 dB when [TeX:] $$N_l$$ = 128, which is by no means possible in other cases even in the higher SNR regime.

Algorithm 1
Meta learning-based GAN training process
D. Meta Learning-based GAN for Real-time Training

n Section III.A, we discussed that the GAN technique can be used to collect realistic wireless channel samples. In reality, however, GAN is also a data-driven DL technique so that it requires considerable training samples and hence the practical benefit of this approach might be washed away when the training data is insufficient [39]. To overcome the shortcoming, one can consider the meta learning, a technique to train a model using a variety of tasks and then solve a new task using only a small number of training samples [40]. In short, the meta learning is a special training technique to obtain the initialization parameters of DNN using which one can easily and quickly learn the desired function with small training samples.

To be specific, an overall procedure of the meta learning- based GAN training is as follows. First, we perform the meta learning to obtain the initialization parameters. We then update the network parameters of GAN to perform the fine-tuning of DNN such that trained DNN generates channel samples for the desired wireless environments. In the meta learning phase, we extract the common features of multiple channel datasets, say M datasets {[TeX:] $$D_1, \cdots, D_M$$}, and then use them to obtain the network initialization parameters [TeX:] $$\theta$$ (see Fig. 10(a)):

[TeX:] $$\psi_{D_i, t}=\theta_{t-1}-\alpha \nabla_\theta \mathcal{L}_{D_i}\left(\theta_{t-1}\right)$$

[TeX:] $$\theta_t=\theta_{t-1}-\beta \nabla_\theta \sum_{i=1}^M \mathcal{L}_{D_i}\left(\psi_{D_i, t}\right)$$

where [TeX:] $$\theta_t$$ and [TeX:] $$\theta_{t-1}$$ are the parameters of GAN in [TeX:] $$t$$-th step and ([TeX:] $$t$$ − 1)-th step, respectively. Also, [TeX:] $$\psi_{D_i, t}$$ is the parameter associated with dataset [TeX:] $$D_i$$ in [TeX:] $$t$$-th step, [TeX:] $$\mathcal{L}_{D_i}$$ is the loss function of GAN for [TeX:] $$i$$-th dataset [TeX:] $$D_i$$, and α and β are the step sizes for the parameter update (see Algorithm 1). Next, in the parameter update phase, we use [TeX:] $$\theta$$ as the initialization parameters and then train DNN to generate the samples close to the desired channel dataset, say [TeX:] $$D_{M+1}$$ . In a nutshell, all that needed is to learn the distinct features (of [TeX:] $$D_{M+1}$$ ) unextracted from the meta learning.

In Fig. 10(b), we test the MSE performances of the DL- based channel estimator trained by the three different datasets: real-measured dataset [41], dataset generated from the vanilla GAN and GAN trained by meta learning8. We observe that the channel estimation performance of the meta learning-based GAN is slightly worse than that using the real samples (e.g., 2 dB loss at MSE=[TeX:] $$10^{-2}$$). Whereas, the performance of the conventional GAN-based approach is much worse (i.e., more than 6 dB loss at MSE=[TeX:] $$10^{-2}$$) since the number of training samples is not large enough to ensure the convergence of GAN and there is no mechanism to exploit the common features of diverse channel conditions.

8Specifically, we have used 10,000 samples for the benchmark training, 4,000 samples for meta learning, and 800 samples for fine-tuning and the training of the conventional CGAN.


In this article, we discussed the DL-based channel esti- mation with emphasis on the design issues related to DL model selection, training set acquisition, and DNN archi- tecture design. As the automated services and applications using machines, vehicles, and sensors proliferate, we expect that DL will be more popular channel estimation paradigm in 6G era. To deal with various frequency bands (sub- 6 GHz/mmWave/THz), wireless resources (massive MIMO antennas, intelligent reflecting surface, relays), and geograph- ical environment, we need to go beyond the state-of-the-art DL technique and consider more aggressive and advanced DL techniques. For example, when we try to train a DL model to estimate the desired wireless channel, transfer learning, an approach to use the pre-trained model for a similar task, can be employed. By recycling most of parameters in the pre-trained model and then training only a small part of parameters, new model can learn the distinct feature of the desired channel while reusing common features of the pre-trained channel environments. Another approach worth investigation is the federated/split/distributed learning, a technique to learn the desired task by the cooperation of multiple decentralized devices or servers. Our hope is that this article will serve as a useful reference for the communication researchers who want to apply the DL technique in their wireless channel estimation application.


This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0- 00972, Development of Intelligent Wireless Access Technolo- gies) and the ITRC (Information Technology Research Center) support program (IITP-2022-2017-0-01637) supervised by the IITP.


Wonjun Kim

Wonjun Kim (Member, IEEE) received the B.S. in the Department of Electrical and Electronic Engineering from Yonsei University in 2016, and the Ph.D. degree in electrical engineering from Seoul National University in 2021. Since September 2021, he has been with Samsung Electronics and has been working for 6G systems. His current research interests include compressed sensing and deep learning techniques for the 6G wireless communications.


Yongjun Ahn

Yongjun Ahn (Member, IEEE) received the B.S. degree from the Department of Electrical and Computer Engineering, Seoul National University, South Korea, in 2018. He is currently pursuing the Ph.D. degree in Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University. His research interests include 6G wireless communications and deep learning techniques.


Jinhong Kim

Jinhong Kim (Member, IEEE) received the B.S. degree from the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, South Korea, in 2016. He is currently pursuing the Ph.D. degree in Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University. His research interests include sparse signal recovery and deep learning techniques for the 6G wireless communications.


Byonghyo Shim

Byonghyo Shim (Senior Member, IEEE) received the B.S. and M.S. degree in Control and Instru- mentation Engineering from Seoul National Uni- versity (SNU), Seoul, Korea, in 1995 and 1997, respectively, and the M.S. degree in Mathematics and the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois at Ur- banaChampaign (UIUC), Urbana, in 2004 and 2005, respectively. From 1997 and 2000, he was with the Department of Electronics Engineering at the Korean Air Force Academy as an Officer (First Lieutenant) and an Academic Full-time Instructor. He also had a short time research position in Texas Instruments and Samsung Electronics in 2004 and 2019, respectively. From 2005 to 2007, he was with the Qualcomm Inc., San Diego, CA as a Staff Engineer working on CDMA systems. From 2007 to 2014, he was with the School of Information and Communication, Korea University, Seoul, Korea, as an Associate Professor. Since September 2014, he has been with the Dept. of Electrical and Computer Engineering, Seoul National University, where he is currently a Professor. His research interests include signal processing for wireless communications, statistical signal processing, machine learning, and information theory. Dr. Shim was the recipient of the M. E. Van Valkenburg Research Award from the ECE Department of the University of Illinois (2005), the Hadong Young Engineer Award from IEIE (2010), the Irwin Jacobs Award from Qualcomm and KICS (2016), the Shinyang Research Award from the Engineering College of SNU (2017), the Okawa Foundation Research Award (2020), and the IEEE COMSOC Asia Pacific Outstanding Paper Award (2021). He was a technical committee member of Signal Processing for Communications and Networking (SPCOM), and currently serving as an associate editor of IEEE Transactions on Signal Processing (TSP), IEEE Transactions on Communications (TCOM), IEEE Transactions on Vehicular Technology (TVT), IEEE Wireless Communica- tions Letters (WCL), Journal of Communications and Networks (JCN), and a guest editor of IEEE Journal of Selected Areas in Communications (location awareness for radios and networks).


  • 1 Y . LeCun, Y . Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, May 2015.doi:[[[10.1007/978-981-13-9113-2_16]]]
  • 2 K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition," in Proc. IEEE CVPR, pp. 770-778, 2016.doi:[[[10.1109/cvpr.2016.90]]]
  • 3 D. Silver et al., "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.doi:[[[10.1038/nature16961]]]
  • 4 I. Goodfellow, Y . Bengio, and A. Courville, "Deep learning," MIT press, 2016.doi:[[[10.1002/9781119845041.ch7]]]
  • 5 H. Ju, S. Kim, Y . Kim and B. Shim, "Energy-efficient ultra-dense network with deep reinforcement learning,", IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6539-6552, Aug. 2022.doi:[[[10.1109/twc.2022.3150425]]]
  • 6 K. Suh, S. Kim, Y . Ahn, S. Kim, H. Ju and B. Shim, "Deep reinforcement learning-based network slicing for beyond 5G," IEEE Access, vol. 10, pp. 7384-7395, Jan. 2022doi:[[[10.1109/access.2022.3141789]]]
  • 7 Y . Ahn and B. Shim, "Deep learning-based beamforming for intelligent reflecting surface-assisted mmWave systems," in Proc. ICTC, 2021.doi:[[[10.1109/ictc52510.2021.9621150]]]
  • 8 E. Vlachos, G. C. Alexandropoulos, and J. Thompson, "Massive MIMO channel estimation for millimeter wave systems via matrix completion," IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1675-1679, Nov. 2018.doi:[[[10.1109/lsp.2018.2870533]]]
  • 9 M. S. Oh, S. Hosseinalipour, T. Kim, C.G. Brinton, and D. J. Love, "Channel estimation via successive denoising in MIMO-OFDM systems: A reinforcement learning approach," in Proc. IEEE ICC, 2021, pp. 1-6.doi:[[[10.1109/ICC42927.2021.9500671]]]
  • 10 X. Ma, Z. Gao, F. Gao, M. D. Renzo, "Model-driven deep learning based channel estimation and feedback for millimeter-wave massive hybrid MIMO systems," IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2388-2406, 2021.doi:[[[10.1109/jsac.2021.3087269]]]
  • 11 S. Park, J. Choi, J. Seol and B. Shim, "Expectation-maximization-based channel estimation for multiuser MIMO systems," IEEE Trans. Commun., vol. 65, no. 6, pp. 2397-2410, Jun. 2017.doi:[[[10.1109/tcomm.2017.2688447]]]
  • 12 3GPP TR 37.817, "Study on enhancement for Data Collection for NR and EN-DC (Release 17)," v1.1.0, 2022.custom:[[[-]]]
  • 13 3GPP TR 38.831, "User Equipment (UE) Radio Frequency (RF) requirements for Frequency Range 2 (FR2)," v16.1.0, 2021.custom:[[[-]]]
  • 14 J. Kim, Y . Ahn, S. Kim, and B. Shim, "Parametric sparse channel estimation using long short-term memory for mmWave massive MIMO systems," in Proc. IEEE ICC, 2022.doi:[[[10.1109/icc45855.2022.9838434]]]
  • 15 H. Tang, J. Wang, and L. He, "Off-grid sparse Bayesian learning based channel estimation for mmWave massive MIMO uplink," IEEE Wireless Commun. Lett., vol. 8, no. 1, pp. 45-48, Feb. 2019.doi:[[[10.1109/LWC.2018.2850900]]]
  • 16 Q. He, T. Q. Quek, Z. Chen, Q. Zhang, and S. Li, "Compressive channel estimation and multi-user detection in C-RAN with low-complexity methods," IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 3931-3944, 2018.doi:[[[10.1109/twc.2018.2818125]]]
  • 17 A. Manoj and A. P. Kannu, "Channel estimation strategies for multiuser mmWave systems," IEEE Trans. Commun., vol. 66, no. 11, pp. 5678-5690, Jul. 2018.doi:[[[10.1109/TCOMM.2018.2854188]]]
  • 18 J. W. Choi, B. Shim, Y . Ding, B. Rao, and D. I. Kim, "Compressed sensing for wireless communications: Useful tips and tricks," IEEE Commun. Surveys Tuts., vol. 19, no. 3, pp. 1527-1550, 2017.doi:[[[10.1109/comst.2017.2664421]]]
  • 19 I. Saffar, M.L.A. Morel, K. D. Singh, and C. Viho, "Semi-supervised deep learning-based methods for indoor outdoor detection," in Proc. ICC, 2019, pp. 1-7.doi:[[[10.1109/icc.2019.8761297]]]
  • 20 M. Alenezi, K. K. Chai, A. S. Alam, Y . Chen, and S. Jimaa, "Unsupervised learning clustering and dynamic transmission scheduling for efficient dense LoRaWAN networks," IEEE Access, vol. 8, pp.191495-191509, Oct. 2020.doi:[[[10.1109/access.2020.3031974]]]
  • 21 3GPP TS 36.104. "Evolved Universal Terrestrial Radio Access (EUTRA); Base Station (BS) Radio Transmission and Reception," 3rd Generation Partnership Project; Technical Specification Group Radio Access Network.custom:[[[-]]]
  • 22 I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben-gio, "Generative adversarial nets," in Proc. NIPS, 2014.doi:[[[10.1007/978-3-658-40442-0_9]]]
  • 23 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Proc. NIPS, 2017, pp. 6000-6010.custom:[[[https://arxiv.org/abs/1706.03762]]]
  • 24 A. Rehman, S. U. Rehman, M. Khan, M. Alazab, and T. Reddy, "CANintelliIDS: detecting in-vehicle intrusion attacks on a controller area network using CNN and attention-based GRU," IEEE Trans. Netw. Sci. Eng., vol. 8, no. 2, pp. 1456-1466, Apr.-Jun. 2021.doi:[[[10.1109/tnse.2021.3059881]]]
  • 25 M. Li, Y . Wang, Z. Wang, and H. Zheng, "A deep learning method based on an attention mechanism for wireless network traffic prediction," in Ad Hoc Networks, vol. 107, no. 1, Oct. 2020.doi:[[[10.1016/j.adhoc.2020.102258]]]
  • 26 Y . Ahn, W. Kim and B. Shim, "Active user detection and channel estimation for massive machine-type communication: Deep learning approach," IEEE Internet Things J., vol. 9, no. 14, pp. 11904-11917, Jul. 2021.doi:[[[10.1109/jiot.2021.3132329]]]
  • 27 W. Kim, Y . Ahn, and B. Shim, "Deep neural network-based active user detection for grant-free NOMA systems," IEEE Trans. Commun., vol. 68, no. 4, pp. 2143-2155, 2020.doi:[[[10.1109/tcomm.2020.2969184]]]
  • 28 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," J. Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.custom:[[[https://dl.acm.org/doi/10.5555/2627435.2670313]]]
  • 29 G. Hinton, O. Vinyals, and J. Dean, "Distilling the knowledge in a neural network," arXiv preprint arXiv:1503.02531, 2015.custom:[[[https://arxiv.org/abs/1503.02531]]]
  • 30 S. Rangan, T. S. Rappaport, and E. Erkip, "Millimeter-wave cellular wireless networks: Potentials and challenges," Proc. IEEE, vol. 102, no. 3, pp. 366-385, 2014.doi:[[[10.1109/jproc.2014.2299397]]]
  • 31 H. Ji, H. Yang, H. Noh, J. Yeo, Y . Kim, and J. Lee, "Compressed channel estimation for point-to-point millimeter-wave communications," in Proc. IEEE Globecom Wkshps, 2019, pp. 1-5.doi:[[[10.1109/gcwkshps45667.2019.9024523]]]
  • 32 J. Park, S. Kim, J. Moon and B. Shim, "Fast terahertz beam training via frequency-dependent precoding," in Proc. IEEE ICC, 2022.doi:[[[10.1109/iccworkshops53468.2022.9814478]]]
  • 33 K. Venugopal, A. Alkhateeb, N. G. Prelcic, and R. W. Heath, "Channel estimation for hybrid architecture based wideband millimeter wave systems," IEEE J. Sel. Areas Commun., vol. 35, no. 9, pp. 1996-2009, 2017.doi:[[[10.1109/JSAC.2017.2720856]]]
  • 34 3GPP TS 38.214, "Physical layer procedures for data," v17.0.0, 2022.custom:[[[https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3216]]]
  • 35 S. Kim, J. W. Choi, and B. Shim, "Downlink pilot precoding and compressed channel feedback for FDD-based cell-free systems," IEEE Trans. on Wireless Commun., vol. 19, no. 6, pp. 3658-3672, 2020.doi:[[[10.1109/twc.2020.2974838]]]
  • 36 B. Lee, J. Choi, J.-Y . Seol, D. J. Love, and B. Shim, "Antenna grouping based feedback compression for FDD-based massive MIMO systems," IEEE Trans. on Commun., vol. 63, no. 9, pp. 3261-3274, 2015.doi:[[[10.1109/tcomm.2015.2460743]]]
  • 37 S. Kim and B. Shim, "Energy-efficient millimeter-wave cell-free systems under limited feedback," IEEE Trans. Commun., vol. 69, no. 6, pp. 4067-4082, Jun. 2021.doi:[[[10.1109/tcomm.2021.3059877]]]
  • 38 C.-K. Wen, W.-T. Shih, and S. Jin, "Deep learning for massive MIMO CSI feedback," IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 748-751, Oct. 2018.doi:[[[10.1109/lwc.2018.2818160]]]
  • 39 J. Kim, Y . Ahn and B. Shim, "Massive data generation for deep learningaided wireless systems using meta learning and generative adversarial network," to appear in IEEE Trans. Veh. Technol.custom:[[[https://arxiv.org/abs/2208.11910]]]
  • 40 C. Finn, P. Abbeel, and S. Levine, "Model-agnostic meta-learning for fast adaptation of deep networks," in Proc. Int. Conf. Mach. Learn. (ICML). PMLR, Aug. 2017, pp. 1126-1135.custom:[[[https://arxiv.org/abs/1703.03400]]]
  • 41 E. Everett, C. Shepard, L. Zhong, and A. Sabharwal, "Softnull: Manyantenna full-duplex wireless via digital beamforming," in IEEE Trans. Wireless Commun., vol. 15, no. 12, p. 8077-8092, Dec. 2016.doi:[[[10.1109/TWC.2016.2612625]]]