## Dohyun Kwon , Joongheon Kim , David A. Mohaisen and Wonjun Lee## |

Parameter | Value |
---|---|

[TeX:] $$N$$ | The number of vehicles |

[TeX:] $$K$$ | The number of mBS |

[TeX:] $$\mathcal{X}$$ | The set of mBS |

[TeX:] $$\mathcal{U}$$ | The set of vehicles |

[TeX:] $$\mathcal{C}$$ | The set of mBS cache |

[TeX:] $$\mathcal{B}$$ | The set of vehicular buffer |

[TeX:] $$\mathcal{M}$$ | The macro base station |

[TeX:] $$\mathcal{V}$$ | The vector of vehicle velocity |

[TeX:] $$\bar{c}$$ | The UB of mBS cache storage |

[TeX:] $$\bar{b}$$ | The UB of vehicular buffer |

[TeX:] $$\bar{m}$$ | [TeX:] $$\text { The UB of cached video at } x_{j} \text { for } u_{i}$$ |

[TeX:] $$\varrho$$ | The vector of FSMC transition probabilities |

[TeX:] $$R$$ | The total reward of IoV network |

[TeX:] $$\gamma$$ | The learning rate of [TeX:] $$\mathcal{M}$$ |

[TeX:] $$C_{N \times K}$$ | The state of mBS cache |

[TeX:] $$B_{N \times K}$$ | The state of vehicular buffer |

[TeX:] $$H_{N \times K}$$ | The state of average quality history |

[TeX:] $$P_{N \times K}$$ | The state of vehicle position |

[TeX:] $$V_{N \times K}$$ | The action of mBS power allocation |

[TeX:] $$L_{N \times K}$$ | The action of proactive cache allocation |

There exists a set of video caches [TeX:] $$\mathcal{C}=\left\{c_{0}, c_{1}, \cdots, c_{j}, \cdots, c_{k-1} \right\}.$$ on the highway and each [TeX:] $$c_{j} \text { is equipped with } \operatorname{mBS} x_{j}$$ such that [TeX:] $$j \in[0, K) . \text { Each video cache } c_{j} \text { of } x_{j} \text { stores video }$$ chunks for vehicles. Suppose that [TeX:] $$c_{j}$$ is requested to provide video chunks from [TeX:] $$u_{i}, \text { where } p_{i, j}=-1, \text { then } x_{j} \text { immediately }$$ provides the cached video chunks or request the media server. In addition, the following mBS, denoted by [TeX:] $$x_{j+1}$$, notices the request of the vehicle and proactively allocates cache size and contents from the media server to prepare the position transition of [TeX:] $$u_{i} . \text { In addition, the spatial upper bound of } c_{j} \text { is denoted as } \bar{c}$$. and the video contents, which are cached in [TeX:] $$c_{j}$$, are transmitted towards the associated set of vehicles [TeX:] $$u_{y, j}, \text { such that } y \in[0, N)$$. We assumed that the capacity of the mmWave link between [TeX:] $$u_{i}$$[TeX:] $$\text { and } x_{j} \text { is sufficient so that } x_{j}$$ can provision the entire video chunks towards a set of vehicles, which are associated with [TeX:] $$x_{j}$$ Moreover, video buffer set [TeX:] $$\mathcal{B}=\left\{b_{0}, b_{1}, \cdots, b_{i}, \cdots, b_{N-1}\right\}$$ represents the set of buffer which is equipped within each vehicle. The buffer [TeX:] $$b_{i} \text { is mounted on } u_{i}$$ and the spatial upper bound of [TeX:] $$b_{i} \text { and buffer playback rate is denoted by } \bar{b} \text { and } \mathcal{F}$$, respectively. Meanwhile, there is a set of video qualities, which are denoted by [TeX:] $$\mathcal{Q}. \text { Moreover, it is assumed that each } u_{i}$$ can be served with each quality of video chunk in the quality set [TeX:] $$\mathcal{Q} . \text { The } \mathcal{Q}$$ can be defined as [TeX:] $$\mathcal{Q}=[360 \mathrm{p}, 480 \mathrm{p}, 720 \mathrm{p}, 1080 \mathrm{p}, 4 \mathrm{K}]$$, or example. Each quality level in [TeX:] $$\mathcal{Q}$$ requires an average link capacity of 1, 3, 5, 8, and 40 Mbps, respectively, in ascending order, which determines the QoS of [TeX:] $$u_{i} \text { associated with } x_{j}$$. Per the aforementioned assumption regarding video contents, the unit size of a chunk is determined by the quality of the video.

The system can be represented in terms of reinforcement learning, where the agent in the system is [TeX:] $$\mathcal{M} \text { The } \mathcal{M}$$ controls the overall power and proactive cache allocation toward each [TeX:] $$x_{j}$$ on a highway from the remote media server [TeX:] $$\mathcal{Z}$$, with specific power allocation level [TeX:] $$v_{k} \text { of the video and cache size } c_{i, j}$$ and [TeX:] $$c_{i, j+1} \text { for } u_{i} \text { where } k \in[0,2], i \in[0, N), \text { and } j \in[0, K-1)$$, respectively. In the following, the learning process of power allocation and proactive cache allocation toward mBSs are introduced in terms of the state space, action space, reward, and algorithmic description, in details.

The state space of the caching system consists of the following elements: Preloaded unit size of video for each [TeX:] $$u_{i}$$ along with the entire mBS [TeX:] $$\mathcal{X}, \text { buffer occupancy of each } u_{i}$$, average quality history of the provisioned video for each [TeX:] $$u_{i}$$ and the position of each vehicle [TeX:] $$u_{i}$$ over time. The elements are denoted by [TeX:] $$C_{N \times K}$$, [TeX:] $$B_{N \times K}, H_{N \times K}, \text { and } P_{N \times K}$$, respectively. The state of a position is represented as in (3) and the rest of the elements of the state space are represented as follows:

First, [TeX:] $$c_{i, j}$$ in (4) represents the cache occupancy of [TeX:] $$x_{j}$$ which is the preloaded unit size of the video for satisfying [TeX:] $$u_{i}$$’s request. The maximum storage size of each [TeX:] $$c_{i, j} \text { for } i \in[0, N)$$[TeX:] $$\text { and } j \in[0, K) \text { is limited to } \bar{m}$$ for vouching fair video transmission toward vehicles and [TeX:] $$\sum_{k} c_{k, j} \leqslant \bar{c}, \text { where } k \in[0, N)$$ and [TeX:] $$\bar{m} \times N \leqslant \bar{c}$$.

Next, the [TeX:] $$b_{i, j}$$ of (5) represents the buffer occupancy of [TeX:] $$u_{i}$$ associated with [TeX:] $$x_{i} \text { . Each } b_{i, j} \text { for } \forall j \in[0, K) \text { has UB of } \bar{b}$$ and packet drop can occur when [TeX:] $$b_{i, j}+c_{i, j}-\mathcal{F} \geqslant \bar{b}$$. Moreover, video playback service can be stalled if [TeX:] $$b_{i, j}+c_{i, j}-\mathcal{F} \leqslant 0$$.

Finally, the average quality state of the provisioned video at [TeX:] $$u_{i, j}$$ can be calculated by the cumulative average quality of the provisioned video history through trajectory of [TeX:] $$u_{i} \text { from } x_{0} \text { to } x_{j}$$. The average quality state of [TeX:] $$u_{i, j} \text { can be denoted by } h_{i, j}$$, and is utilized for learning the policy of [TeX:] $$\mathcal{M}$$ which aims to provision an enhanced quality of the video toward [TeX:] $$u_{i}$$. Suppose that [TeX:] $$\mathcal{S}_{i, j}$$ represents the sojourn time step of [TeX:] $$u_{i} \text { associated with } x_{j}, \text { the } h_{i, j}$$ can be calculated as follows:

[TeX:] $$\text { Moreover, } h_{i, j}=0 \text { when } p_{i, j}=1 \text { and } j=0, \text { i.e., we only }$$ consider the history of quality provisioned at [TeX:] $$u_{i}$$, which has the drive experience on the highway. The [TeX:] $$q_{i}^{k, t}$$ in (7) represents [TeX:] $$t$$th quality index of video chunks, where it is provisioned at [TeX:] $$u_{i}$$ associated with [TeX:] $$x_{k}$$.

The [TeX:] $$\mathcal{M}$$ can learn the optimal action, which proactively requests the optimal power and cache allocation toward [TeX:] $$x_{j} \text { and } x_{j+1}$$ (i.e., vicinity of the [TeX:] $$u_{i}$$) from [TeX:] $$\mathcal{Z}$$ for seamless video retrieval, given that the state of mmWave IoV networks can be observed. Here, the action space of [TeX:] $$\mathcal{M} \text { consists of } V_{N \times K}$$ and [TeX:] $$L_{N \times K}$$, where each of them represents the amount of power allocation matrix and cache allocation matrix, respectively, and they can be denoted as follows:

First, the [TeX:] $$v_{i, j} \text { of } V_{N \times K}$$ represents the amount of power allocation of mBS, where [TeX:] $$\mathcal{M}$$ requests specific quality of video with respect to the power [TeX:] $$v_{i, j} \text { to } \mathcal{Z}$$ to serve video at [TeX:] $$x_{j} \text { for } u_{i}$$. In addition, [TeX:] $$l_{i, j} \text { of } L_{N \times K}$$ stands for the size of the allocated cache size at [TeX:] $$x_{j} \text { for } u_{i} \text { by } \mathcal{M} . \text { The } \mathcal{M}$$ requests cache size up to two neighboring mBSs for accomplishing two missions as follows: (i) Secure seamless current video provisioning service for [TeX:] $$x_{j}$$ and (ii) preemptive cache allocation at [TeX:] $$x_{j+1}$$ for enabling seamless services where [TeX:] $$j \in[0, K-1)$$. For example, if [TeX:] $$u_{i}$$ on a highway is associated with [TeX:] $$x_{j}, \text { the } \mathcal{M}$$ allocates cache storage with unit size of [TeX:] $$l_{i, j} \text { and } l_{i, j+1} \text { at } x_{j} \text { and } x_{j+1}$$, respectively, for supporting current video service for [TeX:] $$u_{i}$$ and preemptive video caching for handoff of [TeX:] $$u_{i} . \text { When } u_{i}$$ is not yet on the highway, i.e., [TeX:] $$p_{i, j}=1 \text { and } j=0, \text { the } \mathcal{M}$$ only allocates cache size toward [TeX:] $$x_{0} \text { for } u_{i}$$ for proactive video caching. If the [TeX:] $$u_{i}$$ is associated with [TeX:] $$x_{K-1}, \text { the } \mathcal{M} \text { requests } x_{K-1}$$ to allocate cache size of [TeX:] $$l_{K-1}$$ for satisfying current video service of [TeX:] $$u_{i}$$ [TeX:] $$$$.

The [TeX:] $$\mathcal{M}$$ learns the proactive caching policy and accomplishes power and preemptive cache allocation toward mBSs for seamless video retrieval by utilizing the proposed DDPG based algorithm as shown in Algorithm 1. The overall caching policy learning procedures are as follows. First, the parameters of the actor and critic network, which activate and evaluate action of [TeX:] $$\mathcal{M}$$, are initialized (line 1). Then, the target networks regarding both actor and critic network, [TeX:] $$\mathcal{Q}^{\prime} \text { and } \mathcal{A}^{\prime}$$, are initialized with the origin’s one (line 2). By iterating each episode, the [TeX:] $$\mathcal{M}$$ repeats following procedures to learn optimal caching policy which is power-cache aware:

i) For every episode, the transition pairs, attained by an arbitrarily generated set of states [TeX:] $$s \text { of size } \varphi$$, corresponding actions generated by the actor network with input [TeX:] $$s$$, reward value for [TeX:] $$s \text { and } a$$, and the next state space [TeX:] $$s^{\prime}$$, are paired and stored at replay buffer [TeX:] $$\Phi$$ (lines 5–7).

ii) After the is fully calculated, the minibatch of transitions is randomly sampled from the replay buffer . Then, for [TeX:] $$i$$th transition pair of , it is utilized for calculating the difference between target value [TeX:] $$y_{i}$$ and model value [TeX:] $$Q\left(s_{i}, a_{i} \mid \theta^{Q}\right)$$ to update the critic network with the gradients obtained from the difference. In addition, stochastic policy gradient is utilized to update parameters of the actor network as per line 15.

iii) Overall, the updated parameters of critic and actor networks′are utilized to update the target parameters of [TeX:] $$\mathcal{Q}^{\prime} \text { and } \mathcal{A}^{\prime}$$ with soft update weight for efficient and stable learning (i.e., (10) and (11)). Note that the sampled is refreshed with another trained transition pairs for better learning procedure after it is sampled. The computational complexity of this algorithm depends on the stochastic policy gradient method for minimizing loss; and our proposed algorithm does not exceed the complexity of stochastic policy gradient.

A comprehensive revenue of [TeX:] $$\mathcal{M}$$ is used as our caching scheme’s reward. The [TeX:] $$\mathcal{M}$$ takes composite action [TeX:] $$\left\{V_{N \times K} , L_{N \times K}\right\}$$, when it observes the state of mmWave IoV networks as [TeX:] $$\left\{C_{N \times K}, B_{N \times K}, H_{N \times K}, P_{N \times K}\right\}$$, and get the next state of IoV networks [TeX:] $$\left\{C_{N \times K}^{\prime}, B_{N \times K}^{\prime}, H_{N \times K}^{\prime}, P_{N \times K}^{\prime}\right\}$$ and weighted average reward sum [TeX:] $$R$$ of the considered system reward including (i) quality variation, (ii) packet drop occurrence, and (iii) playback stall. These sub-rewards are denoted as [TeX:] $$r^{q}, r^{p}, \text { and } r^{s}$$, respectively. The total reward [TeX:] $$R$$ of each episode is calculated as the average of transitions’ rewards of sampled from . Note that the sub-rewards are calculated in vehicle-by-vehicle manner, and each of them is added together for each reward domain [TeX:] $$\text { as } R^{q}, R^{p}, \text { and } R^{s},$$ respectively. That is, the episode reward [TeX:] $$R$$ is equal to [TeX:] $$R^{q}+R^{p}+R^{s}$$.

First, in the case of the reward of quality [TeX:] $$r^{q}$$, it is determined by the action taken by [TeX:] $$\mathcal{M}, \text { which is } V_{N \times K}$$. The IoV network environment, [TeX:] $$e$$, which interacts with [TeX:] $$\mathcal{M}$$, calculates the [TeX:] $$r^{q}$$ by comparing [TeX:] $$H_{N \times K}^{\prime} \text { and } H_{N \times K}$$. Then [TeX:] $$e$$ compares the cumulative average quality of video among them and gives a weighted reward to [TeX:] $$\mathcal{M}$$ if the expected quality corresponding to the allocated power [TeX:] $$v_{i, j}$$ originates an enhancement of the provisioned video quality and vice versa. The [TeX:] $$\mathcal{M}$$ can get reward if the action for [TeX:] $$u_{i}$$ results in a higher quality of video than its previous average video quality. However, if not, [TeX:] $$\mathcal{M}$$ gets negative reward of [TeX:] $$r^{q}$$ as penalty, which represents the degradation of QoS of [TeX:] $$u_{i}$$. Specifically, the quality of video is determined by the data rate, which can be calculated by:

The [TeX:] $$g_{i, j}$$ in (12) represents the power gain from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user. In addition, [TeX:] $$g_{i, j}^{T X} \text { and } g_{i, j}^{R X}$$ stand for transmit antenna gain and receive antenna gain from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user. Moreover, the represents the wavelength and [TeX:] $$d_{i, j} \text { and } d_{0}$$ represents distance from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user and far field reference distance, respectively. Lastly, the represents the path-loss exponent. The [TeX:] $$v_{i, j}$$ in (13) represents the transmit power from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user, [TeX:] $$\sigma^{2}$$ is the variance of additive white Gaussian noise (AWGN). According to Shannon’s capacity formula, the achievable rate for [TeX:] $$i \text { th user from } j \text { th mBS }$$ is as (14). The [TeX:] $$W$$ stands for the system bandwidth, and [TeX:] $$K_{j}$$ is the total number of users associated with [TeX:] $$j \text { th mBS }$$ Thus, each user can utilize [TeX:] $$1 / K_{j}$$ of the total frequency bandwidth of each mBS. Based on the [TeX:] $$a_{i, j}$$, each user can receive the corresponding quality of video chunks from associated mBS.

Next, the [TeX:] $$r^{p}$$ is calculated by observing the current buffer occupancy of each [TeX:] $$u_{i}$$, allocation action [TeX:] $$L_{N \times K} \text { of } \mathcal{M}$$, and the buffer saturation rate [TeX:] $$\mathcal{F}$$. If the difference between the sum of buffer occupancy of [TeX:] $$u_{i} \text { with cache } x_{j} \text { and } \mathcal{F} \text { exceeds } \bar{b}$$, then the packet drop occurs at [TeX:] $$u_{i}$$ throughout the video provisioning service. This is, [TeX:] $$\mathcal{M}$$ gets punished by attaining minus rewards of [TeX:] $$r^{p}$$ because the action of [TeX:] $$\mathcal{M}$$ originated the spectrum waste and power consumption of the corresponding mBS. By exploiting this reward structure, the MBS learns to cache the video chunks in a way that computational overhead and communication loss derived from unnecessary delivery service are dismissed.

Finally, the [TeX:] $$r^{\mathcal{S}}$$ can be computed by subtracting [TeX:] $$\mathcal{F} \text { from } b_{i, j}$$ and adding [TeX:] $$c_{i, j}$$. If the result is positive, the [TeX:] $$u_{i}$$ can playback provisioned video chunks without any stall. On the other hand, if the result is less than zero, then the video playback at [TeX:] $$u_{i}$$ can be stall, which results in a deteriorating QoS for [TeX:] $$u_{i}$$. Therefore, we define [TeX:] $$R$$ as:

where [TeX:] $$\psi, \Xi, \text { and } \aleph$$ represent the reward weights of [TeX:] $$r^{q}, r^{p}, \text { and } r^{s}$$, respectively. The [TeX:] $$(a / b)^{+}$$ is a function that returns 1 if [TeX:] $$a<b \text { or } 0$$ otherwise. Moreover, [TeX:] $$(a / b)^{-}$$ is a function that returns 1 if [TeX:] $$a>b$$ and 0 otherwise. The [TeX:] $$(1 / c)^{\triangle}$$ function returns 0 if [TeX:] $$c \rightarrow-\infty$$ and returns 1 otherwise. Lastly, The [TeX:] $$(1 / c)^{\triangle}$$ function returns 1 if [TeX:] $$c \rightarrow-\infty$$ and returns 0 otherwise. By utilizing these functions, the (15) represents the total system reward calculation procedure with respect to the quality variation, packet drop occurrence, and playback stall.

We utilized various simulations for verifying the performance of our proposed power-cache aware caching scheme in mmWave based IoV networks. The caching scheme is evaluated with these simulations by measuring the corresponding results with respect to the aforementioned interest of rewards, given that the state of IoV network is observable by the agent. We leveraged TensorFlow [39] in our simulations to implement our proposed DDPG based caching scheme. We first present the simulation settings and then discuss the results. As many reinforcement learning literature has shown results based on empirical convergence without complexity analysis [40]-[44], we propose simulation results based on this approach.

In the following, we elaborate the implementation details of the proposed DDPG learning based video caching scheme in mmWave IoV network. First, we introduce hardware configuration for our simulation, and then show the overall design and implementation details of the software.

Hardware. For hardware, we used an NVIDIA DGX station equipped with 4 × Tesla V100 GPUs (total of 128GB of available memory) and Intel Xeon E5–2698 v4 2.2 GHz with 20 cores (total of 256 GB of available system memory) CPU.

Software. We also used Python with version 3.6 on Ubuntu 16.04 LTS to build the DDPG based caching scheme. In addition, we used Xavier initializer to avoid occurrence of vanishing gradient descent during the learning phase. The neural network is constructed with fully connected deep neural network, and the number of nodes in the hidden layer was 200.

Table 2.

Parameter | Value |
---|---|

Total episode [TeX:] $$\mathcal{E}$$ | 500 |

Time step [TeX:] $$T$$ | 100 |

Minibatch size | 64 |

Discounting factor | 0.95 |

Initial epsilon | 0.9 |

Size of [TeX:] $$\mathcal{D}$$ | 1000 |

Optimizer | AdamOptimizer |

Activation function | ReLU |

We implemented both the DDPG based caching algorithm and the customized mmWave IoV networks in highway scenario. The agent in DDPG based caching algorithm continuously interacts with the dynamic IoV network environment and attains pairs of state transition. Accordingly, and in turn, the optimal caching policy can be acquired with policy gradient after the learning phase has converged. In addition, simulation parameters are summarized in Table 2.

First, the caching scheme is evaluated with three different values of learning rate . Fig. 3 to Fig. 5 show the tendency of convergence of learning phase throughout the episodes. Note that Fig. 3 to Fig. 5 have the same simulation setting of [TeX:] $$(K, N)=(20,200) \text { and } \rho=0.143$$ with different learning rates. For each learning rate simulation run, learning tendency for each reward is represented. For example, in case of Fig. 4, the impact of each reward category can be obtained from the gap between other measured values of the mixed reward. As the value of the green-lined graph—which represents the reward value without (w.o.) the packet drop occurrence reward—is getting higher, it can be considered that the [TeX:] $$\mathcal{M}$$ makes an optimal policy which considers the playback stall and quality of provisioned video to be more important than the packet drop for maximizing the QoS. Similarly, the red-lined graph in Fig. 4 is getting lower and is converged at specific value. It can be considered that the total reward value of caching scheme can be underestimated without the quality reward value, which means the importance of the quality reward on the learning phase is not negligible.

When [TeX:] $$\gamma=10^{-4}$$, an interesting learning tendency can be observed in Fig. 5. While the red-lined graph in Fig. 5 does not quite change over the entire learning phase, other graphs are dramatically increased and finally converged at the optimal point. That is, the [TeX:] $$\mathcal{M}$$ learns the caching policy to maximize the quality reward than other criteria. The red-lined graph, which is the mixed reward value, consists of packet drop occurrence reward and playback stall reward, and does not change, while the other two graphs are sharply increased indicating that the quality reward is dramatically increased.

In summary, the total reward can be illustrated as in Fig. 6. Throughout the learning phase, the M with different learning rate learns its caching policy, and the policy can be evaluated by the system reward criteria as mentioned earlier. In case of [TeX:] $$\gamma=10^{-3} \text {and } \gamma=5 \times 10^{-4}, \text {the } \mathcal{M}$$ gets a converged caching policy around the 100th episode. However, in case of [TeX:] $$\mathcal{M}$$ with smaller [TeX:] $$\gamma, \text { the } \mathcal{M}$$ optimized its policy later. Therefore, our proposed power-cache aware video caching scheme accomplished stable and optimal video provisioning service towards vehicles in mmWave based distributed IoV networks.

As the optimal caching policy is attained, the [TeX:] $$\mathcal{M}$$ can immediately allocate power and cache units toward the distributed mBS as the system state is observed by [TeX:] $$\mathcal{M}$$ and thus caching scheme maximizes the QoS of the users. This caching policy differs from the classical caching scheme’s policy, which needs to calculate the optimal caching strategy for each observation of IoV networks over time. Thus, the proposed caching scheme is highly affordable for optimal power and cache allocation of mBSs to provision superior quality and playback experience while seamless service is possible.

In the following, we argue for the importance of scalability in IoV networks. As the scale of the considered IoV networks gets larger, calibrating the optimal caching policy for seamless video services is hard to accomplish with classical approaches. Moreover, when the number of objective to optimize becomes larger calculating the optimal point for seamless video services.

Fig. 7 illustrates the total reward value convergence tendency throughout the learning phase. Note that the total reward of each case is proportional to the scale of the IoV networks. In addition, the of FSMC model is set to 0.186, where the system average velocity of vehicles is 100 km/h. Moreover, the learning rate was set to [TeX:] $$10^{-3}$$. Originally, the action space of Fig. 3 to Fig. 6 was 4000 = 20×200, where the IoV networks in Fig. 7 is 5000, 7500, and 20000. That is, the robustness of the proposed caching scheme with respect to scalability is validated through simulation in Fig. 7. Each scale of IoV networks in Fig. 7 showed converged performance for provisioning optimal quality of video and mitigated playback stall phenomenon through learning power and cache allocation toward mBSs.

Next, the learning tendency of average quality level with respect to the controlled power of mBS’s transmitter and unit size of mBS’s cache in various scale of IoV networks are proposed in Figs. 8 and 9. For power control aspects, the [TeX:] $$\mathcal{M}$$ with scale of [TeX:] $$(K, N)=(20,250) \text { and }(K, N)=(25,300)$$ learns optimal power allocation toward mBSs on the road side, which results in sufficient data rate and can be supported toward users so that maximized quality of video (i.e., 4K resolution) can be provisioned. Besides, as the scale of IoV networks is [TeX:] $$(K, N)=(40,500)$$, which is 5× more dense compared to setting of Fig. 3 to Fig. 5, the [TeX:] $$\mathcal{M}$$ learned to allocate power corresponding to 720p resolution of video toward users with limited spectrum availability.

Finally, [TeX:] $$\mathcal{M}$$ learns to allocate cache size toward mBS for supporting seamless video retrieval at neighboring users. As in Fig. 9, the [TeX:] $$\mathcal{M}$$ with scale of [TeX:] $$(K, N)=(20,250) \text { and }(K, N)=(25,300)$$ learns to allocate smaller cache size than scale of [TeX:] $$(K, N)=(40,500)$$. That is, the [TeX:] $$\mathcal{M}$$ with larger scale learns caching policy to allocate low power utilization strategy. However, the [TeX:] $$\mathcal{M}$$ stabilizes the distributed IoV networks with more flourish cache size for each user so that playback stall problem at user can be mitigated. Besides, for smaller scales, [TeX:] $$\mathcal{M}$$ aims to learn the caching scheme to achieve a maximized average quality level of provisioned video (i.e., higher power allocation of mBS). Therefore, proposed power-cache aware video caching scheme in distributed mmWave IoV networks enables us to learn the optimal caching policy, which accomplishes an optimal power and cache allocation toward mBSs and attains stabilized performance even for an enlarged scale of IoV networks.

We proposed a deep reinforcement learning based video caching scheme in mmWave IoV networks to optimize power consumption and cache allocation of mBS with minimum number of stall events for seamless services. With our proposed caching scheme, stabilized and optimized caching options in a large-scale distributed IoV networks can be achieved as the system state is observed. Through an extensive set of simulations, the proposed caching scheme is shown to be appropriate for learning a massive scale of action space and stabilized learning performance, even when the scale of the considered distributed IoV networks is enlarged.

As future work directions, real-world implementation and its corresponding prototype-based performance evaluation will be considered. Furthermore, additional performance evaluations in order to compare with the other reinforcement learning algorithms will be intensively conducted. Lastly, the extension of our work with multi-agent deep reinforcement learning algorithms is worthy to consider in order to build scalable largescale systems with multiple distributed base stations. To guarantee the convergence in multi-agent deep reinforcement learning, we need more sophisticated and well-designed reward functions and action spaces.

This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00170, Virtual Presence in Moving Objects through 5G) and also by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-20202017-0-01637) supervised by the IITP (Institute for Information & Communications Technology Promotion). J. Kim, A. Mohaisen, and W. Lee are the corresponding authors of this paper.

Dohyun Kwon is currently a Research Engineer at Hyundai-Autoever, Seoul, Republic of Korea. He received his B.S. and M.S. degrees in Computer Science and Engineering from Chung-Ang University, Seoul, Republic of Korea, in 2018 and 2020, respectively. His research focus includes deep reinforcement learning for mobile networks.

Joongheon Kim (M’06-SM’18) is currently an Assistant Professor of Electrical Engineering at Korea University, Seoul, Korea. He received the B.S. and M.S. degrees in Computer Science and Engineering from Korea University, Seoul, Korea, in 2004 and 2006, respectively; and the Ph.D. degree in Computer Science from the University of Southern California (USC), Los Angeles, CA, USA, in 2014. Before joining Korea University as an Assistant Professor in 2019, he was with LG Electronics as a research engineer (Seoul, Korea, 2006-2009), InterDigital as an intern (San Diego, CA, USA, 2012), Intel Corporation as a systems engineer (Santa Clara in Silicon Valley, CA, USA, 2013-2016), and Chung-Ang University as an Assistant Professor (Seoul, Korea, 2016-2019). He is a Senior Member of the IEEE. He was a recipient of Annenberg Graduate Fellowship with his Ph.D. admission from USC (2009), Intel Corporation Next Generation and Standards (NGS) Division Recognition Award (2015), Haedong Young Scholar Award by KICS (2018), IEEE Veh. Technol. Society (VTS) Seoul Chapter Award (2019), Outstanding Contribution Award by KICS (2019), Gold Paper Award from IEEE Seoul Section Student Paper Contest (2019), and IEEE Systems J. Best Paper Award (2020). David Aziz Mohaisen earned his M.Sc. and Ph.D. degrees from the University of Minnesota in 2012. Currently, he is an Associate Professor of Computer Science at the University of Central Florida. Prior to joining Central Florida, he was an Assistant Professor at SUNY Buffalo (2015-2017), a Senior Research Scientist at Verisign Labs (2012-2015), and a Researcher at ETRI (2007-2009). He was awarded the Summer Faculty Fellowship from the US AFOSR (2016), the Best Student Paper at ICDCS (2017), the Best Paper Award at WISA (2014), the Best Poster Award at IEEE CNS (2014), and a Doctoral Dissertation Fellowship from the University of Minnesota (2011). He is in the editorial board of IEEE Trans. Mobile Computing. He is a Senior Member of ACM and a Senior Member of IEEE.

David Aziz Mohaisen earned his M.Sc. and Ph.D. de- grees from the University of Minnesota in 2012. Cur- rently, he is an Associate Professor of Computer Sci- ence at the University of Central Florida. Prior to join- ing Central Florida, he was an Assistant Professor at SUNY Buffalo (2015–2017), a Senior Research Sci- entist at Verisign Labs (2012–2015), and a Researcher at ETRI (2007–2009). He was awarded the Summer Faculty Fellowship from the US AFOSR (2016), the Best Student Paper at ICDCS (2017), the Best Pa- per Award at WISA (2014), the Best Poster Award at IEEE CNS (2014), and a Doctoral Dissertation Fellowship from the University of Minnesota (2011). He is in the editorial board of IEEE Trans. Mobile Com- puting. He is a Senior Member of ACM and a Senior Member of IEEE.

Wonjun Lee received the B.S. and M.S. degrees in Computer Engineering from Seoul National University, Seoul, South Korea, in 1989 and 1991, respectively, the M.S. degree in Computer Science from the University of Maryland at College Park, College Park, MD, USA, in 1996, and the Ph.D. degree in Computer Science and Engineering from the University of Minnesota, Minneapolis, MN, USA, in 1999. In 2002, he joined the faculty of Korea University, Seoul, where he is currently a Professor with the Department of Computer Science and Engineering. He has authored or co-authored over 180 papers in refereed international journals and conferences. His research interests include communication and network protocols, optimization techniques in wireless communication and networking, security and privacy in mobile computing, and radio frequency powered computing and networking. Dr. Lee has served as a Technical Program Committee member for the IEEE International Conference on Computer Communications from 2008 to 2018. He was associated with the Computing Machinery International Symposium on Mobile Ad Hoc Networking and Computing from 2008 to 2009 and the IEEE International Conference on Computer Communications and Networks from 2000 to 2008 and over 118 international conferences.

- 1 L. Wei, R. Q. Hu, Y. Qian, G. Wu, "Key elements to enable millimeter wave communications for 5G wireless systems,"
*IEEE Wireless Commun.*, vol. 21, no. 6, pp. 136-143, Dec, 2014.doi:[[[10.1109/MWC.2014.7000981]]] - 2 M. A. Salkuyeh, B. Abolhassani, "Optimal video packet distribution in multipath routing for urban V ANETs,"
*J. Commun. Netw.*, vol. 20, no. 2, pp. 198-206, Apr, 2018.custom:[[[-]]] - 3 J. G. Andrews et al., "What will 5G be?,"
*IEEE J. Sel. Areas Commun.*, vol. 32, no. 6, pp. 1065-1082, June, 2014.custom:[[[-]]] - 4 T. E. Bogale, L. B. Le, "Massive MIMO and mmWave for 5G wireless HetNet: Potential benefits and challenges,"
*IEEE Veh. Technol. Magazine*, vol. 11, no. 1, pp. 64-75, Mar, 2016.custom:[[[-]]] - 5 J. Kim, G. Caire, A. F. Molisch, "Quality-aware streaming and scheduling for device-to-device video delivery,"
*IEEE /ACM Trans. Netw.*, vol. 24, no. 4, pp. 2319-2331, Aug, 2016.doi:[[[10.1109/TNET.2015.2452272]]] - 6] Cisco, "Cisco visual networking index: Global mobile data traffic forecast, 2016-2021 QA," https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html, July 2018, [Accessed: 201901-08 Cisco visual networking index: Global mobile data traffic forecast, 2016-2021 QA," https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html, July 2018, [Accessed: 201901-08-sciedit-2-03">
*Cisco,*, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html,July2018,[Accessed:201901-08] , custom:[[[, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html,July2018,[Accessed:201901-08] , ]]]. - 7 I. Parvez, A. Rahmati, I. Guvenc, A. I. Sarwat, H. Dai, "A survey on low latency towards 5G: RAN, core network and caching solutions,"
*IEEE Commun.SurveysTuts.Fourth Quarter*, vol. 20, no. 4, pp. 3098-3130, 2018.custom:[[[-]]] - 8 Y. Niu, C. Gao, Y. Li, L. Su, D. Jin, "Exploiting multi-hop relaying to overcome blockage in directional mmwave small cells,"
*J.Commun.Netw.*, vol. 18, no. 3, pp. 364-374, June, 2016.doi:[[[10.1109/JCN.2016.000052]]] - 9 J. Kim, Y. Tian, S. Mangold, A. F. Molisch, "Joint scalable coding and routing for 60 GHz real-time live HD video streaming applications,"
*IEEE Trans.Broadcasting*, vol. 59, no. 3, pp. 500-512, Sept, 2013.doi:[[[10.1109/TBC.2013.2273598]]] - 10 M. Baianifar, S. M. Razavizadeh, H. Akhlaghpasand, I. Lee, "Energy efficiency maximization in mmWave wireless networks with 3D beamforming,"
*J.Commun.Netw.*, vol. 21, no. 2, pp. 125-135, Apr, 2019.custom:[[[-]]] - 11 J. Kim, A. F. Molisch, "Fast millimeter-wave beam training with receive beamforming,"
*J. Commun. Netw.*, vol. 16, no. 5, pp. 512-522, Oct, 2014.doi:[[[10.1109/JCN.2014.000090]]] - 12 J. Kim, Y. Tian, S. Mangold, A. F. Molisch, "Quality-aware coding and relaying for 60 GHz real-time wireless video broadcasting,"
*in Proc. IEEE ICC*, pp. 5148-5152, June, 2013.custom:[[[-]]] - 13 S. Park, B. Kim, H. Yoon, S. Choi, "RA-eV2V: Relaying systems for LTE-V2V communications,"
*J.Commun.Netw.*, vol. 20, no. 4, pp. 198-206, Aug, 2018.doi:[[[10.1109/JCN.2018.000055]]] - 14 T. S. Rappaport et al., "Millimeter wave mobile communications for 5G cellular: It will work!,"
*IEEE Access*, vol. 1, no. 1, pp. 335-349, 2013.doi:[[[10.1109/ACCESS.2013.2260813]]] - 15 J. Kim, A. F. Molisch, "Quality-aware millimeter-wave device-todevice multi-hop routing for 5G cellular networks,"
*in Proc. IEEE ICC*, pp. 5251-5256, June, 2014.custom:[[[-]]] - 16 S. Zhang, N. Zhang, X. Fang, P. Yang, X. S. Shen, "Self-sustaining caching stations: Toward cost-effective 5G-enabled vehicular networks,"
*IEEE Commun.Mag.*, vol. 55, no. 11, pp. 202-208, Nov, 2017.doi:[[[10.1109/MCOM.2017.1700129]]] - 17 N. Magaia, Z. Sheng, P. R. Pereira, M. Correia, "REPSYS: A robust and distributed incentive scheme for in-network caching and dissemination in vehicular delay-tolerant networks,"
*IEEE Wireless Commun.*, vol. 25, no. 3, pp. 65-71, June, 2018.custom:[[[-]]] - 18 H. Ahlehagh, S. Dey, "Video-aware scheduling and caching in the radio access network,"
*IEEE /ACM Trans. Netw.*, vol. 22, no. 5, pp. 1444-1462, Oct, 2014.doi:[[[10.1109/TNET.2013.2294111]]] - 19
*Highway Data Explorer (Online). Available:*, http://dtdapps.coloradodot.info/otis/HighwayData - 20 L. Yao, A. Chen, J. Deng, J. Wang, G. Wu, "A cooperative caching scheme based on mobility prediction in vehicular content centric networks,"
*IEEE Trans.Veh.Technol.vol 67*, no. 6, pp. 5435-5444, June, 2017.doi:[[[10.1109/TVT.2017.2784562]]] - 21 R. S. Sutton, A. G. Barto, "Reinforcement learning: An introduction,"
*MIT press*, 2018.doi:[[[10.1109/TNN.1998.712192]]] - 22 Y. Guo, Q. Yang, F. R. Yu, V. C. Leung, "Cache-enabled adaptive video streaming over vehicular networks: A dynamic approach,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 6, pp. 5445-5459, June, 2018.doi:[[[10.1109/TVT.2018.2817210]]] - 23 K. Poularakis, G. Iosifidis, A. Argyriou, I. Koutsopoulos, L. Tassiulas, "Caching and operator cooperation policies for layered video content delivery,"
*in Proc.IEEE INFOCOM*, pp. 1-9, Apr, 2016.custom:[[[-]]] - 24 Y. Huang, X. Song, F. Ye, Y. Yang, X. Li, "Fair caching algorithms for peer data sharing in pervasive edge computing environments,"
*in Proc. IEEE ICDCS*, pp. 605-614, June, 2017.custom:[[[-]]] - 25 S. Fu, P. Duan, Y. Jia, "Content-exchanged based cooperative caching in 5G wireless networks,"
*in Proc.IEEE GLOBECOM*, pp. 1-6, Dec, 2017.custom:[[[-]]] - 26 S. Arabi, E. Sabir, H. Elbiaze, "Information-centric networking meets delay tolerant networking: Beyond edge caching,"
*in Proc. IEEE WCNC*, pp. 1-6, Apr, 2018.custom:[[[-]]] - 27 R. Kim, H. Lim, B. Krishnamachari, ""Prefetching-based data dissemination in vehicular cloud systems,"
*IEEE Trans.Veh.Technol.*, vol. 65, no. 1, pp. 292-306, Jan, 2015.custom:[[[-]]] - 28 G. Mauri, M. Gerla, F. Bruno, M. Cesana, G. Verticale, "Optimal content prefetching in NDN vehicle-to-infrastructure scenario,"
*IEEE Trans. Veh.Technol.*, vol. 66, no. 3, pp. 2513-2525, June, 2016.doi:[[[10.1109/TVT.2016.2580586]]] - 29 M. Chen et al., "Caching in the Sky: Proactive deployment of cacheenabled unmanned aerial vehicles for optimized quality-of-experience,"
*IEEE J.Sel.AreasCommun.*, vol. 35, no. 5, pp. 1046-1061, May, 2017.custom:[[[-]]] - 30 T. T. Le, R. Q. Hu, "Mobility-aware edge caching and computing in vehicle networks: A deep reinforcement learning,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 11, pp. 10190-10203, Nov, 2018.doi:[[[10.1109/TVT.2018.2867191]]] - 31 Y. He, F. R. Yu, N. Zhao, V. C. Leung, H. Yin, "Software-defined networks with mobile edge computing and caching for smart cities: A big data deep reinforcement learning approach,"
*IEEE Commun.Mag.*, vol. 55, no. 12, pp. 31-37, Dec, 2017.doi:[[[10.1109/MCOM.2017.1700246]]] - 32 Y. He, N. Zhao, H. Yin, "Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 1, pp. 44-55, Jan, 2017.doi:[[[10.1109/TVT.2017.2760281]]] - 33 Y. He et al., "Deep reinforcement learning-based optimization for cacheenabled opportunistic interference alignment wireless networks,"
*IEEE Trans.Veh.Technol.*, vol. 66, no. 11, pp. 10433-10445, Nov, 2017.custom:[[[-]]] - 34 S. Lin et al., "Fast simulation of vehicular channels using finite-state markov models,"
*IEEE WirelessCommun.Lettersearlyaccess*, 2019.custom:[[[-]]] - 35 Z. Ning, X. Wang, F. Xia, J. J. Rodrigues, "Joint computation offloading, power allocation, and channel assignment for 5G-enabled traffic management systems,"
*IEEE Trans.Ind.Inf.*, vol. 15, no. 5, pp. 3058-3067, May, 2019.custom:[[[-]]] - 36 N. Wang, E. Hossain, V. K. Bhargava, "Joint downlink cell association and bandwidth allocation for wireless backhauling in two-tier HetNets with large-scale antenna arrays,"
*IEEE Trans. Wireless Commun.*, vol. 15, no. 5, pp. 3251-3268, Jan, 2016.doi:[[[10.1109/TWC.2016.2519401]]] - 37 K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, G. Caire, "Femtocaching: Wireless content delivery through distributed caching helpers,"
*IEEE Trans. Inf. Theory*, vol. 59, no. 12, pp. 8402-8413, Sept, 2013.doi:[[[10.1109/TIT.2013.2281606]]] - 38 L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, "Web caching and Zipf-like distributions: Evidence and implications,"
*in Proc. IEEE INFOCOM*, pp. 126-134, Mar, 1999.custom:[[[-]]] - 39 Y. J. Mo, J. Kim, J.-K. Kim, A. Mohaisen, W. Lee, "Performance of deep learning computation with TensorFlow software library in GPUcapable multi-core computing platforms,"
*in Proc.IEEE ICUFN*, pp. 240-242, July, 2017.custom:[[[-]]] - 40 J. Clausen, W. L. Boyajian, L. M. Trenkwalder, V. Dunjko, H. J. Briegel, "On the convergence of projective-simulation-based reinforcement learning in Markov decision processes,"
*in arXiv preprint arXiv:1910.11914*, 2019.custom:[[[-]]] - 41 V. Mnih et al., "Asynchronous methods for deep reinforcement learning,"
*in Proc.ICML*, pp. 1928-1937, June, 2016.custom:[[[-]]] - 42 T. P. Lillicrap et al., "Continuous control with deep reinforcement learning,"
*in Proc.ICLR*, pp. 1-14, May, 2016.custom:[[[-]]] - 43 M. Hessel et al., "Rainbow: Combining improvements in deep reinforcement learning,"
*in Proc.AAAI*, pp. 1-8, Feb, 2018.custom:[[[-]]] - 44 H. V. Hasselt, A. Guez, D. Silver, "Deep reinforcement learning with double Q-learning,"
*in Proc.AAAI*, pp. 1-7, Feb, 2016.custom:[[[-]]]