** Self-Adaptive Power Control with Deep Reinforcement Learning for Millimeter-Wave Internet-of-Vehicles Video Caching **

Dohyun Kwon , Joongheon Kim , David A. Mohaisen and Wonjun Lee

## Article Information

## Abstract

**Abstract:** Video delivery and caching over the millimeter-wave (mmWave) spectrum is a promising technology for high data rate and efficient frequency utilization in many applications, including distributed vehicular networks. However, due to the short handoff duration, calibrating both optimal power allocation of each base station toward its associated vehicles and cache allocation are challenging for their computational complexity. Heretofore, most video delivery applications were based on on-line or off-line algorithms, and they were limited to compute and optimize high dimensional objectives within low-delay in large scale vehicular networks. On the other hand, deep reinforcement learning is shown for learning such scale of a problem with an optimized policy learning phase. In this paper, we propose deep deterministic policy gradient-based power control of mmWave base station (mBS) and proactive cache allocation toward mBSs in distributed mmWave Internet-of-vehicle (IoV) networks. Simulation results validate the performance of the proposed caching scheme in terms of quality of the provisioned video and playback stall in various scales of IoV networks.

**Keywords:** deep reinforcement learning , internet-of-vehicle caching , video caching

## I. Introduction

THE millimeter-wave (mmWave) is a promising technology for provisioning high-end resolution of video contents, with superior data rate and improved efficient frequency utilization [1]-[5]. Based on current global trends, the ratio of video traffic among mobile data traffic is expected to increase, where 78% of the mobile traffic will be composed of video contents in 2021 [6]. As such, video caching in mmWave networks has been highlighted by both industry and academia [7]-[12].

In particular, it is expected that most traffic of forthcoming mobile networks would consist of mmWave-based video hunks. Meanwhile, among plentiful use cases of mmWave based video provisioning scenarios, Internet-of-vehicle (IoV) networks are faced with multiple challenges [13]. For example, the user equipment (UE) installed in vehicle has an intrinsic feature: Very high mobility. Considering that propagation in mmWave is quite directive and with a comparatively short coverage region [7], [15], the mmWave propagation of distributed mmWave base station (mBS) for UEs is constrained for short association time.

n realistic mmWave IoV networks, the caching scheme should also consider proactive cache size allocation towards distributed mBSs for preventing playback stall. In addition, power control of each mBS for energy efficiency and the number of requested chunks from media servers for minimizing the number of dropped video chunks are investigated [16], [17]. That is, the edge node (i.e., mBS) is responsible for providing cache size and power allocation for supporting associated vehicles. Moreover, if the caching scheme should reflect more optimization objectives or is considered even in larger IoV networks, the classical caching scheme is limited to calibrate such optimal point within certain amounts of delay bounds for avoid video streaming stall events [5], [18]. As such, a novel caching scheme for such mmWave based IoV networks is required. To this end, and to address those issues, we propose a deep reinforcement learning (DRL) based caching scheme for learning an optimal power control of each mBS in the considered IoV networks, and cache allocation towards mBS with a realistic caching scenario. The reason why DRL is used among various optimization and learning based algorithms is that it is one of the emerging sequential decision-making algorithms for solving time-varying systems under unexpected observations. As the agent of DRL learns the optimal action policy, the caching scheme can derive optimal point of power and cache size allocation for UEs of each edge node as soon as the agent observes the state of environment.

Contributions. This paper proposes DRL-based video caching scheme in mmWave IoV networks, as illustrated in Fig. 1, for enabling an optimal video provisioning service under highly dynamic and multiple dimensional learning objectives. Note that the real-world velocity data of vehicles is utilized so that a realistic simulation of IoV networks is available [19]. The contribution of algorithm in this paper can be summarized as follows:

Extended action space: Among various kinds of DRL algorithms, the deep deterministic policy gradient (DDPG) is adopted and enables us to learn continuous and a multidimensional action space. DDPG learns an optimal point of power and cache size for each mBS and for each observed IoV network. The learning agent is considered as a macro base station (MBS), which learns and controls the constrained optimal power allocation towards distributed edge nodes (i.e., multiple mBSs) so that their associated vehicular user equipment (VUE) can experience qualitative video provisioning service with optimized quality of video and seamless playback.

Model free and off policy: Specifically, the advantage of the model-free property of DDPG is that the MBS does not need to know the complete information of the vehicular network, while the model-based DRL algorithms needs such a knowledge. The MBS interacts with the environment and accumulates the experience of the interactions and utilizes them for learning the optimal caching policy. In addition, the advantage of the off-policy property of DDPG is that even though the caching policy is updated by the learning process, it can utilize the experience, collected from the previous caching policy, so that the data efficiency is much more improved than the on-policy based algorithms, such as SARSA [21].

Organization. The rest of this paper is organized as follows: Section II introduces various caching schemes including the DRL based approaches and the classical one. Next, Section III summarizes an overview of reinforcement learning, including DDPG. Section IV proposes the system model and description of our caching scheme. Section V discusses the simulation settings and our proposed scheme’s performance for power-cache allocation learning in mmWave IoV networks. Finally, Section VI concludes this paper.

## II. RELATED WORK

In this section, classical video caching schemes, which includes optimization formula-based approaches and the DRL based-methods, are introduced. First, we summarize the classical optimization formula-based approaches. Next, we review the DRL-based video caching schemes, especially the deep-Q network (DQN)-based approach.

##### A. Optimization for Caching

The [20] proposed vehicular content centric networks (VCCN) with mobility prediction capabilities for efficiently electing caching nodes. Among multiple vehicles located in a specific hot spot region, the representative vehicle is selected based on the sojourn time. The elected caching node in the hot spot region taking a role of mediating the caching procedure, and the rest of the vehicles are serviced from the caching node. The dynamic cache algorithm (DCA) is proposed in [22], which enables adaptive bitrate (ABR) video streaming service in vehicular networks. The authors considered the mobility issue of vehicular networks, which induces time-varying state of wireless channel. The proposed DCA algorithm based the caching scheme addressed the issue derived from the mobility feature by jointly considering the quality adaptation, cache placement, and bandwidth allocation. A distributed content caching architecture was proposed in [23] focusing on reducing delay of content delivery. Specifically, minimizing the delay due to advent of layered-video encoding techniques such as the scalable video coding (SVC) is NP-hard, so the authors transformed the caching problem and derived pseudopolynomial-time optimal caching solution. Moreover, several video caching schemes in vehicular networks and mobile edge networks are proposed [24] under various scenarios and caching optimizations with classical approaches in information centric network (ICN) were proposed in [25] and [26].

Besides, prefetching-based data dissemination in vehicular cloud systems (VCSs) has been widely studied for vehicular adhoc networks (VANETs) to satisfy various wireless communication capabilities such as multimedia streaming, vehicle information and autonomous navigation services [27]. The authors focused on how to exploit the local data storages (i.e., content cache) of roadside wireless access points (APs) within VCS for efficient data dissemination. That is, the prefetching approach takes a role in proactively caching contents for efficient data dissemination in VCS. Such data dissemination can enhance the local access to popular Internet contents via proxy servers. In [28] proactive caching for various wireless network applications was studied. In [28], a content prefetching technique for named data networking (NDN), which is one of the ICN framework, was proposed and showed to maximize the probability that a user retrieves the desired content in a vehicle-to-infrastructure (V2I) scenario. The authors leveraged an integer linear programming formulation of optimally distributing content in the network nodes while also accounting for the available storage and link capacities.

In [29], the deployment of unmanned aerial vehicles for video caching is discussed and the conceptor-based echo state network is used for solving the quality-of-experience (QoE) optimization. This approach is novel and well-discussed, however the problem in the paper is not equivalent to ours because we assume the existence of fixed infrastructure mBS for more reliable service provisioning.

##### B. DRL-based Caching

The DRL-based caching schemes aim to find the optimal policy in a learning phase. The agent observes the system state. As the agent acquires the system state information, the agent follows its policy and interacts with the system. After the environment interacts with the actions of the agent, the environment returns the corresponding reward value to the agent, and the agent learns a better policy based on the reward value. The deep Q-learning approach based mobility-aware caching and computational offloading scheme was proposed in [30]. The authors formulated a joint optimal caching and computing resource allocation problem to minimize the overall system cost under hard deadline delay, dynamic storage capacities and computation resources constraints with deep Q-learning approach. In addition, deep reinforcement learning based caching schemes for variety of application areas including interference alignment, softwaredefined networks (SDNs), and 5G mobile edge computing were proposed in [31]-[33]. An integrated framework that can dynamically orchestrate networking, caching, and computing resources is proposed in [31] to enhance the performance of services for smart cities. Based on the framework, a mobile edge computing and caching scheme with SDN and network functions virtualization (NFV) is proposed with deep Q-learning based approach. Similarly,[32] proposed deep Q-learning based resource allocation strategy for next generation vehicular networks. The authors formulated the resource allocation problem and jointly considered an orchestration of content caching with ICN, networking (e.g., SDN and NFV), and computing (e.g., cloud/edge computing) for optimizing the network. In addition, cache-enabled interference alignment strategy for next generation wireless networks is proposed in [33]. Unlike most of the previous interference alignment (IA) techniques, which assumed the channel is invariant, the transition model of channel state is designed as a finite state Markov channel (FSMC).

## III. DEEP REINFORCEMENT LEARNING FOR CACHING

In this section, we review deep reinforcement learning based video caching in mmWave based IoV networks. First, we sum up the preliminaries of reinforcement learning. Next, DQN based caching schemes are introduced as well. Finally, the DDPG algorithm, which is appropriate for large scale of action and state space, is introduced for proposed power-cache aware control policy.

##### A. Preliminaries

A Markov decision process (MDP) is defined as [TeX:] $$M=\{\mathcal{S}, \mathcal{A}, T, r\}$$, where [TeX:] $$\mathcal{S}$$ denotes the state space, [TeX:] $$\mathcal{A}$$ denotes the set of possible actions, [TeX:] $$T$$ denotes the transition model and [TeX:] $$r$$ denotes the reward value. Based on the MDP, the goal of the reinforcement learning is to train a policy [TeX:] $$\pi_{\theta} \in \Pi: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]$$. [TeX:] $$\text { The policy } \pi$$ maps the state of the environment to the action to maximize the expected reward [TeX:] $$\mathcal{J}(\pi)$$. With a finite [TeX:] $$T$$ process, the expected reward [TeX:] $$\mathcal{J}(\pi)$$ can be described as the accumulation of the reward at each time step as [TeX:] $$\mathcal{J}(\pi)=\mathbb{E}\left[\sum_{t=0}^{T} \delta^{t} r_{t} \mid \pi\right]$$. where is a discount factor which adjusts the effect of future rewards to the current decision. The optimal policy [TeX:] $$\pi^{*}$$ is then described as follows:

Based on this equation, the objective of reinforcement learning [TeX:] $$\text { is described as arg } \max _{\theta} \mathbb{E}_{s \sim \pi_{\theta}}\left[r\left(s, \pi_{\theta(s)}\right)\right] \text { . }$$.

##### B. DQN

DQN utilizes a neural network to approximate state-action functions (Q-functions). The Q-functions, which is approximated by neural network, allows the DQN to learn the policy even in a high dimensional system state space. The concept of DQN is based on a classical Q-learning algorithm. In the classical Q-learning, the Q-value of a state-action pair is estimated through iterative updates based on multiple interaction with the environment. Therefore, in DQN the immediate reward we receive and the expected Q-value of the new state are used to update the Q-functions. Therefore, the objective of DQN is described as follows:

##### (2)

[TeX:] $$\arg \min _{\theta} L_{D Q N}(\theta)=\arg \min _{\theta}\left(Q\left(s_{t}, a_{t} ; \theta\right)-\bar{Q}\left(s_{t}, a_{t} ; \theta\right)\right)^{2}$$where [TeX:] $$s_{t}$$ is the state at time [TeX:] $$t, a_{t}$$ is the selected action at [TeX:] $$s_{t}$$ and is the parameters of Q-functions. [TeX:] $$Q\left(s_{t}, a_{t} ; \theta\right)$$ is the target Qvalue which is derived from the current Q-functions at time [TeX:] $$t$$. Therefore, [TeX:] $$Q\left(s_{t}, a_{t} ; \theta\right)=r_{t}+\delta \max _{\dot{a}} Q\left(s_{t+1}, \hat{a} ; \theta\right)$$.

##### C. DDPG

Although the DQN based approaches can only handle discrete and low-dimensional action spaces, environments in many realistic applications have continuous and high dimensional action spaces (i.e., proactive caching, resource management, etc). Moreover, the DQN algorithms cannot be straightforwardly applied to continuous actions since DQN depends on choosing the best action that maximizes the Q-value function. When there is a finite number of discrete actions, the action that makes the Q value maximal is chosen, because possible Q values at the state can be computed directly for each action. However, when the action space is continuous, it is hard to exhaustively evaluate the Q values. DDPG is an algorithm which concurrently learns the Q-value function and the policy. The action-value Q function is learned and used to learn the policy. In the DDPG, the optimal Q-function [TeX:] $$Q^{*}(s, a)$$ is approximated by neural network, similar to DQN. Therefore, because the action space in continuous, the function [TeX:] $$Q^{*}(s, a)$$ can be differentiable in terms of the action. Based on that fact, a policy [TeX:] $$\pi_{\theta}$$ can be updated efficiently. The [TeX:] $$Q_{\phi}(s, a)$$ which is approximated with the parameters in [TeX:] $$\phi$$ is updated based on minimizing the mean-squared Bellman error (MSBE) as [TeX:] $$L(\phi, \mathcal{D})=\mathbb{E}\left[\left(Q\left(s_{t}, a_{t} ; \phi\right)-\bar{Q}\left(s_{t}, a_{t} ; \phi\right)\right)^{2}\right]$$, where [TeX:] $$\mathcal{D} \text { is a set of transitions }\left(s, a, r, s^{\prime}\right)$$. DDPG aims to learn a deterministic policy [TeX:] $$\pi_{\theta}(s)$$ which provides an action that maximizes [TeX:] $$Q_{\phi}(s, a)$$. Because the action space is continuous, the Qfunction is differentiable in terms of the action. With respect to the policy parameters , gradient ascent is performed to update the policy [TeX:] $$\pi_{\theta} \text { as } \max _{\theta} \mathbb{E}_{s \sim \mathcal{D}}\left[Q_{\phi}\left(s, \pi_{\theta}(s)\right)\right]$$.

## IV. DDPG-BASED POWER-STORAGE-AWARE CACHING

n this section, we propose the overall architecture of a powercache aware video caching scheme with a DDPG algorithm, which is introduced in the previous section. In IoV networks, he classical caching scheme, which needs to compute the optimal video caching options toward RSU (i.e., mBS cache of edge node) for every time step through extensive calculation, is an unrealistic caching option due to the time domain overhead. As the duration of association time between mBS and vehicles is short, the effect of overhead may severely affect the video provisioning service, which results in degradation of QoS. Thus, we introduce DDPG-based power and cache storage aware proactive caching scheme for meeting the requirements of the considered scenario by using two ideas: (i) Calculation of the optimal caching action through a learning process so that the optimal caching option can be derived for seamless services after the learning process and (ii) scale-adaptable IoV networks with satisfying optimal power and preemptive cache allocation of mBS for qualitative video provisioning service. System description and design of DDPG based caching scheme are proposed in the following subsections.

##### A. Assumptions

Before introducing the overall system description, several assumptions regarding elements of the proposed caching scheme of mmWave IoV network are denoted. The assumptions are defined for the following components of the caching scheme: The mBS, the vehicle, the MBS, and the video contents. Note that the components fully satisfy the corresponding assumptions for quality-cache aware video caching scheme.

mBS: Considering most of typical highway is constructed in rural regions and signal propagation of mBS is quite directive, we assume that mBSs of the considered IoV networks directly orient toward the highway. Specifically, the mBSs make beam alignment toward vehicles on highway within range of azimuth angle, which is an appropriate for servicing their own coverage region. In addition, each mBS is assumed that its data transmission is not affected by others. That is, the mBSs on the highway are independent and identically distributed (i.i.d.) over highway with distance of their non-overlapped coverage region so that IA is out of scope of this paper. Finally, because the case of at least two vehicles associated with the same mBS are located on the same position of a highway at the same time is illogical and does not exist, we assumed that the downlink (DL) of each mBS is enough to transmit the entire video chunks of its cache for each associated vehicle for a unit time step no matter the quality of the video. That is, the capacity of DL from mBS is sufficient to provision video toward each associated vehicle because i.i.d. setting of mBSs and non-overlapping position of associated vehicles with fully available bandwidth of the air interface for each vehicle.

Vehicle: For practical reasons, it can be envisioned that the vehicles on the highway only move forward, i.e., a vehicle can enter the coverage region of the following mBS or stay within the coverage region of the currently associated mBS after a time step. We assume that the vehicles move forward with probability of FSMC transition model with the value of in Fig. 2, which represents transition probability of vehicles given that the average velocity of vehicles of IoV networks and distance of mBS cell coverage region are both available. As presented in [34], the FSMC has been widely utilized to represent the dynamic variation of vehicular network channel. Because the channel is established between roadside mBS and UEs in the vehicular network, the simplified position transition of vehicles can be modeled with the FSMC. Note that the [TeX:] $$u_{i}$$ and [TeX:] $$x_{j}$$ in Fig. 2 represent the [TeX:] $$i$$th vehicle and the [TeX:] $$j$$th mBS, respectively. In addition, the request of vehicle is collected by the MBS, so that currently associated mBS can provide the corresponding video chunks, while the following mBS proactively allocates cache size and video chunks from the media server to prepare the handoff.

MBS: In the proposed mmWave IoV networks, the MBS takes a role of learning agent for power and preemptive cache allocation of mBSs for seamless video provisioning service among the vehicles on the highway. Based on the assumption of full knowledge of channel state information (CSI) of MBS in vehicular network in [35] and the signaling state of IoV networks through backhaul communications between MBS and mBSs within low cost [36], we assumed that the MBS has full knowledge of the considered IoV networks in four aspects: (i) Association information between mBSs and vehicles, (ii) cache occupancy state of each mBS, (iii) buffer occupancy state of each vehicle, and (iv) history of provisioned average quality of video for vehicles along with their trajectory. These four states consist of the state of the MBS, which calculates the corresponding caching action, and is derived from the neural network of the DDPG algorithm. In other words, the MBS can learn the optimal caching policy through trial and error. The detailed procedure of learning process is given in Algorithm 1. In addition, the MBS can be assumed that video chunks toward an mBS for each vehicle is limited up to [TeX:] $$\bar{m}$$ unit size, which is the upper bound (UB) of the video size cached at the corresponding mBS, for satisfying fairness of caching service considering the limited storage of mBS cache.

Contents: Guided by previous research work [37], we assume that the popularity of video contents among vehicles follows Zipf distribution [38] where all chunks during a video session are deterministically requested in sequence. Moreover, for each video chunk, it is assumed that the data rate of corresponding quality of video determines the unit size of a chunk. For example, suppose that there exists two video chunks with 360p and 720p quality, where the required data rate for supporting them is 1 Mbps and 5 Mbps, respectively. Then, we assume that the unit size of each corresponding single chunk for those quality is 1 and 5. That is, in case of a vehicle requesting three chunks of 720p quality video, 3 × 5 unit size of vehicle’s buffer is increased, while the associated mBS’s cache loses corresponding unit size. Finally, each vehicle is assumed to watch a video, which it firstly requested, throughout the entire sojourn time on the highway.

##### B. System Description

In the following, descriptions of system elements including vehicles, cache, buffer, and video are provided.

##### B.1 Vehicles on Highway

There exists vehicles up to [TeX:] $$N \text { and } K$$ fixed mBSs on the roadside of the considered highway mmWave IoV networks. The set of vehicles and mBSs are denoted by [TeX:] $$\mathcal{U}=\left\{u_{0}, u_{1}, \cdots, u_{i}, \cdots, u_{N-1}\right\} \text { and } \mathcal{X}=\left\{x_{0}, x_{1}, \cdots, x_{j}, \cdots, x_{K-1}\right\}$$, respectively. The [TeX:] $$u_{i} \text { and } x_{j} \text { represent } i$$th vehicle and [TeX:] $$j$$ th mBS, where [TeX:] $$\forall u_{i} \in \mathcal{U}, \forall x_{j} \in \mathcal{X}, i \in[0, N), \text { and } j \in[0, K)$$. We assume that every [TeX:] $$u_{i}$$ can be associated with only one mBS at the same time, which follows hard handoff mechanism. In addition, the [TeX:] $$u_{i}$$ can only move forward on the highway, i.e., if the [TeX:] $$u_{i} \text { is associated with } x_{j} \text { at time step } t, \text { the } u_{i}$$ can only associate with [TeX:] $$x_{j} \text { or } x_{j+1} \text { at time step } t+1, \text { not } x_{j-1}$$ such that [TeX:] $$j \in(0, K-1)$$. In addition, the association information or the discrete position of [TeX:] $$\forall u_{i}$$ can be represented as a matrix [TeX:] $$\mathbb{P}_{N} \times K$$, where

##### (3)

[TeX:] $$P_{N \times K}=\left\|\begin{array}{ccc} p_{0,0} & \cdots & p_{0, K} \\ \vdots & p_{i, j} & \vdots \\ p_{N-1,0} & \cdots & p_{N-1, K-1} \end{array}\right\|$$and each element [TeX:] $$p_{i, j} \text { of } P_{N \times K} \text { represents whether } u_{i}$$ belongs to the coverage of [TeX:] $$x_{j} \text { or not. For example, the } p_{i, j} \text { is }-1 \text { if } u_{i}$$ is associated with [TeX:] $$x_{j} . \text { Moreover, if the } p_{i, j} \text { is } 1, u_{i}$$ is associated with [TeX:] $$x_{j-1}$$. Otherwise, the value is set to 0.

The value [TeX:] $$p_{i, j} \in P_{N \times K}$$ can be equal or changed over time by following the FSMC transition probability model described in Fig. 2, where the transition probability set is represented as [TeX:] $$\varrho$$, which is derived from system average velocity vector [TeX:] $$\mathcal{V} \text { and }$$ mBS cell range [TeX:] $$\mathcal{O}$$. For example, if [TeX:] $$u_{i} \text { is associated with } x_{j}$$ such that [TeX:] $$i \in[0, N), \text { and } j \in[0, K-1), \text { the value of } p_{i, j}=-1 \text { and }$$ [TeX:] $$p_{i, j+1}=1 \text { and the rest of elements } p_{i, l} \text { such that } l \in[0, K-1)$$ [TeX:] $$\text { except } j \text { and } j+1 \text { is } 0 . \text { The } \mathrm{mBS} \text { cell range } \mathcal{O} \text { is assumed as }$$ 150 m, and the system average velocity [TeX:] $$\mathcal{V} \text { of } \forall u_{i} \in \mathcal{U}$$ is set to [TeX:] $$\mathcal{V}=80 \mathrm{km} / \mathrm{h}$$ [19]. Suppose that each time step is one second and considering the [TeX:] $$\mathcal{V} \text { and } \mathcal{O}$$ settings, the can be calculated as 0.143, which is the probability of each vehicle to move forward to associate following mBS for the next time step. That is, each [TeX:] $$u_{i} \in \mathcal{U}, \text { which is associated with } x_{j}$$, transits its position over time toward the cell of the following mBS [TeX:] $$x_{j+1}$$ with the FSMC transition probability given that average velocity of users in the IoV networks is available and [TeX:] $$\mathcal{O}=150 \mathrm{m} \text { for } i \in[0, N), \text { and } j \in[0, K-1)$$, respectively.

##### B.2 Cache, Buffer, and Video

There exists a set of video caches [TeX:] $$\mathcal{C}=\left\{c_{0}, c_{1}, \cdots, c_{j}, \cdots, c_{k-1} \right\}.$$ on the highway and each [TeX:] $$c_{j} \text { is equipped with } \operatorname{mBS} x_{j}$$ such that [TeX:] $$j \in[0, K) . \text { Each video cache } c_{j} \text { of } x_{j} \text { stores video }$$ chunks for vehicles. Suppose that [TeX:] $$c_{j}$$ is requested to provide video chunks from [TeX:] $$u_{i}, \text { where } p_{i, j}=-1, \text { then } x_{j} \text { immediately }$$ provides the cached video chunks or request the media server. In addition, the following mBS, denoted by [TeX:] $$x_{j+1}$$, notices the request of the vehicle and proactively allocates cache size and contents from the media server to prepare the position transition of [TeX:] $$u_{i} . \text { In addition, the spatial upper bound of } c_{j} \text { is denoted as } \bar{c}$$. and the video contents, which are cached in [TeX:] $$c_{j}$$, are transmitted towards the associated set of vehicles [TeX:] $$u_{y, j}, \text { such that } y \in[0, N)$$. We assumed that the capacity of the mmWave link between [TeX:] $$u_{i}$$[TeX:] $$\text { and } x_{j} \text { is sufficient so that } x_{j}$$ can provision the entire video chunks towards a set of vehicles, which are associated with [TeX:] $$x_{j}$$ Moreover, video buffer set [TeX:] $$\mathcal{B}=\left\{b_{0}, b_{1}, \cdots, b_{i}, \cdots, b_{N-1}\right\}$$ represents the set of buffer which is equipped within each vehicle. The buffer [TeX:] $$b_{i} \text { is mounted on } u_{i}$$ and the spatial upper bound of [TeX:] $$b_{i} \text { and buffer playback rate is denoted by } \bar{b} \text { and } \mathcal{F}$$, respectively. Meanwhile, there is a set of video qualities, which are denoted by [TeX:] $$\mathcal{Q}. \text { Moreover, it is assumed that each } u_{i}$$ can be served with each quality of video chunk in the quality set [TeX:] $$\mathcal{Q} . \text { The } \mathcal{Q}$$ can be defined as [TeX:] $$\mathcal{Q}=[360 \mathrm{p}, 480 \mathrm{p}, 720 \mathrm{p}, 1080 \mathrm{p}, 4 \mathrm{K}]$$, or example. Each quality level in [TeX:] $$\mathcal{Q}$$ requires an average link capacity of 1, 3, 5, 8, and 40 Mbps, respectively, in ascending order, which determines the QoS of [TeX:] $$u_{i} \text { associated with } x_{j}$$. Per the aforementioned assumption regarding video contents, the unit size of a chunk is determined by the quality of the video.

##### C. DDPG-based Caching

The system can be represented in terms of reinforcement learning, where the agent in the system is [TeX:] $$\mathcal{M} \text { The } \mathcal{M}$$ controls the overall power and proactive cache allocation toward each [TeX:] $$x_{j}$$ on a highway from the remote media server [TeX:] $$\mathcal{Z}$$, with specific power allocation level [TeX:] $$v_{k} \text { of the video and cache size } c_{i, j}$$ and [TeX:] $$c_{i, j+1} \text { for } u_{i} \text { where } k \in[0,2], i \in[0, N), \text { and } j \in[0, K-1)$$, respectively. In the following, the learning process of power allocation and proactive cache allocation toward mBSs are introduced in terms of the state space, action space, reward, and algorithmic description, in details.

##### C.1 State space

The state space of the caching system consists of the following elements: Preloaded unit size of video for each [TeX:] $$u_{i}$$ along with the entire mBS [TeX:] $$\mathcal{X}, \text { buffer occupancy of each } u_{i}$$, average quality history of the provisioned video for each [TeX:] $$u_{i}$$ and the position of each vehicle [TeX:] $$u_{i}$$ over time. The elements are denoted by [TeX:] $$C_{N \times K}$$, [TeX:] $$B_{N \times K}, H_{N \times K}, \text { and } P_{N \times K}$$, respectively. The state of a position is represented as in (3) and the rest of the elements of the state space are represented as follows:

##### (4)

[TeX:] $$C_{N \times K}=\left\|\begin{array}{ccc} c_{0,0} & \cdots & c_{0, K-1} \\ \vdots & c_{i, j} & \vdots \\ c_{N-1,0} & \cdots & c_{N-1, K-1} \end{array}\right\|,$$

##### (5)

[TeX:] $$B_{N \times K}=\left\|\begin{array}{ccc} b_{0,0} & \cdots & b_{0, K-1} \\ \vdots & b_{i, j} & \vdots \\ b_{N-1,0} & \cdots & b_{N-1, K-1} \end{array}\right\|,$$

##### (6)

[TeX:] $$H_{N \times K}=\left\|\begin{array}{ccc} h_{0,0} & \cdots & h_{0, K-1} \\ \vdots & h_{i, j} & \vdots \\ h_{N-1,0} & \cdots & h_{N-1, K-1} \end{array}\right\|.$$First, [TeX:] $$c_{i, j}$$ in (4) represents the cache occupancy of [TeX:] $$x_{j}$$ which is the preloaded unit size of the video for satisfying [TeX:] $$u_{i}$$’s request. The maximum storage size of each [TeX:] $$c_{i, j} \text { for } i \in[0, N)$$[TeX:] $$\text { and } j \in[0, K) \text { is limited to } \bar{m}$$ for vouching fair video transmission toward vehicles and [TeX:] $$\sum_{k} c_{k, j} \leqslant \bar{c}, \text { where } k \in[0, N)$$ and [TeX:] $$\bar{m} \times N \leqslant \bar{c}$$.

Next, the [TeX:] $$b_{i, j}$$ of (5) represents the buffer occupancy of [TeX:] $$u_{i}$$ associated with [TeX:] $$x_{i} \text { . Each } b_{i, j} \text { for } \forall j \in[0, K) \text { has UB of } \bar{b}$$ and packet drop can occur when [TeX:] $$b_{i, j}+c_{i, j}-\mathcal{F} \geqslant \bar{b}$$. Moreover, video playback service can be stalled if [TeX:] $$b_{i, j}+c_{i, j}-\mathcal{F} \leqslant 0$$.

Finally, the average quality state of the provisioned video at [TeX:] $$u_{i, j}$$ can be calculated by the cumulative average quality of the provisioned video history through trajectory of [TeX:] $$u_{i} \text { from } x_{0} \text { to } x_{j}$$. The average quality state of [TeX:] $$u_{i, j} \text { can be denoted by } h_{i, j}$$, and is utilized for learning the policy of [TeX:] $$\mathcal{M}$$ which aims to provision an enhanced quality of the video toward [TeX:] $$u_{i}$$. Suppose that [TeX:] $$\mathcal{S}_{i, j}$$ represents the sojourn time step of [TeX:] $$u_{i} \text { associated with } x_{j}, \text { the } h_{i, j}$$ can be calculated as follows:

##### (7)

[TeX:] $$h_{i, j}=\sum_{k=0}^{j} \frac{\sum_{t=0}^{s_{i, k}-1} q_{i}^{k, t}}{s_{i, k}}, i \in[0, N), j \in(0, K).$$[TeX:] $$\text { Moreover, } h_{i, j}=0 \text { when } p_{i, j}=1 \text { and } j=0, \text { i.e., we only }$$ consider the history of quality provisioned at [TeX:] $$u_{i}$$, which has the drive experience on the highway. The [TeX:] $$q_{i}^{k, t}$$ in (7) represents [TeX:] $$t$$th quality index of video chunks, where it is provisioned at [TeX:] $$u_{i}$$ associated with [TeX:] $$x_{k}$$.

##### C.2 Action Space

The [TeX:] $$\mathcal{M}$$ can learn the optimal action, which proactively requests the optimal power and cache allocation toward [TeX:] $$x_{j} \text { and } x_{j+1}$$ (i.e., vicinity of the [TeX:] $$u_{i}$$) from [TeX:] $$\mathcal{Z}$$ for seamless video retrieval, given that the state of mmWave IoV networks can be observed. Here, the action space of [TeX:] $$\mathcal{M} \text { consists of } V_{N \times K}$$ and [TeX:] $$L_{N \times K}$$, where each of them represents the amount of power allocation matrix and cache allocation matrix, respectively, and they can be denoted as follows:

##### (8)

[TeX:] $$V_{N \times K}=\left\|\begin{array}{ccc} v_{0,0} & \cdots & v_{0, K-1} \\ \vdots & v_{i, j} & \vdots \\ v_{N-1,0} & \cdots & v_{N-1, K-1} \end{array}\right\|,$$

##### (9)

[TeX:] $$L_{N \times K}=\left\|\begin{array}{ccc} l_{0,0} & \cdots & l_{0, K-1} \\ \vdots & l_{i, j} & \vdots \\ l_{N-1,0} & \cdots & l_{N-1, K-1} \end{array}\right\|.$$First, the [TeX:] $$v_{i, j} \text { of } V_{N \times K}$$ represents the amount of power allocation of mBS, where [TeX:] $$\mathcal{M}$$ requests specific quality of video with respect to the power [TeX:] $$v_{i, j} \text { to } \mathcal{Z}$$ to serve video at [TeX:] $$x_{j} \text { for } u_{i}$$. In addition, [TeX:] $$l_{i, j} \text { of } L_{N \times K}$$ stands for the size of the allocated cache size at [TeX:] $$x_{j} \text { for } u_{i} \text { by } \mathcal{M} . \text { The } \mathcal{M}$$ requests cache size up to two neighboring mBSs for accomplishing two missions as follows: (i) Secure seamless current video provisioning service for [TeX:] $$x_{j}$$ and (ii) preemptive cache allocation at [TeX:] $$x_{j+1}$$ for enabling seamless services where [TeX:] $$j \in[0, K-1)$$. For example, if [TeX:] $$u_{i}$$ on a highway is associated with [TeX:] $$x_{j}, \text { the } \mathcal{M}$$ allocates cache storage with unit size of [TeX:] $$l_{i, j} \text { and } l_{i, j+1} \text { at } x_{j} \text { and } x_{j+1}$$, respectively, for supporting current video service for [TeX:] $$u_{i}$$ and preemptive video caching for handoff of [TeX:] $$u_{i} . \text { When } u_{i}$$ is not yet on the highway, i.e., [TeX:] $$p_{i, j}=1 \text { and } j=0, \text { the } \mathcal{M}$$ only allocates cache size toward [TeX:] $$x_{0} \text { for } u_{i}$$ for proactive video caching. If the [TeX:] $$u_{i}$$ is associated with [TeX:] $$x_{K-1}, \text { the } \mathcal{M} \text { requests } x_{K-1}$$ to allocate cache size of [TeX:] $$l_{K-1}$$ for satisfying current video service of [TeX:] $$u_{i}$$ [TeX:] $$$$.

##### C.3 Algorithm for Learning The Proactive Caching

The [TeX:] $$\mathcal{M}$$ learns the proactive caching policy and accomplishes power and preemptive cache allocation toward mBSs for seamless video retrieval by utilizing the proposed DDPG based algorithm as shown in Algorithm 1. The overall caching policy learning procedures are as follows. First, the parameters of the actor and critic network, which activate and evaluate action of [TeX:] $$\mathcal{M}$$, are initialized (line 1). Then, the target networks regarding both actor and critic network, [TeX:] $$\mathcal{Q}^{\prime} \text { and } \mathcal{A}^{\prime}$$, are initialized with the origin’s one (line 2). By iterating each episode, the [TeX:] $$\mathcal{M}$$ repeats following procedures to learn optimal caching policy which is power-cache aware:

i) For every episode, the transition pairs, attained by an arbitrarily generated set of states [TeX:] $$s \text { of size } \varphi$$, corresponding actions generated by the actor network with input [TeX:] $$s$$, reward value for [TeX:] $$s \text { and } a$$, and the next state space [TeX:] $$s^{\prime}$$, are paired and stored at replay buffer [TeX:] $$\Phi$$ (lines 5–7).

ii) After the is fully calculated, the minibatch of transitions is randomly sampled from the replay buffer . Then, for [TeX:] $$i$$th transition pair of , it is utilized for calculating the difference between target value [TeX:] $$y_{i}$$ and model value [TeX:] $$Q\left(s_{i}, a_{i} \mid \theta^{Q}\right)$$ to update the critic network with the gradients obtained from the difference. In addition, stochastic policy gradient is utilized to update parameters of the actor network as per line 15.

iii) Overall, the updated parameters of critic and actor networks′are utilized to update the target parameters of [TeX:] $$\mathcal{Q}^{\prime} \text { and } \mathcal{A}^{\prime}$$ with soft update weight for efficient and stable learning (i.e., (10) and (11)). Note that the sampled is refreshed with another trained transition pairs for better learning procedure after it is sampled. The computational complexity of this algorithm depends on the stochastic policy gradient method for minimizing loss; and our proposed algorithm does not exceed the complexity of stochastic policy gradient.

##### C.4 Reward

A comprehensive revenue of [TeX:] $$\mathcal{M}$$ is used as our caching scheme’s reward. The [TeX:] $$\mathcal{M}$$ takes composite action [TeX:] $$\left\{V_{N \times K} , L_{N \times K}\right\}$$, when it observes the state of mmWave IoV networks as [TeX:] $$\left\{C_{N \times K}, B_{N \times K}, H_{N \times K}, P_{N \times K}\right\}$$, and get the next state of IoV networks [TeX:] $$\left\{C_{N \times K}^{\prime}, B_{N \times K}^{\prime}, H_{N \times K}^{\prime}, P_{N \times K}^{\prime}\right\}$$ and weighted average reward sum [TeX:] $$R$$ of the considered system reward including (i) quality variation, (ii) packet drop occurrence, and (iii) playback stall. These sub-rewards are denoted as [TeX:] $$r^{q}, r^{p}, \text { and } r^{s}$$, respectively. The total reward [TeX:] $$R$$ of each episode is calculated as the average of transitions’ rewards of sampled from . Note that the sub-rewards are calculated in vehicle-by-vehicle manner, and each of them is added together for each reward domain [TeX:] $$\text { as } R^{q}, R^{p}, \text { and } R^{s},$$ respectively. That is, the episode reward [TeX:] $$R$$ is equal to [TeX:] $$R^{q}+R^{p}+R^{s}$$.

First, in the case of the reward of quality [TeX:] $$r^{q}$$, it is determined by the action taken by [TeX:] $$\mathcal{M}, \text { which is } V_{N \times K}$$. The IoV network environment, [TeX:] $$e$$, which interacts with [TeX:] $$\mathcal{M}$$, calculates the [TeX:] $$r^{q}$$ by comparing [TeX:] $$H_{N \times K}^{\prime} \text { and } H_{N \times K}$$. Then [TeX:] $$e$$ compares the cumulative average quality of video among them and gives a weighted reward to [TeX:] $$\mathcal{M}$$ if the expected quality corresponding to the allocated power [TeX:] $$v_{i, j}$$ originates an enhancement of the provisioned video quality and vice versa. The [TeX:] $$\mathcal{M}$$ can get reward if the action for [TeX:] $$u_{i}$$ results in a higher quality of video than its previous average video quality. However, if not, [TeX:] $$\mathcal{M}$$ gets negative reward of [TeX:] $$r^{q}$$ as penalty, which represents the degradation of QoS of [TeX:] $$u_{i}$$. Specifically, the quality of video is determined by the data rate, which can be calculated by:

##### (12)

[TeX:] $$g_{i, j}=\frac{g_{i, j}^{T X} g_{i, j}^{R X} \mu^{2}}{16 \pi^{2}\left(\frac{d_{i, j}}{d_{0}}\right)^{\eta}}$$

##### (13)

[TeX:] $$S I N R_{i, j}=\frac{v_{i, j} g_{i, j}}{\sum_{k \in \mathcal{U}} v_{k, j} g_{k, j}+\sigma^{2}}$$

##### (14)

[TeX:] $$a_{i, j}=\frac{W}{K_{j}} \log _{2}\left(1+S I N R_{i, j}\right), \forall j \in \mathcal{X}$$The [TeX:] $$g_{i, j}$$ in (12) represents the power gain from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user. In addition, [TeX:] $$g_{i, j}^{T X} \text { and } g_{i, j}^{R X}$$ stand for transmit antenna gain and receive antenna gain from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user. Moreover, the represents the wavelength and [TeX:] $$d_{i, j} \text { and } d_{0}$$ represents distance from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user and far field reference distance, respectively. Lastly, the represents the path-loss exponent. The [TeX:] $$v_{i, j}$$ in (13) represents the transmit power from [TeX:] $$j \text { th } \mathrm{mBS} \text { to } i \text { th }$$ user, [TeX:] $$\sigma^{2}$$ is the variance of additive white Gaussian noise (AWGN). According to Shannon’s capacity formula, the achievable rate for [TeX:] $$i \text { th user from } j \text { th mBS }$$ is as (14). The [TeX:] $$W$$ stands for the system bandwidth, and [TeX:] $$K_{j}$$ is the total number of users associated with [TeX:] $$j \text { th mBS }$$ Thus, each user can utilize [TeX:] $$1 / K_{j}$$ of the total frequency bandwidth of each mBS. Based on the [TeX:] $$a_{i, j}$$, each user can receive the corresponding quality of video chunks from associated mBS.

Next, the [TeX:] $$r^{p}$$ is calculated by observing the current buffer occupancy of each [TeX:] $$u_{i}$$, allocation action [TeX:] $$L_{N \times K} \text { of } \mathcal{M}$$, and the buffer saturation rate [TeX:] $$\mathcal{F}$$. If the difference between the sum of buffer occupancy of [TeX:] $$u_{i} \text { with cache } x_{j} \text { and } \mathcal{F} \text { exceeds } \bar{b}$$, then the packet drop occurs at [TeX:] $$u_{i}$$ throughout the video provisioning service. This is, [TeX:] $$\mathcal{M}$$ gets punished by attaining minus rewards of [TeX:] $$r^{p}$$ because the action of [TeX:] $$\mathcal{M}$$ originated the spectrum waste and power consumption of the corresponding mBS. By exploiting this reward structure, the MBS learns to cache the video chunks in a way that computational overhead and communication loss derived from unnecessary delivery service are dismissed.

Finally, the [TeX:] $$r^{\mathcal{S}}$$ can be computed by subtracting [TeX:] $$\mathcal{F} \text { from } b_{i, j}$$ and adding [TeX:] $$c_{i, j}$$. If the result is positive, the [TeX:] $$u_{i}$$ can playback provisioned video chunks without any stall. On the other hand, if the result is less than zero, then the video playback at [TeX:] $$u_{i}$$ can be stall, which results in a deteriorating QoS for [TeX:] $$u_{i}$$. Therefore, we define [TeX:] $$R$$ as:

##### (15)

[TeX:] $$\begin{aligned} R=& R^{q}+R^{p}+R^{s} \\ =& \sum_{j=0}^{K-1} \sum_{i=0}^{N-1}\left(\psi r_{i, j}^{q}+\Xi r_{i, j}^{p}+\aleph r_{i, j}^{s}\right) \\ =& \sum_{j=0}^{K-1} \sum_{i=0}^{N-1}\left(\psi \cdot p_{i, j} \cdot r^{q}\left(\left|\left(\frac{h_{i, j}}{q_{i, j}}\right)^{+}\right|-\left|\left(\frac{h_{i, j}}{q_{i, j}}\right)^{-}\right|\right)\right.\\ &+\Xi \cdot p_{i, j} \cdot r^{p}\left(\left|\left(\frac{b_{i, j}+l_{i, j}-\mathcal{F}}{\bar{b}}\right)^{+}\right|-\left|\left(\frac{b_{i, j}+l_{i, j}-\mathcal{F}}{\bar{b}}\right)^{-}\right|\right) \\ &\left.+\aleph \cdot p_{i, j} \cdot r^{s}\left(\left|\left(\frac{1}{b_{i, j}+l_{i, j}-\mathcal{F}}\right)^{<}\right|-\left|\left(\frac{1}{b_{i, j}+l_{i, j}-\mathcal{F}}\right)^{>}\right|\right)\right) \end{aligned}$$where [TeX:] $$\psi, \Xi, \text { and } \aleph$$ represent the reward weights of [TeX:] $$r^{q}, r^{p}, \text { and } r^{s}$$, respectively. The [TeX:] $$(a / b)^{+}$$ is a function that returns 1 if [TeX:] $$a<b \text { or } 0$$ otherwise. Moreover, [TeX:] $$(a / b)^{-}$$ is a function that returns 1 if [TeX:] $$a>b$$ and 0 otherwise. The [TeX:] $$(1 / c)^{\triangle}$$ function returns 0 if [TeX:] $$c \rightarrow-\infty$$ and returns 1 otherwise. Lastly, The [TeX:] $$(1 / c)^{\triangle}$$ function returns 1 if [TeX:] $$c \rightarrow-\infty$$ and returns 0 otherwise. By utilizing these functions, the (15) represents the total system reward calculation procedure with respect to the quality variation, packet drop occurrence, and playback stall.

## V. PERFORMANCE EVALUATION

We utilized various simulations for verifying the performance of our proposed power-cache aware caching scheme in mmWave based IoV networks. The caching scheme is evaluated with these simulations by measuring the corresponding results with respect to the aforementioned interest of rewards, given that the state of IoV network is observable by the agent. We leveraged TensorFlow [39] in our simulations to implement our proposed DDPG based caching scheme. We first present the simulation settings and then discuss the results. As many reinforcement learning literature has shown results based on empirical convergence without complexity analysis [40]-[44], we propose simulation results based on this approach.

##### A. Setup

In the following, we elaborate the implementation details of the proposed DDPG learning based video caching scheme in mmWave IoV network. First, we introduce hardware configuration for our simulation, and then show the overall design and implementation details of the software.

Hardware. For hardware, we used an NVIDIA DGX station equipped with 4 × Tesla V100 GPUs (total of 128GB of available memory) and Intel Xeon E5–2698 v4 2.2 GHz with 20 cores (total of 256 GB of available system memory) CPU.

Software. We also used Python with version 3.6 on Ubuntu 16.04 LTS to build the DDPG based caching scheme. In addition, we used Xavier initializer to avoid occurrence of vanishing gradient descent during the learning phase. The neural network is constructed with fully connected deep neural network, and the number of nodes in the hidden layer was 200.

We implemented both the DDPG based caching algorithm and the customized mmWave IoV networks in highway scenario. The agent in DDPG based caching algorithm continuously interacts with the dynamic IoV network environment and attains pairs of state transition. Accordingly, and in turn, the optimal caching policy can be acquired with policy gradient after the learning phase has converged. In addition, simulation parameters are summarized in Table 2.

##### B. Converged Performance for Each Learning Rate

First, the caching scheme is evaluated with three different values of learning rate . Fig. 3 to Fig. 5 show the tendency of convergence of learning phase throughout the episodes. Note that Fig. 3 to Fig. 5 have the same simulation setting of [TeX:] $$(K, N)=(20,200) \text { and } \rho=0.143$$ with different learning rates. For each learning rate simulation run, learning tendency for each reward is represented. For example, in case of Fig. 4, the impact of each reward category can be obtained from the gap between other measured values of the mixed reward. As the value of the green-lined graph—which represents the reward value without (w.o.) the packet drop occurrence reward—is getting higher, it can be considered that the [TeX:] $$\mathcal{M}$$ makes an optimal policy which considers the playback stall and quality of provisioned video to be more important than the packet drop for maximizing the QoS. Similarly, the red-lined graph in Fig. 4 is getting lower and is converged at specific value. It can be considered that the total reward value of caching scheme can be underestimated without the quality reward value, which means the importance of the quality reward on the learning phase is not negligible.

When [TeX:] $$\gamma=10^{-4}$$, an interesting learning tendency can be observed in Fig. 5. While the red-lined graph in Fig. 5 does not quite change over the entire learning phase, other graphs are dramatically increased and finally converged at the optimal point. That is, the [TeX:] $$\mathcal{M}$$ learns the caching policy to maximize the quality reward than other criteria. The red-lined graph, which is the mixed reward value, consists of packet drop occurrence reward and playback stall reward, and does not change, while the other two graphs are sharply increased indicating that the quality reward is dramatically increased.

In summary, the total reward can be illustrated as in Fig. 6. Throughout the learning phase, the M with different learning rate learns its caching policy, and the policy can be evaluated by the system reward criteria as mentioned earlier. In case of [TeX:] $$\gamma=10^{-3} \text {and } \gamma=5 \times 10^{-4}, \text {the } \mathcal{M}$$ gets a converged caching policy around the 100th episode. However, in case of [TeX:] $$\mathcal{M}$$ with smaller [TeX:] $$\gamma, \text { the } \mathcal{M}$$ optimized its policy later. Therefore, our proposed power-cache aware video caching scheme accomplished stable and optimal video provisioning service towards vehicles in mmWave based distributed IoV networks.

As the optimal caching policy is attained, the [TeX:] $$\mathcal{M}$$ can immediately allocate power and cache units toward the distributed mBS as the system state is observed by [TeX:] $$\mathcal{M}$$ and thus caching scheme maximizes the QoS of the users. This caching policy differs from the classical caching scheme’s policy, which needs to calculate the optimal caching strategy for each observation of IoV networks over time. Thus, the proposed caching scheme is highly affordable for optimal power and cache allocation of mBSs to provision superior quality and playback experience while seamless service is possible.

##### C. Robustness on Scalability

In the following, we argue for the importance of scalability in IoV networks. As the scale of the considered IoV networks gets larger, calibrating the optimal caching policy for seamless video services is hard to accomplish with classical approaches. Moreover, when the number of objective to optimize becomes larger calculating the optimal point for seamless video services.

Fig. 7 illustrates the total reward value convergence tendency throughout the learning phase. Note that the total reward of each case is proportional to the scale of the IoV networks. In addition, the of FSMC model is set to 0.186, where the system average velocity of vehicles is 100 km/h. Moreover, the learning rate was set to [TeX:] $$10^{-3}$$. Originally, the action space of Fig. 3 to Fig. 6 was 4000 = 20×200, where the IoV networks in Fig. 7 is 5000, 7500, and 20000. That is, the robustness of the proposed caching scheme with respect to scalability is validated through simulation in Fig. 7. Each scale of IoV networks in Fig. 7 showed converged performance for provisioning optimal quality of video and mitigated playback stall phenomenon through learning power and cache allocation toward mBSs.

Next, the learning tendency of average quality level with respect to the controlled power of mBS’s transmitter and unit size of mBS’s cache in various scale of IoV networks are proposed in Figs. 8 and 9. For power control aspects, the [TeX:] $$\mathcal{M}$$ with scale of [TeX:] $$(K, N)=(20,250) \text { and }(K, N)=(25,300)$$ learns optimal power allocation toward mBSs on the road side, which results in sufficient data rate and can be supported toward users so that maximized quality of video (i.e., 4K resolution) can be provisioned. Besides, as the scale of IoV networks is [TeX:] $$(K, N)=(40,500)$$, which is 5× more dense compared to setting of Fig. 3 to Fig. 5, the [TeX:] $$\mathcal{M}$$ learned to allocate power corresponding to 720p resolution of video toward users with limited spectrum availability.

Finally, [TeX:] $$\mathcal{M}$$ learns to allocate cache size toward mBS for supporting seamless video retrieval at neighboring users. As in Fig. 9, the [TeX:] $$\mathcal{M}$$ with scale of [TeX:] $$(K, N)=(20,250) \text { and }(K, N)=(25,300)$$ learns to allocate smaller cache size than scale of [TeX:] $$(K, N)=(40,500)$$. That is, the [TeX:] $$\mathcal{M}$$ with larger scale learns caching policy to allocate low power utilization strategy. However, the [TeX:] $$\mathcal{M}$$ stabilizes the distributed IoV networks with more flourish cache size for each user so that playback stall problem at user can be mitigated. Besides, for smaller scales, [TeX:] $$\mathcal{M}$$ aims to learn the caching scheme to achieve a maximized average quality level of provisioned video (i.e., higher power allocation of mBS). Therefore, proposed power-cache aware video caching scheme in distributed mmWave IoV networks enables us to learn the optimal caching policy, which accomplishes an optimal power and cache allocation toward mBSs and attains stabilized performance even for an enlarged scale of IoV networks.

## VI. CONCLUSION AND FUTURE WORK

We proposed a deep reinforcement learning based video caching scheme in mmWave IoV networks to optimize power consumption and cache allocation of mBS with minimum number of stall events for seamless services. With our proposed caching scheme, stabilized and optimized caching options in a large-scale distributed IoV networks can be achieved as the system state is observed. Through an extensive set of simulations, the proposed caching scheme is shown to be appropriate for learning a massive scale of action space and stabilized learning performance, even when the scale of the considered distributed IoV networks is enlarged.

As future work directions, real-world implementation and its corresponding prototype-based performance evaluation will be considered. Furthermore, additional performance evaluations in order to compare with the other reinforcement learning algorithms will be intensively conducted. Lastly, the extension of our work with multi-agent deep reinforcement learning algorithms is worthy to consider in order to build scalable largescale systems with multiple distributed base stations. To guarantee the convergence in multi-agent deep reinforcement learning, we need more sophisticated and well-designed reward functions and action spaces.

## ACKNOWLEDGEMENTS

This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00170, Virtual Presence in Moving Objects through 5G) and also by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-20202017-0-01637) supervised by the IITP (Institute for Information & Communications Technology Promotion). J. Kim, A. Mohaisen, and W. Lee are the corresponding authors of this paper.

## Biography

##### Dohyun Kwon

Dohyun Kwon is currently a Research Engineer at Hyundai-Autoever, Seoul, Republic of Korea. He received his B.S. and M.S. degrees in Computer Science and Engineering from Chung-Ang University, Seoul, Republic of Korea, in 2018 and 2020, respectively. His research focus includes deep reinforcement learning for mobile networks.

## Biography

##### Joongheon Kim

Joongheon Kim (M’06-SM’18) is currently an Assistant Professor of Electrical Engineering at Korea University, Seoul, Korea. He received the B.S. and M.S. degrees in Computer Science and Engineering from Korea University, Seoul, Korea, in 2004 and 2006, respectively; and the Ph.D. degree in Computer Science from the University of Southern California (USC), Los Angeles, CA, USA, in 2014. Before joining Korea University as an Assistant Professor in 2019, he was with LG Electronics as a research engineer (Seoul, Korea, 2006-2009), InterDigital as an intern (San Diego, CA, USA, 2012), Intel Corporation as a systems engineer (Santa Clara in Silicon Valley, CA, USA, 2013-2016), and Chung-Ang University as an Assistant Professor (Seoul, Korea, 2016-2019). He is a Senior Member of the IEEE. He was a recipient of Annenberg Graduate Fellowship with his Ph.D. admission from USC (2009), Intel Corporation Next Generation and Standards (NGS) Division Recognition Award (2015), Haedong Young Scholar Award by KICS (2018), IEEE Veh. Technol. Society (VTS) Seoul Chapter Award (2019), Outstanding Contribution Award by KICS (2019), Gold Paper Award from IEEE Seoul Section Student Paper Contest (2019), and IEEE Systems J. Best Paper Award (2020). David Aziz Mohaisen earned his M.Sc. and Ph.D. degrees from the University of Minnesota in 2012. Currently, he is an Associate Professor of Computer Science at the University of Central Florida. Prior to joining Central Florida, he was an Assistant Professor at SUNY Buffalo (2015-2017), a Senior Research Scientist at Verisign Labs (2012-2015), and a Researcher at ETRI (2007-2009). He was awarded the Summer Faculty Fellowship from the US AFOSR (2016), the Best Student Paper at ICDCS (2017), the Best Paper Award at WISA (2014), the Best Poster Award at IEEE CNS (2014), and a Doctoral Dissertation Fellowship from the University of Minnesota (2011). He is in the editorial board of IEEE Trans. Mobile Computing. He is a Senior Member of ACM and a Senior Member of IEEE.

## Biography

##### David Aziz Mohaisen

David Aziz Mohaisen earned his M.Sc. and Ph.D. de- grees from the University of Minnesota in 2012. Cur- rently, he is an Associate Professor of Computer Sci- ence at the University of Central Florida. Prior to join- ing Central Florida, he was an Assistant Professor at SUNY Buffalo (2015–2017), a Senior Research Sci- entist at Verisign Labs (2012–2015), and a Researcher at ETRI (2007–2009). He was awarded the Summer Faculty Fellowship from the US AFOSR (2016), the Best Student Paper at ICDCS (2017), the Best Pa- per Award at WISA (2014), the Best Poster Award at IEEE CNS (2014), and a Doctoral Dissertation Fellowship from the University of Minnesota (2011). He is in the editorial board of IEEE Trans. Mobile Com- puting. He is a Senior Member of ACM and a Senior Member of IEEE.

## Biography

##### Wonjun Lee

Wonjun Lee received the B.S. and M.S. degrees in Computer Engineering from Seoul National University, Seoul, South Korea, in 1989 and 1991, respectively, the M.S. degree in Computer Science from the University of Maryland at College Park, College Park, MD, USA, in 1996, and the Ph.D. degree in Computer Science and Engineering from the University of Minnesota, Minneapolis, MN, USA, in 1999. In 2002, he joined the faculty of Korea University, Seoul, where he is currently a Professor with the Department of Computer Science and Engineering. He has authored or co-authored over 180 papers in refereed international journals and conferences. His research interests include communication and network protocols, optimization techniques in wireless communication and networking, security and privacy in mobile computing, and radio frequency powered computing and networking. Dr. Lee has served as a Technical Program Committee member for the IEEE International Conference on Computer Communications from 2008 to 2018. He was associated with the Computing Machinery International Symposium on Mobile Ad Hoc Networking and Computing from 2008 to 2009 and the IEEE International Conference on Computer Communications and Networks from 2000 to 2008 and over 118 international conferences.

## References

- 1 L. Wei, R. Q. Hu, Y. Qian, G. Wu, "Key elements to enable millimeter wave communications for 5G wireless systems,"
*IEEE Wireless Commun.*, vol. 21, no. 6, pp. 136-143, Dec, 2014.doi:[[[10.1109/MWC.2014.7000981]]] - 2 M. A. Salkuyeh, B. Abolhassani, "Optimal video packet distribution in multipath routing for urban V ANETs,"
*J. Commun. Netw.*, vol. 20, no. 2, pp. 198-206, Apr, 2018.custom:[[[-]]] - 3 J. G. Andrews et al., "What will 5G be?,"
*IEEE J. Sel. Areas Commun.*, vol. 32, no. 6, pp. 1065-1082, June, 2014.custom:[[[-]]] - 4 T. E. Bogale, L. B. Le, "Massive MIMO and mmWave for 5G wireless HetNet: Potential benefits and challenges,"
*IEEE Veh. Technol. Magazine*, vol. 11, no. 1, pp. 64-75, Mar, 2016.custom:[[[-]]] - 5 J. Kim, G. Caire, A. F. Molisch, "Quality-aware streaming and scheduling for device-to-device video delivery,"
*IEEE /ACM Trans. Netw.*, vol. 24, no. 4, pp. 2319-2331, Aug, 2016.doi:[[[10.1109/TNET.2015.2452272]]] - 6] Cisco, "Cisco visual networking index: Global mobile data traffic forecast, 2016-2021 QA," https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html, July 2018, [Accessed: 201901-08 Cisco visual networking index: Global mobile data traffic forecast, 2016-2021 QA," https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html, July 2018, [Accessed: 201901-08-sciedit-2-03">
*Cisco,*, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html,July2018,[Accessed:201901-08] , custom:[[[, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visualnetworking-index-vni/vni-forecast-qa.html,July2018,[Accessed:201901-08] , ]]]. - 7 I. Parvez, A. Rahmati, I. Guvenc, A. I. Sarwat, H. Dai, "A survey on low latency towards 5G: RAN, core network and caching solutions,"
*IEEE Commun.SurveysTuts.Fourth Quarter*, vol. 20, no. 4, pp. 3098-3130, 2018.custom:[[[-]]] - 8 Y. Niu, C. Gao, Y. Li, L. Su, D. Jin, "Exploiting multi-hop relaying to overcome blockage in directional mmwave small cells,"
*J.Commun.Netw.*, vol. 18, no. 3, pp. 364-374, June, 2016.doi:[[[10.1109/JCN.2016.000052]]] - 9 J. Kim, Y. Tian, S. Mangold, A. F. Molisch, "Joint scalable coding and routing for 60 GHz real-time live HD video streaming applications,"
*IEEE Trans.Broadcasting*, vol. 59, no. 3, pp. 500-512, Sept, 2013.doi:[[[10.1109/TBC.2013.2273598]]] - 10 M. Baianifar, S. M. Razavizadeh, H. Akhlaghpasand, I. Lee, "Energy efficiency maximization in mmWave wireless networks with 3D beamforming,"
*J.Commun.Netw.*, vol. 21, no. 2, pp. 125-135, Apr, 2019.custom:[[[-]]] - 11 J. Kim, A. F. Molisch, "Fast millimeter-wave beam training with receive beamforming,"
*J. Commun. Netw.*, vol. 16, no. 5, pp. 512-522, Oct, 2014.doi:[[[10.1109/JCN.2014.000090]]] - 12 J. Kim, Y. Tian, S. Mangold, A. F. Molisch, "Quality-aware coding and relaying for 60 GHz real-time wireless video broadcasting,"
*in Proc. IEEE ICC*, pp. 5148-5152, June, 2013.custom:[[[-]]] - 13 S. Park, B. Kim, H. Yoon, S. Choi, "RA-eV2V: Relaying systems for LTE-V2V communications,"
*J.Commun.Netw.*, vol. 20, no. 4, pp. 198-206, Aug, 2018.doi:[[[10.1109/JCN.2018.000055]]] - 14 T. S. Rappaport et al., "Millimeter wave mobile communications for 5G cellular: It will work!,"
*IEEE Access*, vol. 1, no. 1, pp. 335-349, 2013.doi:[[[10.1109/ACCESS.2013.2260813]]] - 15 J. Kim, A. F. Molisch, "Quality-aware millimeter-wave device-todevice multi-hop routing for 5G cellular networks,"
*in Proc. IEEE ICC*, pp. 5251-5256, June, 2014.custom:[[[-]]] - 16 S. Zhang, N. Zhang, X. Fang, P. Yang, X. S. Shen, "Self-sustaining caching stations: Toward cost-effective 5G-enabled vehicular networks,"
*IEEE Commun.Mag.*, vol. 55, no. 11, pp. 202-208, Nov, 2017.doi:[[[10.1109/MCOM.2017.1700129]]] - 17 N. Magaia, Z. Sheng, P. R. Pereira, M. Correia, "REPSYS: A robust and distributed incentive scheme for in-network caching and dissemination in vehicular delay-tolerant networks,"
*IEEE Wireless Commun.*, vol. 25, no. 3, pp. 65-71, June, 2018.custom:[[[-]]] - 18 H. Ahlehagh, S. Dey, "Video-aware scheduling and caching in the radio access network,"
*IEEE /ACM Trans. Netw.*, vol. 22, no. 5, pp. 1444-1462, Oct, 2014.doi:[[[10.1109/TNET.2013.2294111]]] - 19
*Highway Data Explorer (Online). Available:*, http://dtdapps.coloradodot.info/otis/HighwayData - 20 L. Yao, A. Chen, J. Deng, J. Wang, G. Wu, "A cooperative caching scheme based on mobility prediction in vehicular content centric networks,"
*IEEE Trans.Veh.Technol.vol 67*, no. 6, pp. 5435-5444, June, 2017.doi:[[[10.1109/TVT.2017.2784562]]] - 21 R. S. Sutton, A. G. Barto, "Reinforcement learning: An introduction,"
*MIT press*, 2018.doi:[[[10.1109/TNN.1998.712192]]] - 22 Y. Guo, Q. Yang, F. R. Yu, V. C. Leung, "Cache-enabled adaptive video streaming over vehicular networks: A dynamic approach,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 6, pp. 5445-5459, June, 2018.doi:[[[10.1109/TVT.2018.2817210]]] - 23 K. Poularakis, G. Iosifidis, A. Argyriou, I. Koutsopoulos, L. Tassiulas, "Caching and operator cooperation policies for layered video content delivery,"
*in Proc.IEEE INFOCOM*, pp. 1-9, Apr, 2016.custom:[[[-]]] - 24 Y. Huang, X. Song, F. Ye, Y. Yang, X. Li, "Fair caching algorithms for peer data sharing in pervasive edge computing environments,"
*in Proc. IEEE ICDCS*, pp. 605-614, June, 2017.custom:[[[-]]] - 25 S. Fu, P. Duan, Y. Jia, "Content-exchanged based cooperative caching in 5G wireless networks,"
*in Proc.IEEE GLOBECOM*, pp. 1-6, Dec, 2017.custom:[[[-]]] - 26 S. Arabi, E. Sabir, H. Elbiaze, "Information-centric networking meets delay tolerant networking: Beyond edge caching,"
*in Proc. IEEE WCNC*, pp. 1-6, Apr, 2018.custom:[[[-]]] - 27 R. Kim, H. Lim, B. Krishnamachari, ""Prefetching-based data dissemination in vehicular cloud systems,"
*IEEE Trans.Veh.Technol.*, vol. 65, no. 1, pp. 292-306, Jan, 2015.custom:[[[-]]] - 28 G. Mauri, M. Gerla, F. Bruno, M. Cesana, G. Verticale, "Optimal content prefetching in NDN vehicle-to-infrastructure scenario,"
*IEEE Trans. Veh.Technol.*, vol. 66, no. 3, pp. 2513-2525, June, 2016.doi:[[[10.1109/TVT.2016.2580586]]] - 29 M. Chen et al., "Caching in the Sky: Proactive deployment of cacheenabled unmanned aerial vehicles for optimized quality-of-experience,"
*IEEE J.Sel.AreasCommun.*, vol. 35, no. 5, pp. 1046-1061, May, 2017.custom:[[[-]]] - 30 T. T. Le, R. Q. Hu, "Mobility-aware edge caching and computing in vehicle networks: A deep reinforcement learning,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 11, pp. 10190-10203, Nov, 2018.doi:[[[10.1109/TVT.2018.2867191]]] - 31 Y. He, F. R. Yu, N. Zhao, V. C. Leung, H. Yin, "Software-defined networks with mobile edge computing and caching for smart cities: A big data deep reinforcement learning approach,"
*IEEE Commun.Mag.*, vol. 55, no. 12, pp. 31-37, Dec, 2017.doi:[[[10.1109/MCOM.2017.1700246]]] - 32 Y. He, N. Zhao, H. Yin, "Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach,"
*IEEE Trans.Veh.Technol.*, vol. 67, no. 1, pp. 44-55, Jan, 2017.doi:[[[10.1109/TVT.2017.2760281]]] - 33 Y. He et al., "Deep reinforcement learning-based optimization for cacheenabled opportunistic interference alignment wireless networks,"
*IEEE Trans.Veh.Technol.*, vol. 66, no. 11, pp. 10433-10445, Nov, 2017.custom:[[[-]]] - 34 S. Lin et al., "Fast simulation of vehicular channels using finite-state markov models,"
*IEEE WirelessCommun.Lettersearlyaccess*, 2019.custom:[[[-]]] - 35 Z. Ning, X. Wang, F. Xia, J. J. Rodrigues, "Joint computation offloading, power allocation, and channel assignment for 5G-enabled traffic management systems,"
*IEEE Trans.Ind.Inf.*, vol. 15, no. 5, pp. 3058-3067, May, 2019.custom:[[[-]]] - 36 N. Wang, E. Hossain, V. K. Bhargava, "Joint downlink cell association and bandwidth allocation for wireless backhauling in two-tier HetNets with large-scale antenna arrays,"
*IEEE Trans. Wireless Commun.*, vol. 15, no. 5, pp. 3251-3268, Jan, 2016.doi:[[[10.1109/TWC.2016.2519401]]] - 37 K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, G. Caire, "Femtocaching: Wireless content delivery through distributed caching helpers,"
*IEEE Trans. Inf. Theory*, vol. 59, no. 12, pp. 8402-8413, Sept, 2013.doi:[[[10.1109/TIT.2013.2281606]]] - 38 L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, "Web caching and Zipf-like distributions: Evidence and implications,"
*in Proc. IEEE INFOCOM*, pp. 126-134, Mar, 1999.custom:[[[-]]] - 39 Y. J. Mo, J. Kim, J.-K. Kim, A. Mohaisen, W. Lee, "Performance of deep learning computation with TensorFlow software library in GPUcapable multi-core computing platforms,"
*in Proc.IEEE ICUFN*, pp. 240-242, July, 2017.custom:[[[-]]] - 40 J. Clausen, W. L. Boyajian, L. M. Trenkwalder, V. Dunjko, H. J. Briegel, "On the convergence of projective-simulation-based reinforcement learning in Markov decision processes,"
*in arXiv preprint arXiv:1910.11914*, 2019.custom:[[[-]]] - 41 V. Mnih et al., "Asynchronous methods for deep reinforcement learning,"
*in Proc.ICML*, pp. 1928-1937, June, 2016.custom:[[[-]]] - 42 T. P. Lillicrap et al., "Continuous control with deep reinforcement learning,"
*in Proc.ICLR*, pp. 1-14, May, 2016.custom:[[[-]]] - 43 M. Hessel et al., "Rainbow: Combining improvements in deep reinforcement learning,"
*in Proc.AAAI*, pp. 1-8, Feb, 2018.custom:[[[-]]] - 44 H. V. Hasselt, A. Guez, D. Silver, "Deep reinforcement learning with double Q-learning,"
*in Proc.AAAI*, pp. 1-7, Feb, 2016.custom:[[[-]]]