Dynamic Video Delivery using Deep Reinforcement Learning for Device-to-Device Underlaid Cache-Enabled Internet-of-vehicle Networks

Minseok Choi , Myungjae Shin and Joongheon Kim

Abstract

Abstract: This paper addresses an Internet-of-vehicle network that utilizes a device-to-device (D2D) underlaid cellular system, where distributed caching at each vehicle is available and the video streaming service is provided via D2D links. Given the spectrum reuse policy, three decisions having different timescales in such a D2D underlaid cache-enabled vehicular network were investigated: 1) The decision on the cache-enabled vehicles for providing contents, 2) power allocation for D2D users, and 3) power allocation for cellular vehicles. Since wireless link activation for video delivery could introduce delays, node association is determined in a larger timescale compared to power allocations. We jointly optimize these delivery decisions by maximizing the average video quality under the constraints on the playback delays of streaming users and the data rate guarantees for cellular vehicles. Depending on the channel and queue states of users, the decision on the cache-enabled vehicle for video delivery is adaptively made based on the framebased Lyapunov optimization theory by comparing the expected costs of vehicles. For each cache-enabled vehicle, the expected cost is obtained from the stochastic shortest path problem that is solved by deep reinforcement learning without the knowledge of global channel state information. Specifically, the deep deterministic policy gradient (DDPG) algorithm is adopted for dealing with the very large state space, i.e., time-varying channel states. Simulation results verify that the proposed video delivery algorithm achieves all the given goals, i.e., average video quality, smooth playback, and reliable data rates for cellular vehicles.

Keywords: Deep reinforcement learning , device-to-device underlaid network , vehicular networks , video delivery , wireless caching

I. INTRODUCTION

WITH the increasingly growing number of mobile devices, it is expected that tens of exabytes of global data traffic will be handled on a daily basis, 70% of which will consist of on-demand video streaming services [1]. In on-demand video streaming services, a small number of popular contents is requested at ultra-high rates, i.e., most of the requests are repeated, and thus, provision of the desired contents from the remote base stations (BSs) would waste resources. In this regard, the wireless caching technology discussed in [2], whereby the BS pushes popular contents to cache-enabled nodes with limited storage space during off-load time so that these nodes can provide popular contents directly to nearby mobile users, is advantageous for video streaming services.

As device-to-device (D2D) communication has become a promising technology for improving spectral efficiency, a D2Dassisted caching network has been studied [3], [4], where mobile devices can store popular contents and directly respond to the requests of neighboring users. Especially for delay-sensitive content delivery, the necessary decision on which cache-enabled device will deliver the desired content to the streaming user has been extensively researched. The simplest method is to let the cache-enabled node with the strongest channel condition deliver the content to the streaming user [5]. The advanced node association schemes have been developed for heterogeneous caching networks by jointly optimizing routing and caching [7], managing interference among D2D-assisted delivery links [8] and allowing cooperation between adjacent BSs [9]. However, these methods did not consider the different quality levels of contents and assume that all the cached contents have the same size.

Since multimedia contents (e.g., video files) can be encoded to multiple versions that differ in quality level (e.g., in their peak-signal-to-noise-ratio (PSNR) and spatial resolution) [10], each cache-enabled node can store the identical video contents of different quality levels. In this case, the decision on which of the cache-enabled nodes for content delivery becomes closely related to content quality that the user can enjoy [11]. There exist several studies on methods that dynamically select the quality level of the desired video [12]– [14] or maximize a network utility function of time-average video quality [15]. Whereas the video delivery policies presented in [12], [15] are operated at the BS side, this paper considers a scenario where users dynamically choose the desired content quality as in dynamic adaptive streaming over HTTP (DASH) [16]. In [11], [13], [14], the authors also considered video quality selection at the user side, but not resource allocations in the D2D underlaid cellular system. In addition, the above studies [10]– [15] requires global channel state information (CSI); however, in practical vehicular networks, it is difficult to track the time-varying channel gains due to high mobility of vehicle users.

The reinforcement learning (RL) algorithms have been recently employed to proactively cache popular contents in the scenario in the absence of global CSI in [17]– [19]; however, content delivery is not optimized in these works. In addition, the RL-based dynamic resource allocation method for edge computing networks is proposed in [20], but content delivery is not considered. Content placement and delivery are also jointly optimized in cache-enabled D2D networks, based on the deep-Q network (DQN) [21] and the deep deterministic policy gradient (DDPG) frameworks [22]. In addition, the RL-based content delivery policy of a mobile device with service delay constraint was proposed in [23]. However, the above studies [21]– [23] did not consider the cache-enabled D2D underlaid cellular system and differentiated quality requirements of multimedia contents.

In parallel, D2D underlaid cellular systems have been extensively researched for efficient uses of spectrum resources, in which frequency bands are shared for both cellular users (CUEs) and D2D users (DUEs). In general, when spectrum is allocated to CUEs, the newly generated D2D link reuses the spectrum of one of the CUEs; therefore, the DUE interferes with the existing cellular and other D2D links. In order to manage the interference as well as to maximize the network performances, the advanced power controls and resource allocations have been proposed for the D2D underlaid cellular system in [24]– [31]. Both the centralized and the distributed power control schemes that improve the signal-to-interference-plus-noise ratio (SINR) were proposed and analyzed in D2D underlaid cellular networks in [24], and the scheme proposed in [25] achieves the proportional fairness among users. As the D2D underlaid cellular system supports vehicle-to-vehicle (V2V) connections, the global CSI is difficult to be obtained due to high mobility of vehicle users [32]. To deal with this issue, power controls that reduce the requirement of global CSI were presented by utilizing the vehicles’ geographic features [26], large-scale channel fading information [27], the statistical CSI [28], or the path loss rather than instantaneous CSI [29]. Furthermore, the deep learning-based and the deep reinforcement learning (DRL)-based power allocation schemes not requiring global CSI have been proposed in [30] and [31], respectively. However, all of the above studies do not consider the content delivery in cache-enabled D2D underlaid cellular networks. The content delivery in D2D underlaid cellular networks was researched in [33]; however, this work focuses on offloading the cellular data traffic by D2D links rather than node association for content delivery in caching networks. The cache-enabled D2D underlaid cellular networks have been studied in [34] and [35] that proposed a caching method and a incentive mechanism respectively but not content delivery.

In this context, this paper jointly optimizes the node association and resource allocations without global CSI for content delivery in the D2D underlaid cache-enabled Internet-of-vehicle system. This paper reflects the two characteristics of cellular system-assisted D2D communications to the multimedia content delivery policy in wireless vehicular caching networks: spectrum reuse and high mobility. In the absence of global CSI, we propose a deep reinforcement learning (DRL) based adaptive delivery scheme which learns a policy that makes following decisions: 1) The cache-enabled vehicle that will deliver the desired content to the streaming user, 2) power allocations for D2D-assisted delivery links, and 3) power allocations for cellular links. Specifically, the delivery policy is learned by deep deterministic policy gradient (DDPG), which is a model-free and off-policy algorithm. The main contributions of this paper are as follows:

A framework of the compromising characteristics of the D2D underlaid cellular system, the vehicular network, and the wireless caching network is presented. For such a network, the joint optimization problem of three decisions having different timescales is formulated: 1) Association with the cache-enabled vehicle for delivering multimedia contents (e.g., video files), 2) power allocation of the cache-enabled vehicle delivering the content via the D2D link, and 3) power allocation of the CUE whose the spectrum is reused by the D2D-assisted delivery link. The optimization problem maximizes the time-average video quality under the constraints on the limited playback delay of the DUE and the minimum data rate of the CUE.

The problem of dynamic power allocations for both cellular and D2D links sharing the identical spectrum without the knowledge of global CSI is formulated based on a Markov decision process and solved by using the DRL approach. In contrast to the approaches in [26] and [27], the proposed approach dynamically changes power allocations to control interference and to limit the playback delay based on channel statistics and queue states. In order to achieve efficient and improved learning of the delivery policy as well as to deal with the very large state space, we adopt a DDPG-based method because the state space is continuous and massively large.

Considering the interference between the CUEs and DUEs, the decision on the cache-enabled vehicle that will delivery the content, i.e., the D2D transmitter, is made under the frame-based Lyapunov optimization theory [36], in larger timescale than power allocations of cellular and D2D links. With the help of the DRL-based power allocations determined in smaller timescale, the node associations for content delivery can be also completed without global CSI.

We present an evaluation via data-intensive simulations to verify the proposed video delivery policy, as well as to show the advantages of Lyapunov optimization theory and DRL-based approaches.

II. NETWORK MODEL

This section describes the D2D underlaid cache-enabled vehicular network that we considered, and the user queue model is introduced. Since we focus on the multimedia services (e.g., ondemand streaming), the delivered content chunks are accumulated in the user queue and the playback latency or stall events are closely related to the queue dynamics.

A. D2D Underlaid Cache-enabled Vehicular Network

This paper addresses the D2D underlaid vehicular caching network where a certain vehicle user can request a video file from one of the cache-enabled vehicles in its vicinity while some CUEs are communicating with the BS, as shown in Fig. 1. In Fig. 1, we can see that both cellular and D2D links coexist, and CUE 1 (or 2) and DUE 1 (or 2) share the identical frequency band. The server has already pushed popular video files during off-peak hours to cache-enabled vehicles, the storage size of

Fig. 1.
D2D underlaid cache-enabled vehicular network.

which is finite. Since we focus on video delivery, caching policies are outside the scope of this paper and only cache-enabled vehicles that store the desired video are considered. Assume that the desired video hasN quality levels and that there is no quality controller in the cache-enabled vehicles, and therefore, they can transmit only the video file of the fixed quality that the server pushes. In this case, the user can choose the quality level of the video file to be received; let [TeX:] $$q \in\{1, \cdots, N\}$$ denote the desired quality level. Therefore, the choice of the quality level of the video file to be received is consistent with the choice of the cache-enabled vehicle nearby for video delivery. When the video request of a certain user has been accepted at one of the nearby cache-enabled vehicles, a D2D link is activated between them for delivering the desired content. Therefore, the cacheenabled vehicle that is selected to deliver the requested content to the streaming user will be simply called the D2D transmitter.

Let CUEs send massive traffic data to nearby infrastructures, e.g., BSs or roadside units. These vehicles utilize highcapacity vehicle-to-infrastructure (V2I) communication via cellular links. However, there exist several pairs of highly mobile DUEs for video delivery via D2D links. For the D2D underlaid cellular system, the spectrum reuse model is utilized, i.e., both CUEs and DUEs share the spectrum resources. Whenever a D2D link is generated, one of the orthogonal resources of the CUEs is reused by the D2D link. This paper considers only one CUE and a pair of a D2D transmitter and a file-requesting user that reuse the spectrum of the considered CUE, and thus, the spectrum reuse policy for multiple CUEs and DUEs is outside its scope.

The cache-enabled vehicles are modeled using independent Poisson point processes (PPPs) with intensity [TeX:] $$\lambda$$. Assuming a probabilistic caching policy [37], let [TeX:] $$p_{q}$$ be the caching probability for the desired video of quality [TeX:] $$q$$. Then, the PPP intensity of vehicles caching the desired video of quality [TeX:] $$q$$ is [TeX:] $$\lambda p_{q}$$. Suppose that the system does not allow any additional D2D link that reuses the identical spectrum of the considered CUE and the streaming user who is already downloading the desired content from certain cache-enabled vehicle. Then, the system can guarantee negligible interference between multiple D2D links. User mobility is also captured in the network model. The user is moving in a pre-determined direction and periodically searches for a cache-enabled vehicle from which to receive the desired

Table 1. System description parameters.
.tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} [TeX:] $$N$$ Number of quality levels [TeX:] $$\alpha(t)$$ Cache-enabled vehicle chosen at [TeX:] $$t$$ [TeX:] $$P_{c}(t)$$ Transmit power of cellular link at [TeX:] $$t$$ [TeX:] $$P_{d}(t)$$ Transmit power of D2D link at [TeX:] $$t$$ [TeX:] $$Q(t)$$ Queue backlog of DUE at [TeX:] $$t$$ [TeX:] $$W(t)$$ Virtual queue backlog at [TeX:] $$t$$ [TeX:] $$q$$ Video quality [TeX:] $$\mathcal{P}(q)$$ Measure of video quality [TeX:] $$q$$ [TeX:] $$S(q)$$ Video file size of quality [TeX:] $$q$$ [TeX:] $$K$$ Number of frames [TeX:] $$T$$ Time duration of a frame [TeX:] $$t_{c}$$ Unit time slot duration [TeX:] $$t_{k}$$ Beginning time of the [TeX:] $$k$$-th frame [TeX:] $$\mathcal{T}_{k}$$ Time interval of the [TeX:] $$k$$-th frame [TeX:] $$R_{c}(t)$$ Data rate of cellular link at [TeX:] $$t$$ [TeX:] $$R_{d}(t)$$ Data rate of D2D link at [TeX:] $$t$$ [TeX:] $$\eta_{c}$$ Minimum data rate of cellular link [TeX:] $$P_{d}^{0}, P_{c}^{0}$$ Maximum power of cellular and D2D links [TeX:] $$\mathcal{L}(t)$$ Lyapunov function [TeX:] $$\lambda$$ Intensity of cache-enabled vehicle distribution [TeX:] $$\mathcal{B}$$ Bandwidth [TeX:] $$\sigma^{2}$$ Noise variance [TeX:] $$V$$ Lyapunov coefficient [TeX:] $$\mathcal{C}_{k}$$ Cache-enabled vehicle candidate set in [TeX:] $$$$ [TeX:] $$\Theta(t)$$ State of MDP at [TeX:] $$t$$ [TeX:] $$\Xi(t)$$ Action of MDP at [TeX:] $$t$$ [TeX:] $$\mathcal{S}$$ State space of MDP [TeX:] $$\mathcal{A}$$ Action space of MDP [TeX:] $$r(t)$$ Reward of MDP at [TeX:] $$t$$ [TeX:] $$P_{s^{\prime} s}$$ Transition probability from state [TeX:] $$$$ to state [TeX:] $$s$$ [TeX:] $$\pi$$ Trained policy of DDPG algorithm

video file continuously. As shown in Fig. 1, the geological distribution of cache-enabled vehicles in the vicinity of the user that store the desired file can vary at each time slot, and therefore, the decision on the cache-enabled vehicle to be used for video delivery should be appropriately updated.

B. User Queue Model and Channel Model

A video content consists of many sequential chunks. The user receives the video content from the cache-enabled vehicle and processes data for video streaming services in units of chunks. Each chunk of a content is responsible for some playback time of the entire stream. Even if all the chunks are in the correct sequence, the quality of each chunk can differ in dynamic streaming. Therefore, users can dynamically choose the video quality levels in each chunk’s processing time. It can be stated that, when a queue model is used, playback delay occurs when the chunk to be played has not yet arrived in the queue. Thus, the receiver queue dynamics collectively reflect the various factors that cause the playback delay. Therefore, we focus on limiting the queueing delay by dynamically adjusting the queue backlogs of the streaming user.

In general, a queue model has its own arrival and departure processes. The queue dynamics of the D2D user in each discrete time slot [TeX:] $$t \in\{0,1, \cdots\}$$ can be represented as follows:

(1)
[TeX:] $$Q(t+1)=\max \{Q(t)+\mu(t)-c, 0\}$$

(2)
[TeX:] $$Q(0)=0,$$

where [TeX:] $$Q(t)$$ and [TeX:] $$\mu(t)$$ represent the queue backlog and the arrival process of the DUE receiver at slot [TeX:] $$t$$, respectively. Since the playback rate of streaming is usually unchanging, the departure is assumed as a constant c here for simplicity. The queue states are updated in each unit time slot [TeX:] $$t$$, and the interval of each slot is determined to be the channel coherence time, [TeX:] $$t_{c}$$. Suppose a block fading channel, the channel gain of which is static during the processing of multiple chunks; then, [TeX:] $$t_{c}=c \tau$$ , where [TeX:] $$\mathcal{T}$$ is the chunk processing time and [TeX:] $$c$$ is the positive integer.

In this study, the queue backlog [TeX:] $$Q(t)$$ counted the number of video chunks in the queue. [TeX:] $$\mu(t)$$ semantically represents the number of received chunks and clearly depends on the data rate of the D2D link between the streaming user and its associated cache-enabled vehicle and on the chunk size. The arrival process is:

(3)
[TeX:] $$\mu(t)=\left\lfloor\frac{R_{d}\left(\alpha(t), P_{c}(t), P_{d}(t), t\right) \cdot t_{c}}{S(q(\alpha(t)))}\right\rfloor,$$

where [TeX:] $$\alpha(t)$$, [TeX:] $$P_{c}(t)$$, and [TeX:] $$P_{d}(t)$$ denote the cache-enabled vehicle associated with the user (i.e., the D2D transmitter), the transmit power of the CUE, and the transmit power of [TeX:] $$\alpha(t)$$ at slot t, respectively, and [TeX:] $$q(\alpha(t))$$ is the quality level of the requested file that the D2D transmitter [TeX:] $$\alpha(t)$$ can provide. [TeX:] $$R(\alpha(t), t)$$ and [TeX:] $$S(q(\alpha(t)), t)$$ indicate the data rate of the D2D link and the chunk size of the requested file with the desired quality [TeX:] $$q(\alpha(t))$$ at slot [TeX:] $$t$$, respectively. Some video chunks can be only partially delivered as the channel condition varies at every time [TeX:] $$t$$. Because partial chunk transmission is meaningless in our algorithm, flooring is applied in (3).

A Rayleigh fading channel is assumed for the wireless links from all vehicles to infrastructures and vehicles. Denote the channel by [TeX:] $$h=\sqrt{X} \beta g$$, where [TeX:] $$X=A / d^{\gamma}$$ controls path loss with [TeX:] $$d$$ being the server-user distance, the path loss component of A, and the decay exponent [TeX:] $$\gamma$$. In addition, [TeX:] $$\beta$$ is a log-normal shadowing random variable with standard deviation [TeX:] $$\xi$$, and [TeX:] $$g$$ represents the fast fading component with complex Gaussian distribution [TeX:] $$g \sim C N(0,1)$$. In vehicular networks, fast fading components are not easily estimated at the receiver side owing to users’ high mobility. Thus, in this study we considered the situation in which only the slow fading components, i.e., X and [TeX:] $$\beta$$, are known in advance, but the fast fading components are not known.

Consider a typical streaming user reusing the spectrum of a certain CUE. Then, the cellular link and the D2D link for streaming that share the spectrum are interfering with each other. Therefore, the data rate of the cellular link from the CUE to the BS is given by

(4)
[TeX:] $$R_{d}(t)=\mathcal{B} \log _{2}\left(1+\frac{P_{c}(t)\left|h_{c, B}(t)\right|^{2}}{P_{d}(t)\left|h_{d, B}(\alpha(t), t)\right|^{2}+\sigma^{2}}\right),$$

where [TeX:] $$\mathcal{B}$$ is the bandwidth and [TeX:] $$\sigma^{2}$$ is the normalized noise variance. [TeX:] $$h_{c, B}(t)$$ and [TeX:] $$h_{d, B}(\alpha(t), t)$$ are respectively the channel fading gains from the CUE and the D2D transmitter [TeX:] $$\alpha(t)$$ to the BS. Similarly, the data rate of the D2D link between the streaming user and the D2D transmitter [TeX:] $$\alpha(t)$$ is given by

(5)
[TeX:] $$R_{d}(t)=\mathcal{B} \log _{2}\left(1+\frac{P_{d}(t)\left|h_{d}(\alpha(t), t)\right|^{2}}{P_{c}(t)\left|h_{c, d}(t)\right|^{2}+\sigma^{2}}\right),$$

where [TeX:] $$h_{d}(\alpha(t), t)$$ and [TeX:] $$h_{c, d}(t)$$ are the channel gains of the D2D link for video delivery and of the interference link from the CUE

Fig. 2.
Different timescales for decisions on [TeX:] $$\alpha(t)$$, [TeX:] $$P_{d}(t)$$, and [TeX:] $$P_{c}(t)$$.

to the streaming user, respectively. Note that the data rate of the video delivery link in (5) limits the number of chunks that can be delivered to the streaming user as given by (3), which depends on the associated cache-enabled vehicle as well as power allocations for cellular and D2D users.

III. DYNAMIC NODE ASSOCIATIONS AND POWER ALLOCATIONS

This section introduces how the timescales of decisions on node association and power controls are different and formulates the optimization problem the maximizes the average video quality with the constraints of limited playback latency for DUEs and data rate guarantees for CUEs.

A. Decisions at Different Timescales

The goal of this study is to determine the appropriate decisions for video delivery at each slot t in the network model described in Section II: 1) The cache-enabled vehicle for video delivery [TeX:] $$\alpha(t)$$, 2) the transmit power of the cache-enabled vehicle [TeX:] $$P_{d}(t)$$, and 3) the transmit power of the cellular vehicle [TeX:] $$P_{c}(t)$$. The last decision is not directly related to video delivery, but very significantly affects video delivery in terms of interference. Here, the association with one of the cache-enabled vehicles in the vicinity of can be made after it has been determined which vehicles store the desired file, the delivery has been requested, and the request accepted. Therefore, the decision on the cache-enabled vehicle [TeX:] $$\alpha(t)$$ takes longer than that on the dynamic power allocation. Thus, we consider the scenario where the decision on [TeX:] $$\alpha(t)$$ is made at a larger timescale than that on [TeX:] $$P_{d}(t)$$ and [TeX:] $$P_{c}(t)$$.

In this context, the different timescales of decisions on [TeX:] $$\alpha(t)$$, [TeX:] $$P_{d}(t)$$, and [TeX:] $$P_{c}(t)$$ are shown in Fig. 2. The D2D transmitter and the CUE allocate [TeX:] $$P_{d}(t)$$ and [TeX:] $$P_{c}(t)$$ at time slots [TeX:] $$t \in\{0,1,2, \cdots\}$$, but the update of association with the D2D transmitter is performed at a larger timescale, [TeX:] $$t \in\{1, T, 2 T, \cdots\}$$, where T is the time interval for the decision on [TeX:] $$\alpha(t)$$. The time slot for the k-th association is denoted by [TeX:] $$t_{k}=(k-1) T$$ for [TeX:] $$n \in\{1,2, \cdots\}$$. Let the k-th frame for updates of association with the D2D transmitter [TeX:] $$\alpha(t)$$ be [TeX:] $$T_{k}=\left\{t_{k}, t_{k}+1, \cdots, t_{k}+T-1\right\}$$. Hereinafter, we use the term [TeX:] $$\alpha_{k}$$ rather than [TeX:] $$\alpha(t)$$ for [TeX:] $$t \in \mathcal{T}_{k}$$, because [TeX:] $$\alpha_{k}$$ remains unchanged over the frame [TeX:] $$\mathcal{T}_{k}$$. As shown in Fig. 2, after [TeX:] $$\alpha(t)$$ has been determined at time tk, decisions on the transmit power levels of the D2D transmitter [TeX:] $$\alpha(t)$$ and the CUE, i.e., [TeX:] $$P_{d}(t)$$ and [TeX:] $$P_{c}(t)$$, are made over [TeX:] $$t \in \mathcal{T}_{k}$$ to maximize the average streaming quality while guaranteeing the data rate of the CUE, as well as limiting the playback delay of the streaming user.

The user can create a candidate set of nearby cache-enabled vehicles storing the desired video, denoted by [TeX:] $$\mathcal{C}_{k} ; \alpha\left(t_{k}\right) \in \mathcal{C}_{k}$$. In existing studies, the nearest vehicle was usually chosen, because it can provide the best channel condition. However, because the streaming user is moving and there exists an interfering cellular vehicle, an association with the nearest vehicle is not always the best choice. Therefore, the user collects up to N candidates of cache-enabled vehicles located in its vicinity for video delivery, and selects that which can provide the best video quality and allows the user to avoid the playback delay during the frame [TeX:] $$\mathcal{T}_{k}$$.

B. Problem Formulation

For determining the appropriate video delivery policy, three performance metrics are considered: The average video quality, the playback delay of the streaming user, and the data rate of the CUE, the spectrum of which is reused by the streaming user. Based on these goals, we can formulate the optimization problem that maximizes the long-term average video quality constrained by the need to avert queue emptiness and guarantee the data rate of the CUE

(6)
[TeX:] $$\max _{\mathbf{P}_{\mathbf{d}}, \mathbf{P}_{\mathbf{c}}, \boldsymbol{\alpha}} \lim _{K \rightarrow \infty} \frac{1}{K} \sum_{k=1}^{K} \mathbb{E}\left[\mathcal{P}\left(q\left(\alpha_{k}\right)\right)\right]$$

(7)
[TeX:] $$\text { s.t. } \lim _{K \rightarrow \infty} \frac{1}{K T} \sum_{t=1}^{K T} \mathbb{E}[Z(t)]<\infty$$

(8)
[TeX:] $$\lim _{K \rightarrow \infty} \frac{1}{K T} \sum_{t=1}^{K T} \mathbb{E}\left[R_{c}\left(P_{c}, P_{d}, t\right)\right] \geq \eta_{c}$$

(9)
[TeX:] $$0 \leq P_{d} \leq P_{0}^{d}$$

(10)
[TeX:] $$0 \leq P_{c} \leq P_{0}^{c},$$

where [TeX:] $$\mathcal{P}\left(q\left(\alpha_{k}\right)\right)$$ is the quality measure of [TeX:] $$q\left(\alpha_{k}\right)$$ and [TeX:] $$\eta_{c}$$ is the minimum data rate for the CUE. [TeX:] $$P_{0}^{d}$$ and [TeX:] $$P_{0}^{c}$$ are the power budgets of all the cache-enabled vehicles and the CUE, respectively. The decision vectors are represented as [TeX:] $$\alpha=\left[\alpha_{1}, \cdots, \alpha_{K}\right], \mathbf{P}_{d}=\left[P_{d}(0), P_{d}(1), \cdots, P_{d}(K T-1)\right];$$, and [TeX:] $$\mathbf{P}_{c}=\left[P_{c}(0), P_{c}(1), \cdots, P_{c}(K T-1)\right]$$. Specifically, the expectation of (6) is with respect to random channel realizations and stochastic distributions of vehicles. The constraint (7) limits the playback delay of the streaming service, and the constraint (8) guarantees the minimum data rate of the CUE.

As mentioned previously, playback delay occurs when the next chunk to be played has not arrived in the queue, and therefore, the role of the constraint (7) is to avoid queue emptiness, where [TeX:] $$Z(t)=\tilde{Q}-Q(t)$$. Here, [TeX:] $$Z(t)$$ is introduced so that [TeX:] $$Q(t)$$ is sufficiently large to avert playback delay, and [TeX:] $$\tilde{Q}$$ is a sufficiently large parameter that affects the maximal queue backlog. From (2), the queue dynamics of [TeX:] $$Z(t)$$ can be represented as

(11)
[TeX:] $$Z(t+1)=\min \{Z(t)+c-\mu(t), \tilde{Q}\} \text { and } Z(0)=\tilde{Q}.$$

Although the update rules of [TeX:] $$Q(t)$$ and [TeX:] $$Z(t)$$ are different, both queue dynamics refer to the same video chunk processing. Therefore, playback delay due to emptiness of [TeX:] $$Q(t)$$ can be explained by the queueing delay of [TeX:] $$Z(t)$$. By Littles’ theorem [38], the expected value of [TeX:] $$Z(t)$$ is proportional to the time-averaged queueing delay. Our aim is to limit the queuing delay by addressing (7), and it is well known that Lyapunov control-based time-average optimization with (7) can make [TeX:] $$Z(t)$$ bounded [39].

From the optimization problem in (6)–(10), we can intuitively see the manner in which decisions are made according to [TeX:] $$Z(t)$$. Suppose that the queue is almost empty; [TeX:] $$Q(t) \approx 0$$ and [TeX:] $$Z(t) \approx \tilde{Q}$$. In this case, the user prefers the cache-enabled vehicle that has a strong channel condition and stores a low-quality file, and the associated cache-enabled vehicle prefers to allocate more transmit power to reduce [TeX:] $$Z(t)$$. However, the large transmit power of the D2D transmitter will significantly interfere with to the CUE. In addition, the geological location of the D2D transmitter [TeX:] $$\alpha(t)$$ influences the data rate of the CUE. However, when the chunks already accumulated in the queue are sufficient to avoid the playback delay, i.e., [TeX:] $$Q(t) \approx \tilde{Q}$$ and [TeX:] $$Z(t) \approx 0$$, the streaming user will want to associate with the vehicle caching a high-quality video. In addition, it is preferable that the transmit power of the D2D transmitter be small so that the CUE is provided with a large data rate.

IV. DECISION ON CACHE-ENABLED VEHICLE FOR VIDEO DELIVERY

For avoiding the emptiness of the queue [TeX:] $$Q(t)$$, i.e., for pursuing the stability of [TeX:] $$Z(t)$$, the optimization problem of (6)–(10) is solved based on the Lyapunov optimization theory. We first transform the inequality constraint in (8) into the form of queue stability. Specifically, we define the virtual queue[TeX:] $$W(t)$$ with the update equation:

(12)
[TeX:] $$W(t+1)=\max \left\{W(t)+\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right), 0\right\}.$$

The strong stability of the virtual queue [TeX:] $$W(t)$$ pushes the average of [TeX:] $$R_{c}\left(P_{c}, P_{d}, t\right)$$ such that the constraint in (8) is satisfied.

In addition, because the timescale of the decision on [TeX:] $$$$ is larger than that of the decisions on [TeX:] $$\boldsymbol{P}_{v}$$ and [TeX:] $$\boldsymbol{P}_{c}$$, the frame-based Lyapunov optimization theory [36] is used for the decision on the cache-enabled vehicle to be used for video delivery. Let [TeX:] $$\Theta(t)=[Z(t), W(t)]$$ be the concatenated vector of actual and virtual queue backlogs. Then, the quadratic Lyapunov function [TeX:] $$L[\Theta(t)]$$ is defined as

(13)
[TeX:] $$L(\Theta(t))=\frac{1}{2}\left\{Z(t)^{2}+\gamma W(t)^{2}\right\},$$

where is a coefficient for adjusting the scales of [TeX:] $$Z(t)$$ and [TeX:] $$W(t)$$. Then, let [TeX:] $$\Delta(.)$$ be a frame-based conditional Lyapunov function that can be formulated as [TeX:] $$\mathbb{E}\left[L\left(t_{k}+T\right)-L\left(t_{k}\right) \mid \Theta\left(t_{k}\right)\right]$$, i.e., the drift over the time interval [TeX:] $$T$$. The dynamic policy is designed to solve the given optimization problem of (6)–(10) by observing the current queue states of [TeX:] $$\Theta\left(t_{k}\right)$$ and determining the cache-enabled vehicle such that the upper bound on the framebased drift-plus-penalty is minimized [36]:

(14)
[TeX:] $$\Delta\left(t_{k}\right)-V \mathbb{E}\left[\mathcal{P}\left(q\left(\alpha_{k}\right)\right) \mid \Theta\left(t_{k}\right)\right],$$

where V is the importance weight for quality improvement.

The upper bound on the Lyapunov drift can be found in the Lyapunov function:

(15)
[TeX:] $$\begin{aligned} L[\Theta(t+1)]-L[\Theta(t)] \\ =& \frac{1}{2}\left\{Z(t+1)^{2}-Z(t)^{2}+\gamma\left(W(t+1)^{2}-W(t)^{2}\right)\right\} \\ \leq &\left.\frac{1}{2}\left\{c^{2}+\mu(t)^{2}+\gamma\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right)^{2}\right]\right\} \\ &+\left\{Z(t)(c-\mu(t))+\gamma W(t)\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right)\right\} \end{aligned}.$$

By summing (15) over [TeX:] $$t \in \mathcal{T}_{k}$$ and taking the expectation with respect to random channel generalizations, the upper bound on the frame-based conditional Lyapunov drift is obtained by

(16)
[TeX:] $$\begin{aligned} \Delta\left(\boldsymbol{\Theta}\left(t_{k}\right)\right) \leq & B T+\mathbb{E}\left[\sum _ { t = t _ { k } } ^ { t _ { k } + T - 1 } \left\{Z(t)(c-\mu(t))\right.\\ &\left.\left.+\gamma W(t)\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right)\right\}\right] \end{aligned}$$

where we assume that the departure and arrival rates of all queues are bounded, and B is a constant such that

(17)
[TeX:] $$\frac{1}{2} \mathbb{E}\left[c^{2}+a(t)^{2}+\gamma\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right)^{2}\right] \leq B.$$

According to (14), minimizing the bound on frame-based drift-plus-penalty is consistent with minimizing

(18)
[TeX:] $$\begin{array}{l} \mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right), \mathbf{P}_{d, k}, \mathbf{P}_{c, k}\right) \\ =\mathbb{E}\left[\sum _ { t = t _ { k } } ^ { t _ { k } + T - 1 } \left\{Z(t)\left(c-\left\lfloor\frac{t_{c} R_{d}\left(\alpha_{k}, P_{d}, P_{c}, t\right)}{S\left(q\left(\alpha_{k}\right)\right)}\right\rfloor\right)\right.\right. \\ \quad+\gamma W(t)\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right) \\ \left.\left.\quad-V T \cdot \mathcal{P}\left(q\left(\alpha_{k}\right)\right)\right\} \mid \Theta\left(t_{k}\right)\right] \end{array},$$

because BT is a constant. Let [TeX:] $$\mathbf{P}_{d, k}=\left[P_{d}\left(t_{k}\right), P_{d}\left(t_{k}+\right.\right.\text { 1) } \left., \cdots, P_{d}\left(t_{k}+T-1\right)\right]$$ and [TeX:] $$\mathbf{P}_{c, k}=\left[P_{c}\left(t_{k}\right), P_{c}\left(t_{k}+\right.\right.\text { 1) } \left., \cdots, P_{c}\left(t_{k}+T-1\right)\right]$$. Note that [TeX:] $$P_{c}$$ and [TeX:] $$P_{d}$$ change over time slots [TeX:] $$t \in \mathcal{T}_{k}$$. This frame-based drift-plus-penalty algorithm was shown in [36] to satisfy the queue stability constraints of (7)–(8) while maximizing the objective function of (6). For a given spectrum reuse policy, the minimum bound on framebased drift-plus-penalty can be obtained by

(19)
[TeX:] $$\mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right)\right)=\max _{\mathbf{P}_{d, k}, \mathbf{P}_{c, k}} \mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right), \mathbf{P}_{d, k}, \mathbf{P}_{c, k}\right).$$

In Section V, we describe the determination of [TeX:] $$\mathbf{P}_{d, k}$$ and [TeX:] $$\mathbf{P}_{c, k}$$ by using the deep reinforcement learning approach, which is aimed to minimize (18) for a given pair of a streaming user and a CUE sharing the spectrum.

System parameter V in (18) is the weight factor for the term representing the video quality. The value of V is important for controlling the queue backlogs and the quality level of the desired video, i.e., for choosing [TeX:] $$\alpha_{k}$$ at every frame. The appropriate initial value of V needs to be obtained empirically, because it depends on the distributions of cache-enabled vehicles and channel environments. In addition, [TeX:] $$V \geq 0$$ should be satisfied. If V < 0, the optimization goal is converted into minimizing the video quality. Moreover, in the case of V = 0, vehicle users aim only at stacking the queue backlogs of [TeX:] $$Z(t)$$ and do not pursue a high-quality video. In contrast, when V ! 1, vehicle users do not consider the queue state and thus simply associate with the cache-enabled vehicle that stores the highest-quality video. V can be regarded as the parameter to control partially the tradeoff between the video quality, the data rate of the CUE, and the queueing delay of the streaming user, which reflects the fact that the selection of the cache-enabled vehicle for video delivery explicitly adjusts the mutual interference effects between the CUE and DUEs.

With the initial condition of [TeX:] $$\Theta\left(t_{k}\right)$$, the system can compute [TeX:] $$\mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right)\right)$$ for all possible [TeX:] $$\alpha_{k} \in \mathcal{C}_{k}$$. Then, the optimal association policy of [TeX:] $$\alpha_{k}^{*}$$ that minimizes [TeX:] $$\mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right)\right)$$ can be obtained by

(20)
[TeX:] $$\alpha^{*}\left(t_{k}\right)=\underset{\alpha \in \mathcal{C}_{k}}{\arg \max } \mathcal{D}\left(\alpha\left(t_{k}\right), \Theta\left(t_{k}\right)\right).$$

In practice, since the number of cache-enabled vehicles in the vicinity of the streaming user is finite, the user can easily find a suitable vehicle for video delivery, i.e., make the decision on [TeX:] $$\alpha_{k}$$, via a greedy search.

V. DYNAMIC POWER ALLOCATIONS BY DEEP REINFORCEMENT LEARNING

A. Modeling of Markov Decision Process

According to (15), we can formulate the drift-plus-penalty algorithm of the [TeX:] $$k$$-th frame as

(21)
[TeX:] $$\left\{\mathbf{P}_{d, k}^{*}, \mathbf{P}_{c, k}^{*}\right\}=\arg \min \mathcal{D}\left(\alpha_{k}, \Theta\left(t_{k}\right), \mathbf{P}_{d, k}, \mathbf{P}_{c, k}\right)$$

(22)
[TeX:] $$\text { s.t. } 0 \leq P_{d} \leq P_{0}^{d}$$

(23)
[TeX:] $$0 \leq P_{c} \leq P_{0}^{c}.$$

The problem of (21)–(23) can be modeled by a Markov decision process (MDP). In the network model, [TeX:] $$\alpha_{k}$$ is given, and [TeX:] $$\Theta(t)$$ for [TeX:] $$t \in \mathcal{T}_{k}$$ can be observed before making decisions on [TeX:] $$P_{c}(t)$$ and [TeX:] $$P_{v}(t)$$ at every time [TeX:] $$t \in \mathcal{T}_{k}$$.

The MDP is defined as [TeX:] $$M=\{\mathcal{S}, \mathcal{A}, T, r\}$$, where [TeX:] $$\mathcal{S}$$ denotes the state space, [TeX:] $$\mathcal{A}$$ denotes the action space, T denotes the transition model and r denotes the reward structure. The queue backlog set of [TeX:] $$\Theta(t)$$ represents the current state that satisfies the Markov property. The state space is [TeX:] $$\mathcal{S}=\mathcal{Z} \times \mathbb{R}^{+}$$, where [TeX:] $$Z(t) \in \mathcal{Z}=\{0,1, \cdots, \tilde{Q}\}$$ and [TeX:] $$W(t) \in \mathbb{R}^{+} \cdot \mathbb{R}^{+}$$, [TeX:] $$$$ represents a set of nonnegative real numbers. The action set consists of power allocations for the D2D transmitter and the CUE, i.e., [TeX:] $$P_{d}(t)$$ and [TeX:] $$P_{d}(t)$$ for [TeX:] $$t \in \mathcal{T}_{k}$$. Denote the action at slot t by [TeX:] $$\Xi(t)=\left[P_{c}(t), P_{d}(t)\right]$$. By letting the action space be [TeX:] $$\mathcal{A} \in\left[0, P_{0}^{d}\right] \times\left[0, P_{0}^{c}\right]$$, the constraints of (22) and (23) are satisfied. Let power allocations for both the CUE and the DUE be uniformly discretized into [TeX:] $$N_{A}+1$$ levels, and the finite action space is represented by [TeX:] $$\mathcal{A}=\left\{0, P_{0}^{a} / N_{A}, \cdots,\left(N_{A}-\right.\right.\text { 1) } \left.P_{0}^{d} / N_{A}\right\} \times\left\{0, P_{0}^{c} / N_{A}, \cdots,\left(N_{A}-1\right) P_{0}^{c} / N_{A}\right\}$$. The action decisions are made over the k-th frame, i.e., t 2 Tk, and according to (18), the reward (i.e., incurred cost with the negative sign) at each slot [TeX:] $$t \in \mathcal{T}_{k}$$ is represented by

(24)
[TeX:] $$\begin{array}{l} r(\Theta(t), \Xi(t))=Z(t)\left(c-\left\lfloor\frac{t_{c} R_{d}\left(\alpha_{k}, P_{d}, P_{c}, t\right)}{S\left(q\left(\alpha_{k}\right)\right)}\right\rfloor\right) \\ +\gamma W(t)\left(\eta_{c}-R_{c}\left(P_{c}, P_{d}, t\right)\right)-V T \cdot \mathcal{P}\left(q\left(\alpha_{k}\right)\right) \end{array};$$

therefore, the reward r is the cost (24) multiplied by the negative sign. At every slot t, channel gains are randomly generated and state transitions occur according to random network events and the current queue state of [TeX:] $$\Theta(t)$$. The transition from [TeX:] $$\Theta(t)$$ to [TeX:] $$\Theta(t+1)$$) is defined as

(25)
[TeX:] $$P_{s^{\prime} s}(\xi)=\operatorname{Pr}\left\{\Theta(t+1)=s^{\prime} \mid \Theta(t)=s, \Xi(t)=\xi\right\}$$

for all states s,[TeX:] $$s^{\prime} \in \mathcal{S}$$ and [TeX:] $$\xi \in \mathcal{A}$$.

According to Bellman optimality equation, the minimum incurred cost at [TeX:] $$\Theta\left(t_{0}\right)=s_{0}$$ is given by

(26)
[TeX:] $$\begin{array}{l} \min _{\Xi} \mathbb{E}\left[\sum_{t=t_{0}}^{T} r(\Theta(t), \Xi(t)) \mid \Theta\left(t_{0}\right)=s_{0}\right] \\ =\min _{\Xi} \mathbb{E}\left[r\left(s_{0}, \xi\right)+G\left(\Theta\left(t_{0}+1\right) \mid \Theta\left(t_{0}\right)=s_{0}, \Xi\left(t_{0}\right)=\xi\right)\right] \\ =\min _{\Xi} \mathbb{E}\left[r\left(s_{0}, \xi\right)+\sum_{s \in \mathcal{S}} P_{s, s_{0}}(\xi) G(s)\right. \\ \left.\mid \Theta\left(t_{0}\right)=s_{0}, \Xi\left(t_{0}\right)=\xi\right] \end{array}.$$

Note that the channel information is not known, and the state transition probabilities are not given; therefore, we solve the problem (21)–(23) by using a DRL algorithm. Based on the finite MDP, the goal of reinforcement learning is to train a policy [TeX:] $$\pi \in \Pi: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]$$ which gives all action candidates at every state the probability values in [0, 1]. The policy [TeX:] $$\pi$$ maps the state of the environment to the action to maximize the expected reward. Denote the expected reward under the policy [TeX:] $$\pi$$ by [TeX:] $$\mathcal{J}(\pi)$$. With finite T steps, [TeX:] $$\mathcal{J}(\pi)$$ can be described as the accumulation of the reward at each time step.

(27)
[TeX:] $$\mathcal{J}(\pi)=\mathbb{E}\left[\sum_{t=0}^{T} \delta^{t} r(\Theta(t), \Xi(t)) \mid \pi\right]$$

where [TeX:] $$\delta$$ is a discount factor that adjusts the effect of future rewards to the current decision. The optimal policy [TeX:] $$\pi^{*}$$ is

(28)
[TeX:] $$\pi^{*}=\arg \max _{\pi} \mathcal{J}(\pi) .$$

In deep reinforcement learning, the policy [TeX:] $$\pi$$ is approximated by parameter [TeX:] $$\theta$$. The state sequence of s that is generated according to the policy [TeX:] $$\pi_{\theta}$$ is the distribution. Then, the expected reward obtained by the state sequence s and the policy [TeX:] $$\pi_{\theta}$$ can be denoted as [TeX:] $$\mathcal{J}\left(\mathbf{s}, \pi_{\theta}(\mathbf{s})\right)$$, and the objective reinforcement learning is formulated as:

(29)
[TeX:] $$\arg \max _{\theta} \mathbb{E}_{s \sim \pi_{\theta}}\left[\mathcal{J}\left(\mathbf{s}, \pi_{\theta}(\mathbf{s})\right)\right].$$

The following subsections describe the deep reinforcement learning algorithms that can solve (29) by finding the optimal policy for every state at each time step t. In the following subsections, [TeX:] $$s_{t}$$, [TeX:] $$a_{t}$$, and [TeX:] $$r_{t}$$ are used for state, action, and reward at time step t for simplicity, rather than [TeX:] $$\Theta(t)$$, [TeX:] $$\Xi(t)$$, and [TeX:] $$r(\Theta(t), \Xi(t))$$.

B. Deep Q-network (DQN)

The deep Q-network (DQN) is one of the breakthrough deep reinforcement algorithms for applying a neural network to reinforcement learning. A neural network is used to approximate state-action functions (Q-functions). The Q-functions approximated by the neural network allow the DQN to go straight from the high dimensional state space. The concept of the DQN is based on the classic Q-learning algorithm. In classical Qlearning, the Q-value of a state-action pair is estimated through iterative updates based on multiple interactions with the environment. Therefore, in a DQN, with every action taken in a state, the immediate reward received and the expected Q-value of the new state are used to update the Q-functions. Accordingly, the objective of a DQN is described as

(30)
[TeX:] $$\arg \min _{\theta} \ell_{D Q N}(\theta)=\arg \min _{\theta}\left(\mathcal{Q}\left(s_{t}, a_{t} ; \theta\right)-\overline{\mathcal{Q}}\left(s_{t}, a_{t} ; \theta\right)\right)^{2},$$

where [TeX:] $$s_{t} \in \mathcal{S}$$ state at time t, [TeX:] $$a_{t} \in \mathcal{A} \mathrm{s}$$ selected action at [TeX:] $$s_{t}$$, and [TeX:] $$\theta$$ is the parameter set of Q-functions. [TeX:] $$\mathcal{Q}\left(s_{t}, a_{t} ; \theta\right)$$ is the target Q-value that is derived from the current Q-functions at time t. Therefore, [TeX:] $$\mathcal{Q}\left(s_{t}, a_{t} ; \theta\right)=r_{t}+\delta \max _{a}\left(s_{t+1}, a ; \theta\right)$$. Although DQNs show a successful and high performance in many domains, however, because a DQN approximates Q-functions with neural networks, it can be used only for a discrete action space.

C. Deterministic Policy Gradient (DPG)

Policy gradient methods attempt to learn a policy function directly, in contrast to a DQN, which attempts to learn actionvalue Q functions. The basic idea is to increase the probabilities of actions that lead to high rewards and reduce the probabilities of actions that lead to low rewards until the optimal policy is trained. In deterministic policy gradient (DPG) methods, a neural network is used to approximate the policies. The policy is trained by updating the parameters of the policies via stochastic gradient optimization. The objective of the DPG method is described as

(31)
[TeX:] $$\arg \max _{\theta} \ell_{D P G}(\theta)=\arg \max _{\theta} \mathbb{E}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t} ; \theta\right) \hat{r}_{t}\right],$$

where [TeX:] $$\hat{r}_{t}$$ is the reward that is returned from the environment. The DPG method is an on-policy algorithm and can be used for environments having either discrete or continuous action spaces.

D. Deep Deterministic Policy Gradient (DDPG)

Although DQNs can solve problems with high-dimensional state spaces, they can handle only discrete and low-dimensional action spaces. However, the action spaces of the environments of many applications (i.e., proactive caching, resource management, etc.) are continuous and high dimensional. As mentioned previously, DQN algorithms cannot be straightforwardly applied to continuous actions, because a DQN depends on the best action that maximizes the Q-value function being selected. When there exists a finite number of discrete actions, the action that causes the Q value to be maximized can be selected, because the possible Q values at the state can be computed directly for each action. However, when the action space is continuous, it is difficult to evaluate Q values exhaustively.

The deep deterministic policy gradient (DDPG) algorithm concurrently learns the Q-value function and the policy. The action-value Q function is learned and it is also used to learn the policy. In DDPG, the function [TeX:] $$\mathcal{Q}^{*}(s, a)$$ is approximated by the neural network, as in a DQN. Therefore, because the action space is continuous, the function [TeX:] $$\mathcal{Q}^{*}(s, a)$$ can be differentiable in terms of the action. Thus, the policy can be updated efficiently. [TeX:] $$\mathcal{Q}_{\phi}(s, a)$$, which is approximated using the parameter set of [TeX:] $$\phi$$, is updated by minimizing the mean-squared Bellman error (MSBE).

(32)
[TeX:] $$\ell(\phi, \mathcal{D})=\mathbb{E}\left[\left(\mathcal{Q}\left(s_{t}, a_{t} ; \phi\right)-\overline{\mathcal{Q}}\left(s_{t}, a_{t} ; \phi\right)\right)^{2}\right],$$

where [TeX:] $$\mathcal{D}$$ is a set of transitions [TeX:] $$\left(s, a, r, s^{\prime}, d\right)$$. DDPG is aimed to learn a deterministic policy [TeX:] $$\pi_{\theta}(s)$$, which gives the action that maximizes [TeX:] $$\mathcal{Q}_{\phi}(s, a)$$.

(33)
[TeX:] $$\max _{\theta} \mathbb{E}_{s \sim \mathcal{D}}\left[\mathcal{Q}_{\phi}\left(s, \pi_{\theta}(s)\right)\right].$$

In summary, the DPG is used for finding the optimal deterministic policy by using the gradient method, and the DQN provides the optimal stochastic policy by using the deepQ-network. Lastly, the DDPG employs the deep learning framework for deriving the DPG, especially when the state and action spaces are continuous and/or the state transition model is not tractable.

VI. SIMULATION RESULTS

This section presents the simulation results to verify the advantages of the proposed dynamic node association and power allocation policy for content delivery in cache-enabled D2D underlaid cellular systems. We leveraged TensorFlow in our simulations to implement our proposed DDPG-based scheme. As shown in (6)–(10), the main performance metrics of streaming users (i.e., DUEs) are the average content quality and the delay incidence, and that of CUEs is the average data rate. Therefore, we first show the convergence of the proposed DDPG algorithm in the D2D underlaid cache-enabled vehicular network, and compare the performance metrics of the proposed scheme with other techniques.

A. Simulation environments

In the simulation, the scenario of the D2D underlaid cacheenabled vehicular network shown in Fig. 1 was considered, in which vehicles are moving in traffic lanes and the BS is located between the incoming and outgoing lanes. Some vehicles are communicating with the BS via cellular links, and streaming users (vehicles) receive the desired video chunks from nearby cache-enabled vehicles via D2D links. We assume that all vehicles can activate only one type of link, i.e., a vehicle communicating via a cellular link cannot activate a D2D link and vice versa. Since streaming users having the desired videos in their own storage do not activate D2D link, we consider only the streaming users requesting contents from another cache-enabled devices. For the D2D underlaid cellular model, the spectrum of one CUE can be reused by one D2D link. Since this paper focuses on node association and power controls for content delivery in the cache-enabled D2D underlaid cellular system, spectrum allocation is out of scope; therefore, suppose that the spectrum reuse policy is given, i.e., the pairs of the cellular and D2D links sharing the spectrum have already been determined. For simplicity, we also suppose that the maximum transmit power and noise variance of both cellular and D2D links are the same, i.e., [TeX:] $$P_{0}^{c}=P_{0}^{d}=23$$ and [TeX:] $$\sigma^{2}=-114$$. In addition, [TeX:] $$\eta_{c}=30$$ Mbps and [TeX:] $$\mathcal{B}=2$$ MHz are used. The shadowing standard deviations for cellular and D2D links are 8 dB and 3 dB, respectively.

As shown in Fig. 1 and Table 1, the rectangular road, the size of which is [TeX:] $$r_{x} \times r_{y}$$, consists of four lanes in each direction. [TeX:] $$r_{x}=1$$ and [TeX:] $$r_{y}=\left(N_{l}+1\right) w_{l}$$ where [TeX:] $$w_{l}=4 \mathrm{~m}$$, because there are eight lanes and one lane between the two four-lane parts of the road for the BS. The BS is located at (0,0). For simplicity, the location of the CUE is fixed at (200, 8); however, the location of the DUE (i.e., streaming user) is randomly chosen. Assume that a constant speed [TeX:] $$v$$ = 60 km/h for every vehicle and cacheenabled vehicles are distributed with [TeX:] $$\lambda$$ = 0:01. We consider three quality levels whose PSNR measures and file sizes are [TeX:] $$\mathcal{P}(q)$$ = [34; 36:64; 39:11] dB and S(q) = [2621; 5073; 10658] kbits, respectively. In addition, T = 10, [TeX:] $$t_{c}$$ = 1 ms, V = 0:3, c = 15, [TeX:] $$\tilde{Q}$$ = 500 and [TeX:] $$\gamma=10^{-6}$$ are used in this section.

For simulation, we employed a NVIDIA DGX station equipped with 4 Tesla V100 GPUs (a total of 128 GB memory available) and an Intel Xeon E5–2698 v3 2.2 GHz CPU with 20 cores (256 GB system memory available in total). In addition, Pythos version 3.6 on Ubuntu 16.04 LTS is utilized to build the DDPG-based node association and power control scheme. Also, we used a Xavier initializer [40] to avoid the occurrence of vanishing gradient descent during the learning phase. The neural network was constructed with a fully connected deep neural network (DNN), and the number of nodes in the hidden layer was 200. The discount factor for reward is [TeX:] $$\delta=10^{-6}$$. The RL model was trained through a total of 100,000 iterations. Here, note that the DNN is used to approximate the Q-function [TeX:] $$Q(s, a)$$ of this system because the state space is continuous and the random event distribution (i.e., channel statistics in this setup) is not known. Therefore, the inputs of the DNN are the current state and action of the agent and the output becomes its Q-function.

To verify the advantages of the proposed dynamic video delivery policy, we compared the proposed scheme with four other schemes:

“Genie-aided”: Through knowledge of the fast fading gains of all links, the optimal transmit powers of both the CUE and the DUE are optimally obtained. The decision on the cache-enabled vehicle to be used for video delivery is not considered.

The scheme presented in [27]: The power allocations for the CUE and the DUE are achieved based on ergodic capacity and are aimed to satisfy the constraint of the probability that delay occurs. This approach is not dynamic, and therefore, a fixed power allocation is applied for each frame. Because no association algorithm for video delivery is included in the method in [27], the decision on the cache-enabled vehicle for video delivery is not considered.

Fig. 3.
Learning traces with learning rate of [TeX:] $$3 \times 10^{-6}$$.

“Nearest”: The streaming user associates with the nearest cache-enabled vehicle that is likely to provide the strongest channel condition. The power allocations based on deep reinforcement learning are the same as those of the proposed scheme.

“Highest-Qual”: The streaming user associates with the cache-enabled vehicle that caches the highest quality version of the requested file. If there are many vehicles caching the highest-quality version of video, the user chooses the nearest. The power allocations based on deep reinforcement learning are the same as those of the proposed scheme.

In summary, a comparison of our scheme with the “Genieaided" scheme and the scheme presented in [27] can provide specific insight into the effectiveness of deep reinforcement learning-based power allocations. The performance comparison of the proposed cache-enabled vehicle association for video delivery based on frame-based Lyapunov optimization with the “Nearest" and “Highest-Qual" schemes shows its advantages.

Learning Traces

Fig. 3 shows the traces of learning the power allocations of [TeX:] $$\mathbf{P}_{d}$$ and [TeX:] $$\mathbf{P}_{c}$$ at a learning rate of [TeX:] $$3 \times 10^{-6}$$ to minimize the upper bound on the drift-plus-penalty in (18). The learning trace in Fig. 3 is obtained with an example scenario of a certain randomly generated vehicular network; however, the learning traces of most of the random generations are similar to that in this figure. We can easily see that a dramatic increase in the reward is obtained after around 2000 episodes; 4000 episodes were used in the simulation to provide a sufficient margin to converge the learning process. However, although the reward seems to converge to a certain degree, the trace is in general unsteady. This means that the learning process is not quite stable, because the fast fading gains are completely unknown. Therefore, we restrict the action space of the power allocations to the finite set instead of the nonnegative real number set. In practice, the system is usually unable to change the transmit power continuously, and the finite [TeX:] $$N_{A}$$ levels of transmit power are used. Thus, each transmit power becomes[TeX:] $$P_{0}^{d} \in\left\{0, P_{0}^{d} / N_{A}, \cdots,\left(N_{A}-\right.\right.\text { 1) } \left.P_{0}^{d} / N_{A}\right] \text { and } P_{0}^{c} \in\left\{0, P_{0}^{c} / N_{A}, \cdots,\left(N_{A}-1\right) P_{0}^{c} / N_{A}\right]$$. In addition, this deep reinforcement learning is very sensitive to changes in the learning rate. We can see in Fig. 4 that the total rewards do not converge even as the episode proceeds when the learning rate is [TeX:] $$4 \times 10^{-6}$$.

Fig. 4.
Learning traces with learning rate of [TeX:] $$4 \times 10^{-6}$$.
Fig. 5.
Data rate of CUE [TeX:] $$R_{c} \text { vs. } P_{0}$$.
C. Performances of Cellular and D2D Links

We used three performance metrics: 1) The data rate of the CUE, 2) the playback delay at the DUE, and 3) the timeaverage video quality. Figs. 5–7 show the plots of these performance metrics versus the transmit power budgets of both cellular and D2D users, i.e.,[TeX:] $$P_{0}^{d}$$ and [TeX:] $$P_{0}^{c}$$. Because we assume [TeX:] $$P_{0}^{d}=P_{0}^{c}=P_{0}$$, the power of the target signal of the CUE, as well as the interfering signal from the D2D link, increase as [TeX:] $$P_{0}$$ grows. Therefore, the data rate of the CUE does not dramatically change with [TeX:] $$P_{0}$$, as shown in Fig. 5. We can see that the “Genie-aided” scheme definitely shows the largest data rates among the compared techniques. However, the data rate of the scheme presented in [27] is the smallest, because it endeavors to reduce the queuing delay of the DUE rather than to raise the data rate.

The effect of the proposed step for selecting the cacheenabled vehicle for video delivery based on the frame-based Lyapunov optimization can be seen by comparing its performance with that of the “Nearest" and “Highest-Qual" schemes. Because in the “Nearest" scheme the streaming user must associate with the nearest cache-enabled vehicle, the distance between the D2D transmitter and the BS is usually shorter than that produced by the proposed algorithm and “Highest-Qual." Therefore, the data rate of “Nearest" is smaller than that of the proposed scheme and “Highest-Qual," because the interference power that it introduces to the BS is large.

Similarly, since the distribution density of cache-enabled vehicles that store the highest-quality content is considerably less

Fig. 6.
Delay occurrence rates vs. [TeX:] $$P_{0}$$.

than that of vehicles that store the desired content at lower qualities, “Highest-Qual" can provide a larger data rate for the CUE than the proposed scheme. However, by considering the minimum data rate of the CUE (i.e., [TeX:] $$\eta_{c}$$), the proposed algorithm adequately satisfies this constraint, whereas “Nearest" does not. In addition, as already mentioned concerning the unstable learning process in Fig. 3, the data rate performances of all methods that utilize deep reinforcement learning-based power allocation are not monotonic; however, the fluctuation of their trends is not very severe.

Fig. 6 shows the plots of playback delay incidences versus [TeX:] $$P_{0}$$. As explained previously, when there is no chunk in the queue when the streaming service is being employed, playback delay occurs. Therefore, the playback delay incidence reflects the extent of queue emptiness that occurs over the total playback time. In Fig. 6, it can be seen that the “Genie-aided" scheme and the scheme presented in [27] result in no delay. However, the other power allocation algorithms do not result in delay. A comparison of Figs. 5 and 6, reveals that the proposed algorithm endeavors to satisfy the constraint of the minimum data rate of the CUE at the expense of its delay performance.

Among the comparison schemes that utilize power allocation based on the deep reinforcement learning approach, “Nearest" shows the lowest playback delay occurrence rates and the proposed scheme shows the second lowest, whereas the “Highest- Qual" scheme is expected to yield rather long buffering times. As [TeX:] $$P_{0}$$ increases, more chunks can be delivered to the streaming user with the increased data rate of the D2D link; thus, the delay occurrence rates of all the schemes decrease. “Nearest" in particular provides the strongest channel condition to the D2D link, and therefore, its delay performance is better than that of the proposed and the “Highest-Qual" scheme.

In Fig. 7, we can see the time-average video qualities of all the comparison schemes. Since “Genie-aided" scheme and the scheme presented in [27] do not consider the decision on the cache-enabled vehicle for video delivery, they are not shown in this figure. Meanwhile, “Nearest" and “Highest-Qual" show the trends of quality performance that differ from those of the proposed scheme. Obviously, the “Highest-Qual" scheme gives the best video quality. Because the “Nearest" scheme does not

Fig. 7.
Average video quality vs. [TeX:] $$P_{0}$$.

pursue video quality enhancement, it is obvious that its performance is the lowest among the techniques compared in Fig. 7. The performance of the proposed algorithm is better than that of “Nearest" and poorer than that of “Highest-Qual".

Overall, we can state that, although the proposed algorithm results in a few playback delays, it can achieve both the minimum data rate of the CUE and quite good time-average video quality. The scheme in [27] shows almost no playback delay, but it cannot satisfy the minimum data rate constraint. Moreover, in contrast to the method in [27], which does not consider wireless caching and video delivery, the proposed algorithm can nicely achieve average video quality. Similarly, “Nearest” provides a better delay performance than the proposed scheme, but it also cannot satisfy the minimum data rate constraint and achieve the video quality as effectively as the proposed algorithm. Finally, “Highest-Qual" can deliver the highest-quality streaming service, but its delay performance is poorer than that of the proposed scheme. Thus, we can conclude that the proposed algorithm smooths the trade-off between the data rate of the CUE, the playback delay incidence, and the time-average video quality.

VII. CONCLUSION AND FUTURE WORK

In this paper, a method for the joint optimization of three decisions having different timescales in D2D underlaid cellular and vehicular caching networks was proposed: 1) Association with a cache-enabled vehicle to allow video delivery, 2) power allocation for the DUE, and 3) power allocation for the CUE. The proposed algorithm maximizes the long-term time averaged video quality while limiting the playback delay and guaranteeing the data rate of the cellular user, given the spectrum reuse policy. The decision on the cache-enabled vehicle to be used for video delivery is achieved by using the frame-based Lyapunov optimization theory under consideration of the interference signal from the CUE. The dynamic power allocations of both the CUEs and DUEs are obtained by using the deep reinforcement learning approach in the absence of knowledge of channel fast fading. Our intensive simulation results verify that the proposed algorithm effectively achieves a balanced trade-off between the data rate of the cellular user, the playback delay occurrence of video streaming, and the average video quality. As future work, extension to dynamic content delivery and routing optimization in multi-hop wireless networks, e.g., vehicular ad hoc networks (VANETs), is considerable.

Biography

Minseok Choi

Minseok Choi received the B.S., M.S., and Ph.D. degrees from the School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2011, 2013, and 2018, respectively. He was a Visiting Postdoctoral Researcher in electrical and computer engineering with the University of Southern California (USC), Los Angeles, CA, USA, and a Research Professor in electrical engineering with Korea University, Seoul, South Korea. He has been an Assistant Professor with Jeju National University, Jeju, South Korea, since 2020. His research interests include wireless caching networks, stochastic network optimization, non-orthogonal multiple access, and 5G networks.

References

  • 1 "", Cisco. (Online). Available:, https://www.cisco.com/c/en/us/solutions/collateral/serviceprovider/visualnetworking-index-vni/mobile-white-paper-c11-520862.html
  • 2 E. Bastug, M. Bennis, M. Debbah, "Living on the edge: The role of proactive caching in 5G wireless networks," IEEE Commun.Mag., vol. 52, no. 8, pp. 82-89, Aug, 2014.custom:[[[-]]]
  • 3 N. Golrezaei, A. F. Molisch, A. G. Dimakis, G. Caire, "Femtocaching and device-to-device collaboration: A new architecture for wireless video distribution," IEEE Commun.Mag., vol. 51, no. 4, pp. 142-149, Apr, 2013.custom:[[[-]]]
  • 4 M. Ji, G. Caire, A. F. Molisch, "Wireless device-to-device caching networks: Basic principles and system performance," IEEE J. Sel. Areas Commun., vol. 34, no. 1, pp. 176-189, Jan, 2016.custom:[[[-]]]
  • 5 S. H. Chae, W. Choi, "Caching placement in stochastic wireless caching helper networks: Channel selection diversity via caching," IEEE Trans.WirelessCommun., vol. 15, no. 10, pp. 6626-6637, Oct, 2016.custom:[[[-]]]
  • 6 C. Yang, Y. Yao, Z. Chen, B. Xia, "Analysis on cache-enabled wireless heterogeneous networks," IEEE Trans. Wireless Commun., vol. 15, no. 1, pp. 131-145, Jan, 2016.custom:[[[-]]]
  • 7 K. Poularakis, G. Iosifidis, L. Tassiulas, "Approximation algorithms for mobile data caching in small cell networks," IEEE Trans. Commun., vol. 62, no. 10, pp. 3665-3677, Oct, 2014.custom:[[[-]]]
  • 8 L. Zhang, M. Xiao, G. Wu, S. Li, "Efficient scheduling and power allocation for D2D-assisted wireless caching networks," IEEE Trans. Commun., vol. 64, no. 6, pp. 2438-2452, June, 2016.custom:[[[-]]]
  • 9 W. Jiang, G. Feng, S. Qin, "Optimal cooperative content caching and delivery policy for heterogeneous cellular networks," IEEE Trans.Mobile Comput., vol. 16, no. 5, pp. 1382-1393, May, 2017.custom:[[[-]]]
  • 10 K. Poularakis, G. Iosifidis, A. Argyriou, L. Tassiulas, "Video delivery over heterogeneous cellular networks: Optimizing cost and performance," in Proc.IEEE INFOCOM, pp. 1078-1086, 2014.custom:[[[-]]]
  • 11 M. Choi, J. Kim, J. Moon, "Wireless video caching and dynamic streaming under differentiated quality requirements," IEEE J. Sel. Areas Commun., vol. 36, no. 6, pp. 1245-1257, June, 2018.custom:[[[-]]]
  • 12 J. Kim, G. Caire, A. F. Molisch, "Quality-aware streaming and scheduling for device-to-device video delivery," IEEE /ACM Trans. Netw., vol. 24, no. 4, pp. 2319—2331-2319—2331, Aug, 2016.custom:[[[-]]]
  • 13 M. Choi, A. No, M. Ji, J. Kim, "Markov decision policies for dynamic video delivery in wireless caching networks," IEEE Trans. Wireless Commun., vol. 18, no. 12, pp. 5705-5718, Dec, 2019.custom:[[[-]]]
  • 14 M. Choi, A. F. Molisch, J. Kim, "Joint distributed link scheduling and power allocation for content delivery in wireless caching networks," IEEE Trans.WirelessCommun.Early Access, Aug, 2020.custom:[[[-]]]
  • 15 D. Bethanabhotla, G. Caire, M. J. Neely, "Adaptive video streaming for wireless networks With multiple users and helpers," IEEE Trans.Commun., vol. 63, no. 1, pp. 268-285, Jan, 2015.custom:[[[-]]]
  • 16 T. Stockhammer, "Dynamic adaptive streaming over HTTP - standards and design principles,Proc.ACMMMSys2011, Feb. 2011.
  • 17 S. O. Somuyiwa, A. György, D. Gündüz, "A reinforcement-learning approach to proactive caching in wireless networks," IEEE J. Sel. Areas Commun., vol. 36, no. 6, pp. 1331-1344, June, 2018.custom:[[[-]]]
  • 18 W. Jiang, G. Feng, S. Qin, T. S. P. Yum, G. Cao, "Multi-agent reinforcement learning for efficient content caching in mobile D2D networks," IEEE Trans.WirelessCommun., vol. 18, no. 3, pp. 1610-1622, Mar, 2019.custom:[[[-]]]
  • 19 V. Kirilin, A. Sundarrajan, S. Gorinsky, R. K. Sitaraman, "RL-cache: Learning-based cache admission for content delivery," IEEE J. Sel. Areas Commun., vol. 38, no. 10, pp. 2372-2385, Oct, 2020.custom:[[[-]]]
  • 20 S. Deng et al., "Dynamical resource allocation in edge for trustable Internet-of-things systems: A reinforcement learning method," IEEE Trans.IndustrialInformatics, vol. 16, no. 9, pp. 6103-6113, Sept, 2020.custom:[[[-]]]
  • 21 L. Li et al., "Deep reinforcement learning approaches for content caching in cache-enabled D2D networks," IEEE Internet Things J., vol. 7, no. 1, pp. 544-557, Jan, 2020.custom:[[[-]]]
  • 22 G. Qiao, S. Leng, S. Maharjan, Y. Zhang, N. Ansari, "Deep reinforcement learning for cooperative content caching in vehicular edge computing and networks," IEEE Internet Things J., vol. 7, no. 1, pp. 247-257, Jan, 2020.custom:[[[-]]]
  • 23 Z. Nan, Y. Jia, Z. Chen, L. Liang, "Reinforcement-learning-based optimization for content delivery policy in cache-enabled HetNets," in Proc. IEEE GLOBECOM, 2019.custom:[[[-]]]
  • 24 N. Lee, X. Lin, J. G. Andrews, R. W. Heath, "Power control for D2D underlaid cellular networks: Modeling, algorithms, and analysis," IEEE J. Sel.AreasCommun., vol. 33, no. 1, pp. 1-13, Jan, 2015.custom:[[[-]]]
  • 25 X. Li, R. Shankaran, M. A. Orgun, G. Fang, Y. Xu, "Resource allocation for underlay D2D communication With proportional fairness," IEEE Trans.Veh.Technol., vol. 67, no. 7, pp. 6244-6258, July, 2018.custom:[[[-]]]
  • 26 Y. Ren, F. Liu, Z. Liu, C. Wang, Y. Ji, "Power control in D2D-based vehicular communication networks," IEEE Trans. Veh. Technol., vol. 64, no. 12, pp. 5547-5562, Dec, 2015.custom:[[[-]]]
  • 27 L. Liang, G. Y. Li, W. Xu, "Resource allocation for D2D-enabled vehicular communications," IEEE Trans. Commun., vol. 65, no. 7, pp. 3186-3197, July, 2017.custom:[[[-]]]
  • 28 P. Sun, K. G. Shin, H. Zhang, L. He, "Transmit power control for D2Dunderlaid cellular networks based on statistical features," IEEE Trans.Veh. Technol., vol. 66, no. 5, pp. 4110-4119, May, 2017.custom:[[[-]]]
  • 29 N. Cheng et al., "Performance analysis of vehicular device-to-device underlay communication," IEEE Trans. Veh. Technol., vol. 66, no. 6, pp. 5409-5421, June, 2017.custom:[[[-]]]
  • 30 W. Lee, M. Kim, D. Cho, "Deep learning based transmit power control in underlaid device-to-device communication," IEEE Systems J., vol. 13, no. 3, pp. 2551-2554, Sept, 2019.custom:[[[-]]]
  • 31 I. Budhiraja, N. Kumar, S. Tyagi, "Deep reinforcement learning based proportional fair scheduling control scheme for underlay D2D communication," IEEE InternetThingsJ., vol. 9, no. 5, pp. 3143-3156, Mar, 2021.custom:[[[-]]]
  • 32 Y. Liu, Z. Tan, X. Chen, "Modeling the channel time variation using high-order-motion model," IEEE Commun. Lett., vol. 15, no. 3, pp. 275-277, Mar, 2011.custom:[[[-]]]
  • 33 C. Xu, C. Gao, Z. Zhou, Z. Chang, Y. Jia, "Social network-based content delivery in device-to-device underlay cellular networks using matching theory," IEEE Access, vol. 5, pp. 924-937, Nov, 2017.custom:[[[-]]]
  • 34 Y. Wang, X. Tao, X. Zhang, Y. Gu, "Cooperative caching placement in cache-enabled D2D underlaid cellular network," IEEE Commun. Lett., vol. 21, no. 5, pp. 1151-1154, May, 2017.custom:[[[-]]]
  • 35 L. Shi, L. Zhao, G. Zheng, Z. Han, Y. Ye, "Incentive design for cacheenabled D2D underlaid cellular networks using stackelberg game," IEEE Trans.Veh.Technol., vol. 68, no. 1, pp. 765-779, Jan, 2019.custom:[[[-]]]
  • 36 M. J. Neely, S. Supittayapornpong, "Dynamic Markov decision policies for delay constrained wireless scheduling," IEEE Trans. Automatic Control, vol. 58, no. 8, pp. 1948-1961, Aug, 2013.custom:[[[-]]]
  • 37 B. Blaszczyszyn, A. Giovanidis, "Optimal geographic caching in cellular networks," in Proc.IEEE ICC, 2015.custom:[[[-]]]
  • 38 Dimitri Bertsekas and Robert G, Gallager, Data networks (2nd edition) Prentice Hall, 1992.custom:[[[-]]]
  • 39 Michael J. Neely, Stochastic network optimization with application to communication and queueing systems, "", Morgan Claypool Synthesis LecturesCommun.Netw, vol. 3, no. 1, pp. 1-211, 2010.custom:[[[-]]]
  • 40 X. Glorot, Y. Bengio, "Understanding the difficulty of tranining deep feedforward neural networks," in Proc.AISTATS, 2010.custom:[[[-]]]
  • 41 H. Gao, C. Liu, Y. Li, X. Yang, "V2VR: Reliable hybrid-networkoriented V2V data transmission and routing considering RSUs and connectivity probability," IEEE Trans. Intelligent Trans. Sys.Early Access, Apr, 2020.custom:[[[-]]]

Table 1. System description parameters.

.tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
[TeX:] $$N$$ Number of quality levels
[TeX:] $$\alpha(t)$$ Cache-enabled vehicle chosen at [TeX:] $$t$$
[TeX:] $$P_{c}(t)$$ Transmit power of cellular link at [TeX:] $$t$$
[TeX:] $$P_{d}(t)$$ Transmit power of D2D link at [TeX:] $$t$$
[TeX:] $$Q(t)$$ Queue backlog of DUE at [TeX:] $$t$$
[TeX:] $$W(t)$$ Virtual queue backlog at [TeX:] $$t$$
[TeX:] $$q$$ Video quality
[TeX:] $$\mathcal{P}(q)$$ Measure of video quality [TeX:] $$q$$
[TeX:] $$S(q)$$ Video file size of quality [TeX:] $$q$$
[TeX:] $$K$$ Number of frames
[TeX:] $$T$$ Time duration of a frame
[TeX:] $$t_{c}$$ Unit time slot duration
[TeX:] $$t_{k}$$ Beginning time of the [TeX:] $$k$$-th frame
[TeX:] $$\mathcal{T}_{k}$$ Time interval of the [TeX:] $$k$$-th frame
[TeX:] $$R_{c}(t)$$ Data rate of cellular link at [TeX:] $$t$$
[TeX:] $$R_{d}(t)$$ Data rate of D2D link at [TeX:] $$t$$
[TeX:] $$\eta_{c}$$ Minimum data rate of cellular link
[TeX:] $$P_{d}^{0}, P_{c}^{0}$$ Maximum power of cellular and D2D links
[TeX:] $$\mathcal{L}(t)$$ Lyapunov function
[TeX:] $$\lambda$$ Intensity of cache-enabled vehicle distribution
[TeX:] $$\mathcal{B}$$ Bandwidth
[TeX:] $$\sigma^{2}$$ Noise variance
[TeX:] $$V$$ Lyapunov coefficient
[TeX:] $$\mathcal{C}_{k}$$ Cache-enabled vehicle candidate set in [TeX:] $$$$
[TeX:] $$\Theta(t)$$ State of MDP at [TeX:] $$t$$
[TeX:] $$\Xi(t)$$ Action of MDP at [TeX:] $$t$$
[TeX:] $$\mathcal{S}$$ State space of MDP
[TeX:] $$\mathcal{A}$$ Action space of MDP
[TeX:] $$r(t)$$ Reward of MDP at [TeX:] $$t$$
[TeX:] $$P_{s^{\prime} s}$$ Transition probability from state [TeX:] $$$$ to state [TeX:] $$s$$
[TeX:] $$\pi$$ Trained policy of DDPG algorithm
D2D underlaid cache-enabled vehicular network.
Different timescales for decisions on [TeX:] $$\alpha(t)$$, [TeX:] $$P_{d}(t)$$, and [TeX:] $$P_{c}(t)$$.
Learning traces with learning rate of [TeX:] $$3 \times 10^{-6}$$.
Learning traces with learning rate of [TeX:] $$4 \times 10^{-6}$$.
Data rate of CUE [TeX:] $$R_{c} \text { vs. } P_{0}$$.
Delay occurrence rates vs. [TeX:] $$P_{0}$$.
Average video quality vs. [TeX:] $$P_{0}$$.