I. INTRODUCTION
A. Context and Motivations
THE Internet of things (IoT) paradigm has been recognized as a key driving force to realize the smart concept in various domains such as smart cities [1], smart grids [2], smart factories [3] since it enables the interconnection and interoperability of IoT-enabled physical and virtual entities to create smart services and informed decision makings for monitoring, control, and management purposes [4], [5]. The underlying principle of realization involve a set of activities that includes collecting, processing, analyzing, and getting insights from IoT data perceived by the IoT devices. Traditionally, the cloud computing platform plays an essential role in the realization process since it provides rich and powerful resources (e.e., storage, computation, networking) to handle an enormous amount of IoT data (big data) efficiently [6]. However, the data traffic has increased exponentially due to the increase of IoT-enabled devices and growth of customized applications, thus leading to congested networks consequently. Some of the leading IoT applications have put higher demand on resourceconstrained devices. Additionally, more stringent quality of service (QoS) requirements of IoT service provisioning such as (ultra)low delay expose crucial limitations of the cloud based solutions because the delivery of data from the IoT devices to the centralized cloud computing servers seriously affects the performance in processing, analyzing data and results in network congestion issues and excessive delay as an ultimate consequence. This fact context leads to a strong push of fog computing integration into the IoT-cloud systems since it puts computing, storage, communication, and control closer to the IoT devices to meet the prescribed QoS requirements [7], [8]. Technically, the fog computing platform that is placed between the physical IoT devices and the cloud servers can handle a majority of service requests on behalf of the cloud servers to improve the system performance in terms of service delay, workload balancing, and resource utilization [9].
The mutual benefits gained from the combination of fog and cloud enable the resulting IoT-fog-cloud systems to provide uninterrupted IoT services with various QoS requirements for the end users along the things-to-cloud continuum. However, employing the fog computing raises another concern regarding decisions whether the tasks should be processed in the fog or in the cloud. There are many factors impacting on the offloading decision policies such as offloading criteria, application scenarios [9]. Basically, in the most of existing offloading techniques the tasks are probably offloaded to the best surrogate nodes, which have the most ample resources (e.g., large storage capacity, high speed processing) and reliable communication network conditions in terms of delay, bandwidth between them and their neighbors, the IoT devices, and even the cloud servers. However, such the fog offloading solutions face significant challenges regarding the workload distribution among the complicated heterogeneous fog devices characterized by different computation resources and capabilities. The challenge is further amplified by increasing the rates of service requests, which probably make the task queues of resourcerich fog nodes longer. As a result, the requirements of latencysensitive applications can be violated because of excessive waiting time of long queue. Furthermore, the reliance on the remote cloud servers to accomplish the tasks may not help in improving the situation due to high communication delay or networking related disturbance.
Executing the tasks in the fog computing tier requires an efficient resource allocation and management to satisfy the QoS requirements. However, to achieve the objective is facing many critical challenges due to the complex heterogeneity and limitations of fog resources, locality restrictions as well as dynamic nature of resource demands. Most of heuristics existing algorithms are proposed as efficient resource allocation solutions for distributing and executing the tasks in some certain computing scenarios [10], [11]. In other word, they lack a generic framework to study the resource allocation issues in the practical computing context, which encompasses multiple criteria to derive the efficient algorithms such as heterogeneity, QoS management, scalability, mobility, federation, and interoperability [12].
RL has been increasingly studied and applied to effectively solve the resource allocation related issues in many uncertain computing environments [13]. In principle, RL-based algorithms employ the learning processes to learn the dynamic and changing environment to enrich the experiences, thus deriving the best decisions in the long-term operations [14]– [16]. For example, a RL-based task scheduling algorithm has been developed and deployed in the cloud computing scenarios to reduce the average task execution delay, and task congestion level [17]. Many RL-powered algorithms have been proposed to improve the computing performance such as saving the energy consumption in the data centers [18]–[20]. Recently, the collaboration between deep learning and RL can further enhance the capabilities of resulting deep RL (DeepRL) approaches which achieves outstanding performance in complex control fields strongly shows its superiority of decision making in complex and uncertain environment [21], [22]. For example, such a DeepRL algorithm is developed to optimize the task scheduling in the data center [23]. The resource allocation problem is also tackled by using a DeepRL algorithm to achieve the optimal job-virtual machine scheduling in the cloud [24]. The dynamic task scheduling problems are effectively addressed by RL-based techniques in the field of computer science [25]. These primary findings expose an efficient alternative to solve the resource allocation problems in the fog computing using the RL concept. In this regard, this paper provides a significant review to channelize the stateof- the-art RL based methods to solve the resource allocation problems in the fog computing environment.
B. Contributions
The main contributions of paper are summarized as follows:
· We highlight key issues regarding the fog resource management in the fog computing for task offloading and task computation algorithms.
A typical three-tier architecture of IoT-fog-cloud system, which provides specific kinds of IoT services (e.g., service A, B, C as examples) through either the fog layer or the cloud based on the adaptive resource allocation mechanism.
· We examine the RL concept as potential solutions to the resource management issues.
· The state-of-the-art review of existing applications of RL in the fog computing environment is surveyed.
· We explore and discuss the challenges as well as associated open issues when using RL algorithms in the context of fog computing.
C. Paper Structure
The rest of paper is structured as follows. Section II present the key concept of fog computing and existing resource management issues of fog nodes. Section III overviews the principle of RL and key algorithms developed and used in the practical applications. Section IV presents a comprehensive review of existing works which apply RL algorithms in the context of fog computing. Section V discuss the challenges and open issues. Finally, Section VI concludes the work.
II. FOG COMPUTING ENVIRONMENT
A. System Model
A fog computing system is usually placed between an IoT layer and a cloud layer to form a three-tier architecture for the resulting IoT-fog-cloud system as illustrated in Fig. 1.
The first layer include all IoT-enabled devices that generate IoT data, and/or primarily process the data, and periodically reports the raw and/or pre-processed data to the fog or cloud for further advanced processing, and computing (i.e., data streaming applications). The fog computing devices (e.g., routers, switches, gateways, and cloudlets for mobile computing services) and servers distributed in the fog layer and cloud layer, respectively are responsible for receiving, processing,
RESOURCE STATE TABLE OF NEIGHBORS OF FOG NODE [TeX:] $$F_{1}.$$
and responding to the IoT computing service requests sent from the IoT layer. In contrast to the powerful servers in the cloud tier, the fog devices have limited resources. In addition, they are heterogeneous in terms of computation, storage, and communication capability. Therefore, these resources require an efficient resource allocation strategy to improve the performance of fog computing tier in serving and delivering the services.
Depending on the specific objectives and scale of application system, the fog layer can be structured by three main architecture: centralized, distributed and hierarchical form. In the first architecture, the computing system comprises of a fog controller and a set of fog nodes [26]. The controller is responsible for information aggregation and decision-makings, whereas the nodes work directly to serve as the supportive computing devices. Such the architecture is widely applicable in the software-defined networks. The second one is referred as to a network of fogs which forms connectivity among fogs in a distributed manner. Task scheduling and resource allocation in the task offloading processes can be decided by fogs through distributed manners. Whereas in the hierarchical architecture, there exists clusters in which each cluster operates according to the master-slave operations. Specially, a fog node serves as a master to control and manage the operations of associated fog nodes known as slaves. The master is able to know the states of slaves regarding the resources, capacity, thus it can derive optimal resource allocation and task scheduling in its cluster. In addition, a federation is enabled among the master fogs for further resource and load sharing.
Regardless architecture, the systems rely on available resource tables that contain the updated resource states of fogs in the systems to facilitate the resource allocation processes. Depending on the specific architecture of fog systems, the tables are maintained by different responsible fogs. For example, the controllers and fog masters are able to know the resource state of all fog devices, and in their clusters, respectively in the first two architecture. Whereas, in the distributed system each fog maintains it own neighbor resource table containing the updated information about the available resources of its colony [27], [28]. These tables are shared among the member of their colony to support the primary host to make offloading decisions, which ultimately aim at selecting the offloadees, the hosts, and the collaborative fogs. Table I shows an example of neighbor resource table stored by the fog node [TeX:] $$F_{1},$$ which records the resource states of neighbors with respect to residual memory (M), clock frequency (f), processor (i.e., CPU or GPU support), round-trip time (RTT), and expected waiting time in queue (W).
Each computing task k can be described with a tuple
Based on the data fragmentation concept, processing the input data a includes a set of tasks, which distributively process the data subsets [TeX:] $$a_{1}, a_{2}, a_{3}$$ before completing the whole task by jointly processing the output data [TeX:] $$f\left(a_{1}\right), f\left(a_{2}\right), \text { and } f\left(a_{3}\right)$$ to achieve f(a).
[TeX:] $$\left(A_{k}, B_{k}\right),$$ where [TeX:] $$A_{k} \text { and } B_{k}$$ are the vectors representing the input data features and required computing resources, respectively. Basically [TeX:] $$A_{k}$$ can include following features: Total size (in bits or bytes), splittable or non-splitable, number of data types. The sizes of input data of tasks can be ranged from kilobytes to tera-bytes depending the specific applications [29]. Based on this feature, the tasks can be classified into light, medium, and heavy tasks as studied in many of existing works [28], [30] for further analyzing the impact of task sizes on the performance of fog computing systems. The divisibility of task (i.e., input data) is also investigated in the literature. Accordingly, the whole input data can only be processed by a single fog device as it is unable to be splitted into data subsets. Whereas, several existing works assumed that the task can be divided into subtasks with smaller data sizes. Such the task division is employed to get benefit from parallel computing since the substasks can be processed by different fog devices simultaneously. Fig. 2 illustrates main subtasks for computing the input data a as it can be divided into three independent subsets [TeX:] $$\left\{a_{1}, a_{2}, a_{3}\right\}.$$
In particular, the outputs of substasks (i.e., [TeX:] $$f\left(a_{1}\right), f\left(a_{2}\right),$$ and [TeX:] $$\left.f\left(a_{3}\right)\right)$$ are collected and jointly processed in the final stage to achieve the desired outcome (i.e., f(a)). This mechanism is called partial offloading as investigated in the existing works such as [31]. The input data of a certain computing task can include multiple types of data such as text, image, video, and audio as studied in [27], [32]. Regarding the required computing resource, there are many attributes included in [TeX:] $$D_{k}$$ to process the task. Some of existing works just only consider [TeX:] $$B_{k}$$ as the central processing units (CPU cycles) [33]. In another scenarios, GPU and memory requirements are considered during resource allocation for executing heavy and complex task such as the machine learning algorithms [34].
B. Resource Allocation Problems in Fog Computing Environment
The fog computing technology provides additional resources for executing the tasks and correspondingly providing various services with improved quality in the computing systems. However, the nature of fog computing environment exposes major challenges regarding the resource allocations for task execution to achieve the objective. This section explores and
The heterogeneity and unbalanced workload of fog environment in the IoT-fog-cloud systems expose issues in providing IoT services.
discusses such key challenges in the fog computing, which urge a need to develop alternative solutions beyond the existing heuristics propositions.
Firstly, the fog computing environment is complex and dynamically changing. Basically, the fog computing devices like gateways, hubs, and switches are heterogeneous in terms of computation, storage, and communication capability. In many of use cases, the complicated heterogeneity of fog environment represents a critical challenge for these systems to achieve the performance improvement objective. For instance, the presence of limited resource fogs results in the imbalance of workload distribution among the fogs, which in turn impacts negatively on the performance of system in terms of delay. Fig. 3 illustrates such typical issue in the fog computing environment systems, in which a fog node [TeX:] $$F_{2}$$ is unable to process the whole input data of service request A due to lack of resources. Meanwhile, offloading the task to the fog neighbors [TeX:] $$F_{1} \text { and } F_{3}$$ may lead to an extensive delay since there are high workloads in queues of these fog nodes.
In addition, the high rate of requests potentially prolong the task queues in the powerful fog nodes since the limited resource fog nodes with respect to the computational capability, and storage may be unable to process the whole input data of service. Furthermore, constrained by the cloud rental cost, the other objective of proposed framework is to maximize the usage of fog resources, thus minimizing the reliance on the cloud. All these perspectives lead to a direction to explore the task division concept, that potentially can help in reducing the task execution delay through parallel subtask executions in the limited resources fogs. However, dividing tasks may not be effective in the large-scale systems or high task request rate since it may increase the resource contention among the fogs, thus increasing the time and computation complexity of algorithms [27].
Secondly, the fog computing resource is dynamically changing due to mobility. In many practical applications, the presence of mobile fog nodes such as drones [35] and vehicles [36], [37] leads a dynamic change of fog computing resources over time. Moreover, leaving out and joining in the fog computing systems by fog devices are accounted for as causes for this change. Therefore, the resource allocation strategies must be designed in an adaptive and flexible way to cope with this situation.
Thirdly, the resource requirements for executing the tasks also change dynamically due to a various types of tasks. The demand for resources varies according to specific applications, different time periods, and environmental condition. Generally, there are three major computing problems involving the resource allocation in fog computing systems which include resource sharing, task scheduling, and task offloading [12]. Basically, resource sharing is referred as to methods to share available resources among the fog devices to execute the computation demands of tasks [38], [39]. This algorithm is essential in the fog computing environment, where the heterogeneity of fog resources stress the need of multiple computing devices to complete a single task. As the fog computing system is considered as a pool with a set of resource types such as storage, CPU, and GPU the resource sharing requires a cooperative among the nodes to execute the computational task demands. Therefore, mechanisms for fog-to-fog collaboration must be established within the fog computing domains to facilitate the resource sharing [40]. However, to enable such the cooperative fog-to-fog is a challenging job since the practical devices are different in terms of hardware, software, and functionalities such as gateways, routers, and hubs. Specially, task scheduling is usually performed without sufficient information support. Practically, there are no patterns to predict the arriving task profiles characterized by number and size of tasks as well as arrival rate. Therefore, the algorithm has to schedule the tasks at once without any prior experience and prepared information. In addition, the scheduling algorithm needs automatically optimize resource utilization based on the changing demand.Task scheduling is a problem, which have a high impact on the performance of computing system, especially in terms of task execution delay. Generally, scheduling the tasks involves assigning which resources process which tasks within which expected time period. In the large-scale systems including IoT layer, edge and fog layer, and cloud layer possible resources for computation execution include IoT devices, edge devices, fog devices, and also server in the cloud tier. The heterogeneity of fog resources in term of computation capabilities directly lead to the imbalance of workload distributed among the nodes. Concretely, more loads may be carried by powerful devices such as fast processing speed from greedy perspectives. As there exists a lack of management the performance of computing systems can be degraded in the long run since a large available resource in the limited resource fogs are underutilized. Balancing the workload among the computing devices is a challenging problem to ensure a stable operation of computing systems in a long run. There are many factors needed to be considered during designing and developing efficient task offloading algorithms such as resource states of fog devices, and task requirements, which dynamically change over time.
In order to solve the above problems, a lot of related studies are carried out. Most of them focus on some specific scenarios or rely on the acquired details of the incoming tasks in advance to derive efficient heuristics algorithms. In addition, most of the previous studies are single objective optimization-oriented (i.e., delay, energy consumption, and resource utilization). In particular, none of them have a generic framework to model and examine the above mentioned challenges for further de
SYMBOLS USED IN THE WORK.
The fundamental operation of RL with agent, environment, reward, state and action.
signing appropriate algorithms. In the next sections, we briefly introduce the principal concept of RL and conduct an analysis review on application of RL-based methods for solving the resource allocations in the fog computing environment.
III. REINFORCEMENT LEARNING
A. Basic Concept
RL is a method of supervised and unsupervised learning. It is not exclusively supervised because it does not depend just on a training data, but it is not unsupervised because rewards or punishment are given to the agent in return of the optimization. In the RL algorithms, the agents are able to identify the best as well right actions in any situations so as to achieve the overall goal based on the rewards and punishments. In other words, RL can be in simplest terms defined as the “science of decision making” [41].
For the sake of clarity, Table II provides symbols mostly used in the RL frameworks.
In the standard RL model, the agent interacts with its environment to learn the characteristics of environment. The data perceived through the environment sensing serves as input for the agents for decision makings. The actions taken by the agent results in the change of environment and which is further communicated back to the agent for entire the process to start over again. Fig. 4 illustrate the basic operation of RL.
At every time step t of interaction, the agent makes a partial observation of the state [TeX:] $$\left(s_{t}\right)$$ of the world and then decides to take an action [TeX:] $$\left(a_{t}\right)$$ [42]. The environment changes when the agent acts on it or it may change on its own. The agent also perceives a reward [TeX:] $$\left(r_{t}\right)$$ from the environment, which tells it how good or bad the current sate is. The ultimate goal
Two simple MDP’s that illustrate the underlying concepts and algorithms. [ 46]
of the agent is to maximize its cumulative reward, called as return. Therefore, RL methods are the ways that the agent can learn behaviors to achieve its goal [43]. The followings define fundamental concepts and notations used to describe the RL problems and algorithms.
1) State Space: Interaction with the dynamic environment trains the RL system by trail and error as it learns mapping from situations to actions. The RL system must be able to perfectly observe the information provided by the environment which influences the action to be taken, thus the true states of the environment affects the action taken by the RL system. In the most of environment, the state transition is following a Markovian property that, the current state [TeX:] $$s_{t}$$ provides enough information obtain an optimal decision. Therefore, a selection of an action a will have same probability distribution over next states when this action is taken in the same state [44]. Markov decision process (MDP) is a mathematical framework used to model decision making situations in the RL problem. MDPs consists of states, action, transition between states and a reward function. The system is Markovian if the results of an action does not depend on the previous action and historical already visited states, but only on the current state, i.e. [TeX:] $$P\left(s_{t+1} \mid s_{t}, a_{t}, s_{t-1}, a_{t-1}\right)=P\left(s_{t+1} \mid s_{t}, a_{t}\right)$$ [45]. Often MDPs are depicted as a state transition graph where the nodes correspond to states and (directed) edges denote transitions. Two simple MDP’s as illustrated in Fig. 5 show the underlying concepts and algorithms, in which each transition is labeled with the action taken and a pair of numbers. First number is immediate reward and second number represents the transition probability.
2) Action Space: An action taken by the agent is completely dependent on the environment, therefore different environment results in different actions. The set of all valid actions in a given environment is called as an action space abbreviated as S [47]. Some environment such as Atari and Go have discrete action spaces, where only finite number of actions are available to the agent [48]. Other environments have continuous action spaces, such as, where an agent controls a robot in a physical world [49].
3) Reward and Return: As the agent perform an action [TeX:] $$a_{t},$$ it will immediately get a reward [TeX:] $$r_{t}.$$ The immediate reward [TeX:] $$r_{t}$$ is quantified by a numerical value (either in positive or negative) to evaluate the desirability of action that the agent took. Therefore, the goal of agent is to maximize the total amount of rewards or the cumulative reward instead of immediate rewards.
The return or commutative reward at time step t is defined as:
However, to account for the importance of immediate and future rewards, the return is discounted by a discount factor [TeX:] $$\gamma (0 \leq \gamma \leq 1), thus:$$
4) Policy: Policy is defined as the rule used by the agent to decide which action to take in each state. In other words, a policy is a function mapping states to actions. Therefore, the ultimate objective of RL algorithms is to derive the optimal policy that maximizes the return.
5) State Value and State-action Value Function: A state value function [TeX:] $$\left(V^{\pi}(s)\right)$$ is used to specify how good it is for an agent to be in a particular state (s) with a policy [TeX:] $$\pi.$$ In the mathematical formulation, [TeX:] $$V^{\pi}(s)$$ is defined as:
A state-action value function or Q-function [TeX:] $$\left(Q^{\pi}(s, a)\right)$$ is used to specify how good it is for an agent to perform a particular action (a) in a sate (s) following a policy [TeX:] $$\pi .$$ The mathematical formulation of Q function is as follow:
Since the environment is stochastic, there is a probability denoted as [TeX:] $$\pi(a \mid s)$$ for a policy [TeX:] $$\pi$$ to take an action a given the current state s. Certainly, [TeX:] $$\sum_{a \in A} \pi(a \mid s)=1.$$ The relationship between [TeX:] $$V^{\pi}(s) \text { and } Q^{\pi}(s, a)$$ is expressed by:
By denoting [TeX:] $$P_{s s^{\prime}}^{a}$$ as transition probability to transit from a state s to a state [TeX:] $$s^{\prime}$$ as performing an action a, the relationship between [TeX:] $$V^{\pi}(s) \text { and } Q^{\pi}(s, a)$$ is also expressed by:
where r(s,a) is the immediate reward achieved after taking action a given the current state s.
Consequently, the state values can be formulated as:
Similar for Q functions, we have:
Taxonomy of RL algorithms.
where a' is next possible action of a.
In order to achieve the maximal return, the RL algorithms have to find the optimal policy, that has an associated optimal state value function or state-action value function. Mathematically, the optimal policy [TeX:] $$\pi$$ is found as:
or
The next sub-section is to review the key algorithms and taxonomies used in the literature to find the optimal policies for different scenarios.
B. Taxonomy of RL Algorithms
Broadly, RL can be classified into two categories including model-free and model-based methods [50]. A non-exhaustive list of RL based algorithms in these two classes is presented in Fig. 6.
In the model-based model, the agent is supported to make plan as it can see a range of future possibilities of choices and thus deciding between its option well ahead. Thus the agent can filter out the results into a learned policy. For example, the authors of [51] used this approach and called it as AlphaZero (AZ), where in the sample efficiency improved significantly over other methods which were also not having models. However, this approach exposes shortcomings, as the learning by the agent of the model is based only on the experience which itself creates many challenges. The biggest one is that of bias which can be exploited by the agent thereby forcing the agent to perform below par in real environment, secondly it is very computation intensive, which can ultimately results in failure to pay sometime.
In contrary, the model-free algorithms do not need “model” as a result the sample efficiency is lower but they are easier in implementation and tuning, which makes them quite popular than its counter-part. The algorithms in this type can be further divided into based on the learning to be carried out. Accordingly, there are two types of learning carried out. The first one is policy optimization in which the parameter [TeX:] $$\theta$$ is optimized for a policy [TeX:] $$\pi_{\theta}\left(\cdot \mid s_{t}\right)$$ based on the recent collection of the data. Some of key examples of this optimization are A2C/A3C, [52], where performance is maximized by suing gradient ascent and another is PPO [53], where a surrogate objective function is used as a indirect measure of the performance. The second learning method is Q-Learning in which learning of an approximator [TeX:] $$1$$ gives an optimal function [TeX:] $$Q^{*}(s, a).$$ Objective function for Q-Learning approach is based on the Bellman equation [54]. Recently this approach is used by in [55] called as DQN which is milestone for deep RL and [56] called as C51, where the returns are learned leading to the policy expectation as [TeX:] $$Q^{*}.$$ In particular, there are some certain algorithms used simultaneously policy optimization and QLearning to compromise the strengths and weakness of either sides. For example, the authors of [57] proposed DDPG which simultaneously learns policy as well Q-function. In addition, the proposition in [58] proposed a combined approach of SAC and Q-learning with the help of stochastic policies and entropy regularization, thereby giving higher scores.
The model based RL are not well defined methods as models can be used in many orthogonal ways. Broadly they can be classified, based on whether model is given or the model is learned. The learning of the model based approach is used by [59] and called it as MBMF where pure planning technique, model predictive control is used in the selection of the action on some standard benchmark tasks for deep RL. The given model approach is used by [51], called its as AZ in which explicit representation of the policy is learned with the pure planning which produces an action that have strong influence as compared to when policy alone would have produced.
IV. RL-BASED ALGORITHMS FOR RESOURCE ALLOCATION IN THE FOG COMPUTING SYSTEM
This section summarizes key RL-based algorithms in the literature to address the resource allocation problems in the fog computing environment, which are discussed according to key types, namely Rl-based and DRL based methods. In particular, the review analysis is conducted to emphasize on describing how the components of RL-based solutions in the fog computing such as state space, action space, MDP, and reward are formulated.
A. Resource Sharing and Management
Considering the fog computing systems as a resource pool with multiple kinds of resources (i.e., CPU, GPU, storage, and memory), the resource sharing is an important mechanism to allocate the resource efficiently. In principle, the resource sharing algorithms requires the collaboration of fog entities for exchanging their available resource states, thus facilitating the resource allocation.
The work [60] studies the resource management for conserving the energy consumption in the Fog Radio Access Networks (F-RAN), which can provide two main types of services: communication and computation. Considering the dynamic of edge and fog resource states, the network controller is able to make the fast and intelligent decisions on the user equipment (UE) communication modes (i.e., C-RAN (Cloud-RAN) mode, D2D mode, and FAP (fog access point) mode) and the on-off states of processors to minimize the long-term power consumption of systems. The well-trained DRL model is built on the system architecture to derive this optimal decision. In this model, the state space is defined as [TeX:] $$S=\left\{s=\left(s^{\text {processor }}, s^{\text {mode }}, s^{\text {cache }}\right)\right\}, \text { where } s^{\text {processor }}$$ is a vector representing the current on–off states of all the processors, [TeX:] $$s^{\text {mode }}$$ is a vector representing the current communication modes of all the UEs, and [TeX:] $$s^{\text {cache }}$$ is a vector consisting of the cache state at each D2D transmitter. The network controller is able to control the on-off state of a processor and communication mode of a UE each time step. Thus, to reduce the number of action, the action space is defined as [TeX:] $$A=\left\{a=\left(a_{\text {processor }}, a_{\text {mode }}\right)\right\}, \text { where } a_{\text {processor }}$$ indicates “turn on” or “turn off” action for a processor, and [TeX:] $$a_{\text {mode }}$$ represents to change the communication mode for a UE (i.e., C-RAN, FAP, or D2D). To achieve the green FRAN, the reward function is defined as the negative of system energy consumption is incurred by operation of processor in the cloud, fronthaul transmission, and wireless transmission in the fog tier. To enhance the performance of proposed algorithm, multiple factors are developed and integrated in the DRL model. Firstly, the prioritized replay is proposed to reuse the experienced transition more effectively. Secondly, double DRL is used to overcome the optimistic Q-value estimation as well as improve the learning process in cases of environment change. In particular, the transfer learning is integrated to accelerate the learning process, thus allowing the quick convergence of learning. These key factors result in the superiority of proposed algorithms compared to the related and baseline works.
Internet of vehicles (IoV) where all vehicles are connected has emerged as a fundamental element of smart cities using real-time data to react instantly to user requests. The vehicular fog computing (VFC) has appeared as an emerging paradigm that facilitate the dynamic problems of networking, caching, and computing resources to enhance the efficacy of nextgeneration vehicular networks [61]. In these networks, the vehicles both in movements or parked status equally sever as fog nodes, which have limited resources for offering the services such as communication, computation, and storage. Considering the immense dynamic and highly complicated nature of VFC environment, to enhance the QoS such as realtime response is of the challenging job. This sort of problem in the vehicular applications is investigated in [62], which aims at seeking efficient resource allocation strategies to minimize the service latency minimization. The Rl algorithm accordingly is developed to achieve the target that employed an LSTMbased DNN to predict the movement and parking status of vehicles, thus facilitating the resource assignment. In addition, the proposed RL technique uses the latest techniques i.e., proximal policy optimization technique, which has the ability to learn continuously the dynamic environment, and therefore can adjust to decide the resource allocation correspondingly.
It is a significant challenge to present high quality, low bitrate variation, and live streaming assistance for vehicles because of the dynamic characteristics of wireless resources and channels of IoV. A unique transcoding and streaming system for the maximization of video bit-rate and reduces bit-rate variance and time-delays in VFC powered IoV is presented. The scheme jointly optimizes the scheduling of vehicles, selection of bit-rate, and spectrum/computational resource allocation as an MDP problem. A deep RL algorithm i.e., soft actorcritic based on the highest entropy frame is employed for the solution of MDP [63]. Moreover, an asynchronous advantage actor-critic (A3C) RL-based cooperative resource allocation and computation offloading frame for vehicular networks is presented [64].
In another VFC application, a resource sharing scheme for supporting task offloading in terms of VFC is presented in [65]. In this model, the incentivization of vehicles is performed upon sharing the resource of idle computing over dynamic pricing. In the particular case, task priority, availability of service, and mobility of vehicles are comprehensively acknowledged. A MDP diagram is formulated for task offloading due to the dynamic vehicular environment that aims to maximize the average latency-aware use of tasks in a time. Based on the DRL method, a soft actor-critic is developed for the maximization of the policy of entropy and anticipated reward. Moreover, a mobile network operator (MNO) preference and switching problem is formed by simultaneously analyzing switching cost, various prices that can be charged by diverse MNOs and fog servers, and qualityof- service alterations within MNOs. A switching policy that is based on a double deep Q network (DQN) is presented proving to reduce each vehicle’s long-term mean cost with promising reliability and latency performance [66]. Similarly, modeling of optimal computational offloading policy as MDP problem while considering ultra dense system for mobile edge computing (MEC) is performed. A deep Q-network based on the RL algorithm as a computation offloading method is presented to overcome the large dimensionality that will determine the optimum policy for dynamic statistics and no prior knowledge [67]. A semi-MDP is formulated for the optimum and agile framework of resource slicing that simultaneously allocates the storage, radio resources, and computing of the network provider to various slice requests. Dual NN of Deep Q-learning method is implemented that improves the performance by outperforming other RL-based approaches for network slicing management [68].
B. Task Scheduling
Overall, in the IoT-fog-cloud systems, task scheduling is referred to as making decisions on which tasks are processed by the IoT layer, the fog layer, or the cloud layer to achieve the target design objectives [12], [69]. In the most of applications, the main objective of scheduling algorithms is to minimize the task execution delay. However, an efficient scheduling design may improve other system performance indicators simultaneously such as reduced energy consumption, and balanced workload distribution.
To minimize the long-term service delay and computation cost for the fog computing systems under task deadline and resource constraints, a double deep Q-learning (DDQL)- based scheduling algorithm is introduced in the work [70]. Considering a fog-based IoT system with hierarchical architecture, the work aims at developing schedulers (also known as agents in the RL algorithm) each of which is embedded in a corresponding gateway (GW) to allocate resources (i.e., virtual machines (VM) embedded in the fog nodes, and the cloud servers) for executing tasks. To reduce the dimension of RL-based algorithm, paths with the updated resource states are modeled as state space of system as [TeX:] $$S=\left\{s(t)_{j}^{i}=\right. \left.\left(u C P U_{j}^{i}, u M e m o r y_{j}^{i}, u \text { Storage }_{j}^{i}\right)\right\}.$$ In this formula, three elements represents the resource utilization of [TeX:] $$\text { path }_{i}$$ in term of CPU, memory, and storage, respectively at the moment t that [TeX:] $$\text { task }_{j}$$ arrives at. The agent (i.e., scheduler) is responsible for assigning a certain resource (i.e., VM in fog or cloud) to process the task through the action. Thus, the action space is formulated as [TeX:] $$A=\left\{a_{j}^{i} \mid 1 \leq a_{j}^{i} \leq v m n_{i}\right\},$$ where [TeX:] $$a_{j}^{i}$$ is the action that is taken by the [TeX:] $$\text { agenti }$$ and for a [TeX:] $$\text { task }_{j},$$ and [TeX:] $$v m n_{i}$$ is the total number of VMs in [TeX:] $$\text { path }_{i}.$$ To obtain the end-to-end (E2E) service delay, the immediate reward function is defined as [TeX:] $$I R_{j}^{i}(a)=1 / n S D_{j}^{i}, \text { where } I R_{j}^{i}(a)$$ is the immediate reward after taking action a for the [TeX:] $$\text { task }_{j} \text { in path }_{i}, \text { and } n S D_{j}^{i}$$ is the normalized service delay of [TeX:] $$\text { task }_{j} \text { in path }_{i} . n S D_{j}^{i}$$ accounts for waiting time delay in [TeX:] $$\text { path }_{i} \text { to get } V M_{a} \text {, }$$ execution delay by [TeX:] $$V M_{a},$$ transmission, and propagation time of [TeX:] $$\text { task }_{j}.$$ In this model, the objective to achieve the minimized E2E delay is enabled since the agent tries to maximize the cumulative reward through efficient action selection. In particular, to select the optimal action the double DQL concept is introduced in the algorithm in which each agent is supported by two separate models as [TeX:] $$Q_{1} \text { and } Q_{2}$$ for action selection, and Qvalue calculation, respectively. With two Q values, the agents are able to reduce the probability of taking valid and inefficient action, thus accelerating the learning process. For evaluating the performance of framework, the work firstly creates a simulation environment with multiple agents, state space, action space, and reward functions. Then, the proposed DDQL-based scheduling algorithm is applied in this environment to assign appropriately which fog nodes will process which tasks in order to achieve the objectives. In particular, the target network and experience replay mechanisms are integrated into the DDQL-based scheduling policy to cease the fluctuation of results.
In the data-driven IoT-based systems, the end devices or IoT sensors constantly generate online tasks, which requires the upper layer such as fog computing or cloud to process within the deadlines. The nature of online tasks exhibits critical challenges for the system to conduct the task scheduling since there exists an inherent lack of prior information relating to the task arrivals. The issues stress the need for adaptive scheduling solutions, which have been investigated and developed in the literature using the RL principle. In particular, the DRLbased approaches exhibit many effectiveness to deal with the situation of online task scheduling. For example, the work [22] designed an efficient scheduler based on a forward neural network (FNN), which is able to schedule n online tasks at a time to reduce the overall task slowdown. Although the algorithm is well performed in case of predetermined n, it exposes the limitation in terms of flexibility since adjusting n may lead to adverse performance of system. In the same method, the RL-based model in [71] is designed to make the scheduling decision for each arrival task. However, such method is well applied in the cloud environment in stead of fog computing since it can induce considerable overheads. The work presented in [72] reveals these aforementioned limitations, and proposes a neural task scheduling (NTS) to release them. In principle, NTS adopts the model of recurrent neural network (RNN) based on the pointer network [73] to obtain more flexible output sequences. In addition, the network is integrated by the long short-term memory (LSTM) techniques, and attention mechanism [74] to enhance the flexibility and learning efficiency when handling long sequences of tasks. From the RL design perspective, the space state S is modeled as the system state in time slot t represented by all [TeX:] $$n_{t}$$ pending tasks with their characteristics (i.e., resource demands of tasks, and execution durations), and amount of remaining resources in future M time slots. In mathematical form, [TeX:] $$S=\left\{s_{t}\right\},$$ where [TeX:] $$s_{t}$$ is a matrix of size [TeX:] $$n_{t} \times(m+1+m \cdot M),$$ and m is the number of resource types (e.g., CPU, storage, and memory). Regarding the action space, the action at time slot t is defined as [TeX:] $$a_{t}=\left\{j_{1}, j_{2}, \cdots, j_{n_{t}}\right\},$$ which determines the order of resource allocation for the [TeX:] $$n_{t}$$ pending tasks. The reward function is defined as [TeX:] $$r_{t}=\sum_{j \in n_{t}} 1 / l_{j}, \text { where } l_{j}$$ is the duration of task j execution. Thus, the average slowdown of task is minimized as the agent is aimed at maximizing the cumulative reward. Through the extensive simulation, the algorithm is able to efficiently reduce the slowdown of an average task slowdown while ensuring the best QoS.
Furthermore, a task scheduling issue is investigated in the edge computing situation and various tasks are scheduled to virtual machines for the maximization of the long-term task satisfaction degree. The problem is expressed as MDP for which the state, action, state transition, and reward are created. For time scheduling and resource allocation, DRL is implemented, recognizing the heterogeneity of the tasks and the diversity of possible resources [75]. For the fairness of multi-resource considering diverse tasks, an online task scheduling system i.e., FairTS based on DRL techniques is proposed. The systems learn undeviatingly from experience to efficiently reduce the mean task slowdown while guaranteeing multi-resource fairness among the tasks [26]. Moreover in industrial applications, network traffic and computational offloading are explored using RL techniques for investigating the tradeoff within service delay and energy consumption. A cost minimization problem by employing the frame of MDP is formulated followed by the proposal of dynamic RL and scheduling algorithm algorithms to resolve the offloading determination problem [77].
Even though Fog networking is an encouraging technology to handle the limitations of the cloud and the current networks, there are yet challenges that persist to be evaluated in the future. Most significantly, there is a necessity for a distributed intelligent platform at the edge that controls distributed computing, networking, and storage resources. Optimal distribution decision in Fog networks faces several challenges because of contingencies linked with task requirements and available resources at the Fog nodes and the extensive range of computing power capabilities of nodes [78]. Moreover, delay within nodes must be considered for the distribution decision that can result in increased processing time. Hence, the difficulties being encountered by the Fog computing model are diverse and numerous; they include significant decisions about i) whether offloading at Fog nodes should be done or not, ii) offloading of the optimal number of tasks, and iii) given the corresponding resource limits, mapping of incoming tasks to possible Fog nodes [79]. Considering the above challenges, the proposed algorithm expresses the offloading problem as an MDP subject to the Fog node’s behavior and dynamics of the system. MDP enables Fog nodes to offload their computation-intensive tasks by choosing the most proper adjacent Fog node in the presence of ambiguities on the task requirements and availability of resources at the Fog nodes. Nevertheless, the system is unable to accurately predict the transition possibilities and rewards due to dynamically varying incoming task requirements and resource states. To resolve this dilemma, RL can be used to solve MDPs with unknown reward and transition functions by making observations from experience.
C. Task Offloading and Redistribution
The imbalance of workload among the fog resources mainly caused by the heterogeneity of fog computing environment can degrade the performance of computing systems in the long-term operation. This urges the need to develop efficient mechanisms to address the situation through offloading and redistributing the load.
The task offloading problem considering the uncertainties of mobility of end user (EU) devices, mobility of cloudlets, and the resource availability of cloudlets is studied in [80]. A deep Q-network (DQN) [81] is formulated to learn an efficient solution and then deriving the optimal actions on how many tasks will be processed locally by end-user devices and how many task will be offloaded by the cloudlets. In this proposed model, the state space is defined as a [TeX:] $$S=\left\{s=\left(Q^{u}, Q^{C}, D\right)\right\},$$ where [TeX:] $$Q^{u}, Q^{c} \text {, and } D$$ denote the queue state of EU device, the queue state of cloudlets, and the distance state of cloudlets, respectively. The mobility of devices and cloudlets affect the performance of their communication, thus the distance state is used to capture the change of computing system. To determine the optimal offloading decision, the action space is defined as [TeX:] $$A=\left\{a=\left(a_{0}, \cdots, a_{i}, \cdots, a_{N}\right)\right\}, \text { where } a_{0} \text { and } a_{i}$$ is are the number of tasks to be processed locally or by cloudlet i, respectively. The immediate reward function is defined as [TeX:] $$R(s, a)=U(s, a)-C(s, a), \text { where } U(s, a), \text { and } C(s, a)$$ are immediate utility and cost function, which are calculated as following equations.
Recall that [TeX:] $$\rho$$ is a utility constant and N is number of cloudlets deployed in the systems for assisting the offloading processes. In addition, [TeX:] $$I(s, a), E(s, a), D(s, a), \text { and } \Gamma(s, a)$$ are immediate required payment, energy consumption, delay and task loss probability, respectively. Therefore, by maximizing the cumulative reward the algorithm is enable to obtain the maximized utility as well as minimized operation cost. The Qnetwork can be considered as a neural network approximator with an approximate action-value function [TeX:] $$Q(s, a ; \theta)$$ with weights [TeX:] $$\theta \text {. }$$ At each decision period, the user first takes the state vector [TeX:] $$s=\left(Q^{u}, Q^{c}, D\right)$$ as the input of the Q-network and obtains the Q-values [TeX:] $$Q(s, .)$$ for all possible action a as outputs. Then, the user selects the action according to the [TeX:] $$\text { \epsilon-greedy }$$ method. Furthermore, the Q-network is trained by iteratively adjusting the weights [TeX:] $$\theta$$ to minimize a sequence of the loss functions, where the loss function at time-step t is defined as
Although the DQN-based algorithms are able to achieve the excellent performance in the high-dimensional decisionmaking problems [81] the proposed algorithm is evaluated in the simulation environment with only 4 cloudlets. The task arrival rate is varied to analyze the impact on the queue states of EU devices and cloudlets. In addition, no comparative analysis is performed in the work to compare with baseline or related works, thus the feasibility of performance improvement is unexplored.
Balancing the workload among nodes and simultaneously minimizing the task processing delay are studied in [26]. In a SDN-based fog computing system, a SDN fog controller is able to know the global information relating to the system states (e.g., task profiles, and queue states of fog devices), thus deriving the optimal task offloading. Using the DRLbased approach, the fog controller serves as the agent to make the decision. In this model, the state space is defined as [TeX:] $$S=\left\{s=\left(n^{l}, w, Q\right)\right\}, \text { where } n^{l}$$ is the fog node, w is number of tasks to be allocated per unit time, and [TeX:] $$Q=\left\{\left(Q_{1}, \cdots, Q_{N}\right)\right\}$$ is a vector indicating the number of tasks currently remaining in the queues of N fog nodes. The action space is in form [TeX:] $$A=\left\{a=\left(n^{0}, w^{0}\right)\right\},$$ in which [TeX:] $$n^{0}$$ is a neighbor node of [TeX:] $$n^{l} \text {, and } w_{0}$$ is number of tasks to be offloaded by [TeX:] $$n^{0}.$$ Aiming at maximizing the utility and simultaneously minimizing the task processing delay and overload probability, the immediate reward function is modeled as [TeX:] $$R(s, a)=U(s, a)-(D(s, a)+O(s, a)), \text { where } U(s, a)$$ is the immediate utility, [TeX:] $$D(s, a)$$ is immediate delay accounting for waiting delay in queues, communication delay for offloading, and execution delay at local device and offloading neighbor node, and [TeX:] $$O(s, a)$$ is overloaded probability averaging for [TeX:] $$n^{0},\text { and } n^{l}.$$ Q-learning with [TeX:] $$\text { \epsilon-greedy }$$ algorithm is applied in the Rl-based model to derive the optimal action selection.
The benefit of task offloading becomes prominent in the case of fog nodes by the selection of the appropriate nodes and suitable resource management while assuring the QoS requirements of the users [82]. An attempt in the case of heterogeneous service tasks within the multi-fog nodes where both joint tasks offloading and resource allocation management is considered. The problem formulation is based on a partly visible stochastic game where cooperation of each node is performed resulting in the maximization of combined local rewards. A deep recurrent Q-network (DRQN) method is applied to cope with partial visibility and to guarantee the accuracy and convergence of NN, adaptive explorationexploitation approach is utilized [82]. Furthermore, for IoT applications, sequential allocation of the fog nodes restricted resources in the case of heterogeneous latency needs is considered. The problem formulation of the Fog radio access network is made as MDP followed by different RL approaches such as Q-learning, Monte Carlo, SARSA, and Expected SARSA to make optimal decision-making policies [83].
In many of fog-enabled networks the task nodes which are fog nodes having tasks in queues to be processed are unknown about the resource information of their neighbors. Therefore, offloading decisions require a trade-off between exploiting the empirically best nodes and exploring other nodes to find more beneficial actions, which is simply addressed by [TeX:] $$\text { \epsilon-greedy }$$ algorithm [84], [85]. However, this low-complexity solution is time-consuming for approaching the convergence and non-optimal [86]. In this context, multi-armed bandit (MAB)-based solutions are developed to address these kinds of shortcomings [187]. In particular, the upper-confidence bound (UCB) mechanism is integrated for obtaining the guaranteed performance and low complexity [88], [89]. Accordingly, the work [88] introduced BLOT (bandit learning-based offloading of tasks) algorithm to offloading non-splitable tasks. Meanwhile, D2CIT- a decentralized computation offloading is proposed in [89] to offloading the subtasks, which are constitutes of a high-complexity tasks.
For the sake of clarity, Table III summarizes the notable applications of RL in the fog computing resource allocations proposed in the literature.
V. CHALLENGES AND OPEN ISSUES OF RL-BASED ALGORITHMS IN FOG RESOURCE ALLOCATION
Although RL is a powerful approach to introduce intelligence to fog computing-based systems, however, there are still many challenges and open issues that need to be addressed and overcome to fully exploit the potential of RL in assisting the fog paradigm. This section enumerates key challenges and correspondingly explores the open issues regarding the utilization of RL-based approaches for solving the resource allocation problems in the fog computing environment. We identify and discuss them according to three key classes relating to the RL, the fog computing environment, and the computing tasks.
A. RL-related Challenges
1) Nature of RL-based Algorithms: Naturally, the RL-based algorithms are time and resource consuming since they require
THE SUMMARY OF KEY RL-BASED ALGORITHMS IN THE RESOURCE ALLOCATION PROBLEMS.
a large volume of data collected through exploration and exploitation processes to derive the effectiveness of learning model. Meanwhile, the fog computing resources are heterogeneous and limited in terms of computation, storage, and energy compared to the cloud computing servers. Therefore, running the RL-based algorithms on the fog devices in the long term operation is an challenging job that calls for the appropriate and lightweight algorithm designs to tackle this challenge.
2) Load Balancing in RL-enabled Fog Computing: The RL approaches can be helpful for nodes to find the optimal policies (i.e., the number of tasks and size of tasks, offloadees in both the fog stratum and cloud stratum) to offload their requested tasks. If the overload probability and processing time in minimal the actions selection is considered optimal [91]. However, the action selection in the most of reviewed works is mainly dependent on the exploration policy. This situation probably leads to the greedy and optimistic decisions, which choose the more powerful fog resources to offload the tasks, thus resulting in the imbalance of workload consequently. To strike the imbalance situation from RL perspective, merging model-free RL learning with a model-based learning method can provide bias-free results having less dependency on exploration policy.
3) Task Scheduling in RL-enabled Fog Computing: Even though the fog node is equipped with better storage and computing power, however, it still possesses resources much lesser than cloud servers. Due to network heterogeneity, complexity, and uncertainty of the wireless environment, task scheduling in the fog computing environment has been categorized as a complex problem [70]. The RL algorithm can model complex problems, however by increasing the state and action space dimensions the hardware requirements of the scheduler will also need to increase. If we consider deep RL solutions for multi-dimensional problems, we would require multiple deep learning solutions that will add to computational power, storage, and memory requirements. The RL should provide a lightweight but efficient solution to a complex task scheduling problem.
4) Energy Efficiency Trade-off in RL-enabled Fog Computing: In time-critical IoT applications, RL-assisted fog computing systems could bring intelligence features to utilize readily available data and computing resources. Minimizing delay and energy consumption simultaneously has been a key challenge for RL algorithms [92]. First, the training of learning model requires high-energy consumption and similarly, fast learning on time-critical systems given limited samples is a complex problem for RL-enabled fog systems. Thus, there are many open challenges in deploying large-scale state-action space RL models for resource-constrained fog systems.
5) Real-time Responsiveness in RL-enabled Fog Computing: Ultra-reliable and low latency communication (URLLC) and real-time interaction are one of the main enablers of IoT and 5G communication [93]. Fog computing can support the computation tasks with low latency, thus enabling to provide some kinds of real-time applications. In the case of a heterogeneous IoT network, some applications are delay tolerant while others are delay sensitives. The RL algorithm provides intelligent utilization of resources capability to fog nodes, however, RL algorithms also consume time to process large-scale state-action-reward tuple [94]. In the case of multidimensional states and action space, the processing time further increases. Therefore, one of the critical challenges of the RL-enabled fog network is to intelligently satisfy timecritical IoT applications. The deep RL system can learn more quickly and efficiently through episodic memory and metalearning technique, which have been not explored in the literature [94].
6) Security and Privacy in RL-enabled Fog Computing: The RL algorithm in fog nodes collects and processes a large amount of critical data from network devices. Fog devices are distributed and contains limited resources. It is challenging for RL-enabled fog devices to execute proper security solutions in parallel to other learning mechanisms due to limited computing power. Similarly, there is an absence of trust between IoT devices and edge computing nodes [95], [96]. The RL combined with Blockchain can solve the trust issue. In such a solution, optimizing various Blockchain metrics using RL technique is a critical decision that needs to be explored.
7) Advance of Optimization Algorithms: In fact, the reinforcement learning algorithms are kinds of time-consuming works since it requires an extensive time dedicated for learning process. Therefore, the RL-based algorithms should be advanced to reduce the convergence time, thus accelerating the decision makings. In addition, the performance of RLbased algorithms are dependent on the complexity of fog networks, arrival rate of tasks. Thus for achieving a good trade-off of the training time cost and the performance, it is strongly recommended to prepare the sample data set to a reasonable scale with sampling technology to reduce the complexity of scheduling model. Advancing algorithms is required to improve the speed and effectiveness of learning process. How to reduce the dimension of problems (i.e., state space and action space) to accelerate the learning process is open issue.
B. Fog Computing Environment Related Challenges
The presence of fog computing tier is increasingly essential in the IoT-enabled systems to meet any application requirement. However, the various applications require different designs of fog computing architecture (i.e., either specific or agnostic), which totally can contribute as a challenging factor to use RL in this context. Because there is no common Rlbased solutions which can be used for different fog computing architectures and applications.
1) RL-based Resource Allocation in F-RAN: For densely deployed IoT devices, cloud radio access network (C-RAN) architecture is proposed. C-RAN improves spectral efficiency and reduces energy consumption. However, the demand for IoT devices and applications is increasing placing a high burden on centralized cloud computing. Busy cloud servers, limited fronthaul capacity causing large computation and transmission delays. Some IoT applications are delay-sensitive and cannot tolerate such delays. To handle this problem, F-RAN is a critical solution for the Fifth-generation (5G) communication systems to support the URLLC requirement for IoT devices and applications [97]. The fog nodes are capable of performing signal processing, computation, and RF functionalities. IoT environment is heterogeneous in nature with various traffic transmission rates and latency needs. The fog nodes are expected to allocate resources in Fog-RAN efficiently. RL method provides a solution for efficient resource utilization along with satisfying low-latency requirements for various IoT applications. Multi-agent RL is utilized to enhance network resource utilization in heterogeneous environments. Similarly, the model-free RL technique is used for user scheduling in heterogeneous networks for network energy efficiency [98]. Various RL methods such as Q-learning, SARSA, Expected SARSA (E-SARSA), and Monte Carlo (MC) are used for resource allocation problems. However, there are still open issues such as providing dynamic resource allocation framework with heterogeneous service time. Similarly, collaborative resource allocation mechanism with multiple fog nodes are one of the future directions.
2) RL-based Power Consumption Optimization for F-RAN: F-RAN with 5G support is well suited to provide various IoT services including healthcare, IIoT, and autonomous vehicles. In F-RAN, each device can operate in different communication modes including D2D, C-RAN, or FAP. In resource management mechanism, communication mode selection problem is considered as NP-hard due to network dynamics and continuously changing environment. Nevertheless, applying deep reinforcement learning (DRL) has shown considerable benefits for complex environment with high dimensional raw input data. The devices can obtain the desired information locally without accessing base station due to the caching capabilities of D2D transmitters. In this way, the burden on fronthaul can be reduced by traffic offloading and turning off some processors in the cloud for energy optimization. The dynamics of cache state of D2D transmitter can be modeled as MDP [99]. In such MDP problem, the network controller learns to minimize the system power consumption by adjusting devices communication modes and processors on-off states at each learning step. However, in the DRL-based resource allocation mechanism, further research is required to utilize power control of D2D communicating devices, sub-channel allocation and fronthaul resource allocation for improving the F-RAN performance.
3) RL for Ultra-dense Fog Network: An ultra-dense network with an assistant of fog computing is a promising solution to sustain the increasing traffic demand in wireless networks. Network densification brings a number of challenges including resource management [100]. Machine learning particularly RL has been proven to solve resource management challenges effectively. In an ultra-dense fog network scenario, RL algorithms need to enable the parallelism and partition of large-scale networks to manage the computational load of fog devices. Similarly, in the case of wired and wireless backhaul networks, bandwidth allocation must be considered since it is an important factor affecting on the performance. The powerful capability of deep learning and neural networks can enhance the performance of resulting DRL-based methods to solve high dimension problems. Many proposed models have reduced the dimension by efficiently configuring the state and action spaces. However, none of existing works has been investigated and developed the RL-based algorithms for the large scale fog computing systems, which basically contain a large number of fog nodes. Practically, this is open issue.
4) Reliability of Fog Networks: The complex nature of fog computing environment may cause the reliability problem for the fog network. For example, the dynamic mobility of fog nodes which strongly impacts on the fog resource status must be taken into account in designing the algorithms. In addition, the outdated channel probably blocks the communication between the fog nodes [101]. Moreover, VMs in the both fog stratum and cloud may fail and lead to QoS degradation [102]. Hence, reliability of VMs should also be considered when addressing the fog resource provisioning problem. None of reviewed algorithms covers and consider the reliability of fog network, thus opening the issue for future studies.
5) Security and Trust in Fog Network: A fog node is responsible to ensure the security and trust for other devices. Fog node must ensure the global concealing process on the released data. Fog nodes must ensure a mechanism for all nodes to have a certain level of trust in each other. The fog node handles the workload for other nodes in real-time. Protecting the integrity in case of a malicious attack is one of the challenges of fog computing. Similarly, authentication is essential to provide a secure connection between the fog node and other devices. Authentication is needed to provide in real-time data communication particularly in a scenario where nodes are moving from one coverage area to another. The user must experience a minimum delay in real-time services while traveling. The latency is caused by the authentication process performed in the fog node. During the authentication process, there is a possibility of user identity exposure to attackers. Authorization and authentication in fog networks are one of the major concerns. Providing real-time services in a fog computing environment with secure authentication is one of the priority research areas.
C. Computing Task Related Challenges
In the IoT-based context, requesting computing tasks are varied in wide range in terms of profiles, complexity levels, and resource requirements for computation. These property variations serve as a challenging factor to apply RL-based algorithms.
1) Big Data Analytics: The IoT and end user devices increasingly produce a huge amount of high-dimensional big data to be processed at the fog nodes [103]. A selection of appropriate prediction model and RL parameters, e.g., learning rate and discount factor are needed to considered carefully to obtain an optimized model for big data analytics. A proper analytic model can produce accurate results and can learn from heterogeneous data sources. In big data analytics, one of the challenges a learning algorithm faces is how to distribute big data among resources constrained fog devices fairly.
2) Data Fragmentation for Parallel Computing Exploitation: Besides the Big data analytics related tasks, many of tasks in the IoT-enabled systems can be complex in terms of size, data structure. For example, the input data of a ML computing task can contain four types of data: text, image, video, and audio, which requires many kinds of resources to process. However, the limitation of fog computing resource causes imbalance of workload among the fog nodes since many of them with insufficient available resources is unable to process a single task. The data division is a key approach to solve this issues in the complicated heterogeneous fog environment [27]. However, the diversity of input data structure in practical application requires alternative division solutions for improving or optimizing the system performance. For instance, as the data dependency constraints among the substasks are taken into account in the associated workflow model and the collaborative task offloading mechanism must be adapted to such a change accordingly. In addition, the data can be divided according to different features such as by size explicitly. In this way, an optimization can be formulated to find the optimal number of data subsets and associated sizes of data subset for optimizing the system performance. Although the input data of tasks can be divided to take the benefit of parallel computing it may raise the large space as applying RL models. Therefore, to achieve the efficient trade-off between the performance and time-consuming training it requires to search for an optimal number of data subsets divided from the input data.
VI. CONCLUSIONS
The fog computing has been integrated in a wide range of IoT-enabled systems as a support computing resources to cure the pressure of cloud computing resources, thus improving the operation performance of systems. However, the fog computing environment is a complex resource pool in terms of heterogeneity, mobility, and dynamic change, which serve as critical barriers for achieving efficient and effective resource allocation strategy. In addition, the computing tasks are varied with respective to task characteristics and resource demands. Moreover, the most of efficient heuristics algorithms in the literature lack the adaptivity and flexibility to respond to the uncertainties of fog computing environment. These aforementioned challenges stress a need to develop an alternative for resource allocation solutions to flexibly deal with the complexity of fog computing environment.
This paper surveys the literature on the applications of RL in the fog computing environment to allocate the computing resources for task computation and execution. The concept of RL is briefly introduced to highlight accordingly the role and algorithmic model to support deriving the optimal decision makings in many practical applications (e.g., game, robotics, and finance). The start-of-the art literature review is conducted to describe intensively the key RL-based solutions for the resource allocation problems in the fog computing environment. We identify and analyze these algorithms according to three major problems, namely, resource sharing, task scheduling, and task offloading. Finally, the work also explored and discussed the key challenges faced by the nature of RL-based algorithms, the fog computing environment, and the computing tasks in the variety of practical applications. The corresponding open issues are also presented for further studies.