Article Information
Corresponding Author: Qiang Liu , liuq@bjtu.edu.cn
Zequn Jia, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, Beijing, China, and with the School of Electrical and Data Engineering, University of Technology Sydney, Broadway, Ultimo, Sydney, 2007, NSW, Australia, zachary@bjtu.edu.cn
Qiang Liu, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, Beijing, China, liuq@bjtu.edu.cn
Ying He, School of Electrical and Data Engineering, University of Technology Sydney, Broadway, Ultimo, Sydney, 2007, NSW, Australia, Ying.He@uts.edu.au
Qianqian Wu, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, Beijing, China, qianqianwu0921@bjtu.edu.cn
Ren Ping Liu, School of Electrical and Data Engineering, University of Technology Sydney, Broadway, Ultimo, Sydney, 2007, NSW, Australia, RenPing.Liu@uts.edu.au
Yantao Sun, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, Beijing, China, ytsun@bjtu.edu.cn
Received: January 5 2023
Revision received: June 13 2023
Accepted: June 20 2023
Published (Electronic): May 31 2023
I. INTRODUCTION
DATA centers are warehouses that host many servers, provide data processing services, and enable communication between a large number of computing resources [1]. Massively distributed services in data centers like microservices [2], [3] and Spark [4] supports various user-end application. Highspeed data center networks connect separated servers together to provide more reliable and scalable computing capability.
With the growth of the DCN scale and the emerging application of new network devices, failures in DCNs have become the norm rather than occasional events. Failures in DCNs usually have more severe consequences than in general networks, from disruption of services to loss of critical data [5]. Meanwhile, the complexity of large-scale DCN management makes it difficult for network administrators to find and locate failures in a timely manner. As a result, rapid detection and localization of failures are crucial in DCN research.
Failure detection is essential to data transmission reliability and system security. Passive detection methods (e.g., SNMP and NetFlow) were utilized in the past, which find and locate failures by periodically collecting port statistics of transmitted bytes and packets. Due to the unpredictability of the data traffic characteristics in DCN, the results of the analysis are imperfect [6], [7]. Nowadays, active probing is a common failure detection technique in DCNs. It will send one or more probes periodically to other nodes to identify any connection problems. This technique can adapt to frequently changing networks. Therefore, active network probing solutions like Microsoft’s PingMesh [8] have been developed. By deploying end-to-end probing agents on the edge servers, PingMesh measures the latency between servers and analyzes the results to identify the network’s current problem status.
However, there are two main problems with active detection methods such as PingMesh. Firstly, it is challenging for conventional probing methods to use deterministic probing paths due to the abundance of equal-cost multipaths (ECMP) in DCNs. Second, a reasonable set of probing paths must be carefully chosen in order to achieve high detection accuracy with low overhead.
The emergence of new technologies such as softwaredefined networks, programmable switches, etc., leads to the development of in-band network telemetry (INT)-based failure detection methods [9], [10], [11]. As a result, it is possible to ascertain the probes’ forwarding paths, which addresses the issue of too many ECMPs.
Unfortunately, generating a collection of probing paths with high accuracies remains challenging. Most existing research solutions have poor detection accuracy because of the data center’s huge scale, high-reliability requirements, and highapplication diversity. Moreover, the network load could potentially increase due to the measurement traffic introduced into the network [12]. The reliability of the active failure detection is dependent on the probing path of the agents [13]. PingMesh tries to cover more failures by generating three tiers of ping lists: Intra-racks, inter-racks, and inter-datacenters. These strategies are developed based on the expertise of network specialists. While it is difficult for this approach to generate probing path sets with higher failure-locating accuracy. DeTector [14] tries to improve the accuracy of failure localization by proposing probing matrix construction (PMC) and packet loss localization (PLL) algorithms. While failure detection matrix generation and failure localization accuracy still have growth potential. Besides, there is still limited theoretical analysis on failure detection and localization so far.
In this paper, we propose a formal representation of the accuracy of a given failure probing matrix. The design of the probing matrix is then formulated as an optimization problem to minimize the overhead for a given accuracy. We derive an iterative probing matrix construction algorithm with an alternative indicator for evaluating the probing matrices. This algorithm selects a path with the largest accuracy increment from the available detection paths and iterates repeatedly until the accuracy reaches a given value. In doing so, the proposed algorithm will find the optimal path under the current path number limitation at each iteration. Experimental results show that when the detection path is not saturated, the selected probing matrix has higher accuracy than previous methods.
Furthermore, we would like to design an algorithm with higher detection accuracy than conventional heuristics methods. Deep reinforcement learning has generated interest in a variety of research applications. In recent years, the DRL model has been widely used in failure detection, and the automatic feature learning process of its architecture offers excellent potential for solving the problem of failure detection [15]. Therefore, the rational use of reinforcement learning methods can help achieve better failure detection and localization.
So in this paper, we also proposed another DRL-based solution for probing matrix construction. We focus on constructing a probing matrix using the DRL model that performs better than traditional heuristics methods at a given number of probing paths. The experimental results show that our approach is effective, i.e., DRL-based methods are able to find better matrices with higher accuracy than all traditional methods.
Specifically, we have made the following contributions.
We theoretically analyzed the end-to-end active failure detection and localization method. The failure localization process was analyzed, and we derived the theoretical accuracy for different probing matrices and failure rates.
We proposed a heuristic approach for probing matrix construction. A novel evaluation was proposed based on the formula above. Then we proposed a heuristic PMC algorithm using the new indicator to get higher failure localization accuracy.
We propose a DRL-based probing matrix construction method. DRL was incorporated into the PMC problem, and a proximal policy optimization (PPO)-based solution was proposed. The probing matrix generated by this method can achieve higher accuracy.
Our proposed solutions have the potential to make a meaningful impact. By improving the accuracy and efficiency of failure detection, our work can help data center operators enhance network reliability, minimize downtime, and reduce costs. This can eventually enable a seamless user experience for various data center applications and services.
Furthermore, our research can serve as a foundation for future studies in the field of data center network optimization and resilience, ultimately contributing to the development of more robust and efficient data center infrastructures. Our work provides new insights into failure detection in data center networks and has the potential to improve the reliability and availability of data center services.
The rest of this paper is organized as follows. The details of related work are presented in Section II. Section III outlines the system model and defines the probing matrix construction problem. Section IV and Section V describe the motivation and details of our proposed PMC algorithm and DRL-based PMC algorithm, respectively. The evaluation results are presented and discussed in Section VI. Finally, Section VII concludes this article.
II. RELATED WORK
Researchers have been working on failure detection and developed various failure detection techniques for different applications [16]. PTPmesh [17] network monitoring tool uses precision time protocols to estimate one-way latency and quantify packet loss, and it is a viable tool for cloud tenants to obtain network performance statistics. Similar to PingMesh, PTPmesh does not prioritize improving failure localization precision by choosing more appropriate failure detection paths. Pandey et al. [18] proposed a novel networked multi-agent system diffusion protocol based on network structure and node priority. Nodes interact with neighboring nodes to diffuse information based on the weighted difference of available resources on the neighboring nodes. While this solution still relies on point-to-point probing, which increases the expense of replacing network devices. Liang et al. [19] proposed a heuristic algorithm to optimize the computed backup path, formalizing the backup path selection as an integer programming problem. However, this scheme does not allow for flexible modeling of failure modes, and detection accuracy is inadequate. Xie et al. [20] proposed a new chunked matrix completion algorithm to detect link failures. Compared with the existing matrix completion algorithms, the proposed chunking algorithm reduces the sampling complexity. Nevertheless, this method includes a centralized scheduling mechanism to periodically choose the probe paths, which increases the system’s costs and complexity. DeTector [14] is a topologyaware failure detection system for DCNs. The authors gave more attention to the significance of failure probing paths and proposed an algorithm to identify suitable probing paths in order to improve the accuracy of failure localization.
However, the theoretical analysis of failure detection is still insufficient, and there is still room for further research in constructing a more efficient and accurate path selection algorithm based on the analysis. We proposed a scheme to address this issue and included a comparison of our approach to deTector in this paper.
With artificial intelligence, failure detection methods based on machine learning have been introduced [21]. Hou et al. [22] use an unsupervised two-stage method to detect and characterize the faults in networks. They collect massive time series data of probing results and provide valuable insights on anomaly mining. Mahmood et al. [23] proposed a routing method based on reinforcement learning for intelligent failure detection that finds the best route with minimum endto- end delay. In addition, it also demonstrates that the proposed fault-tolerant strategy contains highly trusted computational capabilities that successfully improve network efficiency and thus reduce the risk of network failure. Karthik et al. [24] proposed an automatic link failure prediction and detection method that improves existing techniques with ML to detect and predict failures with automatic troubleshooting capabilities. Such prediction helps the network to react before a failure occurs. Traffic loss during failures is almost zero and can be achieved by predicting failures in advance. Deep learning methods such as graph neural networks (GNNs) are also applied to failure detection tasks within data centers. Jiang et al. [25] proposed APGNN, which extracts features and learns alarm-fault mappings using GNN. Directed graph convolutional neural networks (DGCNN) [26] are applied to locate link failures in data center simulations. The DGCNN model uses the structure of a directed line graph to represent the second-order structure of its underlying graph, providing a new approach to modeling the domain represented by the directed graph.
DRL methods may be more appropriate for this issue than CNNs because it is more adaptive to diverse network topologies. But there are currently rare applicable DRL-based solutions for this problem. We applied deep reinforcement learning to this problem and obtained better results than with conventional approaches.
III. SYSTEM OVERVIEW
In this section, we present the system model for end-to-end failure detection. We then analyze the formulation of failure detection accuracy given the network topology and the failure probing matrix and propose the optimization problem.
A. System Model
Fig. 1 depicts a system model for failure detection via the end-to-end probe mechanism. In this model, agents are deployed on the edge servers of the DCN to send and receive probes, and the results of the probes are analyzed to identify network issues. This model consists of the following components.
Probing matrix is a matrix used to represent a set of failure detection paths in a DCN. Each row of it represents a probe path, whereas each column represents a DCN link.
Probing matrix construction (PMC) algorithm is employed to generate probing matrices. According to the structure of the network and the detection objectives, the algorithm can construct a probing matrix that meets the requirements.
Packet loss localization (PLL) algorithm is designed to analyze the collected probing results to identify and locate failures in the network.
Agent is a lightweight software that is installed on the edge servers of a DCN and consists of senders and receivers. The sender receives the probing matrix from the PMC and constructs probes to transmit to the specified peers according to the probing matrix. Receiving the probes from the sender, the receiver logs the results and reports them to the PLL for analysis routinely.
System architecture of end-to-end probing solutions.
This architecture is highly compatible with current data center networks, especially the software-defined data center networks, as it leverages the flexibility and programmability of these networks. The PMC and PLL algorithms can be integrated into the SDN controllers. The PMC algorithm, when deployed on the controller, can construct corresponding probing matrices based on the network topology collected by the controller. Similarly, the probing statistics collected in the edge servers can also be aggregated to the controller through the SDN control plane and analyzed by the PLL algorithm in the controller. This integration allows for a more efficient and rapid response to network issues, as well as more effective remediation.
The sender and receiver agents can be implemented as lightweight software modules running on the edge servers of the data center network. As their functions are straightforward, they can be designed to operate with minimal overhead, ensuring that their impact on the data center’s primary services is very minimal.
Assuming that there are 𝑛 links in the whole network topology, we use a binary vector [TeX:] $$L \in \mathbb{I}^n$$ to indicate the current status of the network. [TeX:] $$L_i$$ is the 𝑖th element of 𝑳 that represent the status of link 𝑖. [TeX:] $$L_i=0$$ for link failure, [TeX:] $$L_i=1$$ for link is in a normal state.
A forwarding path consists of multiple connected links. Probing paths are forwarding paths in the network, whose source and destination nodes both belong to the set of edge servers. The matrix of candidate probing paths is defined as [TeX:] $$R \in \mathbb{I}^{k \times n},$$ where 𝑘 is the number of all available probing paths. Each row of this matrix denotes the related link to this probing path.
The PMC algorithm is expected to choose 𝑚 forwarding paths from candidate paths, i.e., probing paths [TeX:] $$\boldsymbol{P}=\left\{p_1, p_2, \cdots, p_m\right\}.$$ And we construct the probing matrix [TeX:] $$D \in \mathbb{I}^{m \times n}$$ by extracting [TeX:] $$R_{p_1}, R_{p_2}, \cdots, R_{p_m}$$ from 𝑅. In other words, 𝐷 is a subset of 𝑅, i.e.,
where [TeX:] $$\mathcal{F}_p$$ indicates the PMC algorithm. The value of [TeX:] $$D_{i j}$$ is either 0 or 1, implying that link [TeX:] $$l_j$$ is involved in path [TeX:] $$p_i$$.
The probing results will demonstrate whether or not each link in the probing paths works well. A path will be reported as failed if any link in the probing path are in failure. And in reverse, a failed probing path means at least one related link fails. The probing result of 𝐷 is defined as vector 𝒔 with length 𝑚, where [TeX:] $$s_i=1$$ means that path [TeX:] $$p_i$$ is probed to be working, while [TeX:] $$s_i=0$$ means [TeX:] $$p_i$$ is out-of-order.
Then we use the PLL algorithm to infer 𝐿 according to 𝒔. The resulting inferred link status, denoted as 𝐿′, are shown below.
Thus, the key problem of failure probing is to design appropriate PMC ([TeX:] $$\mathcal{F}_p$$) and PLL ([TeX:] $$\mathcal{F}_m$$) algorithms to find the minimized Diff (𝐿′, 𝐿). We could use the ratio of detected link failure number and total link number, i.e., 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦, to indicate Diff (𝐿′, 𝐿).
B. Problem Formulation
In this subsection, we model the detection accuracy. According to the analytical model, one can calculate the detection accuracy value of the failure detection matrix for any network topology under a specific failure localization policy.
Problem: Given topology matrix 𝑅 and probing matrix 𝐷, find failures accuracy [TeX:] $$\mathcal{A}(D)$$
The set of all possible link failures is the power set of 𝑳, i.e., [TeX:] $$\mathcal{P}(L)$$, and [TeX:] $$|\mathcal{P}(\boldsymbol{L})|=2^n$$. Each row of 𝑄 is a unique status of 𝑳 and represents a failure situation of the topology. [TeX:] $$Q_{i j}$$ means in the 𝑖th failure situation, [TeX:] $$L_j$$ is failed or normal.
As mentioned above, the probing result relies on each involved link, therefore the probing result matrix of all failure situation [TeX:] $$\mathcal{S} \in \mathbb{I}^{2^n \times m}$$ is a binary matrix as shown below.
Here [TeX:] $$\mathcal{B}$$ is the binarization function. [TeX:] $$\mathcal{S}_{i j}$$ is the probing result of the 𝑗th probing path under the 𝑖th failure situation. [TeX:] $$\mathcal{S}_{i j}=1$$ means that all links in [TeX:] $$p_j$$ are reported normal, while [TeX:] $$\mathcal{S}_{i j}=0$$ means one or more links are out of order.
We now turn our attention to the failure localization algorithm [TeX:] $$\mathcal{F}_m$$ and present its theoretical accuracy formulation. The algorithm [TeX:] $$\mathcal{F}_m$$ is designed as a reverse mapping that retrieves the failure situation from the probing results, as depicted in (2). Due to the constraints of probing complexity, this mapping is not always one-to-one. Consequently, multiple failures may lead to the same probing outcome for a given probing result. In such cases, there could be several candidate failure situations [TeX:] $$Q_i$$ that satisfy (3), making it impossible for the PLL algorithm to determine the expected result.
We assume that all link failures in the network follow a binomial distribution with probability 𝑝, i.e., 𝑋 ∼ 𝐵(𝑛, 𝑝). Let [TeX:] $$\hat{S}_{N \times m}$$ represent the compressed S matrix with duplicate rows removed, where 𝑁 denotes the number of unique row vectors in S. We then define the mapping matrix [TeX:] $$U_{N \times 1}$$ as follows:
In this case, [TeX:] $$U_i$$ corresponds to the number of times the row vector [TeX:] $$\hat{S}_i$$ appears in S. We also define [TeX:] $$\hat{Q}_i$$ as the set of candidate link failure situations, meaning that all elements within it result in the same probing outcome [TeX:] $$\hat{S}_i$$:
Each [TeX:] $$\hat{Q} i j$$ is a vector representing a failure situation.
We now consider the accuracy of inferring [TeX:] $$\hat{Q}_{ij}$$ from S𝒊. If the inferred [TeX:] $$\hat{Q} i j$$ is chosen randomly from the candidate failure situations [TeX:] $$\hat{Q}_i$$, the probability that the algorithm can correctly identify all failed links is given by the following expression, where [TeX:] $$\Sigma R_j$$ represents the length of the probing paths.
But we would like to find a better policy for choosing [TeX:] $$\hat{Q}_{ij}$$ from [TeX:] $$\hat{Q}_i$$ to improve the detection accuracy. Therefore, we proposed to choose the [TeX:] $$\hat{Q}_{ij}$$ with the minimum number of failure occurrences from [TeX:] $$\hat{Q}_i$$, i.e., [TeX:] $$j^*=\underset{j}{\arg \min } \Sigma \hat{Q}_{i j}.$$ Then the probability that the algorithm can identify all the failed links become as follows.
Here, 𝑝 is the probability of link failure, this value is usually less than 20% in commercial data center networks [27]. Therefore we could assume that 𝑝 < 0.5. It could be proved that (5) is higher than (4) for 𝑝 < 0.5 as follows.
1) Let
To prove that (5) > (4), we need to prove that for a given [TeX:] $$U_i, P_2\gt \operatorname{avg}\left(P_{1_j}\right) \text {. }$$
2) Let
In (5), we let [TeX:] $$j^*=\operatorname{argmin} \Sigma \hat{Q}_{i j},$$ and as a result, [TeX:] $$\Sigma \hat{Q}_{i j}-\Sigma \hat{Q}_{i j^*} \geq 0, \forall j \in U_i.$$
3) We get [TeX:] $$K\lt 1 \text { when } p\lt 0.5 \text {, i.e., } P_{1_j}\lt P_2, \forall j \in U_i$$ That means [TeX:] $$\operatorname{avg}\left(P_{1_j}\right)\lt P_2,$$ and the proof is finished.
But the probabilities in (5) means the located failures situations [TeX:] $$\hat{Q}_{i j}$$ are exactly the same with the actual failures. Whereas the accuracy only counts the number of located failure links. Therefore we also present the probability for different statistic calibers.
where [TeX:] $$j^*=\operatorname{argmin} \Sigma \hat{Q}_{i j} \text {, and } f$$ indicates the statistical functions for different statistic calibers. There are different [TeX:] $$f\left(\hat{Q}_{i k}, \hat{Q}_{i j}\right)$$ in (7), (8) and (9) for accuracy, false negative rate (FNR) and false positive rate (FPR) .
For [TeX:] $$P\left(\frac{\# \text { of detected failures }}{\# \text { of total failures }}\right)$$, i.e., the accuracy [TeX:] $$\mathcal{A},$$,
For [TeX:] $$P\left(\frac{\# \text { of undetected failures }}{\# \text { of total failures }}\right)$$, i.e., FNR,
For [TeX:] $$P\left(\frac{\# \text { of misreported normal link }}{\# \text { of total failures }}\right)$$, i.e., FPR,
The size of the probing matrix 𝐷 affects how much probing overhead is required. This means more probing agents need to be deployed in DCNs and more extra bandwidth is occupied. Therefore we would like to achieve a given detection accuracy [TeX:] $$\mathcal{A},$$ with minimum probing overhead, i.e., the smallest number of probing paths. We model the optimization problem as shown in (10).
IV. PROBING MATRIX CONSTRUCTION ALGORITHM
The above optimization problem of failure detection via round-trip probing is NP-hard, as proven in [28]. In order to obtain the available probing matrix in a reasonable time, we propose an iterative probing matrix calculation algorithm, which chooses one path from the sub-optimal probing paths that maximizes the accuracy increment [TeX:] $$\Delta \mathcal{A}$$ and iterates until the accuracy reaches the target.
However it remains a complex problem to calculate the accuracy of the new probing matrix 𝑃′ after each iteration. According to (6), the complexity of computing the probing accuracy for a given probing matrix will be [TeX:] $$O\left(2^n\right)$$, where 𝑛 is the number of physical links. While the scale of a modern data center network is enormous, e.g., Microsoft hosts over one million servers in over 100 data centers globally [29]. Therefore, we need an approximate indicator with lower enough computational complexity to reflect the accuracy variation.
As we mentioned in the previous section, one given state S may correspond to multiple [TeX:] $$Q_i$$ values. In (6), there are multiple probing methods to choose a value from multiple possible values [TeX:] $$\hat{Q}_i$$ as our predicted value. For both probing methods 1 and 2, the less optional values, i.e., [TeX:] $$\left|\hat{Q}_i\right|,$$ the higher the probability that the predicted result is the same as the actual result, which means higher probing accuracy.
Define [TeX:] $$\mathcal{G}_t$$ as the set of all indistinguishable failures sets in step 𝑡.
At the start of constructing the probing matrix, we put all possible failures into [TeX:] $$\mathcal{G}$$. After a new probing path 𝑝 is added to the probing matrix 𝐷, the set can be divided according to whether the link in the failure exists on the path. The set division from step 𝑡 to 𝑡 + 1 is defined as follows.
where
We can increase 𝑡 by adding more paths to 𝐷, until the accuracy constraint is met. Here each [TeX:] $$G_i$$ is corresponding to a [TeX:] $$\hat{Q}_i$$. We would like to decrease [TeX:] $$\left|\hat{Q}_i\right|$$ by finding a more reasonable solution of set division, i.e., constructing the probing matrix.
On the basis of the above observations, we propose a new indicator 𝐶 to evaluate the result of set division. There are two key requirements for set division: 1) each [TeX:] $$G_i$$ should contain as fewer elements as possible to make failures easier to identify; 2) each [TeX:] $$\left|\hat{Q}_i\right|$$ should be evenly distributed to avoid lower accuracy due to too many failures reside in some particular [TeX:] $$\hat{Q}_i$$. Therefore the proposed indicator 𝐶 is defined as follows.
where [TeX:] $$\mathcal{L}(\mathcal{G})=\left\{\left|G_i\right|\right\}, \forall G_i \in \mathcal{G} \text { and } \sigma$$ is the standard deviation of it. Smaller 𝐶 means fewer elements in each subset of [TeX:] $$\mathcal{G}$$ and/or elements are more balanced divided, eventually leading to better accuracy.
However, the number of failures initially being divided i.e., [TeX:] $$\sum \mathcal{L},$$ is still up to [TeX:] $$2^n$$. Considering the identifiability mentioned in [14], we can use the failure set of maximum simultaneous failures to reduce the complexity. As a result, the total number of failures will fall to [TeX:] $$\sum_{i=1}^{\beta}\left(\begin{array}{l} n \\ i \end{array}\right).$$
The proposed PMC algorithm is shown in Algorithm 1, where 𝐶 is the criterion when adding a new probing path. This algorithm is a greedy method, which will iterate until given number of probing paths are selected or a given accuracy is achieved. At each iteration, the algorithm calculates the 𝐶 for each candidate path after adding it to the current set of detected paths, and selects the path that minimizes 𝐶 to the set. The overall time complexity will be [TeX:] $$O(m \times k \times n),$$ where 𝑚 is the size of probing matrix and 𝑘 is the number of candidate probing paths.
𝐶 will remain valid until set division of [TeX:] $$\mathcal{G}$$ is complete, i.e., each subsets in [TeX:] $$\mathcal{G}$$ contains only one element. As Algorithm 1 is an iterative algorithm, in each iteration, the algorithm evaluates the candidate set division solutions according to 𝐶. Thus it is necessary to ensure that 𝐶 is valid, i.e., [TeX:] $$|\mathcal{G}| \leq \mathcal{L}(\mathcal{G})$$. In the subsequent experiments, we found that generally when [TeX:] $$\beta=1,|\mathcal{G}|$$ will reach [TeX:] $$\mathcal{L}(\mathcal{G})$$ too fast, making the iteration fail to continue. While = 2 or 3 is large enough to achieve the expected accuracy.
Proposed probing matrix contruction algorithm
V. DRL-BASED PROBING MATRIX CONSTRUCTION
Algorithm 1 focuses on finding a better probing path in each step. Therefore, the final probing matrix is likely to be the local optimum and there is no guarantee that the global optimum will be obtained.
But in some specific scenarios that are extremely sensitive to overhead, we would like to find a probing matrix that is closer to the optimal result, allowing higher system accuracy on the same probing matrix size. Reinforcement learning methods have achieved remarkable results in solving complex combinatorial optimization problems [30]. The PPO algorithm [31] has been reported to be more efficient and stable compared to other reinforcement learning algorithms such as DQN in solving these combinatorial optimization problems [32]. Therefore, we apply the PPO algorithm to the probing path construction problem. We provide feedback to the policy model to enable it to choose more reasonable failure detection paths. The PPO algorithm uses an actor-critic model and limits the change of the policy through the hyperparameter [TeX:] $$\epsilon$$ so that the update of the policy is within a limited range, making it more stable.
The above probing matrix construction problem is abstracted into an environment in order to employ reinforcement learning algorithms. To model this problem with a DRL model, we use the selected set of probing paths as the output of the reinforcement learning model, i.e., actions. But the solution space of this problem is exponential. Thus the action space will become overwhelmingly large, and it will grow rapidly with the increase of the network size. Excessive action space will have a negative impact on the performance of the DRL model, which is not conducive to find an appropriate policy [33].
In order to manage and reduce the complexity of the action space, we have designed our reinforcement learning environment to treat the probing matrix construction problem as an episode, which is divided into multiple steps. This approach allows us to break down the problem into smaller, more manageable subproblems. At each step, the DRL policy selects a single probing path and adds it to the existing probing path set. By doing so, we effectively reduce the action space at each step, as the DRL model only needs to consider a single probing path rather than the entire solution space. This stepby- step approach not only simplifies the action space but also enables the DRL model to learn more efficiently, as it can focus on making incremental improvements to the probing path set. As a result, the performance of the DRL model is less likely to be negatively impacted by the exponential growth of the action space, allowing it to find an appropriate policy more effectively.
Additionally, it is worth noting that while this method of reducing complexity may have some potential negative impact on the final results, our experimental results demonstrate that this approach still achieves optimization outcomes that surpass those of traditional methods. This indicates that the trade-off between managing the complexity of the action space and the quality of the final solution is well-balanced in our proposed method. By effectively handling the action space complexity, our DRL model is able to learn more efficiently and ultimately outperform conventional techniques in solving the probing matrix construction problem.
State space: The state space includes the information that the DRL policy needs to know when making path selections in each step. In this environment, we feed the set of selected probing paths as the state to the policy model [TeX:] $$\pi .$$ The policy model [TeX:] $$\pi$$ will choose new probe paths to be added to the set based on the current state. Therefore, we define the state vector [TeX:] $$S \in \mathbb{I}^{1 \times k}$$ as follows, where [TeX:] $$S_i$$ indicates whether the candidate probing path 𝑖 exists in the set of detection paths and 𝑘 is the number of all candidate probing paths.
Action Space: The action space of each step is the new probing path selected by policy [TeX:] $$\pi$$ in the current step. We directly use the ordinal number of the probe path to represent the corresponding path. Formally, the action space 𝐴 can be defined as a discrete integer.
Reward: The reward provides feedback to the policy [TeX:] $$\pi$$ and must be carefully designed to ensure that qualified probing paths are chosen. We divide the reward in the whole environment into two parts: 1) the reward after each action step is completed (step reward), and 2) the reward at the end of an episode when the compliant set of probing paths is obtained (termination reward).
Step reward: After each step is completed, we first penalize the repeatedly selected path, as it provides no help to the probing. We directly give a large negative reward and terminate this episode. This penalty discourages the DRL policy [TeX:] $$\pi$$ from selecting the same path multiple times. For each distinct path, we also give a small penalty to encourage the DRL policy [TeX:] $$\pi$$ to continue exploring until the size of the probing matrix meets the requirement. This step reward design aims to balance exploration and exploitation during the learning process.
Termination reward: For the reward at the end of each episode, we use the proposed indicator 𝐶. We design the reward 𝑊 to increase in proportion to the performance, as follows:
This termination reward design encourages the DRL policy to select the set of probing paths with a higher final reward 𝑊. By using the indicator 𝐶, we ensure that the reward is directly related to the detection accuracy, thus guiding the policy towards better solutions.
As shown in Fig. 2, the reward mechanisms are adopted to train the DRL policy. We do not include the termination reward 𝑊 in the intermediate step and introduce a discount, as we are more concerned about the final result rather than the intermediate process for this model. This design choice helps the policy avoid converging to a local optimum, as might happen with a greedy algorithm.
The training process of the PPO-based PMC algorithm.
VI. EVALUATION AND DISCUSSION
A. Experimental Setup
To evaluate the performance of the two algorithms proposed in this paper, we constructed simulation platforms for experimental assessment.
First, we developed a Python-based simulation platform to assess the detection accuracy of failure probing matrices produced by the PMC algorithm. This platform initially generates a fat-tree network topology with faults, taking into account the specified link failure rate 𝑞. Subsequently, it executes the provided PMC algorithm to create the corresponding probing matrix and deploys the relevant agents to the appropriate edge servers. Then detection agents run on the edge server nodes, transmitting probing data. The PLL algorithm from [14] is employed for failure localization using collected probing data. Finally, the identified link failures are compared with the actual link failures to compute metrics such as detection accuracy.
Additionally, to emulate a more realistic probing topology, we constructed a similar experimental environment utilizing the ns-3 network simulator. The probe transmission and reception processes are simulated on the edge servers within the network topology established in ns-3. The forwarding model of ns-3 more closely resembles the actual data center network nodes, thereby providing a better representation of our algorithms’ performance in real-world scenarios.
We compare our proposed algorithms with deTector [14] as it is currently one of the leading algorithms for host-based failure localization. It builds upon the well-known PingMesh [8]. Our work improves upon deTector in several critical aspects, including theoretical analysis, a superior indicator, and further more, a DRL-based solution. By comparing against deTector, we aim to highlight the strengths of our proposed algorithms.
The fat-tree topology [34] is one of the most prevalent network structures in data center networks, with many commercial data centers employing fat-tree and its variants [35]. It utilizes a hierarchical structure where every level is fully interconnected. This design guarantees multiple pathways between any two nodes within the network, thereby circumventing bottlenecks and augmenting overall network performance. Therefore, we used a 4-ary fat-tree network topology in our experiments, consisting of a total of 36 nodes.
The simulation platform and other related experiments in this paper are run on a workstation with an Intel Core i7-8700 processor and 32 GB of memory.
B. Theoretical Accuracy
We first investigate the consistency between the proposed accuracy model and the simulation accuracy results. The experiments are conducted in a 4-ary fat-tree network topology with two pods, where 128 candidate probing paths are involved in this network. Algorithm 1 was executed with different link failure rates to generate the probing matrices at each given link failure rate. Then we construct probing matrices with different sizes from 1 to 20, and run the PLL algorithm respectively based on them. The theoretical and simulation accuracies are shown in Fig. 3.
It is shown that the simulated values (detected rates, undetected rates, and misreported rates) are closely approximated to the theoretical values given by (6) with different link failure rates (𝑞 = 0.05 and 𝑞 = 0.1). This indicates our proposed model in Section III is able to accurately reflect the failure detection accuracy given the network topology and probing matrix.
The comparison of theoretical and simulation values of (6) with different link failure rates.
C. Heuristic PMC Algorithm
We propose an indicator 𝐶 in (13) as an alternative to the absolute accuracy with high time complexity. We first compare the probing accuracies using different evaluation indicators ([TeX:] $$\Delta \mathcal{A} \text { and } C$$) with our proposed Algorithm 1.
As shown in Fig. 4, We ran the failure detection algorithm using [TeX:] $$\Delta \mathcal{A} \text { and } C$$ as evaluation metrics respectively. It is shown that with the same link failure rate, the accuracies produced by the two approaches are close to each other. This demonstrates that the indicator 𝐶 we proposed for choosing better probing paths during each iteration is reasonable and can depict the contribution of each probing path to the final accuracy.
The accuracy results when choosing different indicators in Algorithm 1.
We evaluate the theoretical values in (7), (8), and (9) of the proposed PMC algorithm and deTector. The results shown in Fig. 5 demonstrate that our proposed PMC algorithm outperforms the PMC algorithm in deTector. Specifically, the proposed algorithm gains higher detected rates as well as lower undetected and misreported rates.
Theoretical values of different methods with different .
In addition, we evaluate the simulated failure detection accuracies of our algorithm. The experiments are conducted in both numerical and ns-3 simulation environments with deTector as a comparison.
The PLL algorithm proposed in [14] is a failure localization algorithm that is efficient and could be deployed in data center networks. Therefore, we also performed an accuracy evaluation using the PLL algorithm. We generated the probing matrices using the same parameters and sent their probing results to the PLL algorithm as inputs. The accuracy of the link failures inferred by the PLL algorithm according to the probing results is shown in Fig. 6(a).
To validate the actual effectiveness of the algorithm in real data center networks, we implemented the whole probing matrix construction, probing agents and failure localization algorithms in a popular network simulator ns-3. We use ns-3 to generate the experimental network topology, and obtain the corresponding probing results on the receiver agents by sending actual probe packets from the sender agents. The probing results are aggregated at the controller to calculate the failure localization using the PLL algorithm. Results in Fig. 6(b) indicate that the proposed algorithm also gains better accuracies in the ns-3 simulator than the deTector algorithm.
Accuracies of generated probing matrices using different methods with different s.
These several experimental data above show that our proposed algorithm has a significant improvement in the accuracy of failure detection compared with deTector. Our method’s accuracy increases more rapidly with the number of probing paths and can be accomplished with fewer probing paths compared with deTector, which could reduce the overall probing overhead.
We can also find that with the different values of , the accuracies of = 2 and = 3 are almost the same in the first half of the curve, whereas the curve of = 2 tends to flatten out while = 3 continues to grow in the rest. This is due to the failures of = 2 is already divided completely when the path number is greater than 15, i.e., there is only one element in each set. As a result, the set number cannot be used to enhance the subsequent probing path selection. While there are more elements in = 3 and they can keep being split as the number of paths increases.
This provides insight for selecting an appropriate value. The value of can be determined in accordance with the accuracy requirement. With a given accuracy, a smaller is preferred to reduce the overhead of deploying agents and probing bandwidth.
D. DRL-based PMC Algorithm
We also evaluated the accuracy of the failure probing matrix obtained using the DRL-based method. As the main objective of using reinforcement learning methods is to improve the final accuracy as closely to the global optimal as possible, the reinforcement learning method we designed focuses on improving the final 𝐶, and not on the values in the intermediate process.
The comparison of the accuracies of the three methods is shown in table Table II. This table demonstrates that our reinforcement learning scheme is able to find a probing matrix that is closer to the optimal solution than the heuristic algorithm and deTector with the same given number of probing paths and value.
COMPARISON OF ACCURACIES OF DIFFERENT SCHEMES IN DIFFERENT AND PROBING PATH NUMBER PARAMETERS.
It is worth noting that although the DRL-based method can find a better probing matrix, it takes longer training time than the greedy heuristic algorithms. However, considering the fattree network topology with 128 candidate paths as an example, the complexity of finding the optimal probing matrix with 10 paths is [TeX:] $$\left(\begin{array}{c} 128 \\ 10 \end{array}\right) \approx 2 \times 10^{14}$$, while the complexity of selecting 15 or 20 paths is even higher. This means the DRL-based algorithm is able to find a suboptimal matrix that is closer to the optimal one after several hours of training even for such a complex problem.
To further justify our choice of the PPO algorithm, we conducted a comparative experiment between PPO and DQN, two popular DRL algorithms. As shown in Fig. 7, the PPO algorithm demonstrates faster growth, smaller fluctuations, and ultimately achieves a higher normalized reward compared to DQN. In contrast, the DQN algorithm exhibits slower growth, larger fluctuations, and its final reward is only about 60% of the PPO’s reward. These results provide empirical evidence supporting our choice of the PPO algorithm for solving the complex combinatorial optimization problem in our study. The superior performance of PPO in terms of both convergence speed and final reward makes it a more suitable choice for our application.
Training performance comparison of PPO and DQN-based PMC algorithms.
The entire training process of 1000K PPO steps took approximately 2 hours. The training was performed on an Intel Core i7-8700 workstation with 32 GB of RAM, utilizing only the CPU. Because reinforcement learning typically employs smaller neural networks, so the benefits of using a GPU would be counterbalanced by the communication overhead between the CPU and GPU.
E. Limitations and Future Work
Our experimental results demonstrate the effectiveness of the proposed solutions. The results show that our theoretical model is consistent with actual scenarios, and both our heuristic algorithm and DRL-based PMC algorithm outperform other methods.
Nevertheless, our algorithms exhibit certain limitations and drawbacks. The DRL-based algorithm delivers superior detection performance, achieving higher accuracy with equivalent probing overhead or reduced probing overhead with the same target accuracy. In contrast, the heuristic algorithm provides expedited construction speed but slightly lower detection performance. As a result, these two algorithms cater to distinct application scenarios. When probing overhead is highly sensitive and a smaller probing matrix is desired, the DRL-based algorithm should be employed given the same target accuracy. On the other hand, when the construction speed of the probing matrix is prioritized and its size is less critical, the heuristic algorithm should be selected.
Additionally, scalability remains a significant limitation for both algorithms. As the size of data center networks increases, the matrix construction speed of these methods still has considerable constraints. Further research is warranted in this area.
Fortunately, recent studies such as Bernardez et al. [36] suggest that trained GNNs can yield favorable results on unseen topologies in the training data, indicating a degree of generality. Our work establishes the groundwork for utilizing deep reinforcement learning in failure probing matrix construction. Building upon this foundation, it might be feasible to develop a universal reinforcement learning PMC solution employing GNN-based reinforcement learning techniques. This approach would enable the construction of probing matrices on any network topology following a single training session. We leave this exploration for future work.
Besides, our proposed algorithms are anticipated to perform optimally in data centers with physical infrastructure instead of virtualized networks. This is mainly due to physical data center networks’ regular and predictable topologies. Conversely, virtualized data center networks present a more diverse range of topologies. These virtualized networks may not always offer multiple equivalent paths, which could potentially diminish the effectiveness of our solution. We acknowledge that further research is necessary to optimize probing matrix construction in data centers with virtual infrastructure.
VII. CONCLUSION
This paper firstly models the end-to-end failure detection method and presents the theoretical accuracy of failure detection under a given network topology and probing matrix. Secondly, we design an alternative indicator for evaluating the effectiveness of a probing matrix, and propose a new probing matrix construction algorithm. The evaluation results show that the proposed algorithm using the new indicator focuses on improving the accuracy at each step of selection. As a result, the selected probing matrix can achieve the target accuracy sooner than previous methods. Furthermore, we propose to use the deep reinforcement learning algorithm to construct probing matrices. We provide feedback to the reinforcement learning policy via balanced set division result. The proposed algorithms are evaluated in both numerical and ns-3 simulations. Experimental results show that the DRL-based probing matrix construction is able to find a probing matrix that is closer to the optimal. Finally, we discussed the different application scenarios of the two proposed methods that are aiming to achieve better detection accuracy or construction speed.